Welcome to Docop

Docproc provides some opinionated means to very easily build simple document processing pipelines for importing, processing and exporting textual content items, or in general anything possibly having a textual representation.

It has been used to for example fetching HTML pages, converting them to plaintext and exporting the results to an AI knowledge base.

Introduction

Some basic concepts:

sources specify the content you will retrieve
content speficies already retrieved content documents to process
targets specify destinations to export the results to

Practicalities:

configuration is given in a YAML file
tasks are Python modules
pipelines are defined as YAML files
file names are used as names for tasks, docs and pipelines
first comment lines are used as the description for tasks and pipelines
retrieved content is stored as YAML documents, with arbitrary metadata addable by tasks

Getting started

do a pip install
set up a few directories for tasks, docs and and pipes
copy config.yaml.in to config.yaml and specify the directories there
read the docs on how to configure docop, use the command-line tool, create tasks and pipes etc.