Skip to content

Welcome to Docop

Docproc provides some opinionated means to very easily build simple document processing pipelines for importing, processing and exporting textual content items, or in general anything possibly having a textual representation.

It has been used to for example fetching HTML pages, converting them to plaintext and exporting the results to an AI knowledge base.

Introduction

Some basic concepts:

  • sources specify the content you will retrieve
  • content speficies already retrieved content documents to process
  • targets specify destinations to export the results to

Practicalities:

  • configuration is given in a YAML file
  • tasks are Python modules
  • pipelines are defined as YAML files
  • file names are used as names for tasks, docs and pipelines
  • first comment lines are used as the description for tasks and pipelines
  • retrieved content is stored as YAML documents, with arbitrary metadata addable by tasks

Getting started

  • do a pip install
  • set up a few directories for tasks, docs and and pipes
  • copy config.yaml.in to config.yaml and specify the directories there
  • read the docs on how to configure docop, use the command-line tool, create tasks and pipes etc.