Docop pipelines explained

Pipelines are defined as YAML formatted files. They specify an ordered sequence of tasks to run.

How to review available pipelines

Use the pipes command to list them:

docop pipes

Presuming the earlier 'mypipe' pipeline existed, the command would output:

mypipe: My first full pipe (retrieve → process1 → process2 → export)

How to create a task pipeline

Create a YAML file with a descriptive name, add a comment line to describe it and list the tasks in an ordered sequence:

mypipe.yaml

# My first full pipe

tasks:
  - retrieve
  - process1
  - process2
  - export

How to run a pipeline

Just use the run command. Docproc will automatically find the given pipeline and run it. No need to give a path to the pipeline definition file or include the .yaml suffix.

Using the --help option gives more details:

Usage: docop run [OPTIONS] TASKNAME or PIPENAME [EXTRAS]...

  Run a task or pipeline.

Options:
  -s, --source TEXT   Sources that will be fetched and stored as documents.
  -c, --content PATH  Stored documents to process.
  -t, --target TEXT   Targets to export document content to
  -a, --account TEXT  Account to use (source or target)
  --help              Show this message and exit.

Docproc will provide ample status information when it runs the pipeline.

How Docop runs task pipelines

The following diagram describes how docop loops over sources, content and targets and runs a pipe of tasks to fetch, process and export content.

graph LR
  S((Start)) --> QS{Sources\nfetched?};
  QS -- No --> RT(⚡ Run 1st task\nto fetch);
  RT --> QS;
  QS -- Yes --> QL{Next\ntask\nlast?};
  QL -- No --> RP(⚡ Run next task\nto process);
  RP --> QL;
  QL --  Yes --> RE(⚡ Run last task\nto export);
  RE --> QE{Docs\nexported?};
  QE -- Yes --> E((End));
  QE -- No --> RE;

To recap:

When a task runs, it is provided a set of execution context variables
The first task in a pipe should check the sources and fetch them
The next tasks should process the fetched content
The last task should export content to targets
Each task can process one or more source, collection, document or target per run