Output¶
Each task can either write to the specified directory, or return the output in memory to the downstream tasks.
If the output directory is null, the output will not be dumped / exported,
and passed to the downstream tasks directly.
For example, the below configuration defines test_dir as the default
output directory in the metadata section.
task_aoverrides the output directory asnullso the output is only passed totask_bin memory.task_bwrites to the default output directory with its task name.
metadata:
output:
directory: "test_dir"
task:
task_a:
caller: ...
output:
directory: null
task_b:
caller: ...
parameters:
- !data task_a
Format¶
Currently, the library defines the default behavior of writing and loading the pickle and JSON formats. For other formats, their loaders and dumpers should be customised accordingly. The file extension is determined by the format.
Both JSON and pickle uses the standard library loader and dumper. Pickle will be used by default if not specified.
Loader and dumper¶
As mentioned above, if the format is different from the standard library supported formats, the customised loaders and dumpers should be defined.
For example, if you would like to use pandas to load and dump the parquet file
in pyarrow engine by default, you can define the configuration like below.
metadata:
format: parquet
dump-caller: object.to_parquet
dump-parameters:
engine: pyarrow
load-caller: pandas:read_parquet
load-parameters:
engine: pyarrow
All the tasks will be written in parquet format and file pattern <name>.parquet.
Existence¶
Currently, if the output already exists in the output directory, the task will not be executed and the output is loaded directly.
In the meantime, if you have external dependencies, you can define their location directly without specifying its caller.
For example, the following configuration expects the file task_a.pickle
already exists in the output directory; otherwise, an exception will be
thrown that a caller is expected.
task:
task_a: {}