Providers
Though we aim to offer a large suite of built-in transforms, it is inevitable that people will need to author their own. This is made possible through the notion of Providers which leverage expansion services and vend catalogues of schema transforms.
Java
For example, you could build a jar that vends a cross language transform or schema transform and then use it in a transform as follows
pipeline:
type: chain
source:
type: ReadFromCsv
config:
path: /path/to/input*.csv
transforms:
- type: MyCustomTransform
config:
arg: whatever
sink:
type: WriteToJson
config:
path: /path/to/output.json
providers:
- type: javaJar
config:
jar: /path/or/url/to/myExpansionService.jar
transforms:
MyCustomTransform: "urn:registered:in:expansion:service"
A full example of how to build a java provider can be found here.
Python
Arbitrary Python transforms can be provided as well, using the syntax
providers:
- type: pythonPackage
config:
packages:
- my_pypi_package>=version
- /path/to/local/package.zip
transforms:
MyCustomTransform: "pkg.module.PTransformClassOrCallable"
We offer a python provider starter project that serves as a complete example for how to do this.
YAML Provider listing files
One can reference an external listings of providers in the yaml pipeline file via the syntax
providers:
- include: "file:///path/to/local/providers.yaml"
- include: "gs://path/to/remote/providers.yaml"
- include: "https://example.com/hosted/providers.yaml"
...
where providers.yaml
is simply a yaml file containing a list of providers
in the same format as those inlined in this providers block.
See, for example, the provider listing here.
In fact, this is how many of the the built in transforms are declared, see for example the builtin io listing file.
Hosting these listing files (together with their required artifacts) allows one to easily share catalogues of transforms that can be directly used by others in their YAML pipelines.