ssoudan.blog

Written on March 11, 2021

TFX is a framework to develop and deploy production ML pipelines that I have been using in the last months. Pipelines are made of components that consume and produce artifacts. This framework can be extended by defining our owns and enrich the catalog what is already provided: standard artifacts & standard components. I’m sure these catalogs are bound to grow.

I have packaged some of the components that proved handy in a library and shared it here: tfx_x.

Two sets of components at this point: one to manipulate a new artifact type name PipelineConfiguration - will come back on this in a sec, and another to manipulate Examples. Let’s close the case for Examples first.

Examples is basically the artifact type of the datasets - possibly at different stages of transformation and that what the two first components are helping with: filtering based on a predicate on examples and stratified sampling. Nothing too fancy here, couple of lines of Beam and quite a few more lines of boilerplate code.

Back on the PipelineConfiguration. Building pipelines is a very essential step to get anywhere with ML models. You are bound to run and rerun slightly different variations of everything again and again. And let’s be honest, not all of them are going to produce great results - if they produce anything at all. “one must imagine Sisyphus happy” wrote Camus.

That will happen to assemble all the steps to get from some data somewhere to a model you are confident with, but also to experiment, to explore, to tune… Yes, there is ‘Restart & Run All’ in Jupyter but that eventually shows some limits and if we are talking about ‘operationalizing’ ML (as in MLops) hopefully that’s not the solution.

One of challenge is the need to keep track of what has been tested, what were the parameters so it can be reproduced or used as a starting point for further exploration - in case you put yourself in a corner with no exit and feel the safest option is to revert back to what was sort of working, last week. Git can be an option but should not be the only element of the solution.

Immutability and versioning everywhere!

For the code, it’s git, for the runtime, containers. And we have immutable artifacts the components can produce AND consume as stated before. You probably see where I’m going with the parameters and the artifacts by now. To put some order in the parametrization of my pipelines, I ended up creating a choke point where the parametrization of my pipelines take place and leveraging the Artifacts.

This has 2 benefits: I can have my own components take there configuration from this artifact. But more importantly, this is an artifact. It is immutable, gets versioned and stored along with the other artifacts - see mlmd. The code I use to analyze the results of experiment or the performance of a model can access the parameters that have been used in the same way as the rest of the artifacts the run of a pipeline has produced.

My recipe at this point: - Code: in git. - Runtime: a container versioned after git commit, in registry. - Runtime parameters: collected from different places - depending on the context, but assembled and frozen in an artifact that is passed to the components and stored so it can be recovered in the future. - Analysis code: a Jupyter notebook which checkouts the latest code from git and import it as a library, and that only require the ‘id’ of the pipeline to fetch all the artifacts it needs to produce beautiful charts.

Together with a bit of Kubeflow and GCP, I’m a ‘one man army’ as I have been told.

Have fun, stay safe.

Abstract