Alex Miłowski, geek

Workflows - DSL or code?

A workflow for data engineering, machine learning, or other business processes is typically described a graph of tasks that are chained together by dependencies and consequences. When a task completes, it may cause another task (or many tasks) to execute. The graph of tasks and their connections are a description of the work accomplished by the workflow.

A workflow with a start node connected to a A task, A connected to B and C, C connected to D, and B and D are connected to the end.

An example workflow

How a workflow is described depends on the workflow orchestration system and these might be:

  • the workflow and tasks are stored in a database - typically defined via an API
  • the workflow is defined by code using an API and possibly code annotations
  • the workflow it’s own artifact (e.g., a file) defined by a DSL (Domain Specific Language) which is specified in a common (e.g., YAML, JSON, or XML) or custom syntax.

In my survey of workflow orchestration systems, I noted a trend of moving from a DSL to code for defining workflows. You can see this trend in the chart below:

DSL Format by Category

And within specific categories, you can also consider whether code annotations are used to describe the workflow (which implies not using a DSL):

Annotation API by Category

What is a workflow as code annotations?

Many programming languages allow code annotations and some enable these annotations to define or change the behavior of the annotated code. In Python, this is often a convenient way to package registration and wrapping “core behavior” for interacting with a more complicated system. For a workflow, this is a convenient mechanism for registering and packaging a workflow task as a Python function into the workflow.

For example, in this metaflow example from their tutorial, you can see the workflow is defined by three mechanisms:

  1. The workflow is defined by extending the class FlowSpec.
  2. A step in the workflow is defined with the @step annotation on a workflow class method.
  3. A step defines dependent task by calling self.next()

There are many different approaches to code annotations, but they have a common approach of:

  • There is a mechanism for identifying a function or method as a task in a workflow.
  • There is either an API for chaining steps within the task definition or as separate code that defines the workflow (possibly with annotations).

What is workflow as a DSL?

A DSL (Domain Specific Language) is typically a separate artifact (a file) that describes an object (i.e., the workflow) using a “language” encoded in some syntax. In workflow orchestration systems, there is a high prevalence of using a generic syntax like YAML or JSON to encode the workflow description. Consequently, the DSL is a specific structure in that syntax that encodes the workflow, tasks, and the connections between the tasks.

For example, Argo Workflows are YAML files. The example workflow graph shown at the beginning of this article can be encoded as an Argo Workflow in YAML as follows:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-
spec:
  entrypoint: start
  templates:
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: start
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: "A"
        template: echo
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: "C"
        template: echo
        arguments:
          parameters: [{name: message, value: D}]
      - name: end
        depends: "B && D"
        template: echo
        arguments:
          parameters: [{name: message, value: end}]

A workflow orchestration system has to process the workflow DSL artifact into an internal representation, resolve the workflow graph, and ensure all the parts are properly specified. Often, the tasks are references to implementations that are invokable in some system. In the case of Argo Workflows, the tasks are container invocations (i.e., Kubernetes batch jobs).

Which is better?

The answer is somewhat subjective. If your whole world is Python, the code annotation approach is very attractive. Systems that use this approach often make it very easy to get things working quickly.

When your world gets a little more complicated, it isn’t a stretch to imagine how a task in a workflow might call a remote implementation. This enables the workflow system to stay within a particular programming paradigm (e.g., Python with code annotations) while allowing interactions with other components or systems that are implemented differently.

On the other hand, a DSL is separably processable and typically agnostic to task implementation. You can get an idea of the shape of the workflow (e.g., the workflow graph) without running any code or locating the code for each task. That’s attractive approach for someone who might not be an expert on a particular code base or programming language.

The challenges of workflows as code:

  • you have to execute the code truly understand the structure of the workflow,
  • requires correctly configured environment which is typically the domain of a developer,
  • everything is packaged as code - which is great until it isn’t,
  • as the number of workflows and variety of environments expands over time, technical debt can make these workflows become brittle.

In contrast, the challenges of workflows as a DSL:

  • the workflow isn’t code - it is something else you need to learn,
  • understanding the syntax and semantics may be challenging (e.g., love or hate YAML?),
  • synchronizing workflows and task implementations may be challenging and requires extra coordination

The common thread here is the need for coordination. A workflow is an orchestration of tasks and those tasks define an API. Regardless of how you define the workflow, you need to be careful about how task implementations evolve. That means your organization has to continually curate their workflows to be successful with either approach.

Conclusions

There is simply nothing terribly wrong with either approach for authoring workflows. If your part of the organization is primarily developers who work in a particular language (e.g., Python), then you may be better off with using code annotations. The process for keeping the workflows and tasks compatible with each other is the same as any other software engineering challenge; solutions for this are well known.

On the other hand, if your organization has a heterogeneous environment with tasks implemented in a variety of languages and different kinds of consumers of the workflows themselves, you are likely better off with a system that has a DSL somewhere in the mix. The DSL acts as an intermediary between the developers of the tasks, the way they are orchestrated, and the different business consumers within your organization.

As a final note, using a DSL has the possibility of authoring tools or ways to generate them from diagrams that may be helpful to “cross chasms” between different parts of an organization with different skill sets. Generating workflows via a DSL is a way to add dynamic and generative approaches to MLOps. So having a generative metalanguage as a workflow of task primitives for your organization may also be helpful with “agentic AI” systems where the workflow is not just the means but is also an outcome that can be executed to accomplish a goal.