Alex Miłowski, geek

Welcome!

Here you'll find a variety of articles I've written over the years (see the chronological list). The site is organized as a very simple blog (old school) and below are few of the first entries listed in chronological order. You can find site keywords and more information about me in the upper right.

ready? set? go!

A survey of workflow orchestration systems

Introduction

Workflow orchestration is a common problem in business automation that has an essential place in the development and use of ML models. While systems for running workflows have been available for many years, these systems have a variety of areas of focus. Earlier systems were often focused on business process automation. Newer systems are developed specifically for the challenges of orchestrating the tasks of data science and machine learning applications. Depending on their focus, these systems have different communities of use, features, and deployment characteristics specific to their targeted domain.

This article provides a general overview of what constitutes a workflow orchestration system and follows with a survey of trends in the available systems that covers:

  • origins and activity
  • how workflows are specified
  • deployment options

What is workflow orchestration?

A workflow is an organization of a set of tasks encapsulates a repeatable pattern of activity that typically provides services, transforms materials, or processes information 1. The origin of the term dates back to the 1920’s and primarily in the context of manufacturing. In a modern parlance, we can think of a workflow as akin to a “flow chart of things needed to be accomplished” for a specific purpose within an organization. In more recent years, “workflow orchestration” or “workflow management” systems have been developed to track and execute workflows for specific domains.

In the recent past, companies used workflow orchestration for various aspects of business automation. This has enabled companies to go from paper-based or human centric processes to one where the rules by which actions are taken are dictated by workflows encoded in these systems. While ensuring consistency, it also gives the organization a way to track metadata around tasks and ensure completion.

Within data platforms, data science, and more recent machine learning endeavours, workflow orchestration has become a fundamental tool for scaling processes and ensuring quality outcomes. When the uses of the systems are considered, earlier systems were focused on business processes whilst latter are focused on data engineering, data science, and machine learning. Each of these systems were categorized into one of the following areas of focus:

  • Business Processing - oriented for generic business process workflows
  • Science - specifically focused on scientific data processing, HPC, and modeling or inference for science
  • Data Science / ML - processing focused on data science or machine learning
  • Data Engineering - processing specific to data manipulation, ETL, and other forms of data manangement
  • Operations - processes for managing computer systems, clusters, databases, etc.

Of the systems surveyed, the breakdown of categories is shown below:

Systems by Category

While many of these systems can be used for different purposes, each has specializations for specific domains based on their community of use. An effort has been made to place a system into a single category based on the use cases, documentation, and marketing associated with the system.

Origins of projects

Project Creation by Category

All the systems surveyed appear after 2005 and which just after the “dot-com era” and at the start of “big data”. In the above figure, the start of and end dates are shown for each category. Each column starts at the earliest project formation and ends at the last project formation. This gives a visual representation of activity and possible innovation in each category.

While business process automation has been and continues to be a focus of workflow system development, you can see some evolution of development from data engineering or operations to data science and machine learning. Meanwhile, the creation of new science-oriented systems appear to have stagnated. This may be due to the use of data engineering and machine learning methods in scientific contexts and so there is no need for special systems.

Activity

Active Projects by Category

As is often the case with open-source software, even if associated with a commercial endeavour, some of the projects appear to have been abandoned. In the above chart, there tends to be a 20-25% rate of abandonment for workflow systems with the notable exception of science-oriented systems. In addition, it should be noted that some of these active projects are just being maintained whilst others are being actively developed by a vibrant community.

For science, while there may not be many new science-oriented workflow systems being created in recent years, most of those that exist are still actively being used.

SaaS Offered

In addition, some of these projects have commercial SaaS offerings that also indicate viability. The largest section of which is for Data Science / ML at 35% of those surveyed. This has a likely correlation with the current investment in machine learning and AI technologies.

Saas Available by Category

Workflow specification

Workflow Graph

Most workflows are conceptualized as a “graph of tasks” where there is a single starting point that may branch out to any number of tasks. Each following task has a dependency of a preceding task that creates a link between tasks. This continues through to “leaf” tasks that are at the very end of the workflow. In some systems, these are all connected to an end of the workflow.

Many systems differ on how a workflow is described. Some have a DSL (Domain Specific Language) that is used to encode the workflow. Others have an API that is used by code to create the workflow via program execution. Others have a hybrid mechanism that uses code annotation features of a specific programming language to describe the workflow. The use of annotations simplifies the description of a workflow via an API and serves as a middle ground between the API and a DSL.

In the following chart, the use of a DSL and the encoding format is shown. If the DSL and format is compared to the project creation, you can see that a DSL is more prominent in Business Processing and Science workflow systems that generally have an earlier origin (~ 2005). Whereas, Data Engineer and Data Science / ML tend to use code or annotations on code rather than a DSL to describe the workflow.

Further, there is a strong trend to use YAML as a syntax for describing the graph of tasks in the workflow DSL. This is almost exclusively true for those surveyed in the Data Science / ML category. It should be noted that there is some use of specialized syntaxes (Custom), which is occurs often in the Science category, where the DSL uses a specialized syntax that must be learned by the user.

DSL Format by Category

Meanwhile, using annotations in code to describe workflows is a growing trend. In those surveyed, it appears that as systems evolved from focusing on data engineering to data science and ML, the use of code annotations has increased. This is also likely due in part to the dominance of Python as a programming language of choice for machine learning applications and the fondness of python users for annotation schemes.

Annotation API by Category

When it comes to describing tasks, systems that use annotations have a clear advantage in terms of simplicity. In those systems, a task is typically a function with an annotation. Subsequently, the system orchestrates execution of that function within the deployment environment.

In general, tasks are implemented as code in some programming language. Some workflow systems are agnostic to the choice of programming language as they use containers for invocation, a service request (e.g., an HTTP request to a service), or some other orthogonal invocation. Many systems are specifically designed to be opinionated about the choice of language, either by the API provided or due to the way the workflow is described through code annotations.

The following chart shows the distribution of task languages in the surveyed systems. The dominance of Python is clear from this chart due to prevalence of use in data engineering, data science, and machine learning. Many of the uses of Java are from systems that are focused on business processing workflows.

Task Language

Deployment

As with any software, these workflow systems must be deployed on infrastructure. Unsurprisingly, there is a strong trend towards containers and container orchestration. Many still leave the deployment considerations up to the user to decide and craft.

Deployments

When only the Data Engineering and Data Science / ML categories are considered, you can see the increasing trend of the use of Kubernetes as the preferred deployment.

Deployments - Data Science/ML Only

Conclusions

Overall, when you look at the activity and project creation over all the categories, two things seem to be clear:

  1. There is a healthy ecosystem of workflow systems for variety of domains of use.
  2. There is no clearly dominant system.

While particular communities or companies behind certain systems might argue otherwise, there is clearly a lot of choice and activity in this space. There are certainly outlier systems that have smaller usage, support, and active development. In a particular category, there are probably certain systems you might have on a short list of “winners”.

In fact, what is missing here is any community sentiment around various systems. There are systems that are in use by a lot of companies (e.g., Airflow) simply because they have been around for a while. Their “de facto” use doesn’t mean that all the user’s needs are being met nor are they satisfied with their experience using the system. These users may simply not have a choice or sufficient reason to change; working well enough means there is enough momentum to make change costly.

Rather, the point here is there is a lot of choice given the activity in workflow systems. That variety of choice means there is lot of opportunity for innovation by users or developers as well as for companies who have a workflow system product. And that is a very good thing.

Data

All the systems considered were either drawn from curated lists of systems or by github tags such as workflow-engine or workflow. Whilst not a complete list, it does consist of 80 workflow systems or engines.

Each system’s documentation and GitHub project was examined to determine various properties. Some of these values may be subjective. An effort was made to have consistent judgements for categories of use. Meanwhile, a valiant attempt was made to understand the features of each system by finding evidence in their documentation or examples for various features. As such, some things may have been missed if they were hard to find. Although, that is not unlike a user’s experience with the product: if the feature is hard to determine, they may assume it doesn’t exist.

The data is available here: workflow-orchestration-data.csv

References


  1. Workflow, Wikipedia, see also https://en.wikipedia.org/wiki/Workflow ↩︎

next entry

Circularity in LLM-curated knowledge graphs

I recently read “Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation” (MedGraphRAG) where the authors use a careful and thoughtful construction of a knowledge graph, curated from various textual sources, and extracted via a careful orchestration of an LLM. Queries against this knowledge graph are used to create a prompt for a pre-trained LLM where a clever use of tagging allows the answer to be traceable back to the sources.

The paper details several innovations:

  • using layers in the graph to represent different sources of data,
  • allowing the layers to represent data with different update cadences and characteristics (e.g., patient data, medical research texts, or reference material)
  • careful use of tags to enable the traceability of the results back to the sources.

I highly recommend you read through the paper as the results surpass SOTA and the techniques are mostly sound.

When digging into the “how”, especially given the associated github project, I am continually nagged by thoughts around using an LLM for Named Entity Recognition (NER) and Relation Extraction (RE) tasks. In particular:

  1. How often does such an LLM miss reporting entities or relations entirely (omissions)?
  2. What kinds of errors does such an LLM make and how often (misinformation)?
  3. If we use an LLM to generate the knowledge graph, and it has problems from (1) and (2), how well does an LLM answer questions given information from the knowledge graph (circularity)?

The success demonstrated by the authors of the MedGraphRAG technique as used for answering various medical diagnosis questions is one measure for (3). As with all inference, incorrect answers will happen. Tracing down the “why” for the incorrect answer relies on understanding whether it is the input (the prompt generated from the knowledge graph) or the inference drawn by the LLM. This means we must understand whether something is “wrong” or “missing” in the knowledge graph itself.

To answer this, I went on a tear for the last few weeks of reading whatever I could on NER and RE evaluation, datasets, and random blog posts to update myself on the latest research. There are some good NER datasets out there and some for RE as well. I am certain there are many more resources out there that I haven’t encountered, but I did find this list on Github which led me to the conclusion that I really need to focus on RE.

In going through how the MedGraphRAG knowledge graph is constructed, there are many pre-and-post processing steps that need to be applied to their datasets. Not only do they need to process various medical texts to extract entities and relations, but they also need to chunk or summarize these text in a way that is respective of topic boundaries. This helps the text fit into the limits of the prompt. The authors use “proposition transfer”, which serves a critical step regarding topic boundaries, and that process also uses an LLM; bringing another circularity and questions about correctness.

All things considered, the paper demonstrates how a well constructed knowledge graph can be used to contextualize queries for better answers that are traceable back to the sources supporting that answer. To put such a technique into production, you need to be able to evaluate whether the entities and relations extracted are correct, that you aren’t missing important information, and you need to do this every time you update your knowledge graph. For that, you need some mechanism for evaluating an external LLM’s capabilities and the quality of its ability to perform a relation extraction (RE) task.

Experiments with llama3

Maybe I shouldn’t be surprised, but there are subtle nuances in the prompts that can generate vastly different outcomes for relation extraction tasks. I ran some experiments running llama3.1 locally (8B parameters) just to test various things. At one point during ad hoc testing, one of the responses said something along the lines of “there is more, but I omitted them” and adding “be as comprehensive and complete as possible” to the prompt fixed that problem.

Everyone who has put something into production knows that a very subtle change can have drastic and unintended outcomes. When constructing a knowledge graph from iterative interactions with an external LLM, we need some way to know that our new prompt that fixes one problem hasn’t created a hundred problems elsewhere. That is usually the point of unit and system testing (and I already hear the groans from the software engineers).

In the case of the MedGraphRAG implementation, they use the CAMEL-AI libraries in Python to extract “entities” and “relations”. That library instructs the LLM to produce a particular syntax that reduces to typed entities and relation triples (i.e., subject, relation, object triples) which is then parsed by the library. I am certainly curious as to when that fails to parse as escaping text is always a place where errors proliferate.

Meanwhile, in my own experimentation, I simply asked llama to output YAML and was surprised that it did something close to what might be parsable. A few more instructions were sufficient to pass the results into a YAML parser:

A node should be formatted in YAML syntax with the following rules:

 * All nodes must be listed under a single 'nodes' property. 
 * All relationships must be listed under a single 'relations' property.
 * The 'nodes' and 'relations' properties may not repeat at the top level.

Note:

There are so many ways I can imagine this breaking. So, we will have to see what happens when I run a lot of text through this kind of prompt.

I did spend some time experimenting on whether I could prompt llama to produce a property graph. That is, could it separate properties of an entity from relations. Or could it identify a property of a relation? I wasn’t particularly successful (yet) but that is a topic for a different blog post.

A gem of a paper on relation extraction

In my wanders looking for more research in this area, I found this paper titled “Revisiting Relation Extraction in the era of Large Language Models” which addresses the question at the heart of knowledge graph construction with an LLM. While doing NER and NER resolution is one critical step, a knowledge graph wouldn’t be a graph if the LLM does not handle the RE task well. This is another paper that I highly recommend you read.

The authors give a good outline of the elements of an evaluation of RE with an LLM. They compare the results of various models and LLM techniques against human annotated datasets for relations. They also detail the need for human evaluators for determining “correctness” given the various challenges already present in the thorny problems of RE.

In some contexts, the datasets were not well suited to be run through an LLM for RE tasks. The authors say at one point,

“These results highlight a remaining limitation of in-context learning with large language models: for datasets with long texts or a large number of targets, it is not possible to fit detailed instructions in the prompt.”

This is a problem that the MedGraphRAG technique solved using proposition transfer but doing so muddies the RE task with yet another LLM task.

An idea

I’ve recently become involved in the ML Commons efforts where am particularly interested in datasets. I think the challenge of collecting, curating, or contributing to datasets that support LLM evaluation for knowledge graph construction would be particularly useful.

This effort at ML Commons could focus on a variety of challenges:

  • Collecting datasets: identification of existing datasets or corpus that can be use for NER and RE tasks in various domains
  • Standardized metadata: helping to standardize the metadata and structure of these datasets to allow more automated use for evaluation
  • Annotation: annotation of datasets with entities and relations to provide a baseline for comparison
  • Conformance levels: enable different levels of conformance to differentiate between the “basics” and more complex RE outcomes.
  • Tools: tooling for dataset curation and LLM evaluation

One area of innovation here would be the ability to label outcomes from an LLM not just in terms of omissions or misinformation but also whether they can identify more subtle relations, inverse relations, etc. That would allow a consumer of these models to understand what they should expect and what they may have to do afterwards to the knowledge graph as a secondary inference.

I will post more on this effort when and if it becomes an official work item. I hope it does.

next entry

GQL: Schemas and Types

In GQL, a graph is set of zero or more nodes and edges where:

A Node has:
  • a label set with zero or more unique label names,
  • a property set with zero or more name-value pairs with unique names.
An Edge has:
  • a label set with zero or more unique label names,
  • a property set with zero or more name-value pairs with unique names,
  • two endpoint nodes in the same graph,
  • an indication of whether the edge is directed or undirected ,
  • when a directed edge, one endpoint node is the source and the other node is the destination.

While data can be unstructured in a graph such that nodes and edges are created in an ad hoc manner, graphs are often carefully crafted and structured to represent the relationships and properties of the subject. As such, having a schema that matches the rules used to structure the graph is useful for validation, queries, and optimizations by a database engine.

What is a schema for a graph?

A graph schema should attempt to answer basic questions like:

  • What kinds of nodes are allowed in the graph?
  • What kinds of relationships (i.e., edges) are allowed between nodes?
  • What are the named properties of nodes and edges?
  • What are the allowed values of properties?
  • What properties are optional?

Additionally, it would be good to define more ontology-oriented questions for validity like:

  • the cardinality of edges and nodes,
  • inverse relations (e.g., a parent relation to node has a child relation in reverse)
  • cross-node or cross-edge property constraints

Such additional criteria are often described as a constraint language that is separable from a schema language. That is, they can often be viewed as an additional layer and not the domain of the schema language.

Schemas in GQL

In GQL, the graph type (see §4.13.2 “Graph types and graph element types”) is the main construct for defining a schema. A graph type describes the nodes and edges allowed in a graph. It is created with a create graph type statement (see §12.6):

CREATE GRAPH TYPE MyGraph AS {
  // definitions go here
}

graph type creation

Once created, the graph type can be used to type specific instances of graphs in your site (i.e., a portion of your graph database). In this example, the specific syntax uses a nested graph type specification (see §18.1) that contains a list of element type specifications are specified in a comma-separated list.

Each element type specification is either:

  • a node type specification (see §18.2)
  • an edge type specification (see §18.3)

For each of these, there are two ways to describe the type:

  • a “pattern” that is similar in syntax to node or edge patterns in queries,
  • a “phrase” that is more explicit and verbose.

The simplest and most compact form of a node type specification is as a pattern:

(:label {
  prop1 :: type1,
  prop2 :: type2,
  // etc.
})

node type pattern basics

The label set defines the “keys” by which the type can be referenced and also used to query for matching nodes. A node type can also define additional non-keyed labels and the union of these and the keyed labels are the total set of labels for the type. For example, we could model animal nodes as the following:

(:cat => :mammal:animal),
(:dog => :mammal:animal)

animals as nodes

The node types for cat and dog are keyed with unique labels but also have the labels mammal and animal on the same nodes.

Similarly, we can specify edges with patterns:

(:Person)~[:sibling]~(:Person),
(:Person)-[:children]->(:Person)

edge patterns

The first edge pattern is an undirected edge that is keyed with sibling and has two endpoints whose type is keyed with Person. The second edge pattern is a directed edge that is keyed with children and has a source and destination type that is keyed with Person.

A worked example

If we imagine we’re trying to structure a graph to represent data retrieved from resources using the Person type from schema.org, we can see how these declarations fit together as well as the “rough edges” of graph typing.

Here is a complete example where the properties and relations have been limited to make the schema smaller:

CREATE GRAPH TYPE People AS {
   (:Thing {
      name :: STRING NOT NULL,
      url :: STRING
   }),
   (:Person => :Thing {
      name :: STRING NOT NULL,
      url :: STRING,
      givenName :: STRING,
      familyName :: STRING NOT NULL,
      birthDate :: DATE NOT NULL,
      deathDate :: DATE
   }),
   (:Person)-[:knows]->(:Person),
   (:Person)-[:children]->(:Person),
   (:Person)-[:parent]->(:Person),
   (:Person)~[:sibling]~(:Person),
   (:Person)~[:spouse { started :: DATE NOT NULL, ended :: DATE }]~(:Person)
}

schema via patterns

This same schema can be described via phrase declarations:

CREATE GRAPH TYPE PeopleViaPhrase AS {
   NODE :Thing {
      name :: STRING NOT NULL,
      url :: STRING
   },
   NODE :Person => :Thing {
      name :: STRING NOT NULL,
      url :: STRING,
      givenName :: STRING,
      familyName :: STRING NOT NULL,
      birthDate :: DATE NOT NULL,
      deathDate :: DATE
   } AS Person,
   DIRECTED EDGE knows {} CONNECTING (Person -> Person),
   DIRECTED EDGE children {} CONNECTING (Person -> Person),
   DIRECTED EDGE parent {} CONNECTING (Person -> Person),
   UNDIRECTED EDGE sibling {} CONNECTING (Person ~ Person),
   UNDIRECTED EDGE spouse 
      { started :: DATE NOT NULL, ended :: DATE } 
      CONNECTING (Person ~ Person)
}

schema via phrases

Note:

It is not clear to me why there are two ways to do the same thing. In phrase version, you’ll see that AS Person added to the declaration of the Person node type. This seems necessary as the endpoint pair phrase requires a local alias . Otherwise, the outcome is effectively the same. The question remains as to what you can do with one form that you can’t do with the other?

Curious things

I’ve noticed a few things:

  1. You’d like to see type extension: If Person extends Thing, then it inherits the properties of Thing instead of re-declaring the property set. If you think about that a bit, there be dragons. Having type extension in a schema language really requires having constructs and semantics for both extension and restriction. There is a whole section (§4.13.2.7 “Structural consistency of element types”) that explains what could be interpreted as a subtype. Yet, that structural consistency doesn’t appear to apply to schemas.
  2. The way local aliases are specified and used seems a bit inconsistent. You only have a local alias for a type when you declare it, yet there are specific situations where you need a local alias (e.g., the endpoint pair phrase ). It is also appears useful when you have a key label set with more than one label in that you can use the local alias to avoid being required to repeat the multiple labels for each edge type.
  3. Edges to anything: It isn’t clear how you can describe an edge which has an endpoint (e.g., a destination) that is any node.
  4. Node type unions: It isn’t clear to me how you specify an edge whose destination is a union of node types. While you may be able to work around this with a common label, the set of destination node types may be disjoint.
  5. Lack of graph constraints: There are graph type level constraints such as limiting the cardinality of edges that would be very useful (e.g., node type X can only have one edge of type Y).
  6. Optional keywords - why?: CREATE PROPERTY GRAPH and CREATE GRAPH, NODE TYPE and NODE, and EDGE TYPE and EDGE are all substitutable syntax without any change in semantics. I feel like we should have dropped PROPERTY and TYPE from the grammar.
  7. Synonyms - why?: NODE vs VERTEX and EDGE vs RELATIONSHIP - as synonyms, they mean the same thing, and so why didn’t they just pick one?

Concluding remarks

As with many schema languages, there are things left yet to do. As a first version, a graph type provides some basic mechanisms for describing a graph as a set of “leaf types”. The modeling language you want to use to describe your graph may simply be out of scope for GQL. Meanwhile, you can simply use another tool (pick your favorite UML or ERM tool) to generate GQL graph types useful in your database.

Presently, the type system in GQL provides a minimum bar for validation. More importantly, these types give the database system a contract for the kinds of nodes, edges, and properties that are to be created, updated, and queried. This affords the database engine the ability to optimize how information is stored, indexed, and accessed.

That should probably be the primary takeaway here:

The graph schema is for your database and not for you. It doesn’t describe everything you need to understand about your graph’s structure.

next entry