GQL: Schemas and Types

🔗 2024-06-26

In GQL, a graph is set of zero or more nodes and edges where:

A Node has:

a label set with zero or more unique label names,
a property set with zero or more name-value pairs with unique names.

An Edge has:

a label set with zero or more unique label names,
a property set with zero or more name-value pairs with unique names,
two endpoint nodes in the same graph,
an indication of whether the edge is directed or undirected ,
when a directed edge, one endpoint node is the source and the other node is the destination.

While data can be unstructured in a graph such that nodes and edges are created in an ad hoc manner, graphs are often carefully crafted and structured to represent the relationships and properties of the subject. As such, having a schema that matches the rules used to structure the graph is useful for validation, queries, and optimizations by a database engine.

What is a schema for a graph?

A graph schema should attempt to answer basic questions like:

What kinds of nodes are allowed in the graph?
What kinds of relationships (i.e., edges) are allowed between nodes?
What are the named properties of nodes and edges?
What are the allowed values of properties?
What properties are optional?

Additionally, it would be good to define more ontology-oriented questions for validity like:

the cardinality of edges and nodes,
inverse relations (e.g., a parent relation to node has a child relation in reverse)
cross-node or cross-edge property constraints

Such additional criteria are often described as a constraint language that is separable from a schema language. That is, they can often be viewed as an additional layer and not the domain of the schema language.

Schemas in GQL

In GQL, the graph type (see §4.13.2 “Graph types and graph element types”) is the main construct for defining a schema. A graph type describes the nodes and edges allowed in a graph. It is created with a create graph type statement (see §12.6):

CREATE GRAPH TYPE MyGraph AS {
  // definitions go here
}

graph type creation

Once created, the graph type can be used to type specific instances of graphs in your site (i.e., a portion of your graph database). In this example, the specific syntax uses a nested graph type specification (see §18.1) that contains a list of element type specifications are specified in a comma-separated list.

Each element type specification is either:

a node type specification (see §18.2)
an edge type specification (see §18.3)

For each of these, there are two ways to describe the type:

a “pattern” that is similar in syntax to node or edge patterns in queries,
a “phrase” that is more explicit and verbose.

The simplest and most compact form of a node type specification is as a pattern:

(:label {
  prop1 :: type1,
  prop2 :: type2,
  // etc.
})

node type pattern basics

The label set defines the “keys” by which the type can be referenced and also used to query for matching nodes. A node type can also define additional non-keyed labels and the union of these and the keyed labels are the total set of labels for the type. For example, we could model animal nodes as the following:

(:cat => :mammal:animal),
(:dog => :mammal:animal)

animals as nodes

The node types for cat and dog are keyed with unique labels but also have the labels mammal and animal on the same nodes.

Similarly, we can specify edges with patterns:

(:Person)~[:sibling]~(:Person),
(:Person)-[:children]->(:Person)

edge patterns

The first edge pattern is an undirected edge that is keyed with sibling and has two endpoints whose type is keyed with Person. The second edge pattern is a directed edge that is keyed with children and has a source and destination type that is keyed with Person.

A worked example

If we imagine we’re trying to structure a graph to represent data retrieved from resources using the Person type from schema.org, we can see how these declarations fit together as well as the “rough edges” of graph typing.

Here is a complete example where the properties and relations have been limited to make the schema smaller:

CREATE GRAPH TYPE People AS {
   (:Thing {
      name :: STRING NOT NULL,
      url :: STRING
   }),
   (:Person => :Thing {
      name :: STRING NOT NULL,
      url :: STRING,
      givenName :: STRING,
      familyName :: STRING NOT NULL,
      birthDate :: DATE NOT NULL,
      deathDate :: DATE
   }),
   (:Person)-[:knows]->(:Person),
   (:Person)-[:children]->(:Person),
   (:Person)-[:parent]->(:Person),
   (:Person)~[:sibling]~(:Person),
   (:Person)~[:spouse { started :: DATE NOT NULL, ended :: DATE }]~(:Person)
}

schema via patterns

This same schema can be described via phrase declarations:

CREATE GRAPH TYPE PeopleViaPhrase AS {
   NODE :Thing {
      name :: STRING NOT NULL,
      url :: STRING
   },
   NODE :Person => :Thing {
      name :: STRING NOT NULL,
      url :: STRING,
      givenName :: STRING,
      familyName :: STRING NOT NULL,
      birthDate :: DATE NOT NULL,
      deathDate :: DATE
   } AS Person,
   DIRECTED EDGE knows {} CONNECTING (Person -> Person),
   DIRECTED EDGE children {} CONNECTING (Person -> Person),
   DIRECTED EDGE parent {} CONNECTING (Person -> Person),
   UNDIRECTED EDGE sibling {} CONNECTING (Person ~ Person),
   UNDIRECTED EDGE spouse 
      { started :: DATE NOT NULL, ended :: DATE } 
      CONNECTING (Person ~ Person)
}

schema via phrases

Note:

It is not clear to me why there are two ways to do the same thing. In phrase version, you’ll see that AS Person added to the declaration of the Person node type. This seems necessary as the endpoint pair phrase requires a local alias . Otherwise, the outcome is effectively the same. The question remains as to what you can do with one form that you can’t do with the other?

Curious things

I’ve noticed a few things:

You’d like to see type extension: If Person extends Thing, then it inherits the properties of Thing instead of re-declaring the property set. If you think about that a bit, there be dragons. Having type extension in a schema language really requires having constructs and semantics for both extension and restriction. There is a whole section (§4.13.2.7 “Structural consistency of element types”) that explains what could be interpreted as a subtype. Yet, that structural consistency doesn’t appear to apply to schemas.
The way local aliases are specified and used seems a bit inconsistent. You only have a local alias for a type when you declare it, yet there are specific situations where you need a local alias (e.g., the endpoint pair phrase ). It is also appears useful when you have a key label set with more than one label in that you can use the local alias to avoid being required to repeat the multiple labels for each edge type.
Edges to anything: It isn’t clear how you can describe an edge which has an endpoint (e.g., a destination) that is any node.
Node type unions: It isn’t clear to me how you specify an edge whose destination is a union of node types. While you may be able to work around this with a common label, the set of destination node types may be disjoint.
Lack of graph constraints: There are graph type level constraints such as limiting the cardinality of edges that would be very useful (e.g., node type X can only have one edge of type Y).
Optional keywords - why?: CREATE PROPERTY GRAPH and CREATE GRAPH, NODE TYPE and NODE, and EDGE TYPE and EDGE are all substitutable syntax without any change in semantics. I feel like we should have dropped PROPERTY and TYPE from the grammar.
Synonyms - why?: NODE vs VERTEX and EDGE vs RELATIONSHIP - as synonyms, they mean the same thing, and so why didn’t they just pick one?

Concluding remarks

As with many schema languages, there are things left yet to do. As a first version, a graph type provides some basic mechanisms for describing a graph as a set of “leaf types”. The modeling language you want to use to describe your graph may simply be out of scope for GQL. Meanwhile, you can simply use another tool (pick your favorite UML or ERM tool) to generate GQL graph types useful in your database.

Presently, the type system in GQL provides a minimum bar for validation. More importantly, these types give the database system a contract for the kinds of nodes, edges, and properties that are to be created, updated, and queried. This affords the database engine the ability to optimize how information is stored, indexed, and accessed.

That should probably be the primary takeaway here:

The graph schema is for your database and not for you. It doesn’t describe everything you need to understand about your graph’s structure.

GQL: Let's variable

🔗 2024-06-13

When trying to understand a new language, I always like to start with how variables are defined and examine how scoping works with respect to variable usage. This helps with understanding the basics of how you can store values, use them in expressions, and where scope of use ends. While this may seem simple, understanding how a language implements scoping rules isn’t always simple.

Execution context

At the center of evaluation in GQL is the “execution context” (§ 4.8) that has:

a working record, which is a record,
a working table, which is a binding table,
an execution outcome

A statement in GQL may modify this context by setting the working table, working record, or the execution outcome. During the execution of a construct, new “child contexts” may be established, based on the current context, and used for nested evaluation. In the end, the “current working table” can be used to formulate the query result which is represented in the execution outcome.

For more on that, see “General Rules” in §14.10 <primitive result statement> where you’ll see the conditions under which:

The current execution outcome is set to a successful outcome with the current working table as its result.

The crucial point here is that the current “in scope” variable names are drawn from the fields of the working records and tables from the current context in various ways.

Calling procedures

Consider the following GQL program shown which is an inline procedure call (see § 15 “Procedure calling). An inline procedure call is like a closure where the arguments to the procedure are a list of variables that must be available in the incoming working record. When that list is omitted completely, all the fields of the incoming working record are available. Note that when an empty list of variables is specified (i.e., ()), the effective incoming variables are at empty sequence.

CALL (x, y) {
   VALUE z = x + y
   RETURN z
}

An inline procedure call

For a procedure call statement to evaluate, we need a complete context where x and y are defined so that we can expect a successful outcome:

VALUE x = 40
VALUE y = 2
VALUE k = 12
CALL (x, y) {
   VALUE z = x + y
   RETURN z
}

An inline procedure call in context

Each statement in this program can change the context using incoming values for the working record and working table to produce outgoing working records and table. For this query, the context is transformed as follows:

Level	Statement	Incoming	Outgoing
1	`VALUE x = 40`	record: unit table: unit	record: `x=40` unit
1	`VALUE y = 2`	record: `x=40` table: unit	record: `x=40, y=2` unit
1	`VALUE k = 12`	record: `x=40, y=2` unit	record: `x=40, y=2, k=12` unit
1	`CALL (x,y) { ... }`	record: `x=40, y=2, k=12` table: unit	record: `x=40, y=2, k=12` table: 1: z=42
1.1	`(x,y) { ... }`	record: `x=40, y=2` table: unit	record: `x=40, y=2` table: 1: z=42
1.1.1	`{ VALUE z = x + y RETURN z }`	record: `x=40, y=2` table: unit	record: `x=40, y=2` table: 1: z=42
1.1.1	`VALUE z = x + y`	record: `x=40, y=2` table: unit	record: `x=40, y=2, z=42` table: unit
1.1.1	`RETURN z`	record: `x=40, y=2, z=42` table: unit	record: `x=40, y=2, z=42` table: 1: z=42

In the above table, I have attempted to outline how the execution context changes when executing the GQL program with values for x and y. Each of the columns have these definitions:

The “level” column shows the execution context as it is changed for “child” constructs.
The “statement” column is the portion of the query being evaluated.
The “incoming” is the record and table of context coming into the expression
The “outgoing” is the record and table of the context resulting from the expression evaluation.

The evaluation can be explained as follows:

A value variable definition (§ 10.3) (in rows 1-3 and 7) adds a single variable to the incoming working record of the current context. It is an error if the name is already in the working record.
The call procedure statement (§ 15.1) (in row 4) invokes a procedure call and the outgoing context is the incoming working table amended with the outgoing working table of the procedure called.
The inline procedure call (§ 15.2) (in row 5) changes the incoming record to be those variables lists in the binding variables listed (e.g., (x, y)).
The nested procedure specification (§ 9.1) / procedure body (§ 9.2) (in row 6) create a new execution context whose working record contain the results of the value variable definition. Otherwise, the working record remains unchanged. In this case, the definition of z causes a new child execution context. Also, the outgoing declared type of the procedure body is the declared type of the last statement in the procedure.
The value variable definition (§ 10.3) (in row 7) only declares z in the context. In this case, it is a new child context. The result is that z won’t be available outside the nested procedure specification.
The return statement (§ 14.11) changes the working table of the current execution context. This establishes the declared type of the procedure body which will be used to merge the results into the calling context.

Note:

While I believe this is correct, I may have missed some nuance. I am working though how binding tables are transformed and how values interact with them.

The result of this query is:

The working record: x=40, y=2, k=12
The working table: 1: z=42
The outcome: successful, unit

Note:

The overall query didn’t return anything and so the result is successful but empty.

Let statements

If a value variable definition adds a field to the working record, then what does a let statement (§ 14.7) do? The short answer is that a let statement changes the working table. Consider the example query with let statement where we bind the variable

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  LET friendsCount = count(friend)
  RETURN n, friendsCount

Query with let statement

The first thing to note is that every instance of variable = expression is a shorthand for VALUE variable = expression. You can see this in the grammar below:

---
title: simple query statement
---
stateDiagram-v2
  direction LR
  
  state LetVariableDefinition {
     direction LR

     state fork <>
     state join <>

     RegularIdentifier1 : regular identifier
     EQUALS1 : =
     ValueExpression1 : value expresion
     fork --> RegularIdentifier1
     RegularIdentifier1 --> EQUALS1 
     EQUALS1 --> ValueExpression1
     ValueExpression1 --> join

     RegularIdentifier2 : regular identifier
     typed : #colon;#colon; | TYPED
     EQUALS2 : =
     ValueExpression2 : value expresion
     ValueType : value type
     fork --> VALUE
     VALUE --> RegularIdentifier2 
     state value_type_fork <>
     state value_type_join <>
     RegularIdentifier2 --> value_type_fork
     RegularIdentifier2 --> value_type_join
     value_type_fork --> typed
     typed --> ValueType
     ValueType --> value_type_join
     value_type_fork --> ValueType
     value_type_join --> EQUALS2
     EQUALS2 --> ValueExpression2
     ValueExpression2 --> join
  }

  LET --> LetVariableDefinition
  LetVariableDefinition --> LetVariableDefinition

The semantics of a let statement first transforms these shorthands into value variable definitions:

LET friendsCount = count(friend) → LET VALUE friendsCount = count(friend)

The second transformation is from the let statement into a call procedure statement with an inline procedure call where the variable scope clause (§ 15.2) is a list of all the binding variable references used in the right-hand side of the let variable definitions:

LET VALUE friendsCount = count(friend) → CALL (friend) {
                                            VALUE friendsCount = count(friend)
                                            RETURN friendsCount
                                         }

The body of the procedure is a value variable definition for each let variable definition. The procedure ends with a return statement that enumerates the left-hand side of the let variable definitions.

The resulting query executed is shown in query executed where the let statement has been replaced with call procedure statement.

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  CALL (friend) {
    VALUE friendsCount = count(friend)
    RETURN friendsCount
  }
  RETURN n, friendsCount

query executed

When tracing through the execution contexts, you’ll see how friendsCount is added to the binding table via the call procedure statement that replaced the let statement. At the end, values in the context’s working table is used for the result statement.

Note:

I see a problem here with friend that I can’t quite sort. The friend variable reference is to a node from the match statement. So, that’s in the binding table. The binding variable references are allowed to be the union of the field names of the working record and working table. Meanwhile, § 15.1 states that the incoming working record’s declared type is the incoming working record amended by the declared type of the incoming working table - which is some sort of record union. But, it isn’t clear to me yet how the call procedure statement scopes references from the binding table nor exactly how that amending works with the current working record used for variable scoping.

Can variables refer to each other?

The short answer is “yes” but only in declaration order. It is easy to see that this should work:

VALUE x = 1
VALUE y = 2
VALUE z = x + y

And this should not:

VALUE z = x + y
VALUE x = 1
VALUE y = 2

For a let statement, there is a similar ordering requirement:

LET x = 1, y = 2, z = x + y

which becomes:

CALL {
   VALUE x = 1
   VALUE y = 2
   VALUE z = x + y
   RETURN x, y, z
}

Where the only difference between let and value bindings is whether the fields names are in the working record or working table.

Concluding thoughts

There was a lot of background material to get to the conclusion of “expected semantics” for variable binding and use. The distinction between the “working record” and “working table” is essential to understand but has some hidden complexity. The challenge there is understanding how simple value bindings are handled versus variables that refer to patterns of results (e.g., a set of matching nodes) where their enumeration could be large; a topic for another post.

This is my first foray into how various expressions affect the context. There is more complexity that I need to sort for myself in terms of how values are added to binding tables when you have fields whose value space is a set of nodes from a match pattern. That is, I need to have a deeper understanding of exactly how working tables are amended via joins or other operators.

Feedback and questions are always welcome!

GQL and Cypher

🔗 2024-06-04

GQL ISO/IEC 39075:2024 is a new database query language for property graphs that was recently published (April, 2024) by ISO/IEC JTC 1/SC 32/WG 3 – the same committee that publishes and maintains the SQL query language standards. The scope of GQL is to provide a language for storing and querying Property Graph structured data.

A property graph can be described as a directed multigraph where the nodes and edges may have:

labels - like a set of “type names” associated with the target
properties - a set of name/value pairs

A simple way to conceptualize a property graph is with an entity-relationship model but with the addition of attributes for relations . For example, a very simple graph of movies, actors, and genres might start with an ER diagram as follows:

erDiagram

   Movie ||--o{ Person : actor
   Movie ||--o{ Genre : genre
   Movie {
      string title
   }
   Person {
      string name
   }
   Genre {
      string name
   }

Note:

Relations may also have attributes - if we go back to the original “The Entity-Relationship Model – Toward a unified view of data”, Chen, 1976, you will see: “The information about an entity or a relationship is obtained by observation or measurement, and is expression by a set of attribute-value pairs.” As such, an ER diagram a great way to describe a property graph, but diagramming tools (e.g., mermaid) don’t always help you show it that way. Also, while you can represent an n-ary relationship in an ER diagram, you can’t necessarily restrict the cardinality of a relationship in the property graph.

Querying property graphs

Property graphs are not new and there are many graph databases that provide query languages for property graphs (not exhaustive):

Many of these databases use a query language called Cypher which has an open source specification.

From the above list, some of these companies participated in the development in GQL. As such, it shouldn’t be a surprise that GQL resembles Cypher in many ways. In fact, some Cypher queries are valid GQL queries.

So, what does a GQL query look like (apart from Cypher-like)?

Moving from Cypher to GQL

You can read the openCypher language specification which contains many examples of queries and expressions you can use in Cypher. At the current time, you would be hard-pressed to find a similar primer on GQL with examples. The standard is also long and focus on the query syntax and semantics and so it also does not contain examples.

I did not personally participate on the GQL standards, but I have been trying to track its progress and so after ISO published the standard, I immediately purchase a copy (yes, that is how ISO works and, yes, I’m that much of a nerd). It is long and detailed specification and also not a casual read. After working through many of the grammars, I went about developing a full parser in python that is in a very “alpha” state at the moment.

Note:

Stay tuned while I work on a primer for GQL and towards releasing that parser in the near future. Those two go together as I need a good conformance test suite. Unless one exists somewhere … ISO?

I have been using my parser it to validate my forays writing hypothetical GQL queries. As such, I went through the openCypher specification and tried to translate the examples. Let’s go through a few examples from the introduction.

Example 1

“Nodes with name “John” who have more than three “FRIEND” relations.”

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  WITH n, count(friend) AS friendsCount
  WHERE friendsCount > 3
  RETURN n, friendsCount

Cypher Example 1

GQL doesn’t have the WITH construct but does have LET and FILTER and so this becomes:

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  LET friendsCount = count(friend)
  FILTER friendsCount > 3
  RETURN n, friendsCount

GQL Example 1 (version 1)

But GQL has some optionality in the syntax and so this may also be:

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  LET friendsCount = count(friend)
  FILTER WHERE friendsCount > 3
  RETURN n, friendsCount

GQL Example 1 (version 2)

Note:

I’m still reading through the semantics and so it isn’t clear whether FILTER WHERE exp is different from FILTER exp.

Example 2

Another example for mutating the graph:

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  WITH n, count(friend) AS friendsCount
  SET n.friendsCount = friendsCount
  RETURN n.friendsCount

Cypher Example 2

MATCH (n {name: 'John'})-[:FRIEND]-(friend)
  SET n.friendsCount = count(friend)
  RETURN n.friendsCount

GQL Example 2

Note:

I’m aggregating the friend count in the SET statement. I question whether the LET should be there to perform the aggregation outside the context of the SET even though I omitted it in the above. I will have to see as I dig deeper.

Understanding linear statements

In the published BNF, you’ll see that the above GQL queries are broken down into different statements that are chained together. These queries eventually end up being parsed as an ambient linear query statement and that is processed as a sequence of simple query statement productions followed by at primitive result statement.

So, the prior GQL examples turn into:

match + let + filter + return statements
match + set + return statements

And all of these statements are chained together by an implementing system.

---
title: ambient linear query statement
---
flowchart LR
  S[➡]
  E[➡]
  S --> simpleLinearQueryStatement
  S --> primitiveResultStatement
  simpleLinearQueryStatement --> primitiveResultStatement
  primitiveResultStatement --> E
  subgraph simpleLinearQueryStatement[simple linear query statement]
    simpleQueryStatement[simple query statement]
    simpleQueryStatement --> simpleQueryStatement
  end
  subgraph primitiveResultStatement[primitive result statement]
    returnStatement["return statement"]
  end

---
title: simple query statement
---
flowchart LR
  S[➡]
  E[➡]
  callQueryStatement[call query statement]
  S --> primitiveQueryStatement --> E
  S --> callQueryStatement  --> E
  subgraph primitiveQueryStatement[primitive query statement]
    direction LR
    S1[➡]
    E1[➡]
    matchStatement[match statement]
    letStatement[let statement]
    forStatement[for statement]
    filterStatement[filter statement]
    orderByPage[order by and page statement]
    S1 --> matchStatement --> E1
    S1 --> letStatement --> E1
    S1 --> forStatement --> E1
    S1 --> filterStatement --> E1
    S1 --> orderByPage --> E1
  end

What’s next?

With a working parser I can validate my syntactic assumptions of what is and is not GQL. The next steps are to map my expected semantics to the actual semantics of the language. I understand how Cypher works and so how does that translate to GQL? That requires a deeper understand of how results are built from statements and there is a specification for that!

Back to reading, … but I am going to be back with more after I peel back the next 🧅 layer.