In GQL, a graph is set of zero or more nodes and edges where:
A Node
has:
a label set
with zero or more unique label names,
a property set
with zero or more name-value pairs with unique names.
An Edge
has:
a label set
with zero or more unique label names,
a property set
with zero or more name-value pairs with unique names,
two endpoint nodes in the same graph,
an indication of whether the edge is directed
or undirected
,
when a directed edge, one endpoint node is the source and the other node is the destination.
While data can be unstructured in a graph such that nodes and edges are created in an ad hoc manner,
graphs are often carefully crafted and structured to represent the relationships and properties of the
subject. As such, having
a schema that matches the rules used to structure the graph is useful for validation, queries, and
optimizations by a database engine.
What is a schema for a graph?
A graph schema should attempt to answer basic questions like:
What kinds of nodes are allowed in the graph?
What kinds of relationships (i.e., edges) are allowed between nodes?
What are the named properties of nodes and edges?
What are the allowed values of properties?
What properties are optional?
Additionally, it would be good to define more ontology-oriented
questions for validity like:
the cardinality of edges and nodes,
inverse relations (e.g., a parent relation to node has a child relation in reverse)
cross-node or cross-edge property constraints
Such additional criteria are often described as a constraint language that is separable
from a schema language. That is, they can often be viewed as an additional
layer and not the domain of the schema language.
Schemas in GQL
In GQL, the graph type
(see §4.13.2 “Graph types and graph element types”) is
the main construct for defining a schema. A graph type describes the nodes and edges
allowed in a graph. It is created with a
create graph type statement
(see §12.6):
CREATE GRAPH TYPE MyGraph AS {
// definitions go here
}
graph type creation
Once created, the graph type can be used to type specific instances of graphs
in your site (i.e., a portion of your graph database). In this example, the
specific
syntax uses a nested graph type specification
(see §18.1)
that contains a list of element type specifications
are
specified in a comma-separated list.
Each element type specification is either:
a node type specification
(see §18.2)
an edge type specification
(see §18.3)
For each of these, there are two ways to describe the type:
a “pattern” that is similar in syntax to node or edge patterns in queries,
a “phrase” that is more explicit and verbose.
The simplest and most compact form of a node type specification
is as a pattern:
The label set defines the “keys” by which the type can be referenced and also
used to query for matching nodes. A node type can also define additional non-keyed
labels and the union of these and the keyed labels are the total set of labels for the type. For example, we could
model animal nodes as the following:
The first edge pattern is an undirected edge that is keyed with sibling and has two endpoints whose type is keyed with Person.
The second edge pattern is a directed edge that is keyed with children
and has a source and destination type that is keyed with Person.
A worked example
If we imagine we’re trying to structure a graph to represent data retrieved from
resources using the Person type from schema.org,
we can see how these declarations fit together as well as the “rough edges” of
graph typing.
Here is a complete example where the properties and relations have been
limited to make the schema smaller:
CREATE GRAPH TYPE People AS {
(:Thing {
name :: STRING NOT NULL,
url :: STRING
}),
(:Person => :Thing {
name :: STRING NOT NULL,
url :: STRING,
givenName :: STRING,
familyName :: STRING NOT NULL,
birthDate :: DATE NOT NULL,
deathDate :: DATE
}),
(:Person)-[:knows]->(:Person),
(:Person)-[:children]->(:Person),
(:Person)-[:parent]->(:Person),
(:Person)~[:sibling]~(:Person),
(:Person)~[:spouse { started :: DATE NOT NULL, ended :: DATE }]~(:Person)
}
schema via patterns
This same schema can be described via phrase declarations:
CREATE GRAPH TYPE PeopleViaPhrase AS {
NODE :Thing {
name :: STRING NOT NULL,
url :: STRING
},
NODE :Person => :Thing {
name :: STRING NOT NULL,
url :: STRING,
givenName :: STRING,
familyName :: STRING NOT NULL,
birthDate :: DATE NOT NULL,
deathDate :: DATE
} AS Person,
DIRECTED EDGE knows {} CONNECTING (Person -> Person),
DIRECTED EDGE children {} CONNECTING (Person -> Person),
DIRECTED EDGE parent {} CONNECTING (Person -> Person),
UNDIRECTED EDGE sibling {} CONNECTING (Person ~ Person),
UNDIRECTED EDGE spouse
{ started :: DATE NOT NULL, ended :: DATE }
CONNECTING (Person ~ Person)
}
schema via phrases
Note:
It is not clear to me why there are two ways to do the same thing. In phrase version, you’ll
see that AS Person added to the declaration of the Person node type. This seems necessary
as the endpoint pair phrase
requires a local alias
. Otherwise,
the outcome is effectively the same. The question remains as to what you can do with one form that you
can’t do with the other?
Curious things
I’ve noticed a few things:
You’d like to see type extension: If Person extends Thing, then it inherits the properties of Thing instead of re-declaring the property set. If you think about that a bit,
there be dragons. Having type extension in a schema language really requires having constructs and semantics for both extension and restriction. There is
a whole section (§4.13.2.7 “Structural consistency of element types”) that explains what could be interpreted as a subtype. Yet, that structural consistency doesn’t appear to apply to schemas.
The way local aliases are specified and used seems a bit inconsistent. You only have a local alias for a type when you declare it, yet there are specific situations
where you need a local alias (e.g., the endpoint pair phrase
). It is also appears useful when you have a key label set with more than one label
in that you can use the local alias to avoid being required to repeat the multiple labels for each edge type.
Edges to anything: It isn’t clear how you can describe an edge which has an endpoint (e.g., a destination) that is any node.
Node type unions: It isn’t clear to me how you specify an edge whose destination is a union of node types. While you may be able to
work around this with a common label, the set of destination node types may be disjoint.
Lack of graph constraints: There are graph type level constraints such as limiting the cardinality of edges that would be very useful (e.g., node type X can only have one edge of type Y).
Optional keywords - why?: CREATE PROPERTY GRAPH and CREATE GRAPH, NODE TYPE and NODE, and EDGE TYPE and EDGE are all substitutable syntax without any change in semantics. I feel like we should have dropped PROPERTY and TYPE from the grammar.
Synonyms - why?: NODE vs VERTEX and EDGE vs RELATIONSHIP - as synonyms, they mean the same thing, and so why didn’t they just pick one?
Concluding remarks
As with many schema languages, there are things left yet to do. As a first version, a graph type
provides
some basic mechanisms for describing a graph as a set of “leaf types”. The modeling language you want to use to describe your graph
may simply be out of scope for GQL. Meanwhile, you can simply use another tool (pick your favorite UML or ERM tool) to generate
GQL graph types useful in your database.
Presently, the type system in GQL provides a minimum bar for validation. More importantly, these types give
the database system a contract for the kinds of nodes, edges, and properties that are to be created, updated, and queried. This
affords the database engine the ability to optimize how information is stored, indexed, and accessed.
That should probably be the primary takeaway here:
The graph schema is for your database and not for you. It doesn’t describe everything you need to
understand about your graph’s structure.
When trying to understand a new language, I always like to start with how variables are defined
and examine how scoping works with respect to variable usage. This helps with
understanding the basics of how you can store values, use them in expressions, and where scope of use ends.
While this may seem simple, understanding how a language implements scoping rules isn’t always simple.
Execution context
At the center of evaluation in GQL is the “execution context” (§ 4.8)
that has:
a working record, which is a record,
a working table, which is a binding table,
an execution outcome
A statement in GQL may modify this context by setting the working table, working record, or the
execution outcome. During the execution of a construct, new “child contexts” may be
established, based on the current context, and used for nested evaluation. In the end,
the “current working table” can be used to formulate the query result which is represented in
the execution outcome.
For more on that, see “General Rules” in §14.10 <primitive result statement> where you’ll see
the conditions under which:
The current execution outcome is set to a successful outcome with the current working table as its result.
The crucial point here is that the current “in scope” variable names are drawn from the fields of the working records and tables from
the current context in various ways.
Calling procedures
Consider the following GQL program shown which is an inline procedure call (see § 15 “Procedure calling). An
inline procedure call is like a closure
where the arguments to the procedure are a list of variables that must be available in the
incoming working record. When that list is omitted completely, all the fields of the incoming working record
are available. Note that when an empty list of variables is specified (i.e., ()), the effective
incoming variables are at empty sequence.
CALL (x, y) {
VALUE z = x + y
RETURN z
}
An inline procedure call
For a procedure call statement to evaluate, we need a complete context where x and y are defined so that
we can expect a successful outcome:
VALUE x = 40
VALUE y = 2
VALUE k = 12
CALL (x, y) {
VALUE z = x + y
RETURN z
}
An inline procedure call in context
Each statement in this program can change the context using incoming values for the
working record and working table to produce outgoing working records and table.
For this query, the context is transformed as follows:
Level
Statement
Incoming
Outgoing
1
VALUE x = 40
record: unit table: unit
record: x=40 unit
1
VALUE y = 2
record: x=40 table: unit
record: x=40, y=2 unit
1
VALUE k = 12
record: x=40, y=2 unit
record: x=40, y=2, k=12 unit
1
CALL (x,y) { ... }
record: x=40, y=2, k=12 table: unit
record: x=40, y=2, k=12 table: 1: z=42
1.1
(x,y) { ... }
record: x=40, y=2 table: unit
record: x=40, y=2 table: 1: z=42
1.1.1
{ VALUE z = x + y RETURN z }
record: x=40, y=2 table: unit
record: x=40, y=2 table: 1: z=42
1.1.1
VALUE z = x + y
record: x=40, y=2 table: unit
record: x=40, y=2, z=42 table: unit
1.1.1
RETURN z
record: x=40, y=2, z=42 table: unit
record: x=40, y=2, z=42 table: 1: z=42
In the above table, I have attempted to outline how the execution context
changes when executing the GQL program with values for x and y. Each
of the columns have these definitions:
The “level” column shows the execution context as it is changed for “child” constructs.
The “statement” column is the portion of the query being evaluated.
The “incoming” is the record and table of context coming into the expression
The “outgoing” is the record and table of the context resulting from the expression evaluation.
The evaluation can be explained as follows:
A value variable definition (§ 10.3) (in rows 1-3 and 7) adds a single variable to the incoming working record of the current context. It is an error if the name is already in the working record.
The call procedure statement (§ 15.1) (in row 4) invokes a procedure call and the outgoing context is the incoming working table amended with the outgoing working table of the procedure called.
The inline procedure call (§ 15.2) (in row 5) changes the incoming record to be those variables lists in the binding variables listed (e.g., (x, y)).
The nested procedure specification (§ 9.1) / procedure body (§ 9.2) (in row 6) create a new execution context whose working record contain the results of the value variable definition. Otherwise, the working record remains unchanged. In this case, the definition of z causes a new child execution context. Also, the outgoing declared type of the procedure body is the declared type of the last statement in the procedure.
The value variable definition (§ 10.3) (in row 7) only declares z in the context. In this case, it is a new child context. The result is that z won’t be available outside the nested procedure specification.
The return statement (§ 14.11) changes the working table of the current execution context. This establishes the declared type of the procedure body which will be used to merge the results into the calling context.
Note:
While I believe this is correct, I may have missed some nuance. I am working though how binding tables are transformed and how values interact with them.
The result of this query is:
The working record: x=40, y=2, k=12
The working table: 1: z=42
The outcome: successful, unit
Note:
The overall query didn’t return anything and so the result is successful but empty.
Let statements
If a value variable definition adds a field to the working record, then what does a let statement (§ 14.7) do? The short
answer is that a let statement changes the working table. Consider the example query with let statement
where we bind the variable
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
LET friendsCount = count(friend)
RETURN n, friendsCount
Query with let statement
The first thing to note is that every instance of variable = expression is a shorthand for
VALUE variable = expression. You can see this in the grammar below:
---
title: simple query statement
---
stateDiagram-v2
direction LR
state LetVariableDefinition {
direction LR
state fork <>
state join <>
RegularIdentifier1 : regular identifier
EQUALS1 : =
ValueExpression1 : value expresion
fork --> RegularIdentifier1
RegularIdentifier1 --> EQUALS1
EQUALS1 --> ValueExpression1
ValueExpression1 --> join
RegularIdentifier2 : regular identifier
typed : #colon;#colon; | TYPED
EQUALS2 : =
ValueExpression2 : value expresion
ValueType : value type
fork --> VALUE
VALUE --> RegularIdentifier2
state value_type_fork <>
state value_type_join <>
RegularIdentifier2 --> value_type_fork
RegularIdentifier2 --> value_type_join
value_type_fork --> typed
typed --> ValueType
ValueType --> value_type_join
value_type_fork --> ValueType
value_type_join --> EQUALS2
EQUALS2 --> ValueExpression2
ValueExpression2 --> join
}
LET --> LetVariableDefinition
LetVariableDefinition --> LetVariableDefinition
The semantics of a let statement first transforms these shorthands into value variable definitions:
LET friendsCount = count(friend) → LET VALUE friendsCount = count(friend)
The second transformation is from the let statement into a call procedure statement with an inline procedure call
where the variable scope clause (§ 15.2) is a list of all the binding variable references used in the right-hand side
of the let variable definitions:
LET VALUE friendsCount = count(friend) → CALL (friend) {
VALUE friendsCount = count(friend)
RETURN friendsCount
}
The body of the procedure is a value variable definition for each let variable definition. The procedure ends with
a return statement that enumerates the left-hand side of the let variable definitions.
The resulting query executed is shown in query executed where the let statement has been
replaced with call procedure statement.
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
CALL (friend) {
VALUE friendsCount = count(friend)
RETURN friendsCount
}
RETURN n, friendsCount
query executed
When tracing through the execution contexts, you’ll see how friendsCount is added to the binding table
via the call procedure statement that replaced the let statement. At the end, values in the context’s working
table is used for the result statement.
Note:
I see a problem here with friend that I can’t quite sort. The friend variable reference is to a node
from the match statement. So, that’s in the binding table. The binding variable references are
allowed to be the union of the field names of the working record and working table. Meanwhile, § 15.1 states
that the incoming working record’s declared type is the incoming working record amended by the
declared type of the incoming working table - which is some sort of record union. But, it isn’t
clear to me yet how the call procedure statement scopes references from the binding table nor
exactly how that amending works with the current working record used for variable scoping.
Can variables refer to each other?
The short answer is “yes” but only in declaration order. It is easy to see that this
should work:
VALUE x = 1
VALUE y = 2
VALUE z = x + y
And this should not:
VALUE z = x + y
VALUE x = 1
VALUE y = 2
For a let statement, there is a similar ordering requirement:
LET x = 1, y = 2, z = x + y
which becomes:
CALL {
VALUE x = 1
VALUE y = 2
VALUE z = x + y
RETURN x, y, z
}
Where the only difference between let and value bindings is whether the fields names are in the working record or working table.
Concluding thoughts
There was a lot of background material to get to the conclusion of “expected semantics” for variable
binding and use. The distinction between the “working record” and “working table” is essential
to understand but has some hidden complexity. The challenge there is understanding how
simple value bindings are handled versus variables that refer to patterns of results (e.g., a
set of matching nodes) where their enumeration could be large; a topic for another post.
This is my first foray into how various expressions affect the context. There is more complexity that I need
to sort for myself in terms of how values are added to binding tables when you have fields whose value
space is a set of nodes from a match pattern. That is,
I need to have a deeper understanding of exactly how working tables are amended via joins or other operators.
GQL ISO/IEC 39075:2024 is a new database query language for property graphs that was recently
published (April, 2024) by ISO/IEC JTC 1/SC 32/WG 3 – the same committee that
publishes and maintains the SQL query language standards. The scope of GQL is to provide a
language for storing and querying Property Graph structured data.
A property graph can be described as a directed multigraph where the nodes and edges may have:
labels - like a set of “type names” associated with the target
properties - a set of name/value pairs
A simple way to conceptualize a property graph is with an entity-relationship model but with the addition of attributes for relations. For example, a very simple graph of movies, actors, and genres might start with
an ER diagram as follows:
erDiagram
Movie ||--o{ Person : actor
Movie ||--o{ Genre : genre
Movie {
string title
}
Person {
string name
}
Genre {
string name
}
Note:
Relations may also have attributes - if we go back to the original “The Entity-Relationship Model – Toward a unified view of data”, Chen, 1976,
you will see: “The information about an entity or a relationship is obtained by observation or measurement, and is expression by a set of attribute-value pairs.” As such, an ER diagram a great way to describe a property graph, but diagramming tools (e.g., mermaid) don’t always help you show it that way. Also, while you can represent an n-ary relationship in an ER diagram, you can’t necessarily restrict the cardinality of a relationship in the property graph.
Querying property graphs
Property graphs are not new and there are many graph databases that provide query languages for property graphs (not exhaustive):
From the above list, some of these companies participated in the development in GQL. As such, it shouldn’t be a surprise that GQL resembles Cypher in many ways.
In fact, some Cypher queries are valid GQL queries.
So, what does a GQL query look like (apart from Cypher-like)?
Moving from Cypher to GQL
You can read the openCypher language specification which
contains many examples of queries and expressions you can use in Cypher. At the current time, you would be hard-pressed to find a similar
primer on GQL with examples. The standard is also long and focus on the query syntax and semantics and so it also does not contain examples.
I did not personally participate on the GQL standards, but I have been trying to track its progress and so after ISO published the
standard, I immediately purchase a copy (yes, that is how ISO works and, yes, I’m that much of a nerd). It is long and detailed specification and also not a casual read. After
working through many of the grammars, I went about developing a full parser in python that is in a very “alpha” state at the moment.
Note:
Stay tuned while I work on a primer for GQL and towards releasing that parser
in the near future. Those two go together as I need a good conformance test suite. Unless one exists somewhere … ISO?
I have been using my parser it to validate my forays writing hypothetical GQL queries. As such, I went through the openCypher
specification and tried to translate the examples. Let’s go through a few examples from the introduction.
Example 1
“Nodes with name “John” who have more than three “FRIEND” relations.”
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
WITH n, count(friend) AS friendsCount
WHERE friendsCount > 3
RETURN n, friendsCount
Cypher Example 1
GQL doesn’t have the WITH construct but does have LET and FILTER and so this becomes:
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
LET friendsCount = count(friend)
FILTER friendsCount > 3
RETURN n, friendsCount
GQL Example 1 (version 1)
But GQL has some optionality in the syntax and so this may also be:
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
LET friendsCount = count(friend)
FILTER WHERE friendsCount > 3
RETURN n, friendsCount
GQL Example 1 (version 2)
Note:
I’m still reading through the semantics and so it isn’t clear whether FILTER WHERE exp is different from FILTER exp.
Example 2
Another example for mutating the graph:
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
WITH n, count(friend) AS friendsCount
SET n.friendsCount = friendsCount
RETURN n.friendsCount
Cypher Example 2
MATCH (n {name: 'John'})-[:FRIEND]-(friend)
SET n.friendsCount = count(friend)
RETURN n.friendsCount
GQL Example 2
Note:
I’m aggregating the friend count in the SET statement. I question whether the LET should be there to
perform the aggregation outside the context of the SET even though I omitted it in the above. I will have to see as I dig deeper.
Understanding linear statements
In the published BNF, you’ll see that the above GQL queries are broken down into different statements that are chained together. These
queries eventually end up being parsed as an ambient linear query statement and that is processed as a sequence of
simple query statement productions followed by at primitive result statement.
So, the prior GQL examples turn into:
match + let + filter + return statements
match + set + return statements
And all of these statements are chained together by an implementing system.
---
title: ambient linear query statement
---
flowchart LR
S[➡]
E[➡]
S --> simpleLinearQueryStatement
S --> primitiveResultStatement
simpleLinearQueryStatement --> primitiveResultStatement
primitiveResultStatement --> E
subgraph simpleLinearQueryStatement[simple linear query statement]
simpleQueryStatement[simple query statement]
simpleQueryStatement --> simpleQueryStatement
end
subgraph primitiveResultStatement[primitive result statement]
returnStatement["return statement"]
end
---
title: simple query statement
---
flowchart LR
S[➡]
E[➡]
callQueryStatement[call query statement]
S --> primitiveQueryStatement --> E
S --> callQueryStatement --> E
subgraph primitiveQueryStatement[primitive query statement]
direction LR
S1[➡]
E1[➡]
matchStatement[match statement]
letStatement[let statement]
forStatement[for statement]
filterStatement[filter statement]
orderByPage[order by and page statement]
S1 --> matchStatement --> E1
S1 --> letStatement --> E1
S1 --> forStatement --> E1
S1 --> filterStatement --> E1
S1 --> orderByPage --> E1
end
What’s next?
With a working parser I can validate my syntactic assumptions of what is and is not GQL. The next steps are to map my
expected semantics to the actual semantics of the language. I understand how Cypher works and so how does that
translate to GQL? That requires a deeper understand of how results are built from statements and there is a specification for that!
Back to reading, … but I am going to be back with more after I peel back the next 🧅 layer.