GQL: Schemas and Types
In GQL, a graph is set of zero or more nodes and edges where:
- A Node has:
- a label set with zero or more unique label names,
- a property set with zero or more name-value pairs with unique names.
- An Edge has:
- a label set with zero or more unique label names,
- a property set with zero or more name-value pairs with unique names,
- two endpoint nodes in the same graph,
- an indication of whether the edge is directed or undirected ,
- when a directed edge, one endpoint node is the source and the other node is the destination.
While data can be unstructured in a graph such that nodes and edges are created in an ad hoc manner, graphs are often carefully crafted and structured to represent the relationships and properties of the subject. As such, having a schema that matches the rules used to structure the graph is useful for validation, queries, and optimizations by a database engine.
What is a schema for a graph?
A graph schema should attempt to answer basic questions like:
- What kinds of nodes are allowed in the graph?
- What kinds of relationships (i.e., edges) are allowed between nodes?
- What are the named properties of nodes and edges?
- What are the allowed values of properties?
- What properties are optional?
Additionally, it would be good to define more ontology-oriented questions for validity like:
- the cardinality of edges and nodes,
- inverse relations (e.g., a parent relation to node has a child relation in reverse)
- cross-node or cross-edge property constraints
Such additional criteria are often described as a constraint language that is separable from a schema language. That is, they can often be viewed as an additional layer and not the domain of the schema language.
Schemas in GQL
In GQL, the graph type (see §4.13.2 “Graph types and graph element types”) is the main construct for defining a schema. A graph type describes the nodes and edges allowed in a graph. It is created with a create graph type statement (see §12.6):
Once created, the graph type can be used to type specific instances of graphs in your site (i.e., a portion of your graph database). In this example, the specific syntax uses a nested graph type specification (see §18.1) that contains a list of element type specifications are specified in a comma-separated list.
Each element type specification is either:
- a node type specification (see §18.2)
- an edge type specification (see §18.3)
For each of these, there are two ways to describe the type:
- a “pattern” that is similar in syntax to node or edge patterns in queries,
- a “phrase” that is more explicit and verbose.
The simplest and most compact form of a node type specification is as a pattern:
The label set defines the “keys” by which the type can be referenced and also used to query for matching nodes. A node type can also define additional non-keyed labels and the union of these and the keyed labels are the total set of labels for the type. For example, we could model animal nodes as the following:
The node types for cat
and dog
are keyed with unique labels
but also have the labels mammal
and animal
on the same nodes.
Similarly, we can specify edges with patterns:
The first edge pattern is an undirected edge that is keyed with sibling
and has two endpoints whose type is keyed with Person
.
The second edge pattern is a directed edge that is keyed with children
and has a source and destination type that is keyed with Person
.
A worked example
If we imagine we’re trying to structure a graph to represent data retrieved from resources using the Person type from schema.org, we can see how these declarations fit together as well as the “rough edges” of graph typing.
Here is a complete example where the properties and relations have been limited to make the schema smaller:
This same schema can be described via phrase declarations:
Note:
AS Person
added to the declaration of the Person
node type. This seems necessary
as the endpoint pair phrase
requires a local alias
. Otherwise,
the outcome is effectively the same. The question remains as to what you can do with one form that you
can’t do with the other?
Curious things
I’ve noticed a few things:
- You’d like to see type extension: If
Person
extendsThing
, then it inherits the properties ofThing
instead of re-declaring the property set. If you think about that a bit, there be dragons. Having type extension in a schema language really requires having constructs and semantics for both extension and restriction. There is a whole section (§4.13.2.7 “Structural consistency of element types”) that explains what could be interpreted as a subtype. Yet, that structural consistency doesn’t appear to apply to schemas. - The way local aliases are specified and used seems a bit inconsistent. You only have a local alias for a type when you declare it, yet there are specific situations where you need a local alias (e.g., the endpoint pair phrase ). It is also appears useful when you have a key label set with more than one label in that you can use the local alias to avoid being required to repeat the multiple labels for each edge type.
- Edges to anything: It isn’t clear how you can describe an edge which has an endpoint (e.g., a destination) that is any node.
- Node type unions: It isn’t clear to me how you specify an edge whose destination is a union of node types. While you may be able to work around this with a common label, the set of destination node types may be disjoint.
- Lack of graph constraints: There are graph type level constraints such as limiting the cardinality of edges that would be very useful (e.g., node type X can only have one edge of type Y).
- Optional keywords - why?:
CREATE PROPERTY GRAPH
andCREATE GRAPH
,NODE TYPE
andNODE
, andEDGE TYPE
andEDGE
are all substitutable syntax without any change in semantics. I feel like we should have droppedPROPERTY
andTYPE
from the grammar. - Synonyms - why?:
NODE
vsVERTEX
andEDGE
vsRELATIONSHIP
- as synonyms, they mean the same thing, and so why didn’t they just pick one?
Concluding remarks
As with many schema languages, there are things left yet to do. As a first version, a graph type provides some basic mechanisms for describing a graph as a set of “leaf types”. The modeling language you want to use to describe your graph may simply be out of scope for GQL. Meanwhile, you can simply use another tool (pick your favorite UML or ERM tool) to generate GQL graph types useful in your database.
Presently, the type system in GQL provides a minimum bar for validation. More importantly, these types give the database system a contract for the kinds of nodes, edges, and properties that are to be created, updated, and queried. This affords the database engine the ability to optimize how information is stored, indexed, and accessed.
That should probably be the primary takeaway here:
The graph schema is for your database and not for you. It doesn’t describe everything you need to understand about your graph’s structure.