Reflections on the First openCypher Implementers Meeting

By Mats Rydberg & Petra Selmer  |  31 March 2017

Introduction

The conference was held in Walldorf, Germany, at SAP’s headquarters.

The conference room, featuring Alastair Green making the case for multiple graph querying.

What happened

The opening presentation was given by Alastair Green, the Product Manager at Neo4j responsible for the development of Cypher and the openCypher project. In his talk, Alastair explained that the goal of the openCypher initiative is to craft a standard language for querying graphs, using as a basis Cypher in its current form, and seeking to evolve it via an open process, with the active participation of all interested vendors and implementers.

Mats Rydberg, an engineer at Neo4j, presented an overview of the library of shared artifacts that have been produced and made publicly available under the auspices of openCypher. He then proceeded to a discussion of ideas around a shared grammar and a verifiable test kit.

Next, Marcus Paradies, developer at SAP, discussed how they’ve injected Cypher into their HANA Graph stack, detailing how SAP has modelled graphs in their relational system, and mentioning some of the shortcomings that were encountered. In particular, Marcus highlighted the importance of compositionality within the language, and brought up two of the conference’s larger topics: pattern matching semantics, and multiple graphs. Later in the day, Neo4j’s CLG (Cypher Language Group, the internal Neo4j team responsible for language development of Cypher) team lead Stefan Plantikow and Oskar van Rest, Principal Member of Technical Staff at Oracle, both described problems relating to Cypher’s pattern matching semantics, and explored alternatives.

Following on from Marcus’ presentation, Dmitry Vrublevsky, software engineer at Neueda, presented and demonstrated the Cypher developer tool that they have been building as a plugin for the popular JetBrains family of IDEs. As of 8 February 2017, the plugin had over 11,000 downloads, and Dmitry showcased its syntax highlighting, refactoring and error reporting features.

Just before the first coffee break of the day, Gábor Szárnyas and József Marton, researchers at Budapest University of Technology and Economics, took the audience through their research project of incremental query execution on graphs. The example model of a railroad network resonated well with the audience, as did the extensive work on mapping Cypher onto relational algebra (with special extensions). The session concluded with a list of challenging components of the language, which include the handling of lists, aggregations and the default bag semantics that are in effect when not using DISTINCT.

Both Gábor and József, as well as Dmitry, had already been directly involved in the openCypher project, raising issues, providing pull requests and discussing various topics, mostly related to the grammar specification.


Andrés Taylor, engineer at Neo4j, father of Cypher, and former team lead of the CLG, started the next session by describing how Cypher has been (and is) implemented in Neo4j, for and from which the language has been grown. After a brief overview of Cypher’s history, Andrés described in detail the cost-based query planner, the algorithm it is based on, and ended with a quick look at Neo4j’s new Cypher runtime, which runs on generated code.

Roi Lipman, software engineer at Redis Labs, then gave a presentation on how he had developed the Redis graph module based on a hexastore model, where node-relationship-node triplets are stored in six permutations to enable fast prefix-based searches.

A concept that has arisen in prior meetings with several interested parties is that of a shared standard of internal graph query representation, possibly compiled from several distinct source languages. Stefan Plantikow gave an insight as to how Neo4j and the CLG had been thinking about such a model, called QUIL (Query Intermediary Language).

Tomasz Zdybał, software engineer at Dgraph Labs, presented Dgraph, an in-memory native graph database, and the implementation in their product of a graph query language based on Facebook’s GraphQL. Tomasz highlighted Dgraph’s intentions of adding support for Cypher, and how schema validation was an important topic.

Just before lunch, Alastair Green took the floor again, standing in for Bitnine (the Korean company behind Agens Graph, who were unable to attend in person) discussing how Cypher, SQL, and other query languages could be integrated with one another. Alastair presented slides authored by Kisung Kim from Bitnine, detailing the hybrid relational/graph architecture of Agens Graph, which is based on PostgreSQL. The most significant contribution was the way in which Bitnine had introduced integration points between SQL and Cypher, allowing for each to be passed in as a subquery construct to the other, and how functions may be shared as expressions between the languages.


The lunch break paved the way for the big topic previously brought up by Marcus Paradies of SAP in the morning: multiple graphs. Cypher has always been a language that operates on a single implicit graph, producing a stream of records as output (the collection of which effectively form a table). Alastair Green discussed the motivations for changing this model to make Cypher a language closed over graphs, envisioning a future where Cypher would be capable of processing multiple graphs provided as input, and producing as output one or more graphs. Alastair explored salient subtopics such as identity, addressing, and ways of defining compositions of graphs from distinct sources.

Following on from Alastair’s talk on the vision of multiple graphs in Cypher, Stefan Plantikow led a longer session on the topic, presenting his latest thinking on how Cypher can be remodeled towards a graph-in-graph-out paradigm. Stefan discussed the motivations from a different angle to those covered by Alastair, focusing on how to make the concept of multiple graphs logically consistent and how to extend the execution model, whilst still keeping in mind Cypher’s considerable user base and the cost of imposing breaking change in semantics. One of the major concepts in Stefan’s discussion was the re-interpretation of Cypher’s result records as ‘g-records’ (graph-records, or graphlets), meaning each binding of a matched subgraph would itself be interpreted as a (typically very small) graph, and the extended ability to collapse/union all such g-records into one large result graph. Both the g-record model and the companion model of the unionised graph would enact Cypher as being closed over graphs, as it would be possible to upon retrieval of the result graph(s) immediately issue a new Cypher query, now pattern matching on the newly computed results. Stefan also gave detailed syntax proposals for how to define and use graphs as values in the context of a query, including a take on the addressing topic raised by Alastair. It was made clear that this topic is foremost in the minds of several key Cypher innovators.

Before Stefan’s dive into the world of multiple graphs, Martin Junghanns, researcher at the University of Leipzig, presented his research project on implementing Cypher on Gradoop, a graph platform based on Apache Hadoop. Martin gave us an overview on how their system handled query planning, and the model used to represent (intermediate) query results in the distributed framework, Flink, in which the queries are executed. The project also featured interesting extensions to the Property Graph Model, upon which Cypher is based, including the concept of logical subgraphs and a set of graph operations.

Following on from Martin’s talk, Hannes Voigt, researcher at the Technical University of Dresden, walked us through his research project, in which Michael Hunger, community caretaker at Neo4j, had participated. The topic comprised virtual graphs and views, and featured several intriguing extensions to Cypher. These were expressed in terms of ‘crossing the concept chasm’, which Hannes explained as the different levels of abstraction that users view their data in. At the lowest level of abstraction is the actual raw unprocessed data, which is usually very high in volume. At higher levels, larger patterns start to appear, composed of groups of nodes and relationships from the lower levels. These larger patterns are in the model constructed using virtual nodes and relationships, with the additional ability to define views which provide several interesting qualities, such as performance optimizations and query modularization.


The afternoon session featured two larger sections (#1 and #2) in which four of the members of the CLG presented views and ideas on how to address the most prominently mentioned shortcomings of the language in its current form. Mats Rydberg presented a proposal for a new schema/constraint syntax, and also talked more in-depth on the Technology Compatibility Kit, highlighting its usefulness to verify that a multitude of language implementations are semantically consistent. Petra Selmer went through the latest thinking on several classes of subqueries, including syntax proposals. Petra also detailed the revised Cypher improvement process, which has been designed to chime in with the open, collaborative format intended for the openCypher project. As mentioned above, Stefan Plantikow and Oskar van Rest discussed (Oskar’s slides) semantics of pattern matching in terms of isomorphism/homomorphism of subgraphs and entities. Tobias Lindaaker provided insights as to how Cypher could complete its support for Conjunctive Regular Path Queries (CRPQs), going through historical thinking as well as recent syntax and semantics proposals. The topic of vendor-specific extensions, including some of the pitfalls that should be looked out for from experiences with the SQL standard, was also presented by Tobias.

The last session also featured Paolo Guagliardo, researcher at the University of Edinburgh, who presented a recent research project on formalising semantics of SQL, and the advantages conferred on a language through the provision of a formal specification. Paolo also announced the recently begun project of producing a formal semantics specification of Cypher, which is to be carried out by his research team, including Nadime Francis and Professor Leonid Libkin. This project will be undertaken during the spring of 2017, and reports of progress will be given at upcoming oCIMs.


All in all, the meeting was a resounding success, and we look forward to many more in the future, starting with the 2nd oCIM already on May 10th, this time in London. Please do reach out to us at openCypher@neo4j.com if you are interested, and be sure to check out the event page for the 2nd oCIM for further details.

Presentation materials

The programme, complete with all slide material presented during the conference may be found at the event page for the 1st oCIM.