Monday 27 February 2012

Fun with Gremlin (not related to the movie or the car))

So you have all of this graph-tacular data in your graph database (for this post, I'm using neo4j).  It looks slick with its vertices and edges.  People stop by your desk to say, "Is that the new connect-the-dots app you're working on?"

After staring at them for a couple moments and repressing the urge to sell them the "app" for five bucks, you start thinking about how you're going to access and use this wonderful data.

"If only there was a way to query the data..." you wonder.

While SQL is the querying language of choice for relational databases, there is no real "standard" as far as NoSQL databases go (that's a subject for another time).

In the world of graph-driven databases, there are options:
  • Gremlin; a Groovy-based querying language that can handle any type of query.  This language is perhaps geared more towards those with more of a math- or graph-based background as the syntax is nothing like SQL.
  • SPARQL; a popular query language for RDF graphs.  This language is likely more easily picked up by those with an SQL background as the syntax is more SQL-like than Gremlin.
  • Cypher; a neo4j-specific querying language.  This language currently only allows read-only queries of graphs (i.e. no inserts, updates or deletions).  Like SPARQL, Cypher also takes its cues from SQL.
Neo4j folks: At the time of this writing, neo4j comes pre-loaded with plugins for both Gremlin and Cypher.  (As I understand things, there is currently a ticket open in the neo4j community to develop a SPARQL plugin, but has not yet been completed; there are likely complications stemming from the fact that neo4j is not an RDF database at its core.)

In this article, I'm going to cover some basics in using Gremlin.  By using some concrete examples, I hope to demonstrate a bit of the power behind using a graph-based database!

For the purposes of this post, I'll be making use of a small graph I created that contains some people, who they know, and what they've purchased.  I expect this graph to grow and change as time goes on, but, that's where it stands for now (this is the beauty of schema-less data).

So far, Gremlin is the only querying language I've used with graph databases, hence this article making use of Gremlin.  The good thing about this Gremlin is that it won't require a new muffler and you don't have to worry about feeding it after midnight.

Getting started with Gremlin and Neo4j is easy enough.  It's a plugin that comes with Neo4j, so all you need to do to begin is to open up your Neo4j web admin instance, click the "Console" menu option at the top, and select the "Gremlin" option from the top-right of the console that appears.

At this point, you're faced with the currently-available variables and a gremlin overlooking them, like so:



We see that the variable g contains the current graph.  If we issue the query "g.V" (without the quotes, as always), we get a list of all the vertices (nodes) in the graph; however, this information is not incredibly useful as you're only given each node's ID.

Let's say some (but not all) of our nodes have been given the property "Name".  If we try using "g.V.Name", we'll again see a listing of the nodes; however, the value of the "Name" property for each node (if available) will appear (if "Name" isn't a property of a specific node, "null" will appear; also, note that Gremlin is case-sensitive).


We can also similarly view a list of the edges (relationships) by issuing "g.E" to the console; this time, however, in addition to the edge IDs, we also see the type of relationship and the adjoining vertices (nodes), e.g. 1-KNOWS->6.  Note that you can see that these edges are directed!  In this case, we see that the edge goes out from node 1 and goes in to node 6.  Useful stuff.

We can also (similarly) view a property of the edges (if it exists) as we did for the vertices, e.g. "g.E.Quantity".

Identifying individual vertices and edges is simply a matter of knowing each one's ID number.  Obtaining a reference to a node (which, yes, can be assigned to variables for easier reference) involves a call like "g.v(6)" or "g.e(3)" (note the casing).

You can examine the value of an individual node's/relationship's specific property by querying it much like we did above, e.g. "g.v(6).Name".

Want to know all of the edges coming out of a node?  "g.v(6).outE" will do the trick.  Similarly, if you want to know all of the edges coming in to a node, we can use "g.v(6).inE".

We can also go one step further and use the "inV" and "outV" steps to identify the nodes on the ends of edge.  "inV" will correspond to the node at the head of an edge (also known as the "incoming vertex"), whereas "outV" will correspond to the other side of an edge (an "outgoing vertex").

You can also use "bothV" and "bothE" to get both incoming and outoing vertices and edges (respectively).

So if you want to travel from one node to another, you might do something like: v(1).outE.inV.  You can shorten this by using: v(1).out.  There exist similar constructs for "in" and "both". (Note that for "in", it's a short-cut for "inE.outV".)

Have more than one relationship connecting a node?  No problem!  You can access specific ones via something like this: v(1).out('LIKES') (this will take you to the node on the other end of the 'LIKES' relationship for node v(1)).

(Gremlin's github page has a good basic tutorial about all this.)

I could go on at great length about the features of Gremlin, but I think this is a great starting point.  I'm going to include some concrete examples below.  If I use any constructs or syntax that doesn't make sense, I very much encourage you to visit the Gremlin wiki page to look up the answers; this is a good little exercise, especially for the "groupCount" and "cap" constructs.

The graph below assumes one that has nodes describing products and people, and has relationships showing purchases and who knows whom.

How many times has each product been purchased?
g.V.inE('PURCHASED').inV.ProductType.groupCount.cap

Which products have been purchased more than once?
g.V.filter{it.inE('PURCHASED').count() > 1}.ProductType

Who is known by more than one person in this graph (we define 'knowing' as sharing an edge/relationship--in or out--with another person)?
g.V.filter{it.bothE('KNOWS').count() > 1}.Name

Who knows the most people, and how many people do they know?
g.V.bothE('KNOWS').outV.Name.groupCount.cap

How well-known is each person?
g.V.both('KNOWS').Name.groupCount.cap

(By the way, if anyone notices anything wrong with anything above, please let me know as I'm always looking to evolve and develop my knowledge of, well, everything.)

So you begin to see the power of what we can extract out of a graph database!  Personally, I'm tempted to find out "who is your daddy and what does he do?"  Such a question would be relatively straight-forward to figure out!



Ok, I think that's enough for now.  Hopefully this is a decent (but brief) introduction to the world of Gremlin and querying graph databases.  I know writing this has forced me to examine in more depth exactly what exactly these queries actually do.

I'll see you next time!

No comments:

Post a Comment