Today we play further with Neo4j, exploring the ReST API, indexes, and algorithms in various languages.
The ReST API is always available, although not the easiest thing to work with. Besides what the book covers, I also learned how to extend it, and how to bypass it for large loads.
Indexing can be manual, as the book shows, or automatic (although the documentation warns this is still an experimental feature).
Finally, the algorithms are mostly provided by an external library, JUNG, so its use require direct access to the data, bypassing the server.
Creating an index on relationship
As the index is of type
exact, there is no need to create it first
(although it is
inserting data in the index will do:
1 2 3 4
About the ReST API
Clearly this is not how one would want to program. I copied the
code from the book (instead of just using a downloaded version), and
ran it for hours before finding a bug in the data to create
indexes… Running it again with this bug fixed made it much faster
(as actors were reused instead of being duplicated).
There is a higher level API, Neo4j.rb, which runs on JRuby (so it does not use the ReST API). It should be noted that this is not really a driver, but a library to manage a Neo4j database directly in Ruby. Still, with it, it is possible to create the database that will be used by a server. There are other alternatives (the Gremlin console, for instance), but for Ruby it seems to be one of the most advanced, and is still being improved.
There is also a ReST API wrapper called neography, but as I’m trying to save time I’ll go with Neo4j.rb.
To use this API you first need to clone the Git repository:
neo4j directory, build then install the gem (making sure
the default Ruby is JRuby):
1 2 3 4 5 6 7
As I said above, it is possible to use it to feed data into a database, but it should not be used while the server is running. I used it to create the movie network, as it was significantly faster than the book Ruby script.
To do so, I first rewrite the import script to use the
neo4j gem. I
am also using the
for extra performance; the resulting code is less readable, but not
significantly so. The script is mostly the same size as the original
one, but much easier to understand (if you know the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
I first shut down the Neo4j server. I defined a
NEO4J_HOME environment variable as the root of the Neo4j instance,
and cleared the content of
While not strictly necessary, this step helps ensure that the database is always in a known (i.e. empty) state each time.
Finally I ran the script with
The whole import took a little above 1 hour on my not really powerful macBook Air. The original script never finished, even after running a few hours.
I also found that index creation is the main cost: my first attempt at loading data did not use indexes at all: the whole file was loaded in less than 3 minutes (but of course the resulting graph was unusable).
The script is not complete; it should certainly handle exceptions and close the database properly. But for an initial load it does the job.
After it finished, I just restarted the server.
Note: I strongly suggest backing up the
data/graph.db directory just
after the initial load (and before starting the server). I had a crash
while running the Kevin Bacon queries, and Neo4j unhelpfully lost the
property data file, forcing me to import again…
A data corruption during a read only operation does not inspire confidence…
Once thing I had not properly understood, and which caused me some
problems as I was trying to learn how to use the driver, is that
all indexes use Lucene. They are either
fulltext, and can
be queried as shown here:
1 2 3 4 5 6 7 8 9 10 11 12 13
So the fact that the driver only supports Lucene indexes is not a limitation. There is nothing else (although presumably there could be).
As I found on this post, it is fairly easy to extend the ReST API with arbitrary code. Deploying the code is a simple as copying the jar at the right location.
The official documentation is mostly an updated version of the post above.
I claimed yesterday that it was impossible to the ReST API directly to list just the names of all the nodes.
Of course, today I know I could pass a Gremlin expression through ReST, and get the same result as in the console. But that could be considered cheating.
The alternative is to use extend the ReST with a plugin, as I show here.
As always, the use of Maven is recommended. My
pom.xml loads the
server-api for Neo4j:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
The Java code is simplified by the use of annotations. The code returns an iterator that extract the names of the underlying node iterator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
Finally, it is important to have add a file
META-INF/services/org.neo4j.server.plugins.ServerPlugin with the
complete name of the new plugins (in this case, just one):
The jar should be copied to the
plugins directory of the Neo4j
instance, and Neo4j restarted.
It is possible to test the correct deployment of the plugin using the ReST API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
The query returns the list of each extension, as well as the URL to
call it. Using the
GET method, the extension is self documenting:
1 2 3 4 5 6 7
Finally, it can be invoked with the
Of course Kevin Bacon
This section is about the code of the book version beta 2.0.
I had trouble to get the code to work in Neo4j 1.5. Here I document the alternative code I came up with and used.
Defining steps in Gremlin
I could not get the book code to define the
costars step to work: it
outV does not accept a filter expression as argument.
Even with the addition of a dedicated
filter step, I could not
filter properly. Instead, I started from scratch, using the
code as a basis:
1 2 3 4 5
Note the use of
sideEffect to introduce the variable
the expression. Not doing this (and instead following the book code),
the filter was not working at all (i.e. the start node was
still part of the result). Also I have a different type for the
Movie#acted_in) as it was generated by Neo4j.rb.
From Elvis to Kevin Bacon
loop step does not emit intermediate node by default, so while
the query in the book is accepted, it does not return any result
because the actual degree of separation between Elvis and Kevin Bacon
is just 3.
The latest version of Gremlin extends the basic
loop pattern to emit
intermediate nodes if requested, but this is not possible with the
version embedded in Neo4j 1.5 admin console.
The standalone Gremlin shell version 1.3 is a bit too old (it links against Neo4j version 1.5.M01, whose database format is not compatible with version 1.5’s format). So I tried the current head of the Git repository.
To build it you will need to download half the Internet, so be patient.
The build command is
mvn install (the install step will make the
scripts to launch the console).
Once started, you can load the database with:
costars step that was working in the console no longer does
in the shell. I had to replace the
uniqueObject() step with
dedup(), and make sure
everything is on a single line:
Finally, the command to find nodes by index has to explicitly use the index:
(if you created the data using the original import command, the index
As frustrating as it all is, the end result is that the
can now be used as needed:
I also had to change the query once more to use
next instead of
>> 1 as
in the book, as that does not work in the latest version of Gremlin
Once again, I had to change the code from the book to get it to work:
adding a dedicated
filter step did the trick:
loop argument does not change, as the filter expression already
counted as a step in the book version.
I had a small problem with the book code: the query is not run if the
command is followed by
; '' (which the book uses to prevent the
display of the results). Just running this:
1 2 3
works. Why on earth would such a small change have such an impact is beyond me. Now I’m scared of Groovy.
This time the book code was working as intended, but I found that there is an even more central actor than Donald Sutherland: Bobby Vitale…
Neo4j ReST API
The documentation is here.
Binding or ReST API
See above my useo Neo4j.rb.
API for the JUNG project
The API is here.
Path-finding as a step
I am using the Gremlin shell, rather than the Neo4j console.
I used the possibility to pass arguments to the closure to introduce both the target node and the loop limit as parameters. Otherwise the code is identical to the one I was using above. With this, the path from Elvis to Kevin Bacon becomes
1 2 3 4 5 6
A family graph
I used Ruby (with the Neo4j.rb library). The code first defines a
family data structure that maps each family member to other members
keyed by their relationships.
The code first iterate over the family members and inserts them; it
then goes over the
family structure a second time to insert the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
Run a JUNG algorithm
I tried to run a simple Dijkstra shortest path algorithm in Gremlin, but eventually had to give up as the shell kept giving me weird exceptions when I tried to load the required class. Furthermore, the graph being directed from the actor nodes to the movie nodes, there is not path between anything but an actor and one of its movies (and the JUNG class to transform a directed graph to an undirected one seems to convert the whole graph eagerly).
Eventually I gave up, dumped the movie database, and used the family graph instead.
The code is in Java, the language I used after Groovy scared me with these weird exceptions (Java is boring but predictable).
The hardest perhaps was to figure out the dependencies for the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
The code in Java is verbose; especially I could not find a simple way to look up nodes in the BluePrints graph, nor could I use the properties of the Neo4j nodes from the BluePrints vertices…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Running it produces
1 2 3 4 5 6 7 8 9 10
Wrapping Up Day 2
I must say that today was a rather frustrating experience. Neo4j ecosystem is still evolving, but this means that most of the documentation I came upon was already obsolete. The navigation on the data was at time very hard to figure out, and the error messages (really, the underlying Java exception) not helpful.