Today we play further with Neo4j, exploring the ReST API, indexes, and algorithms in various languages.
The ReST API is always available, although not the easiest thing to work with. Besides what the book covers, I also learned how to extend it, and how to bypass it for large loads.
Indexing can be manual, as the book shows, or automatic (although the documentation warns this is still an experimental feature).
Finally, the algorithms are mostly provided by an external library, JUNG, so its use require direct access to the data, bypassing the server.
Creating an index on relationship
As the index is of type exact
, there is no need to create it first
(although it is
possible). Just
inserting data in the index will do:
1 2 3 4 |
|
About the ReST API
Clearly this is not how one would want to program. I copied the
importer.rb
code from the book (instead of just using a downloaded version), and
ran it for hours before finding a bug in the data to create
indexes… Running it again with this bug fixed made it much faster
(as actors were reused instead of being duplicated).
There is a higher level API, Neo4j.rb, which runs on JRuby (so it does not use the ReST API). It should be noted that this is not really a driver, but a library to manage a Neo4j database directly in Ruby. Still, with it, it is possible to create the database that will be used by a server. There are other alternatives (the Gremlin console, for instance), but for Ruby it seems to be one of the most advanced, and is still being improved.
There is also a ReST API wrapper called neography, but as I’m trying to save time I’ll go with Neo4j.rb.
To use this API you first need to clone the Git repository:
1
|
|
In the neo4j
directory, build then install the gem (making sure
the default Ruby is JRuby):
1 2 3 4 5 6 7 |
|
As I said above, it is possible to use it to feed data into a database, but it should not be used while the server is running. I used it to create the movie network, as it was significantly faster than the book Ruby script.
To do so, I first rewrite the import script to use the neo4j
gem. I
am also using the
Neo4j::Batch::Inserter
for extra performance; the resulting code is less readable, but not
significantly so. The script is mostly the same size as the original
one, but much easier to understand (if you know the neo4j
gem).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
I first shut down the Neo4j server. I defined a
NEO4J_HOME
environment variable as the root of the Neo4j instance,
and cleared the content of $NEO4J_HOME/data/graph.db
with
1
|
|
While not strictly necessary, this step helps ensure that the database is always in a known (i.e. empty) state each time.
Finally I ran the script with
1
|
|
The whole import took a little above 1 hour on my not really powerful macBook Air. The original script never finished, even after running a few hours.
I also found that index creation is the main cost: my first attempt at loading data did not use indexes at all: the whole file was loaded in less than 3 minutes (but of course the resulting graph was unusable).
The script is not complete; it should certainly handle exceptions and close the database properly. But for an initial load it does the job.
After it finished, I just restarted the server.
Note: I strongly suggest backing up the data/graph.db
directory just
after the initial load (and before starting the server). I had a crash
while running the Kevin Bacon queries, and Neo4j unhelpfully lost the
property data file, forcing me to import again…
A data corruption during a read only operation does not inspire confidence…
Indexes
Once thing I had not properly understood, and which caused me some
problems as I was trying to learn how to use the driver, is that
all indexes use Lucene. They are either exact
or fulltext
, and can
be queried as shown here:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
So the fact that the driver only supports Lucene indexes is not a limitation. There is nothing else (although presumably there could be).
Extending Neo4j
As I found on this post, it is fairly easy to extend the ReST API with arbitrary code. Deploying the code is a simple as copying the jar at the right location.
The official documentation is mostly an updated version of the post above.
I claimed yesterday that it was impossible to the ReST API directly to list just the names of all the nodes.
Of course, today I know I could pass a Gremlin expression through ReST, and get the same result as in the console. But that could be considered cheating.
The alternative is to use extend the ReST with a plugin, as I show here.
As always, the use of Maven is recommended. My pom.xml
loads the
server-api
for Neo4j:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
The Java code is simplified by the use of annotations. The code returns an iterator that extract the names of the underlying node iterator:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
Finally, it is important to have add a file
META-INF/services/org.neo4j.server.plugins.ServerPlugin
with the
complete name of the new plugins (in this case, just one):
1
|
|
The jar should be copied to the plugins
directory of the Neo4j
instance, and Neo4j restarted.
It is possible to test the correct deployment of the plugin using the ReST API:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
The query returns the list of each extension, as well as the URL to
call it. Using the GET
method, the extension is self documenting:
1 2 3 4 5 6 7 |
|
Finally, it can be invoked with the POST
method:
1 2 |
|
Of course Kevin Bacon
This section is about the code of the book version beta 2.0.
I had trouble to get the code to work in Neo4j 1.5. Here I document the alternative code I came up with and used.
Defining steps in Gremlin
I could not get the book code to define the costars
step to work: it
seems outV
does not accept a filter expression as argument.
Even with the addition of a dedicated filter
step, I could not
filter properly. Instead, I started from scratch, using the
Gremlin wiki
code as a basis:
1 2 3 4 5 |
|
Note the use of sideEffect
to introduce the variable start
into
the expression. Not doing this (and instead following the book code),
the filter was not working at all (i.e. the start node was
still part of the result). Also I have a different type for the
relationship (Movie#acted_in
) as it was generated by Neo4j.rb.
From Elvis to Kevin Bacon
The loop
step does not emit intermediate node by default, so while
the query in the book is accepted, it does not return any result
because the actual degree of separation between Elvis and Kevin Bacon
is just 3.
The latest version of Gremlin extends the basic loop
pattern to emit
intermediate nodes if requested, but this is not possible with the
version embedded in Neo4j 1.5 admin console.
The standalone Gremlin shell version 1.3 is a bit too old (it links against Neo4j version 1.5.M01, whose database format is not compatible with version 1.5’s format). So I tried the current head of the Git repository.
To build it you will need to download half the Internet, so be patient.
The build command is mvn install
(the install step will make the
scripts to launch the console).
Once started, you can load the database with:
1
|
|
The code costars
step that was working in the console no longer does
in the shell. I had to replace the uniqueObject()
step with
dedup()
, and make sure
everything is on a single line:
1
|
|
Finally, the command to find nodes by index has to explicitly use the index:
1 2 |
|
(if you created the data using the original import command, the index
name is actors
).
As frustrating as it all is, the end result is that the loop
step
can now be used as needed:
1
|
|
I also had to change the query once more to use next
instead of >> 1
as
in the book, as that does not work in the latest version of Gremlin
either.
Random walk
Once again, I had to change the code from the book to get it to work:
adding a dedicated filter
step did the trick:
1
|
|
The loop
argument does not change, as the filter expression already
counted as a step in the book version.
Centrality
I had a small problem with the book code: the query is not run if the
command is followed by ; ''
(which the book uses to prevent the
display of the results). Just running this:
1 2 3 |
|
works. Why on earth would such a small change have such an impact is beyond me. Now I’m scared of Groovy.
JUNG Algorithms
This time the book code was working as intended, but I found that there is an even more central actor than Donald Sutherland: Bobby Vitale…
Exercises
Neo4j ReST API
The documentation is here.
Binding or ReST API
See above my useo Neo4j.rb.
API for the JUNG project
The API is here.
Path-finding as a step
I am using the Gremlin shell, rather than the Neo4j console.
1
|
|
I used the possibility to pass arguments to the closure to introduce both the target node and the loop limit as parameters. Otherwise the code is identical to the one I was using above. With this, the path from Elvis to Kevin Bacon becomes
1 2 3 4 5 6 |
|
A family graph
I used Ruby (with the Neo4j.rb library). The code first defines a
family
data structure that maps each family member to other members
keyed by their relationships.
The code first iterate over the family members and inserts them; it
then goes over the family
structure a second time to insert the
relationships.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
Run a JUNG algorithm
I tried to run a simple Dijkstra shortest path algorithm in Gremlin, but eventually had to give up as the shell kept giving me weird exceptions when I tried to load the required class. Furthermore, the graph being directed from the actor nodes to the movie nodes, there is not path between anything but an actor and one of its movies (and the JUNG class to transform a directed graph to an undirected one seems to convert the whole graph eagerly).
Eventually I gave up, dumped the movie database, and used the family graph instead.
The code is in Java, the language I used after Groovy scared me with these weird exceptions (Java is boring but predictable).
The hardest perhaps was to figure out the dependencies for the
pom.xml
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
The code in Java is verbose; especially I could not find a simple way to look up nodes in the BluePrints graph, nor could I use the properties of the Neo4j nodes from the BluePrints vertices…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
Running it produces
1 2 3 4 5 6 7 8 9 10 |
|
Wrapping Up Day 2
I must say that today was a rather frustrating experience. Neo4j ecosystem is still evolving, but this means that most of the documentation I came upon was already obsolete. The navigation on the data was at time very hard to figure out, and the error messages (really, the underlying Java exception) not helpful.