Wakatta!

Like Eureka!, only cooler

Seven Databases in Seven Weeks Neo4j Day 2

Today we play further with Neo4j, exploring the ReST API, indexes, and algorithms in various languages.

The ReST API is always available, although not the easiest thing to work with. Besides what the book covers, I also learned how to extend it, and how to bypass it for large loads.

Indexing can be manual, as the book shows, or automatic (although the documentation warns this is still an experimental feature).

Finally, the algorithms are mostly provided by an external library, JUNG, so its use require direct access to the data, bypassing the server.

Creating an index on relationship

As the index is of type exact, there is no need to create it first (although it is possible). Just inserting data in the index will do:

1
2
3
4
curl -X POST http://localhost:7474/db/data/index/relationship/published \
-H "Content-Type: application/json" \
-d '{ "uri" : "http://localhost:7474/db/data/relationship/0", \
"key" : "date", "value" : "1916-11-28" }'

About the ReST API

Clearly this is not how one would want to program. I copied the importer.rb code from the book (instead of just using a downloaded version), and ran it for hours before finding a bug in the data to create indexes… Running it again with this bug fixed made it much faster (as actors were reused instead of being duplicated).

There is a higher level API, Neo4j.rb, which runs on JRuby (so it does not use the ReST API). It should be noted that this is not really a driver, but a library to manage a Neo4j database directly in Ruby. Still, with it, it is possible to create the database that will be used by a server. There are other alternatives (the Gremlin console, for instance), but for Ruby it seems to be one of the most advanced, and is still being improved.

There is also a ReST API wrapper called neography, but as I’m trying to save time I’ll go with Neo4j.rb.

To use this API you first need to clone the Git repository:

1
git clone git://github.com/andreasronge/neo4j.git

In the neo4j directory, build then install the gem (making sure the default Ruby is JRuby):

1
2
3
4
5
6
7
$ gem build neo4j.gemspec
  Successfully built RubyGem
  Name: neo4j
  Version: 1.3.1
  File: neo4j-1.3.1-java.gem
$ gem install neo4j-1.3.1-java.gem
... (lot's of output elided)

As I said above, it is possible to use it to feed data into a database, but it should not be used while the server is running. I used it to create the movie network, as it was significantly faster than the book Ruby script.

To do so, I first rewrite the import script to use the neo4j gem. I am also using the Neo4j::Batch::Inserter for extra performance; the resulting code is less readable, but not significantly so. The script is mostly the same size as the original one, but much easier to understand (if you know the neo4j gem).

(importer_driver.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/env ruby

%w{rubygems neo4j}.each{|r| require r}

class Movie
  include Neo4j::NodeMixin
  property :name
  index :name
end

class Actor
  include Neo4j::NodeMixin
  property :name
  index :name
  has_n(:acted_in).to(Movie)
end

def get_or_create_node(inserter, name, clazz)
  n = inserter.index_get('name', name, :exact, clazz)
  n = n.first if n
  unless n
    n = inserter.create_node({'name' => name}, clazz)
    inserter.index_flush(clazz)
  end
  n
end

puts "begin processing..."

Neo4j::Config[:storage_path] = "#{ENV['NEO4J_HOME']}/data/graph.db"
count = 0

inserter = Neo4j::Batch::Inserter.new

File.open(ARGV[0]).each do |line|
  _, _, actor, movie = line.split("\t")
  next if actor.empty? || movie.empty?

  actor_node = get_or_create_node(inserter, actor, Actor)
  movie_node = get_or_create_node(inserter, movie, Movie)

  inserter.create_rel(Actor.acted_in, actor_node, movie_node)

  puts "  #{count} relationships loaded" if (count += 1) % 100 == 0
end

puts "done!"
inserter.shutdown

I first shut down the Neo4j server. I defined a NEO4J_HOME environment variable as the root of the Neo4j instance, and cleared the content of $NEO4J_HOME/data/graph.db with

1
rm -rf $NEO4J_HOME/data/graph.db/*

While not strictly necessary, this step helps ensure that the database is always in a known (i.e. empty) state each time.

Finally I ran the script with

1
jruby importer_driver.rb performance.tsv

The whole import took a little above 1 hour on my not really powerful macBook Air. The original script never finished, even after running a few hours.

I also found that index creation is the main cost: my first attempt at loading data did not use indexes at all: the whole file was loaded in less than 3 minutes (but of course the resulting graph was unusable).

The script is not complete; it should certainly handle exceptions and close the database properly. But for an initial load it does the job.

After it finished, I just restarted the server.

Note: I strongly suggest backing up the data/graph.db directory just after the initial load (and before starting the server). I had a crash while running the Kevin Bacon queries, and Neo4j unhelpfully lost the property data file, forcing me to import again…

A data corruption during a read only operation does not inspire confidence…

Indexes

Once thing I had not properly understood, and which caused me some problems as I was trying to learn how to use the driver, is that all indexes use Lucene. They are either exact or fulltext, and can be queried as shown here:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ curl http://localhost:7474/db/data/index/node/
{
  "movies" : {
    "template" : "http://localhost:7474/db/data/index/node/movies/{key}/{value}",
    "provider" : "lucene",
    "type" : "exact"
  },
  "actors" : {
    "template" : "http://localhost:7474/db/data/index/node/actors/{key}/{value}",
    "provider" : "lucene",
    "type" : "exact"
  }
}

So the fact that the driver only supports Lucene indexes is not a limitation. There is nothing else (although presumably there could be).

Extending Neo4j

As I found on this post, it is fairly easy to extend the ReST API with arbitrary code. Deploying the code is a simple as copying the jar at the right location.

The official documentation is mostly an updated version of the post above.

I claimed yesterday that it was impossible to the ReST API directly to list just the names of all the nodes.

Of course, today I know I could pass a Gremlin expression through ReST, and get the same result as in the console. But that could be considered cheating.

The alternative is to use extend the ReST with a plugin, as I show here.

As always, the use of Maven is recommended. My pom.xml loads the server-api for Neo4j:

(pom.xml) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>jp.wakatta</groupId>
  <artifactId>listNames</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <properties>
      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      <neo4j.version>1.5</neo4j.version>
  </properties>
  <build>
      <plugins>
          <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-compiler-plugin</artifactId>
              <version>2.3.2</version>
              <configuration>
                  <source>1.6</source>
                  <target>1.6</target>
              </configuration>
          </plugin>
      </plugins>
  </build>
  <dependencies>
      <dependency>
          <groupId>org.neo4j</groupId>
          <artifactId>server-api</artifactId>
          <version>${neo4j.version}</version>
      </dependency>
  </dependencies>
</project>

The Java code is simplified by the use of annotations. The code returns an iterator that extract the names of the underlying node iterator:

(ListNames.java) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package jp.wakatta;

import java.util.Iterator;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.server.plugins.Description;
import org.neo4j.server.plugins.Name;
import org.neo4j.server.plugins.PluginTarget;
import org.neo4j.server.plugins.ServerPlugin;
import org.neo4j.server.plugins.Source;

@Description("An extension to list all node names")
public class ListNames extends ServerPlugin {
  @Name("list_all_names")
  @Description("List all the node names")
  @PluginTarget(GraphDatabaseService.class)
  public Iterable<String> getAllNames(@Source GraphDatabaseService graphDb) {
      final Iterator<Node> nodeIterator = graphDb.getAllNodes().iterator();
      return new Iterable<String>() {
          public Iterator<String> iterator() {
              return new Iterator<String>() {
                  @Override
                  public void remove() { // do nothing 
                  }
                  @Override
                  public String next() {
                      try {
                          return (String) nodeIterator.next().getProperty("name");
                      } catch (Exception ex) {
                          return "";
                      }
                  }
                  @Override
                  public boolean hasNext() {
                      return nodeIterator.hasNext();
                  }
              };
          }
      };
  }
}

Finally, it is important to have add a file META-INF/services/org.neo4j.server.plugins.ServerPlugin with the complete name of the new plugins (in this case, just one):

(org.neo4j.server.plugins.ServerPlugin) download
1
jp.wakatta.ListNames

The jar should be copied to the plugins directory of the Neo4j instance, and Neo4j restarted.

It is possible to test the correct deployment of the plugin using the ReST API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ curl  http://localhost:7474/db/data/
{
  "relationship_index" : "http://localhost:7474/db/data/index/relationship",
  "node" : "http://localhost:7474/db/data/node",
  "relationship_types" : "http://localhost:7474/db/data/relationship/types",
  "neo4j_version" : "1.5",
  "batch" : "http://localhost:7474/db/data/batch",
  "extensions_info" : "http://localhost:7474/db/data/ext",
  "node_index" : "http://localhost:7474/db/data/index/node",
  "reference_node" : "http://localhost:7474/db/data/node/0",
  "extensions" : {
    "CypherPlugin" : {
      "execute_query" : "http://localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query"
    },
    "GremlinPlugin" : {
      "execute_script" : "http://localhost:7474/db/data/ext/GremlinPlugin/graphdb/execute_script"
    },
    "ListNames" : {
      "list_all_names" : "http://localhost:7474/db/data/ext/ListNames/graphdb/list_all_names"
    }
  }
}

The query returns the list of each extension, as well as the URL to call it. Using the GET method, the extension is self documenting:

1
2
3
4
5
6
7
$ curl http://localhost:7474/db/data/ext/ListNames/grphdb/list_all_names
{
  "extends" : "graphdb",
  "description" : "List all the node names",
  "name" : "list_all_names",
  "parameters" : [ ]
}

Finally, it can be invoked with the POST method:

1
2
$ curl -X POST http://localhost:7474/db/data/ext/ListNames/grphdb/list_all_names
[ "", "actor", "film", "Leif Andrée", "7X - This is Our Kids ", ...

Of course Kevin Bacon

This section is about the code of the book version beta 2.0.

I had trouble to get the code to work in Neo4j 1.5. Here I document the alternative code I came up with and used.

Defining steps in Gremlin

I could not get the book code to define the costars step to work: it seems outV does not accept a filter expression as argument.

Even with the addition of a dedicated filter step, I could not filter properly. Instead, I started from scratch, using the Gremlin wiki code as a basis:

1
2
3
4
5
Gremlin.defineStep('costars',
                   [Vertex, Pipe],
                   {_().sideEffect{start = it}.
                       outE('Movie#acted_in').inV.inE('Movie#acted_in').
                       outV.filter{!start.equals(it)}.uniqueObject()})

Note the use of sideEffect to introduce the variable start into the expression. Not doing this (and instead following the book code), the filter was not working at all (i.e. the start node was still part of the result). Also I have a different type for the relationship (Movie#acted_in) as it was generated by Neo4j.rb.

From Elvis to Kevin Bacon

The loop step does not emit intermediate node by default, so while the query in the book is accepted, it does not return any result because the actual degree of separation between Elvis and Kevin Bacon is just 3.

The latest version of Gremlin extends the basic loop pattern to emit intermediate nodes if requested, but this is not possible with the version embedded in Neo4j 1.5 admin console.

The standalone Gremlin shell version 1.3 is a bit too old (it links against Neo4j version 1.5.M01, whose database format is not compatible with version 1.5’s format). So I tried the current head of the Git repository.

To build it you will need to download half the Internet, so be patient.

The build command is mvn install (the install step will make the scripts to launch the console).

Once started, you can load the database with:

1
g = new Neo4jGraph('/users/x/neo4j-enterprise-1.5/data/graph.db')

The code costars step that was working in the console no longer does in the shell. I had to replace the uniqueObject() step with dedup(), and make sure everything is on a single line:

1
Gremlin.defineStep('costars', [Vertex, Pipe], {_().sideEffect{start = it}.outE('Movie#acted_in').inV.inE('Movie#acted_in').outV.filter{!start.equals(it)}.dedup()})

Finally, the command to find nodes by index has to explicitly use the index:

1
2
bacon = g.idx('Actor_exact')[['name':'Kevin Bacon']].next()
elvis = g.idx('Actor_exact')[['name':'Elvis Presley']].next()

(if you created the data using the original import command, the index name is actors).

As frustrating as it all is, the end result is that the loop step can now be used as needed:

1
elvis.costars.loop(1){ it.loops < 4}{true}.filter{it.equals(bacon)}.paths.next().name.grep{it}

I also had to change the query once more to use next instead of >> 1 as in the book, as that does not work in the latest version of Gremlin either.

Random walk

Once again, I had to change the code from the book to get it to work: adding a dedicated filter step did the trick:

1
bacon.outE.filter{ rand.nextDouble() <= 0.01 }.inV.inE.outV.loop(5){ it.loops < 3 }.count()

The loop argument does not change, as the filter expression already counted as a step in the book version.

Centrality

I had a small problem with the book code: the query is not run if the command is followed by ; '' (which the book uses to prevent the display of the results). Just running this:

1
2
3
role_count = [:]; count = 0
g.V.in.groupCount(role_count).loop(2){ count++ < 1000 }
role_count.sort{a,b -> a.value <=> b.value}

works. Why on earth would such a small change have such an impact is beyond me. Now I’m scared of Groovy.

JUNG Algorithms

This time the book code was working as intended, but I found that there is an even more central actor than Donald Sutherland: Bobby Vitale…

Exercises

Neo4j ReST API

The documentation is here.

Binding or ReST API

See above my useo Neo4j.rb.

API for the JUNG project

The API is here.

Path-finding as a step

I am using the Gremlin shell, rather than the Neo4j console.

1
Gremlin.defineStep('path_to', [Vertex, Pipe], {Vertex to, Integer max -> _().costars.loop(1){ it.loops < max }{true}.filter{it.equals(to)}})     

I used the possibility to pass arguments to the closure to introduce both the target node and the loop limit as parameters. Otherwise the code is identical to the one I was using above. With this, the path from Elvis to Kevin Bacon becomes

1
2
3
4
5
6
gremlin> elvis.path_to3(bacon, 4).paths.next().name.grep{it}
==>Elvis Presley    
==>Frankie and Johnny
==>Nathan Lane
==>He Said, She Said
==>Kevin Bacon

A family graph

I used Ruby (with the Neo4j.rb library). The code first defines a family data structure that maps each family member to other members keyed by their relationships.

The code first iterate over the family members and inserts them; it then goes over the family structure a second time to insert the relationships.

(family.rb) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/usr/bin/env ruby

%w{rubygems neo4j}.each{|r| require r}


class Person
  include Neo4j::NodeMixin
  property :name
  index :name
end

def get_or_create_node(inserter, name, clazz)
  n = inserter.index_get('name', name, :exact, clazz)
  n = n.first if n
  unless n
    n = inserter.create_node({'name' => name}, clazz)
    inserter.index_flush(clazz)
  end
  n
end

family = {
  'Alice' => {:sibbling_of => ['Bob', 'Carol'], :married_to => ['Walter'],:child_of => ['Trent', 'Peggy'] },
  'Bob' => {:sibbling_of => ['Alice', 'Carol'], :married_to => ['Eve'], :child_of => ['Trent', 'Peggy']},
  'Carol' => {:sibbling_of => ['Bob', 'Alice'], :child_od => ['Trent', 'Peggy'] },
  'Trent' => {:married_to => ['Peggy'], :parent_of => ['Alice', 'Bob', 'Carol']},
  'Peggy' => {:married_to => ['Trent'], :parent_of => ['Alice', 'Bob', 'Carol']},
  'Eve' => {:married_to => ['Bob']},
  'Walter' => {:married_to => ['Alice'], :sibling_of => ['Dave']},
  'Dave' => {:sibling_of => ['Walter']}
}

puts "begin processing..."

Neo4j::Config[:storage_path] = "#{ENV['NEO4J_HOME']}/data/graph.db"

inserter = Neo4j::Batch::Inserter.new

family.each do |k, _|
  inserter.create_node({'name' => k}, Person)
end

inserter.index_flush(Person)

family.each do |k, v|
  v.each do |r, os|
    os.each do |o|
      inserter.create_rel(r,
                          inserter.index_get('name', k, :exact, Person).next,
                          inserter.index_get('name', o, :exact, Person).next
                          )
    end
  end
end

puts "done!"
inserter.shutdown

Run a JUNG algorithm

I tried to run a simple Dijkstra shortest path algorithm in Gremlin, but eventually had to give up as the shell kept giving me weird exceptions when I tried to load the required class. Furthermore, the graph being directed from the actor nodes to the movie nodes, there is not path between anything but an actor and one of its movies (and the JUNG class to transform a directed graph to an undirected one seems to convert the whole graph eagerly).

Eventually I gave up, dumped the movie database, and used the family graph instead.

The code is in Java, the language I used after Groovy scared me with these weird exceptions (Java is boring but predictable).

The hardest perhaps was to figure out the dependencies for the pom.xml:

(pom.xml) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>jp.wakatta</groupId>
  <artifactId>graph-algo</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <properties>
      <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      <neo4j.version>1.5</neo4j.version>
  </properties>
  <build>
      <plugins>
          <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-compiler-plugin</artifactId>
              <version>2.3.2</version>
              <configuration>
                  <source>1.6</source>
                  <target>1.6</target>
              </configuration>
          </plugin>
      </plugins>
  </build>
  <repositories>
      <repository>
          <id>tinkerpop-repository</id>
          <name>TinkerPop Maven2 Repository</name>
          <url>http://tinkerpop.com/maven2</url>
      </repository>
  </repositories>
  <dependencies>
      <dependency>
          <groupId>org.neo4j</groupId>
          <artifactId>neo4j</artifactId>
          <version>${neo4j.version}</version>
      </dependency>
      <dependency>
          <groupId>net.sf.jung</groupId>
          <artifactId>jung-graph-impl</artifactId>
          <version>2.0.1</version>
      </dependency>
      <dependency>
          <groupId>com.tinkerpop.gremlin</groupId>
          <artifactId>gremlin-java</artifactId>
          <version>1.4</version>
      </dependency>
      <dependency>
          <groupId>com.tinkerpop.blueprints</groupId>
          <artifactId>blueprints-neo4j-graph</artifactId>
          <version>1.1</version>
      </dependency>
      <dependency>
          <groupId>com.tinkerpop.blueprints</groupId>
          <artifactId>blueprints-core</artifactId>
          <version>1.1</version>
      </dependency>
      <dependency>
          <groupId>com.tinkerpop.blueprints</groupId>
          <artifactId>blueprints-graph-jung</artifactId>
          <version>1.1</version>
      </dependency>
  </dependencies>
</project>

The code in Java is verbose; especially I could not find a simple way to look up nodes in the BluePrints graph, nor could I use the properties of the Neo4j nodes from the BluePrints vertices…

(GraphAlgorithm.java) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
package jp.wakatta;

import java.util.Map;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.kernel.EmbeddedGraphDatabase;

import com.tinkerpop.blueprints.pgm.Edge;
import com.tinkerpop.blueprints.pgm.Vertex;
import com.tinkerpop.blueprints.pgm.impls.neo4j.Neo4jGraph;
import com.tinkerpop.blueprints.pgm.oupls.jung.GraphJung;

import edu.uci.ics.jung.algorithms.shortestpath.DijkstraShortestPath;
import edu.uci.ics.jung.graph.Graph;


public class GraphAlgorithm {
  public static void main(String...args) {
      String neo4jHome = "/users/x/neo4j-test";
      
      if (args.length > 0)
          neo4jHome = args[0];
      
      GraphDatabaseService graphDb = new EmbeddedGraphDatabase(neo4jHome + "/data/graph.db/");
      
      com.tinkerpop.blueprints.pgm.Graph g = new Neo4jGraph(graphDb);
      
      Graph<Vertex, Edge> j = new GraphJung(g);
      
      DijkstraShortestPath<Vertex, Edge> dijkstra = new DijkstraShortestPath<Vertex, Edge>(j);
      
      Node trent = graphDb.index().forNodes("Person_exact").get("name", "Trent").next();
      Node dave = graphDb.index().forNodes("Person_exact").get("name", "Dave").next();
      
      System.out.println("Distance between Trent and Dave:" + dijkstra.getDistance(g.getVertex(trent.getId()), g.getVertex(dave.getId())));
      
      System.out.println("Distance between Trent and everybody:");
      for (Map.Entry<Vertex, Number> kv: dijkstra.getDistanceMap(g.getVertex(trent.getId())).entrySet()) {
          System.out.println(graphDb.getNodeById((Long) kv.getKey().getId()).getProperty("name") + " => " + kv.getValue());
      }
      g.shutdown();
      graphDb.shutdown();
  }    
}

Running it produces

1
2
3
4
5
6
7
8
9
10
Distance between Trent and Dave:3.0
Distance between Trent and everybody:
Trent => 0.0
Carol => 1.0
Peggy => 1.0
Alice => 1.0
Bob => 1.0
Walter => 2.0
Eve => 2.0
Dave => 3.0

Wrapping Up Day 2

I must say that today was a rather frustrating experience. Neo4j ecosystem is still evolving, but this means that most of the documentation I came upon was already obsolete. The navigation on the data was at time very hard to figure out, and the error messages (really, the underlying Java exception) not helpful.

Comments