Third, last and quite short day with Neo4j. Today on the menu: transactions, replication, and backups.
Transactions are a standard feature of relational databases, but NOSQL databases seem to consider them too costly (of the other databases in the book, only HBase and Redis also support transactions, as far as I can tell). Neo4j does support them, along with rollbacks.
Replication is Neo4j’s answer for High Availability and, to some extent, Scaling. The latter is limited as Neo4j does not partition the data, so everything has to fit in each computer in the cluster.
Finally, backups are exactly what you would expect them to be. Neo4J offers both full and incremental backups, which update a previous backup.
I cannot comment much on transactions, as I could not use them: the Gremlin shell from the Web Admin console could not find the required enumeration (which I imported, though), while the Gremlin standalone shell was giving me strange errors when I tried to import the relevant classes.
I suppose pure Java would be more reliable, either as standalone code or plugin, but I did not explore that possibility.
High-availability is achieved by deploying and linking together several instances of Neo4j. The setup is somewhat tedious, as there are additional processes to configure and run (the coordinators), and four different configuration files to edit. Really, this is the kind of things you’d wish Apache Whirr would do for you.
But if you want to do it manually, you should follow the
rather than the book version (at least in beta version 2.0): the book
use the property
ha.zoo_keeper_servers in the
configuration file, when the correct property is
ha.coordinators. What is worse is that it will look like it works,
until you try to write to a slave over the ReST API, which will fail
with an exception. Writes to the master would also not be pulled by
the slaves. Using the right property name fixes these problems.
Once set up, the cluster will have one master and several slaves. The master contains the authoritative version of the data. The book recommends to always write to slaves, as they have to push any update to the master before completing the update, meaning you have a guaranteed replication of your data. However, what the book does not explain is how to figure out which server is slave, or even whether the list of servers in the cluster can be discovered….
Actually, it is possible to have some idea of which server is the master by querying any server with
(assuming one of the server is listening to port
7471). A sample
reply is shown (only partially, as it is very long) one the
HA setup tutorial. But
the actual address of each server is not shown, and I could not find
any way to get the address property to be properly filled.
So the proper way to use such a cluster is probably to use the HAProxy, as explained in Neo4j HA documentation. It can be configured to differentiate between master and slaves, and to restrict connections to slaves (keeping the list updated with a check). It can also split the requests by some specific parameter (for instance, the user id), and direct the requests the same server for a given value of the parameter. While Neo4j does not shard the data itself, this mechanism can be used to shard the data cache (what must be loaded in memory).
Neo4j support remote, full or incremental backups. Incremental backups are properly understood as update to the previous backup (either full or incremental), and are therefore much faster.
This is a good feature, and should be used often. But as I’m just playing, and the notion of backup does not lend itself to exploration, I just looked at them briefly.
Neo4j licensing guide
The guide is fortunately quite short.
This seems to be a description of the original HA feature in Neo4j, but as far as I can tell it does not exist anymore. In fact, there is an update to the official documentation to remove the mention of read-only slave.
There used to be a Java class to create a server as read-only slave, as documented here, but it no longer exists either.
Maximum number of nodes supported
Replication across three physical servers
As I already explained how to setup a cluster of EC2 virtual machines for Riak, I will go skip all the details.
I launched four instances: one will be the HAProxy server, the remaining three the Neo4j servers.
All the rules but the first one are internal (i.e. the source is the name of the security group, which should be specific to the cluster).
- 22 (SSH) - source
- 2181: coordinator client port
- 2888: quorum election port
- 3888: leader election port
- 6001: inter cluster communication port
- 7474: web interface for the Neo4j servers
- 8080: admin interface for HAProxy
- 80: web interface for the proxy
Neo4j does not need ranges, unlike Riak.
I connect to each of the Neo4j server, and download the Enterprise edition:
First step is to configure the coordinators. I edit the
conf/coord.cfg file and replace the server.1 property with the block
1 2 3
(I got the IP addresses by using the
ifconfig command on each
instance). I also update the
data/coordinator/myid of each instance
with own number (1 to 3).
I then modified each
conf/neo4j.properties, setting each to its own
ha.server_id, and setting the
10.202.90.131:2181,10.202.81.171:2181,10.195.78.222:2181. I also
ha.server to use the
eth0 IP address rather than
Finally, I modified each
- the web server needs to listen to the
eth0IP address rather than
- the server needs to be set to HA mode:
Surprisingly enough, the three servers did start and were configured properly…
I checked the setup with
I looked for the string
InstancesInCluster, and made sure there were
three known servers.
Finally I pushed something into the second (slave) server using
1 2 3
then tried to retrieve it from the third (slave) server with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
So far so good…
Well, HAproxy seems a good choice, so I’ll go with that.
The compiled jar should be copied to the
lib directory of each
instance, and the
conf/neo4j-server.properties configuration file
updated to contain the line
as documented on the page above.
As a first test, I deployed HAProxy on my own machine, using this configuration file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
I had installed HAProxy with
Homebrew. The config above does
not bind to port
*:80, so I can run it without root privileges:
Once up, I opened a browser on HAProxy stat page (it is not JSON, you really need a browser), to check that two instances of Neo4j were configured as slaves and available.
Finally, I checked a Gremlin script with:
1 2 3 4
7000 is the HAProxy port, not any of the Neo4j ports). I had a
few P.G. Wodehouse nodes I inserted when I was testing writes to slaves.
Ok, this is ready to be tested on the AWS cluster.
Deploying on the cloud
I used the small cluster deployed in the previous exercise. I just
copied the Neo4j HAStatus extension jar to each machine (in the
directory), and changed the
conf/neo4j-server.properties exactly as
I quickly checked that the extension was installed with:
1 2 3
(each is supposed to return nothing. If there’s a problem, these commands will return an error page).
Everything looks fine. Time to set up the HAProxy machine.
Once again, I followed the instructions from the Neo4j documentation: first I installed the “Development Tools”:
This step is very fast because they all are stored in the Amazon Cloud.
I retrieved the HAProxy code:
To build it, I used the command
make TARGET=26 (which means build
for a recent version of Linux).
I did not copy the executable, as I will run it without root privileges anyway.
I created a file
haproxy.cfg that contains:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
which is essentially the same file as the file
I established SSH tunnels to ports
8080, checked the
status of the proxy on
http://localhost:8080/haproxy?stats (I had
made a mistake to one of the IP address, so I fixed it and restarted
Finally, I was able to run
1 2 3
And all was good.
Wrapping up Neo4j
This is another database I had to fight all along the way. The book, the available documentation, and the actual behaviour of the database overlap only partially. Figuring out what is actually possible and how to achieve it was harder than for any other databases in the book.
One thing that was especially irritating is the error handling of the Gremlin shell: a syntax error such as a missing closing quote renders the shell unusable: it keeps complaining about the syntax error, but offers no way to actually correct it. And I could find no way to reset the shell, except by restarting the whole server…
This, and the fact that both the embedded interpreter or the standalone shell are unstable in their own different ways (not to mention slightly incompatible) makes Gremlin useless. But the alternatives, Cipher or Java, are not really usable either: Cipher is too limited, Java too verbose and its syntax ill suited.
This said, Neo4j occupies a fairly specific niche which does not have many alternatives. Let’s hope the ecosystem stabilises into something more coherent and stable.
The other databases
It seems the Redis might be available soon, but CouchDB is not there yet. So I will probably switch to a different book for the time being.