Final day with MongoDB. First to cover geospatial indexing; then to explore MongoDB’s approach to the CAP theorem.
Like PostgreSQL, MongoDB has integrated support for geometric of geographic data and queries. Finding the neighbours of a point or location is trivial, and such queries are optimized using dedicated kind of indexes.
Regarding the CAP theorem, MongoDB strictly separates Availability from Partition tolerance: replica sets are designed for availability, using a quorum approach (like Riak) to select the most recent data in case of conflict.
Sharding is the dedicated mechanism for partitions. A replica set can own a shard of the data.
Unlike with Riak, where availability and partitioning are functions of the properties set on buckets, MongoDB requires the whole topology to be explicitly configured. I assume that what MongoDB loses in flexibility, it gets it back in predictability.
Exercises
Replica set configuration options
The documentation is here.
Spherical geo index
I don’t know if this is another instance of the book describing features from old version of MongoDB, but there is no such thing as a spherical geo index.
Spherical searches rely on standard 2d
indexing, as explained
here.
Find all cities within a 50 mile radius of London
To solve this exercise, it is necessary to format the data as required in the documentation. Unfortunately, the data files in the code for the second beta version of the book use latitude, longitude whereas MongoDB expects longitude, latitude (i.e. a X, Y coordinate).
I used the small script below to reformat the data file, and imported the reformated one:
1 2 |
|
With that loaded, and with the geospatial indexing in place, MongoDB is ready to run the queries.
First I need to locate London. There are a few places named London,
but I assume the authors meant the one in England. I create a centre
variable to be used in the queries.
1 2 |
|
As indicated in the documentation, I have to measure distances in radians. For this I need to know the Earth Radius in miles.
1 2 |
|
Finally I can run my queries. I have a few options:
geoNear
command
I can pass the spherical: true
option to the geoNear
command. By
default, the query will only return 100 results.
1 2 3 4 5 |
|
As it turns out, there are far more cities in this range. A circle of 500 miles radius around London includes much of Western Europe:
To get unlimited results, I set the number of possible results to the number of cities:
1 2 3 4 5 6 |
|
$within
operator
Alternatively, I can use the $within
operator. I get the spherical
behaviour by specifying a centerSphere
:
1 2 3 4 5 |
|
This query will return cities within the range, just like the
unlimited geoNear
one.
Sharded replicas
This is the kind of things that is not overly difficult, but tedious. And I don’t like tedious.
As a good UNIX geek, I’d rather spend hours to automate what would have taken me 10 minutes to do manually. So here’s the automated setup in Bash scripts.
First I create all the necessary directories:
1 2 |
|
Then I start two sets of 3 replicas that can also be sharded:
1 2 3 4 5 |
|
I setup each replica set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
At this point it is good to wait a minute for the replica sets to be fully online.
Next step is to figure out the shards URL: they are composed of the shard name, followed by the list of comma separated servers:
1 2 3 4 5 6 7 8 |
|
Now it is time to start the config server. I move it to the port 27019
as 27016
is already in use:
1 2 3 |
|
And finally I add the shards to the config mongo, and enable sharding
on test
for both the cities
collection and GridFS:
1 2 3 4 5 6 7 8 9 10 11 |
|
I can check that everything looks ok with:
1 2 3 |
|
Of course, all these scripts would be useless to actually distribute the servers on different machines. Given some time, I’ll try to setup a AWS EC2 cluster as I did for Riak.
At this point, I tried to import the cities data file. It was somewhat slower than without replicas, but not significantly so.
I also added a file, using the same command as in the book.
Now, to test the replicas, I killed the two primary servers (to
identify them, I used ps auxw | grep mongod
, which gave me the
process id I needed to kill).
With two servers down, mongofiles -h localhost:27020 get my_file.txt
was still able to retrieve the file.
Wrapping up MongoDB
MongoDB is the first database besides PostgreSQL I feel comfortable using. They both provide more “database-like” features than either Riak or HBase: integrated queries, advanced indexing, … The use of JavaScript is well integrated and pleasant to use.
Moreover, MongoDB’s approach to the CAP theorem is simple. While it is less flexible or dynamic than Riak’s, its simplicity makes it easy to reason about.