Day 2 with Riak, to cover Mapreduce, distribution, rings and consistency.
Mapreduce will be familiar to anyone who has been paying attention to Google. The basic idea comes from functional programming, but Google showed how it could be used to distribute the load over many machines (this in turn spawned two distinct recent developments in software: one is the range of databases explicitly built around the concept of Mapreduce; the other is the idea that functional languages are better suited to do concurrent and parallel programming).
A followup to yesterday simple trick
There is an alternative, maybe even simpler than yesterday’s use of python -mjson.tool
. It is shown (maybe a bit late) in the book: using an Accept: text/plain
header in the HTTP request, as in:
1
|
|
This asks Riak to return plain text instead of json data; Riak helpfully interprets that as meaning the same data, but formatted for people. The advantage of this approach is that is still works even when curl
generates extra output:
1
|
|
will both generate verbose debugging output, and get the data in human format.
Nothing wrong with Mapreduce per se, but…
I understand that the examples have to remain simple enough to be implementable on a (possible shared) single personal computer, but I still think that the authors are being dishonest when they contrast Mapreduce to an “SQL” based solution.
The problem with this solution is that it is just what a self-taught coder with no understanding of SQL might produced (ok, I’m being unfair. Some college educated programmer I have known actually write code like that as well). But this idea of pushing the code to the data instead of pulling the data to the code is exactly what a database like PostgreSQL does.
An actual SQL solution would be much simpler, would run fully in the database (no need for a script to create the initial data), and would deliver the data long before you’d be done coding and debugging the Javascript Mapreduce code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
It does not help that actually testing the code on Riak, I had to wait about 10 seconds while PostgreSQL answer was immediate. Of course, I have 3 instances of Riak on a single machine, which is certainly not a typical setup.
Riak is on a specific point among the trade off line to which data stores are constrained by the CAP theorem. It would have been better to work on an example (maybe using more complex data, loaded from a file, like the movie database in PostgreSQL Day 3) that plays into Riak strengths rather showing how badly it performs compared to a much easier and more natural solution in SQL.
It might have been enough to ask the reader to imagine the dataset to be so large that it could not fit in a single server (for instance, a simulation of the Hilbert’s Hotel), or to simulate a rocky network by shutting down some nodes at awkward moments, and show how Riak still delivers the goods. In such conditions most SQL databases would start to show their limits, and these would be legitimate.
Another comment on the example
If you run all the examples and the exercises, you will notice something interesting: there is a floor 101. That’s actually a bug in the initial seeding script: the room number should range over 0 to 99, not 1 to 100. Mapping composite keys to a single value is not always safe. Oh well…
I’ve changed my script after I went through everything, rebuild the data, and ran everything again.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Partial updates are not supported, unless they are
So I said yesterday that partial updates are not supported. And if they are, I really have not found a way to get them to work.
Except for buckets. You can update properties of a bucket without having to specify every one of them. Which is good. But still, it might have been good for the rest of the data to have this feature.
The fact that the properties of buckets are probably fixed, meaning you can’t remove any, makes a partial update non ambiguous. Supporting partial updates on generic data would require a way to specify whether the update was complete or partial, and a way to remove properties. I can see why Riak designers chose not to go that way, but it still feels a bit ad hoc overall.
CAP choices
The section called “On Consistency and Durability” is more interesting, and better shows which features Riak attempts to provide.
The ability to choose from various strategies to implement consistency and durability certainly introduces additional ways to make a mistake, but it also offers flexibility where the relational databases typically offer little.
Exercises
Online Riak Mapreduce documentation
The documentation is http://wiki.basho.com/MapReduce.html.
Riak contrib functions
Google is a good friend. The central site and the repository.
Keys filter documentation
http://wiki.basho.com/Key-Filters.html
Total capacity by floor
The first step is to map the original data to an array of capacity indexed by floor. What has not been seen so far is how to get the key
of an object, but it is just another property (directly in v
, not in the values
array):
1 2 3 4 5 6 7 |
|
A quick test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Ok, looking good. Now the reduce
part. It is strictly identical to the book’s reduce
function, but I just renamed some variables to make it clear what is being iterated over:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Testing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
Ok, still looking good. Let’s run it on the whole set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
Restrict the capacity count to floors 42 and 43
It should be enough that the key be between 4200 and 4390:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
The output matches what was computed for the whole set of rooms, which is always a good thing.
And this completes Day 2. I must admit I have so far mixed feelings: I understand better what Riak tries to achieve, but the example is just not the kind of things Riak should be used for, and the Mapreduce syntax is a bit heavy (although, fortunately, very regular).
Tomorrow will cover Vector Clocks, the mechanism that Riak exposes both to move conflicts resolution to the client side, and to help the client to implement a decent resolution.