I managed to solve this exercise using sum by parts; the book uses partial fractions, but as far as I can tell I never learned about these. They seem to be part of high school curriculum in America, but might not be in Belgium (or I was sleeping that day).
Sum by parts worked for me because I eventually asked myself what kind of function of $k$ would produce a finite difference $\frac{1}{4k^21} = \frac{1}{(2k+1)(2k1)}$.
By looking at $\Delta (2k)^{\underline m}$, I realised that the solution was fairly simple. I first experimented with $\Delta (2k)^{\underline{1}}$, then found the right expression:
For the sum by parts, I can therefore try to use:
The last expression was computed using the product rule for finite difference.
When I put everything into the sum by parts formula, the various blocks felt into place with satisfying “clicks”:
The answer as a function of $n$ is
I could not do this exercise; I also failed to see the basic sum that was at the centre of the question (quite prominently).
Even the book solution took me a while to figure out.
The book solution asks how many pairs $a$, $b$ are there such that $\sum_{a\le k \lt b}k = 1050$.
Rewriting this in terms of finite calculus is simple enough:
Now, one thing to notice is that if a sum of two integers is even, so is the difference, and viceversa. Therefore the product above is the product of one even and one odd integers.
So we are now looking for ways to express
as a product with $x$ is even and $y$ odd.
To compute how many ways there are to produce a divisor of a number whose prime factors are known, it is enough to see that, for each prime factor $p$ with multiplicity $n_p$, this prime can be left out, or included up to $n_p$ times in the divisor; so for each prime p there are $n_p+1$ possibilities. Summing these over the prime factors give the number of divisors of the original number.
In the present case, this number is 12.
Now that we have the number of possible pairs of $x$ and $y$, we have to go back to $a$ and $b$. There might be a principle involved here, but I don’t really see it, so I’ll just try to rebuild the solution from the ground up.
We already know that either $ba=x$ and $b+a1=y$ or $ba=y$ and $b+a1=x$. So we already see that whatever expression we need, both $a$ and $b$ will have a $\frac{1}{2}$ added so that their difference cancels out, and their sum produces a $1$ that will cancel the $1$.
Looking at sum and differences, we have $(x+y)(xy) = 2y$ and $(x+y)+(xy)=2x$, so there will be a sum and a difference involved. We also need $a$ to be smaller than $b$, so it is a candidate for $\frac{1}{2}(xy)+\frac{1}{2}$. However, we also need it to be positive, so we add an absolute value.
So the candidate solutions are $a=\frac{1}{2}xy+\frac{1}{2}$ and $b=\frac{1}{2}(x+y)+\frac{1}{2}$. Let’s check them:
So it all adds up.
This exercise was more in line with the content of this chapter, and easy enough. Essentially, it is just a matter of changing the order of summation.
The inner sum is a geometric progression:
So we need to solve $\sum_{j\ge 2}\frac{1}{j(j1)}=\sum_{j\ge 0}\frac{1}{(j+1)(j+2)}$, which we already saw as well, and the value is indeed 1.
Using the same approach:
Once again, we have a geometric progression, just as easy to solve as the previous one:
Now we have $\sum_{j\ge 2} j(j2)^{\underline{3}}$, which can be summed by parts. I chose:
and now the sum by part:
I am not really sure of my solution here, despite the fact that the outcome is identical to the book; my method is somewhat different from the book’s, and using some concepts from Chapter 3.
To prove that the two sums have the same value, I just evaluate each.
First, a basic observation on the $\dot{}$ operator: if $b\le 0$, $a\dot{}b$ is at least zero, and if $a\le 0$, at most $a$ (otherwise it is always zero).
So if $a\le 0$ or $a\le b$, $a\dot{}b=0$.
I will now assume that $x\ge 0$; otherwise both sums are zero.
First, I replace the infinity sum by a finite one: as seen above, if $k\gt x$, $\min(k,x\dot{}k)=0$, so
I then try to remove the $\min$ operator:
so that
The general idea here is to find a way to eliminate the $k$ terms, by shifting each term in the second sum by an equal amount.
At this point, the number of terms in each sum is important: the total number of terms is the number of $k$ such that $0\le k\le x$; clearly this is $\lfloor x \rfloor + 1$.
The number of terms in the first sum is similarly $\lfloor\frac{x}{2}\rfloor + 1$.
The number of terms in the second sum is therefore $\lfloor x \rfloor  \lfloor \frac{x}{2} \rfloor$. To find an expression for this, it helps to look at integral $x$ first.
If $x=4$, $\lfloor 4 \rfloor  \lfloor \frac{4}{2} \rfloor = 2$.
If $x=5$, $\lfloor 5 \rfloor  \lfloor \frac{5}{2} \rfloor = 52 = 3$.
More generally, if $2n\le \lt 2n+1$, there will be $n+1$ terms in the first sum, and $n$ in the second. And if $2n1\le x\le 2n$, there will be $n$ terms in each sum.
A first attempt for the number of terms in the second sum is $\lfloor \frac{x}{2} \rfloor$, but this only works for $x$ such that $2n\le x\lt 2n+1$. But it is easy to see that $\lfloor \frac{x+1}{2} \rfloor$ will always work: if $2n\le x\lt 2n+1$, $\lfloor \frac{x+1}{2} \rfloor = \lfloor\frac{2n+1}{2}\rfloor = n$, and if $2n1\le x\lt 2n$, $\lfloor \frac{x+1}{2}\rfloor = \lfloor\frac{2n}{2}\rfloor = n$.
Using this value to “shift” the terms of the second sum:
Now the question is whether the new $k$ in the second sum cancel the $k$ in the first sum. Once again, let’s check the cases:
So we can safely rewrite the sum as
And, as we already know the number of terms is $\left\lfloor \frac{x+1}{2}\right\rfloor$, the sum value is
This sum is much easier than the previous one. First I remove the $\dot{}$ operator. I need $2k+1\le x$:
This gives me $\lfloor \frac{x+1}{2}\rfloor$ number of terms.
So I can extract $x$ and work only on $k$
So the both expressions have the same value.
I find it easier to work from the basic $\min$ operator, and scale that up to $\vee$ (the formulas can always be derived mechanically from the underlying algebra).
$\min$ is an associative and commutative operator, is distributive with addition, and its neutral element is $\infty$.
I will not repeat the full list of formulas; the book has them already.
An undefined sum, according to (2.59), is one in which both the positive sum and the negative sums are unbounded.
I define $K^+$ as $\{k\in K a_k \gt 0\}$ and $K^$ as $\{k\in K a_k\lt 0\}$.
The point about unbounded sums is that, even if I drop a large number of terms, there are always enough remaining terms to add up to an arbitrary amount.
For instance, given $n$ even and $E_n=K^+\setminus F_{n1}$, it is always true that I can find $E’_n\subset E_n$ such that
So if I define $F_n = F_{n1} \cup E’_n$, $\sum_{k\in F_n}a_k \ge A^+$.
And when $n$ is odd, with $O_n=K^\setminus F_{n1}$, I can always find a subset $O’_n\subset O_n$ such that
(with $K^$, the $a_k$ are smaller than zero, so the sum can be arbitrarily small).
If I define $F_n = F_{n1} \cup O’_n$, $\sum_{k\in F_n}a_k \le A^$.
As I could not do the other bonus questions, this completes Chapter 2.
]]>Now I have no choice but to urge all my three readers (that includes you, Mom) to go and buy this great book. Even if you already own a copy.
]]>Anyway, I do have time now and am eager to go on with Chapter 3; but first let’s finish Chapter 2. Today the homework exercises, and very soon the exams and bonus (at least the ones I could do) exercises.
This exercise is not tricky in any way; just follow the method and the result is guaranteed.
The recurrence equations are
The $a_n$, $b_n$ and $c_n$ series are:
The summation factor
After experimenting a bit, I found that $s_1 = 2$ is slightly easier to work with, so the summation factor is $s_n = \frac{2^n}{n!}$.
With $S_n = \frac{2^n+1}{n!} T_n$, the recurrence equation becomes
The sum is wellknown, with $\sum_{k=0}^n 2^k = 2^{n+1}1$, so $\sum_{k=1}^n 2^k = 2^{n+1}2$.
Going back to $T_n$, we have
Using the perturbation method:
so $\sum_{k=0}^n H_k$ is
This exercise is just tricky in the very first step (working out the exact meaning of $S_{n+1}$), as the sign of the terms change depending of whether $n$ is odd or even.
This means that instead of the book equation (2.24) $S_{n+1} = S_n + a_{n+1}$, we find something like $S_{n+1} = a_{n+1}  S_n$.
First, the left hand part of the equation:
Then, the right hand part:
Putting both together, $S_n = \frac{1(1)^{n+1}}{2}$, or, as the book states, $S_n = [\text{\(n\) is even}]$.
Using the same approach as above:
and
Together:
The last version uses the ceiling operator from Chapter 3.
It will probably not be a surprised to find $U_n$ expressed in terms of $S_n$ and $T_n$.
and
With $2T_n = n+1S_n$, this produces $U_{n+1} = U_n + n + 1$, which gives the answer away, but let’s just continue with the current method.
Putting both side together:
First, I look for a usable double sum. I use the fact that for any $j, k$, $j < k$, $(a_jb_k  a_kb_j) = (a_kb_j  a_jb_k)$ and $(A_jB_k  A_kB_j)=  (A_kB_j  A_jB_k)$. This means that, with $s_{j,k} = (a_jb_k  a_kb_j)(A_jB_k  A_kB_j)$, $s_{j,k} = s_{k,j}$.
There is also the fact that $s_{j,j} = 0$, so now I can complete the sum to the whole rectangle:
The expansion of $s_{j,k}$ is $a_jA_jb_kB_k  a_jB_jA_kb_k  A_jb_ja_kB_k + b_jB_ja_kA_k$. Showing the summation just for the first one (the other three are identical):
Putting it all together:
In particular, with $a_k = A_k$ and $b_k = B_k$, the sum is $\left(\sum_{k=1}^n a_k^2 \right)\left(\sum_{k=1}^n b_k^2 \right)  2 \left(\sum_{k=1}^n a_kb_k \right)$.
Using
First the sum by part
Then the evaluation
For the sum by part, I use
The sum by part
The evaluation is
I don’t think I listed all the laws for this exercise, as the only complete list for the sum laws in the book is in the answer for this exercise.
I will not repeat it here; suffice to say that when we replace sum by product, the laws can be updated by replacing product by exponentiation, and sum by product.
While it took me a few false starts, I eventually found that the triangular completion used for (2.32) works here as well.
So $\prod_{1\le j \le k \le n}a_ja_k = \left(\prod_{1\le k\le} a_k\right)^{n+1}$.
As suggested, I worked out $\Delta c^{\underline x}$:
I did not immediately saw the relation between this and the original sum. First I rewrote the original sum to remove the division:
Now the relation is visible. So we have
As stated in the book, the infinite sums do no converge, so the third step is invalid.
And that’s all for today.
]]>However, in Chapter 4 about Naïve Bayes classifiers, I didn’t see how the implementation derived by the maths. Eventually, I confirm that it could not, and try to correct it.
It is of course possible that the implementation is eventually correct, and derives from more advanced theoretical concepts or practical concerns, but the book mentions neither; on the other hands, I found papers (here or here) that seem to confirm my corrections.
Everything that follows assumes the book’s implementation was wrong. Humble and groveling apologies to the author if it was not.
The book introduces the concept of conditional probability using balls in buckets. This makes the explanation clearer, but this is just one possible model; each model (or distribution) uses dedicated formulas.
The problem is that the book then uses set of words or bags of words as it these were the same underlying model, which they are not.
If we are only interested in whether a given word is present in a message or not, then the correct model is that of a biased coin where tails indicate the absence of the word, and heads its presence.
This is also known as a Bernoulli trial, and the estimator for the probability of presence is the mean presence: the number of documents in which the word is present, divided by the total number of documents.
The book algorithm does not implement this model correctly, as its numerator is the count of documents in which the word is present (correct), but the denominator is the total number of words (incorrect).
If we want to consider the number of times a word is present in messages, then the balls in buckets model is correct (it is a also known as Categorical distribution), and the code in the book adequately implements it.
The book then improves the algorithm in two different ways. One is the use of logarithms to prevent underflow. The other is to always use one as the basic count for words, whether they are present or not.
This is in fact not so much a trick as a concept called Additive smoothing, where a basic estimator $\theta_i = \frac{w_i}{N}$ is replaced by $\hat{\theta}_i = \frac{w_i + \alpha}{N + \alpha d}$
$\alpha$ is a socalled smoothing parameter, and $d$ is the total number of words.
If the model is Bernoulli trial, $w_i$ is the number of documents where word $i$ is present, and $N$ is the total number of documents.
If the model is categorical distribution, $w_i$ is the total count of word $i$ is the documents and $N$ is the total count of words in the documents.
As we are interested in $P(w_iC_j)$ (with $C_0, C_1$ the two classes we are building a classifier for), $N$ above is restricted to documents in the relevant class; $\alpha$ and $d$ are independent of classes.
So the correct formula becomes
With $\alpha=1$ as a smoothing parameter, the book should have used
numWords
instead of 2.0
as an initial value for both p0Denom
and
p1Denom
.
The differences with the code from the book are minor: first I
introduce a flag to indicates whether I’m using set of words
(Bernoulli trials) or bags of words (categorical distribution) as a
model. Then I initialise p0Denom
and p1Denom
with numWords
as
explained above; finally I check the bag
flag to know what to add to
either denominators.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

For the Spam test, the book version has an average error of 6%. The rewritten version has an error between 3% and 4%. The Spam test uses messages as set, for which my version is the most different.
For the NewYork/San Francisco messages classification, I did not measure any difference in error rates; this test uses messages as bags, for which the book version was mostly correct (the only difference was in the denominators).
OK, well, but the book algorithm still works, at least on the original data.
But how well exactly would it work with other data? As the algorithm does not seem to implement any kind of sound model, is there any way to quantify the error we can expect? By building on theoretical foundations, at least we can quantify the outcome, and rely on the work of all the brilliant minds who improved that theory.
Theories (the scientific kind, not the hunch kind) provide well studied abstractions. There are always cases where they do not apply, and other cases where they do, but only partially or imperfectly. This should be expected as abstractions ignore part of the real world problem to make it tractable.
Using a specific theory to address a problem is very much similar to looking for lost keys under a lamppost: maybe the keys are not there, but that’s where the light is brightest, so there is little chance to find them anywhere else anyway.
So far, this was the only chapter where I had anything bad to say about the book. And even then, it was not that bad.
The rest of the book is very good; the underlying concepts are well explained (indeed, that’s how I found the problem in the first place), there is always data to play with, and the choice of language and libraries (Python, Numpy and matplotlib) is very well suited to the kind of exploratory programming that makes learning much easier.
So I would recommend this book as an introduction to this subject, and I’m certainly glad I bought it.
]]>I liked that the book started with PostgreSQL. All too often, I am put of by the amazingly uninformed criticisms of the NoSQL crowd about relational databases; this left me with the general impression that a younger generation of engineers was just too ignorant to figure SQL out, so they build something new (without the benefits of decades of experience…).
By having a balance approach, the book cleared this misconception (Hadoop, the Definitive Guide also has a balance coverage in its introduction).
Each database’s strengths and weaknesses are correctly (as far as I can tell) reported, along with its position in the CAP triangle, and intended or ideal usage.
A recapitulative (but already partially incorrect, at least in the 5.0 beta version) overview of all the databases properties in Appendix A is also very useful.
Well, this is not exactly a problem of the book itself, but rather of the tools it covers: the rapid and sometimes radical changes in some of the databases meant that the technical information in the book was already obsolete.
The book’s intention is not to be a detailed tutorial; for instance, they skip installations (really, most technical books should skip installation and go straight to setup and use; think of the number of trees that would save), but the search for corrections was heavily taxing my already sparse free time.
All this will eventually improve, as the tools and documentation mature; right now using them is a bit too involved for the broad but shallow approach this book follows.
Compared to Seven Languages in Seven Weeks, I found this book more challenging. But this is perhaps a consequence of my prior exposure to a variety of languages and programming concepts; I suspect many people may find this book much easier.
Of all the books I have read recently, this is the one that changed and enlarged my views the most.
If you are, like me, a traditional software engineer with years of experience in relational databases but little exposure to newer kind of storage, you will benefit from this presentation of many databases and solution designs.
If, however, you already come from the NoSQL database and have experience in a few of the covered tools, this one book might not be the ideal one to convince you of the strengths of PostgreSQL. The problem with relational databases is that, having been the defacto standard storage solutions for decades, nobody remember why they became popular in the first place (they actually replaced databases that looked pretty much like document or graph databases, only much more primitive).
Still, given its price, as a broad introduction to many different data tools and techniques, this book is hard to beat. I certainly am glad for having read it, and I think you would be too.
]]>Today is less about Redis (indeed, it is hardly used at all), and more about a concept: Polyglot Persistence, and about an implementation that showcases the concept.
In fact, I spent most of my time browsing the documentation of Node.js, the library/framework the authors used to build the demo application.
Polyglot Persistence, the use of several kinds of storage systems in a project, makes even more sense than Polyglot Programming (the use of several languages in a project).
While languages are, by and large, equivalent in expressive power, and mostly a matter of choice, culture, or comparative advantage (some languages favour small teams, other large ones), storage systems are sufficiently different that they are not interchangeable.
Once the idea of eventual consistency takes root, it is only a simple extension to view the data as services available from a number of sources, each optimised for its intended use (instead of a single, default source that only partially meets the more specialised needs), and with its own update cycles.
The problem, of course, is that it introduces several levels of complexity: development, deployment, monitoring, and a dizzying range of potential errors, failures, …
The implementation described in the book is small enough to fit in less than 15 pages, yet rich enough to show what is possible.
The databases are (with the versions I used):
and the glue language is Node.js.
Redis is used first as initial storage for the first data takeon. It is then used to track the transfer of data between CouchDB and the other databases, and finally to support autocompletion of band names.
CouchDB is intended as the System Of Records (i.e. master database) for the system. Data is meant to be loaded into CouchDB first, then propagated to the other databases.
Beside that, it is not used much, and after the exercises, not used at all…
Neo4j keeps a graph of bands, members, and instruments (or roles), and their relationships.
Node.js is a framework/library for JavaScript based on the concept of eventbased programming (similar to, but perhaps more radical than, Erlang). All I/O is done in continuationpassing style, which means that whenever a I/O operation is initiated, one of the argument is a function to handle whatever the operation produces (or deal with the errors).
This is good from a performance point of view, but it is of course more complex to design and code with. Still, it looks like a fun tool to glue various servers together.
I had to fix some of the code from the authors (nothing serious, and all reported in the errata):
populate_couch.js
: the trackLineCount
has an offbyone
error. The check for completion should be totalBands <=
processedBands
bands.js
: the initialisation of membersQuery
in the function
for the /band
route has a syntax error. It should be1 2 

The book uses a now dated version of Neo4j, so the queries do not
work. The shortcut to access a node by index does not work anymore,
and the uniqueObject
step has been replaced by dedup
.
Here are the updated relevant portions:
1 2 

and
1 2 

I’m not sure what the second homework exercise was supposed to be about: Neo4j already contains information about members and memberships. Perhaps it dates from an early draft, before this chapter’s code evolved into what it is now. In any case, the first exercise had enough Neo4j anyway.
The start and end dates for memberships in bands is sometimes provided; the purpose of this exercise is to use this information.
I load the start and end dates into their own key in Redis. The key
format are from:bandName:artistName
and to:bandName:artistName
.
First I take the data from the relevant columns:
1 2 3 4 5 6 

Then, if they’re not empty, I create the keys in Redis:
1 2 3 4 

Adding the information to CouchDB is not hard; the main difficulty is
to figure out how to modify the populate_couch.js
script
(continuationpassing style is hard).
Eventually, I just reused the roleBatch
(therefore renamed
artistInfoBatch
) to retrieve the roles, from and to information.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The putting it in CouchDB is trivial:
1 2 3 4 5 6 7 8 9 10 

Neo4j was the hardest piece of the puzzle: I didn’t know, and could
not find any definitive documentation on, how to relationship
properties at creation time. Eventually I found that adding them to
the data
attribute passed at creation time did the trick (although
it still took me more time to understand how to use them).
The problem to do so is that the neo4j_caching_client.js
library
does not support adding properties to relationships, but it was easy
enough to modify this library to add this feature.
1 2 3 4 5 6 7 8 9 

then the relevant properties can be passed to the function above in
the graph_sync.js
script:
1 2 3 4 5 6 7 8 9 

To make use of the new data, I tried to differentiate between current
and old members of a band. I simply define a current member as one
whose to
property is null.
Figuring how to write a Gremlin query that extracted the information I needed was challenging: the documentation is often sparse, and many concepts barely explained.
I found that I could collect nodes or relationships along a path by
naming them (with the step as
), and then gather all of them in a
single row of a
Table
.
I used this to get both the from
, to
properties and the artist name
property in a single query. However,
I spent some time tracking a bug in my filters where apparently, null
to
would not be returned as current members. I finally realise that
when a given node or relationship is given two different names, these
names will appear in reverse order in the Table
.
So in my case, the query:
1 2 3 

I give the names from
and to
to the relationship, but used them in
reverse order in the Table
closures. Is this the intended behaviour
or a bug? Does anybody know?
It seems like a common problem with some NoSQL databases: the query language feels very much adhoc, and not entirely sound or fully thought through. Despite its many defects, SQL was at least based (if sometimes remotely) on the relational calculus, which gave a precise meaning to queries. It was further specified in different standards, so that even its defects were fully clarified (XPath/XQuery is another pretty well specified query language). When playing with NoSQL databases that pretend to have a query language, I often find it difficult to go beyond the simpler examples, precisely because of this linguistic fuzziness.
But I solved it for this case, so now I have my Table
. It is an
object with two properties: columns
is an array of column names, and
data
is an array of arrays (each one being a row). To convert them
to an array of objects, I use the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

The rest of the code is just the nested Node.js event functions, and
the formatting using the mustache
(which was pretty cool and easy to use).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 

The book (in beta 5.0) suggested to use Riak’s Luwak, but this component has recently been removed, and there seems to be no replacement at this time. So I went with MongoDB’s GridFS instead. This is a little more complex than a simple replacement of the client libraries: MongoDB does not have an HTTP ReST API for GridFS, so I need to stream the content of the file through the server.
To keep things simple, I load only on sample per band; the file name must be the same as the CouchDB key, followed by ‘.mp3’.
To access MongoDB from Node.js, I use
nodemongodbnative,
which can be installed with npm
. It has all the expected features of
a client, including GridFS support (with one caveat, see below).
To stream the file from the server, I use a dedicated port, for no
better reason than because
Brick.js, that the authors used to
build the service, was giving me trouble, while the standard http
module did not.
When displaying the band information, I check whether a file exists with the same name as the band’s key: if it does, I add a link to the dedicated streaming port, passing the key as parameter:
1 2 3 4 

Then, I create a new http
server to send the music:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

The only problem I had (but it took me a while to figure it out) was that the stream support in the MongoDB client for GridFS content is (as far as I can tell) defective: it will close the stream after just one or two chunks’ worth of data (Issue in Github).
So instead I have to load the whole file in memory then write it in the response… Clearly not the best approach, but hey, it works!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 

Well, that was a long day. I should have enjoyed it, but the lack of maturity in some of the tools (Neo4j’s always evolving query language and the GridFS streaming bug) caused hours of frustration. The main cause, however, was missing knowledge: faced with an unexpected behaviour, I had no idea whether it was a bug (find a workaround) or an incorrect invocation (rework the query to correct it).
The exposition of polyglot persistence through the music information service were pretty good, given the space constraint. Of course it skipped the really ugly and tedious parts (how to incrementally keep the databases in sync when the main records are updated, not merely created); given the variation in data models, data manipulation (or lack thereof) and query between the different databases, this can easily become a nightmare (especially if incremental updates are not part of the initial design).
Another upcoming book, Big Data, takes a very different approach (no updates, only appends). I look forward to reading it.
]]>To show that
I start by rewriting the sum in the right side of the equation:
This latest value can now be put back into the original right:
which is indeed the left side of the equation (the butlast step is permitted under the associative law, but that didn’t fit in the margin).
It is clear that there is a single $p(k)$ for every possible (integer) $k$. So I need to show that for every $m$, there is a single $k$ such that $p(k)=m$, defining $p^{1}$.
The book method is smart, mine clearly less so, but as far as I can tell, still correct: for $m$, I consider $mc$ and $m+c$. The difference is $2c$, so they’re either both even, or both odd.
If they’re both even, then $mc+(1)^{mc}c=m$, so $k=mc$. If they’re both odd, then $m+c+(1)^{m+c}c=m$, so $k=m+c$. So $k$ is always well defined for every $m$, and $p$ is indeed a permutation.
While I found the closed formula for the sum, I could not do it with the repertoire method.
Solving the sum is not really difficult (although a little bit more than the repertoire method, if you know how to do the latter); one way is to solve the positive and negative sums separately (they can be broken down to already solved sums); another one is to compute the sum of an even number of terms (one positive and one negative), then to compute sums of odd number of terms (by adding a term to the previous solution), and finally combining both to find the closed formula.
In both attempts above, I tried to remove the $(1)^k$ factor from the terms; when using the repertoire method I tried to do the same, which is why I failed.
The repertoire method relies on a good intuition: one must have a sense of general shape of the parametric functions. In retrospect, it seems obvious, but I just couldn’t see it, blinded as I was by$(1)^k$.
Expressing the sum as a recurrence is easy:
Also, looking at the first few terms of the sum, $1, 3, 6, 10, 15, \dots$, it is natural to consider solutions of the form $(1)^n F(n)$; it is a little bit trickier to see where a good generalisation of the recurrence above should put the additional terms:
With such a form, plugging in solutions $(1)^nF(n)$ will simplify to $F(n) = \beta + \gamma n + \delta n^2  F(n1)$.
At this stage, it becomes very easy to find the $A(n)$, $B(n)$, $C(n)$ and $D(n)$ functions (the latter being the solution we are looking for). In fact, if all you care about is $D(n)$, then it is enough to use $R_n = (1)^n n$ and $R_n = (1)^n n^2$:
which gives $B(n)+2C(n) = (1)^n n$.
which gives $B(n)2C(n)+2D(n) = (1)^n n^2$. Combining with the previous answer, we have $2D(n) = (1)^n (n^2n)$, or $D(n) = (1)^n \frac{n^2n}{2}$.
In hindsight, these steps could have helped me solve this exercise as intended:
Not overly complicated; at least the introduction of $j$ is not a mystery (unlike the next exercise).
The inner sum can be rewritten as
Here I use the already known sum $\sum 2^k$. Putting this last result in the original sum
It took me some time to convince myself that the original rewrite was legitimate; eventually I did it by induction (the book version is much shorter, and once you see it, much easier). Clearly it works for $n=1$, so assuming it does for $n1$, we have
So the rewrite is correct. At this stage, (2.33) pretty much finishes it:
so $\sum_{1\le k\le n}k^3=\frac{n^2(n+1)^2}{4}$.
This follows directly from $\frac{a}{b} = \frac{c}{d} \implies ad = bc$, and the use of equation (2.52).
I’ll just do the conversion from raising factorial power to falling factorial power; the other conversion is just the same.
$x^{\overline m} = \frac{1}{(x1)^{\underline m}}$ follows from (2.51) and (2.52).
For the other equalities, by induction on $m$, and using (2.52) and its raising factorial powers equivalent:
They all follow from definition:
Assuming the relations hold for all $k, 0\le k\lt m$:
Using the recurrence relations derived from (2.52) and its raising factorial power equivalent:
Assuming the relations hold for all $k, m\lt k\le 0$:
So the main difficulties is to derive two equalities from (2.52) (four if we count the negative cases as well), and the identification of the recurrence equation in the induction step (especially for $(x+m1)^{\underline{m\pm 1}}$).
I suppose I could say it follows directly from the equivalence of the metric functions (if my memory of metric space terminology is correct).
More basically, the equivalence of the propositions follows from the relationships based on the hypotenuse formula: $\sqrt{(Rz)^2+(Iz)^2}\le Rz + Iz$, so the absolute convergence of the real and imaginary parts implies the absolute convergence of the absolute value. Conversely, $Rz,Iz\le\sqrt{(Rz)^2+(Iz)^2}$, so the absolute convergence of the absolute value also implies the absolute convergence of both the real and imaginary parts.
This time, I found a solution to all the exercises, which is a progress of some sort. I still have trouble with the repertoire method, or perhaps not with the method itself but in identifying suitable generalisations and candidate solutions. This is something that can only be developed with practice, so I just have to be patient and keep trying (I hope I’ll get there eventually).
]]>The meaning of such an expression is not clear, so there is no real way to fail this exercise.
A first interpretation, maybe the common one, is that the sum is zero because the range is empty. In other words, the sum is $\sum_{4\le k\le 0} q_k$.
A second interpretation, perhaps for those used to programming languages with very flexible loops could argue that the sum is $q_4 + q_3 + q_2 + q_1 + q_0$.
I toyed briefly with a negative sum, similar to integrals with reversed bounds, but I did not come up with the nice book solution of $\sum_{k=m}^n = \sum_{k\le n}  \sum_{k\lt m}$, which is consistent with and extends the first interpretation.
It is easy to see that the expression has the same value as $x$:
The first one is easy:
The second one is tricky, is more than one way. One problem is that $k$ is not explicitly defined, and I had assumed it was a natural, when the authors thought of it as a integer; now the latter is in line with the book conventions, so I was wrong and had missing terms. The right answer is:
Here it is important to restrict the bounds as much as possible (but no more); otherwise there is a risk of introducing spurious terms.
The terms appear in the same order, but are grouped in sums differently.
The problem is the step
$k$ is already bound in the inner sum, so it is invalid to replace $j$ by $k$ in the outer.
This can be worked explicitly:
The result is not surprising:
So $\bigtriangledown f(x)$ is the difference operator to use with rising factorials.
Clearly, when $m\lt 0$, $0^{\overline{m}} = 0$; when $m = 0$, $0^{\overline{m}} = 1$ (to make the expression $x^{\underline{1+0}}=x^{\underline 1}(x1)^{\underline 0}$ work when $x=1$); I had forgotten about $m<0$, which was perhaps the easiest case, as $\frac{1}{m!}$ (it follows directly from the definition of falling factorials with negative powers).
It is easy to see that $x^{\overline{m+n}} = x^{\overline m}(x+m)^{\overline n}$:
From there, the value of rising factorials for negative powers follows quickly:
To start, I quickly looked up the proof of the original derivative product rule on Wikipedia; the geometric nature of the proof was illuminating (I believe I was taught the so called Brief Proof both in highschool and at university).
This geometric proof can be used for both the infinite and the finite calculus, and its symmetric nature (there are two ways to compute the area of the big rectangle: $f(x)g(x)+(f(w)f(x))g(w) + f(x)(g(w)g(x))$ and $f(x)g(x)+f(w)(g(w)g(x)) + (f(w)f(x))g(x)$) can be used in the finite case. The symmetry (and equality) is restored because in the infinite calculus, $\lim_{w\rightarrow x}f(w) = f(x)$ and $\lim_{w\rightarrow x}g(w) = g(x)$, a restoration that is not possible in the finite calculus.
However, the equivalent finite calculus formulas, $\bigtriangleup(uv) = u\bigtriangleup v + Ev\bigtriangleup u$ and $\bigtriangleup(uv) = Eu\bigtriangleup v + v\bigtriangleup u$, have together the symmetry they lack on their own.
OK, that was not entirely bad (two small mistakes, both about negative numbers blindness). Next step, the basic exercises.
]]>Overall, this chapter felt less overwhelming than the first, despite being much longer and introducing very powerful techniques. I have yet to do the exercises, though, so I may still revise this judgement.
The authors mentions that the Sigmanotation is “… impressive to family and friends”. I can confirm that assessment.
The remark on keeping bounds simple actually goes beyond resisting “premature optimisation”, that is, removing terms just because they are equal to zero. Sometimes, it is worth adding a zero term if it simplifies the bounds. Such a trick is used in solving $\sum_{1\le j\lt k\le n} \frac{1}{kj}$, and I’ll get back to this point when I go over this solution.
The Iverson notation (or Iversonian) is a very useful tool, as is the general Sigmanotation. About the latter, it already simplifies variable changes a lot, but I found it useful (and less error prone) to always write the variable change on the right margin (for instance as $k \leftarrow k+1$) and to keep that change as the only one in a given line of the rewrite; otherwise, no matter how trivial the change, any error I make at that time will be hard to locate (I know; I tried).
First we see how easy it is to use the repertoire method to build solutions to common (or slightly generalised) sums. The only problem with the repertoire method is it requires a well furnished repertoire of solutions to basic recurrences; I’m sure I would never have come up with the radixchange solution to the generalised Josephus problem. And given that there is an infinite number of functions one could try, a more directed method is sometimes necessary.
This section also shows how to turn some recurrence equations (such as the Tower of Hanoi one) into a sum; this method involve a choice ($s_1$ can be any nonzero value), which could either simplify or complicate the solution. I haven’t done the exercises yet, so I don’t know to what extent the choice is obvious or tricky.
Finally it shows how to turn a recurrence expressed as a sum of all the previous values into a simpler recurrence by computing the difference between two successive values. This is one instance of a more general simplification using a linear combination of a few successive values.
Unsurprisingly, sums have the same basic properties as common additions: distributive, associative and commutative laws. Only the latter is really tricky, as it involves a change to the index variable. As mentioned above, I found useful to make such changes really clear and isolated in any reasoning.
With these laws confirmed, it is possible to build the first method for solving sums: the perturbation method. It is very simple, and while it does not always work, when it does it is very quick.
This is perhaps the first section where I had to slow down; basically multiple sums are not different from simple sums, and manipulations are defined by the distributive law, but index variable changes (especially the rocky road variety) require special attention. This, combined with “obvious” simplifications (obvious to the authors, and sometimes in retrospect to the reader as well), gave me some difficulties.
For instance, the solution to
The index variable change $k \leftarrow k+j$ is explained as a specific instance of the simplification of $k+f(j)$; more perplexing are the ranges for $j$ and $k$ when the sum is replaced by a sum of sum:
The range for $j$ is built from $1\le j$ and $k+j\le n$, so there is nothing really strange here.
The range for $k$, however, looks like a typo: certainly the authors meant $1\le k\lt n$. A margin graffiti confirms the range, but it does not really explain it.
The fact is, it is safe to let $k\le n$ here, because the sum over $j$ when $k=n$ is zero: not only the expression $\sum_{1\le j \le kn = 0} \frac{1}{k}$ is zero because there is no $j$ that can satisfies the range predicate, but the closed form of this sum, $\frac{kn}{k}$, is also zero when $k=n$.
With the closed form checked, it is safe to add extra terms to simplify the range of $k$.
What happens if you don’t see this possible simplification? As expected, the answer remains the same:
So to expend on the original advice of keeping the bounds as simple as possible: sometimes it is possible to extend the bounds (in order to simplify them), as long as the extra terms in closed form evaluate to zero. If the extra terms are still defined as sums, just checking that the range is empty might not be enough.
A cool and fun section on the various ways to solve a given sum.
Method 0 is to look it up. This book, written before the rise of Internet (I remember Internet in the early 1990’s; most of it was still indexed manually on the CERN index pages…), suggests a few books as resources.
Fortunately, some of them have migrated to the Web, which is a more suitable tool than books for such knowledge; the combination of searches and instant updates is hard to beat (a book remains best for a content that is mostly linear and somewhat independent of time; a novel, or textbook, for instance. References are better on Internet, free if possible, for a subscription otherwise).
Method 1 is guessing then proving; proving in fact should be a complement for all the other methods (except perhaps Method 0). Having two independents proofs is always good.
Method 2 is the perturbation method. In this section example, we see how an apparent failure can still be exploited by being imaginative.
Method 3 is the repertoire method. In this chapter it is usually much simpler than in the first.
Method 4 uses calculus to get a first approximation, then uses other methods to solve the equations for the error function.
Method 5 is a clever rewriting of the problem into a sum of sums; like the repertoire method but unlike the others, it requires some intuition to find a solution (perhaps more than the repertoire method); I have bad memories of trying such a method to solve problems at university, always somehow ending up right where I started. I guess I will try other methods if I can.
Method 6 is the topic of the next section; method 7 is for another chapter.
This section was surprising and exciting, but not really that complex. It really is a matter of adapting regular calculus reflexes to the finite version. I have to see how it works in practice.
One thing that is causing me some trouble is the fallingpower version of the law of exponents:
While the rule is easy to prove and to remember, it is less easy than the general one to recognise in practice; I failed to see it when it came up in the solution to
Worse, even the explanation in the book, I had to write it down, play with it, before seeing it.
So I’m thinking about a notation that would bring out the rule more clearly, an extension of the shift operator $E$:
This would turn the exponent law into
Whether this is useful, or whether I’ll get used to the original notation anyway, we’ll see in the exercises…
The last section is about infinite sums. The authors quite sensibly restrict the scope to absolutely convergent sums, which have the advantage that the three basic laws and the manipulations they allow are still valid.
Once again, this was not overly difficult; the only point I had trouble understanding was the existence of the subsets $F_j$ such that $\sum_{k\in F_j} a_{j,k} \gt (A/A’)A_j$ when $\sum_{j\in G} A_j = A’ \gt A$. But this last equation means that $A/A’ \lt 1$, so $(A/A’)A_j \lt A_j$. The first equation is therefore just a consequence of the fact that $A_j$ is a least upper bound.
Next post, the warmups.
]]>As I’m refreshing my C skills, I thought it would be interesting to try and implement a version as fast as possible.
I represent a subset as bit patterns in a 32bits integer. This means
I am limited to 32 different values (in other words, $n$ must be no
larger than 32). The upside is that I have extremely fast intersection
(&
) and union (
) operations, among others.
I use a work memory allocated at the beginning of the search; additional memory is allocated on the stack (using C99 features), and the selected tickets are just printed to avoid having to remember them.
The work memory is large enough to store $1 + \beta$ times a block large that can hold the complete set of $j$subsets. The first block keeps the remaining $j$subsets, and there’s an extra block for each random ticket: each time (for a total of $\beta$) a random ticket is generated, the $j$subsets that are not covered yet is computed for this ticket; after I have generated $\beta$ tickets, I copy the work block of the best one over the first one.
I could have use just 3 blocks, a reference, the best so far, and one for the current random ticket, and copy from the current to the best each time the current ticket is better. There would be more copy operations, but perhaps less movement between the cache and the memory. The current design requires less than 2M, and only one copy operation per random ticket.
I am using a few GCC builtin bitlevel operations (number of bits, index of least significant 1 bit, and count of trailing zeroes); Bit Twiddling Hacks and Hacker’s Delight have portable alternatives.
I also use /dev/random
as a source of random numbers; replacing
dev_random
by random
would restore portability (but the output
would always be the same, and the random state is reset when the
program starts).
So, is it fast?
1 2 3 4 5 

The program found 71 tickets covering all 7subsets with at least 6 numbers in less than a second. Even when the conditions are not that good, it remains fast:
1 2 3 4 5 

Here it generated 1077 tickets using the smaller ticket size from Younas and Skiena paper; the paper had a 1080 tickets solution, so my version is effective.
Of course, it would be useless and unfair to compare the speed of this version against the numbers from the paper; more relevant is the difference with the Haskell version: while the latter was not meant to be fast, it is hundreds of times slower. I suppose it would be interesting to try and make it faster, but I suspect it would be just as ugly or uglier than the C version. And I like to keep using Haskell as a design and exploratory tool.
solve
The main function, solve
, is more complex than in the Haskell
version. It allocates the work memory, and fills it with init
. A
first ticket is used in init
to filter out $j$subsets.
Then the loop for the other tickets starts. It of course stops when there are no remaining $j$subsets.
The subset of remaining numbers is computed with funion
(fold
union), and the digits
array prepared to be used in sample
. It
consists of the individual bits of the number representing the
remaining numbers subset. It is computed by repeatedly isolating the
rightmost 1 bit (with d & d
), then clearing this bit (with d &= d 1
).
A first ticket is randomly generated and its uncovered set computed. It is also set as the best new ticket (and indeed is the best so far). Then for the remaining $\beta1$ new tickets, the uncovered set is computed as well, and if the new set is smaller than the best’s, the new ticket becomes the best as well.
The best ticket is printed, the main work memory is updated with the best uncovered set, and if there are any remaining $j$subsets to find, we loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

init
init
’s purpose it to avoid wasting a loop over the $j$subsets by
merging the generation of $j$subsets with the coverage of a first
permutation (defined as [1..k]
in solve
). The returned value is
not size of the not yet covered set of $j$subsets.
If all tickets had to be generated randomly, 0 could be passed instead of a ticket to keep all $j$subsets.
1 2 3 4 5 6 7 8 9 10 11 12 13 

check_cover
check_cover
has a similar design as init
, but reads the
$j$subsets from the work memory from
instead of generating them.
1 2 3 4 5 6 7 8 9 10 

sample
sample
is very similar to the Hashell version (indeed they are both
based on the same algorithm); here the digits
array plays the role
that ds
played in the Haskell version.
1 2 3 4 5 6 7 8 9 10 11 12 13 

next_perm
The next_perm
is from Bit Twiddling Hacks, and explained here.
1 2 3 4 5 

Using gcc
, the necessary option is std=c99
to activate C99
support; O3
gives much (really) better performance, while Wall
is in general a good idea:
1


To run it, just pass the $n$, $k$, $j$ and $l$ parameters on the command line. There is no checks, so avoid mistakes. The program outputs the generated tickets:
1 2 3 

After I completed the Haskell version, I found it not overly difficult to implement the C one. I was lucky to have discovered Bit Twiddling Hacks the week before; the code fragments there were very helpful in writing efficient set oriented functions over words.
Surprisingly, I had just one bug to track (I was using a variable both as parameter and temporary storage in one of the function); that was lucky as I’m not sure I could have debugged such code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 

The first War Story is Psychic Modeling, an attempt to exploit “precognition” to improve the chances of winning the lottery.
This war story is also the subject of one of the first implementation projects. In chapter 1. A few years ago, when I bought the book, I had easily solved the previous exercises, but then I reached this implementation project, and I got stuck. I could not even get a high level sketch of what a solution would look like.
Certainly, if I was unable to solve an exercise of the first chapter of this book, it was hopelessly beyond my reach…
Still, I had the ambition of one day resuming my reading, and I would from time to time give this problem another attempt.
Recently, it feels like all the pieces finally fell into places, and after a few hours of coding I had an (naive) implementation. Yet I still have doubts, as the only reference I have to compare my solution with, Skiena’s own paper (Randomized Algorithms for Identifying Minimal Lottery Ticket Sets), apparently is worse (in terms of necessary tickets) than my solution…
Note on this paper: unfortunately it is in Word format, and I found that some characters are not properly displayed on non MS Word text processing tools (such as Open Office). So you might have to open it with MS Word or MS Word Viewer.
I will use the notation from the book rather than the paper. The problem is defined as this:
A first difference between the paper’s approach and mine is that I’m using the notion of coverage size rather than distance: I measure how similar two subsets are by defining their cover as the size of their intersection; in their paper the authors use a notion of distance defined as the size of the difference of the two subsets (perhaps to help with the design of heuristics in the backtracking version of their algorithm).
Now, clearly the two approaches are equivalent; it is less clear that the formulas derived from either are indeed the same.
For a given $j$subset, how many $j$subsets have a coverage of at least $l$ with the first one? The covered $j$subsets must have at least $l$ numbers (between $l$ and $j$, to be precise) in common with the first one, and the rest taken from the $nj$ other numbers. This gives
For a given $j$subset, how many $j$subsets are within $jl$ distance of the first one? We can choose at most $jl$ numbers out of the $nj$ rest; and complete with numbers from the first subset. This gives
It took me a while to confirm it, but the formulas are indeed the same:
Note that I do not use the $k$ size of a ticket. In fact, in my original design, I used it but ignored $j$; reading the paper I realised that $j$ was indeed critical: one of the $j$subsets will be on the winning ticket, so they are the ones we need to cover. However, I could not understand why the paper did not use the potentially larger size of a ticket to cover more $j$subsets.
Restated with a complete ticket, the coverage formula becomes
This apparent small change actually reduces the lower bound of the necessary tickets significantly. For $n=15$, $k=6$, $j=5$, $l=4$, for instance, will the paper offers as a lower bound $58$, the formula above gives $22$.
So the question is: is it valid to use the possibly larger value $k$ when generating tickets? I could not think of any reason not too, and if I’m right, this gives each ticket a much larger cover, and therefore a lower number of necessary tickets.
For a first effort, I chose to code in Haskell, and favoured simplicity over speed. The code is indeed both simple, and wasteful, but Moore’s Law says that computers have become about 1000 times faster since the time the paper was written, so I have some margin.
To keep things simple, sets and subsets are just lists.
Such functions ought to belong to a dedicated library (and perhaps they do); I include them to keep the implementation mostly selfcontained.
1 2 3 4 

fact
is just the factorial; combi
computes the binomial
coefficient, and remainingNumbers
is just the union of all the
passed $j$subsets.
1 2 3 4 

genCombi k s
generates the $k$subsets of $s$.
These are simple implementations of the formula above.
1 2 3 4 

ticketCover
just implements the coverage estimate I defined above
(the one that uses $k$); lowerBound
computes the lower bound for a
single win.
As stated above, I define the cover between two subsets as the size of their intersection, and define sufficient coverage as the cover being larger than $l$.
1 2 3 4 5 6 7 8 9 

cover
implements the cover definition; coveredP
and notCoveredP
are predicates that check for (or against) sufficient coverage.
notCovered
and notCoveredBatch
computes the subsets that are not covered by a single ticket or a set
of tickets, respectively; they are used to compute what is left to
cover after selecting a ticket, and to check solutions.
Finally coverageScore
computes the size of of the covered subsets by
a ticket. This function is used to compare potential tickets and
select the one with the best (i.e. largest) coverage.
1 2 3 4 5 

checkFormula
computes the size of the coverage of a single ticket;
it can be used to confirm the value of ticketCover
above (and as far
as I can tell from my checks, it does).
The solution loop takes the parameters and a ticket candidate generating function; it then gets one ticket at a time, computes the $j$subsets not covered yet, and repeat until the remaining $j$subsets set becomes empty.
1 2 3 4 5 6 7 8 

The solve
function expects the candidate generation function to be a
monad; this is to make it possible to use random number generators.
I do not really know how to navigate subsets, so I won’t try to implement a backtracking solution as describe in the paper. Instead, I have what is really the simplest greedy algorithm: when a new ticket is needed, get the one that has the best coverage among all the possible tickets:
1 2 3 4 5 

So for each $j$subsets set, generate all the $k$subsets, and compare their coverage.
Needless to say, this function does not return anything anytime soon for even slightly large values of $n$.
To improve the performance (well, to get a result in my lifetime), I am using what I understand to be the same approach as in the paper: generates $\beta$ tickets, compare their coverage of the remaining subsets, and keep the best one.
The different with the paper, as mentioned before, is that my tickets are $k$subsets rather than $j$subsets themselves.
I first need a function to generate a random combination. I’m using a method derived from Knuth (no reference as I don’t have Volume 4 just yet).
1 2 3 4 5 6 7 8 9 

The generating function is very similar to the naive one
1 2 3 4 5 

The only difference is the tickets
candidate set: the naive function
generates them all; the randomised one selects $\beta$ randomly.
By using solve n j j l
instead of solve n k j l
, my implementation
should compute subset coverage the same way the paper’s implementation
does.
I will not compare speed, as this would be meaningless. But I can check whether different values for ticket size can indeed help reduce the size of the covering set.
Let’s start with a very simple problem, where $n=5$, $k=3$, $j=3$ and $l=2$.
I don’t really need to generate the $j$subsets, but if I do I can check the solution.
The solution itself is computed by passing a ticket generating
function; I could have used getCandidate
, but here I’m passing
getCandidateRandom
with a $\beta=100$.
The notCovered
set is empty, so the solution is at least a covering one.
The solution has two tickets, and the lower bound confirms it is pretty good.
1 2 3 4 5 6 7 8 

Next test, with $n=15$, $k=5$, $j=5$ and $l=4$. The paper reports that they found a solution with $137$ tickets. As $k=j$, my algorithm cannot really beat that (and indeed finds a solution of the same size, if I try a couple of times):
1 2 3 4 5 6 7 8 

For the next test, I should have a better solution than the paper, as $k$ is larger than $j$: $n=15$, $k=6$, $j=5$, $l=4$.
The paper has a lower bound of $58$, and a solution of size $138$, but my lower bound is $22$, and my solution has size $57$.
1 2 3 4 5 6 7 8 

When the difference between $k$ and $j$ becomes large, the solution improves significantly: with $n=18$, $k=10$, $j=7$, $l=6$, the paper has a lower bound of $408$, mine is $18$. The paper’s solution has size $1080$, but mine is just $73$.
1 2 3 4 5 6 7 8 

Even if my approach is ultimately wrong, I can say I must be close to an actual solution. I could (and probably will, given time) try to rewrite my solution in C, and focus on performance.
So I declare this problem conquered, I will resume my reading.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 

First, it helps to see that the indices of the recurrence are actually $S_n$:
And of course, $S_n = S_{n1} + n$.
Setting $m=S_{n1}$, we try to show:
Now, obviously, if we have $m+n$ discs, we can move the $m$ top ones from $A$ to $C$ using $B$ and $D$ as transfer pegs, then move the bottom $n$ ones from $A$ to $B$ using $D$ as transfer peg, and finally move the top $m$ ones from $C$ to $B$.
The first step takes $W_m$ moves, the second one is the classic Tower of Hanoi problem (as we can no longer use peg $C$, we only have three pegs), so it takes $T_n$ moves, and the last step takes $W_m$ moves again.
This is only one possible solution; the optimal one must be equal or better, so we have
This is true for any $m+n$ discs, and in particular for $S_n = S_{n1} + n$ ones.
I could not solve this problem. I had found that the halflines did intersect, but then I failed to show that their intersections were all distinct.
Even with the solution from the book, it took me a while before I finally had a complete understanding.
One problem I had was that lines in a graph are basic college level mathematics, but college was a long, long time ago. I pretty much had to work from first principles.
Following the book in writing the positions as $(x_j, 0)$ and $(x_j  a_j, 1)$, I need to find $\alpha$ and $\beta$ such that $y=\alpha x + \beta$ is true for both points above.
With this given, I can try to find the intersection of lines from different zigs, $j$ and $k$:
Now, still following the book, I replace $x$ by $t$ with $x=x_j  t a_j$:
Somehow, I have a faint memory of such a result; I need to check a college math book.
To complete, I need to show that $y = t$:
So the intersection of any two pair of halflines from different zigs is $(x_j  t a_j, t)$. Note that $t$ has the same value whether $j \gt k$ or $k \gt j$. To simplify further computations, I set $j \gt k$.
There are two remaining steps: show that $t$ is different for different pairs of $j$, $k$ (with $j \ne k$); and then show that the four intersections for a pair $j$, $k$ are also distinct.
$a_j$ can be of two forms: $n^j$ and $n^j + n^{n}$. So $a_j  a_k$ can be one of
So there are three different forms for $a_j  a_k$, which I will simply write $n^j  n^k + \epsilon$ where $\epsilon \lt 1$.
Let’s show that $n^j+n^k  1 \lt t \lt n^j+n^k + 1$: multiply the whole inequality by $n^j  n^k + \epsilon$. As
so $n^j  n^k + \epsilon \gt 0$. Defining
the left and right inequalities become
Subtracting $N_{jk}N’_{jk} = (n^jn^k)(n^j+n^k)$ from the original inequality:
I need to prove the following inequality
We already know $\epsilon \lt 1$, so looking at the second term (and assuming $\epsilon \ne 0$, as this case is trivial)
and we have
So the inequalities are established. $N_{jk}$ can be seen as a number in based $n$ where the digits are all zeroes except the $j$ and $k$ ones, $N_{jk} = N_{j’k’} \implies j=j’, k=k’$, and therefore $t$ uniquely defines $j$ and $k$ or, two pairs of zigs must have different $t$.
I still need to show that for a given pair, when $t$ is the same, the intersections are different. There are three different values of $t$, so two intersections points have the same height. This happens for
which happens when $a_j = n^j$, $a_k = n^k$ and $a_j = n^j + n^{n}$, $a_k = n^k + n^{n}$. But the $x = x_j  t a_j$ value for intersections is different: $t n^j$ and $t (n^j + n^{n})$, so there are indeed four distinct intersection points.
I could not solve this problem. Once again, my lack of intuition with geometry was to blame.
But if we have two zigs with halflines angles $\phi$, $\phi + 30^{\circ}$ and $\theta$, $\theta + 30^{\circ}$, then for any two pairs of halflines from the two zigs to intersect, their angles must be between $0^{\circ}$ and $180^{\circ}$. Taken together, these constraints give $30^{\circ} \lt \phi  \theta \lt 150^{\circ}$.
Update: The original version of this post had a lower bound of $0$. Thanks to Tailshot for pointing out the error
This means there cannot be more than $5$ such pairs (and to be honest, I would have said 4, but the book says it’s indeed 5).
Using the repertoire method, solve the recurrence equations
The general form of $h(n)$ is
We get three of these functions directly by solving
So we have a solution for $A(n)$, $B_0(n)$ and $B_1(n)$.
Setting $h(n) = n$
which gives the equation $n = A(n) + B_1(n) 2(C_0(n) + C_1(n))$.
Setting $h(n) = n^2$
which gives the equation $n^2 = A(n) + B_1(n) + 4C_1(n)$
The latest gives us $C_1(n) = (n^2  A(n)  B_1(n))/4$. To solve for $C_0$, one can either replace the value of $C_1$ in the equation for $h(n) = n$ above, or, equivalently, add twice that equation to the one for $h(n) = n^2$, which eliminates $C_1(n)$:
It took me a while, as I was trying to find a recurrence equation of some sort which would help me with this problem and the bonus one (where Josephus’ position is fixed but he can pick $m$). Eventually I found one, which did not help me with the bonus problem, but led me to a solution for this problem.
Obviously, if we have $k$ persons and want to remove the last one in the first round, we can choose $m=k$ and that will work. Actually, any multiple $m=ak$ works as well.
This shows that at each round, if we have $k$ persons left, and we start counting on the first one, when $m=ak$ we will remove the $k^{th}$ person then start counting from the first one again.
Back to the original problem: there are $2n$ persons, and we want to get rid of the $n+1, \cdots, 2n$ first. If we take $m=lcm(n+1,\cdots, 2n)$, then for the first $n$ rounds the last (bad) person will be remove, leaving only the good ones at the end.
When first solving the problem, I picked $m=\prod_{i=1}^n (n+i)$, which has the same property as the least common multiple, but is larger. Perhaps a smaller number is better for the nerves of the participants.
I tried to solve the bonus questions, but after repeatedly failing, I had a glimpse at the solutions: they obviously require either knowledge of later chapters, or other concepts I know nothing about, so I will get back to these bonus problems after I finish the book.
I am now working through Chapter 2. It is a much larger chapter than the first, so it will take me some time.
]]>Advanced views in CouchDB are, as noted yesterday, materialized output of MapReduce computations.
This has a cost: such computations are saved, so they take more time than with other implementations, the first time at least.
Updating the views, on the other hand, is fairly fast (CouchDB recomputes only what is necessary). Views have to be planned, but once there they are fairly cheap. For exploratory queries, other databases might be more appropriate.
CouchDB’s reduce functions distinguishes between the first invocation,
and the following ones (on values that have already gone through the
reduce function). This makes it possible to implement a _count
function which counts the number of values (the first invocation
transforms values into numbers, and the following ones add the numbers
up).
Replication is the
oneway process of replicating the changes of one database on
another. Replication can be between any two databases, whether on the
same server or on different ones. It can be one time, or
continuous. The documents to replicate can be filtered, or selected by
_id
.
Replication is a lower level mechanism than what MongoDB, for instance, proposes (where there is a strict hierarchy of masters and slaves), and closer to the flexible approach or Riak.
Of course, when concurrent writes are permitted, conflicts can occur, and CouchDB handles them.
Concurrent updates can cause conflicts, and CouchDB detects them so they can be dealt with.
First, conflicts cannot happen on a single server: updates to a document must refer to the latest revision, otherwise the update fails. So clients are directly aware that they need to resubmit the (merged) document.
When replication is enabled, conflicts result from concurrent updates
in two replicated databases. At the next replication, one version will
be selected as winning, and replicated to other databases. The other
versions are still accessible from the _conflicts
attribute
(initially, only in the losing databases).
If two ways replications are in place, eventually, all databases will
have the _conflicts
attribute populated (with all the losing
revisions, if there are more than one).
This makes it possible to implement a remedial action; it is possible to have views with only documents in conflicts, or to filter changes for conflicts, and implement merging actions in monitoring scripts.
CouchDB documentation helpfully provides some advice for designing conflictaware applications.
Changes are dedicated views that contains a list of updates for a specific database. The parameters support starting at a given revision (in this case, a database revision, not a document revision), filtering documents, and keeping the stream open in several ways.
This makes it possible (easy, even) to monitor (interesting or relevant) changes, to synchronize with other systems, or to automatically resolve conflicts, for instance.
When using LongPolling, I found that one very large datasets, the
JSON.parse
invocation could take a long time, and would suggest to
always use a limit
parameter on the query, to cut the dataset down
to manageable chunks.
There are three, documented on the Wiki.
They are implemented directly in Erlang, so they have a better performance than JavaScript functions.
_sum
This function behaves just as the reduce function from the book; it
sums the values by key. It is useful when the map functions uses
emit(key, 1);
(or some other numeric value).
_count
It is similar to _sum
, but it counts the number of values rather
than merely summing them. It is useful when the value is not a number.
_stat
This is an extension of _sum
which computes additional statistics
(minimum, maximum, …) on the numeric values.
_changes
outputFilters are nicely described in CouchDB The Definitive Guide.
To create a new filter, I first create a design document to store the function:
1 2 3 

The by_country
function retrieves a country
parameter from the
request, and compares it against the record country
attribute; only
the matching records are returned.
To monitor only updates to bands from Spain, for instance, I can use
1


To monitor for conflicts, I have the following design document:
1 2 3 4 5 6 7 8 

With that, I can then listen for changes, keeping only the conflicts:
1 2 3 4 5 

Because CouchDB only set the _conflicts
attribute on the
losing database; the winner database (the one in which the winning
revision was initially created) does not know about conflicts. This
means I must check against musicrepl
instead of music
.
The API is documented here.
To use it, simply pass the source
and target
databases to the
_replicate
URL:
1 2 3 4 

_replicator
databaseThe
_replicator
database
is an alternative to the use of the
_replicate
URL above: documents inserted in the _replicator
database will, if properly formed, cause a replication job to be
started (either oneoff, or continuous).
Deleting the document will cancel the replication job.
Document describing replications are updated to reflect the progress of the job.
The command below triggers a replication from music
to musicrepl
:
1 2 3 4 

Using the watch_changes_longpolling_impl.js
script on the _replicator
database, it is possible to monitor the replication job:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

The first change is when the document is created; the second when the job starts, and the third when it successfully completes.
Unlike the _replicate
based API, continuous jobs stored in
_replicator
will resume when the database is restarted.
The approach is to keep input in a buffer, then extract as many line from the buffer as possible (if the last line is incomplete, it is put back into the buffer), and parse each line as a JSON object.
The format of each parsed object is different: each change is in its
own object, so there is no results
attribute any more.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 

I just inserted the code block above in the original
watch_changes_skeleton.js
; no other modifications were required.
With the code block above, both the long polling and the continuous outputs are identical.
As I said above, conflicts are only created in the losing database, so
to test this I must use the musicrepl
database.
Otherwise, the code is simple: iterate on the _conflicts
attribute,
and for each revision it contains, emit that revision mapped to the
document _id
:
1 2 3 4 5 6 7 8 9 10 

Testing it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

And this completes Day 3 and this overview of CouchDB.
]]>It is another fairly short day, as much of this section is actually about the complexities of XML parsing…
Like Riak and MongoDB, CouchDB is scripted with JavaScript, so today has a feeling of déjà vu.
A View is just a mapping of a key to a value. Keys and values are extracted from documents; there can be more than one key for each document, as in MongoDB.
Once the view has been built and updated for the documents it applies to, it can be accessed by key using optimized methods (all based on some form of lexicographical order).
A View in CouchDB is essentially the equivalent of a materialized view in relational databases.
Access to the view causes it to be updated (i.e. recomputed) if necessary, which can be a painfully slow experience. I had imported the whole content of the music database (26990 records), and each time I tested a Temporary View or saved a Permanent one, I had to wait for CouchDB to finish the refresh (fortunately not too long on this dataset).
It interesting to note that while relational databases require the schema to be designed ahead of time, but support arbitrary queries, CouchDB let you ignore the schema, but need you to design the queries ahead of time.
emit
functionThe key can be any JSON object, although I would say that only strings and arrays of strings have sensible semantics.
Arrays can be used with reduce functions to provide query time custom grouping, as explained here.
For instance, to compute the number of records by date, I used the
releasedate
of each album to create a key array
[year, month, date]
, and a value of 1
(1 for each album):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

As I intend to use grouping, I also need a reduce function:
1 2 3 

Each document in the view is now a date as an array, with a single number for the record made that date (there are as many identical keys as there were records for a given day).
When querying, by default, the reduce function will be called on identical keys to get a single value:
1 2 3 4 5 6 7 8 

(month is 0 based…)
With the group_level
parameter, I can control whether I want to
group by day (group=true
or group_level=3
, as above), by month
(group_level=2
), or year (group_level=1
):
1 2 3 4 5 6 7 8 

1 2 3 4 5 6 7 8 

There are quite a few of them listed here.
The code is essentially the same as the one mapping names to ids, but
here it associates random
to name
.
1 2 3 4 5 

The URL below returns the first artist whose random number is greater than the random one generated by Ruby.
1 2 3 4 

As expected, if given a value too large (for instance, 1), the query returns nothing:
1 2 

The code of each script is similar, in a way Russian Dolls are similar: each one is an extension of the previous, digging deeper into the nested structure of the original document.
1 2 3 4 5 6 7 8 9 10 11 

Testing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Testing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

Testing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

And that’s it for Day 2.
]]>Today is just a short introduction: CouchDB is (yet another) keyvalue store; it has a ReST API, stores JSON data, and, like Riak, only supports full updates. Unlike Riak, however, it does not support concurrent updates; instead it requires the client to only update from the latest version of the data.
I thought at first that the data was versioned, like in HBase, but
this is not the case: the version id (_rev
) is there to ensure that
updates occur sequentially, not concurrently. CouchDB can keep
previous versions of documents, but the retention is unreliable as
explained here.
Besides the HTTP based ReST API, CouchDB also provides a web interface; among other tools, there is a complete test suite, which is always nice to check the installation.
The documentation is here; there is also a reference
Besides the basic CRUD POST
GET
PUT
and DELETE
, there is also
HEAD
(for basic information on a document):
1 2 3 4 5 6 7 8 

When using cURL
, the command HEAD
must be used with the flag I
,
otherwise cURL
will wait (endlessly) for data after the headers.
Finally, there is a COPY
command, which as expected copies a
document (without having to retrieve it first):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

PUT
a new document with a specific _id
It is just a matter of specifying an id when creating the document:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

To create an attachment, it is necessary to know the version of the
document, as it is considered an update. The URL for the attachment is
just the URL for its document, with any suffix (the suffix naming the
attachment). The _rev
is specified by passing a rev
parameter.
Using the document with _id
‘beatles’ created above, the attachment
is uploaded with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The document now has a new _rev
.
To retrieve the attachment, just use its URL:
1 2 

(the line breaks have been lost…)
Onward to Day 2!
]]>Redis low level protocol supports the notion of pipelines: sending commands in batch, and collect all the results at the end, instead of waiting for results between each command. This should save a round trip delay for each command, so there can be huge performance boosts for specific usages, as the informal benchmarks below show.
Redis servers can be distributed for performance or memory concern, but much of the work falls on the client side.
Slaves in Redis are just the opposite of MongoDB’s. Whereas MongoDB’s slaves are meant to be written to, so that updates are automatically pushed to the master, Redis slaves are, or should be, readonly. Updates are only propagated from master to slaves.
There is no integrated support for failover; it has to be implemented in client code.
So slaves are mainly a mechanism to distribute reads; combined with monitoring client code, they can also be used to data replication and failover.
Note that each slave needs as much memory as the master, as it contains the same data.
By itself, Redis does not support sharding, and relies on the client library to spread accesses over several instances. There is a ongoing development to have real Redis Clusters, but for the time being it has to be simulated.
One issue not mentioned in the book is that sharding breaks
transactions and pipelines: there is no guarantees that the relevant
keys are all in the same instance, so the Redis Ruby client, for
instance, will raise an exception when invoking MULTI
.
The Java client, Jedis, has a mechanism to “tag” a key such that keys with the same tag are guaranteed to be on the Redis server. This makes the distribution of keys predictable, and allows the use of transactions (provided all the involved keys have the same tag).
This shows that not only this is a client side feature, but the actual extent of the feature may vary widely. And of course, there is no reason to think that different clients will shard keys the same way.
Properly setup, sharding will distribute the data over each node, reducing the memory load of each node.
I first tried to rewrite the code in Java, to measure the cost of Ruby’s
convenience. The code in Java is clumsier than in Ruby, but it ran
a bit faster (105 seconds instead of 155 seconds for the Ruby version
using hiredis
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 

Using pipelines, the difference was 11 seconds against 26 seconds
(again, the Ruby version is using hiredis
).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 

Disabling snapshots and append only file did not improve the time significantly compared to the default (snapshots but no append only file).
Enabling the append only file and setting it to always
was almost 3
times as slow for the pipelined Java version (27 seconds). For the
original Ruby version (with hiredis
), it was even worse (1101
seconds). This means the overhead of writing to file can be mitigated
with pipelines.
To recap: disabling snapshots did not improve performance measurably,
but enabling append only file always
degrades the performance
significantly; using pipelines makes it a bit better, but it is still
much slower.
The exact setup to implement is not described, so what I did is to distribute data between two shards of one master and two slaves.
There is no direct support for such a layout in Jedis (nor, as far as I can tell, in the Ruby library), so I had to write some of it myself.
As always with Redis, the writes are restricted to the masters, and the reads are distributed over the slaves (and the masters as well, if needed).
Jedis does not support slaves directly. What the documentation proposes is to have a dedicated client to the master to write on, and a sharded pool to the slaves. However, such an approach would be difficult, as I need to shard the writes to the masters as well (I would have to use a different sharding algorithm, and manage the routing of commands through the tree of Redis instances).
Fortunately, Redis user Ingvar Bogdahn had posted an implementation of a Round Robin pool of slaves. This implementation manages a connection pool to a master, and another connection pool to a set of slaves. The commands are properly distributed: all the write commands are sent to the master, and the reads commands are distributed over the slaves.
I had to fix the code in some places: a command implementation was missing, another was incorrect, and finally the password was never sent to the master, causing authentication errors. But the bulk of the code is Ingvar’s, and I was glad to use it.
The classes are
UniJedis
: provides pools for both master and a set of slaves, and dispatches commands to the correct pool.RoundRobinPool
: implements a pool with Round Robin accessChainableTransaction
: (not used in this project) provides a fluent interface for Redis transactions.DBKeys
: (not used in this project) abstracts database and keys.Sharding is directly supported by Jedis, but as organized the code is restricted to a set of clients to specific instances.
There are basic, generic classes
(Sharded
,
ShardInfo
,
…) that can be used to implement sharding of arbitrary clients (such
as the Round Robin pool above), but it requires a lot of tedious code
to map each command to a method on the right shard. Worse, such code
would be the same for every kind of shard.
So I first wrote generic classes that implement sharding in terms of
generic Jedis client; the actual implementation is then much simpler
(just the constructors, and the few commands that cannot be sharded,
such as disconnect
or flushAll
).
BinaryShardedGJedis
: first level of Jedis commands implementation (binary commands)ShardedGJedis
: second level of Jedis commands implementation (String
based commands)UniJedisShardInfo
: descriptor class to use with Sharded
ShardedUniJedis
: actual implementation of sharded UniJedis
. As promised, the class has hardly any code.The code for the service itself is now fairly
small. JedisClient
is the class that builds the tree of sharded master/slaves pools. It
is loaded and initialized as a Spring
bean. The web services are JSR 311
services, running over Jersey, and loaded
and initialized by Spring.
Admin
let the user defines a keyword for a specific URL, and
Client
extracts a keyword from the request URL, retrieves the URL for the
this keyword, and returns a request to redirect to this URL.
Once deployed (on Apache Tomcat), it can be used in a browser or on the command line:
1 2 

and for clients:
1 2 3 4 5 6 

The code for the whole project can be found on Github.
And this completes Day 2.
]]>Redis is basically a keyvalue store, like Riak, but while Riak is agnostic about the values, Redis values can be data structures (lists, queues, dictonaries, …, or even messaging queues). This allows Redis to act as a synchronized shared memory for cooperating applications.
Redis values can have structure, and specific commands manipulate these values in appropriate ways. Redis supports strings, which can also behave as numbers if they have the right format, lists which can also be seen as queues, and support blocking reads, sets, hashes (that is, dictionaries), and sorted sets.
All Redis commands are atomic, and it is possible to group a sequence
of commands into a transaction for an all or nothing execution with
the command MULTI
. But a
Redis transaction is not similar to a transaction in relational
databases: it just queues all the commands and executes them when it
receives the EXEC
command. This
means it is not possible to read any data while in a transaction.
Perhaps nothing labels Redis as a datastore for transient data more than expiry: keys can be marked for expiration (either relative from the current time, or absolute).
Redis also supports messaging but this is a topic for Day 2.
This post has a more detailed but still balanced coverage of Redis.
The documentation is well done and easy to navigate. Of all the databases I have seen so far, this is probably the base (PostgreSQL being a strong second).
I’m using Java and the Jedis client library.
The code is simple enough:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

The pom.xml
file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 

This one is simple as well, but having a reader and a writer allowed me to try one writer and two readers.
First the writer program:
1 2 3 4 5 6 7 8 9 10 11 

The poml.xml
is a bit more complex, as it creates a selfcontained
jar with MANIFEST.MF
(so I can run it from the command line easily):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 

The reader program:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

with its pom.xml
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 

The blpop
command can block on several lists, so when it receives
something it is always at least a pair: the list key, and the value.
Now, I can open three terminals to test the code: two with readers:
1


and one with the writer (which must be started last):
1


The writer will simply state
1


One of the readers will get the message:
1 2 3 4 5 

but the other one will just keep waiting:
1


So Redis blocking queues can only server one blocking reader at a time (as it should).
The reader programs can be stopped with Ctrlc
, or by pushing
finish
into msg:queue
from a Redis client (twice, once for each
client):
1 2 3 4 

And that’s all for today.
]]>The repertoire method is really a tool to help with the intuitive step of figuring out a closed formula for a recurrence equation. It does so by breaking the original problem into smaller parts, with the hope they might be easier to solve.
Let’s assume we have a system of recurrence equations with parameters, so that the unknown function can be expressed as a linear combination of other (unknown) functions where the coefficients are the parameters:
We can consider $g$ as a specific point in a $m$dimensional function space (determined by both the recurrence equations, and the parameters), and because $g$ is a linear combination, we can try to find $m$ base functions (hopefully known or easy to compute) $f_k(n) = \sum_{i=1}^m A_i(n)\alpha_{i_k}$ with $1 \le k \le m$, expressed in terms of $m$ linearly independent vectors $(\alpha_{1_k},\cdots,\alpha_{m_k})$.
In other words, if we can find $m$ linearly independent parameter vectors such that, for each, we have a known solution $f_k(n)$, then we can express the function $g$ as a linear combination of $f_k(n)$ for any parameters (because the $m$ $f_k(n)$ form a base for the $m$dimensional function space defined by the recurrence equations).
First, we need to check that the recurrence equations accept a solution expressed as
It is enough to plug this definition into the recurrence equations, and make sure the different parameters always remain in different terms.
Then we can either solve $f(n) = \sum_{i=1}^m A_i(n)\alpha_i$ for known $f(n)$, or for known $\alpha_i$ parameters, as long as we end up with $m$ linearly independent parameter vectors (or, as it is equivalent, $m$ linearly independent known functions for specific parameters).
It is important to keep in mind that a solution can be searched from both direction: either set a function and try to solve for the parameters, or set the parameters and solve for the function.
Given
We need to check that $g$ can be written as
The base case is trivial. The recurrence case is
so $g$ can be expressed as a linear combination of other functions, with the parameters as the coefficients.
Now, when I tried to solve this problem, I didn’t know I could set the parameters to values that would lead to an easy solution ($\gamma = 0$ turns the problem into an easy to solve generalised radixbased Josephus problem); instead I wasted a lot of time trying to find known functions and solve for the parameters, which is why I have four steps below instead of just two as in the book.
As the book suggests, I tried to solve for $g(n) = n$:
As the recurrence equation looks like the generalised radixbased Josephus equation, I tried to solve for $g(2^m+1) = 3^m$:
I tried to solve for $g(n) = 1$, as it seemed useful to solve for a constant (no linear combination of linearly independent nonconstant functions can produce a constant function).
This is the step that took me the longest, and when I finally understood I could fix the parameters, I was able to use the radixbased Josephus solution.
The recurrence equations
have as solution $g(2^m + (b_m\cdots b_0)) = 3^m + (b_m\cdots b_0)_3$.
We have the equations
We have two functions already defined ($A(n)$ and $B_1(n)$), and the other two equations give us the remaining function.
Now we can solve for $g(n)$:
The $\gamma$ term is really $h_3(n)  n$.
The $\beta_0$ term is the same as $h_3(2^m1l)$, as can be seen by observing that in base $3$, $3^m$ is $1$ followed by $m$ zeroes, so $3^m1$ is $m$ twos, and $\frac{3^m1}{2}$ is $m$ ones, in other words the same representation as the binary representation of $2^m1$.
Now, the binary representation of $l$ is the same as the representation in base $3$ of $h_3(l)$ (by definition of $h_3$), so the binary representation of $2^m1l$ is the same as the representation in base $3$ of $\frac{3^m1}{2}  h_3(l)$.
With these two observations, it is possible to rewrite $g$ as
which is the book solution.
It is enough to solve for $\alpha, \beta_0, \beta_1 \ne 0, \gamma = 0$, and to find the parameters for $g(n) = n$. The first gives $A$, $B_0$ and $B_1$ directly by the generalised radixbased Josephus solution, and the second one adds a constraint to solve for $C$ as well.
As can be seen above, approaching the problem from both directions (solving for known functions and solving for known parameters) can result in time saved, and simplified expression of the solution.
]]>To solve this, I first observed that for $n=1$, we need $m_1$ moves, and for $n \gt 1$, we need $A(m_1, \cdots, m_{n1}) + m_n + A(m_1, \cdots, m_{n1})$ or $2A(m_1, \cdots, m_{n1}) + m_n$ moves.
This leads to the solution,
which is trivially shown by induction. The base case:
And for larger $n$, assuming $A(m_1, \cdots, m_n) = \sum_{i=1}^n m_i 2^{ni}$,
A geometric problem, but very similar to the previous intersecting lines. A zigzag is made of 3 segments, so a pair of zigzag lines can intersect at 9 different points. The first zigzag line defines two regions; each new zigzag adds a new region, plus one more for each intersection point.
This gives the following recurrence equations:
Using the linearity of the recurrence equation, it is easy to see that
Here I used the linearity to compute solutions to both $ZZ_n = ZZ_{n1} + 9(n1)$ and $ZZ_n = ZZ_{n1} + 1$, which are equally trivial. Then I combined the solutions into one.
I use (again) induction to confirm the solution. The base case is $ZZ_1 = ZZ_1 + 9S_0 + 0$. And for other $n$, assuming $ZZ_n = ZZ_1 + 9S_{n1} + (n1)$
The formula can also be written as
Again, a geometric problem. This one gave me more trouble. It took me a while before finally seeing that a new plane intersection with the previous ones will be a set of intersecting lines which defines the regions the new plan will divide in two.
The number of regions formed by intersecting lines was solved in the book, and defined as $L_n = S_n + 1$
So a plane cutting $n$ existing planes will define $P_{n+1} = P_n + L_n$ new regions. This recurrence gives $P_5 = 26$ regions.
The book did not expect a closed formula for this exercise, as the necessary techniques are only covered in chapter 5.
The recurrence equation for $I(n)$ follow the structure of $J(n)$, but with different base cases:
Here I generated the first few values to get inspired. I noticed that $I(n)$ had increasing odd values for batches that were longer than for $J(n)$: $3, 6, 12, 24, \cdots$.
These numbers are from the series $3\cdot 2^m$, so using the same “intuitive” step as in the book, I tried to show that $I(3\cdot 2^m + l) = 2l + 1$ with $0 \le l \lt 3\cdot 2^m$ (the formula does not work for $I(2)$, which has to be defined separately).
By induction on $m$: the base case is $I(3) = I(3\cdot 2^0 + l) = 1$.
Assuming $I(3\cdot 2^m + l) = 2l+1$, we have
The book solution is defined in terms of $2^m+2^{m1}+k$, which is same:
with $1 \le m$, while I have $0 \le m$.
I put the repertoire method in its own post as it was both the most difficult exercise and the one where I learned the most.
]]>I eventually found a better way, which I document here.
There is an antlr3mavenarchetype, which I started from. However, for the purpose of clarity, I will start from scratch here.
The Maven plugin for Eclipse is called m2e (m2eclipse is an obsolete version), and is available in the default Eclipse Marketplace. However, the current version (1.0 at the time of writing) does not handle the life cycle of some common Maven plugins very well. In particular, it does not know where to put the generation of classes from grammar files into the Eclipse life cycle.
The 1.1 milestone does it much better, so I suggest to install it. The location is http://download.eclipse.org/technology/m2e/milestones/1.1, which can be used for the “Install New Software” function.
Create a new Maven Project, and skip the archetype selection (i.e. use simple project). As I said above, I could use the ANTLR v3 archetype, but chose not to.
By default Maven uses compiler source and target version 1.5. On Mac
OS X Lion, there is no JDK 1.5 (only 1.6), so I always update pom.xml
to set the source
and target
configuration options to something
meaningful:
1 2 3 4 5 6 7 8 9 10 11 12 13 

I create a property for the ANTLR version, as I will need for both the ANTLR plugin and the jar:
1 2 3 

Then I add the plugin declaration
1 2 3 4 5 6 7 8 9 10 11 12 

Finally I add the dependency to the ANTLR runtime:
1 2 3 4 5 6 7 

At this stage, Eclipse is upset because the lifecycle configuration
org.antlr:antlr3mavenplugin:3.4:antlr
is not covered. But as we’re
using m2e 1.1, we can look for the appropriate connector in the m2e
Marketplace. There should be only one: antlr by Sonatype, which should
be installed.
This is something that the original ANTLR v3 Maven archetype suggests: to include the ANTLR runtime into the generated jar.
Using the Maven Assembly Plugin, it is possible to declare what goes into the generated jar. As it is selfcontained, it is also possible to declare a main class (not done below as I did not have a main class yet):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 

Now, the ANTLR plugin can process code under
src/main/antlr3
, so we can create this folder, and add it as source
folder in the Eclipse project properties. Creating or updating a
grammar file in Eclipse will also create or update
The ANTLR connector also added the target/generatedsources/antlr3
directory as another source folder, but it will disappear when
executing the Maven/Update Project Configuration action, so it is best
to add it manually. You can then change the properties for this folder
to check ‘Locked’ (to avoid accidental edition) and ‘Derived’ (to hide
the content from the “Open Resource” command).
Note that the plugin is unable to follow the @header
directive
properly (that is, it will copy the directory structure of the grammar
file, instead of following the directory structure implied by the
@header
directive), so the grammar files must use the same directory
structure as the Java package intended for the generated classes. In
other words, if you want your generated classes to have the package
org.something
, you both need to put the grammar files under
src/main/antlr3/org/something
, and use the @header package
directive to set the package of the generated classes.
It is also unable to handle grammar files directly under
src/main/antlr3
. If you try, it will generate this error: “error(7):
cannot find or open file: null/NestedNameList.g” when running the
processsources
goal. Running this goal is also the only way to get
the error message if something is wrong with the grammar file (unless
you install an ANTLR Eclipse plugin, which I didn’t try).
Small gotcha: I found that with the current version of plugins, connectors and so on, Eclipse does not detect changes to generated classes directly: it is always one change behind, especially when there are errors.
If you made a mistake in the grammar file that causes the generated classes not to compile anymore, you would have to change the grammar file twice for the error markers to go away; the first time, Eclipse will correctly report that the errors in the classes are gone, but the project error markers will stay; the second change (even if you changed nothing, just add a character, delete it, and save), and the error markers will finally disappear.
This is more annoying than really a serious problem, and in any case the files are always properly generated, so if there is no error, all files are kept uptodate.
If you include the buildhelpermavenplugin
plugin in your
pom.xml
, then it is possible to automatically add the relevant
source folders to Eclipse:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

To use it, another connector is necessary, but it is found directly in the m2e Marketplace.
Once in the pom.xml
, just importing the project into Eclipse will
create the relevant source folders automatically. However the ‘Locked’
and ‘Derived’ flags on the target/generatedsources/antlr3
folder
are stored in the workspace .metadata
, so these flags have to be set
manually for each workspace.
If all the above seems tedious, it is because it is. The
antlr3mavenarchetype
will generate much of it, but not for
instance the additional source folders.
I have the kind of laziness that causes me to spend hours trying to save a few minutes later on, so I created my own archetype, a trivial little thing whose only purpose is to get the basic setup in place quickly.
It does not really do much, and perhaps should best seen as a template, which is why the best use is to download it, adjust it to your own need, then install it locally.
Hope this helps.
]]>