Wakatta!

Like Eureka!, only cooler

Psychic Modeling

In the Algorithm Design Manual, Stephen Skiena entertains, edifies and educates his readers with so called “War Stories”, that is, interesting implementation challenges from his own experience.

The first War Story is Psychic Modeling, an attempt to exploit “precognition” to improve the chances of winning the lottery.

This war story is also the subject of one of the first implementation projects. In chapter 1. A few years ago, when I bought the book, I had easily solved the previous exercises, but then I reached this implementation project, and I got stuck. I could not even get a high level sketch of what a solution would look like.

Certainly, if I was unable to solve an exercise of the first chapter of this book, it was hopelessly beyond my reach…

Still, I had the ambition of one day resuming my reading, and I would from time to time give this problem another attempt.

Recently, it feels like all the pieces finally fell into places, and after a few hours of coding I had an (naive) implementation. Yet I still have doubts, as the only reference I have to compare my solution with, Skiena’s own paper (Randomized Algorithms for Identifying Minimal Lottery Ticket Sets), apparently is worse (in terms of necessary tickets) than my solution…

Note on this paper: unfortunately it is in Word format, and I found that some characters are not properly displayed on non MS Word text processing tools (such as Open Office). So you might have to open it with MS Word or MS Word Viewer.

The problem

I will use the notation from the book rather than the paper. The problem is defined as this:

  • a lottery ticket has $k$ numbers
  • a win requires $l$ numbers from the winning ticket
  • the psychic visualises $n$ numbers
  • of which $j$ are “guaranteed” to be on the winning ticket.

Defining “sufficient coverage”

A first difference between the paper’s approach and mine is that I’m using the notion of coverage size rather than distance: I measure how similar two subsets are by defining their cover as the size of their intersection; in their paper the authors use a notion of distance defined as the size of the difference of the two subsets (perhaps to help with the design of heuristics in the backtracking version of their algorithm).

Now, clearly the two approaches are equivalent; it is less clear that the formulas derived from either are indeed the same.

For a given $j$-subset, how many $j$-subsets have a coverage of at least $l$ with the first one? The covered $j$-subsets must have at least $l$ numbers (between $l$ and $j$, to be precise) in common with the first one, and the rest taken from the $n-j$ other numbers. This gives

\begin{align} \sum_{l \le i \le j} \binom{j}{i} \binom{n-j}{j-i} \end{align}

For a given $j$-subset, how many $j$-subsets are within $j-l$ distance of the first one? We can choose at most $j-l$ numbers out of the $n-j$ rest; and complete with numbers from the first subset. This gives

\begin{align} \sum_{0 \le i \le j-l} \binom{n-j}{i} \binom{j}{j-i} = \sum_{0 \le i \le j-l} \binom{n-j}{i} \binom{j}{i} \end{align}

It took me a while to confirm it, but the formulas are indeed the same:

\begin{align} \sum_{0 \le i \le j-l} \binom{n-j}{i} \binom{j}{i} & = \sum_{0 \le i \le j-l} \binom{n-j}{i} \binom{j}{j-i}\\ & = \sum_{l-j \le i \le 0} \binom{n-j}{-i} \binom{j}{j+i}&&\text{changing the sign of \(i\)}\\ & = \sum_{l \le j+i \le j} \binom{n-j}{-i} \binom{j}{j+i}\\ & = \sum_{l \le i \le j} \binom{n-j}{j-i} \binom{j}{i}&&\text{replacing \(j+i\) by \(i\)}\\ \end{align}

Size of a ticket

Note that I do not use the $k$ size of a ticket. In fact, in my original design, I used it but ignored $j$; reading the paper I realised that $j$ was indeed critical: one of the $j$-subsets will be on the winning ticket, so they are the ones we need to cover. However, I could not understand why the paper did not use the potentially larger size of a ticket to cover more $j$-subsets.

Restated with a complete ticket, the coverage formula becomes

\begin{align} \sum_{l \le i \le j} \binom{k}{i} \binom{n-k}{j-i} \end{align}

This apparent small change actually reduces the lower bound of the necessary tickets significantly. For $n=15$, $k=6$, $j=5$, $l=4$, for instance, will the paper offers as a lower bound $58$, the formula above gives $22$.

So the question is: is it valid to use the possibly larger value $k$ when generating tickets? I could not think of any reason not too, and if I’m right, this gives each ticket a much larger cover, and therefore a lower number of necessary tickets.

Implementation

For a first effort, I chose to code in Haskell, and favoured simplicity over speed. The code is indeed both simple, and wasteful, but Moore’s Law says that computers have become about 1000 times faster since the time the paper was written, so I have some margin.

To keep things simple, sets and subsets are just lists.

Support functions

Such functions ought to belong to a dedicated library (and perhaps they do); I include them to keep the implementation mostly self-contained.

Support functions
1
2
3
4
fact n = product [1..n]
combi n c = (fact n) `div` (fact c * fact (n-c))

remainingNumbers js = foldr union [] js

fact is just the factorial; combi computes the binomial coefficient, and remainingNumbers is just the union of all the passed $j$-subsets.

Generating Combinations
1
2
3
4
genCombi 0 _ = [[]]
genCombi _ [] = []
genCombi k (l:ls) = [l:cs | cs <- genCombi (k-1) ls] ++
                    genCombi k ls

genCombi k s generates the $k$-subsets of $s$.

Lower Bound Estimate

These are simple implementations of the formula above.

Lower Bound Estimates
1
2
3
4
ticketCover n k j l = sum [ (combi k i) *
                            (combi (n-k) (j-i)) | i <- [l..j]]
lowerBound n k j l = (fromIntegral $ combi n j) /
                     (fromIntegral $ ticketCover n k j l)

ticketCover just implements the coverage estimate I defined above (the one that uses $k$); lowerBound computes the lower bound for a single win.

Coverage

As stated above, I define the cover between two subsets as the size of their intersection, and define sufficient coverage as the cover being larger than $l$.

Defining Coverage
1
2
3
4
5
6
7
8
9
cover l1 l2 = length $ intersect l1 l2

coveredP l t j = l <= cover t j
notCoveredP l t j = l > cover t j

notCovered l t js = filter (notCoveredP l t) js
notCoveredBatch l ts js = foldr (notCovered l) js ts

coverageScore l t js = length $ filter (coveredP l t) js

cover implements the cover definition; coveredP and notCoveredP are predicates that check for (or against) sufficient coverage.

notCovered and notCoveredBatch computes the subsets that are not covered by a single ticket or a set of tickets, respectively; they are used to compute what is left to cover after selecting a ticket, and to check solutions.

Finally coverageScore computes the size of of the covered subsets by a ticket. This function is used to compare potential tickets and select the one with the best (i.e. largest) coverage.

Checking the estimates
1
2
3
4
5
checkFormula n k j l =
  let candidates = genCombi j [1..n]
      ticket = [1..k]
      covered = filter (coveredP l ticket) candidates
  in length covered

checkFormula computes the size of the coverage of a single ticket; it can be used to confirm the value of ticketCover above (and as far as I can tell from my checks, it does).

Solution Loop

The solution loop takes the parameters and a ticket candidate generating function; it then gets one ticket at a time, computes the $j$-subsets not covered yet, and repeat until the remaining $j$-subsets set becomes empty.

Solution Loop
1
2
3
4
5
6
7
8
solve n k j l gc =
  let jtuples = genCombi j [1..n]
  in loop jtuples
 where loop [] = return []
       loop js = do
         t <- gc n k j l js
         ts <- loop $ notCovered l t js
         return (t:ts)

The solve function expects the candidate generation function to be a monad; this is to make it possible to use random number generators.

Naive Ticket Selection

I do not really know how to navigate subsets, so I won’t try to implement a backtracking solution as describe in the paper. Instead, I have what is really the simplest greedy algorithm: when a new ticket is needed, get the one that has the best coverage among all the possible tickets:

Naive Ticket Selection
1
2
3
4
5
getCandidate n k j l js =
  let numbers = remainingNumbers js
      tickets = genCombi k numbers
      ticketsScore = map (\t -> (coverageScore l t js, t)) tickets
  in return $ snd $ maximumBy (comparing fst) ticketsScore

So for each $j$-subsets set, generate all the $k$-subsets, and compare their coverage.

Needless to say, this function does not return anything anytime soon for even slightly large values of $n$.

Randomised Ticket Selection

To improve the performance (well, to get a result in my lifetime), I am using what I understand to be the same approach as in the paper: generates $\beta$ tickets, compare their coverage of the remaining subsets, and keep the best one.

The different with the paper, as mentioned before, is that my tickets are $k$-subsets rather than $j$-subsets themselves.

I first need a function to generate a random combination. I’m using a method derived from Knuth (no reference as I don’t have Volume 4 just yet).

Sample Generation
1
2
3
4
5
6
7
8
9
sample 0 _ = return []
sample _ [] = return []
sample k ds = do
  s <- sample (k-1) (tail ds)
  p <- randomRIO (0, (length ds - 1))
  let t = ds!!p
  if not (t `elem` s)
    then return (t:s)
    else return (head ds:s)

The generating function is very similar to the naive one

Randomised Generating Function
1
2
3
4
5
getCandidateRandom beta n k j l js = do
  let numbers = remainingNumbers js
  tickets <- replicateM beta (sample k numbers)
  let ticketsScore = map (\t -> (coverageScore l t js, t)) tickets
  return $ snd $ maximumBy (comparing fst) ticketsScore

The only difference is the tickets candidate set: the naive function generates them all; the randomised one selects $\beta$ randomly.

Compatibility with the paper version

By using solve n j j l instead of solve n k j l, my implementation should compute subset coverage the same way the paper’s implementation does.

Testing and Results

I will not compare speed, as this would be meaningless. But I can check whether different values for ticket size can indeed help reduce the size of the covering set.

Let’s start with a very simple problem, where $n=5$, $k=3$, $j=3$ and $l=2$.

I don’t really need to generate the $j$-subsets, but if I do I can check the solution.

The solution itself is computed by passing a ticket generating function; I could have used getCandidate, but here I’m passing getCandidateRandom with a $\beta=100$.

The notCovered set is empty, so the solution is at least a covering one.

The solution has two tickets, and the lower bound confirms it is pretty good.

1
2
3
4
5
6
7
8
*Main> let problem = genCombi 3 [1..5]
*Main> solution <- solve 5 3 3 2 (getCandidateRandom 100)
*Main> solution
[[3,4,5],[1,2,4]]
*Main> notCoveredBatch 2 solution problem 
[]
*Main> lowerBound 5 3 3 2
1.4285714285714286

Next test, with $n=15$, $k=5$, $j=5$ and $l=4$. The paper reports that they found a solution with $137$ tickets. As $k=j$, my algorithm cannot really beat that (and indeed finds a solution of the same size, if I try a couple of times):

1
2
3
4
5
6
7
8
*Main> let problem = genCombi 5 [1..15]
*Main> solution <- solve 15 5 5 4 (getCandidateRandom 100)
*Main> length solution
137
*Main> notCoveredBatch  4 solution problem
[]
*Main> lowerBound 15 5 5 4
58.88235294117647

For the next test, I should have a better solution than the paper, as $k$ is larger than $j$: $n=15$, $k=6$, $j=5$, $l=4$.

The paper has a lower bound of $58$, and a solution of size $138$, but my lower bound is $22$, and my solution has size $57$.

1
2
3
4
5
6
7
8
*Main> let problem = genCombi 5 [1..15]
*Main> solution <- solve 15 6 5 4 (getCandidateRandom  100)
*Main> length solution
57
*Main> notCoveredBatch 4 solution problem
[]
*Main> lowerBound 15 6 5 4
21.29787234042553

When the difference between $k$ and $j$ becomes large, the solution improves significantly: with $n=18$, $k=10$, $j=7$, $l=6$, the paper has a lower bound of $408$, mine is $18$. The paper’s solution has size $1080$, but mine is just $73$.

1
2
3
4
5
6
7
8
*Main> let problem = genCombi 7 [1..18]
*Main> solution <- solve 18 10 7 6 (getCandidateRandom 100)
*Main> length solution
73
*Main> notCoveredBatch 6 solution problem
[]
*Main> lowerBound 18 10 7 6
17.68

Wrapping up

Even if my approach is ultimately wrong, I can say I must be close to an actual solution. I could (and probably will, given time) try to rewrite my solution in C, and focus on performance.

So I declare this problem conquered, I will resume my reading.

Complete code

Psychic Modeling Implementation (psychic.hs) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import Data.List
import Data.Ord

import Control.Monad
import System.Random

-- helpers functions
fact n = product [1..n]
combi n c = (fact n) `div` (fact c * fact (n-c))

remainingNumbers js = foldr union [] js

-- generate combinations
genCombi 0 _ = [[]]
genCombi _ [] = []
genCombi k (l:ls) = [l:cs | cs <- genCombi (k-1) ls] ++
                    genCombi k ls

-- estimates for the solution size
ticketCover n k j l = sum [ (combi k i) *
                            (combi (n-k) (j-i)) | i <- [l..j]]
lowerBound n k j l = (fromIntegral $ combi n j) /
                     (fromIntegral $ ticketCover n k j l)

-- definition of coverage measure
cover l1 l2 = length $ intersect l1 l2

-- predicate functions: check cover between ticket and j-subset
coveredP l t j = l <= cover t j
notCoveredP l t j = l > cover t j

-- keep only j-subsets that are not covered by tickets
notCoveredBatch l ts js = foldr (notCovered l) js ts
notCovered l t js = filter (notCoveredP l t) js

-- length of j-subsets from js that are sufficiently covered by t
coverageScore l t js = length $ filter (coveredP l t) js

-- check the coverage of a single k number ticket on the C_j^n
-- potentials; can be compared against ticketCover estimate
checkFormula n k j l =
  let candidates = genCombi j [1..n]
      ticket = [1..k]
      covered = filter (coveredP l ticket) candidates
  in length covered

-- compute solution given a candidate generator
solve n k j l gc =
  let jtuples = genCombi j [1..n]
  in loop jtuples
 where loop [] = return []
       loop js = do
         t <- gc n k j l js
         ts <- loop $ notCovered l t js
         return (t:ts)

-- naive candidate generator
getCandidate n k j l js =
  let numbers = remainingNumbers js
      tickets = genCombi k numbers
      ticketsScore = map (\t -> (coverageScore l t js, t)) tickets
  in return $ snd $ maximumBy (comparing fst) ticketsScore

-- Knuth method for generating combination
-- adapted for arbitrary set
sample 0 _ = return []
sample _ [] = return []
sample k ds = do
  s <- sample (k-1) (tail ds)
  p <- randomRIO (0, (length ds - 1))
  let t = ds!!p
  if not (t `elem` s)
    then return (t:s)
    else return (head ds:s)

-- random candidate generator
getCandidateRandom beta n k j l js = do
  let numbers = remainingNumbers js
  tickets <- replicateM beta (sample k numbers)
  let ticketsScore = map (\t -> (coverageScore l t js, t)) tickets
  return $ snd $ maximumBy (comparing fst) ticketsScore

Comments