## I’m Above Average and So Are You

Derek Severs’ short essay “I Assume I’m Below Average” recently went viral on Hacker News. The article begins:

Ninety-six percent of cancer patients claim to be in better health than the average cancer patient.

Ninety-four percent of professors say they are better-than-average teachers.

Ninety percent of students think they are more intelligent than the average student.

Ninety-three percent of drivers say they are safer-than-average drivers.

What the article doesn’t discuss, however, is whether the people in these statistics are actually wrong.

There’s no fundamental reason why you can’t have 90% of people be better than average. For example, more than 99.9% of people have an above-average number of legs. And more than 90% of people commit fewer felonies than average. These examples are obvious, but they’re not so different than some of the examples above.

Any cancer patient that isn’t on the verge of dying probably is doing above average. (Unfortunately, so many cancer patients die, that the 96% number probably is still over-optimistic.)

What about the fraction of students that are above average? It depends on the class. Looking at Stanford’s data for the grade distribution of their graduate Algorithms class, about 78% of students (anyone with an A or A+) have an above-average grade.

This type of skew will happen in any probability distribution where a small number of people bring the average way down. I suspect another example of this is car drivers (consider the fact that, in America, 1 in 7 drivers don’t wear a seatbelt, or that 1 in 50 drivers admit to driving drunk in the past month!). Whether or not university professors fit this type of distribution is harder to tell.

To be clear, the examples given at the beginning are pretty extreme, and people probably are overly optimistic about their standings. But the examples aren’t as extreme as they look at first glance.

This reminds me of an old joke about there being an imaginary town called Lake Wobegon, “where all the women are strong, all the men are good-looking, and all the children are above average.” The joke, of course, is that the children can’t actually all be above average. But almost all of them can be. You just need one to bring the average down. (Poor Timmy.)

## What is the actual infection rate at universities?

Here’s a question: what fraction of students at MIT have COVID-19 right now?

Since MIT tests students twice a week, this question should be pretty easy to answer. But it’s not.

That’s because MIT, along with other universities in the area, doesn’t report their infection rate. What they report is the positive test rate.

The Positive Rate is computed as

${\frac{\text{\# positive tests}}{\text{\# tests performed}}.}$

At first glance, this might sound like the fraction of the population that has the virus, but it’s not. That’s because, once you test positive, you don’t get any additional tests for 14 days (the length of a quarantine). That means that, in the time that a normal person would have gotten tested four times, an infected person only gets tested once. The true percent of the MIT population with coronavirus is probably more like

$\displaystyle 4 \cdot 0.22\% = 0.88\%.$

In my view, that’s pretty bad. Roughly speaking, this suggests that every two weeks, each student has a ${0.0088}$ chance of getting the virus. If a student lives at MIT for a year, their chance of avoiding the virus extrapolates to

$\displaystyle (1 - 0.88 / 100)^{26} < 0.8.$

So, by the end of the year, we may be looking at roughly 20% of MIT’s population having had the virus.

Of course, my extrapolation could be way off. But that’s the problem: MIT should be providing data that makes it easy to perform this sort of extrapolation, and they’re not.

Other schools are doing the same thing. Here’s Northeastern University’s infographic.

In my opinion, this graphic crosses the line to intellectually dishonest. The sea of blue is intended to comfort the reader without providing any useful information.

This is incredibly harmful. I’ve heard first hand from students who are convinced that the virus risk is low enough that they should just continue business as usual. Graphics like the one above inappropriately perpetuate that belief. In a crisis where bad math can literally kill, universities should do better.

## My favorite example of: the probabilistic method

The “probabilistic method” is the art of applying probabilistic thinking to non-probabilistic problems. Applications of the probabilistic method often feel like magic. Here is my favorite example:

Theorem (Erdös, 1965). Call a set ${X}$ sum-free if for all ${a, b \in X}$, we have ${a + b \not\in X}$. For any finite set ${S}$ of positive integers, there is a sum-free subset ${X \subseteq S}$ of size ${|X| > |S| / 3}$.

This theorem seems to have nothing to do with probability. Yet Erdös’s proof relies on a short and beautiful probabilistic argument.

Proof. Erdös’s first insight is to treat the set ${S}$ as living in ${\mathbb{Z}_p}$ for some large prime number ${p}$, rather than living in ${\mathbb{Z}}$. Living in ${\mathbb{Z}_p}$ only makes the theorem harder to prove, since any set ${X}$ that is sum-free in ${\mathbb{Z}_p}$ is also sum-free in ${\mathbb{Z}}$. But, as we shall see, ${\mathbb{Z}_p}$ has several structural properties that Erdös beautifully exploits.

Let ${p = 3k + 2}$ be a prime number satisfying ${p > s}$ for all ${s \in S}$. (The fact that such a prime exists is actually nontrivial, and uses the fact that there are infinitely many primes congruent to ${2}$ mod ${3}$.) We’ll be using ${p}$ and ${k}$ throughout the rest of the proof.

Erdös begins his construction by defining a large sum-free set in ${\mathbb{Z}_p}$ which is not necessarily a subset of ${S}$:

$\displaystyle \text{MIDDLE-THIRD} = \{k + 1, k + 2, \ldots, 2k + 1\}.$

As the name suggests, MIDDLE-THIRD consists of what is roughly the middle third of ${\mathbb{Z}_p}$. One can easily confirm that MIDDLE-THIRD is sum-free in ${\mathbb{Z}_p}$.

Next, Erdös selects a random value ${r \in \{1, 2, \ldots, p - 1\}}$, and examines the set

$\displaystyle X_r = S \cap (r \cdot \text{MIDDLE-THIRD}),$

In words, ${X_r}$ is the set of ${s \in S}$ that can be written as ${rx \pmod p}$ for some ${x \in \text{MIDDLE-THIRD}}$. If we could show that ${X_r}$ is a sum-free set of size greater than ${|S| / 3}$, then the proof would be complete.

The good news is that ${X_r}$ is always sum-free, no matter the value of ${r}$. This is because, if we have ${a + b = c}$ for some ${a, b, c \in X_r}$, then we also have ${a / r + b / r = c / r}$, violating the sum-freeness of MIDDLE-THIRD. (Here we are implicitly exploiting a nontrivial property of ${\mathbb{Z}_p}$ , namely that it supports division!)

The bad news is that ${|X_r|}$ could potentially be too small. To prove the theorem, we need to show that there is some option for ${r}$ that results in ${|X_r| > |S| / 3}$.

Rather than directly finding a value of ${r}$ for which ${|X_r| > |S| / 3}$, Erdös instead proves a probabilistic statement: the expected value of ${|X_r|}$ for a randomly chosen ${r}$ satisfies ${\mathbb{E}[|X_r|] > |S| / 3}$. Of course, if ${\mathop{\mathbb E}[|X_r|] > |S| / 3}$, then there must exist some ${r}$ for which ${|X_r| > |S| / 3}$. What’s neat is that we don’t actually construct ${r}$, we just prove it exists.

Our new challenge is to prove that ${\mathbb{E}[|X_r|] > |S| / 3}$. One way to think of ${|X_r|}$ is as the number of ${x \in \{k + 1, k + 2, \ldots, 2k + 1\}}$ for which ${rx \in S}$. Thus we can expand ${\mathbb{E}[|X_r|]}$ as

$\displaystyle \mathbb{E}[|X_r|] = \sum_{x = k + 1}^{2k + 1} \Pr[rx \in S].$

This is where Erdös really exploits the structure of ${\mathbb{Z}_p}$. In ${\mathbb{Z}_p}$ , every non-zero element ${x}$ is an additive generator, meaning that the multiples ${x, 2x, 3x, \ldots, (p - 1)x}$ of ${x}$ are just the numbers ${1, 2, \ldots, p - 1}$ in some order. The result is that ${rx}$ is just a random element of ${\{1, 2, \ldots, p - 1\}}$, and thus

$\displaystyle \mathbb{E}[|X_r|] = \sum_{x = k + 1}^{2k + 1} \frac{|S|}{p - 1}$

$\displaystyle = (k + 1) \frac{|S|}{p - 1}$

$\displaystyle > |S| / 3.$

Putting the pieces together, we have shown that ${\mathbb{E}[|X_r|] > |S| / 3}$, which implies that there is some ${r}$ for which ${|X_r| > |S| / 3}$. Since ${X_r}$ is sum-free, this completes the proof of the theorem.

Magic!

## My favorite example of: the pigeonhole principle

This is part of a new sequence of posts titled,

My favorite example of: ${x}$,

for different values of ${x}$. Today, ${x}$ is the pigeonhole principle.

The Erdös-Szekeres Theorem: Consider any sequence of ${n}$ distinct numbers. There must exist a subsequence ${S}$ of ${\sqrt{n}}$ numbers such that ${S}$ is either entirely increasing or entirely decreasing.

The Erdös-Szekeres Theorem is a classic result in permutation combinatorics. Although the theorem was first presented in 1935, the proof I’m going to describe appears in the 1959 paper “A simple proof of a theorem of Erdös and Szekeres”, written by Abraham Seidenberg. Seidenberg’s proof is arguably one of the slickest applications of the pigeonhole principle ever.

The Proof.

Consider a sequence ${x = x_1, x_2, \ldots, x_n}$ of ${n}$ distinct numbers. For each number ${x_i}$, define

$\displaystyle a_i = \text{ length of longest increasing subsequence ending in }x_i$

and define

$\displaystyle b_i = \text{ length of longest decreasing subsequence ending in }x_i.$

For example, if ${x = 1, 4, 5, 3, 2}$, then ${a_5 = 2}$ is the length of the longest increasing subsequence ending in ${x_5 = 2}$ and ${b_5 = 3}$ is the length of the longest decreasing subsequence ending in ${x_5 = 2}$.

In order to prove the theorem, Seidenberg made one crucial observation:

Observation: If ${i < j}$ then either ${a_i \neq a_j}$ or ${b_i \neq b_j}$.

To see why the observation holds, consider the case where ${x_i < x_j}$. Then for any increasing subsequence ${S}$ that ends in ${x_i}$, the longer subsequence ${S \cup \{x_j\}}$ is an increasing subsequence that ends in ${x_j}$. This implies that ${a_j \ge a_i + 1}$ and thus that ${a_i \neq a_j}$.  A similar argument can be used in the case where ${x_i > x_j}$ to show that  ${b_j \ge b_i + 1}$, and thus that ${b_i \neq b_j}$. In both cases, the observation holds.

In order to prove the theorem, our goal is to show that there is some ${i}$ for which either ${a_i \ge \sqrt{n}}$ or ${b_i \ge \sqrt{n}}$. Seidenberg realized that he could prove this with a simple application of the pigeonhole principle.

Suppose for conradiction that ${a_i < \sqrt{n}}$ and ${b_i < \sqrt{n}}$ for all ${i}$. Then there are fewer than ${\sqrt{n} \cdot \sqrt{n} = n}$ possibilities for each pair ${(a_i, b_i)}$. It follows by the pigeonhole principle that at least two of the pairs

$\displaystyle (a_1, b_1), (a_2, b_2), (a_3, b_3), \ldots, (a_n, b_n)$

are equal to one-another. That is, there exists $i \neq j$ such that $(a_i, b_i) = (a_j, b_j)$. But this contradicts Seidenberg’s observation, and thus completes the proof.

## The Many Quirks of Qsort

In 1973, the author of the C programming language Dennis Ritchie was such a fan of the quicksort algorithm that he decided to name the language’s sort function after it.

In this post, I’m going to dive into some of the most interesting (and bizarre) aspects of the qsort function. I’m going to be focusing on the GNU C library version of qsort—you can find the code here and here.

The m in qsort. What if I told you that the world’s most famous implementation of quicksort actually uses… mergesort.

It’s true. The GNU implementation of qsort directly calls mergesort as the default sorting algorithm. It’s hard to tell exactly when this change occurred—based on the version history of glibc, it looks like the modification happened sometime between 1992 and 1995.

But why? I haven’t been able to find any documentation describing the reason for the transition from quicksort to mergesort, but I do have a conjecture. Although quicksort is typically faster than mergesort in practice, mergesort has one advantage: it performs fewer total comparisons. In the C library, this ends up being important, because the sort function performs comparisons by dereferencing a pointer to a comparison function. This makes comparisons much more expensive than they should be, giving mergesort an advantage.

I compared the library implementations of quicksort and mergesort on my machine, running both on a random array of ten million 64-bit integers. Mergesort was faster, taking 1.4 seconds instead of quicksort’s 1.7 seconds.

Then I modified the implementations to perform comparisons inline (and to also move objects around in 64-bit chunks instead of 8-bit chunks). This completely flipped things. Now quicksort became faster, taking 0.9 seconds instead of mergesort’s 1.2 seconds. So the pointer dereferences really do seem to be key.

If you really want quicksort, you can have it. Here’s a neat trick. If malloc fails, then qsort can’t allocate the memory needed for mergesort. In this case, qsort actually does use quicksort.

I wrote a function called malloc_all that allocates all of the memory in my system (this takes about half a second). If I call qsort after running malloc_all, then it runs the quicksort algorithm.

Wait, where’s the randomness? Here’s where things start to get interesting. If I call malloc_all and then I run qsort on the million-element array

$\displaystyle 0, 0, 0, \ldots, 0, 1, 2, 3, \ldots, 250000, 0, 1, 0, 2, 0, 3, \ldots, 0, 250000,$

then qsort takes a whopping 394 seconds to run. Wow!

It turns out that qsort‘s quicksort doesn’t use randomness to pick its pivot. As a consequence there are lots of natural inputs that cause the algorithm to take quadratic time!

Unlike C++, which requires that its sort function runs in ${O(n \log n)}$ time, the C standard places no such requirement on qsort. So it’s not that qsort is broken. It’s just quirky.

But why not use randomness? Most modern library implementations of quicksort are based on Bentley and McIlroy’s 1993 paper, Engineering a Sort Function. In it, Bentley and McIlroy advocate against using randomness, saying that “a library sort has no business side-effecting the random number generator”. As far as I can tell, this single sentence decided the fate of most modern quicksort library implementations. I think this is a bit of a shame, since it only takes a few lines to implement a library-specific random number generator, and then we would have a sort function without any worst-case inputs… But I’m also a theoretician, so I’m very biased.

The true namesake of qsort. I need to confess something: qsort isn’t actually named after quicksort. It’s named after quickersort, the variant of quicksort introduced by R.S. Scowen in 1965, three years after the introduction of quicksort.

There is one very neat way in which the two algorithms differ: quickersort implements its own recursion stack manually in a way that guarantees a stack depth of ${O(\log n)}$. The way that quickersort does this is simple but clever. Suppose quickersort is executing a subproblem ${A}$, which has two recursive child subproblems ${B}$ and ${C}$, where ${B}$ is the smaller of the two child subproblems. After performing the partitioning step in ${A}$, quickersort then executes ${B}$ recursively. After performing ${B}$, however, quickersort then overwrites ${A}$‘s stack frame with ${C}$‘s stack frame, and performs ${C}$ at the same stack depth as ${A}$ was performed. The result is that, the only time that the stack depth ever increases is when the problem size decreases by at least a factor of two. Hence the bound of ${O(\log n)}$ on stack size.

A beautiful piece of code. At the end of the day, GNU libc qsort is one of my favorite pieces of code. I like the strange quirks. But there’s more to it than that. It’s a simple and beautifully written piece of code full of beautiful nuances and subtleties. If you have a few minutes, you might want to take a look.

You can also rerun the experiments in this blog post using this code.

## The World’s Simplest Interesting Algorithm

In this post, I want to tell you about what I think might be the world’s simplest interesting algorithm.

The vertex cover problem. Given a graph ${G = (V, E)}$, we want to find the smallest set of vertices ${S \subseteq V}$ such that every edge ${e \in E}$ is covered by the set ${S}$. This means that for each edge ${e = (u, v)}$, at least one of ${u}$ or ${v}$ is in ${S}$.

The vertex cover problem is known to be NP-complete, meaning that it is very hard (or impossible) to find a polynomial-time algorithm for the problem.

But what we can do is find an approximately optimal solution to the problem. There’s a beautiful and simple algorithm that finds a set ${S}$ whose size is guaranteed to be within a factor of ${2}$ of optimal.

The algorithm. To build our set ${S}$, we repeatedly find some edge ${e = (u, v)}$ that is not yet covered, and we add both ${u}$ and ${v}$ to our set ${S}$. That’s the entire algorithm.

At first sight, this seems like a crazy algorithm. Why would we put both ${u}$ and ${v}$ into ${S}$, when we only need one of ${u}$ or ${v}$ in order to cover the edge ${e = (u, v)}$?

The answer to this question is simple. If we define ${S_{\text{OPT}}}$ to be the optimal solution to the vertex-cover problem, then ${S_{\text{OPT}}}$ must contain at least one of ${u}$ or ${v}$. By placing both of ${u}$ and ${v}$ into our set ${S}$, we guarantee that at least half the elements in our set ${S}$ also appear in ${S_{\text{OPT}}}$. That means that, once our algorithm finishes, we are guaranteed that ${|S| \le 2 |S_{\text{OPT}}|}$.

A challenge.  This might be the world’s simplest interesting algorithm (and analysis).  But it also might not be. What do you think?

## What Linearity of Expectation Has to Do with Needles and Pi

Take a needle of length ${\ell}$ and drop it in a random position on a hardwood floor whose boards have the same width ${\ell}$:

There’s some probability ${p}$ that the needle crosses between two adjacent floorboards. It turns out that ${p}$ has a surprisingly simple formula,

$\displaystyle p = \frac{2}{\pi}.$

Does this mean we can drop thousands of needles (or, for safety-sake, popsicle sticks) to accurately estimate ${\pi}$? A few years ago, my friend Jake Hillard and I put this to the test. Jake measured out the width between floorboards in the Stanford History Department and 3D-printed a few dozen popsicle sticks of the same length. Here’s one I kept as a momento:

(Since the sticks weren’t infinitesimally thin, the rule was to count a crossing only if the center of the stick crossed.)

With the help of Stanford Splash, we then recruited about a hundred middle- and high- school students to perform the experiment. In total, we recorded roughly ${4,000}$ popsicle-stick drops, and ended up estimating ${\pi}$ to about…. ${2.95}$. That’s actually not too far from what we expected, since even with no experimental error at all, the standard deviation that we would expect with ${4,000}$ data points is roughly ${0.07}$.

Ok, so dropping popsicle sticks clearly isn’t the best way to estimate ${\pi}$. But I’m hoping I can convince you that it is one of the combinatorially coolest. And that’s because the formula for ${p}$ is actually just an extraordinarily slick (although slightly informal) application of linearity of expectation.

Anything Can be Solved with Enough Glue. For simplicity, I’ll assume for the rest of the post that $\ell = 1$. Imagine that, instead of dropping a single needle, we drop ${n}$ needles, glued together to form a regular ${n}$-gon. Here’s an example for ${n = 8}$:

Each needle individually crosses between two floorboards with probability ${p}$. By linearity of expectation, the expected total number of crossings by our ${n}$-gon is ${n \cdot p}$. That is, the ${n}$-gon as a whole should, on average, induce ${n}$ times as many crossings as would a single needle.

But now let’s think about what this picture looks like when ${n}$ is large.

When ${n}$ is very large, our ${n}$-gon starts to closely resemble a circle with circumference ${n}$. The height of the ${n}$-gon corresponds with the diameter of the circle, and is therefore roughly ${n / \pi}$ (This step of the argument isn’t actually as obvious as it sounds, and requires a bit more work to be formal). But no matter how we drop our circle-like ${n}$-gon on the hardwood floor, the total number of crossings will always be almost exactly the same (up to ${\pm 1}$)! In particular, the total number of crossings by needles in the ${n}$-gon is just twice the height of the ${n}$-gon, i.e., $2 \cdot n / \pi$. Since the expected number of crossings can be expressed as ${n \cdot p}$, it follows that, as ${n}$ grows large, ${n \cdot p}$ approaches ${2 \cdot n / \pi}$. Hence the formula ${p = 2 / \pi}$.

## Proving Optimality for the 100 Prisoners Problem

The 100 Prisoners Problem is the single coolest algorithmic riddle I know. It’s a great example of a simple algorithm accomplishing something totally unexpected.

But the algorithm for the 100 Prisoners Problem is only half the story. The second, lesser known half, is a breathtakingly simple proof that no other algorithm can do better.

The Riddle: A sadistic prison warden runs a prison with 100 prisoners (numbered 1 through 100). He decides to force them to play the following game. The warden lines up 100 boxes in a room, and places a random permutation of the numbers 1 through 100 in the boxes. Each prisoner enters the room and is allowed to open 50 boxes. If the prisoner’s number is in one of the opened boxes, then the prisoner wins the game. Here’s the catch: In order for the prisoners to be set free, all of the prisoners have to win the game.

Before each prisoner enters the room, the boxes are reset to their initial configuration (so there’s no way for one person to leave a hint for the next person). Moreover, once the game has begun, no communication is permitted between the prisoners. They can, however, agree on a strategy beforehand for how to choose which boxes to open. The question is, what’s the optimal strategy?

The Algorithm: If the prisoners each independently open a random subset of the boxes, then the probability of all the prisoners winning is ${1/2^{100}}$, which is astronomically small when you consider that there are fewer than ${2^{300}}$ atoms in the universe.

Surprisingly, there is a strategy which allows the prisoners to do much better, raising the probability of freedom to roughly 30 percent.

The algorithm works like this: When the ${i}$-th prisoner walks in the room, they begin by opening the ${i}$-th box. They then look at the number ${j}$ which they just revealed, and use that to decide which box to open next, opening the ${j}$-th box. They continue like this until they’ve opened 50 boxes, opening each successive box according to the number in the previous box.

The Intuition: If a box has some number ${j}$ in it, think of the box as having an arrow pointing from the box to the ${j}$-th box. If you start at a given box, and follow the arrows, eventually you’ll come in a full circle and get back to the box where you started. In other words, each box is part of a cycle of arrows.

The ${i}$-th prisoner starts by opening the ${i}$-th box. They then continue to open each box in the cycle until they get to the box containing the number ${i}$. Notice that the arrow from the box containing the number ${i}$ actually goes back to the ${i}$-th box, the very first box the prisoner opened. In other words, the box containing ${i}$ is guaranteed to be the last box in the cycle. It follows that the ${i}$-th prisoner will win the game if and only if the cycle containing the ${i}$-th box is of length no more than fifty.

In order for any prisoner to lose, there has to be some cycle of length more than fifty. This fact alone is very powerful, because it ensures that if any prisoners lose, then actually at least half of the prisoners lose (namely, all the prisoners in the cycle of length more than fifty lose). The algorithm manages to make it so that whether or not various prisoners lose is now highly correlated.

To see how powerful this observation is, consider the following slightly easier variant of the game: Each prisoner opens boxes just as before, but now they are permitted to open 75 boxes instead of 50. Using our algorithm, whenever any prisoners lose, at least 75 prisoners lose. But, since each prisoner individually has a ${1/4}$ chance of losing, the average number of prisoners who will lose is 25. Given that the average is 25, the fraction of the time that more than 75 can lose is at most ${\frac{25}{75} = 1/3}$. The remaining ${2/3}$ of the time, all of the prisoners must win, resulting in them being set free.

Although this is a slick analysis, it doesn’t work when each prisoner is only allowed to open fifty boxes. Instead, we’ll do a more careful analysis, allowing us to calculate the exact probability of winning.

A Precise Analysis: Let’s start by counting the number of box configurations (i.e., permutations of numbers) which result in there being a cycle of length, say, 63. In order to construct such a permutation, we can begin by picking which boxes will be in the cycle. There are 100 ways to pick the first box in the cycle, 99 ways to then pick the second box in the cycle, and so on, ending with 38 ways to pick the final box in the cycle. But, of course, that’s not quite right, because what’s it mean for a box to be the “first” in the cycle? A cycle has no first element, meaning there are 63 ways to construct the same cycle. Dividing by 63, we get that the number of ways to construct a cycle of length 63 is

$\displaystyle \frac{100 \cdot 99 \cdot \cdots \cdot 38}{63}.$

Once we’ve constructed the cycle, we still need to determine where the other numbers will go. Notice that 63 numbers already have their positions determined by the cycle which we constructed. We’re left with 37 remaining numbers and 37 positions to put them. Putting them in an arbitrary ordering, there are

$\displaystyle 37 \cdot 36 \cdot 35 \cdots \cdot 1$

ways to complete the construction of our permutation.

Multiplying the number of ways to pick the cycle by the number of ways to then complete the permutation, it follows that the number of permutations containing a cycle of length 63 is

$\displaystyle \frac{100 \cdot 99 \cdot \cdots \cdot 1}{63}.$

But that’s just the total number of permutations ${n!}$ divided by ${63}$. In other words, if you pick a permutation at random, then the probability of there being a cycle of length ${63}$ is ${1/63}$.

The same is true if we replace ${63}$ with any of the numbers ${51, 52, \ldots, 100}$. The probability of there being any cycle of length greater than 51, and the prisoners therefore losing the game, is exactly

$\displaystyle \frac{1}{51} + \frac{1}{52} + \cdots + \frac{1}{100}.$

At this point, we could just do out the summation by hand to get the answer. But there’s an easier way. In particular, we use the fact that the ${n}$-th Harmonic number ${H_n}$, which is defined to be

$\displaystyle H_n = \frac{1}{1} + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n},$

satisfies ${H_n \approx \ln n}$. Notice that the probability of the prisoners losing the game can be written as ${H_{100} - H_{50}}$. Plugging in the approximation for ${H_n}$, we get that the probability of some prisoner losing is roughly

$\displaystyle \ln 100 - \ln 50 = \ln \frac{100}{50} = \ln 2 \approx 0.7,$

as desired.

Proving Optimality: A natural question is, can we do better? Surprisingly there’s a very simple and elegant proof that we cannot. Moreover, the proof gives a deep intuition for why the algorithm does so well.

Consider the following variant of the game, which I call The Baby 100 Prisoner’s Problem. In this version of the game, whenever a prisoner opens a box, the box is left open for any prisoners who come later. When a prisoner comes in, if their number has already been revealed, then the prisoner wins for free. Otherwise, the prisoner is allowed to open up to 50 of the remaining boxes in search of their number. Once the prisoner finds their number, they are not permitted to open any more boxes.

The baby problem is clearly much easier than the original. In fact, it’s not even an interesting algorithmic problem anymore, since every strategy will perform exactly the same. In particular, when opening a box that has never been opened before, then because the numbers are randomly permuted in the boxes, it doesn’t really matter what your strategy is for picking which box to open.

So the baby problem is easier than the original, and has the property that every algorithm performs optimally on it. Our proof strategy for showing that the cycle algorithm for the 100 Prisoner’s Problem is optimal will be to show that the cycle algorithm performs exactly the same for the standard problem as it does for the baby version of the problem. In other words, by using the cycle algorithm, we make it so that the full problem is just as easy as the baby one!

Consider the cycle algorithm for the standard version of the problem. Suppose the ${i}$-th prisoner comes into the room, and all the previous prisoners have won the game so far. The ${i}$-th prisoner wishes to find the box containing the number ${i}$. Notice that if that box has ever been opened before (by a previous prisoner), then the entire cycle for that box must have also been opened by the past prisoner. Since all the past prisoners won the game, the cycle must be length no greater than fifty. It follows that the ${i}$-th prisoner will also win the game! In other words, if the box containing ${i}$ has been opened at some point in the past, then we’re guaranteed to successfully find it now. On the other hand, if the box hasn’t been opened yet, then all the boxes which we will open will be in a cycle which has not yet been explored. That is, we will be opening only boxes which have not yet been opened.

But that’s exactly what happens in the baby problem! If the box we’re looking for has been opened before then we get it for free, and otherwise, we open new boxes to search for it! The cycle algorithm allows us to simulate the baby problem while actually playing the standard problem. Since the algorithm manages to do as well on the standard problem as it does on the baby problem, and since no algorithm for the standard problem can do better than any algorithm for the baby problem, it follows that the cycle algorithm must be optimal.

Conclusion: The 100 Prisoners Problem implicitly relies on a number of beautiful ideas from permutation combinatorics. At the heart of the algorithm’s analysis is the theory of permutation cycles. The proof of optimality can even be seen as implicitly constructing what’s known as Foata’s Transition Lemma, a classic bijection between two canonical representations of permutations.

Sometimes seeing classic combinatorial ideas applied to a different and less intuitive setting makes you realize just how elegant and subtle those ideas actually are. For me that’s exactly what the 100 Prisoners Problem does. It reminds me that sometimes the key to fully appreciating mathematical beauty is to look at it the wrong way.

## The Combinatorics of Boarding an Airplane

I’ve spent a lot of time at airports recently. And whenever I’m at an airport I find myself thinking about combinatorial problems having to do with airplanes. While flying from SFO to BOS recently, I stumbled across something I found really surprising.

Suppose ${n}$ people board a plane in a random order. Each person has a seat number assigned to them, and whenever someone gets to their seat, they pause for a moment to put their luggage in the overhead bin.

Whenever someone is pausing for their luggage, they prevent anyone in back of them from passing. Suppose that it takes each person time ${w}$ to walk by a given seat, and time ${p \gg w}$ to put away their luggage once they get to their seat. The question is:

How long will it take for everyone to sit down?

In the worst case the passengers are arranged in decreasing numerical order. This means everyone will have to wait on everyone in front of them, and boarding will take time roughly ${n \cdot p}$. If, on the other hand, the passengers arrive in the correct (increasing) order, then they’ll each be able to walk straight to their seats, boarding in time ${n \cdot w + p}$.

But what if the passengers arrive in a random order? Then how much parallelism do we get?

The answer, as it turns out, is quite a lot. In fact, the expected time to board the plane will be

$\displaystyle \Theta(n \cdot w + \sqrt{n} \cdot p).$

This has some pretty weird consequences. For example, it means that for any fixed ${w}$ and ${p}$, and for a large enough plane-size ${n}$, the boarding time will become dominated by the time necessary to walk down to the end of the plane. The time spent waiting on people to put their luggage overhead becomes a non-issue!

A Simplifying Assumption:  To make our analysis easier, we’ll make a simplifying assumption that turns out not to affect the outcome of the analysis. In particular, we’ll treat each of the passengers as though they don’t take any space in the aisle. For example, if one passenger is paused at seat $i$, and there are there are some passengers waiting behind him that want to get past seat $i$, then that won’t stop the passenger behind them from getting to seat $i - 1$.

On top of this, we’re already making the simplifying assumption that each row has only one seat. It turns out that both of these assumptions can be removed without changing the end result. I’ll talk about this more at the end of the post.

The Combinatorics:  The key insight to analyze the algorithm is this: Think of the people as a permutation of seat numbers. If the longest decreasing subsequence in the line of people has length ${d}$, then it turns out the time to board the plane will be within a factor of two of ${n \cdot w + d \cdot p}$.

To see this, consider some person ${x_1}$ as they walk to their seat. Let ${x_2}$ be the final person on whom ${x_1}$ waits before ${x_1}$ gets to their seat. Then the total time ${x_1}$ spends waiting on others will be at most ${p}$ greater than the total time that ${x_2}$ spends waiting on others. Similarly there is some person ${x_3}$ who is the last person on whom ${x_2}$ waits; and ${x_2}$ spends at most ${p}$ time more in total waiting than does ${x_3}$. Continuing like this, we can find a subsequence of people

$\displaystyle x_1, x_2, x_3, \ldots, x_k$

such that the right-most person ${x_k}$ in the sequence doesn’t have to wait at all, and such that each person ${x_i}$ spends at most ${p}$ longer than ${x_{i + 1}}$ waiting. This means that our original person ${x_1}$ spends time at most ${k \cdot p}$ waiting on others.

The sequence of people ${x_1, x_2,x_3, \ldots, x_k}$ also has the property that their seat numbers are in decreasing order. So what we’ve really shown is that if ${d}$ is the length of the longest decreasing subsequence in the permutation of people, then no person will have to wait for time more than ${d \cdot p}$, and everyone will have boarded after time ${n \cdot w + d \cdot p}$.

On the other hand, it’s not hard to see that the left-most person in the length-${d}$ decreasing sequence will require time at least ${d \cdot p}$ to board the plane; and that the left-most person in the entire line will require time at least ${n \cdot w}$ to board. This means that the true boarding time will be within a factor of two of ${n \cdot w + d \cdot p}$.

So the real question becomes: On average, how large is ${d}$, the length of the longest decreasing subsequence in a random permutation?

Bounding the longest decreasing subsequence: We’ll start by considering a different question. For a given length ${k}$, how many different decreasing subsequences of length ${k}$ should we expect in our permutation? (We count two subsequences as different even if they differ in only a single element.)

This is easy to compute by linearity of expectation: Each ${k}$-element subsequence has a ${1/k!}$ probability of being in decreasing order. And there are ${\binom{n}{k}}$ different ${k}$-element subsequences. So the expected number of decreasing subsequences is

$\displaystyle \frac{\binom{n}{k}}{k!} = \frac{n \cdot (n - 1) \cdots (n - k + 1)}{k! \cdot k!}.$

A useful fact about ${k!}$ is that it’s at least as large as ${(k / 3)^k}$. So the expected number of decreasing ${k}$-element subsequences is at most

$\displaystyle \frac{n^k}{(k^2 / 9)^k}.$

When ${k = 6\sqrt{n}}$, this reduces to at most ${\frac{1}{4^k} = \frac{1}{4^{6 \sqrt{n}}}}$.

Of course, if the expected number of ${6\sqrt{n}}$-element decreasing subsequences is so small, then so is the probability of any such subsequence appearing.

So with extremely high probability, no decreasing subsequence will be of length more than ${O(\sqrt{n})}$. And this implies that the expected length is also at most ${O(\sqrt{n})}$.

What about a lower bound? We’ve just shown that the expected length of the longest decreasing subsequence is at most ${O(\sqrt{n})}$. But is that tight?

For this, we can use a classical result from permutation combinatorics. The Erdös-Szekeres Theorem says that any permutation of length ${n}$ must either contain either a decreasing or an increasing subsequence of length ${\sqrt{n}}$. For a random permutation, this means that with probability at least ${1/2}$, there will be a decreasing subsequence of length ${\sqrt{n}}$.  So the expected length of the longest decreasing subsequence is at least $\sqrt{n}/2$.

The full version of the problem:  Near the beginning of the post I mentioned we were making two assumptions: (1) That passengers don’t take up any space in the aisle; and (2) that each row has a single seat.

Removing these assumptions makes the problem considerably harder. But it turns out the end conclusion is basically the same. A 2009 math paper on the topic analyzed the problem in its full complexity, and found that as long as each person takes up roughly one row worth of space in the aisle, and the number of seats in each row is constant,  the boarding time remains

$\displaystyle \Theta(n \cdot w + \sqrt{n} \cdot p).$

## The (Almost) Secret Algorithm Researchers Used to Break Thousands of RSA Keys

RSA encryption allows for anyone to send me messages that only I can decode. To set this up, I select two large random primes $p$ and $q$ (each of which is hundreds of bits long), and release their product $x = p \cdot q$ online for everyone to see; $x$ is known as my public key. In addition, I pick some number $e$ which shares no factors with $p-1$ or $q-1$ and release it online as well.

The beauty of RSA encryption is that using only the information I publicly released, anyone can encode a message they want to send me. But without knowing the values of $p$ and $q$, nobody but me can decode the message. And even though everyone knows my public key $x = p \cdot q$, that doesn’t give them any efficient way to find values for $p$ or $q$. In fact, even factoring a 232-digit number took a group of researchers more than 1,500 years of computing time (distributed among hundreds of computers).

On the surface, RSA encryption seems uncrackable. And it might be too, except for one small problem. Almost everyone uses the same random-prime-number generators.

A few years ago, this gave researchers an idea. Suppose Bob and Alice both post public keys online. But since they both used the same program to generate random prime numbers, there’s a higher-than-random chance that their public keys share a prime factor. Factoring Bob’s or Alice’s public keys individually would be nearly impossible. But finding any common factors between them is much easier. In fact, the time needed to compute the largest common divisor between two numbers is close to proportional to the number of digits in the two numbers. Once I identify the common prime factor between Bob’s and Alice’s keys, I can factor it out to obtain the prime factorization of both keys. In turn, I can decode any messages sent to either Bob or Alice.

Armed with this idea, the researchers scanned the web and collected 6.2 million actual public keys. They then computed the largest common divisor between pairs of keys, cracking a key whenever it shared a prime factor with any other key. All in all, they were able to break 12,934 keys. In other words, if used carelessly, RSA encryption provides less than $99.8\%$ security.

At first glance this seems like the whole story. Reading through their paper more closely, however, reveals something strange. According to the authors, they were able to run the entire computation in a matter of hours on a single core. But a back-of-the-envelope calculation suggests that it should take years to compute GCD’s between 36 trillion pairs of keys, not hours.

So how did they do it? The authors hint in a footnote that at the heart of their computation is an asymptotically fast algorithm, allowing them to bring the running time of the computation down to nearly linear; but the actual description of the algorithm is kept a secret from the reader, perhaps to guard against malicious use. Within just months of the paper’s publication, though, follow-up papers had already discussed various approaches in detail, both presenting fast algorithms (see this paper and this paper), and even showing how to use GPUs to make the brute-force approach viable (see this paper).

There’s probably a lesson here about not bragging about things if you want them to stay secret. But for this post I’m not interested in lessons. I’m interested in algorithms. And this one turns out to be both relatively simple and quite fun.

Algorithm Prerequisites: Our algorithm will deal with integers having an asymptotically large number of digits. Consequently, we cannot treat addition and multiplication as constant-time operations.

For $n$-bit integers, addition takes $O(n)$ time. Using long multiplication, multiplication would seem to take $O(n^2)$ time. However, it turns out there is an algorithm which runs in time $O(n \log n \log \log n)$.

Computing the GCD naively using the Euclidean algorithm would take $O(n^2 \log n \log \log n)$ time. Once again, however, researchers have found a better algorithm, running in time $O(n \log^2 n \log \log n)$.

Fortunately, all of these algorithms are already implemented for us in GMP, the C++ big-integer library. For the rest of the post we will use Big-O-Tilde notation, a variant of Big-O notation that ignores logarithmic factors. For example, while GCD-computation takes time $O(n \log^2 n \log \log n)$, in Big-O-Tilde notation we write that it takes time $\widetilde{\text{O}}(n)$.

Transforming the Problem: Denote the set of public RSA keys by $k_1, \ldots, k_n$, where each key is the product of two large prime numbers (i.e., hundred digits). Note that $n$ is the total number of keys. Rather than computing the GCD of each pair of keys, we can instead compute for each key $k_i$ the GCD of it and the product of all the other keys $\prod_{t \neq i} k_t$. If a key $k_i$ shares exactly one prime factor with other keys, then this provides that prime factor. If both prime factors of $k_i$ are shared with other keys, however, then the computation will fail to actually extract the individual prime factors. This case is probably rare enough that it’s not worth worrying about.

The Algorithm: The algorithm has a slightly unusual recursive structure in that the recursion occurs in the middle of the algorithm rather than at the end.

At the beginning of the algorithm, all we have is the keys,

$k_1,$
$k_2,$
$k_3, \cdots$

The first step of the algorithm is to pair off the keys and compute their products,

$j_1 = k_1 \cdot k_2,$
$j_2 = k_3 \cdot k_4,$
$j_3 = k_5 \cdot k_6, \cdots$

Next we recurse on the sequence of numbers $j_1, \ldots, j_{n / 2}$, in order to compute

$r_1 = GCD(j_1, \prod_{t \neq 1} j_t),$
$r_2 = GCD(j_2, \prod_{t \neq 2} j_t),$
$r_3 = GCD(j_3, \prod_{t \neq 3} j_t), \cdots$

Our goal is to compute $s_i = GCD(k_i, \prod_{t \neq i} k_t)$ for each key $k_i$. The key insight is that when $i$ is odd, $s_i$ can be expressed as
$s_i = GCD(k_i, r_{(i + 1) / 2} \cdot k_{i + 1}),$
and that when $i$ is even, $s_i$ can be expressed as
$s_i = GCD(k_i, r_{i / 2} \cdot k_{i - 1}).$
To see why this is the case, one can verify that the term on the right side of the GCD is guaranteed to be a multiple of $GCD(k_i, \prod_{t \neq i} k_t)$, while also being a divisor of $\prod_{t \neq i} k_t$. This, in turn, implies that the GCD-computation will evaluate to exactly $GCD(k_i, \prod_{t \neq i} k_t)$, as desired.

Computing each of the $s_i$‘s in terms of the $r_i$‘s and $k_i$‘s completes the algorithm.

Bounding the Running Time: Let $m$ denote the total number of bits needed to write down $k_1, k_2, \ldots, k_n$. Each time the algorithm recurses, the total number of bits in the input being recursed on is guaranteed to be no more than at the previous level of recursion; this is because the new inputs are products of pairs of elements from the old input.

Therefore each of the $O(\log n)$ levels of recursion act on an input of total size $O(m)$ bits. Moreover, the arithmetic operations within each level of recursion take total time at most $\tilde{O}(m)$. Thus the total running time of the algorithm is also $\tilde{O}(m)$ (since the $O(\log n)$ recursion levels can be absorbed into the Big-O-Tilde notation).

If we unwrap the running time into standard Big-O notation, we get
$O(m \log^3 m \log \log m).$

Is it practical? At first glance, the triple-logarithmic factor might seem like a deal breaker for this algorithm. It turns out the actual performance is pretty reasonable. This paper found that the algorithm takes time roughly 7.65 seconds per thousand keys, meaning it would take a little more than 13 hours to run on 6.2 million keys.

One of the log factors can be removed using a slightly more clever variant of the algorithm, which avoids GCD computations at all but the first level of recursion (See this paper). The improved algorithm takes about 4.5 seconds per thousand keys, resulting in a total running time of about 7.5 hours to handle 6.2 million keys.

So there we go. A computation that should have taken years is reduced to a matter of hours.  And all it took was a bit of clever recursion.