Driving Faster Takes Longer

I often drive between Boston and New Haven. While on the road, I find myself pondering a simple question: If my only goal is to arrive as fast as possible, how fast should I drive?

Ignoring things like ethics (or fuel efficiency), the solution would seem to be simple. Drive as fast as possible. But there’s a catch: If I crash and die, then the next 50 years (which I had planned on spending alive) are spent dead. It’s only fair to count this as a penalty towards the length of the trip.

So, the expected trip length isn’t just {\frac{\text{distance}}{\text{speed}}}. It’s actually

{\frac{\text{distance}}{\text{speed}} + \Pr[\text{Death}] \cdot 50 \text{ years}.}

How much time do I spend dead on my 140 mile trip? Roughly speaking, one fatality occurs for every 100,000,000 miles driven. This ignores a lot of things (e.g., the fact that different speeds lead to different fatality rates, which we’ll come back to later), but it’s good enough to give us a sense of what’s going on. My expected time spent dead on the trip is about

{\frac{140 \text{ miles}}{100,000,000 \text{ miles per death}} \cdot 50 \text{ years per death} = 37 \text{ minutes}}.

That’s quite a lot considering that, when I don’t die, my entire trip length comes out to about 2 hours and 30 minutes! 

But what happens to my probability of death if I drive faster or slower? A somewhat dated analysis from the University of Alabama gives a bit of insight. To a first approximation, if you crash at a given speed, your probability of dying doubles each time you add 10 mph to that speed. This is to say that {\Pr[\text{die }\mid\text{ crash}]} doubles each time you increase your speed by 10 mph. Of course, what we actually care about is \Pr[\text{crash and die}] , which by Bayes’ rule is \Pr[\text{die }\mid\text{ crash}] \cdot \Pr[\text{crash}]. It seems likely that both \Pr[\text{die }\mid\text{ crash}] and \Pr[\text{crash}] increase as you drive faster, but since I only have data on how the first quantity changes, I’ll ignore changes to the second.

Now let’s calculate the expected length of the trip (including dead time) at different speeds. As a baseline, let’s suppose that our previous analysis (i.e., our estimate of 37 minutes dead) holds when we travel at the speed limit of 65 mph. Now suppose we travel at 65 + x miles per hour for some x. Our actual time on the road is {\frac{140 \text{ miles}}{65 + x \text{ miles per hour}}}, but our expected time spent dead is {37 \cdot 2^{x / 10}} minutes. So our expected trip length, including time spent dead, is:

{(60\text{ minutes / hour}) \cdot \frac{140 \text{ miles}}{65 + x \text{ miles per hour}} + 37 \cdot 2^{x / 10}} \text{ minutes}

{= \frac{8400}{65 + x} + 37 \cdot 2^{x/10}} minutes.

Graphing the length of the trip (in minutes) as a function of x gives:

So the fastest speed to drive is… about 62 miles per hour.1

Of course, this model bakes in your remaining life expectancy at 50 years. An odd feature of the model is that, the older you get, the quicker your trips become, at least in expectation, and the more you should speed. Here’s the same graph if you model the cost of dying as 25 years:

Now your optimal speed is closer to 70 mph, still on the slower end of what people do on I-95. If you want to rationalize the 85 mph speeds that the fastest drivers travel at, you would need to reduce your estimated cost of dying to a morbidly short 5 years.

Now, if you really do want to optimize how fast you get places, it’s worth noting that this model has a cheat code: different cars have very different fatality rates. (Although, unfortunately, they are typically reported per registration year rather than per mile). The trick to getting somewhere fast isn’t to speed, it’s to get a Volvo XC90.

  1. Okay, to be fair, there are a bunch of factors we ignored. Maybe most deaths are caused by really irresponsible drivers that are speeding by a lot, so little-old-us traveling at 65 mph is actually super safe. Maybe our guess that 65 mph was the speed at which our original 37-minutes-dead-analysis should hold was totally wrong. Maybe highway miles are much more dangerous than non-highway miles, and we shouldn’t be on the highway at all. This is really not my area of expertise, so take everything I say with a giant hand full of salt. ↩︎

A Hash Table that Uses Less Space Than the Items that it Stores

When people talk about “space-efficient hash tables”, they are usually talking about the following type of guarantee: If we are storing n keys, each of which are w bits, then the total space usage should be (1 + \epsilon)wn bits for some small \epsilon.

But, what if I told you we can do better? In fact, it’s possible to construct a hash table that uses fewer than nw bits. That is, the hash table uses less space than if we were to write each of the n keys, one after another, in an array. As terminology, I’ll refer to such a hash table as extremely tiny.

Techniques for building extremely tiny hash tables have been known to theoreticians for decades. But how hard is it to build one in the real world? Not that hard at all, it turns out.

In this post, I’ll show you how to build a very simple extremely tiny hash table for 32-bit keys.

The Construction. Since our keys are 32-bits, we can write each key x as x_1 \circ x_2 \circ x_3 \circ x_4, where each x_i is one byte. Our extremely tiny hash table H will have the following structure: it consists of 256 sub-hash-tables H_1, H_2, \ldots, H_{256}. To insert a key x = x_1 \circ x_2 \circ x_3 \circ x_4 into H, we simply insert the three-byte item x_2 \circ x_3 \circ x_4 into table H_{x_1}. And to query whether x \in H, we simply query whether x_2 \circ x_3 \circ x_4 \in H_{x_1}.

Finally, each H_i is just a classical linear-probing hash table, dynamically resized to have a load factor (i.e., space efficiency) between 80% and 90%. That’s the entire construction.

Analysis. The trick here is that, even though H is storing 4-byte keys, each H_i is only storing 3-byte keys. The first byte of each key is encoded implicitly by the choice of which H_i the key is stored in. This is a (somewhat unusual) application of a technique known as “quotienting”.

How much space do the H_i‘s consume? Remember that each H_i is kept at a load factor of 80% or above. So, if H_i is storing n_i 3-byte keys, then it uses at most 3 n_i / 0.8 = 3.75 n_i bytes of space.

The different sub-hash-tables H_1, \ldots, H_{256} may have very different sizes than each other at any given moment. Nonetheless, since the sum of their sizes in n (the total number of keys in H), the total space used by the sub-hash tables is at most 3.75 n bytes of space.

Finally, we need to account for the metadata of the 256 sub-hash-tables. All things considered, in my implementation, this uses less than 6000 bytes. So our total space is at most 3.75 n + 6000 bytes.

For any n \ge 5334, the total space of our hash table comes out to less than 31 bits per key. We’ve built a real-world extremely tiny hash table.

Real-World Performance. So is this data structure fast? Well, it’s not blazingly fast. But it’s not embarrassingly slow either.

On my laptop, the time to first insert ten million keys and then do twenty million queries (half positive, half negative) is 3.7 seconds. If we instead use the C++ std::unordered_set (configured to use the same hash function as our original hash table), the same sequence of operations takes 7.3 seconds.

So we’re faster than the C++ hash table, but at the same time, that’s not a Herculean feat. (And it’s also not a fair comparison since, of course, the C++ hash table supports much more functionality). But it is a proof of concept: you can build extremely tiny hash tables in the real world.

My code is available at https://github.com/williamkuszmaul/tinyhash.

Train Tracks with Gaps: Applying the Probabilistic Method to Trains (Best Paper, FUN 2020)

Some Context:  This post is a shortened version of a paper that won the best-paper award at FUN 2020.  The paper is ostensibly about trains but the real-world applicability may be somewhat… limited.  The actual purpose of the paper is educational: we get to see some of the funnest techniques in probabilistic combinatorics being used to solve a cute problem involving trains.

Part 1: A Consequential Train Ride   

A few years ago, while traveling on a train, and on only a few hours of sleep, I was staring out the window. The train crossed a bridge over a road, and the ground was momentarily replaced by a steep drop. Startled, my sleep-deprived mind briefly wondered whether there was still a track underneath us. “Of course there is,” I thought to myself. “Without a track, the train car would have fallen into the gap.”

“Ah, no so fast!” responded the latent mathematician inside me. “If the train car had more than two sets of wheels, then perhaps it could cross the bridgeless gap without falling in.” It was true.

Consider, for example, a train with four sets of wheels: one set in the rear, one set in the front, and one set in each of the first and third quartiles.

basic_gap-1

As long as the gap in the track is less than the distance between any pair of wheels, then at least three sets of wheels will touch track at all times. Assuming that the center of mass of the train is in the middle half of the train, then the the train does not fall into the gap.

“In fact,” continued the latent mathematician, “what if we have {n} sets of wheels? Maybe we can build a mono-rail using an asymptotically small amount of track.”

“That’s a stupid idea,” responded I. “Gaps in train track are not something to optimize.”

But I was sleep deprived, so I did it anyway.

The Basic Observation: More Wheels Means Less Track.   

Consider a train car with {2n} sets of wheels. Half the sets are evenly dispersed across the first quarter of the train car, and half the sets are evenly dispersed across the final quarter. The train can safely drive down the track as long as at least one set of wheels from each side of the train car is touching track at all times. The train car looks something like this:

Want to build a monorail, but you’re short on track? No problem! You can get away with filling in only an {O(1/n)} fraction of the track:

ezgif.com-gif-maker

Every quarter of a train length, we place a piece of track whose length is a {\frac{1}{4n}} fraction of the length of the train car. We get asymptotic cost savings!

To see that this is the best we can do, suppose that the fraction of track that is filled in is less than {\frac{1}{n}}, and for symmetry sake suppose the track is circular (i.e., the end of the track loops back to the start). If we place the train at a random position in the circular track, then each wheel has a less than {\frac{1}{n}} chance of touching track. By a union bound, it follows that the probability of any wheel in the rear quarter of the train touching the track is less than {1}. Thus no matter how the track is placed, if the total fraction of track that is filled in is less than {\frac{1}{n}}, then there is some position at which the train falls through.

Part 2: Trains with Non-Uniform Wheel Placements

Consider a train car that has {n} wheels in its rear quarter and {n} wheels in its front quarter, but suppose that the wheels aren’t evenly spaced. For example, maybe the rear of the car looks something like this:

uneven_wheels1-1

Can we still fill in an asymptotically small fraction of the track in a way that will allow the train car to drive down the track? In other words, can we place down a small amount of track in a way so that, as the train drives down it there is always at least one wheel from each of the front and rear quarters of the train touching track? It turns out that, via a simple application of the probabilistic method with alterations, we can.

To formalize the situation, let’s focus just on the first quarter of the train. (In particular, up to a constant factor in the amount of train track that we install, we can consider the two quarters of the train separately.) Suppose this portion of the train is {f} feet long, and assume that each of the {n} sets of wheels resides at a distinct integer offset from the rear of the train. Our goal is to build a train track that is {\ell} feet long. We build the train track out of pillars that are each {1} foot long and are each placed at integer positions on the track. We are required to put down the pillars in a way so that, as the train drives down the track, at least one wheel from the rear quarter is always touching the track (i.e., touching some pillar). We want to use as little track as possible, with the best we could hope for being a total of {\frac{\ell}{n}} pillars.

As a reminder, there are three variables: the number of wheels {n} (in the quarter of the train car that we’re considering), the length {f} of one quarter of the train car, and the length {\ell} of the train track. In general, we have {n \le f \le \ell}. Note that, although {n} and {f} could reasonably be close to one another (e.g., {f = 2n}), we also want to be able to handle cases where {n \ll f}. This allows for the train car to be configured in truly strange ways — for example, the positions of the wheels could even form a geometric series:

uneven_wheels2-1

A Randomized Algorithm for Building Track.  

Our algorithm for deciding where to put the pillars is a simple example of the probabilistic method with alterations. In particular, we begin by randomly constructing a track that uses only a small number of pillars, and then we show that this track can be slightly altered in order to support (the rear quarter of) the train car at every point. The result will be a train track of length {\ell} using only {O\left(\frac{\ell \ln n}{n}\right)} pillars (and as we shall see later, this is optimal).

We begin by installing each pillar randomly with probability {\frac{\ln n}{n}}. Even though this strategy has nothing to do with the structure of where the wheels are on the train, it already does remarkably well. In particular, if we place the train at some given position on the track, then there are {n} different pillar positions that could potentially hold up the rear quarter. Each of these pillars positions has a {\frac{\ln n}{n}} probability of having a pillar installed. It follows that, for a given position on the track, the rear quarter of the train has a

\displaystyle \left(1 - \frac{\ln n}{n} \right)^{n}

probability of falling through the track. Taking advantage of the identity {\left(1 - \frac{1}{k}\right)^k \le \frac{1}{e}}, which holds for any {k \ge 1}, we see that the wheels fall through the track with probability at most,

\displaystyle \frac{1}{e^{\ln n}} = \frac{1}{n}.

In other words, out of all the places we can place the train on the track, only a {\frac{1}{n}}-fraction of them will be problematic (in expectation). To fix this, we can just install one additional pillar to remedy each of these problematic positions. The result is a train track that fully supports the rear quarter of our train, and that uses only {\left(\frac{1 + \ln n}{n}\right) \ell} total feet of track, in expectation.

The construction is kind of fun to visualize. We start with some random pillars, and then we add additional pillars as needed so that the train never falls through the track:

Webp.net-gifmaker (2) (1)

Of course, our bound of {\left(\frac{1 + \ln n}{n}\right) \ell} isn’t quite as good as we did when the wheels were evenly spaced out (we are a roughly {\ln n} factor worse). But it’s still pretty amazing! No matter how the wheels are distributed across each quarter of the train car, we can get away with installing only a {O\left(\frac{\ln n}{n} \right)} fraction of the track!

Part 3: A Matching Lower Bound

In the rest of this post, we’re going to prove that the construction above is optimal. (This is where the math gets really cool.) That is, we’ll show that there exists a train car configuration for which {\Omega\left(\ell \frac{\ln n}{n} \right)} pillars are necessary. We are going to again use the probabilistic method, but this time in a more sophisticated way.

We continue to focus only on the rear quarter of the train car, which is {f} feet long. We set {f = 2n}, and we will construct the rear quarter of the train car by placing {n} wheels at integer positions in the set {\{1, 2, \ldots, 2n\}}. We will then consider a track of length {\ell = 2f}, and show that {\Omega(\log n)} pillars are necessary in order to support the rear quarter of the train car at all positions on the track. Recall that each pillar is one foot wide and is placed at an integer offset on the track.

Let {C} be the set of wheel-positions in the rear quarter of the train car. We choose {C} by placing a wheel at each position in {\{1, 2, \ldots, 2n\}} independently with probability {\frac{1}{2}}. This means that {C} has {n} wheels in expectation, but may not actually have exactly {n} wheels. The important thing to note is that, with at least {50\%} probability, {C} has {n} or more wheels.

Now consider a track layout given by a subset {T} of {\{1, 2, \ldots, 4n\}}, and satisfying {|T| = \frac{\ln n}{4}}. Whereas {C} is a random variable, {T} is a fixed set.

Define {X_{C, T}} to be the event that, for every possible position of the rear quarter of the train on the track, at least one wheel from the rear quarter of the train is supported by track. From the perspective of the train car, {X_{C, T}} is a good event. Formally, {X_{C, T}} occurs if for every offset {k \in \{0, 1, \ldots, 2n\}}, we have {(k + C) \cap T \neq \emptyset}.

The key to proving the desired lower bound is to show that the probability of {X_{C, T}} occurring is very small, namely that {\Pr[X_{C, T}] < \frac{1}{2 \binom{4n}{(\ln n) / 4}}}. Because {T} is a {\frac{\ln n}{4}}-element subset of {\{1, 2, \ldots, 4n\}}, there are at most {\binom{4n}{(\ln n) / 4}} possibilities for {T}. Taking a union bound over all of these possibilities implies that

\displaystyle \Pr[X_{C, T} \text{ for any }T] < \frac{1}{2}.

On the other hand, we know that the number of wheels {|C|} is less than {n} with probability at most {1/2}. By a union bound, the probability that either {|C|} has fewer than {n} wheels or that {X_{C, T}} holds for some {T} is less than {1}. Hence there must exist a car {C} with {n} or more sets of wheels, and such that no track {T} of size smaller than {(\ln n) / 4} satisfies {X_{C, T}}. In fact, with slightly more careful bookkeeping, one can show that an even stronger property is true: almost all choices of how to place {n} wheels in {C} require a track of size larger than {(\ln n) / 4} to support the car.

To complete the lower bound, the challenge becomes to show that {\Pr[X_{C, T}]} is very small. That is, for a given choice of track {T} containing {(\ln n)/4} pillars, the probability that {T} supports the rear-quarter of the train car {C} is small.

Rather than examine the event {X_{C, T}} directly, we instead examine a related quantity. Define {Y_{C, T}} to be the number of positions {k \in \{0, 1, \ldots, 2n\}} for which {(k + C) \cap T = \emptyset} (i.e., the number of positions in which the rear quarter of the car, given by {C}, falls through the track {T}).

The relationship between {X_{C, T}} and {Y_{C, T}} is that {X_{C, T}} occurs if and only if {Y_{C, T} = 0}. Our approach to completing the analysis will be to first show that {\mathop{\mathbb E}[Y_{C, T}]} is relatively large, and then to show that the probability of {Y_{C, T}} deviating substantially from its expected value is small. This, in turn, implies that {\Pr[Y_{C, T} = 0]} is small, completing the analysis. In other words, the problem of proving that there exists a train-car configuration requiring a large amount of track is reduced to the problem of proving a concentration inequality on the random variable {Y_{C, T}}.

For each position {k \in \{0, 1, \ldots, 2n\}}, the set {T - k} consists of {(\ln n) / 4} elements. Since each of these elements is contained in {C} with probability {1/2}, the probability that {C} avoids all of the elements in {T - k} is given by,

\displaystyle \frac{1}{2^{(\ln n) /4}} = \frac{1}{n^{1/4}}.

Summing over the values of {k}, it follows that the expected number of positions {k} in which {(C + k) \cap T = \emptyset} is

\displaystyle \mathop{\mathbb E}[Y_{C, T}] = 2n \cdot \frac{1}{n^{1/4}} > n^{3/4}.

The final step of the analysis is to prove a concentration inequality on {Y_{C, T}}. Standard Chernoff bounds do not apply here because {Y_{C, T}} is not a sum of independent indicator random variables. Instead, we employ a more powerful tool, namely McDiarmid’s Inequality:

McDiarmid’s Inequality (1989): Let {A_1, \ldots, A_m} be independent random variables over an arbitrary probability space. Let {F} be a function mapping {(A_1, \ldots, A_m)} to {\mathbb{R}}, and suppose {F} satisfies,

\displaystyle \sup_{a_1, a_2, \ldots, a_m, \overline{a_i}} |F(a_1, a_2, \ldots, a_{i - 1}, a_i, a_{i + 1}, \ldots , a_m) - F(a_1, a_2, \ldots, a_{i - 1}, \overline{a_i}, a_{i + 1}, \ldots , a_m)| \displaystyle \le R,

for all {1 \le i \le m}. That is, if {A_1, A_2, \ldots, A_{i - 1}, A_{i + 1}, \ldots, A_m} are fixed, then the value of {A_i} can affect the value of {F(A_1, \ldots, A_m)} by at most {R}; this is known as the Lipschitz Condition. Then for all {S > 0},

\displaystyle \Pr[|F(A_1, \ldots, A_m) - \mathop{\mathbb E}[F(A_1, \ldots, A_m)]| \ge R \cdot S] \le 2e^{-2S^2 / m}.

To apply McDiarmid’s Inequality to our situation, recall that {Y_{C, T}} is defined to be the number of positions in the track {T} that the rear quarter of the car, given by {C}, falls through. Whereas the track {T} is fixed, each of the {2n} possible wheels in {C} is picked with probability {1/2}. Define the indicator random variables {A_1, A_2, \ldots, A_{2n}} so that {A_i} indicates whether {i \in C}. As required by McDiarmid’s Inequality, the {A_i}‘s are independent of one-another, and {Y_{C, T}} is a function of the {A_i}‘s.

Now we show that the Lipschitz Condition holds with {R = (\ln n) / 4}. Recall that the track {T} consists of only {(\ln n) / 4} pillars. Out of the {2n} possible wheels {i} that {C} could contain, each of those wheels {i} is only relevant (to the car’s stability) when the car is {k} feet down the track for some {k} that places wheel {i} on top of a pillar. Since there are only {(\ln n)/4} pillars, each wheel {i} is only relevant to the train car’s stability for {(\ln n) / 4} positions {k} on the track. In other words, for a given wheel position {i \in \{1, 2, \ldots, 2n\}}, there are only {(\ln n) / 4} values of {k \in \{0, 1, \ldots, 2n\}} for which {(C + k) \cap T} can possibly contain {i}. As a result, each {A_i} can only affect the value of {Y_{C, T}} by at most {(\ln n) / 4}, meaning that the Lipschitz Condition holds with {R = (\ln n) / 4}.

Applying McDiarmid’s Inequality, we can deduce that

\displaystyle \Pr[n^{3/4} - Y_{C, T} > n^{5/8} \cdot (\ln n) / 4] \le 2e^{-n^{1/4}}.

When {n} is large, this probability is much smaller than {\frac{1}{2 \binom{4n}{(\ln n) / 4}}}. It follows that

{\Pr[X_{C, T}] = \Pr[Y_{C, T} = 0] < \frac{1}{2 \binom{4n}{(\ln n) / 4}}}.

Summing over all possible options for the track {T}, the probability that any of them support the train car {C} is therefore less than {1/2}. This completes the proof: we have shown (non-constructively) that some train car {C} with {n} or more wheels fails to be supported by any track {T} consisting of {(\ln n)/4} or fewer pillars.

Conclusion.  If you’ve made it to here, you might want to take a look at the paper that this blog post is based on. It turns out that there’s even more fun to be had: in fact, the train-track problem can be used to give illustrative examples of some of the most useful techniques in probabilistic combinatorics, including, e.g., the Lovász Local Lemma and the Method of Conditional Expectations!

1 This research was partially sponsored by the United States Air Force Research Laboratory and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

I’m Above Average and So Are You

Derek Severs’ short essay “I Assume I’m Below Average” recently went viral on Hacker News. The article begins:

Ninety-six percent of cancer patients claim to be in better health than the average cancer patient.

Ninety-four percent of professors say they are better-than-average teachers.

Ninety percent of students think they are more intelligent than the average student.

Ninety-three percent of drivers say they are safer-than-average drivers.

What the article doesn’t discuss, however, is whether the people in these statistics are actually wrong.

There’s no fundamental reason why you can’t have 90% of people be better than average. For example, more than 99.9% of people have an above-average number of legs. And more than 90% of people commit fewer felonies than average. These examples are obvious, but they’re not so different than some of the examples above.

Any cancer patient that isn’t on the verge of dying probably is doing above average. (Unfortunately, so many cancer patients die, that the 96% number probably is still over-optimistic.)

What about the fraction of students that are above average? It depends on the class. Looking at Stanford’s data for the grade distribution of their graduate Algorithms class, about 78% of students (anyone with an A or A+) have an above-average grade.

This type of skew will happen in any probability distribution where a small number of people bring the average way down. I suspect another example of this is car drivers (consider the fact that, in America, 1 in 7 drivers don’t wear a seatbelt, or that 1 in 50 drivers admit to driving drunk in the past month!). Whether or not university professors fit this type of distribution is harder to tell.

To be clear, the examples given at the beginning are pretty extreme, and people probably are overly optimistic about their standings. But the examples aren’t as extreme as they look at first glance.

This reminds me of an old joke about there being an imaginary town called Lake Wobegon, “where all the women are strong, all the men are good-looking, and all the children are above average.” The joke, of course, is that the children can’t actually all be above average. But almost all of them can be. You just need one to bring the average down. (Poor Timmy.)

What is the actual infection rate at universities?

Here’s a question: what fraction of students at MIT have COVID-19 right now?

Since MIT tests students twice a week, this question should be pretty easy to answer. But it’s not.

That’s because MIT, along with other universities in the area, doesn’t report their infection rate. What they report is the positive test rate.

The Positive Rate is computed as

{\frac{\text{\# positive tests}}{\text{\# tests performed}}.}

At first glance, this might sound like the fraction of the population that has the virus, but it’s not. That’s because, once you test positive, you don’t get any additional tests for 14 days (the length of a quarantine). That means that, in the time that a normal person would have gotten tested four times, an infected person only gets tested once. The true percent of the MIT population with coronavirus is probably more like

\displaystyle 4 \cdot 0.22\% = 0.88\%.

In my view, that’s pretty bad. Roughly speaking, this suggests that every two weeks, each student has a {0.0088} chance of getting the virus. If a student lives at MIT for a year, their chance of avoiding the virus extrapolates to

\displaystyle (1 - 0.88 / 100)^{26} < 0.8.

So, by the end of the year, we may be looking at roughly 20% of MIT’s population having had the virus.

Of course, my extrapolation could be way off. But that’s the problem: MIT should be providing data that makes it easy to perform this sort of extrapolation, and they’re not.

Other schools are doing the same thing. Here’s Northeastern University’s infographic.

In my opinion, this graphic crosses the line to intellectually dishonest. The sea of blue is intended to comfort the reader without providing any useful information.

This is incredibly harmful. I’ve heard first hand from students who are convinced that the virus risk is low enough that they should just continue business as usual. Graphics like the one above inappropriately perpetuate that belief. In a crisis where bad math can literally kill, universities should do better.

My favorite example of: the probabilistic method

The “probabilistic method” is the art of applying probabilistic thinking to non-probabilistic problems. Applications of the probabilistic method often feel like magic. Here is my favorite example:

Theorem (Erdös, 1965). Call a set {X} sum-free if for all {a, b \in X}, we have {a + b \not\in X}. For any finite set {S} of positive integers, there is a sum-free subset {X \subseteq S} of size {|X| > |S| / 3}.

This theorem seems to have nothing to do with probability. Yet Erdös’s proof relies on a short and beautiful probabilistic argument.

Proof. Erdös’s first insight is to treat the set {S} as living in {\mathbb{Z}_p} for some large prime number {p}, rather than living in {\mathbb{Z}}. Living in {\mathbb{Z}_p} only makes the theorem harder to prove, since any set {X} that is sum-free in {\mathbb{Z}_p} is also sum-free in {\mathbb{Z}}. But, as we shall see, {\mathbb{Z}_p} has several structural properties that Erdös beautifully exploits.

Let {p = 3k + 2} be a prime number satisfying {p > s} for all {s \in S}. (The fact that such a prime exists is actually nontrivial, and uses the fact that there are infinitely many primes congruent to {2} mod {3}.) We’ll be using {p} and {k} throughout the rest of the proof.

Erdös begins his construction by defining a large sum-free set in {\mathbb{Z}_p} which is not necessarily a subset of {S}:

\displaystyle \text{MIDDLE-THIRD} = \{k + 1, k + 2, \ldots, 2k + 1\}.

As the name suggests, MIDDLE-THIRD consists of what is roughly the middle third of {\mathbb{Z}_p}. One can easily confirm that MIDDLE-THIRD is sum-free in {\mathbb{Z}_p}.

Next, Erdös selects a random value {r \in \{1, 2, \ldots, p - 1\}}, and examines the set

\displaystyle X_r = S \cap (r \cdot \text{MIDDLE-THIRD}),

In words, {X_r} is the set of {s \in S} that can be written as {rx \pmod p} for some {x \in \text{MIDDLE-THIRD}}. If we could show that {X_r} is a sum-free set of size greater than {|S| / 3}, then the proof would be complete.

The good news is that {X_r} is always sum-free, no matter the value of {r}. This is because, if we have {a + b = c} for some {a, b, c \in X_r}, then we also have {a / r + b / r = c / r}, violating the sum-freeness of MIDDLE-THIRD. (Here we are implicitly exploiting a nontrivial property of {\mathbb{Z}_p} , namely that it supports division!)

The bad news is that {|X_r|} could potentially be too small. To prove the theorem, we need to show that there is some option for {r} that results in {|X_r| > |S| / 3}.

Rather than directly finding a value of {r} for which {|X_r| > |S| / 3}, Erdös instead proves a probabilistic statement: the expected value of {|X_r|} for a randomly chosen {r} satisfies {\mathbb{E}[|X_r|] > |S| / 3}. Of course, if {\mathop{\mathbb E}[|X_r|] > |S| / 3}, then there must exist some {r} for which {|X_r| > |S| / 3}. What’s neat is that we don’t actually construct {r}, we just prove it exists.

Our new challenge is to prove that {\mathbb{E}[|X_r|] > |S| / 3}. One way to think of {|X_r|} is as the number of {x \in \{k + 1, k + 2, \ldots, 2k + 1\}} for which {rx \in S}. Thus we can expand {\mathbb{E}[|X_r|]} as

\displaystyle \mathbb{E}[|X_r|] = \sum_{x = k + 1}^{2k + 1} \Pr[rx \in S].

This is where Erdös really exploits the structure of {\mathbb{Z}_p}. In {\mathbb{Z}_p} , every non-zero element {x} is an additive generator, meaning that the multiples {x, 2x, 3x, \ldots, (p - 1)x} of {x} are just the numbers {1, 2, \ldots, p - 1} in some order. The result is that {rx} is just a random element of {\{1, 2, \ldots, p - 1\}}, and thus

\displaystyle \mathbb{E}[|X_r|] = \sum_{x = k + 1}^{2k + 1} \frac{|S|}{p - 1}

\displaystyle = (k + 1) \frac{|S|}{p - 1}

\displaystyle > |S| / 3.

Putting the pieces together, we have shown that {\mathbb{E}[|X_r|] > |S| / 3}, which implies that there is some {r} for which {|X_r| > |S| / 3}. Since {X_r} is sum-free, this completes the proof of the theorem.

Magic!

My favorite example of: the pigeonhole principle

This is part of a new sequence of posts titled,

My favorite example of: {x},

for different values of {x}. Today, {x} is the pigeonhole principle.

The Erdös-Szekeres Theorem: Consider any sequence of {n} distinct numbers. There must exist a subsequence {S} of {\sqrt{n}} numbers such that {S} is either entirely increasing or entirely decreasing.

The Erdös-Szekeres Theorem is a classic result in permutation combinatorics. Although the theorem was first presented in 1935, the proof I’m going to describe appears in the 1959 paper “A simple proof of a theorem of Erdös and Szekeres”, written by Abraham Seidenberg. Seidenberg’s proof is arguably one of the slickest applications of the pigeonhole principle ever.

The Proof.

Consider a sequence {x = x_1, x_2, \ldots, x_n} of {n} distinct numbers. For each number {x_i}, define

\displaystyle a_i = \text{ length of longest increasing subsequence ending in }x_i

and define

\displaystyle b_i = \text{ length of longest decreasing subsequence ending in }x_i.

For example, if {x = 1, 4, 5, 3, 2}, then {a_5 = 2} is the length of the longest increasing subsequence ending in {x_5 = 2} and {b_5 = 3} is the length of the longest decreasing subsequence ending in {x_5 = 2}.

In order to prove the theorem, Seidenberg made one crucial observation:

Observation: If {i < j} then either {a_i \neq a_j} or {b_i \neq b_j}.

To see why the observation holds, consider the case where {x_i < x_j}. Then for any increasing subsequence {S} that ends in {x_i}, the longer subsequence {S \cup \{x_j\}} is an increasing subsequence that ends in {x_j}. This implies that {a_j \ge a_i + 1} and thus that {a_i \neq a_j}.  A similar argument can be used in the case where {x_i > x_j} to show that  {b_j \ge b_i + 1}, and thus that {b_i \neq b_j}. In both cases, the observation holds.

In order to prove the theorem, our goal is to show that there is some {i} for which either {a_i \ge \sqrt{n}} or {b_i \ge \sqrt{n}}. Seidenberg realized that he could prove this with a simple application of the pigeonhole principle.

Suppose for conradiction that {a_i < \sqrt{n}} and {b_i < \sqrt{n}} for all {i}. Then there are fewer than {\sqrt{n} \cdot \sqrt{n} = n} possibilities for each pair {(a_i, b_i)}. It follows by the pigeonhole principle that at least two of the pairs

\displaystyle (a_1, b_1), (a_2, b_2), (a_3, b_3), \ldots, (a_n, b_n)

are equal to one-another. That is, there exists i \neq j such that (a_i, b_i) = (a_j, b_j). But this contradicts Seidenberg’s observation, and thus completes the proof.

The Many Quirks of Qsort

 

In 1973, the author of the C programming language Dennis Ritchie was such a fan of the quicksort algorithm that he decided to name the language’s sort function after it.

In this post, I’m going to dive into some of the most interesting (and bizarre) aspects of the qsort function. I’m going to be focusing on the GNU C library version of qsort—you can find the code here and here.

The m in qsort. What if I told you that the world’s most famous implementation of quicksort actually uses… mergesort.

It’s true. The GNU implementation of qsort directly calls mergesort as the default sorting algorithm. It’s hard to tell exactly when this change occurred—based on the version history of glibc, it looks like the modification happened sometime between 1992 and 1995.

But why? I haven’t been able to find any documentation describing the reason for the transition from quicksort to mergesort, but I do have a conjecture. Although quicksort is typically faster than mergesort in practice, mergesort has one advantage: it performs fewer total comparisons. In the C library, this ends up being important, because the sort function performs comparisons by dereferencing a pointer to a comparison function. This makes comparisons much more expensive than they should be, giving mergesort an advantage.

I compared the library implementations of quicksort and mergesort on my machine, running both on a random array of ten million 64-bit integers. Mergesort was faster, taking 1.4 seconds instead of quicksort’s 1.7 seconds.

Then I modified the implementations to perform comparisons inline (and to also move objects around in 64-bit chunks instead of 8-bit chunks). This completely flipped things. Now quicksort became faster, taking 0.9 seconds instead of mergesort’s 1.2 seconds. So the pointer dereferences really do seem to be key.

If you really want quicksort, you can have it. Here’s a neat trick. If malloc fails, then qsort can’t allocate the memory needed for mergesort. In this case, qsort actually does use quicksort.

I wrote a function called malloc_all that allocates all of the memory in my system (this takes about half a second). If I call qsort after running malloc_all, then it runs the quicksort algorithm.

Wait, where’s the randomness? Here’s where things start to get interesting. If I call malloc_all and then I run qsort on the million-element array

\displaystyle 0, 0, 0, \ldots, 0, 1, 2, 3, \ldots, 250000, 0, 1, 0, 2, 0, 3, \ldots, 0, 250000,

then qsort takes a whopping 394 seconds to run. Wow!

It turns out that qsort‘s quicksort doesn’t use randomness to pick its pivot. As a consequence there are lots of natural inputs that cause the algorithm to take quadratic time!

Unlike C++, which requires that its sort function runs in {O(n \log n)} time, the C standard places no such requirement on qsort. So it’s not that qsort is broken. It’s just quirky.

But why not use randomness? Most modern library implementations of quicksort are based on Bentley and McIlroy’s 1993 paper, Engineering a Sort Function. In it, Bentley and McIlroy advocate against using randomness, saying that “a library sort has no business side-effecting the random number generator”. As far as I can tell, this single sentence decided the fate of most modern quicksort library implementations. I think this is a bit of a shame, since it only takes a few lines to implement a library-specific random number generator, and then we would have a sort function without any worst-case inputs… But I’m also a theoretician, so I’m very biased.

The true namesake of qsort. I need to confess something: qsort isn’t actually named after quicksort. It’s named after quickersort, the variant of quicksort introduced by R.S. Scowen in 1965, three years after the introduction of quicksort.

There is one very neat way in which the two algorithms differ: quickersort implements its own recursion stack manually in a way that guarantees a stack depth of {O(\log n)}. The way that quickersort does this is simple but clever. Suppose quickersort is executing a subproblem {A}, which has two recursive child subproblems {B} and {C}, where {B} is the smaller of the two child subproblems. After performing the partitioning step in {A}, quickersort then executes {B} recursively. After performing {B}, however, quickersort then overwrites {A}‘s stack frame with {C}‘s stack frame, and performs {C} at the same stack depth as {A} was performed. The result is that, the only time that the stack depth ever increases is when the problem size decreases by at least a factor of two. Hence the bound of {O(\log n)} on stack size.

A beautiful piece of code. At the end of the day, GNU libc qsort is one of my favorite pieces of code. I like the strange quirks. But there’s more to it than that. It’s a simple and beautifully written piece of code full of beautiful nuances and subtleties. If you have a few minutes, you might want to take a look.

You can also rerun the experiments in this blog post using this code.

The World’s Simplest Interesting Algorithm

In this post, I want to tell you about what I think might be the world’s simplest interesting algorithm.

The vertex cover problem. Given a graph {G = (V, E)}, we want to find the smallest set of vertices {S \subseteq V} such that every edge {e \in E} is covered by the set {S}. This means that for each edge {e = (u, v)}, at least one of {u} or {v} is in {S}.

The vertex cover problem is known to be NP-complete, meaning that it is very hard (or impossible) to find a polynomial-time algorithm for the problem.

But what we can do is find an approximately optimal solution to the problem. There’s a beautiful and simple algorithm that finds a set {S} whose size is guaranteed to be within a factor of {2} of optimal.

The algorithm. To build our set {S}, we repeatedly find some edge {e = (u, v)} that is not yet covered, and we add both {u} and {v} to our set {S}. That’s the entire algorithm.

At first sight, this seems like a crazy algorithm. Why would we put both {u} and {v} into {S}, when we only need one of {u} or {v} in order to cover the edge {e = (u, v)}?

The answer to this question is simple. If we define {S_{\text{OPT}}} to be the optimal solution to the vertex-cover problem, then {S_{\text{OPT}}} must contain at least one of {u} or {v}. By placing both of {u} and {v} into our set {S}, we guarantee that at least half the elements in our set {S} also appear in {S_{\text{OPT}}}. That means that, once our algorithm finishes, we are guaranteed that {|S| \le 2 |S_{\text{OPT}}|}.

A challenge.  This might be the world’s simplest interesting algorithm (and analysis).  But it also might not be. What do you think?

 

What Linearity of Expectation Has to Do with Needles and Pi

Take a needle of length {\ell} and drop it in a random position on a hardwood floor whose boards have the same width {\ell}:

 

stick_drop

 

There’s some probability {p} that the needle crosses between two adjacent floorboards. It turns out that {p} has a surprisingly simple formula,

\displaystyle p = \frac{2}{\pi}.

Does this mean we can drop thousands of needles (or, for safety-sake, popsicle sticks) to accurately estimate {\pi}? A few years ago, my friend Jake Hillard and I put this to the test. Jake measured out the width between floorboards in the Stanford History Department and 3D-printed a few dozen popsicle sticks of the same length. Here’s one I kept as a momento:

stick

(Since the sticks weren’t infinitesimally thin, the rule was to count a crossing only if the center of the stick crossed.)

With the help of Stanford Splash, we then recruited about a hundred middle- and high- school students to perform the experiment. In total, we recorded roughly {4,000} popsicle-stick drops, and ended up estimating {\pi} to about…. {2.95}. That’s actually not too far from what we expected, since even with no experimental error at all, the standard deviation that we would expect with {4,000} data points is roughly {0.07}.

Ok, so dropping popsicle sticks clearly isn’t the best way to estimate {\pi}. But I’m hoping I can convince you that it is one of the combinatorially coolest. And that’s because the formula for {p} is actually just an extraordinarily slick (although slightly informal) application of linearity of expectation.

Anything Can be Solved with Enough Glue. For simplicity, I’ll assume for the rest of the post that $\ell = 1$. Imagine that, instead of dropping a single needle, we drop {n} needles, glued together to form a regular {n}-gon. Here’s an example for {n = 8}:

 

polygon_drop

Each needle individually crosses between two floorboards with probability {p}. By linearity of expectation, the expected total number of crossings by our {n}-gon is {n \cdot p}. That is, the {n}-gon as a whole should, on average, induce {n} times as many crossings as would a single needle.

But now let’s think about what this picture looks like when {n} is large.

circle_drop

When {n} is very large, our {n}-gon starts to closely resemble a circle with circumference {n}. The height of the {n}-gon corresponds with the diameter of the circle, and is therefore roughly {n / \pi} (This step of the argument isn’t actually as obvious as it sounds, and requires a bit more work to be formal). But no matter how we drop our circle-like {n}-gon on the hardwood floor, the total number of crossings will always be almost exactly the same (up to {\pm 1})! In particular, the total number of crossings by needles in the {n}-gon is just twice the height of the {n}-gon, i.e., 2  \cdot n / \pi. Since the expected number of crossings can be expressed as {n \cdot p}, it follows that, as {n} grows large, {n \cdot p} approaches {2 \cdot n / \pi}. Hence the formula {p = 2 / \pi}.