## The Combinatorics of Boarding an Airplane

I’ve spent a lot of time at airports recently. And whenever I’m at an airport I find myself thinking about combinatorial problems having to do with airplanes. While flying from SFO to BOS recently, I stumbled across something I found really surprising.

Suppose ${n}$ people board a plane in a random order. Each person has a seat number assigned to them, and whenever someone gets to their seat, they pause for a moment to put their luggage in the overhead bin.

Whenever someone is pausing for their luggage, they prevent anyone in back of them from passing. Suppose that it takes each person time ${w}$ to walk by a given seat, and time ${p \gg w}$ to put away their luggage once they get to their seat. The question is:

How long will it take for everyone to sit down?

In the worst case the passengers are arranged in decreasing numerical order. This means everyone will have to wait on everyone in front of them, and boarding will take time roughly ${n \cdot p}$. If, on the other hand, the passengers arrive in the correct (increasing) order, then they’ll each be able to walk straight to their seats, boarding in time ${n \cdot w + p}$.

But what if the passengers arrive in a random order? Then how much parallelism do we get?

The answer, as it turns out, is quite a lot. In fact, the expected time to board the plane will be

$\displaystyle \Theta(n \cdot w + \sqrt{n} \cdot p).$

This has some pretty weird consequences. For example, it means that for any fixed ${w}$ and ${p}$, and for a large enough plane-size ${n}$, the boarding time will become dominated by the time necessary to walk down to the end of the plane. The time spent waiting on people to put their luggage overhead becomes a non-issue!

A Simplifying Assumption:  To make our analysis easier, we’ll make a simplifying assumption that turns out not to affect the outcome of the analysis. In particular, we’ll treat each of the passengers as though they don’t take any space in the aisle. For example, if one passenger is paused at seat $i$, and there are there are some passengers waiting behind him that want to get past seat $i$, then that won’t stop the passenger behind them from getting to seat $i - 1$.

On top of this, we’re already making the simplifying assumption that each row has only one seat. It turns out that both of these assumptions can be removed without changing the end result. I’ll talk about this more at the end of the post.

The Combinatorics:  The key insight to analyze the algorithm is this: Think of the people as a permutation of seat numbers. If the longest decreasing subsequence in the line of people has length ${d}$, then it turns out the time to board the plane will be within a factor of two of ${n \cdot w + d \cdot p}$.

To see this, consider some person ${x_1}$ as they walk to their seat. Let ${x_2}$ be the final person on whom ${x_1}$ waits before ${x_1}$ gets to their seat. Then the total time ${x_1}$ spends waiting on others will be at most ${p}$ greater than the total time that ${x_2}$ spends waiting on others. Similarly there is some person ${x_3}$ who is the last person on whom ${x_2}$ waits; and ${x_2}$ spends at most ${p}$ time more in total waiting than does ${x_3}$. Continuing like this, we can find a subsequence of people

$\displaystyle x_1, x_2, x_3, \ldots, x_k$

such that the right-most person ${x_k}$ in the sequence doesn’t have to wait at all, and such that each person ${x_i}$ spends at most ${p}$ longer than ${x_{i + 1}}$ waiting. This means that our original person ${x_1}$ spends time at most ${k \cdot p}$ waiting on others.

The sequence of people ${x_1, x_2,x_3, \ldots, x_k}$ also has the property that their seat numbers are in decreasing order. So what we’ve really shown is that if ${d}$ is the length of the longest decreasing subsequence in the permutation of people, then no person will have to wait for time more than ${d \cdot p}$, and everyone will have boarded after time ${n \cdot w + d \cdot p}$.

On the other hand, it’s not hard to see that the left-most person in the length-${d}$ decreasing sequence will require time at least ${d \cdot p}$ to board the plane; and that the left-most person in the entire line will require time at least ${n \cdot w}$ to board. This means that the true boarding time will be within a factor of two of ${n \cdot w + d \cdot p}$.

So the real question becomes: On average, how large is ${d}$, the length of the longest decreasing subsequence in a random permutation?

Bounding the longest decreasing subsequence: We’ll start by considering a different question. For a given length ${k}$, how many different decreasing subsequences of length ${k}$ should we expect in our permutation? (We count two subsequences as different even if they differ in only a single element.)

This is easy to compute by linearity of expectation: Each ${k}$-element subsequence has a ${1/k!}$ probability of being in decreasing order. And there are ${\binom{n}{k}}$ different ${k}$-element subsequences. So the expected number of decreasing subsequences is

$\displaystyle \frac{\binom{n}{k}}{k!} = \frac{n \cdot (n - 1) \cdots (n - k + 1)}{k! \cdot k!}.$

A useful fact about ${k!}$ is that it’s at least as large as ${(k / 3)^k}$. So the expected number of decreasing ${k}$-element subsequences is at most

$\displaystyle \frac{n^k}{(k^2 / 9)^k}.$

When ${k = 6\sqrt{n}}$, this reduces to at most ${\frac{1}{4^k} = \frac{1}{4^{6 \sqrt{n}}}}$.

Of course, if the expected number of ${6\sqrt{n}}$-element decreasing subsequences is so small, then so is the probability of any such subsequence appearing.

So with extremely high probability, no decreasing subsequence will be of length more than ${O(\sqrt{n})}$. And this implies that the expected length is also at most ${O(\sqrt{n})}$.

What about a lower bound? We’ve just shown that the expected length of the longest decreasing subsequence is at most ${O(\sqrt{n})}$. But is that tight?

For this, we can use a classical result from permutation combinatorics. The Erdös-Szekeres Theorem says that any permutation of length ${n}$ must either contain either a decreasing or an increasing subsequence of length ${\sqrt{n}}$. For a random permutation, this means that with probability at least ${1/2}$, there will be a decreasing subsequence of length ${\sqrt{n}}$.  So the expected length of the longest decreasing subsequence is at least $\sqrt{n}/2$.

The full version of the problem:  Near the beginning of the post I mentioned we were making two assumptions: (1) That passengers don’t take up any space in the aisle; and (2) that each row has a single seat.

Removing these assumptions makes the problem considerably harder. But it turns out the end conclusion is basically the same. A 2009 math paper on the topic analyzed the problem in its full complexity, and found that as long as each person takes up roughly one row worth of space in the aisle, and the number of seats in each row is constant,  the boarding time remains

$\displaystyle \Theta(n \cdot w + \sqrt{n} \cdot p).$

## The (Almost) Secret Algorithm Researchers Used to Break Thousands of RSA Keys

RSA encryption allows for anyone to send me messages that only I can decode. To set this up, I select two large random primes $p$ and $q$ (each of which is hundreds of bits long), and release their product $x = p \cdot q$ online for everyone to see; $x$ is known as my public key. In addition, I pick some number $e$ which shares no factors with $p-1$ or $q-1$ and release it online as well.

The beauty of RSA encryption is that using only the information I publicly released, anyone can encode a message they want to send me. But without knowing the values of $p$ and $q$, nobody but me can decode the message. And even though everyone knows my public key $x = p \cdot q$, that doesn’t give them any efficient way to find values for $p$ or $q$. In fact, even factoring a 232-digit number took a group of researchers more than 1,500 years of computing time (distributed among hundreds of computers).

On the surface, RSA encryption seems uncrackable. And it might be too, except for one small problem. Almost everyone uses the same random-prime-number generators.

A few years ago, this gave researchers an idea. Suppose Bob and Alice both post public keys online. But since they both used the same program to generate random prime numbers, there’s a higher-than-random chance that their public keys share a prime factor. Factoring Bob’s or Alice’s public keys individually would be nearly impossible. But finding any common factors between them is much easier. In fact, the time needed to compute the largest common divisor between two numbers is close to proportional to the number of digits in the two numbers. Once I identify the common prime factor between Bob’s and Alice’s keys, I can factor it out to obtain the prime factorization of both keys. In turn, I can decode any messages sent to either Bob or Alice.

Armed with this idea, the researchers scanned the web and collected 6.2 million actual public keys. They then computed the largest common divisor between pairs of keys, cracking a key whenever it shared a prime factor with any other key. All in all, they were able to break 12,934 keys. In other words, if used carelessly, RSA encryption provides less than $99.8\%$ security.

At first glance this seems like the whole story. Reading through their paper more closely, however, reveals something strange. According to the authors, they were able to run the entire computation in a matter of hours on a single core. But a back-of-the-envelope calculation suggests that it should take years to compute GCD’s between 36 trillion pairs of keys, not hours.

So how did they do it? The authors hint in a footnote that at the heart of their computation is an asymptotically fast algorithm, allowing them to bring the running time of the computation down to nearly linear; but the actual description of the algorithm is kept a secret from the reader, perhaps to guard against malicious use. Within just months of the paper’s publication, though, follow-up papers had already discussed various approaches in detail, both presenting fast algorithms (see this paper and this paper), and even showing how to use GPUs to make the brute-force approach viable (see this paper).

There’s probably a lesson here about not bragging about things if you want them to stay secret. But for this post I’m not interested in lessons. I’m interested in algorithms. And this one turns out to be both relatively simple and quite fun.

Algorithm Prerequisites: Our algorithm will deal with integers having an asymptotically large number of digits. Consequently, we cannot treat addition and multiplication as constant-time operations.

For $n$-bit integers, addition takes $O(n)$ time. Using long multiplication, multiplication would seem to take $O(n^2)$ time. However, it turns out there is an algorithm which runs in time $O(n \log n \log \log n)$.

Computing the GCD naively using the Euclidean algorithm would take $O(n^2 \log n \log \log n)$ time. Once again, however, researchers have found a better algorithm, running in time $O(n \log^2 n \log \log n)$.

Fortunately, all of these algorithms are already implemented for us in GMP, the C++ big-integer library. For the rest of the post we will use Big-O-Tilde notation, a variant of Big-O notation that ignores logarithmic factors. For example, while GCD-computation takes time $O(n \log^2 n \log \log n)$, in Big-O-Tilde notation we write that it takes time $\widetilde{\text{O}}(n)$.

Transforming the Problem: Denote the set of public RSA keys by $k_1, \ldots, k_n$, where each key is the product of two large prime numbers (i.e., hundred digits). Note that $n$ is the total number of keys. Rather than computing the GCD of each pair of keys, we can instead compute for each key $k_i$ the GCD of it and the product of all the other keys $\prod_{t \neq i} k_t$. If a key $k_i$ shares exactly one prime factor with other keys, then this provides that prime factor. If both prime factors of $k_i$ are shared with other keys, however, then the computation will fail to actually extract the individual prime factors. This case is probably rare enough that it’s not worth worrying about.

The Algorithm: The algorithm has a slightly unusual recursive structure in that the recursion occurs in the middle of the algorithm rather than at the end.

At the beginning of the algorithm, all we have is the keys,

$k_1,$
$k_2,$
$k_3, \cdots$

The first step of the algorithm is to pair off the keys and compute their products,

$j_1 = k_1 \cdot k_2,$
$j_2 = k_3 \cdot k_4,$
$j_3 = k_5 \cdot k_6, \cdots$

Next we recurse on the sequence of numbers $j_1, \ldots, j_{n / 2}$, in order to compute

$r_1 = GCD(j_1, \prod_{t \neq 1} j_t),$
$r_2 = GCD(j_2, \prod_{t \neq 2} j_t),$
$r_3 = GCD(j_3, \prod_{t \neq 3} j_t), \cdots$

Our goal is to compute $s_i = GCD(k_i, \prod_{t \neq i} k_t)$ for each key $k_i$. The key insight is that when $i$ is odd, $s_i$ can be expressed as
$s_i = GCD(k_i, r_{(i + 1) / 2} \cdot k_{i + 1}),$
and that when $i$ is even, $s_i$ can be expressed as
$s_i = GCD(k_i, r_{i / 2} \cdot k_{i - 1}).$
To see why this is the case, one can verify that the term on the right side of the GCD is guaranteed to be a multiple of $GCD(k_i, \prod_{t \neq i} k_t)$, while also being a divisor of $\prod_{t \neq i} k_t$. This, in turn, implies that the GCD-computation will evaluate to exactly $GCD(k_i, \prod_{t \neq i} k_t)$, as desired.

Computing each of the $s_i$‘s in terms of the $r_i$‘s and $k_i$‘s completes the algorithm.

Bounding the Running Time: Let $m$ denote the total number of bits needed to write down $k_1, k_2, \ldots, k_n$. Each time the algorithm recurses, the total number of bits in the input being recursed on is guaranteed to be no more than at the previous level of recursion; this is because the new inputs are products of pairs of elements from the old input.

Therefore each of the $O(\log n)$ levels of recursion act on an input of total size $O(m)$ bits. Moreover, the arithmetic operations within each level of recursion take total time at most $\tilde{O}(m)$. Thus the total running time of the algorithm is also $\tilde{O}(m)$ (since the $O(\log n)$ recursion levels can be absorbed into the Big-O-Tilde notation).

If we unwrap the running time into standard Big-O notation, we get
$O(m \log^3 m \log \log m).$

Is it practical? At first glance, the triple-logarithmic factor might seem like a deal breaker for this algorithm. It turns out the actual performance is pretty reasonable. This paper found that the algorithm takes time roughly 7.65 seconds per thousand keys, meaning it would take a little more than 13 hours to run on 6.2 million keys.

One of the log factors can be removed using a slightly more clever variant of the algorithm, which avoids GCD computations at all but the first level of recursion (See this paper). The improved algorithm takes about 4.5 seconds per thousand keys, resulting in a total running time of about 7.5 hours to handle 6.2 million keys.

So there we go. A computation that should have taken years is reduced to a matter of hours.  And all it took was a bit of clever recursion.

## Fibonacci Formulas For Fantastic Fun

The Fibonacci numbers are the sequence ${1, 1, 2, 3, 5, 8, \ldots}$, and satisfy the recurrence ${F(n) = F(n - 1) + F(n - 2)}$. They also have a beautiful formula,

$\displaystyle F(n) = \frac{1}{\sqrt{5}} \left(\frac{1+ \sqrt{5}}{2}\right)^n - \frac{1}{\sqrt{5}} \left(\frac{1 - \sqrt{5}}{2}\right)^n.$

This seemingly complicated formula actually has a very simple explanation.

Consider the set ${V}$ of number sequences of the form ${a_1, a_2, \ldots}$ that satisfy the recurrence ${a_n = a_{n - 1} + a_{n - 2}}$. I call the set “${V}$” because it turns out to form a vector space. That is, given any two sequences ${a_1, a_2, a_3, \ldots}$ and ${b_1, b_2, b_3, \ldots}$ in ${V}$, their sum ${(a_1 + b_1), (a_2 + b_2), \ldots}$ satisfies the same Fibonacci-style recurrence as they do, as does any scalar multiple of ${a_1, a_2, a_3, \ldots}$.

Notice that ${V}$ is not, however, a very complicated vector space. Because each number sequence in ${V}$ satisfies the Fibonacci-recurrence, each number sequence is entirely determined by its first two entries. Consequently, ${V}$ has dimension two.

If you look at the first few terms of the Fibonacci numbers, you might notice that they seem to grow at a roughly exponential rate. Consequently, one might hope that ${V}$ has a two basis elements of the form ${1, a, a^2, a^3, \ldots}$ and of the form ${1, b, b^2, b^3, \ldots}$. Indeed, if ${V}$ did, then we would get for free that, for some constants ${c_1,c_2 \in \mathbb{R}}$,

$\displaystyle F(n) = c_1 a^n + c_2 b^n.$

Fortunately, such ${a}$ and ${b}$ do exist and are easy to find. In order for ${1, a, a^2, \ldots}$ to satisfy the recurrence ${a^n = a^{n -1} + a^{n - 2}}$, we need that ${a^2 = a + 1}$. Applying the quadratic formula, we get values for ${a}$ and ${b}$,

$\displaystyle a = \frac{1 + \sqrt{5}}{2},$

$\displaystyle b = \frac{1 - \sqrt{5}}{2}.$

The two-envelope paradox is a great example where seemingly correct math results in a bizarre conclusion. Not only that, but the paradox is so confusing that even well-trained mathematicians and computer scientists sometimes disagree on the source of the issue.

The Paradox: Consider two identical envelopes, each of which contains some positive amount of money, and one of which contains twice as much money as the other. You are given an envelope at random, and you open it to find ${X}$ dollars. Then you’re given a choice: either keep the ${X}$ dollars you have, or trade and get the money in the other envelope.

So how much money do you expect to make or lose if you swap envelopes? With probability ${1/2}$, the other envelope contains ${2X}$ dollars, meaning you would make ${X}$ dollars. The rest of the time, the other envelope contains ${X/2}$ dollars, meaning that you lose ${X / 2}$ dollars. Thus the expected amount of money that you’ll make by swapping is:

$\displaystyle \frac{1}{2} \cdot X + \frac{1}{2} \cdot (-X/2) = \frac{1}{4}X,$

a positive amount! If follows that you should always swap envelopes.

But Wait a Second…: This, of course, is nonsense. The strategy of always swapping envelopes is equivalent to the strategy of never swapping envelopes, since both strategies give you a random envelope. So it can’t be that, in general, swapping is a better strategy than not swapping.

The question is, where did we go wrong? That is, which part of the argument above can’t be made mathematically rigorous?

The Source of the Confusion: The first thing to notice is that ${X}$ is a random variable, meaning it’s the result of a random process and doesn’t always take the same value. But in order for a random variable to be well defined, it must be over a probability space. So the question is, what probability space is ${X}$ over?

The paradox is very careful not to specify. But there are two natural answers one might try:

• Option 1: ${X}$ is a random positive rational number.
• Option 2: The probability space doesn’t matter for the argument.

It turns out that Option 1 isn’t actually valid. There is no probability space that is uniform over the positive rationals. So that leaves us with option 2. This means that the argument should work, regardless of the initial distribution used for the values in the envelopes. To make our lives easier, let’s pick a distribution. One envelope has ${1}$ dollar and the other has ${2}$ dollars.

With This Simplification, The Issue Becomes Obvious: Once we know the value of ${X}$, it’s no longer the case that the other envelope has a fifty-percent chance of having ${2X}$ dollars and a fifty-percent chance of having ${X/2}$ dollars. Indeed, the case where the other envelope has ${2X}$ dollars only happens when ${X = 1}$, in which case the profit of ${X}$ dollars is actually just a profit of a single dollar. The case where the other envelope has ${X/2}$ dollars, on the other hand, only happens when ${X = 2}$, in which case the loss of ${X / 2}$ dollars is actually just a loss of a a single dollar.

But Where Did the Analysis Formally Go Wrong? In the analysis in the paradox, the expected profit is expressed in terms of ${X}$:

$\displaystyle \mathbb{E}[\text{profit}] = \frac{1}{2} \cdot X + \frac{1}{2} \cdot (-X/2) = \frac{1}{4}X.$

But that’s nonsensical math. The left side of the equation is a fixed value, while the right side is a random variable, expressed as a function of ${X}$. The types on the two sides of the equation just don’t agree. To do the math properly, we would have to write,

$\mathbb{E}[\text{profit}] = \mathbb{E}[\text{profit} \mid X = 1] \cdot \Pr[X = 1] + \mathbb{E}[\text{profit} \mid X = 2] \cdot \Pr[X = 2]$

$= \frac{1}{2} \cdot \mathbb{E}[\text{profit} \mid X = 1] + \frac{1}{2} \cdot \mathbb{E}[\text{profit} \mid X = 2]$

$= \frac{1}{2} \cdot 1 + \frac{1}{2} \cdot (-1) = 0.$

The Takeaway: The reason people so frequently disagree on the cause of the paradox is simple. It’s because the aspect of the paradox that makes it confusing (namely, that the probability distribution for ${X}$ is unspecified) is different from the aspect of the paradox that makes it wrong (namely, that an expectation is expressed as a random variable). Like any good magic trick, it’s all about the misdirection.

## The Proof That Shook The World Had No Diagonals

In 1874, Georg Cantor proved a result that shook the world of mathematics: that no infinite sequence, no matter how cleverly constructed, can contain every real number. His proof was so controversial that it nearly ruined his career. Famed French mathematician Henri Poncare described Cantor’s ideas as a “grave disease”. Cantor’s former professor Leopold Kronecker called Cantor a “scientific charlatan” and “corrupter of youth”. Cantor even briefly considered swapping fields to philosophy.

At this point, any math-history enthusiast worth their weight in black-board chalk will likely guess what proof I am referring to: Cantor’s famed diagionalization argument. The only problem is, they’d be wrong.

The diagionalization argument, which can be used to prove the same result, wasn’t published for another 17 years until 1891. In fact, Cantor’s original proof relied on nothing but standard elementary number theory.

The Theorem: There is no sequence of real numbers $a_1, a_2, \ldots$ containing every real number.

Recall that the real numbers include non-fractions like $\pi$ and $\sqrt{2}$. The theorem claims that for every infinite sequence of real numbers, there must be some real number not in the sequence. Notice that the same can’t be said for the integers. For example, the sequence $0, 1, -1, 2, -2, 3, -3, \ldots$ contains every integer. In fact, the same can’t even be said for pairs of integers; if you’re clever you can construct a sequence containing all the ways there are to pair up two integers. Since every rational number can be written as a fraction of two integers, it follows that one can also list off all the rational numbers. So the fact that such a list cannot be made for the reals should, apriori, be kind of surprising.

Cantor’s Original Proof: Suppose there is a sequence $a_1, a_2, a_3, \ldots$ containing every real number. In other words, for every real number $r$, there is some position $i$ such that $a_i = r$. Create two variables $t$ and $b$ (top and bottom) initialized to be the first two elements of the sequence. For simplicity, we will assume $b < t$. Start looking at the elements in the sequence. Each time we encounter an element between $t$ and $b$, change one of the two variables to take on that element’s value. Each time alternate which variable takes on a new value. Keep track of the values of $b$ and $t$ as $b_1, b_2, \ldots$ and $t_1, t_2, \ldots$.

For example, if the sequence is $1, 6, 8, 3, 2, 11, 5, 4, \ldots$, then $t$ and $b$ would start off as $b_1 = 1$ and $t_1=6$. The first element between $b$ and $t$ that we find is $3$, leading us increase $b$ to $b_2 = 3$. The next value in the list between $b$ and $t$ (now $3$ and $6$) is $5$, leading us to decrease $t$ to $t_2 = 5$. The next value in the list between $b$ and $t$ (now $3$ and $5$) is $4$, leading us to increase $b$ to $b_3 = 4$. The process continues on forever (and, in general, the elements of the list won’t all be integers like in this example). In particular, if $a_1, a_2, \ldots$ really does contain every real number then the process will never stop and we will always find new numbers between $b$ and $t$.

At this point, Cantor asked a crucial question. What is $L =\lim_{i \rightarrow \inf} b_i$? In particular, because $b_1, b_2, \ldots$ is a strictly increasing sequence which is bounded above by $t$, the limit $L$ exists and is well defined.

Cantor realized that this leads to a problem. In particular, it turns out $L$ must remain between $b$ and $t$ at all times. Indeed if $L for some $i$, then since $b$ is strictly increasing $L$ cannot be $\lim_{i \rightarrow \inf} b_i$. And if $L > t_i$ for some $i$, then since $b$ can never exceed any value $t$ takes on, it follows that $\lim_{i \rightarrow \inf}b_i \le t_i < L$.

The problem, Cantor realized, is that if $L$ must always remain between $b$ and $t$, then there is no way that $L$ can appear in the sequence. In particular, if $L$ does appear, then when we reach $L$ in the sequence $a_1, a_2, \ldots$, we will advance one of $b$ or $t$ to take its value (since $L$ is between $b$ and $t$). But then once we advance both $b$ and $t$ one more time (increasing $b$ and reducing $t$), $L$ will no longer be between them. This shows that $L$ cannot appear anywhere in the list, which, in turn, implies that the list cannot contain all real numbers after all, a contradiction.

## What Made Gödel’s Incompleteness Theorem Hard to Prove: It’s about How You Say it, Not Just What You Say

Roughly speaking, Gödel’s Incompleteness Theorem states that there are true mathematical statements that cannot be proven. When I was in 11-th grade, my geometry teacher Mr. Olsen, my friend Uma Roy, and I spent five weeks reading through Gödel’s original proof of the theorem. Why did it take so long? Partly because Uma and I were high-school students. Partly because Gödel was a less-than-talented writer. But mostly because the proof is actually pretty hard.

That might seem surprising, since it’s easy to present a one-paragraph summary of essentially how the proof works: Gödel begins by constructing a mathematical statement essentially equivalent to the sentence,

This statement cannot be proven.

Then Gödel considers what would happen if the statement were false. That would mean that the statement could be proven. But any statement that can be proven must be true, a contradiction. From this Gödel deduces that the statement must be true. But, since the statement is true, it follows that the statement cannot be proven. Note that this final statement is not a contradiction. Rather, it’s a proof of Gödel’s theorem.

So what makes the actual proof so hard? The trickiness comes from the fact that what might sound like a valid mathematical statement in English often isn’t (especially when the sentence is self-referential). Consider, for example, the sentence,

This sentence is false.

The sentence is nonsensical: it cannot be false (since that would make it true) and it cannot be true (since that would make it false). And it certainly cannot be written as a formal mathematical statement.

Here’s another example (known as the Berry paradox):

Define ${x}$ to be the smallest positive integer that cannot be defined in under ${100}$ words.

This might look like a valid mathematical definition. But again it’s nonsensical. And, importantly for the sanity of mathematics, no analogous statement can be made mathematically formal.

Even statements that use math symbols can be nonsensical:

${S = \{A \mid A \not\in A\}}$.

(i.e., ${S}$ is the set of sets ${A}$ which are not elements of themselves.)

This is again a nonsensical definition (known as Russell’s Paradox). In particular, once we’ve defined ${S}$ we can ask ourselves, does ${S}$ contain itself? If it does, then ${S}$ cannot be a member of ${S}$, a contradiction; and if it doesn’t, then ${S}$ will be a member of ${S}$, a contradiction.

The point of the above three examples is that, if you want to prove a theorem about mathematical statements, then you have be extremely careful that what you’re proving the theorem about really is a mathematical statement. And, indeed, from the 46 definitions at the start, to the remarkably dense proofs at the end, Gödel’s original paper is nothing if not a massive exercise in carefulness.

## Random Walks are Fun, Weird, and Slick to Analyze

Random walks are fun, weird, and slick to analyze. Here are two examples of what I mean.

Suppose you start at 1 on the number line, and then you perform a random walk. At each step, you have a fifty percent chance of stepping to the left, and a fifty percent chance of stepping to the right.

Question 1: What’s the probability that you hit 100 before you hit 0?

Answer: After the very first step, you already have a fifty percent chance of hitting 0. It’s tempting to extrapolate from this and conclude that the probability of hitting 100 before 0 will be something like ${1 - \frac{1}{2^{100}}}$. But it’s not. Actually, the probability of hitting 100 first is much higher. It’s exactly ${\frac{1}{100}}$.

Think about the problem as a betting game. You start with one dollar, and then at each step you make a fair one-dollar bet, either increasing or decreasing your total wealth by a dollar. You stop when either you run out of money (i.e., hit zero) or you have 100 dollars (i.e., hit 100).

The key observation is that this is a fair betting game. On average you shouldn’t make or lose money. Since you walk into the game with a dollar, the average amount you walk away with should also be a dollar. But in reality you always walk away with either 0 or 100 dollars. So the question is: What fraction of the time do you have to walk away with 100 dollars, so that the average amount you walk away with is 1 dollar? When phrased this way, the answer is obvious: The probability of walking away with 100 dollars is ${\frac{1}{100}}$.

Question 2: It turns out that with probability one, the random walk will eventually hit zero. On average, how many steps does it take for this to happen?

Answer: This question is even weirder than the first. Rephrased as a betting game, you walk into a casino with a dollar and then you make a series of one-dollar bets until you eventually run out of money. How long does it take before you get to go home?

The fact that I’m asking this question at all should make you think that either the answer is surprisingly small (e.g., 5) or surprisingly large (e.g. ${2^{100}}$).

Actually, the answer is infinite. That’s right. The average number of bets you have to make before you get to go home is infinite. Moreover, the proof is simple. Rather then counting the number of bets you make, let’s just count something smaller: the number of distinct values ${n}$ such that at some point in the betting process (before running out of money) you end up with ${n}$ dollars. In Question 1, we found that for any value ${n}$, the probability that you end up with ${n}$ dollars before you ever hit ${0}$ is exactly ${1/n}$. It follows that the expected number of distinct monetary-amounts that you hit during the betting game is:

$\displaystyle \sum_{n > 0} \Pr[\text{You hit }n\text{ before }0]$

$= \sum_{n > 0} \frac{1}{n}$

$= \sum_{1 \le n < 2} \frac{1}{n} + \sum_{2 \le n < 4} \frac{1}{n} + \sum_{4 \le n < 8} \frac{1}{n} + \cdots$

$\ge \sum_{1 \le n < 2} \frac{1}{2} + \sum_{2 \le n < 4} \frac{1}{4} + \sum_{4 \le n < 8} \frac{1}{8} + \cdots$

$= \frac{1}{2} + \frac{1}{2} + \frac{1}{2} + \cdots = \infty.$

PS For Question 1, we assumed without a formal proof that any betting game consisting of fair bets has to, on average, neither make nor lose money. It turns out that this is true as long as the betting game (a) takes finite expected time to complete and (b) consists of one-dollar bets. (See the Optional Stopping Theorem for more on this.) When either of these requirements are broken, it turns out there are weird betting games that can be designed so that each individual bet is fair, but the game as a whole is not. In fact, we even have an example in Question 2: Start with a dollar and make one-dollar bets until you run out of money. Each bet is fair, but at the end of the game you still always come out a dollar behind.