CMPS 2200¶

Introduction to Algorithms¶

Selection¶

Agenda:

Randomized Linear Work Selection

The $\texttt{Selection Problem}$¶

Given an unsorted list $a$ and an integer $k$ ($0\leq k< |a|$), the order statistics (or selection) problem asks us to return the $k^{th}$ smallest element from $a$.

We also refer to the $k^{th}$ smallest element as the element of rank $k$.

Example:

Let $a=\langle 5, 11, 9, 3, 6, -1, 99\rangle.$

For $k=0$, the "$0^{th}$ smallest" element is the minimum element in $a$, or $-1$.
For $k=n-1$, it is the maximum, or $99$.
For $k=3$, we return $6$.

Observations¶

Before we come up with a randomized algorithm, we can reason about bounds for our runtime using reductions.

First, let's establish a lower bound on our potential runtime.

What problems can we reduce to the $\texttt{Selection Problem}$?

Both $\texttt{Min}$ and $\texttt{Max}$ reduce to this problem.

What does that tell us about our runtime?

$W(n) \in \Omega(n)$

Now let's establish an Upper Bound on our potential runtime.

What can we reduce this problem to?

$\texttt{Sorting}$: Sort $a$ and return the $k^{th}$ element of this sorted list.

$W(n) \in O(n \log n)$

We have that $W(n) \in \Omega(n) ~ \cap ~ O(n \log n)$.

Can we do any better? Sorting seems like overkill since we don't really need to rearrange all the elements, or even return a list.

Let's build an Algorithm¶

A useful observation is that the $k^{th}$ smallest element in $a$ partitions $a$ into a set of $k-1$ smaller elements, and a set of $n-k-1$ larger elements.

Example: Suppose $a=\langle 5, 11, 9, 3, 6, -1, 99\rangle$.

$n = |a| = 7$.

Let $k=3$.

So $6$ is larger than $\langle 5, 3, -1 \rangle$ and smaller than $\langle11, 9, 99\rangle$.

This implies that for any element $x$ in the list, we can compute its rank by partitioning the list.

This can be done in $O(n)$ work and $O(\log n)$ span.

Pivoting to Search¶

We'll use partitioning to invent an algorithm to find the element with rank $k$.

Again, suppose $a=\langle 5, 11, 9, 3, 6, -1, 99\rangle$ and we are looking for the element with rank $k=3$.

We can choose an element to partition around (a "pivot" element) to get an indication of where to continue searching.

For simplicity, let's pivot around $a[0] = 5$.

$5$ is larger than 2 elements in $l = \langle 3, -1 \rangle$ and smaller than 4 elements in $r = \langle 11, 9, 6, 99\rangle$.

What does this tell us?

We've identified 3 smaller elements ($\langle 3, -1 \rangle$ and $5$).

Thus, the element of rank $k=3$ in $a$ must be in $r = \langle11, 9, 6, 99\rangle$.

Moreover it's rank is $k-3 = 0$ in $r$.

This motivates an algorithm like binary search, but with the partition step helping establish some order.

$$\begin{array}{ll} \mathit{simple\_select}~a~k = \\ \texttt{let} \\ ~~~~p = a[0] \\ ~~~~\ell = \left\langle\, x \in a \;|\; x < p \,\right\rangle \\ ~~~~r = \left\langle\, x \in a \;|\; x > p \,\right\rangle \\ \texttt{in} \\ ~~~~\texttt{if}~(k < |\ell|)~\texttt{then}~\mathit{simple\_select}~\ell~k \\ ~~~~\texttt{else if}~(k < |a| - |r|)~\texttt{then}~p \\ ~~~~\texttt{else}~\mathit{simple\_select}~r~(k - (|a| - |r|)) \\ \texttt{end} \end{array}$$

Analysis¶

We just have one recursive call so no parallelism there. However, we can use filter to partition in parallel. This has $O(n)$ work $O(\log n)$ span.

What is the total work over all recursive calls?

We know that the work in each recursive call is the $\max\{W(\mid l\mid), W(\mid r\mid)\} + O(n)$.

What is the worst case scenario?

Consider the case where $a$ is a sorted list. Then in every call, $\ell = \emptyset$, and $\mid r\mid = n-1$.

So we would we have $W(n) = W(n-1) + n = O(n^2)$.

This is worse than just sorting the list!

Moreover the span is $S(n) = S(n-1) + \lg n \in O(n\lg n)$

Far worse than sorting!

In [9]:

def simple_select(a, k):
    p = a[0]
    print('\na=', a, 'k=', k, 'p=', p)
    l = list(filter(lambda x: x < p, a))  # O(|a|) work, O(log|a|) span
    r = list(filter(lambda x: x > p, a))  # O(|a|) work, O(log|a|) span
    print('l=', l, 'r=', r)
    if k < len(l):
        print('...recursing with l=%s and k=%d' % (str(l), k))
        return simple_select(l, k)
    elif k < len(a) - len(r):
        print('...returning p=%d' % p)
        return p
    else:
        print('...recursing with r=%s and k=%d' % (str(r), k - (len(a) -  len(r))))
        return simple_select(r, k - (len(a) -  len(r)))
    
simple_select([5, 11, 9, 3, 6, -1, 99], 3)

a= [5, 11, 9, 3, 6, -1, 99] k= 3 p= 5
l= [3, -1] r= [11, 9, 6, 99]
...recursing with r=[11, 9, 6, 99] and k=0

a= [11, 9, 6, 99] k= 0 p= 11
l= [9, 6] r= [99]
...recursing with l=[9, 6] and k=0

a= [9, 6] k= 0 p= 9
l= [6] r= []
...recursing with l=[6] and k=0

a= [6] k= 0 p= 6
l= [] r= []
...returning p=6

Out[9]:

In [10]:

# worst case: find the max of a sorted list
simple_select([-1,3,5,6,9,11,99], 6)

a= [-1, 3, 5, 6, 9, 11, 99] k= 6 p= -1
l= [] r= [3, 5, 6, 9, 11, 99]
...recursing with r=[3, 5, 6, 9, 11, 99] and k=5

a= [3, 5, 6, 9, 11, 99] k= 5 p= 3
l= [] r= [5, 6, 9, 11, 99]
...recursing with r=[5, 6, 9, 11, 99] and k=4

a= [5, 6, 9, 11, 99] k= 4 p= 5
l= [] r= [6, 9, 11, 99]
...recursing with r=[6, 9, 11, 99] and k=3

a= [6, 9, 11, 99] k= 3 p= 6
l= [] r= [9, 11, 99]
...recursing with r=[9, 11, 99] and k=2

a= [9, 11, 99] k= 2 p= 9
l= [] r= [11, 99]
...recursing with r=[11, 99] and k=1

a= [11, 99] k= 1 p= 11
l= [] r= [99]
...recursing with r=[99] and k=0

a= [99] k= 0 p= 99
l= [] r= []
...returning p=99

Out[10]:

The Problem, and its Solution¶

The problem is that we just don't know anything about the element we're using for the partition. How do we avoid this worst case?

Pick a random pivot for partitioning!

$$\begin{array}{ll} \mathit{select}~a~k = \\ \texttt{let} \\ ~~~~p = \text{pick a uniformly random element from}~a \\ ~~~~\ell = \left\langle\, x \in a \;|\; x < p \,\right\rangle \\ ~~~~r = \left\langle\, x \in a \;|\; x > p \,\right\rangle \\ \texttt{in} \\ ~~~~\texttt{if}~(k < |\ell|)~\texttt{then}~\mathit{select}~\ell~k \\ ~~~~\texttt{else if}~(k < |a| - |r|)~\texttt{then}~p \\ ~~~~\texttt{else}~\mathit{select}~r~(k - (|a| - |r|)) \\ \texttt{end} \end{array}$$

In [17]:

import random
# random.seed(42)  # for repeatability

def select(a, k):
    p = random.choice(a)
    print('\na=', a, 'k=', k, 'p=', p)
    l = list(filter(lambda x: x < p, a))  # O(|a|) work, O(log|a|) span
    r = list(filter(lambda x: x > p, a))  # O(|a|) work, O(log|a|) span
    print('l=', l, 'r=', r)
    if k < len(l):
        print('...recursing with l=%s and k=%d' % (str(l), k))
        return select(l, k)
    elif k < len(a) - len(r):
        print('...returning p=%d' % p)
        return p
    else:
        print('...recursing with r=%s and k=%d' % (str(r), k - (len(a) -  len(r))))
        return select(r, k - (len(a) -  len(r)))
    
select([-1,3,5,6,9,11,99], 6)

a= [-1, 3, 5, 6, 9, 11, 99] k= 6 p= 5
l= [-1, 3] r= [6, 9, 11, 99]
...recursing with r=[6, 9, 11, 99] and k=3

a= [6, 9, 11, 99] k= 3 p= 11
l= [6, 9] r= [99]
...recursing with r=[99] and k=0

a= [99] k= 0 p= 99
l= [] r= []
...returning p=99

Out[17]:

Quantifying our chances¶

If we get a sorted list as input, what is the probability of the worst-case?

The size of the $l$ and $r$ will depend on the random choice.

Thus the recurrences describing the work and span depend on each random choice and we need to find their expected asymptotic work/span.

Intuition¶

We saw that the work of our algorithm depends on $\max\{W(\mid l\mid), W(\mid r\mid)\}$ in each recursive call.

What is the probability of getting a perfect split?

$1/n$

What about a split that is "good enough"?

So, suppose we knew that $\max\{W(\mid l\mid), W(\mid r\mid)\} \leq W(3n/4)$.

This would be a good enough split since the overall work would be $W(n) = W(3n/4) + n \in O(n)$. (root dominated)

Probability of "good enough"¶

What is the probability that the input to the recursive call has size $\le \frac{3}{4} n$?

We can examine where $p$ might land in the sorted version of $a$, to understand the probability of a good split.

If we sample any pivot in the green region, what is the size of the larger partition?

It is at most $3n/4$.

What is the probability of sampling a point in the green region?

$1/2$

$\mathbf{P}[\max\{W(\mid l\mid), W(\mid r\mid)\} \leq W(3n/4)] = 1/2$.

If we think of each choice of pivot as a coin flip ("good" vs. "bad") then the expected number of pivot choices to reduce the input to $3n/4$ is 2.

In other words, every two recursions yields the desired reduction in list size, and so in expectation we will do linear work.

What if we're unlucky?

We could keep sampling pivots outside of the green area. What is the probability if we do so $i$ times in a row?

$\frac{1}{2} * \frac{1}{2} * \frac{1}{2} * ... = \frac{1}{2^i}$

E.g., for $i=10$, probability of getting no good pivots is $\approx 0.1\%$.

Thus, probability of getting at least one good pivot for 10 splits is $\approx 99.9\%$

Analysis¶

With our intuition established, let's analyze the performance rigorously.

Let $X(n)$ be the fractional size of the larger side of the split, for an input of size $n$. So

$$X(n) = \frac{\max{\{|l|, |r|\}}}{n}$$

e.g., $n=6$

i	len(l)	len(r)	X(i)
0	0	5	5/6
1	1	4	4/6
2	2	3	3/6
3	3	2	3/6
4	4	1	4/6
5	5	0	5/6

Our work and span recurrences are:

$$W(n) \leq W(X(n) \cdot n) + O(n)$$$$S(n) \leq S(X(n) \cdot n) + O(\lg n)$$

The expected size of the larger partition¶

We want to prove that $\mathbf{E}[X(n)] = 3/4$ as our intuition says.

As we discussed previously, if $|l| = i$, then $|r| = n - i -1$.

Using the fact that the probability of the pivot being any particular $i$ is $1/n$, we have:

$\begin{align} \mathbf{E}\left[{X(n)}\right] &= \sum_{i=0}^{n-1} X(i) \cdot P[X(i)] \\ & = \frac{1}{n} \sum_{i=0}^{n-1} \frac{\max\{i, n-i-1\}}{n} \\ \end{align}$

i	len(l)	len(r)	X(i)
0	0	5	5/6
1	1	4	4/6
2	2	3	3/6
3	3	2	3/6
4	4	1	4/6
5	5	0	5/6

$\begin{align} \mathbf{E}\left[{X(n)}\right] &= \sum_{i=0}^{n-1} X(i) \cdot P[X(i)]\\ & = \frac{1}{n} \sum_{i=0}^{n-1} \frac{\max\{i, n-i-1\}}{n} \\ &\leq \frac{1}{n} \sum_{j=n/2}^{n-1} \frac{2j}{n} ~~~~~~~~~~~~~~ \left(\sum_{j=3}^{5} \frac{2j}{n} = \frac{2*3}{6} + \frac{2*4}{6} + \frac{2*5}{6}\right) \\ &\leq \frac{2}{n^2}\sum_{j=n/2}^{n-1} j\\ &\leq \frac{3}{4} \end{align}$

Thus the expected size of the larger partition is $\leq \frac{3}{4}$.

Full detail

Last line uses $\sum_{i=x}^y i = \frac12(x+y)(y - x + 1)$:

$= \frac12 (n/2 + (n-1))((n-1)-(n/2) + 1)$

$= \frac12 (3n/2 - 1)(n/2)$

$= \frac12 (3n^2/4 - n/2)$

$=(3n^2/8 - n/4)$

$=(3n^2 - 2n)/8$

multiply by $2/n^2$:

$= \frac{3n^2-2n}{4n^2} = \frac{3n^2}{4n^2} - \frac{1}{2n} \le \frac{3}{4}$

What if we're unlucky?¶

It might seem tempting to say that we are done. However, we could get "unlucky" in a series of recursions even though $\mathbf{E}[X(n)]\leq 3/4.$ We will show the following.

Theorem. At the $d^{th}$ level of recursion, the size of the input is $(3/4)^d n$ in expectation.

Proof: We can prove this by induction.

The base case $d=0$ holds trivially.

For the inductive step, we make the inductive hypothesis that our theorem holds for $d \ge 0$ and will show that it holds after the $d+1^{st}$ recursive call.

For the $d^{th}$ recursive call, let $Y_d$ be a random variable for the instance size and let $Z_d$ be the rank of the pivot.

For any value of $y$ and $z$, let $f(y,z)$ be a function that take in $y$ (the size of the input) and $z$ (the position of the pivot) and returns the fraction of the input reduced by that choice of the pivot.

The expected input size in the $(d+1)^{st}$ call is: $$\mathbf{E}[Y_{d+1}] = \sum_{y}{\sum_{z}{y\cdot f(y,z) \mathbf{P}_{Y_d,Z_d}(y,z)}} $$

$$\begin{align} \mathbf{E}[Y_{d+1}] &= \sum_{y}{\sum_{z}{y\cdot f(y,z) \mathbf{P}_{Y_d,Z_d}(y,z)}} \\ & = \sum_{y}{\sum_{z}{y f(y,z) \mathbf{P}_{Y_d}(y) \mathbf{P}_{Z_d \mid Y_d}(z \mid y)}} ~~~~~ \hbox{by def,} ~~ p(a,b) = p(a)p(b|a)\\ &= \sum_{y}{y \mathbf{P}_{Y_d}(y) \sum_{z}{f(y,z) \mathbf{P}_{Z_d \mid Y_d}(z \mid y)}} ~~~~~~ \hbox{grouping terms} \\ &= \sum_{y}{y \mathbf{P}_{Y_d}(y) \mathbf{E}\left[{X(y)}\right]} ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \hbox{def of}~ X(i) ~\hbox{and expectation}\\ & \le \frac{3}{4} \sum_{y}{y \mathbf{P}_{Y_d}(y)} ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\hbox{by our bound above} \\ & \le \frac{3}{4} \mathbf{E}\left[{Y_d}\right]. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\hbox{definition of expectation} \end{align}$$

Thus $\mathbf{E}[Y_{d+1}] \le \frac{3}{4} \mathbf{E}\left[{Y_d}\right]$

Since we have shown that $\mathbf{E}[Y_{d+1}] \le \frac{3}{4} \mathbf{E}\left[{Y_d}\right]$, a statement true for all $d$, at the $d^{th}$ level of recursion, the size of the input is $(3/4)^d n$ in expectation. $\blacksquare$

Expected Work¶

We've shown that for every pivot, we expect the larger of the two partitions to be $< 3/4n$.

Furthermore, we've shown that the expected size of the problem at every level $d$ is $(3/4)^d n$ in expectation.

Finally, we need to compute the expected work.

This is analogous to the tree method, but using the expected size of $n$ at each level.

$ \begin{align} \mathbf{E}[W(n)] & \le \sum_{i=0}^n \mathbf{E}[Y_i]~~~~~~~~ \hbox{since there is linear work at each iteration}\\ & \le \sum_{i=0}^n \Big(\frac{3}{4}\Big)^i n ~~~~~~ \hbox{by theorem above}\\ & \le n\sum_{i=0}^n \Big(\frac{3}{4}\Big)^i \\ & \le 4n ~~~~~~~~~~~~~~~~~~~~ \hbox{by}~ \sum_{i=0}^{\infty}\alpha^i\:<\:\frac{1}{1\:-\:\alpha}~\hbox{for}~ \alpha < 1\\ & \in O(n) \end{align} $

Span¶

For the span we can also use the theorem to show that at each level the span is $O(\log n)$. By showing that the number of levels is $O(\log n)$ with high probability, we can establish that the span is $O(\log^2 n)$ with high probability.

Final Results¶

Finally, we have that by using random pivots to solve the $\texttt{Selection Problem}$ we achieve the following work and span in expectation:

$W(n) \in O(n)$

$S(n) \in O(\log^2 n)$