Home/Magazine Archive/October 2020 (Vol. 63, No. 10)/Lower Bounds for External Memory Integer Sorting via.../Full Text

Research highlights
## Lower Bounds for External Memory Integer Sorting via Network Coding

Sorting extremely large datasets is a frequently occurring task in practice. These datasets are usually much larger than the computer's main memory; thus, external memory sorting algorithms, first introduced by Aggarwal and Vitter, are often used. The complexity of comparison-based external memory sorting has been understood for decades by now; however, the situation remains elusive if we assume the keys to be sorted are integers. In internal memory, one can sort a set of *n* integer keys of Θ(lg *n*) bits each in *O*(*n*) time using the classic Radix Sort algorithm; however, in external memory, there are no faster integer sorting algorithms known than the simple comparison-based ones. Whether such algorithms exist has remained a central open problem in external memory algorithms for more than three decades.

In this paper, we present a *tight* conditional lower bound on the complexity of external memory sorting of integers. Our lower bound is based on a famous conjecture in network coding by Li and Li, who conjectured that network coding cannot help anything beyond the standard multicommodity flow rate in undirected graphs.

The only previous work connecting the Li and Li conjecture to lower bounds for algorithms is due to Adler et al. Adler et al. indeed obtain relatively simple lower bounds for *oblivious* algorithms (the memory access pattern is fixed and independent of the input data). Unfortunately, obliviousness is a strong limitation, especially for integer sorting: we show that the Li and Li conjecture implies an Ω(*n* lg *n*) lower bound for internal memory *oblivious* sorting when the keys are Θ(lg *n*) bits. This is in sharp contrast to the classic (nonoblivious) Radix Sort algorithm. Indeed, going beyond obliviousness is highly nontrivial; we need to introduce several new methods and involved techniques, which are of their own interest, to obtain our tight lower bound for external memory integer sorting.

Sorting is one of the most basic algorithmic primitives and has attracted lots of attention from the beginning of the computing era. Many classical algorithms have been designed for this problem such as Merge Sort, Bubble Sort, Insertion Sort, etc. As sorting extremely large data has become essential for many applications, there has been a strong focus on designing more efficient algorithms for sorting big datasets^{2} These datasets are often much larger than the computer's main memory and the performance bottleneck changes from being the number of CPU instructions executed to being the number of accesses to slow secondary storage. In this external memory setting, one usually uses the external memory model to analyze the performance of algorithms. External memory algorithms are designed to minimize the number of *input/output (I/O)*s between the internal memory and external memory (e.g., hard drives and cloud storage), and we measure the complexity of an algorithm in terms of the number of I/Os it performs.

Formally, the external memory model consists of a main memory that can hold *M* words of *w* bits each (the memory has a total of *m* = *Mw* bits), and an infinite (random access) disk partitioned into blocks of *B* consecutive words of *w* bits each (a block has a total of *b* = *Bw* bits). The input to an external memory algorithm is initially stored on disk and is assumed to be much larger than *M*. An algorithm can then read blocks into memory or write blocks to disk. We refer jointly to these two operations as an I/O. The complexity of an algorithm is measured solely in terms of the number of I/Os it makes.

Aggarwal and Vitter^{2} considered the sorting problem in the external memory model. A simple modification to the classic Merge Sort algorithm yields a comparison based sorting algorithm that makes *O*((*n/B*) lg_{M/B}(*n/B*) ) I/Os for sorting an array of *n* comparable records (each storable in a word of *w* bits). Notice that *O*(*n/B*) would correspond to *linear* I/Os, as this is the amount of I/Os needed to read/write the input/output. Aggarwal and Vitter^{2} complemented their upper bound with a matching lower bound, showing that comparison-based external memory sorting algorithms must make Ω((*n/B*) lg_{M/B}(*n/B*)) I/Os. In the same paper, Aggarwal and Vitter also showed that any algorithm treating the keys as *indivisible* atoms, meaning that keys are copied to and from disk blocks, but never reconstructed via bit tricks and the like, must make Ω(min{*n*, (*n/B*) lg_{M/B}(*n/B*)}) I/Os. This lower bound does not assume a comparison-based algorithm, but instead makes an indivisibility assumption. Notice that the lower bound matches the comparison-based lower bound for large enough *B* (*B* > lg *n* suffices). The comparison and indivisibility settings have thus been (almost) fully understood for more than three decades.

However, if the input to the sorting problem is assumed to be *w* bit integers and we allow arbitrary manipulations of the integers (hashing, XOR tricks, etc.), then the situation is completely different. In the standard internal memory computational model, known as the word-RAM, one can design integer sorting algorithms that far outperform comparison-based algorithms regardless of *w*. More concretely, if the word and key size is *w* = Θ(lg *n*), then Radix Sort solves the problem in *O*(*n*) time, and for arbitrary *w*, one can design sorting algorithms with a running time of in the randomized case^{6} and *O*(*n* lg lg *n*) in the deterministic case^{5} (both bounds assume that the word size and key size are within constant factors of each other). In external memory, no integer sorting algorithms faster than the comparison-based *O*((*n/B*) lg_{M/B}(*n/B*)) bound are known! Whether faster integer sorting algorithms exist was posed as an important open problem in the original paper by Aggarwal and Vitter^{2} that introduced the external memory model. Three decades later, we still do not know the answer to this question.

In this paper, we present tight conditional lower bounds for external memory integer sorting via a central conjecture by Li and Li^{7} in the area of *network coding*. Our conditional lower bounds show that it is impossible to design integer sorting algorithms that outperform the optimal comparison-based algorithms, thus settling the complexity of integer sorting under the conjecture by Li and Li.

**1.1. Network coding**

The field of network coding studies the following communication problem over a network: Given a graph *G* with capacity constraints on the edges and *k* data streams, each with a designated source-sink pair of nodes (*s _{i}*,

CONJECTURE 1 (UNDIRECTED *k*-PAIRS CONJECTURE). *The coding rate is equal to the multicommodity flow rate in undirected graphs*.

Despite the centrality of this conjecture, it has so forth resisted all attempts at either proving or refuting it. Adler et al.^{1} made an exciting connection between the conjecture and lower bounds for algorithms. More concretely, they proved that if Conjecture 1 is true, then one immediately obtains nontrivial lower bounds for all of the following:

*Oblivious*external memory algorithms*Oblivious*word-RAM algorithms*Oblivious*two-tape Turing machines

In the above, *oblivious* means that the memory access pattern of the algorithm (or tape moves of the Turing machine) is fixed and independent of the input data. Thus proving Conjecture 1 would also give the first nontrivial lower bounds for all these classes of algorithms. One can view this connection in two ways: Either as exciting conditional lower bounds for (restricted) algorithms, or as a strong signal that proving Conjecture 1 will be very difficult.

In this paper, we revisit these complexity theoretic implications of Conjecture 1. Our results show that the restriction to oblivious algorithms is unnecessary. In more detail, we show that Conjecture 1 implies nontrivial (and in fact tight) lower bounds for external memory sorting of integers and for external memory matrix transpose algorithms. We also obtain tight lower bounds for word-RAM sorting algorithms when the word size is much larger than the key size, as well as tight lower bounds for transposing a *b* × *b* matrix on a word-RAM with word size *b* bits. The striking thing is that our lower bounds hold without *any* extra assumptions such as *obliviousness*, *indivisibility*, *comparison-based*, or the like. Thus proving Conjecture 1 is as hard as proving super-linear algorithm lower bounds in the full generality word-RAM model, a barrier far beyond current lower bound techniques! Moreover, we show that the assumption from previous papers about algorithms being oblivious makes a huge difference for integer sorting: We prove an Ω(*n* lg *n*) lower bound for sorting Θ(lg *n*) bit integers using an oblivious word-RAM algorithm with word size Θ(lg *n*) bits. This is in sharp contrast to the classic (nonoblivious) Radix Sort algorithm, which solves the problem in *O*(*n*) time. Thus, the previous restriction to oblivious algorithms may be very severe for some problems.

**1.2. Lower bounds for sorting**

Our main result for external memory integer sorting is the following connection to Conjecture 1:

THEOREM 2. *Assuming Conjecture 1, any randomized algorithm for the external memory sorting problem with w* = Ω(lg *n*) *bit integers, having error probability at most* 1/3, *must make an expected*

*I/Os*.

Thus if we believe Conjecture 1, then even for randomized algorithms, there is no hope of exploiting integer input to improve over the simple external memory comparison-based algorithms (when *B* ≥ lg *n* such that the latter term in the lower bound is the min).

Now observe that because our lower bound only counts I/Os, the lower bound immediately holds for word-RAM algorithms when the word size is some *b* = Ω(lg *n*) by setting *m* = *O*(*b*) and *B* = *b/w* in the above lower bound (the CPU's internal state, i.e., registers, can hold only a constant number of words). Thus, we get the following lower bound:

COROLLARY 3. *Assuming Conjecture 1, any randomized word-RAM algorithm for sorting w* = Ω(lg *n*) *bit integers, having error probability at most* 1/3 *and word size b* ≥ *w bits, must spend*

*time*.

We note that a standard assumption in the word-RAM is a word size and key size of *b*, *w* = Θ(lg *n*) bits. For that choice of parameters, our lower bound degenerates to the trivial *t* = Ω(*n*). This has to be the case, as Radix Sort gives a matching upper bound. Nonetheless, our lower bound shows that when the key size is much smaller than the word size, one cannot sort integers in linear time (recall linear is *O*(*nw/b*) as this is the time to read/write the input/output).

Finally, we show that the obliviousness assumption made in the previous paper by Adler et al.^{1} allows one to prove very strong sorting lower bounds that even surpass the known (nonoblivious) Radix Sort upper bound:

THEOREM 4. *Assuming Conjecture 1, any oblivious randomized word-RAM algorithm for sorting* Θ(lg *n*) *bit integers, having error probability at most* 1/3 *and word size* Θ(lg *n*), *must spend* Ω(*n* lg *n*) *time*.

Thus, at least for the natural problem of integer sorting, being oblivious has a huge impact on the possible performance of algorithms. Our results are therefore not just an application of the previous technique to a new problem, but a great strengthening. Moreover, as we discuss in Section 3, removing the obliviousness assumption requires new and deep ideas that result in significantly more challenging lower bound proofs.

**1.3. Lower bounds for matrix transpose**

We also reprove an analog of the lower bounds by Adler et al.^{1} for the matrix transpose problem, this time without any assumptions of obliviousness. In the matrix transpose problem, the input is an *n* × *n* matrix *A* with *w*-bit integer entries. The matrix is given in row-major order, meaning that each row of *A* is stored in *n/B* blocks of *B* consecutive entries each. The goal is to compute *A ^{T}*, that is, output the column-major representation of

THEOREM 5. *Assuming Conjecture 1, any randomized algorithm for the external memory matrix transpose problem with w bit integer entries, having error probability at most* 1/3, *must make an expected*

*I/Os*.

Consider now the matrix transpose problem on the word-RAM with word size *b* bits (and thus memory size *m* = *O*(*b*)). Given an *n* × *n* matrix *A* with *w*-bit integer entries, the lower bound in Theorem 5 implies (by setting *B* = *b/w)*:

COROLLARY 6. *Assuming Conjecture 1, any randomized word-RAM algorithm for computing the transpose of an n* × *n matrices with w-bit integer entries, having error probability at most* 1/3 *and word size b bits, must spend*

*time*.

We now give a formal definition of the *k*-pairs communication problem and the multicommodity flow problem.

** k-pairs communication problem.** To keep the definition as simple as possible, we restrict ourselves to directed acyclic communication networks/graphs and we assume that the demand between every source-sink pair is the same. This will be sufficient for our proofs. For a more general definition, we refer the reader to Adler et al.

The input to the *k*-pairs communication problem is a directed acyclic graph *G* = (*V*, *E*) where each edge *e* ∈ *E* has a capacity *c*(*e*) ∈ R^{+}. There are *k* sources *s*_{1},…,*s _{k}* ∈

Each source *s _{i}*. receives a message

A network coding solution specifies for each edge *e* ∈ *E* an alphabet Γ(*e*) representing the set of possible messages that can be sent along the edge. For a node *v* ∈ *V*, define In(*u*) as the set of in-edges at *u*. A network coding solution also specifies, for each edge *e* = (*u*, *v*) ∈ *E*, a function *f _{e}*: Π

In an execution of a network coding solution, each of the extra nodes *S _{i}* starts by transmitting the message

Following Adler et al.^{1} (and simplified a bit), we define the *rate* of a network coding solution as follows: Let each source receive a uniform random and independently chosen message *A _{i}* from

*H*(*A*) ≥_{i}*rd*=_{i}*r*for all*i*.- For each edge
*e*∈*E*, we have*H*(*A*) ≤_{e}*c*(*e*).

Here *H*(x) denotes binary Shannon entropy. The intuition is that the rate is *r*, if the solution can handle upscaling the entropy of all messages by a factor *r* compared to the demands.

**Multicommodity flow.** A multicommodity flow problem in an undirected graph *G* = (*V*, *E*) is specified by a set of *k* source-sink pairs (*s _{i}*,

A (fractional) solution to the multicommodity flow problem specifies for each pair of nodes (*u*, *v*) and commodity *i*, a flow *f _{i}*(

- For all nodes
*u*that is not a source or sink, we have ∑_{w∈v}*f*(_{i}*u*,*w*) − ∑_{w∈v}*f*(_{i}*w*,*u*)=0. - For all sources
*s*, we have ∑_{i}_{w∈v}*f*(_{i}*s*,_{i}*w*)−∑_{w∈v}*f*(_{i}*w*,*s*)=1._{i} - For all sinks, we have ∑
_{w∈v}*f*(_{i}*w*,*t*)−∑_{i}_{w∈v}*f*(_{i}*t*,_{i}*w*)=1.

The flow also satisfies that for any pair of nodes (*u*, *v*) and commodity *i*, there is only flow in one direction, that is, either *f _{i}*(

- For all edges
*e*= (*u*,*v*) ∈*E*, we have*r*· ∑_{i}*d*(_{i}*f*(_{i}*u*,*v*)+*f*(_{i}*v*,*u*))=*r*· ∑_{i}(*f*(_{i}*u*,*v*)+*f*(_{i}*v*,*u*))≤*c*(*e*).

Intuitively, the rate is *r* if we can upscale the demands by a factor *r* without violating the capacity constraints.

**The undirected k-pairs conjecture.** Conjecture 1 implies the following for our setting: Given an input to the

Having defined coding rate and flow rate formally, we also mention that the result of Braverman et al.^{4} implies that if there exists a graph *G* where the network coding rate *r* and the flow rate *r'* in the corresponding undirected graph *G'* satisfy *r* ≥ (1 + ▭)*r'* for a constant ε > 0, then there exists an infinite family of graphs {*G*^{*}} for which the corresponding gap is at least (lg|*G*^{*}|)^{c} for a constant *c* > 0. So far, all evidence suggests that no such gap exists, as formalized in Conjecture 1.

In this section, we give an overview of the main ideas in our proof and explain the barriers we overcome in order to remove the assumption of obliviousness. To prove our lower bound for external memory sorting, we focus on the easier problem of permuting. In the permutation problem, we are given an array *A* of *n* entries. The *i*'th entry of *A* stores a *w*-bit data item *d _{i}* and a destination π(

Consider now an algorithm A for permuting, and assume for simplicity that it is deterministic and always correct. As in the previous work by Adler et al.^{1}, we define a graph *G*(*A*) that captures the memory accesses of A on an input array *A*. The graph *G* has a node for every block in the input array, a node for every block in the output, and a node for every intermediate block written/read by A. We call these block nodes. Moreover, the graph has a memory node that represents the memory state of A. The idea is that whenever A reads a block into memory, then we add a directed edge from the corresponding block node to the memory node. When A writes to a block, we create a new node (that replaces the previous version of the block) and add a directed edge from the memory node to the new node. The algorithm A can now be used to send messages between input and output block nodes as follows: Given messages *X*_{1}, …, *X _{n}* of

The idea is that we want to use Conjecture 1 to argue that the graph *G* must be large (i.e., there must be many I/Os). To do so, we would like to argue that if we undirect *G*, then there is a permutation π such that for many pairs *A*[*i*] and *C*[π(*i*)], there are no short paths between the block nodes storing *A*[*i*] and *C*[π(*i*)]. If we could argue that for *n*/2 pairs (*A*[*i*], *C*[π(*i*)]), there must be a distance of at least *ℓ* steps in the undirected version of *G*, then to achieve a flow rate of *w*, it must be the case that the sum of capacities in *G* is at least *lwn*/2. But each I/O adds only 2*b* bits of capacity to *G*. Thus, if A makes *t* I/Os, then it must be the case that *tb* = Ω(ℓ*wn*) ⇒ *t* = Ω((*nw*/*b*)×ℓ) = Ω((*n*/*B*)×ℓ).

Unfortunately, we cannot argue that there must be a long path between many pairs in the graph *G* we defined above. The problem is that the memory node is connected to all block nodes and thus the distance is never more than 2. To fix this, we change the definition of *G* slightly: After every *m/b* I/Os, we deactivate the memory node and create a new memory node to replace it. Further I/Os insert edges to and from this new memory node. In order for the new memory node to continue the simulation of A, the new memory node needs to know the memory state of A. Hence, we insert a directed edge from the old deactivated memory node to the new memory node. The edge has capacity *m* bits. Thus, in the simulation, when the current memory node has performed *m/b* I/Os, it forwards the memory state of A to the next memory node who continues the simulation. The *m/b* I/Os between the creation of new memory nodes has been chosen such that the amortized increase in capacity due to an I/O remains *O*(*b*).

We have now obtained a graph *G* where the degrees of all nodes are bounded by 2*m/b*. Thus, for every node *G*, there are at most (2*m/b*)^{ℓ} nodes within a distance of *ℓ*. Thus, intuitively, a random permutation π should have the property that for most pairs (*A*[*i*], *C*[*π*(*i*)]), there will be a distance of ℓ= Ω(lg_{2m/b} *n*/*B*) between the corresponding block nodes. This gives the desired lower bound of *t* = Ω((*n*/*B*)×ℓ)=Ω((*n*/*B*)×lg_{2m/b} *n*/*B*).

If we had assumed that the algorithm A was oblivious as in previous work, we would actually be done by now. This is because, under the obliviousness assumption, the graph *G* will be the same *for all* input arrays. Thus, one can indeed find the desired permutation π where there is a large distance between most pairs (*A*[*i*], *C*[*π*(*i*)]). Moreover, all inputs corresponding to that permutation π and data bit strings *d*_{1}, …, *d _{n}* can be simulated correctly using A and the graph

To overcome this barrier, we first argue that even though there can be many distinct graphs, the number of such graphs is still bounded by roughly (*nw/b* + *t*)^{t} (each I/O chooses a block to either read or write and there are *t* I/Os). This means that for *t* = *o*(*n*), one can still find a graph *G* that is the result of running A on many different input arrays *A*. We can then argue that among all those inputs *A*, there are many that all correspond to the same permutation π and that permutation π has the property from before that, and for most pairs (*A*[*i*], *C*[*π*(*i*)]), there will be a distance of ℓ= Ω(lg_{2m/b} *n*/*B*) between the corresponding block nodes. Thus, we would like to fix such a permutation and use A to obtain a network coding solution. The problem is that we can only argue that there are *many* data bit strings *d*_{1}, …, *d _{n}* that together with π result in an array

LEMMA 7. *Consider a communication game with a coordinator u, a set F* ⊆ {0, 1}^{nw} *and n players. Assume* |*F*| ≥ 2^{mw-r} *for some r. The coordinator receives as input n uniform random bit strings X _{i} of w bits each, chosen independently of the other X_{j}. The coordinator then sends a prefix-free message R_{i} to the i'th player for each i. From the message R_{i} alone (i.e., without knowing X_{i}), the i'th player can then compute a vector*

*In particular, if r* = *o*(*nw*) *and w* = ω(1), *then the communication satisfies* ∑_{i} E[|*R _{i}*|]=

We use the lemma as follows: We create a coordinator node *u* that is connected to all input block nodes and all output block nodes. In a simulation of A, the input block nodes start by transmitting their inputs to the coordinator node *u*. The coordinator then computes the messages in the lemma and sends *R _{i}* back to the input block node storing

The introduction of the node *u* of course allows some flow to traverse paths not in the original graph *G*. Thus, we have to be careful with how we set the capacities on the edges to and from *u*. We notice that edges from the input nodes to *u* need only a capacity of *w* bits per array entry (they send the inputs), and edges out of *u* need E [|*R _{i}*|] capacity for an input

One may observe that our proof uses the fact that the network coding rate is at most the flow rate in a strong sense. Indeed, the introduction of the node *u* allows a constant fraction of the flow to potentially use a constant length path. Thus, it is crucial that the network coding rate *r* and flow rate *r'* is conjectured to satisfy *r* ≤ *r'* and not, for example, *r* ≤ 3*r'*. Indeed, we can only argue that a too-good-to-be-true permutation algorithm yields a graph in which *r* ≥ *ar'* for some constant *a* > 1. However, Braverman et al.^{4} recently proved that if there is a graph where *r* ≥ (1 + ε)^{r'} for a constant ε > 0, then there is an infinite family of graphs {*G'*} where the gap is Ω((lg |*G'*|)^{c}) for a constant *c* > 0. Thus, a too-good-to-be-true permutation algorithm will indeed give a strong counter example to Conjecture 1.

Our proof of Lemma 7 is highly nontrivial and is based on the elegant proof of the bound by Barak et al.^{3} for compressing interactive communication under nonproduct distributions. Our main idea is to argue that for a uniform random bit string in {0, 1}^{nw} (corresponding to the concatenation *X* = *X*_{1} ο … ο *X _{n}* of the

As mentioned in the proof overview in Section 3, we prove our lower bound for external memory sorting via a lower bound for the easier problem of permuting: An input to the permutation problem is specified by a permutation π of {1, 2, …, *n*} as well as *n* bit strings *d*_{1}, …, *d _{n}* ∈ {0, 1}

The array *A* is presented to an external memory algorithm as a sequence of blocks, where each block contains ⌊*b*/(*w*+lg *n*)⌋ consecutive entries of *A* (the blocks have *b* = *Bw* bits). For simplicity, we henceforth assume (*w* + lg *n*) divides *b*.

The algorithm is also given an initially empty output array *C*. The array *C* is represented as a sequence of *n* words of *w* bits each, and these are packed into blocks containing *b/w* words each. The goal is to store *d*_{π−1(i)} in *C*[*i*]. That is, the goal is to *copy* the bit string *d _{i}* from

The best known upper bounds for the permutation problem work also under the *indivisibility* assumption. These algorithms solve the permutation problem in

I/Os.^{2} Moreover, this can easily be shown to be optimal under the indivisibility assumption by using a counting argument.^{2} The *n* bound is the bound obtained by running the naive "internal memory" algorithm that simply puts each element into its correct position one at a time. The other term is equivalent to the optimal comparison-based sorting bound (one thinks of *d _{i}* as an integer in [2

We thus set out to use Conjecture 1 to provide a lower bound for the permutation problem in the external memory model. Throughout the proof, we assume that *nw/b* = *n/B* is at least some large constant. This is safe to assume, as otherwise we only claim a trivial lower bound of Ω(1).

Let A be a randomized external memory algorithm for the permutation problem on *n* integers of *w* bits each. Assume A has error probability at most 1/3 and let *b* denote the disk block size in number of bits. Let *m* denote the memory size measured in number of bits. Finally, let *t* denote the expected number of I/Os made by A (on the worst input).

**I/O-graphs.** For an input array *A* representing a permutation π and bit strings *d*_{1},…*d _{n}*, and an output array

Now, run algorithm A on *A*. Whenever it makes an I/O, do as follows: If this is the first time, the block is being accessed and it is not part of the input or output (a write operation to an untouched block). Then, create a new live block node in *G* and add a directed edge from the current live memory node to the new block node (see Figure 1e). Label the new node by the next unused integer label. Otherwise, let *v* be the live node in *G* corresponding to the last time the disk block was accessed. We add a directed edge from *v* to the live memory node, mark *v* as dead, create a new live block node *v'*, and add a directed edge from the live memory node to *v'*. We give the new node the same label as *v* (Figure 1b and c). Finally, once for every *m/b* I/Os, we mark the memory node as dead, create a new live memory node, and add a directed edge from the old memory node to the new live memory node (Figure 1d).

**Figure 1. I/O-graph for an array A consisting of 3-bit strings d_{1},…,d_{8}. In this example, each disk block contains two words of w = 3 bits, that is, B = 2 (and b = Bw = 6). Also, the main memory holds M = 6 words (m = 18). Figure (a) shows the initial I/O-graph. For each disk block, we have initially one block node that is illustrated underneath them. Black nodes are dead, and white nodes are live. Figure (b) shows the updated I/O-graph after making an I/O to access the first disk block. Figure (c) is the I/O-graph after accessing the block containing C[1] and C[2]. Figure (d) shows the graph after making another I/O on the first disk block. Also, we create a new memory node after every m/b = M/B = 3 I/Os and mark the old memory node as dead. Figure (e) shows the updated graph after accessing some block other than the input or output.**

To better understand the definition of *G*, observe that all the nodes with the same label represent the different versions of a disk block that existed throughout the execution of the algorithm. Moreover, there is always exactly one live node with any fixed label, representing the current version of the disk block. Also, observe that at the end of the execution, there must be a live disk block node in *G* representing each of the output blocks in *C*, and these have the same labels as the original nodes representing the empty disk blocks of *C* before the execution of A.

**Fixing the randomness of** A. Consider the execution of A on an input *A* representing a uniform random permutation π as well as independent and uniform random bit strings *d*_{1}, …, *d _{n}* ∈ {0, 1}

**Finding a popular I/O-graph.** We now find an I/O-graph *G* that is the result of running A^{*} on a large number of different inputs. For notational convenience, let *t* denote the worst case number of I/Os made by A^{*} (instead of using *t*^{*} or 6*t*). Observe that the total number of different I/O-graphs one can obtain as the result of running A^{*} is small:

LEMMA 8. *There are no more than*

*I/O-graphs that may result from the execution of A ^{*}*.

This means that we can find an I/O-graph, which corresponds to the execution of A^{*} on many different inputs, and moreover, we can even assume that A^{*} is correct on many such inputs:

LEMMA 9. *There exists a set Γ containing at least* (*n*!2^{nw})/(2(*t* + *n*(*w* + lg *n*)/*b* + *nw*/*b* + 1)^{t+1}) *different input arrays A, such that* A^{*} *is correct on all inputs A* ∈ Γ *and the I/O-graph is the same for all A* ∈ Γ.

**Data must travel far.** The key idea in our lower bound proof is to argue that there is a permutation for which most data bit strings *d _{i}* are very far away from output entry

We prove the following:

LEMMA 10. *If* (*t*+*n*(*w*+lg *n*)/*b*+*nw*/*b*+1)^{t+1} ≤ (*nw*/*b*)^{(n/30)}, *then there exists a permutation* π, *a collection of values* *F* ⊆ {{0, 1}^{w}}^{n} *and an I/O-graph G such that the following holds:*

- For all (
*d*_{1}, …,*d*) ∈_{n}*F*, it holds that the algorithm A^{*}executed on the input array*A*corresponding to inputs π and*d*_{1}, …,*d*results in the I/O-graph_{n}*G*and A^{*}is correct on*A*. *There are at least*(4/5)*n indices i*∈ {1, …,*n*}*for which dist*(π,*i*,*G*) > (1/2) lg_{2m/b}(*nw/b*).

**Reduction to network coding.** We are now ready to make our reduction to network coding. The basic idea in our proof is to use Lemma 10 to obtain an I/O-graph *G* and permutation π with large distance between the node containing *A*[*i*] and the node containing *C*[π(*i*)] for many *i*. We will then create a source *s _{i}* at the node representing

However, the reduction is not as straightforward as that. The problem is that Lemma 10 leaves us only with a subset *F* of all the possible values *d*_{1}, …, *d _{n}* that one wants to transmit. For other values of

- Add source and sink nodes
*s*_{1}, …,*s*and_{n}*t*_{1}, …,*t*to_{n}*G*^{*}. - For each source
*s*, add an additional node_{i}*p*._{i} - Add all nodes of
*G*to*G*^{*}. - Add all edges of
*G*to*G*^{*}. Edges between a block node and a memory node have capacity*b*bits. Edges between two memory nodes have capacity*m*bits. - Remove all block nodes that have an incoming and outgoing edge to the same memory node (this makes the graph acyclic).
- Add a directed edge with capacity
*w*bits from each source*s*to_{i}*p*, and add a directed edge with capacity_{i}*w*bits from each*p*to the input block node containing_{i}*A*[*i*]. - Add an edge with capacity
*w*bits from the output block node containing*C*[π(*i*)] to the sink*t*._{i} - Add a special node
*u*to*G*^{*}. Add an edge of capacity*w*bits from each source*s*to_{i}*u*. Also, add a directed edge from*u*to each*p*having capacity ρ_{i}_{i}for parameters ρ_{i}> 0 to be fixed later. Also, add an edge from*u*to sink*t*with capacity ρ_{i}_{i}.

We argue that for sufficiently large choices of ρ_{i}, one can use A^{*} to efficiently transmit *w* bits of information between every source-sink pair (*s _{i}*,

**Deriving the lower bound.** We observe that for all edges, except those with capacity ρ_{i}, our protocol sends a fixed number of bits. Thus, messages on such edges are prefix-free. For the edges with capacity ρ_{i}, the protocol sends a prefix-free message with expected length ρ_{i}. Because all messages on all edges are prefix-free, it follows from Shannon's Source Coding theorem that the expected length of each message is an upper bound on its entropy. Because the expected lengths are at most the capacity of the corresponding edges, we get by the definition of network coding rate from Section 2, that the above solution achieves a rate of *w* bits. Hence, from Conjecture 1, it follows that if we undirected *G*^{*}, then the multicommodity flow rate must be at least *w* bits. From the definition of multicommodity flow rate in Section 2, we see that this implies that there is a (possibly fractional) way of sending *w* units of flow between each source-sink pair.

We first examine the amount of flow that can be transported between pairs (*s _{i}*,

Thus, using the reduction to sorting we have proved:

For *w* = Ω(lg *n*), we may use the reduction to sorting and we immediately obtain Theorem 2 as a corollary.

THEOREM 2. *Assuming Conjecture 1, any randomized algorithm for the external memory sorting problem with w* = Ω(lg *n*) *bit integers, having error probability at most* 1/3, *must make an expected*

*I/Os*.

1. Adler, M., Harvey, N.J.A., Jain, K., Kleinberg, R., Lehman, A.R. On the capacity of information networks. In *Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm*, SODA '06 (2006), 241–250.

2. Aggarwal, A., Vitter, J. The input/output complexity of sorting and related problems. *Commun. ACM 9*, 31 (1988), 1116–1127.

3. Barak, B., Braverman, M., Chen, X., Rao, A. How to compress interactive communication. In *Proceedings of the Forty-Second ACM Symposium on Theory of Computing*, STOC '10 (2010), 67–76.

4. Braverman, M., Garg, S., Schvartzman, A. Coding in undirected graphs is either very helpful or not helpful at all. In *8th Innovations in Theoretical Computer Science Conference, ITCS 2017, January 9–11, 2017, Berkeley, CA, USA* (2017), 18:1–18:18.

5. Han, Y. Deterministic sorting in *O*(*n* lg lg *n*) time and linear space. In *Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing* (2002), ACM, New York, 602–608.

6. Han, Y., Thorup, M. Integer sorting expected time and linear space. In *Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science* (2002), IEEE, 135–144.

7. Li, Z., Li, B. Network coding: the case of multiple unicast sessions. In *Proceedings of the 42nd Allerton Annual Conference on Communication, Control and Computing, Allerton '04* (2004).

The original version of this paper is entitled "Lower Bounds for External Memory Integer Sorting via Network Coding" and was published in STOC 2019.

**©2020 ACM 0001-0782/20/10**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.

No entries found