Home/Magazine Archive/February 2014 (Vol. 57, No. 2)/Communication Costs of Strassen's Matrix Multiplication/Full Text

Research highlights
## Communication Costs of Strassen's Matrix Multiplication

Algorithms have historically been evaluated in terms of the number of arithmetic operations they performed. This analysis is no longer sufficient for predicting running times on today's machines. Moving data through memory hierarchies and among processors requires much more time (and energy) than performing computations. Hardware trends suggest that the relative costs of this communication will only increase. Proving lower bounds on the communication of algorithms and finding algorithms that attain these bounds are therefore fundamental goals. We show that the communication cost of an algorithm is closely related to the graph expansion properties of its corresponding computation graph.

Matrix multiplication is one of the most fundamental problems in scientific computing and in parallel computing. Applying expansion analysis to Strassen's and other fast matrix multiplication algorithms, we obtain the first lower bounds on their communication costs. These bounds show that the current sequential algorithms are optimal but that previous parallel algorithms communicate more than necessary. Our new parallelization of Strassen's algorithm is communication-optimal and outperforms all previous matrix multiplication algorithms.

Communication (i.e., moving data) can greatly dominate the cost of an algorithm, whether the cost is measured in running time or in total energy. This holds for moving data between levels of a memory hierarchy or between processors over a network. Communication time per data unit varies by orders of magnitude, from order of 10^{9} seconds for an L1 cache reference to order of 10^{2} seconds for disk access. The variation can be even more dramatic when communication occurs over networks or the internet. In fact, technological trends^{16, 17} are making communication costs grow exponentially over time compared to arithmetic costs. Moore's Law is making arithmetic on a chip improve at about 60% per year, but memory and network bandwidth is improving at only 26% and 23% per year.^{16} So even in cases where communication is not the bottleneck today, it may be in the future.

Ideally, we would be able to determine lower bounds on the amount of required communication for important problems and design algorithms that attain them, namely, algorithms that are communication-optimal. These dual problems have long attracted researchers, with one example being classical Θ(*n*^{3}) matrix multiplication (see further details below), with lower bounds proved in Hong and Kung^{18} and Irony et al.^{20} and many optimal sequential and parallel algorithms obtained in, for example, Agarwal et al.^{1} and Cannon^{11}.

These lower bounds have recently been extended to a large class of other classical linear algebra problems, including linear system solving, least squares, and eigenvalue problems, for dense and sparse matrices, and for sequential and parallel machines.^{9} Surprisingly, the highly optimized algorithms in widely implemented libraries like LAPACK and ScaLAPACK^{3} often do not attain these lower bounds, even in the asymptotic sense. This has led to much recent work inventing new, faster algorithms that do; see the citations in Ballard et al.^{9, 10} for references.

In this paper, we describe a novel approach to prove the first communication lower bounds for Strassen's Θ() matrix multiplication algorithm, as well as many similar fast algorithms. Specifically, we introduce expansion analysis of the computational graphs of the algorithms and show that the expansion helps determine the communication cost. These communication cost bounds are *lower* than those of classical matrix multiplication: this means that not only does Strassen's algorithm reduce computation, but it also creates an opportunity for reducing communication. In addition, the lower bound decreases as the amount of available memory grows, suggesting that using extra memory may also allow for faster algorithms.

In fact, there is an optimal parallel algorithm that attains our lower bounds for varying amounts of memory, whose performance exceeds all other known matrix multiplication implementations, classical or Strassen-based, on a large parallel machine,^{6} see Figure 1. In the rest of this paper, we focus on explaining our new lower bounds for Strassen's algorithm and their implications.

**1.1. Communication models**

In order to analyze the communication costs of algorithms, we consider idealized memory and communication models. In the sequential case (see Figure 2), we consider a machine with two levels of memory hierarchy: a fast memory of size *M* words (where computation is performed) and a slow memory of infinite size. We assume that the input initially resides in slow memory and is too large to fit in fast memory. We define the *communication cost* of a sequential algorithm to be the total number of words transferred between the slow and fast memories.

In the parallel case (see Figure 2), we consider *p* processors, each with a local memory of size *M*, connected over a network. In this case, the communication cost is the number of words transferred between processors, counted along the critical path of the algorithm. That is, two words that are communicated simultaneously between separate pairs of processors are counted only once.

**1.2. Classical matrix multiplication**

To illustrate the effects of arithmetic reordering on communication and running time of a sequential computation, consider the problem of computing matrix multiplication *C* = *A* · *B*, where the (*i, j*)^{th} output element is computed by the classical formula *C*_{ij} = Σ_{k} *A*_{ik} · *B*_{kj}. One "naive" ordering of the computation of the classical algorithm can be specified simply by three nested loops (see Algorithm 1). For matrices that are too large to fit in fast memory, this ordering requires the communication of at least one operand for each scalar multiplication, resulting in a total communication cost of Θ(*n*^{3}). A natural question to ask is: can we do better?

The answer is yes. We can reduce communication by using a "blocked" algorithm (see Algorithm 2). The idea is to partition *A, B*, and *C* into square blocks of size *b* × *b* so that three blocks can simultaneously fit in the fast memory. We use the notation *C*[*I, J*] to refer to the (*I, J*)^{th} *b* × *b* block of the *C* matrix. When *C*[*I, J*], *A*[*I, K*], and *B*[*K, J*] are all in fast memory, then the inner loop of the algorithm (corresponding to (*b*^{3}) arithmetic operations) can be performed with no more communication.

If we pick the maximum block size of *b* = , this results in a total of Θ((*n*/)^{3}) block operations, each requiring (*M*) words to be communicated. Hence, the total communication cost is Θ(*n*^{3}/), a factor of Θ(√*M*) better than that of the naive algorithm.

The typical performance difference between the naive and blocked algorithms on a sequential machine is an order of magnitude. With the blocked algorithm, attained performance is close to the peak capabilities of the machine. Again, the question arises: can we do better? Can we further reorder these computations to communicate less?

If we insist on performing the (*n*^{3}) arithmetic operations given by the classical formulation, the answer is no. Hong and Kung^{18} proved a communication cost lowerbound of Ω(*n*^{3}/) for any reordering, showing that the blocked algorithm is communication-optimal. But this is not the end of the story: this communication optimality of the blocked algorithm assumes (*n*^{3}) arithmetic operations.

**1.3. Strassen's matrix multiplication**

While the classical algorithms for matrix multiplication have already been optimized for reducing communication cost to the minimum possible, a completely different algorithmic approach for this problem is possible. Let us recall Strassen's algorithm^{24} (see Algorithm 3).

Strassen's key idea is to multiply 2 × 2 matrices using seven scalar multiplies instead of eight. Because *n* × *n* matrices can be divided into quadrants, Strassen's idea applies recursively. Each of the seven quadrant multiplications is computed recursively, and the computational cost of additions and subtractions of quadrants is (*n*^{2}). Thus, the recurrence for the flop count is *F*(*n*) = 7*F* (*n*/2) + (*n*^{2}) with base case *F*(1) = 1, which yields *F*(*n*)=(), which is asymptotically less computation than the classical algorithm.

The main results presented in the following section expose a wonderful fact: not only does Strassen's algorithm require less computation than the classical algorithm, but it also requires less communication!

In this section, we state our main results: communication lower bounds for Strassen's matrix multiplication. The proof technique described in Section 3 allows us to state bounds in both sequential and parallel cases. As mentioned in the Section 1, the lower bounds are lower than the bounds for the classical algorithm.^{18, 20} In both sequential and parallel cases, there now exist communication-optimal algorithms that achieve the lower bounds.

**2.1. Sequential case**

We obtain the following lower bound:

THEOREM 1.^{10} *Consider Strassen's algorithm implemented on a sequential machine with fast memory of size M. Then for M n*^{2}, *the communication cost of Strassen's algorithm is*

It holds for any implementation and any known variant of Strassen's algorithm that is based on performing 2 × 2 matrix multiplication with seven scalar multiplications. This includes Winograd's *O*() variant that uses 15 additions instead of 18, which is the most commonly used fast matrix multiplication algorithm in practice.

This lower bound is tight, in that it is attained by the standard recursive sequential implementation of Strassen's algorithm. The recursive algorithm's communication cost is given by the recurrence . The base case occurs when the input and output sub-matrices fit in the fast memory and the matrix multiplication can be performed with no further communication. This yields

for *M n*^{2}, matching the lower bound stated in Theorem 1.

**2.2. Parallel case**

The proof technique of Theorem 1 extends to parallel machines, yielding

COROLLARY 2.^{10} *Consider Strassen's algorithm implemented on a parallel machine with p processors, each with a local memory of size M. Then for , the communication cost of Strassen's algorithm is*

While Corollary 2 does not hold for all sizes of local memory (relative to the problem size and number of processors), the following memory-independent lower bound can be proved using similar techniques^{5} and holds for all local memory sizes, though it requires separate assumptions.

THEOREM 3.^{5} *Suppose a parallel algorithm performing Strassen's matrix multiplication load balances the computation. Then, the communication cost is*

Note that the bound in Corollary 2 dominates the one in Theorem 3 for *M* = *O* (*n*^{2}/*p*^{2/log 7}). Thus, the tightest lower bound for parallel implementations of Strassen is the maximum of these two bounds. Table 2 and Figure 3, both adapted from Ballard et al.,^{5} illustrate the relationship between the two functions. Figure 3 in particular shows bounds on strong scaling: for a fixed dimension *n*, increasing the number of processors (each with local memory size *M*) within a limited range does not increase the total volume of communication. Thus, the communication cost along the critical path decreases linearly with *p*. This is because in this "perfect strong scaling range," the dominant lower bound includes a *p* in the denominator; however, when the second bound begins to dominate, the denominator includes a *p*^{2/3} rather than *p*, and increasing *p* leads to more communication volume. As shown in the figure, a similar phenomenon occurs for the classical algorithm, though with slightly different parameters.^{5, 23}

The recent parallel algorithm for Strassen's matrix multiplication^{6} has communication cost

where *p* is the number of processors and *M* is the size of the local memory. Note that this matches the lower bounds of Corollary 2 and Theorem 3 above. A similar algorithm for Strassen's matrix multiplication in the BSP model is presented in McColl and Tiskin.^{22}

The crux of the proof of Theorem 1 is based on estimating the edge expansion of the computation graph of Strassen's algorithm. We describe below how communication cost is closely related to the edge expansion properties of this graph. The graph has a recursive structure, and we use a combinatorial analysis of the expansion. The high-level argument is based on partitioning the computation in segments, which we explain in Section 3.3. Let us first define two key concepts: computation graphs and edge expansion. See Ballard et al.^{10} for the full proof.

**3.1. Computation graphs**

The computation performed by an algorithm on a given input can be modeled as a computation directed acyclic graph (*CDAG*): we have a vertex for each input, intermediate, and output argument, and edges according to direct dependencies (e.g., for the binary arithmetic operation *x*: = *y + z*, we have directed edges from vertices corresponding to operands *y* and *z* to the vertex corresponding to *x*).

In the sequential case, an implementation (or scheduling) determines the order of execution of the arithmetic operations, which respects the partial ordering of the CDAG. In the parallel case, an implementation determines which arithmetic operations are performed by which of the *p* processors as well as the ordering of local operations. This corresponds to partitioning the CDAG into *p* parts. Edges crossing between the various parts correspond to arguments that are in the possession of one processor but are needed by another processor and therefore relate to communication.

**3.2. Edge expansion**

Expansion is a graph-theoretic concept^{19} that relates a given subset of a graph to its boundary. If a graph has large expansion, then subsets of vertices will have relatively large boundaries. For example, a 2D grid where each vertex has north, south, east, and west neighbors has small expansion, whereas a complete graph has large expansion. While there are several variants of expansion metrics, we are interested in edge expansion of regular graphs, defined as follows: the edge expansion *h*(*G*) of a *d*-regular undirected graph *G* = (*V, E*) is

where *E*_{G}(*A, B*) is the set of edges connecting the disjoint vertex sets *A* and *B*.

Note that CDAGs are typically not regular. If a graph *G* = (*V, E*) is not regular but has a bounded maximal degree *d*, then we can add (<*d*) loops to vertices of degree <*d*, obtaining a regular graph *G*. We use the convention that a loop adds 1 to the degree of a vertex. Note that for any *S* ⊆ *V*, we have |*E*_{G}(*S, V\S*)| *= |E*_{G} (*S, V*\S)|, as none of the added loops contributes to the edge expansion of *G*.

For many graphs, small sets have larger expansion than larger sets. Let *h*_{s}(*G*) denote the edge expansion of *G* for sets of size at most *s*:

For many interesting graph families (including Strassen's CDAG), *h*_{s}(*G*) does not depend on |*V*(*G*)| when *s* is fixed, although it may decrease when *s* increases.

**3.3. The partition argument**

The high-level lower bound argument is based on partitioning the execution of an algorithm's implementation into segments. Let *O* be any total ordering of the vertices that respects the partial ordering of the CDAG *G*, that is, all the edges are directed upwards in the total order. This total ordering can be thought of as the actual order in which the computations are performed. Let *P* be any partition of *V* into segments *S*_{1}, *S*_{2}, ..., so that a segment *S*_{i} *P* is a subset of the vertices that are contiguous in the total ordering *O*.

Let *S* be some segment, and define *R*_{S} and *W*_{S} to be the set of read and write operands, respectively (see Figure 4), namely, *R*_{S} is the set of vertices outside *S* that have an edge going into *S*, and *W*_{S} is the set of vertices in *S* that have an edge going outside of *S*. Recall that *M* is the size of the fast memory. Then, the total communication cost due to reads of operands in *S* is at least |*R*_{S}*| M*, as at most *M* of the needed |*R*_{S}| operands are already in fast memory when the segment starts. Similarly, *S* causes at least |*W*_{S}| *M* actual write operations, as at most *M* of the operands needed by other segments are left in the fast memory when the segment ends. The total communication cost is therefore bounded below by

**3.4. Edge expansion and communication**

Consider a segment *S* and its read and write operands *R*_{S} and *W*_{S} (see Figure 4). If the graph *G* containing *S* has *h*(*G*) edge expansion, maximum degree *d* and at least 2|*S*| vertices, then (using the definition of *h*(*G*)), we have

Combining this with (3) and choosing to partition *V* into |*V*|/s segments of equal size *s*, we obtain *IO* max_{s} (|*V*|/s) · (*h*(*G*) · *s* 2*M*) = (|*V*| · *h*(*G*)). In many cases, *h*(*G*) is too small to attain the desired communication cost lower bound. Typically, *h*(*G*) is a decreasing function of |*V*(*G*)|; that is, the edge expansion deteriorates with the increase of the input size and number of arithmetic operations of the corresponding algorithm (this is the case with Strassen's algorithm). In such cases, it is better to consider the expansion of *G* on small sets only: *IO* max_{s} (| *V*|/*s*) · (*h*_{s}(*G*) · *s* 2*M*). Choosing the minimal *s* so that

we obtain

The existence of a value *s* | *V*|/2 that satisfies condition (4) is not always guaranteed. In Ballard et al.,^{10} we confirm the existence of such *s* for Strassen's CDAG for sufficiently large |*V*|.

Recall Strassen's algorithm for matrix multiplication and consider its computation graph. If we let *H*_{i} be the computation graph of Strassen's algorithm for recursion of depth *i*, then *H*_{} corresponds to the computation for input matrices of size *n* × *n*. Let us first consider *H*_{1} as shown in Figure 5, which corresponds to multiplying 2 × 2 matrices. Each of *A* and *B* is "encoded" into seven pairs of multiplication inputs, and vertices corresponding to the outputs of the multiplications are then "decoded" to compute the output matrix *C*.

The general computation graph *H*_{} has similar structure:

- Encode
*A*: generate weighted sums of elements of*A* - Encode
*B*: generate weighted sums of elements of*B* - Multiply the encodings of
*A*and*B*element-wise - Decode
*C*: take weighted sums of the products

Denote by *Enc*_{}*A* the part of *H*_{} that corresponds to the encoding of matrix *A*. Similarly, *Enc*_{}*B*, and *Dec*_{}*C* correspond to the parts of *H*_{} that compute the encoding of *B* and the decoding of *C*, respectively. Figure 6 shows a high level picture of *H*_{}. In the next section, we provide a more detailed description of the CDAG.

**4.1. Recursive construction**

We construct the computation graph *H*_{i+1} by constructing *Dec*_{i+1}*C* from *Dec*_{i}*C* and *Dec*_{1}*C*, similarly constructing *Enc*_{i+1}*A* and *Enc*_{i+1}*B*, and then composing the three parts together. Here is the main idea for recursively constructing *Dec*_{i+1}*C*, which is illustrated in Figure 7.

- Replicate
*Dec*_{1}*C*7^{i}times. - Replicate
*Dec*_{i}*C*4 times. - Identify the 4 · 7
^{i}output vertices of the copies of*Dec*_{1}*C*with the 4 · 7^{i}input vertices of the copies of*Dec*_{i}*C*:- - Recall that each
*Dec*_{1}*C*has four output vertices. - - The set of each first output vertex of the 7
^{i}*Dec*_{1}*C*graphs is identified with the set of 7^{i}input vertices of the first copy of*Dec*_{i}*C*. - - The set of each second output vertex of the 7
^{i}*Dec*_{1}*C*graphs is identified with the set of 7^{i}input vertices of the second copy of*Dec*_{i}*C*, and so on. - - We make sure that the
*j*^{th}input vertex of a copy of*Dec*_{i}*C*is identified with an output vertex of the*j*^{th}copy of*Dec*_{1}*C*.

- - Recall that each

After constructing *Enc*_{i+1}*A* and *Enc*_{i+1}*B* in a similar manner, we obtain *H*_{i+1} by connecting edges from the *k*^{th} output vertices of *Enc*_{i+1}*A* and *Enc*_{i+1}*B* to the *k*^{th} input vertex of *Dec*_{i+1}*C*, which corresponds to the element-wise scalar multiplications.

**4.2. Strassen's edge expansion**

Given the construction of the CDAG for Strassen's algorithm, we now state our main lemma on the edge expansion of the decoding graph. The proof technique resembles the expander analysis in Alon et al.^{2} For the complete proof, see Ballard et al.^{10}

LEMMA 5. (MAIN LEMMA) *The edge expansion of Dec*_{k}*C* is

By another argument (proof in Ballard et al.^{10}), we obtain that

where *s* = (7^{k}). Choosing *s* = (*M*^{}), we satisfy Inequality 4 and obtain Inequality 5 (for sufficiently large |*V*|). This gives Theorem 1.

In this paper, we focus on lower bounds for Strassen's matrix multiplication algorithm on two machine models. However, the design space of improving fundamental algorithms via communication minimization is much larger. It includes proving lower bounds and developing optimal algorithms; using classical methods as well as fast algorithms like Strassen's; performing matrix multiplication, other matrix algorithms, and more general computations; minimizing time and/or energy; using minimal memory or trading off extra memory for less communication; and using hierarchical, homogeneous, or heterogeneous sequential and parallel models. In this section, we discuss a subset of these extensions; see Ballard et al.^{9, 10} and the references therein for more details.

**5.1. Lower bounds**

The proof technique described in Section 3 is not specific to Strassen's algorithm and can be applied more widely. The partition argument is used for classical algorithms in numerical linear algebra^{8, 20} where a geometric inequality specifies the per-segment communication cost rather than edge expansion. Further, the edge expansion technique applies to *Strassen-like* algorithms that also multiply square matrices with *o*(*n*^{3}) arithmetic operations, to other fast algorithms for rectangular matrix multiplication, and to other matrix computations.

**Strassen-like algorithms**. Strassen-like algorithms are recursive matrix multiplication algorithms based on a scheme for multiplying *k* × *k* matrices using *q* scalar multiplications for some *k* and *q < k*^{3} (so that the algorithm performs *O*() flops where *ω*_{0} = log_{k} *q*.) For the latest bounds on the arithmetic complexity of matrix multiplication and references to previous bounds, see Williams.^{25} For our lower bound proof to apply, we require another technical criterion for Strassen-like algorithms: the decoding graph must be connected. This class of algorithms includes many (but not all) fast matrix multiplications. For details and examples, see Ballard et al.^{7, 10}

For Strassen-like algorithms, the statements of the communication lower bounds have the same form as Theorem 1, Corollary 2, and Theorem 3: replace log_{2} 7 with *ω*_{0} everywhere it appears! The proof technique follows that for Strassen's algorithm. While the bounds for the classical algorithm have the same form, replacing log_{2} 7 with 3, the proof techniques are quite different.^{18, 20}

**Fast rectangular matrix multiplication**. Many fast algorithms have been devised for multiplication of rectangular matrices (see Ballard et al.^{7} for a detailed list). A fast algorithm for multiplying *m* × *k* and *k* × *r* matrices in *q* < *mkr* scalar multiplications can be applied recursively to multiply *m*^{t} × *k*^{t} and *k*^{t} × *r*^{t} matrices in *O*(*q*^{t}) flops. For such algorithms, the CDAG has very similar structure to Strassen and Strassen-like algorithms for square multiplication in that it is composed of two encoding graphs and one decoding graph. Assuming that the decoding graph is connected, the proofs of Theorem 1 and Lemma 5 apply where we plug in *mr* and *q* for 4 and 7. In this case, we obtain a result analogous to Theorem 1 which states that the communication cost of such an algorithm is given by Ω(*q*^{t}*/M*^{}). If the output matrix is the largest of the three matrices (i.e., *k* < *m* and *k* < *r*), then this lower bound is attained by the natural recursive algorithm and is therefore tight. The lower bound extends to the parallel case as well, analogous to Corollary 2, and can be attained using the algorithmic technique of Ballard et al.^{6}

**The rest of numerical linear algebra**. Fast matrix multiplication, algorithms are basic building blocks in many fast algorithms in linear algebra, such as algorithms for LU, QR, and eigenvalue and singular value decompositions.^{13} Therefore, communication cost lower bounds for these algorithms can be derived from our lower bounds for fast matrix multiplication algorithms. For example, a lower bound on LU (or QR, etc.) follows when the fast matrix multiplication algorithm is called by the LU algorithm on sufficiently large sub-matrices. This is the case in the algorithms of Demmel et al.,^{13} and we can then deduce matching lower and upper bounds.^{10}

**Nested loops computation**. Nearly all of the arguments for proving communication lower bounds are based on establishing a relationship between a given set of data and the amount of useful computation that can be done with that data, a so-called "surface-to-volume" ratio. For example, Hong and Kung^{18} use an analysis of dominator sets and minimal sets of CDAGs to establish such ratios. The LoomisWhitney geometric inequality is applied for this purpose to matrix computations specified by three nested loops in Ballard et al.^{8} and Irony et al.^{20} Recently, Christ et al.^{12} have extended this analysis using a generalization of the Loomis Whitney inequality, known as the HölderBrascampLieb inequality, to prove lower bounds for computations that are specified by an arbitrary set of nested loops that linearly access arrays and meet certain other criteria.

**5.2. Algorithms**

The main motivation for pursuing communication lower bounds is to provide targets for algorithmic performance. Indeed, the conjecture and proof of Theorem 1 and Corollary 2, as well as the existence of an optimal algorithm in the sequential case, were the main motivations for improving the parallel implementations of Strassen's algorithm. Not only were we able to devise an optimal algorithm, but we were also able to show with an implementation for distributed-memory machines that it performs much faster in practice.^{6, 21}

**Communication avoiding parallel Strassen**. In Section 2.2, we stated the communication cost of a new parallel algorithm for Strassen's matrix multiplication, matching the asymptotic lower bound. The details of the algorithm appear in Ballard et al.,^{6} and more extensive implementation details and performance data are given in Lipshitz et al.^{21} We show that the new algorithm is more efficient than any other parallel matrix multiplication algorithm of which we are aware, including those that are based on the classical algorithm and those that are based on previous parallelizations of Strassen's algorithm.

Figure 1 shows performance on a Cray XT4. For results on other machines, see Lipshitz et al.^{21} For example, running on a Cray XE6 with up to 10,000 cores, for a problem of dimension *n* = 131712, our new algorithm attains performance as high as 30% above the peak for classical matrix multiplication, 83% above the best classical implementation, and 75% above the best previous implementation of Strassen's algorithm. Even for a small problem of dimension *n* = 4704, it attains performance 66% higher than the best classical implementation.

**Further applications**. The key algorithmic idea in our parallel implementation of Strassen's algorithm is a careful parallel traversal of the recursion tree. This idea works for many other recursive algorithms where the subproblems do not have interdependencies (and it also works in some cases where dependencies exist). For example, classical rectangular matrix multiplication^{14} and sparse matrixmatrix multiplication^{4} can be parallelized in this way to obtain communication optimality.

The same techniques can be utilized to save energy at the algorithmic level (since communication consumes more energy than computation) as well as to obtain lower bounds on energy requirements.^{15}

In summary, we believe this work flow of theoretical lower bounds to algorithmic development to efficient implementations is very effective: by considering fundamental computations at an algorithmic level, significant improvements in many applications are possible.

We would like to thank Benjamin Lipshitz for his work on many of these ideas and for useful discussions during the writing of this paper.

This work is supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG07-10227); additional support from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, and Samsung is acknowledged. This research is supported by U.S. Department of Energy grants under Grant Numbers DE-SC0003959, DE-SC0004938, DE-SC0005136, DE-SC0008700, AC02-05CH11231, and DE-FC02-06-ER25786, and DARPA grant HR0011-12-2-0016. The research is also supported by the Sofja Kovalevskaja programme of Alexander von Humboldt Foundation and by the National Science Foundation under agreement DMS-0635607, and by ERC Starting Grant Number 239985.

1. Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P. A three-dimensional approach to parallel matrix multiplication. *IBM J. Res. Dev. 39*, 5 (1995), 575582.

2. Alon, N., Schwartz, O., Shapira, A. An elementary construction of constant-degree expanders. *Combinator. Probab. Comput. 17*, 3 (2008), 319327.

3. Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Croz, J.D., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., Sorensen, D. LAPACK's User's Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. Also available from http://www.netlib.org/lapack/.

4. Ballard, G., Buluç, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S. Communication Optimal Parallel Multiplication of Sparse Random Matrices. In *Proceedings of the 25th ACM Symposium on Parallelism in Algorithms and Architectures*, (2013), ACM, New York, NY, USA.

5. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Brief announcement: Strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds. In *Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures*, (2012), ACM, New York, NY, USA, 7779.

6. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Communication-optimal parallel algorithm for Strassen's matrix multiplication. In *Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures*, SPAA '12 (2012), ACM, New York, NY, USA, 193204.

7. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O. Graph expansion analysis for communication costs of fast rectangular matrix multiplication. In *Design and Analysis of Algorithms*. G. Even and D. Rawitz, eds., Volume 7659 of *Lecture Notes in Computer Science* (2012), Springer, Berlin-Heidelberg, 1336.

8. Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Graph expansion and communication costs of fast matrix multiplication. In *Proceedings of the 23rd Annual ACM Symposium on Parallel Algorithms and Architectures* (2011), ACM, New York, NY, USA, 112.

9. Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Minimizing communication in numerical linear algebra. *SIAM J. Matrix Anal. Appl. 32*, 3 (2011), 866901.

10. Ballard, G., Demmel, J., Holtz, O., Schwartz, O. Graph expansion and communication costs of fast matrix multiplication. *J. ACM* (Dec. 2012) *59*, 6, 32:132:23.

11. Cannon, L. *A cellular computer to implement the Kalman filter algorithm*. PhD thesis, Montana State University, Bozeman, MN (1969).

12. Christ, M., Demmel, J., Knight, N., Scanlon, T., Yelick, K. Communication lower bounds and optimal algorithms for programs that reference arrays Part I. Manuscript, 2013.

13. Demmel, J., Dumitriu, I., Holtz, O. Fast linear algebra is stable. *Numer. Math. 108*, 1 (2007), 5991.

14. Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O. Communication-optimal parallel recursive rectangular matrix multiplication. In *Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium* (*IPDPS*) (2013), IEEE.

15. Demmel, J., Gearhart, A., Lipshitz, B., Schwartz, O. Perfect strong scaling using no additional energy. In *Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium*, IPDPS '13 (2013), IEEE.

16. Fuller, S.H., Millett, L.I., eds. *The Future of Computing Performance: Game Over or Next Level?* The National Academies Press, Washington, D.C., 2011, 200 pages, http://www.nap.edu.

17. Graham, S.L., Snir, M., Patterson, C.A., eds. *Getting up to Speed: The Future of Supercomputing*. Report of National Research Council of the National Academies Sciences. The National Academies Press, Washington, D.C., 2004, 289 pages, http://www.nap.edu.

18. Hong, J.W., Kung, H.T. I/O complexity: The red-blue pebble game. In *STOC '81: Proceedings of the 13th annual ACM Symposium on Theory of Computing* (1981), ACM, New York, NY, USA, 326333.

19. Hoory, S., Linial, N., Wigderson, A. Expander graphs and their applications. *Bull. AMS 43*(4), (2006), 439561.

20. Irony, D., Toledo, S., Tiskin, A. Communication lower bounds for distributed-memory matrix multiplication. *J. Parallel Distrib. Comput. 64*, 9, (2004), 10171026.

21. Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O. Communication-avoiding parallel Strassen: Implementation and performance. In *Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis*, (2012), IEEE Computer Society Press, Los Alamitos, CA, USA, 101:1101:11.

22. McColl, W.F., Tiskin, A. Memory-efficient matrix multiplication in the BSP model. *Algorithmica 24* (1999), 287297.

23. Solomonik, E., Demmel, J. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In *Proceedings of the 17th International European Conference on Parallel and Distributed Computing* (2011), Springer.

24. Strassen, V. Gaussian elimination is not optimal. *Numer. Math. 13* (1969), 354356.

25. Williams, V.V. Multiplying matrices faster than Coppersmith-Winograd. In *Proceedings of the 44th Symposium on Theory of Computing*, STOC '12 (2012), ACM, New York, NY, USA, 887898.

The original version of this paper is entitled "Graph Expansion and Communication Costs of Fast Matrix Multiplication" and was first published in the *Proceedings of the 2011 ACM Symposium on Parallelism in Algorithms and Architectures* and also appeared in the December 2012 issue of the *Journal of the ACM*.

Figure 1. Strong-scaling performance comparison of parallel matrix multiplication algorithms on a Cray XT4.^{6} All data corresponds to a fixed dimension *n* = 94080. The *x*-axis represents the number of processors *p* on a log scale, and the *y*-axis measures effective performance, or 2*n*^{3}/(*p* · time). The new algorithm outperforms all other known algorithms and exceeds the peak performance of the machine with respect to the classical flop count. The new algorithm runs 24184% faster than the best previous Strassen-based algorithm and 5184% faster than the best classical algorithm for this problem size.

Figure 2. Sequential two-level (left) and parallel distributed-memory (right) models.

Figure 3. Communication costs and strong scaling of matrix multiplication: classical vs. Strassen.^{5} The vertical axis corresponds to *p* times the communication cost, so horizontal lines correspond to perfect strong scaling. The quantity *p*_{min} is the minimum number of processors required to store the input and output matrices (i.e., *p*_{min} = 3*n*^{2}/*M* where *n* is the matrix dimension and *M* is the local memory size).

Figure 4. A subset (segment) *S* and its corresponding read operands *R*_{S} and write operands *W*_{S}.

Figure 5. Computation graph of Strassen's algorithm for multiplying 2 × 2 matrices (*H*_{1}). The encodings of *A* and *B* correspond to the additions and subtractions in lines 410 of Algorithm 3, and the decoding of the seven multiplications to compute *C* corresponds to lines 1114. A vertex labeled with two indices *ij* corresponds to the (*i, j*)^{th} entry of a matrix and a vertex labeled with one index *k* corresponds to the *k*^{th} intermediate multiplication.

Figure 6. High-level view of Strassen's CDAG for *n* × *n* matrices. The graph is composed of two encoding subgraphs and one decoding subgraph; connections between the subgraphs are not shown.

Figure 7. Illustration of the recursive construction of the decoding subgraph. To construct *Dec*_{i+1}*C, Dec*_{i}*C* is replicated 4 times and *Dec*_{1}*C* is replicated 7^{i} times, and appropriate vertices are identified.

Table 1. Asymptotic communication cost lower bounds for sequential matrix multiplication, where *n* is the matrix dimension and *M* is the fast memory size. Note that although the expressions for classical and Strassen are similar, the proof techniques are quite different

Table 2. Asymptotic communication cost lower bounds for parallel matrix multiplication, where *n* is matrix dimension, *M* is local memory size, and *p* is the number of processors

**©2014 ACM 0001-0782/14/02**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.

No entries found