Parallel Algorithm

Brent’s Principle state and proof with example

Apr 3, 2016

6259

Brent’s principle provides a schema for realizing the inherent parallelism in a problem. However, it is important to note that the time for communication between operations can be a serious impediment to the efficient implementation of a problem on a parallel machine. Often, the time to route messages between operations can be the most important limitation on exploitation of parallelism.

Theorem: (Brent’s Theorem): A PRAM algorithm involving t time steps and performing a total of m operations then it can be executed by p processors in no more than [t + {(m – t) / p}] time steps.
Or
Given A, a parallel algorithm with computation time t, if parallel algorithm A performs m computational operations, then processors can execute algorithm A in time [t + {(m – t) / p}].

Proof:

Brent’s theorem specifies for a sequential algorithm with t time steps, and a total of m operations, that a run time T is definitely possible on a shared memory machine (PRAM) with p processors. There may be an algorithm that solves this problem faster, or it may be possible to implement this algorithm faster (by scheduling instruction differently to minimize idle processors, for instance), but it is definitely possible to implement this algorithm in this time given p processors.

Key to understanding Brent’s theorem understands time steps. In a single time step every instruction that has no dependencies is executed, and therefore t is equal to the length of the longest chain of instructions that depend on the results of other instructions (as any shorter chains will be finished executing by (or before) the time the longest chain has).

Let us consider adding n integers x₁, x₂, .. .. .., x_n (where n = 2^k) is s_i denoted the number of computational operations performed by parallel algorithm A at step i, where 1 ≤ i ≤ t. Under the assumption that at most two integers can be added in one primitive operation, we see that the sum can be formed by performing n/2 additions, n/4 additions of these results, etc., until the last sum is formed. Thus by definition

∑_i=1^t ⌈s_i/ p⌉ = m_i= n / 2ⁱ for i ≤ ⌈log₂n⌉

When only p processors are available, we assign ⌈n ⁄ p⌉ integers to p-1 processors and n-(p-1) ⌈n ⁄ p⌉ integers to the remaining processor. In ⌈n ⁄ p⌉ steps, the p processors each compute their local sums, leaving their results in a reserved location. In each subsequent phase, half of the processors active in the preceding phase are active in this one. Each active processor fetches the partial sum computed by one other processor, adds it to its partial sum, and stores the result in a reserved place.

The entire computation by parallel algorithm be performed with p processors in time

∑_i=1^t ⌈s_i/ p⌉ ≤ ∑_i=1^t (s_i+ p – 1) / p
∵p processors can simulate step i in time ⌈s_i / p⌉
=∑_i=1^t p / p + ∑_i=1^t (s_i – 1) / p
=t + (m-t) / p

This expression tells us many useful things about the algorithm:

No matter how many processors used, there can be no implementation of this algorithm that runs faster than O (log n).
If we have n processors, the algorithm can be implemented in (log n) time
If we have (log n) processors, the algorithm can be implemented in 2(log n) time (asymptotically this is the same as (log n) time)
If we have one processor, the algorithm can be implemented in n time.

If we consider the amount of work done in each case, with one processor, we do n work, with (log n) processors we do n work, but with n processors we do (n log n) work. The implementations with 1 or (log n) processors, therefore are cost optimal, while the implementation with n processors is not.

It is important to remember that Brent’s theorem does not tell us how to implement any of these algorithms in parallel; it merely tells us what is possible. The Brent’s theorem implementation may be hideously ugly compared to the naive implementation.

You May Also Like: