[picture:OPTI-NUM header graphic]


Newsletters
November 2008

November 2008


Back to main newsletter

Parallel Computing Language Constructs

Parallel for-Loops

Applications, such as Monte-Carlo simulations, parameter sweeps and test suite implementations, which apply the same algorithm to different data sets, are often implemented in MATLAB using for-loops. Replacing the for-loop with the parfor construct can improve the performance of such loop execution by allowing several MATLAB workers to execute individual loop iterations simultaneously. For example, a loop of 100 iterations could run on a cluster of 20 MATLAB workers, so that simultaneously, the workers each execute only five iterations of the loop. You might not get quite 20 times improvement in speed because of communications overhead and network traffic, but the speedup should be significant for a moderately time-intensive loop.

The following example shows how even using a 2-worker matlab pool can speed up a time-consuming (but somewhat arbitrary in this case!) task: Calculating the eigenvalues of a large random matrix 10 times.

How long does it take to compute a single loop ? (Note: I turn off multithreading for this test!)

maxNumCompThreads(1);
tic
a = eig(rand(1000));
singleTime = toc
                                
singleTime =
    3.4435
                                

That task took around 3 seconds on my laptop. Logically then, running the same task 12 times should take around 36 seconds.

tic,
parfor i=1:12
    a=eig(rand(1000));
end
serialLoopTime = toc
                                
serialLoopTime =
   42.5411
                                

Now let's try to use both processors on my laptop, by making use of my "local" configuration (which is created automatically when you install Parallel Computing Toolbox, and does not require MATLAB Distributed Computing Server).

The matlabpool commands instruct MATLAB to set up a pool of 2 MATLAB workers on the local cluster, and then at the end to release (close) those workers.

matlabpool local 2
tic,
parfor i=1:12
    a=eig(rand(1000));
end
localLoopTime = toc
matlabpool close
                                
Starting matlabpool using the parallel configuration 'local'.
Waiting for parallel job to start...
Connected to a matlabpool session with 2 labs.
localLoopTime =
   26.1158
Sending a stop signal to all the labs...
Waiting for parallel job to finish...
Performing parallel job cleanup...
Done.
                                

As you can see, this task completed faster than on my single MATLAB, but not quite twice as fast (due to task communication overhead). I can now test this on a cluster in our training room, which yields the following results.

matlabpool training
tic,
parfor i=1:12
    a=eig(rand(1000));
end
trainingLoopTime = toc
matlabpool close
                                
Destroying 1 pre-existing parallel job(s) created by matlabpool 
that were in the finished or failed state.

Starting matlabpool using the parallel configuration 'training'.
Waiting for parallel job to start...
Connected to a matlabpool session with 6 labs.
trainingLoopTime =
   14.1150
Sending a stop signal to all the labs...
Waiting for parallel job to finish...
Performing parallel job cleanup...
Done.
                                

This task took around 10 seconds. The overhead comes from sending the tasks to the training room (over the network) and waiting for the tasks to complete. Also, the training room machines are not as fast as my laptop. Here's a graph showing time per configuration

bar([serialLoopTime, localLoopTime, trainingLoopTime]);
set(gca,'XTickLabel', {'Series (1)', 'Local (2)', 'Training (6)'})
title('Running an arbitrary process 12 times on various clusters');
                                

Single Program Multiple Data

The single program multiple data (SPMD) language construct allows seamless interleaving of serial and parallel programming.

The "single program" aspect of spmd means that the identical code runs on multiple workers. You run one program in the MATLAB client, and those parts contained within spmd blocks run on the workers. The "multiple data" aspect means that even though the spmd statement runs identical code on all workers, each worker can have different, unique data for that code.

Typical applications of spmd are those that require running simultaneous execution of a program on multiple data sets, when communication or synchronization is required between the workers or if the data needs to be distributed to multiple workers due to space constraints on a machine.

The following example demonstrates how to use an spmd block to perform a task that involves data too large to fit on my machine. In this example, we perform a vectorised Monte Carlo analysis to solve a Financial Mathematics problem: Finding price of a European Call option. For more information on European Call Option pricing, see [Hull, John C, "Options, Futures and Other Derivatives", Prentice Hall].

First we clear our data and set up the cluster we will use for this example.

clear all
matlabpool training
                                
Starting matlabpool using the parallel configuration 'training'.
Waiting for parallel job to start...
Connected to a matlabpool session with 6 labs.
                                

Next run some serial code on the client.

riskFree = 0.1;
volatility = 0.4;
numTradingDays = 252;
numPaths = 120000;
numYears = 1;

spotPrice = 100;
strikePrice = 98;
                                

Now we calculate the paths in parallel. Note how this code is written once, but will be run on each machine in the cluster. All required client variables are automatically broadcast to the workers. The use of a codistributor to generate optionPrice and epsilon means the data is distributed among the cluster machines, without manually breaking up the matrices; MATLAB keeps track of that problem for me.

spmd
    optionPrice = spotPrice * ones(1,numPaths,codistributor('1d'));
    epsilon = randn(numTradingDays-1,numPaths,codistributor('1d'));
    dt = 1/numTradingDays;
    dOptionPrice = exp((riskFree-volatility^2/2)*dt + 
volatility*epsilon*sqrt(dt));
    optionPrice = cumprod([optionPrice;dOptionPrice]);
    % Make one worker gather the results.
    res = gather(optionPrice, 1);
end
                                

Finally, return the results from the first worker and calculate the balance of the serial code.

finalPrice=res{1};
payoff = max(finalPrice(end,:)-strikePrice, 0);
expected_payoff = mean(payoff);
% Take the time value of money into account
callMonteCarlo = exp(-riskFree*numYears)*expected_payoff
matlabpool close
                                
callMonteCarlo =
   20.9617
Sending a stop signal to all the labs...
Waiting for parallel job to finish...
Performing parallel job cleanup...
Done.
                                

Note that on my local machine with 2 GB of RAM, I could not run this example, as I got an "Out of memory" error.

Let M-Lint Be Your Guide

In order to benefit from a parfor speedup, the loop contents must be independent. Fortunately, MATLAB includes M-Lint code checking to warn you when a parfor loop is not independent.

 

Back to main newsletter