Sei sulla pagina 1di 8

Optimizing Implementation of Autocorrelation Function

Authors: Eng. Mihai Nicolae Newgate Design SRL Dr. Eng. Ioan Rugina Institutute of Solid Mechanics; Eng. Alexandru Vasile University Politechnica Bucharest e-mail nicolaem@go.ro; vasal@colel.pub.ro; rugina@mecsol.ro

Abstract
The (auto) correlation function has a great practical importance when is necessary to process a signal in the presence of noise. The implementation methods can be very different each one with advantages and disadvantages. The function can be implemented by software, running a program on a common PC or using a system based on microprocessor, microcontroller or digital signal processor. In this case, performances depends by the power of calculation for the used processor, by the level of optimization for this processor for operations which are specific to digital processing and by the manner of the algorithms are implemented. The advantages of this implementations consist in high flexibility, the processing algorithm could be easily modified. The processing speed is medium, being very difficult to obtain real time processing for high frequency signals. Another implementation method consists in designing dedicated integrated circuits for this function. In this way a high speed processing is achieve and can be used in real time processing. The main disadvantage of this method consists in high costs and this can be used just if a large number of circuits are necessary. Also can be used semi custom designed circuits, which allow adjusting of processing parameters. The modeling and testing of these dedicated circuits can be done using the hardware description language VHDL.

1. Introduction Correlation is a mathematical operation that is very similar to convolution. Just as with convolution, correlation uses two signals to produce a third signal. This third signal is called the cross-correlation of the two input signals. The standard equation for correlation - If a[n] is an N point signal running from 0 to N-1, and b[n] is an M point signal running from 0 to M-1, the correlation r[n] of the two: r[k] = a[k] * b[k], is an N+M-1 point signal running from (N-1) to (M-1), given by:
N 1 n=0

r[ k ] = a[ n] b[ n + k ]

(1) with k between (N-1) and (M-1)

This equation is called the correlation sum. It allows each point in the output signal to be calculated independently of all other points in the output signal. The index, k, determines which sample in the output signal is being calculated. In computer programs performing correlation, a loop makes this index run through each sample in the output signal. To calculate one of the output samples, the index, n, is used inside of the correlation machine. As n runs through 0 to N-1, each sample in the impulse response, is multiplied by the proper sample from the input b[n+k], signal. All these products are added to produce the output sample being calculated. Correlation is the optimal technique for detecting a known waveform in random noise. That is, the peak is higher above the noise using correlation than can be produced by any other linear system. (To be perfectly correct, it is only optimal for random white noise). Using correlation to detect a known waveform is frequently called matched filtering. Writing a program to correlate one signal by another is a simple task, only requiring a few lines of code. Executing the program may be more painful. The problem is the large number of additions and multiplications required by the algorithm, resulting in long execution times. As shown by the programs in the next

chapter, the time-consuming operation is composed of multiplying two numbers and adding the result to an accumulator. Other parts of the algorithm, such as indexing the arrays, are very quick. The multiply-accumulate is a basic building block in DSP, and we will see it repeated in several other important algorithms. In fact, the speed of DSP computers is often specified by how long it takes to perform a multiply-accumulate operation. 2. Autocorrelation If a signal is correlated with itself, the resulting signal is instead called the autocorrelation. In this case the equation (1) becomes:
N 1 n =0

r[k ] = a[ k ] * b[k ] = a[n] b[n + k ]

(2)

with k between -(N-1) and (N-1) which represent 2N-1 points. The autocorrelation is symmetrical which means that r[k] = r[-k] and we dont need to calculate all 2N1 points being enough to determine just N points, lets say between 0 and N-1. Lets detail this equation for some points: r[-(N-1)] = a(0)a(N-1) r[-(N-2)] = a(0)a(N-1) + a(1) a(N-2) r[0] = a(0)a(0) + a(1)a(1) + .. + a(N-1)a(N-1) . r[(N-2)] = a(N-1) a(0)+ a(N-2) a(1) r[(N-1)] = a(N-1) a(0) As we showed before is necessary to calculate just r[0] to r[N-1]. For this purpose is possible to use a computer program as the following. In this manner all the elements between 0 and N-1 will be calculated. This simple algorithm can be implemented on a common PC or a dedicated system based on microprocessor, microcontroller or digital signal processor. The problem is the number of operations is very big (especially add and multiply) and the calculus can take

between tens of minutes on a top level PC and tens of milliseconds on a dedicated system with a high class DSP. If a signal composed of N samples is autocorrelated multiply-accumulations must be preformed. Personal computers of the late 1990's requires about 0.8 microsecond per multiply-accumulation (450 MHz Pentium III using single precision floating point). Therefore, calculating the autocorrelation of 10,000 samples signal requires about 40 seconds. To process 100000 point signal requires over 1 hour. A decade earlier (80286 at 12 MHz), this calculation would have required a few days! for k = 0 to N-1 do for n = 0 to N-1 do if (n+k ) < N then r[k] := r[k] + a(n)a(n+k) else break; end for n; end for k; 3. Hardware implementation Another method of implementation is using a dedicated integrated circuit. This can provide the maximum speed ever possible because all the operations are hardware implemented and will be executed in 1 clock cycle. In this case we assume to design a special integrated circuit that can read the signal samples, store them into a memory and after that calculate the autocorrelation and store them in memory. Of course this circuit will have the facility to deliver the calculated autocorrelation data to output pins. Lets suppose that the input samples are 16 bits long. In this case a product result as a 32 bits operant. The sum of, lets say, 100.000 terms will be 49 bits length. Here is a problem in choosing the most suitable memory. In principle the function can be calculated for any number of samples if we have enough memory. Also the width of the result memory must be wide enough regarding the number of input samples. Here we have two possibilities, to chose the maximum allowed number of samples

and accordingly to chose the result memory width (even this is not a typical size, multiple of 8 or 16) or to chose a standard width memory, large enough for our purpose. We will follow the second way and chose a 64 width result memory. In this case, theoretically we can process 4 billion samples, of course if we have enough memory. For our example we preferred to choose a standard memory block of 32Kx32. So, we will use 2 such modules for samples memory getting a size of 128Kx16 (we can store over 100.000 samples) and 8 modules for results memory - getting a size of 128Kx64. The results memory is very big and for practical reasons cannot be used like that. In this case the implementation method depends by the requirements of the application. For example we can give up to the result data memory and once a result is calculated to output it, using also an output strobe signal. An external device must get this data. Our dedicated circuit has the following block diagram:

16
Input Samples

2x32kx32 Input Data Memory

ALU X MULT
Reg64

64 Y 8x32Kx32 Result Data Memory


Output Results

Input Strobe Start Strobe Busy

Output Strobe Clk

Command Unit

Reset

We have the following input signals: - Input samples (16 bits) the input data are written into the memory;

- Input strobe each input data is written when this strobe is active; - Start strobe command for beginning of calculation; - Output results the output data (64 bits) representing the result of operation; - Output strobe each output data is delivered on this strobe; - Busy is active as long the calculation is performed; - Clock, reset clock signal, respective reset. The Command Unit controls all the operations, loading of registers, ALU, starts and stops the calculation. ALU is the most important part of the circuit, here being calculated the output data. In the same clock cycle we multiply X,Y and add the result to Reg64. 4. Working Diagram For each input strobe the input data is memorized into the Input Data Memory. When all the desired samples are stored in memory a pulse on Start Strobe will start the calculation. As long the calculation is performed Busy will be active, indicating that no other operation can be done. If is necessary to output the result data, after Busy is not active any more, on each Output Strobe, one result data will be outputted on Output Results, in order from 0 to N-1. The working diagram is the following :

Input Samples Input Strobe Start Strobe Busy Output Data Output Strobe

XXXXXXXXXXXXXXXXXXXXXXX

5. Logical Diagram The most important part of this circuit is represented by ALU. The global performance depends by the performances of the ALU. Thats why we tried to design ALU as fast as possible. For this reason in one single clock period are performed many operations, using a parallel hardware architecture. Each step is executed in 1 clock cycle. As can be observed from the diagram the calculation for one element it takes 2 clock cycles, because the memory cannot deliver 2 different data simultaneously. This means that for 100000 samples we need 200000 clock

START StbSTART = 1 YES ADR_X <= 0 ADR_Y <= 0 ADR_Rez <= 0 k <= 0 n <= 0 Reg64 <= 0 Busy <= 1

NO

Initialization

InputDataMemory_ADR <= ADR_X ADR_X <= ADR_X+1 X <= Operand_X InputDataMemory_ADR <= ADR_Y Y <= Operand_Y InputDataMemory_ADR <= ADR_X ADR_Y <= ADR_Y+1

ADR_X < N ADR_Y < N YES Reg64 <= Reg64 + X*Y X <= Operand_X ADR_X <= ADR_X+1 InputDataMemory_ADR <= ADR_Y

NO

OutputResult <= Reg64 ADR_X <= 0 ADR_Y <= k+1 ADR_Rez <= ADR_Rez+1 k <= k+1 n <= 0 Reg64 <= 0

k<N NO STOP Busy <= 0

YES

cycles. Lets consider a 50MHz clock, which means a period of 20 ns. In this case one add-multiply operation will be performed in 40 ns and for the final result it takes 200 s (3 min and 20 s). If we really need more speed we can use a dual port memory that allows two data reading in the same time or to use a parallel architecture with more ALUs for calculating more terms of the autocorrelation in the same time. Weve written this code in VHDL, which can perfectly describe the behavior of a circuit. Using ModelSim we simulated it, testing its performances. We used MatLab for generating an input data files and an autocorrelation result data file. We convert this file in hex data files (using a Pascal program) that are more easily used in VHDL. By simulating the code, using the input data file, we obtained a result data file. Using another Pascal program we compared the results files (generated by MatLab and by our code) . In this mode we could fully test this circuit. The next step is to synthesize it, using Synopsys tool, and check how big is this circuit, what is the maximum frequency clock, how much does it cost. 6. Observations and Conclusions In this paper is showed a hardware method for implementing the autocorrelation function. In this case the processing speed is much faster then other methods. For 100000 samples a PC needs for this calculation over 1 hour and our circuit needs just 200 s. If we use a sample rate of 2 ms is possible to perform this calculation in real time. If we decrease the number of input samples we can increase the sampling rate. The main disadvantage of this method is the high cost and no flexibility. This can be used just for a large numbers of circuits that must do the same job. References
1. The Scientist and Engineer's Guide to Digital Signal Processing - Steven W. Smith 2. VHDL Cookbook - Peter J. Ashenden

Potrebbero piacerti anche