Sei sulla pagina 1di 7

Microprocessors and Microsystems 23 (1999) 3541

A programmable image processor for real-time image processing applications


M.Y. Siyal a,1,*, M. Fathy b,2
b

School of EEE, Nanyang Technological University, Nanyang Avenue, Singapore, Singapore 639798 Department of Computer Engineering, Iran University of Science and Technology, Narmark, Tehran, Iran Received 22 September 1998; received in revised form 25 October 1998; accepted 12 November 1998

Abstract In this article we present an architectural design and analysis of a programmable image processor, named Snake. The processor was designed with a high degree of parallelism to speed up a range of image processing operations. Data parallelism found in array processors has been included into the architecture of the proposed processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed. The performance of Snake is also compared with other types of processor architectures. 1999 Elsevier Science B.V. All rights reserved.
Keywords: SIMD; RISC; CISC; Array processor; Parallel processing; Image processing Real-time

1. Introduction Low-level image processing operations usually involve simple and repetitive operations over the entire input images [15]. In real time applications, these input images are required to be processed at a rate of 25 frames per second (in most countries). Hence, with every frame consisting of 256 K pixels (512 512 resolution), tens of megabytes of data must be processed at every second. To provide such a high throughput rate, parallel implementation has to be investigated [8,9]. Area processing [2] including convolution and morphological operations present nearly all of the feature extraction algorithms, therefore, it plays a vital role in image processing applications. However, because of their dependency on neighbouring pixels, the operations become computationally expensive [1,7,8], therefore, the efforts on developing suitable architectures to implement them efciently is an important task for real-time image processing based applications. Theoretically, the highest throughput rate will be achieved when a perfect match is met between the processor architecture and the algorithms [2,5]. Keeping this in mind, the design of Snake was undertaken. The processor was designed to implement various image
* Corresponding author. Tel.: 0065 790 4464; fax: 0065 790 0415. E-mail addresses: emysiyal@ntu.ac.sg (M.Y. Siyal), 1 E-mail: Eyakoob@ntu.edu.sg (M.Y. Siyal), 2 E-mail: fathy@rose.ipm. ac.ir (M. Fathy) 0141-9331/99/$ - see front matter PII: S0141-933 1(98)00112-4

area operations with 3 3 mask. This is because area processing with 3 3 mask is commonly used in many image processing applications [5,9]. This is particularly true for the image processing algorithms developed for road trafc analysis [8,9]. However, other area processing with larger masks can be easily realised through several 3 3 basic mask operations. 2. The architecture of Snake The whole design reects the idea of combining parallelism found in mesh-connected array processors into an efcient pipeline processor, so that the high concurrency could be achieved. 2.1. The overall system Fig. 1 shows the block diagram of Snake. The processor contains four functional units that are connected to each other, forming an efcient four-stage pipeline. The four hardware units are listed below: 1. Address Generation Unit (AGU): This module consists of two sub-units and is responsible for movement between the processor and the image memory. 2. PE-array Computation Unit: It consists of nine interconnected processing elements, forming a full meshconnected 3 3 PE-array. 3. Multi-input Arithmetic Logical Unit (MALU): Besides

1999 Elsevier Science B.V. All rights reserved.

36

M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 3541

Fig. 3. The links in PE-array.

Fig. 1. Block diagram of the processor.

execution of normal arithmetic logical functions, several special-designed circuits are integrated. Therefore, this unit can handle up to ten inputs at a time to perform some group arithmetic and logical operations within one cycle. 4. Processor Control Unit (PCU): It consists of several subunits and is responsible for fetching and decoding instructions, and generating control signals for correspondent components. The efforts of parallelism exploitation at machine instruction level are mainly made at this unit. Data ow in Snake is based on a four-stage pipeline: rstly, a set of neighbouring pixels is read into the processor by the AGU. Then, the PE-array takes the data, computes it in parallel, and sends the processed results to MALU. The MALU computes these partial results and gets the nal result. Finally, the result is output by AGU. An instruction set, supporting both Snake architecture and characters of area image processing, was also developed. The instruction set it simple and designed with efcient pipelining and decoding in mind. Moreover, parallelism at instruction level was also exploited. That means, some instructions take advantage of spatial parallelism by using multiple functional units to execute several operations concurrently. 2.2. Address generation unit As shown in Fig. 2, the AGU consists of two sub-units: one for address calculation and the other for interfacing.

Each sub-unit is connected to PCU that handles program execution. The address calculation sub-unit can gure out the addresses for the pixel itself and its neighbouring pixels, while the interfacing sub-unit generates interface signals to control the data movement between the processor and the memory. The most distinctive operation is neighbouring read, in which case nine neighbouring pixels in image memory are fetched by the AGU into nine correspondent inputs of PE-array. During each read cycle, a 18-bit base-address, for three row neighbouring pixels, is rstly put on the higher bits of a 24-bit address and data bus, while the three offset addresses (2 bits for each pixel) are put on the corresponding lower bits of the bus. Then, after the control signals have been sent, the contents of the three pixels can be obtained by the AGU. Finally, each obtained data is output to a suitable input of PE-array. 2.3. PE-array computation unit The mesh of 3 3 processing elements forms this computation module. As shown in Fig. 3, the nine PEs are fully inter-connected, therefore, they work synchronously, the same way as mesh-connected array processor works. Each PE in the array accesses its corresponding input from AGU, along with inputs from its neighbouring PEs. As shown in Fig. 4, every processing element (PE) consists of a Processing Arithmetic & Logical Unit (PALU) providing Boolean as well as arithmetic functions,

Fig. 2. Block diagram of AGU.

Fig. 4. Structure of a processing element.

M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 3541

37

instruction execution (EX). The possible overlapping execution among the three stages has been investigated during the design, so that the performance of the processor could be improved. The processor carries out the two stages (EX and IF) concurrently when the two consecutive and independent instructions are met. For the EX stage, two sections of instructions are included. One is to initialise local memories in both PEarray and MALU. This section of instructions is responsible for mapping algorithms onto the functional units. The other section of the instructions is to synchronise the processing step among all functional units in the processor.
Fig. 5. Structure of MALU.

instruction and constant registers, and a Logic Control Unit (LCU). Each processing element has a local memory, which is used to store instructions and the data. Initially, both instruction and constant registers in each PE are ready to execute their instructions. During the process, every PE in the array executes the operations dened in its own instruction register simultaneously. Note that the whole process in the array is well connected and is controlled by PCU. 2.4. Multi-input arithmetic logical unit As shown in Fig. 5, MALU has ten input data: nine from PE-array with pre-processed data, and one from PCU. The instructions, which will be running, are stored in the local memory. The instructions are loaded by PCU in a serial way through the units LCU. Therefore, with every clock cycle, it can take more than one value from its inputs and produce the result. LCU is responsible for the execution of the instructions that are stored in the local memory or sent by PCU. As mentioned earlier, apart from its conventional abilities for arithmetic and logical functions, some specially designed circuits are integrated to enhance its capability to handle these ten inputs and perform group operations. The enhanced ability to do group operations in MALU makes many complex operations essential for image area processing to be executed very efciently. For example, the enhanced MALU can select the maximum value from its nine inputs from PE-array within one cycle. Therefore, the enhanced ALU can deal with a variety of image area processing operations, like convolution and morphological operations. 2.5. Processor control unit Like traditional RISC processors, the process in PCU is decomposed into a three-stage instruction pipeline (Fig. 6): instruction fetching (IF), instruction decoding (ID), and

3. Instruction set Obviously, along with the architecture design of Snake, an effective instruction set should be developed as well, so that various image-processing algorithms could be implemented. The simplicity and suitability for image processing are two major concern during the development of the instruction set. Besides, the possible parallelism at instruction level and particular architecture of the processor are also important factors and was taken into consideration at the design stage of the processor. A total of 64 instructions with 8-bit format have been developed for basic use. Apart from instructions for normal functions, there are quiet a few distinctive instructions developed to take advantage of the special architecture of Snake. A short list of some of instructions is shown in Table 1. As an example, instruction CEP is the instruction for both PE-array and MALU. The execution of this instruction results in, not only all of PEs in the PE-array doing processing simultaneously, but also MALU processing at the same time. With the help of these instructions, such as group instructions and neighbour read/write instructions, the architecture of the processor is well exploited. 3.1. Programming In Snake, each unit has its own short program and data. Each unit executes its own program on all pixels. Therefore, the rst step of programming is to map the algorithms on to these units. In our approach, the mapping is done by loading instructions and data to local memories. After the mapping procedure, the units are ready for operations. During processing, instructions are decoded in the PCU, and synchronous signals are then sent to correspondent units. These signals are used to control the speed of processing in the processor.

4. The analytical model


Fig. 6. Instruction pipeline.

As most image area algorithms use a 3 3 mask for the

38 Table 1 A short list of instructions Instruction mnemonic ARD RPA SMED OUTR AINC CEP CRP Brief description

M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 3541

by: TA N tIO
Opcode

tMALU

tPE :

Neighbour read PE-array execute Select medium from PE-array Output result in MALU to AGU AGU address register plus 1 Execute operations in both PE-array and MALU, simultaneously Both neighbour and PE-array execution

00000111 00111011 01000010 01000101 00000100 1001100 10000111

The Eq. (5) can also be expressed graphically by the pipeline shown in Fig. 7. Moreover, because of the existence of spatial parallelism in the pipeline, the four processing times (tin, tout, tMALU, tPE) overlap. So, considering this factor, a coefcient r0 r 1 is introduced to adjust Eq. (5) as follows: TA N r tIO tMALU tPE : 6

basic implementation, we can discuss the analytical model for Snake, based on a 3 3 neighbourhood. Disregarding both times on program decoding and register initialising, the total processing time (TA) has three main components: data I/O time (TIO), computation time in PEarray (TPE), and computation time in MALU (TMALU). These times are functions of the program length (L) and image size (N). The total time to process a whole image can be expressed by: TA TIO TPE TMALU : tout ; 1 2

As shown in Eq. (6), in order to speed up the processing, two ways are available: one is to reduce the coefcient value r , and the other is to reduce the processing time for each stage. The former means developing a spatial parallelism in pipeline while the latter means to improve efciency in every stage. Therefore, enormous efforts on both the optimisation of architecture at component level and exploitation of parallelism at machine instruction level were made during the processor design. For Snake, we achieved r 0:4. 5. Evaluation and comparison This section presents an evaluation of the proposed processor by implementing typical image processing algorithms. For the purpose of comparison, the algorithms were also implemented on a number of CISC, RISC and DSP processors. These processors, namely Intel 80I860 (i860) RISC processor, Pentium-200 CISC processor, TMS320C30 DSP, and T800 Transputer are the most commonly used processors for real-time applications [6]. In order to compare the performance of Snake with these processors, we implemented various image-processing algorithms on Snake. However, for the purpose of comparison and performance evaluation, only two algorithms are described in this article. 5.1. Dilation algorithm The rst algorithm considered for this investigation is a dilation operation with a 3 3 mask over a 512 512 image. Dilation is a basic morphological operation and constitutes a class of algorithms devised for local feature extraction in image processing [9]. Two dimensional dilation operation of a gray-scale image I(x,y),by using a gray-scale structuring element C(ij), can be expressed by the following equation: Dx; y maximumIx i; y j Ci; j: 7

The I/O time in Eq. (1) can be expressed by: TIO N tin

where N is the number of pixels in an image, tin is the cycle time for performing a read neighbouring operation, and tout is the cycle time for performing a write result operation. For Snake, because of enhance-designed of AGU, tin tout 12CLK inpractice: 3

The computation time in PE-array is expressed by: TPE N tPE ;

where N is the number of pixels in an image, and tPE is the longest processing cycle time among the PEs in the array. Note that all PEs compute in parallel. The computation time in MALU is expressed by: TMALU N tMALU ; 4

where N is the number of pixels in an image, and tMALU is time to execute an algorithm on a pixel. Note that because of a special hardware inside MALU, the number of instructions for MALU to execute an algorithm is signicantly reduced. Hence, the total time for a whole image is again expressed

Fig. 7. The analytical model.

In our investigation, matrix C is a 3 3 mask as shown below, and test image is a picture with 512 512 pixels.   1 0 2     C  2 1 1 :   1 0 1

M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 3541

39

matrix is applied as:    0 1 0     C  1 3 1 :    0 1 0  Like the dilation operation, the matrix C is applied over the test picture (512 512pixels). The ltering algorithm is expressed by the following equation: Di; j S{Ii k; j 1*Ck; 1}; 0 k 2; 0 1 2; 9

where I(i,j) is the input pixel (i,j), C(k,l) is matrix C coefcient, and D(i,j) is the resultant pixel. 5.3. Implementation As shown in Fig. 8 two steps are required for the implementation of the above algorithms on Snake. The rst procedure is to initialise the values of registers in both AGU and PE-array. The second step focuses on the execution of a set of instructions repeatedly. In practice, the rst step is carried out at the beginning of the implementation, while the second step is executed until the whole image data has been processed. As a result of the same test pictures for both algorithms, the four AGU registers were initialised during the mapping process with the same set of data. This reects the structure of the test image (512 512) and the starting position of the pixel (0,0). The same procedure is followed for AGU mapping. During the process, instructions and the registers of every processing element in the PE-array are initialised with suitable values. The values should correspond to the algorithm involved. For example, for the dilation algorithm, the constant and instruction registers of PE0 in the PE-array were initialised with values 1 and Add respectively. While, for the low-bound lter, these two registers were initialised with 0 and multiply respectively. In the

Fig. 8. Flow chart.

Therfore, test dilation algorithm can be expressed by the equation as follows: Di; j Maximum{Ii 1 2; 8 where I(i,j) is the value at the input pixel (i,j), C(k,l) is the matrix C coefcient, and D(i,j) is the result value at pixel (I,j). The algorithm overall requires 032 5122 calculations over the 512 512 test image. 5.2. A low-bound lter The second algorithm implemented was a convolution algorithm, named low-bound lter. In this lter, a 3 3 k; j l Ck; l}; 0 k 2; 0

Fig. 9. Execution times of processors (dilation algorithm).

40

M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 3541

Fig. 10. Execution times of processors (convolution algorithm).

implementation of the dilation algorithm, the instructions related to the group maximum obtaining were downloaded in the local memory of MALU. While for the low-bound lter, the local memory was initialised with instruction group adding. After the above mapping process, the second step is the processing of images. At this stage, PCU decodes the processing instructions from the memory module, and generates the synchronous signals for the correspondent units. The nine neighbouring pixels were obtained by the neighbouring read instruction. It was followed by the instruction that resulted in both PE-array and MALU working simultaneously. After two management instructions, pixel result was obtained and output by AGU. With the help of spatial parallelism instructions, only ve instructions were involved in this stage for the two algorithms. 6. Results and comparison In order to evaluate the architecture and performance of Snake, we have conducted extensive experiments and have compared the performance of Snake with processors. Figs. 9 and 10 show the comparitive performance of the processors in implementing the dilation and convolution algorithms over the same 512 512 test image, receptively. As can be seen, the proposed architecture is more powerful than the commonly used processors. 7. Conclusion In this article, we have presented the architectural design and the results of a programmable image processor called Snake. An important factor in the design of Snake is its pipeline structure with a small number of processing elements. As a result of this small mesh-connected

PE-array, concurrency in the processor is signicantly improved. Some other parallel strategies, such as parallelism at instruction level, were also implemented in the design. As Snake is a programmable processor, it is easy to implement various image-processing algorithms and, it is exible to deal with different structures of images. Therefore, the proposed processor could be a good candidate for real-time image processing applications.

References
[1] C.J. Turner, V.C. Bhavsar, Parallel implementations of convolution and moments algorithms on a multi-transputer system, Microprocessor and Microprogramming (1995) 283290. [2] P.A. Navaux, A.F. Cesar, Performance evalutaion in image processing with GAPP array processor, Microprocessor and Microprogramming (1995) 7182. [3] D.D. Eggers, E. Ackerman, High speed image rotation in embedded system, Computer Vision and Image Understanding (1995) 270277. [4] A. Abnous, N. Bagherzadeh, Architectural design and analysis of a VLIW processor, Computer and Electrical Engineering 21 (1995) 119142. [5] V. Cantoni, C. Guerra, S. Levialdi, Towards an Evaluation Of An Image Processing System, in: Computer Structures for Image Processing, Academic Press, New York, 1983, pp. 4356. [6] M.O. Tokhi M.A. Hossain, CISC, RISC and DSP processors in realtime signal processing and control, Microprocessors and Microsystems, 1995. [7] M. Fathy, M.Y. Siyal, A combined edge detection and background differencing image processing approach for real-time trafc analysis, Journal of Road and Transport Research 4 (3) (1995) 114123. [8] M.Y. Siyal, M. Fathy, A real-time image processing approach for automatic trafc control and monitoring, IES Journal 35 (1) (1995) 5055. [9] M Fathy, M.Y. Siyal, Measuring trafc movements using image processing techniques, Pattern Recognition Letters 18/05 (1997) 493500.

M.Y. Siyal, M. Fathy / Microprocessors and Microsystems 23 (1999) 3541 Dr M Y Siyal obtained M.Sc and Ph.D. from the University of Manchester Institute of Science and Technology, England in 1988 and 1991 respectively. From February 1991 to January 1993, he was with the Department of Electrical and Electronic Engineering, University of Newcastle in Tyne, England. Since, February 1993, he has been with the school of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. Dr Siyal is senior member of IEE, member of IEE and chartered engineer. He has published over 40 referred journal and conference papers. His research interests are real-time image processing, computer architecture and multi-media and information technology.

41

Dr M Fathy received B.Sc. degree in Electrical Engineering from Iran University of Science and Technology, in 1984. He obtained his M.Sc in Process Engineering from Bradford University, England in 1987 and Ph.D. degree from University of Manchester Institute of Science and Technology, England in 1990. Currently he is working in the Department of Computer Engineering at Iran University of Science and technology, where he teaches postgraduate students. Dr Fathy has been consultant to various organizations and is actively conducting research in areas of image processing, neural network, fuzzy logic and real-time image processing applications and has published extensively in these areas in international journals and conferences.

Potrebbero piacerti anche