Sei sulla pagina 1di 212

Proceedings of the

Rapid System Pro t o t yp i n g


Shortening the Path from Specification to Prototype

2011 22nd IEEE International Symposium on

24-27 may 2011 Karlsruhe, Germany Sponsored by IEEE Reliability Society Karlsruhe Institute of Technology

Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the rst page, provided the percopy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For other copying, reprint or republication permission, write to IEEE Copyrights Manager, IEEE Operations Center, 445 Hoes Lane, Piscataway, NJ 08854. All rights reserved. Copyright c 2011 by IEEE.

2011 22nd IEEE International Symposium on Rapid System Prototyping

Table of Contents
Message from the General Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Message from the Program Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Message from the Organizing Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Conference Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Tutorial and Keynotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Session 1: Automotive & FPGA An FPGA-Based Signal Processing System for a 77 GHz MEMS Tri-Mode Automotive Radar . . . . . . . . . . . . . . . . Sazzadur Chowdhury, Roberto Muscedere and Sundeep Lal FPGA based Real-Time Object Detection Approach with Validation of Precision and Performance . . . . . . . . . . . Alexander Bochem, Kenneth Kent and Rainer Herpers 2 9

Rapid Prototyping of OpenCV Image Processing Applications using ASP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Felix Mhlbauer, Michael Grohans and Christophe Bobda Optimization Issues in Mapping AUTOSAR Components To Distributed Multithreaded Implementations . . . . 23 Ming Zhang and Zonghua Gu FPGA Design for Monitoring CANbus Trafc in a Prosthetic Limb Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Alexander Bochem, Kenneth Kent, Yves Losier, Jeremy Williams and Justin Deschenes Session 2: Prototyping Architectures Rapid Single-Chip Secure Processor Prototyping on OpenSPARC FPGA Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Jakub Szefer, Wei Zhang, Yu-Yuan Chen, David Champagne, King Chan, Will Li, Ray Cheung and Ruby Lee A Study in Rapid Prototyping: Leveraging Software and Hardware Simulation Tools in the Bringup of System-on-a-Chip Based Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Owen Callanan, Antonino Castelfranco, Catherine Crawford, Eoin Creedon, Scott Lekuch, Kay Muller, Mark Nutter, Hartmut Penner, Brian Purcell, Mark Purcell and Jimi Xenidis Rapid automotive bus system synthesis based on communication requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Matthias Heinz, Martin Hillenbrand, Kai Klindworth and Klaus D. Mller-Glaser An event-driven FIR lter: design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Taha Beyrouthy and Laurent Fesquet Session 3: Prototyping Radio Devices Applying Graphics Processor Acceleration in a Software Dened Radio Prototyping Environment. . . . . . . . . . . 67 William Plishker, George Zaki, Shuvra Bhattacharyya, Charles Clancy and John Kuykendall Validation of Channel Decoding ASIPs A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Christian Brehm and Norbert Wehn Area and Throughput Optimized ASIP for Multi-Standard Turbo decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Rachid Alkhayat, Purushotham Murugappa, Amer Baghdadi and Michel Jezequel Design of an Autonomous Platform for Distributed Sensing-Actuating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Franois Philipp, Faizal A. Samman and Manfred Glesner Session 4: Virtual Prototyping for MPSoC A Novel Low-Overhead Flexible Instrumentation Framework for Virtual Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 92 Tennessee Carmel-Veilleux, Jean-Franois Boland and Guy Bois i

Using Multiple Abstraction Levels to Speedup an MPSoC Virtual Platform Simulator . . . . . . . . . . . . . . . . . . . . . . . 99 Joo Moreira, Felipe Klein, Alexandro Baldassin, Paulo Centoducatte, Rodolfo Azevedo and Sandro Rigo A non intrusive simulation-based trace system to analyse Multiprocessor Systems-on-Chip software . . . . . . . . 106 Damien Hedde and Frdric Ptrot Embedded Virtualization for the Next Generation of Cluster-based MPSoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Alexandra Aguiar, Felipe Gohring De Magalhaes and Fabiano Hessel Session 5: Model Based System Design Rapid Property Specication and Checking for Model-Based Formalisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Daniel Balasubramanian, Gabor Pap, Harmon Nine, Gabor Karsai, Michael Lowry, Corina Pasareanu and Tom Pressburger Automatic Generation of System-Level Virtual Prototypes from Streaming Application Models . . . . . . . . . . . . . 128 Philipp Kutzer, Jens Gladigau, Christian Haubelt and Jrgen Teich An Automated Approach to SystemC/Simulink Co-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Francisco Mendoza, Christian Koellner, Juergen Becker and Klaus D. Mller-Glaser Extension of Component-Based Models for Control and Monitoring of Embedded Systems at Runtime . . . . . . 142 Tobias Schwalb and Klaus D. Mller-Glaser A model-driven based framework for rapid parallel SoC FPGA prototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149 Mouna Baklouti, Manel Ammar, Philippe Marquet, Mohamed Abid and Jean-Luc Dekeyser A State-Based Modeling Approach for Fast Performance Evaluation of Embedded System Architectures . . . . . 156 Sebastien Le Nours, Anthony Barreteau and Olivier Pasquier Session 6: Software for Embedded Devices Task Mapping on NoC-Based MPSoCs with Faulty Tiles: Evaluating the Energy Consumption and the Application Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Alexandre Amory, Csar Marcon, Fernando Moraes and Marcelo Lubaszewski Me3D: A Model-driven Methodology Expediting Embedded Device Driver Development . . . . . . . . . . . . . . . . . . . 171 Hui Chen, Guillaume Godet-Bar, Frdric Rousseau and Frdric Ptrot Session 7: Tools and Designs for Congurable Architectures Schedulers-Driven Approach for Dynamic Placement/Scheduling of multiple DAGs onto SoPCs . . . . . . . . . . . . 179 Ikbel Belaid, Fabrice Muller and Maher Benjemaa Generation of emulation platforms for NoC exploration on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Junyan Tan, Virginie Fresse and Fdric Rousseau Arbitration and Routing Impact on NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Edson Moreno, Cesar Marcon, Ney Calazans and Fernando Moraes On-Chip Efcient Round-Robin Scheduler for High-Speed Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Surapong Pongyupinpanich and Manfred Glesner

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

ii

Message from the General Chair


Welcome to Germany and welcome to Karlsruhe Institute of Technology KIT for the 22nd IEEE International Symposium on Rapid System Prototyping (RSP). RSP explores trends in Rapid Prototyping of Computer Based Systems. Its scope ranges from embedded system design, formal methods for the verication of systems, engineering methods, process and tool chains to case studies of actual software and hardware systems. It aims to bring together researchers from the hardware and software communities to share their experiences and to foster collaboration of new and innovative Science and Technology. The 22nd annual Symposium focus will encompass theoretical and practical methodologies, resolving technologies of specication, completeness, dynamics of change, technology insertion, complexity, integration, and time to market. RSP 2011 is a 4 day event starting with an industry driven tutorial program on safety critical systems and the according standard IEC 61508 and to go along with a visit to a car manufacturing plant. We will have three days of technical program starting each day with outstanding keynote speakers from industry and academia talking about RSP in Automotive, Aerospace and Robotics Applications. Time for technical discussions will be supplemented by social gathering with old and new friends. The success of the symposium is based on the effort of many volunteers. I wish to thank and express my appreciation for their hard and dedicated efforts of Program Chairs Fabiano Hessel and Frdric Ptrot who set up an excellent symposium program, the Tutorial Chair Michael Huebner for organizing the highly topical tutorial day, the Publicity Chair Jrme Hugues and the IEEE Liaison Chair Alfred Stevens from IEEE Reliability Society who is sponsoring RSP2011, the Organizing Chair Matthias Heinz and last but not least the Finance Chair Martin Hillenbrand. Special thanks to all of you who contributed a paper and will contribute to formal presentations and informal discussions. I hope that you will nd this years symposium interesting and rewarding, and that you will enjoy your time in Karlsruhe. Klaus D. Mueller-Glaser Karlsruhe Institute of Technology, Germany

iii

Message from the Program Chairs


We welcome you to the 22nd IEEE International Symposium on Rapid System Prototyping (RSP-2011) held in Karlsruhe, Germany. RSP is the rst conference to bring together people from the software and hardware communities coming from academia and industry for exchanging on RSP-related research topics from scientic and technical standpoints. For more than 20 years, the symposium has been attracting an outstanding mix of practitioners and researchers, and since its early days spans numerous disciplines, making it the one of its kind. Your participation in RSP will have signicant impact to understand the challenges behind rapid system prototyping and how the methods and tools you develop can lead to better systems for our everyday lives. It is the intent of this symposium to foster exchanges between professionals from industry and academia, from hardware design to software engineering, and to promote a dialog on the latest innovations in rapid system prototyping. The quality of the Technical Program is a key point to RSP. The strength of the Technical Program is due to the work of the Technical Program Committee, for soliciting colleagues for quality submissions and for the review work. We believe that the program of this year covers many exiting topics, and we hope that you will enjoy it. We extend our sincere appreciation to all those who contributed to make RSP 2011 a bubbling experience: the Authors, Speakers, Reviewers, Session Chairs, and volunteers. We extend a particular thank you to the technical Program Committee for their dedication to RSP and their excellent work in reviewing the submissions. We wish you a very productive and exciting conference.

Frdric Ptrot - TIMA, France Fabiano Hessel - PUCRS, Brazil

iv

Message from the Organizing Chairs


We are happy to welcome you at the 22nd IEEE International Symposium on Rapid System Prototyping in Karlsruhe. We hope you will enjoy the well-chosen program of remarkable keynotes, talks and topics as well as the technology region Karlsruhe. The city of Karlsruhe was founded in 1715 by Margrave Charles III William of Baden-Durlach. From the Karlsruhe Palace, which forms the center of the city, 32 streets are radiating out like the ribs of a fan. This is why Karlsruhe is also called the "fan city". On the right-hand side directly next to the Karlsruhe Palace, the campus of the University of Karlsruhe is located. In 2009, the University of Karlsruhe and the Research Center of Karlsruhe joined forces as the Karlsruhe Institute of Technology (KIT). This year, the RSP Symposium is hosted by the KIT. We thank our partners for their contributions and efforts to accomplish the symposium: the local partners and institutions at KIT for their support, the general chair and the program chairs for developing the scientic program of this symposium and the members of the program committee for their conscientious reviews. We are also grateful for the sponsorship of Institution of Electrical Engineers (IEEE) and the KIT in nancially and technically supporting the symposium. We hope you enjoy the conference program and additionally take some time to explore the city of Karlsruhe. Matthias Heinz (KIT), Martin Hillenbrand (KIT), Germany

Conference Committees
General Chair
K. Mueller-Glaser - KIT, Germany

Program Chairs
F. Ptrot - TIMA, France F. Hessel - PUCRS, Brazil

Tutorial Chair
M. Hbner - KIT, Germany

Publicity Chair
J. Hugues - ISAE, France

IEEE Liaison Chair


A. Stevens - IEEE Reliability Society, USA

Local Organization Chair


M. Heinz - KIT, Germany

Finance Chair
M. Hillenbrand - KIT, Germany

Technical Program Commitee Members


M. Aboulhamid Universite de Montral G. Alexiou CTI T. Antonakopoulos University of Patras P. Athanas Virginia Tech M. Auguston NPS A. Baghdadi TELECOM Bretagne J. Becker Karlsruhe Institute of Technology C. Bobda University of Arkansas G. Bois Ecole Polytechnique de Montreal D. Buchs CUI, University of Geneva R. Cheung UCLA R. Drechsler University of Bremen, Germany J. Drummond RSP PC M. Engels FMTC A. Frhlich Federal University of Santa Catarina M. Glesner TU Darmstadt M. Glockner BMW AG D. Hamilton Auburn University W. Hardt Chemnitz University of Technology M. Heinz Karlsruhe Institute of Technology J. Henkel Karlsruhe Institute of Technology F. Hessel PUCRS J. Hugues ISAE A. Jerraya CEA G. Karsai Vanderbilt University K. Kent University of New Brunswick F. Kordon Univ. P. & M. Curie R. Kress I. Krueger R. Lauwereins M. Lemoine P. Leong T. Le R. Ludewig G. Martin B. Michael K. Mueller-Glaser N. Navet G. Nicolescu V. Olive Y. Papaefstathiou C. Park L. Pautet C. Pereira R. Pettit D. Pnevmatikatos F. Ptrot J. Rice F. Rousseau M. Shing O. Sokolsky T. Taha E. Todt B. Zalila Inneon UCSD IMEC CERT The Chinese University of Hong Kong National University of Singapour IBM Germany Tensilica Naval Postgraduate School Karlsruhe Institute of Technology INRIA / RTaW Ecole Polytechnique de Montreal CEA-LETI Technical University of Crete SAMSUNG Electronics TELECOM ParisTech Univ. Federal do Rio Grande do Sul The Aerospace Corporation Tech. Univ. of Crete & FORTH-ICS TIMA Lab, Grenoble-INP University of Lethbridge TIMA - UJF Naval Postgraduate School University of Pennsylvania Clemson University Universidade Federal do Parana ReDCAD Laboratory, Univ. of Sfax

Additional Reviewers
Doucet, Fred Gligor, Marius Kuehne, Ulrich Li, Min Menarini, Massimiliano Noack, Joerg Eggersgl, Stephan Grosse, Daniel Le, Hoang Linard, Alban Muller, Ivan Philipp, Francois Thoma, Florian Friederich, Stephanie Heisswolf, Jan Lesecq, Suzanne Love, Andrew Muller, Olivier Samman, Faizal Arya Zha, Wenwei Gheorghe, Luiza Hostettler, Steve Li, Hua Marinissen, Erik Jan Naessens, Frederik Spies, Christopher

vi

Tutorial and Keynotes


Tutorial Program Tuesday May 24th : Safety Integrity Levels in FPGA based designs Gernot Klaes, Technical-Inspection Authority, Germany "Functional Safety according to IEC 61508 a short introduction " Giulio Corradi, Xilinx "FPGA and the IEC61508 an inside perspective" Romual Girardey, Endress+Hauser "Safety Aware Place and Route for On-Chip Redundancy in FPGA"

Keynote Speeches Wednesday May 25th : Prof. Dr.-Ing. Juergen Bortolazzi, Porsche AG, "RSP in Automotive" Thursday May 26th : Dr. Costa Pinto, Efacec, "FPGA in Aerospace Applications" Friday May 27th : Prof. Dr. Rdiger Dillmann, FZI, "Prototyping in Robotics"

vii

Session 1 Automotive & FPGA

An FPGA-Based Signal Processing System for a 77 GHz MEMS Tri-Mode Automotive Radar
Sundeep Lal, Roberto Muscedere, Sazzadur Chowdhury
Department of Electrical and Computer Engineering University of Windsor Windsor, Ontario, N9B 3P4 Canada
Abstract An FPGA implemented signal processing algorithm to determine the range and velocity of targets using a MEMS based tri-mode 77 GHz FMCW automotive radar has been presented. In the developed system, A Xilinx Virtex 5 FPGA based signal processing and control algorithm dynamically reconfigures a MEMS based FMCW radar to provide a short-range, mid-range, and long range coverage using the same hardware. The MEMS radar incorporates two MEMS SP3T RF switches, two microfabricated Rotman lenses, and two reconfigurable microstrip antenna arrays embedded with MEMS SPST switches in addition to other microelectronic components. By sequencing the FMCW signal through the 3 beamport of the Rotman lens, the radar beam can be steered 4 degrees in a combined cycle time of 62 ms for all the 3 modes. A worst case range accuracy of 0.21m and velocity accuracy of 0.83 m/s has been achieved which is better than the state-of-the art Bosch LRR3 radar sensor. Keywords-component; formatting; style; styling; insert (key words)

percent annually, with demands reaching 3 million units in 2011 with 2.3 million of them using radar sensors [2]. The strategic Automotive Radar frequency Allocation (SARA) consortium specified that a combined SRR and LRR platform in the 77-79 GHz range will enable to reduce size and improve performance of automotive radars [3-4]. In [3] it has been identified that in the long term, 77 GHz will become the only reasonable technology platform to serve both short and long range radars. In [3, 5] it has been determined that frequency modulated continuous wave (FMCW) radar with an analog or digital beamforming capability with a low cost SiGe based radar front end is the technology of choice for forward collision warning applications. Though GaAs or SiGe based MMICs are being pursued vigorously to minimize the cost and size while improving the performance of automotive radars [3, 6], the auto industry is eyeing on to exploit the small cost, batch fabrication capability of the MEMS technology to realize more sophisticated radar systems [2]. The project goal of a European consortium SARFA has been set to utilize RF MEMS as an enabling technology for performance improvement and cost reduction of automotive radar front ends operating at 76-81 GHz. [7]. In [5], a MEMS based long range radar comprised of a microfabricated Rotman lens and MEMS SP3T switches and a microstrip antenna array has been presented [8]. The DistronicPlus system uses one long range and two short range radars in the front, two other short range radar in the front for park assist, and two other short range radars in the rear to provide an effective collision avoidance system. Due to the individual short and long range units, the price tag of the DistronicPlus system is pretty high. Almost all the commercially available automotive radars use microelectronic based ASICs. However, FPGAs are becoming increasingly popular in the development phase for rapid prototyping as opposed to DSP based solutions. Relative advantages of an FPGA based system over the DSP based ones for automotive radars are discussed in [9-11] where it has been determined that FPGAs can offer superior performance in terms of footprint area, high throughput, more time-andresource efficient implementations, high speed parallel processing, digital data interfacing, ADC and DAC handling, and clock management in a relatively low-cost platform. For example, An FPGA can generate a precision Vtune signal to

I.

INTRODUCTION

The global auto industries are extensively pursuing radar based proximity detection systems for applications including adaptive cruise control, collision avoidance, and pre-crash warning to avoid or mitigate collision damage. In [1], it has been identified by analyzing actual crash records from the 2004-08 files of the National Automotive Sampling System General Estimates System (NASS GES) and the Fatality Analysis Reporting System (FARS) that a forward collision warning/mitigation system comprised of radar sensors has the greatest potential to prevent or mitigate up to 1.2 million crashes, up to 66,000 nonfatal serious and moderate injury crashes, and 879 fatal crashes per year. Ironically, the IIHS study found that the forward collision warning crash avoidance features that could prevent or mitigate this much fatal and nonfatal injury related crashes were available only to just a handful of luxury vehicle models due to the high cost of the currently available forward collision warning technology. Thus a low cost radar technology for forward collision warning system that can be made available to all the on-road vehicles would be able to prevent/mitigate up to 1.2 million crashes per year. Market research firm Strategy Analytics predicts that over the period 2006 to 2011, the use of long-range distance warning systems in cars could increase by more than 65
This research has been supported by NSERC Canada, Ontario Centres of Excellence (OCE), and Auto21 Canada.

978-1-4577-0660-8/11/$26.00 2011 IEEE

control the VCO linearity which is a critical issue to avoid false targets. Investigation shows that instead of a passive antenna system, a MEMS based reconfigurable microstrip antenna in conjunction with MEMS SP3T switches and a microfabricated Rotman lens can be used to realize a compact tri-mode radar that can provide all the short, mid and long range functionality in a small form-factor single unit. An FPGA based control unit will control the operation of an array of MEMS SPST RF switches embedded in the reconfigurable antenna array to dynamically alter the antenna beamwidth to switch the radar from short to mid to long range using a predetermined time constant. This will reduce the price tag significantly by multiplexing the functionality of three different range radars in the same hardware. Additionally, the passive microfabricated Rotman lens will eliminate microelectronics based analog or digital beamforming components as used in commercially available automotive radars. Consequently, the overall system will become less complex, faster, lower cost, and more reliable. Due to the faster signal processing and digital data interfacing capability, an FPGA like Xilinx Virtex 5 can offer a very robust control of VCO linearity and a faster refresh rate for range and velocity data. In addition to the conventional signal processing tasks, control algorithms for the MEMS SP3T and SPST RF switches can also be embedded in the same FPGA. In this context, this paper presents the development, implementation, and validation of Xilinx Virtex 5 FPGA based control and signal processing algorithms for the developed MEMS tri-mode radar sensor. The algorithm is able to determine the target range and velocity with a very high degree of precision in a cycle time that is shorter than the state-of-the art 3rd generation long range radar (LRR3) from Bosch [12]. II. MEMS TRI-MODE RADAR OPERATING PRINCIPLE

Figure 1. MEMS tri-mode radar block diagram.

Figure 2. Reconfigurable microstrip antenna array.

A simplified architecture of the MEMS tri-mode radar is shown in Fig. 1. The radar operating principle is as follows: (i) An FPGA implemented control circuit generates a triangular signal (Vtune) to modulate a voltage controlled oscillator (VCO) to generate a linear frequency modulated continuous wave (FMCW) signal centered at 77 GHz. (ii) The FMCW signal is fed to a MEMS SP3T switch. (iii) An FPGA implemented control algorithm controls the SP3T switch to sequentially switch the FMCW signal among the three beam ports of a microfabricated Rotman lens. (iv) As the FMCW signal arrives at the array ports of the Rotman lens after traveling through the Rotman lens cavity, the time-delayed in-phase signals are fed to a reconfigurable microstrip antenna array. (v) The reconfigurable antenna has MEMS SPST switches embedded in each of the linear sections as shown in Fig. 2. The scan area of a conventional microstrip antenna array depends on the antenna beamwidth that in turn depends on the number of microstrip patches. The higher the number of patches, the narrower will be the beam. It has been determined that for a short range radar, a beamwidth of 80 degrees is necessary to scan an area up to 30 meter in front of the vehicle as shown in Fig.3. For mid range, a beam width of 20 degrees is necessary to cover an area between 30-80 meters ahead of the vehicle and in the LRR mode, a beam width of 9 degrees is necessary to

Figure 3. SRR, MRR, and LRR coverage with beam width.

scan an area 80-200 meters ahead of the vehicle. Following Fig. 2, when both the SPST switches SW1 and SW2 are in OFF position, 4 microstrip patches per linear array provide short range coverage. When the switch SW 1 is turned ON and SW 2 is OFF, 8 microstrip patches per linear array provide mid-range range coverage. Finally, when both the switches SW 1 and SW 2 are ON, 12 microstrip patches per linear array provide long range coverage. An FPGA implemented control module controls the operation of the switches SW1 and SW2. (vi) The sequential switching of the input signal among the beamports of the Rotman lens enables the beam to be steered across the target area in steps by a pre-specific angle as shown in Fig. 4. (vii) On the receiving side, a receiver antenna array receives the signal reflected off a vehicle or an obstacle and feeds the signal to another SP3T switch through another Rotman lens. (viii) An FPGA based control circuit controls the operation of the receiver SP3T switch in tandem with the transmit SP3T switch so that the signal output at a specific beamport of the receiver Rotman lens can be mixed with the corresponding

TABLE I.

SPEED COMPARISON - FPGA VS. A DUAL CORE P AND DSP Clock Freq. (MHz) 3000 600 600 200 2048-point FFT Latency (s) 37.55 32.40 34.20 39.60 No. of Clock Cycles 112650 19440 20520 7920

Part Name Intel 32-bit Core 2 Duo Analog Devices ADSPBF53x Texas Instruments TMS320C67xx Xilinx Virtex-5 FFT Core Figure 4. Beam steering by the Rotman lens.

transmit signal. (ix) The output of the receiver SP3T switch is passed through a mixer to generate an IF signal in the range of 0-200 KHz. (10) An analog-to-digital converter (ADC) samples the received IF signal and converts it to a digital signal. (x). Finally, an FPGA implemented algorithm processes the digital signal from the ADC to determine the range and velocity of the detected target. In this way a wider near-field area and a narrow far field area can be progressively scanned with a minimum hardware. III. RADAR SIGNAL PROCESSING AND SWITCH CONTROL

A. Choice of Development Platform FPGA vs. DSP Older radar systems relied on various analog components. However, Digital Signal Processors (DSP) and FieldProgrammable Gate Arrays (FPGA) are increasingly becoming popular as both offer attractive features. Off-the-shelf signal processing blocks are available from both DSP and FPGA manufacturers as IP (Intellectual Property) cores for rapid prototyping. Key considerations to choose one over the other depend on internal architecture, speed, development time, complexity, and system requirements. DSPs follow a pipelined architecture, and even a dual core DSP requires resource sharing. This limits the overall data throughput in DSPs, and data capture channels depend on the number of memory modules and interrupts. On the other hand, FPGAs are fully customizable, and modules do not have to share resources like I/O ports and memory. DSPs spend most processing time and power in moving instructions and variables in and out of shared memories, which FPGAs inherently avoid. This makes FPGAs more suitable for parallel processing and optimized resource handling, especially in low-memory applications. The capability of FPGAs are continuously increasing due to the development of fast multipliers, accumulators, and RAM units, etc. With parallel processing capabilities, FPGAs have a higher throughput than DSPs while using a slower clock. Table I shows benchmark results comparing a Xilinx FPGA with various processors for a 2048-point FFT and Table II compares DSPs to FPGAs in terms of various development criteria that affect their development time, reliability and applicability. From the Table I, it is clear that Virtex-5 SX50T is welladapted for signal processing applications and computes the same FFT in a lesser number of clock cycles compared to typical DSPs by using parallel resources. Based on the Tables I and II, the Xilinx Virtex 5 SX50T FPGA has been selected for rapid system prototyping for the target MEMS tri-mode radar.

B. LFMCW Sweep Generation and Switch Control To keep the hardware requirements to a minimum for the Virtex 5 FPGA, FMCW sweep durations of 1.0 ms, 3 ms, and 6 ms have been selected for the LRR, MRR, and SRR modes, respectively. A 10-bit counter in the FPGA is used to generate a digital sweep signal which is fed to a 10 bit DAC to generate the VTune signal for the successive modes as shown in Fig. 5. The sweep generation timing diagram as shown in Fig. 5 is for a clock frequency of 100 MHz on a Virtex-5 FPGA. A sweep bandwidth of 800 MHz for LRR, 1.4 GHz for MRR, and 2 GHz for SRR has been selected to provide sufficient samples. Fig. 5 also shows the required DAC output and tuning voltage for a TLC Precision 77 GHz GaAs VCO (MINT77TR). The MEMS SP3T switches (SP3T-T and SP3T-R in Fig. 1) are operated by 18 V charge pumps which are activated at the end of the down sweep of the SRR mode.
TABLE II. Criteria Sampling rates Data rate Memory management Data capture Group development DEVELOPMENT CRITERIA COMPARISON DSP FPGA

Low (interrupts and I/O are High (parallel data capture shared) and processing) Better below 30MB/s Unpredictable optimized arrangement can lead to failures by mishandling pointers Uses interrupts with varying orders of precedence can lead to conflicts Developers do not have a clear idea of resource availability and usage by others Can handle faster data rates Fully customizable memory arrangement and read/write ports Independent data capture with dedicated input/output ports and memory blocks Developers can work independently without resource availability concerns

Figure 5. VTune timing diagram.

C. Signal Processing Algorithm Following the theory of FMCW radars, the range R and relative velocity VR of a detected target can be calculated from:

R=

( f up + f down ) 2

c 2k

(1)

VR =

( f up f down ) 4

c f0

(2)

where f up , represents the up sweep beat frequency, f down represents the down-sweep beat frequency, c is the speed of the electromagnetic wave in the medium, k = B / T is the bandwidth / sweep duration and f 0 represents the center frequency as shown in Fig. 6. Fig. 7 presents the developed radar signal processing algorithm. In the system, the transmitted and received signals are mixed to generate a beat signal which is then passed through a low pass filter (LPF) to remove the noise. The filtered signal is then converted to digital format using an ADC and a hamming window is applied to limit the frequency content. Afterwards, a Fast Fourier Transform (FFT) is applied to the time-domain samples. The normalized peak intensity for all the FFT samples is computed and processed by a Cell Averaging Constant False Alarm Rate (CA-CFAR) processor architecture as in [11]. Upon processing of both up and down sweep for a beam port, peak pairing is done to compute preliminary values for target range and velocity. Peak pairing is responsible for matching the detected peak in the up sweep spectrum to a detected peak in the down sweep spectrum as belonging to the same target.
D. Implementation Methodology The critical design considerations to implement the developed algorithm include: (a) Interconnection between the processing modules, (b) Data interdependency between the units, (c) Control interdependency between the units, (d) Coherency in data format, and (e) Synchronization of up and down sweeps and timely switching of the SP3T switches.

Figure 7. Signal processing algorithm.

TABLE III. Processing Unit DAC ADC Window Type/Length FFT Type FFT Length CFAR Type CFAR Parameters Peak Pairing Criteria

SIGNAL PROCESSING UNITS Details 10-bit, 1.2 MHz 11-bit, 2.2 MHz Hamming/2048 Mixed Radix-2/4 DIT 2048 Cell Averaging M = 8, GB = 2, Pfa = 10-6 * - Power Comparison, - Spectral Proximity

* M = depth of cell averaging, GB = no. of guard bands, Pfa = probability of false alarm.

The following stepwise methodology has been adopted to address the mentioned considerations to achieve a high degree of accuracy in the FPGA based signal processing scheme: (a) Development and simulation of the developed signal processing algorithm in Matlab environment for a single target, (b) Verification of the algorithm implemented in Matlab for a single target and then extending the Matlab codes for a single Rotman lens beam with 7 random targets, (c) HDL coding and testing of individual modules of the verified algorithm, (d) Identification of hardware resource sharing, timing, and area optimization, (e) Assembly of modules to form overall radar signal processing system in HDL, (f) Validation using a 7target scenario by comparing the results obtained from Matlab codes. IV. HARDWARE IMPLEMENTATION

A. HDL Modules Fig. 8 shows the top-level black box view of the FPGA implemented signal processing system and Fig. 9 presents the developed HDL building blocks for FPGA implementation. The system has been developed using Verilog HDL 2005 (IEEE 1364-2005). B. Fixed-Point Considerations Fixed-point implications arise at 4 stages of the developed HDL system: (i) ADC the quantization noise added by

Figure 6. FMCW chirp signal and beat frequency.

sampling is unavoidable. To minimize the quantization error, an 11-bit ADC has been employed and a maximum error of 0.125% is induced due to quantization. (ii) Windowing A 2048-point Hamming window is stored inside an on-chip ROM. The window coefficients are stored in 10-bit resolution. This results in 0.084% error in computation. The use of such digitized window functions are validated in [13]. (iii) FFT The accuracy of the FFT that depends on input resolution and phase coefficient/twiddle factor affects the precision of the entire signal processing algorithm and deduced target information. With a fixed resolution of 12 bits, different phase coefficient resolutions were tested. Highly accurate results were obtained with 16-bit resolution. (iv) Peak Pairing As the implementation of the ADC and the window function as mentioned above in an FPGA environment involves several multiplications/divisions, to avoid computational delay and resource overhead, a simplified approach is used. It is known that actual frequency is the product of frequency resolution of FFT and bin number of detected target peak by the CFAR processor. For the developed system, the FFT frequency resolution defined as Sampling frequency / FFT Size equals 976.5625 Hz/bin. Also, factor k in (1) can be calculated from the bandwidth B (800MHz) and sweep duration T (1 ms) as 8 x 1011 for the LRR mode. Using these information, (1) can be simplified to: R = ( f up_bin + f down_bin ) 0.09290625

Figure 9. HDL building block modules.

processing bottlenecks in the algorithm and use of multiple parallel units to resolve the same, (vi) Potential race conditions and data consistency issues in shared resources. (vii) Memory sharing, (viii) Clocked/combinational logic synchronization, (ix) Data word length and fixed-point format (position of decimal point) considerations, and (x) Handling of signed and unsigned data. A bottleneck was identified in the Power spectral density calculation module (PSD unit) where a square root operation caused significant delay. This was resolved by using 4 PSD units in parallel as shown in Fig. 10. Since the time-domain data RAM and frequency-domain data RAM are implemented separately, this allows for simultaneous sampling of the next frequency sweep while processing samples from the previous sweep. This also optimizes the HDL design for time. Upon testing the Xilinx FFT it has been observed that the first half of the output has more noise and higher DC components at lower frequencies as compared to the latter half. Accordingly, only the latter half of the FFT output is utilized. This inherently saves memory and improves timing by a factor of 2. The CFAR unit is designed to process 32 values at a time and is fully synchronized with the FDR and PSD units (Fig. 8) thus avoiding storage of the entire 1024 frequency samples.
D. Resource Usage The design was synthesized for Xilinx Spartan-3A DSP Edition and Virtex-5 SX50T. The resource usage for the HDL implementation of the developed signal processing system is listed in Table IV.

(3)

where, f up(down) = f up(down)_b in 976 .5625 and f up(down)_b in is the FFT bin number for up and down peaks of a valid target detected by CFAR. Similarly, (2) can be simplified as
VR = ( f up_bin f down_bin ) 3.398 km/h

(4)

The constants 0.09290625 in (3) and 3.398 in (4) have been approximated as 0.0927734375 (11-bit binary number) and 3.40625 (7-bit binary number). This reduces multiplier resolution and is suitable for fast computation using embedded FPGA multipliers e.g. in Xilinx DSP48E slices. This creates a maximum error of 0.193% in range and velocity computation. Similar calculations are done for the MRR and SRR modes.
C. HDL Optimization Critical optimization considerations include: (i) Interfacing modules, (ii) Synchronization of modules in the overall system, (iii) Fixed-point truncation/rounding errors, (iv) Potential overflow identification, (v) Identification of data flow or

Figure 8. Top-level module.

Figure 10. Parallel PSD units.

TABLE IV. Resource Slice registers Slice LUTs DSP48 Slices LUT-FF pairs FPGA fabric area

RESOURCE USAGE ON DIFFERENT FPGAS Spartan-3A 46% 96% 30% 11% 47% Virtex-5 4% 23% 6% 9% 21%

E. Processing Latency Table V shows the respective number of clock cycles required by each module in the HDL implementation in Fig. 9 for the LRR mode and the total number of clock cycles consumed for processing both up and down frequency sweeps and producing the final target information. Implementation on Xilinx Spartan-3A has a safe maximum operating frequency of 50 MHz and 160 MHz on Xilinx Virtex-5.

Figure 11. Highway test scenario.

The system has been simulated for 50 MHz on the Spartan3A and 100 MHz on the Virtex-5, as presented in Table V. One beam corresponds to both up and down frequency sweeps on the same beam port of the Rotman lens.
TABLE V. Operation PROCESSING LATENCY ON DIFFERENT FPGAS FOR LRR Clock Cycles / Beam 204756 2072 Latency at 50 MHz Spartan-3A (ms) 2.04756 0.04144 Latency at 100 MHz Virtex-5 (ms) 2.04756 0.02072 0.03960 0.10743 0.04388 0.21163 2.25928

can determine the range with a maximum error of 0.28 m when compared with actual values and the maximum error in relative velocity is 3 km/h (0.83 m/s) only. In both cases, the HDL generates more accurate results as compared to Matlab determined values. Comparative accuracy and FGPA parameters for all the three modes are listed in Table VIII. From the Table VIII it is clear that the SRR offers highest range and velocity accuracy. Table IX compares the range and velocity accuracy of all the 3 modes from the implemented HDL version with the state-ofthe-art Bosch LRR3. From Table IX, it is clear that the new radar can determine the range and velocity with almost same accuracy as Bosch LRR3 while covering 3 ranges and the complete cycle time of all the three modes is 62 ms as compared to 80 ms for the Bosch LRR3. The marginal
TABLE VI. LRR RANGE ACCURACY COMPARISON: MATLAB-HDL HDL value (m) 9.00 24.00 29.00 55.00 78.00 106.00 147.75 MatlabActual (m) 0.38 0.34 0.27 0.37 0.32 0.28 0.37 HDLActual (m) 0.00 0.00 0.00 0.00 0.00 0.00 0.28 MatlabHDL (m) 0.38 0.34 0.27 0.37 0.32 0.28 0.62

Sweep Sampling Window and feed to FFT FFT 1 3960 0.07920 PSD (4 parallel 10743 0.21486 units ) CFAR 4388 0.08776 Total Processing 21163 0.42326 Overall 225928 2.47082 1 This FFT delay is for real-valued input data.

V.

SIMULATION AND VALIDATION

A randomly generated 3-lane highway scenario as shown in Fig. 11 has been used to test the validity of the HDL codes. The scenario includes 7 arbitrary targets covered by a composite 17 wide scanning beam formed by the 3 beamports of the Rotman lens as shown in Fig. 1 [5]. Table VI compares the range accuracy of Matlab and HDL implemented versions with actual values for the LRR mode only. The simulation has been carried out over 6 iterations of time-domain MATLABgenerated samples with the algorithm running on a Xilinx Virtex-5 ML506 development board. During simulation, following physical conditions are assumed: Light-medium rain producing RF attenuation of 0.8 dB/km [14]. 2. Negligible attenuation and reflection from radome with less than 0.05 mm thick water deposition [15]. 3. Simulated targets can be entirely described by Swerling I, III and V (or 0) models [16] and clutter sources are spectrally stationary. 4. Best case SNR of 4.73 dB. Table VII shows a similar comparison for the velocity accuracy for the same targets for the LRR mode. From the Tables VI and VII, it appears that the HDL version of the developed algorithm 1.

Actual Matlab Target Distance value ID from Host (m) (m) 1 9.00 9.38 2 3 4 5 6 7 24.00 29.00 55.00 78.00 106.00 148.00 24.34 29.27 55.37 78.32 106.28 148.37

TABLE VII. Targ et ID 1 2 3 4 5 6 7

LRR VELOCITY ACCURACY COMPARISON: MATLAB-HDL Matlab value (km/h) 123.85 52.31 89.78 100.00 69.34 79.56 21.64 HDL value (km/h) 123.5 53.5 87.5 100.0 70.5 83.0 22.0 Matlab Actual (km/h) 0.85 2.69 0.78 0.00 0.66 0.44 0.36 HDLActual (km/h) 0.5 1.5 1.5 0.0 0.5 3.0 0.0 Matlab -HDL (km/h) 0.35 1.19 2.28 0.00 1.16 3.44 0.36

Actual Velocity relative to Host (km/h) 123 55 89 100 70 80 22

TABLE VIII.

TRI-MODE RADAR DESIGN SPECIFICATIONS SRR 0-30 m 300 km/h 6 ms 2000 MHz 200 KSPS 1024 points 196 Hz/bin 0.28 m 0.10 m 0.14 m/s 106 s MRR 30-100 m 300 km/h 3 ms 1400 MHz 700 KSPS 2048 points 342 Hz/bin 0.29 m 0.14 m 0.42 m/s 212 s LRR 100-200 m 300 km/h 1 ms 800 MHz 2000 KSPS 2048 points 977 Hz/bin 0.34 0.28 0.83 m/s 212 s

Criteria Range coverage Relative velocity coverage Up or Down sweep duration Sweep bandwidth Required sampling rate FFT size FFT frequency resolution Range accuracy with worst case VCO linearity of 25% Range accuracy with practical VCO linearity of 1% Velocity accuracy Processing delay for beam port

a highly reliable low cost small form factor radar sensor that can enable even the lower-end vehicles to be equipped with a collision avoidance system. All the MEMS components have been fabricated and the assembly and packaging of a prototype device is in progress. ACKNOWLEDGMENT The authors would like to greatly acknowledge the additional support provided by the Canadian Microelectronics Corporation (CMC Microsystems), and Evigia systems Inc., Ann Arbor, MI. REFERENCES
[1] J. S. Jermakian , Crash Avoidance Potential of Four Passenger Vehicle Technologies, Insurance Institute of Highway safety, April, 2010, http://www.iihs.org/research/topics/pdf/r1130.pdf H. Arnold, Infineon: Automotive radar is aimed at mid-range cars, available. [Online]., http://www.electronics-eetimes.com/en/infineonauto motive-radar-is-aimed-at-mid-range-cars? cmp_id= 7& news_id=202803354 R. Lachner, Development Status of Next generation Automotive Radar in EU, ITS Forum 2009, Tokyo, 2009, [Online]. Available. http://www.itsforum.gr.jp/Public/J3Schedule/ P22/ lachner090226.pdf G. Rollmann, Frequency Regulations for Automotive Radar, SARA, presented at Industrial Wireless Consortium (IWPC), Dsseldorf, Germany, 2009. R. Schneider, H. Blcher, K. Strohm, KOKON Automotive High Frequency Technology at 77/79 GHz, Proceedings of the 4th European Radar Conference, 2007, Munich, Germany, pp. 247-250. R. Stevenson, SiGe threatens to weaken GaAs grip on automotive radar, 2009, Compound Semiconductor, [Online]. Available. http://www.compoundsemiconductor.net J. Oberhammer, RF MEMS Steerable Antennas for Automotive Radar and Future Wireless Applications (acronym SARFA), NORDITE, the Scandinavian ICT Research Programme, [Online]. Available. http://www.sarfa.ee.kth.se A. Sinjari, S. Chowdhury, MEMS Automotive Collision Avoidance Radar Beamformer, in Proc. IEEE ISCAS2008, Seattle, WA, 2008, pp. 2086-2089. D. Kok, J. S. Fu, Signal Processing for Automotive Radar, in IEEE Radar Conf.(EURAD2005), Arlington, VA, June 2005, pp. 842-846. J. Saad, A. Baghdadi, FPGA-based Radar Signal Processing for Automotive Driver Assistance System, in IEEE/IFIP Intl. Symp. Rapid System Prototyping, Fairfax, VA, 2009, pp. 196-199. T. R. Saed, J. K. Ali, Z. T. Yassen, An FPGA Based Implementation of CA-CFAR Processor, Asian Journal of Information Technology, Vol. 6, No. 4, pp. 511-514, 2007. Robert Bosch, LRR3: 3rd generation Long-Range Radar Sensor, [Online]. Available. http://www.bosch-automo tivetechnology.com /media/en/pdf/fahrsich erheits systeme_2/lrr3_datenblatt_de_2009.pdf G. Hampson, Implementation Results of a Windowed FFT, Sys. Eng. Div., Ohio State Univ., Columbus, OH, July 12, 2002. [Online]. Available: http://esl.eng.ohio-state.edu/~rstheory/iip/window.pdf P. W. Gorham, RF Atmosphere Absorption/Ducting, Antarctic Impulsive Transient Antenna Project (ANITA), Univ. Hawaii (Manoa), April 21, 2003. A. Arage, G. Kuehnle, R. Jakoby, Measurement of Wet Antenna Effects on Millimeter Wave Propagation, in Proc. IEEE RADAR Conf., New York City, NY, 2006, pp. 190-194. P. Swerling, Probability of Detection for Fluctuating Targets, The RAND Corp., Santa Monica, CA, Mar. 17, 1954.

[2]

deviation from Bosch LRR3 for MRR and LRR modes is offset by the fact that the combined cycle time for the 3 modes (62 ms) is smaller than the cycle time of Bosch LRR3 (80 ms).
TABLE IX. TRI-MODE RADAR ACCURACY COMP. WITH BOSCH LRR3 Bosch LRR3 0.5-250 -100 to +200 0.10 0.12 N/A 80 ms MEMS Tri-Mode (Xilinx Virtex 5 HDL Simulation) SRR 0.4-30 300 0.10 0.14 106s MRR 30-100 300 0.14 0.42 212s LRR 100-200 300 0.28 0.83 212s

[3]

[4]

Parameter Range (m) Velocity (km/h) Range accuracy (m) Velocity accuracy (m/s) Processing latency Cycle time

[5]

[6]

[7]

62 ms for three modes combined

[8]

VI.

CONCLUSIONS

[9] [10]

A signal processing algorithm has been developed and tested in a Xilinx Virtex 5 SX50T FPGA for a MEMS based tri-mode (short, mid, and long range) 77 GHz FMCW automotive radar to determine the range and velocity of multiple targets. The MEMS based radar uses a microfabricated Rotman lens and MEMS SP3T RF switches in conjunction with a reconfigurable MEMS SPST switch embedded microstrip antenna array. The proposed signal processing hardware implementation for the MEMS radar makes use of independent modules thus eliminating the latency posed by the MicroBlaze softcore used in [11]. The algorithm can determine the target range with a maximum error of 0.28 m for LRR and 0.10 for SRR. The maximum velocity error has been determined as 0.83 m/s for LRR and 0.14 m/s for SRR. Further investigation shows that by increasing the sweep duration to 6 ms, the maximum velocity error for LRR can be reduced to 0.14 m/s following (1). These accuracies of the FPGA determined range and velocity results meet the auto industry set specifications and are in par with the Bosch/Infineon LRR3 radar. The developed system allows for

[11]

[12]

[13]

[14]

[15]

[16]

FPGA based Real-Time Object Detection Approach with Validation of Precision and Performance
Alexander Bochem, Kenneth B. Kent
Faculty of Computer Science University of New Brunswsick Fredericton, Canada alexander.bochem@unb.ca, ken@unb.ca

Rainer Herpers
Department of Computer Science University of Applied Sciences Bonn-Rhein-Sieg Sankt Augustin, Germany rainer.herpers@h-brs.de

AbstractThis paper presents the implementation and evaluation of a computer vision problem on a Field Programmable Gate Array (FPGA). This work is based upon previous work where the feasibility of application specic image processing algorithms on a FPGA platform have been evaluated by experimental approaches. This work coveres the development of a BLOB detection system on an Altera Development and Education II (DE2) board with a Cyclone II FPGA in Verilog. It detects binary spatially extended objects in image material and computes their center points. Bounding Box and Center-of-Mass have been applied for estimating center points of the BLOBs. The results are transmitted via a serial interface to the PC for validation of their ground truth and further processing. The evaluation compares precision and performance gains dependent on the applied computation methods.

keywords: FPGA; BLOB Detection; Image Processing; Bounding Box; Center-of-Mass; Verilog I. I NTRODUCTION

The usability of interfaces in software is a major criteria for acceptance by the user. A major issue in computer science is the improvement of information representation in interfaces and nding alternatives for user interaction. Simplifying the way a user operates with a computer helps in optimizing the usability and increasing the benet of digitization. A common virtual reality (VR) system uses standard-input devices such as keyboards and mice, or multimodal devices such as omni directional treadmills and wired gloves. The issue with these active input devices is the lack of information about the user and his point of interest. A computer mouse, for example, only gives information about
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

its relative position on the desk in relation to its position on the computer monitor. A wired glove represents the user in the coordinate system of the virtual world, but it gives no information about the absolute position in the real world. The common way to display a virtual environment are computer monitors, customized stereoscopic displays, also known as Head Mounted Displays (HMD), and the Cave Automatic Virtual Environment (CAVE). In general computer screens are relatively small compared to the virtual environment they display. The frame of a computer monitor causes a signicant disturbance in the perception of the VR by the user. HMDs on the other hand allow a complete immersion of the virtual reality for the user. The downside is that HMDs are heavy and therefore not recommendable to wear for a longer time. The CAVE concept applies projection technology on multiple large screens, which are arranged in the form of a big cubicle [1]. The size of the cubicle usually ts several people. Based on the CAVE concept the Bonn-Rhein-Sieg University of Applied Sciences has invented a low cost mobile platform for immersive visualisations, called Immersion Square [2]. It consists of three back projection walls using standard PC hardware for the image processing task. For immersive environments the position and orientation of the user in relation to the projection surface would allow alternative ways of user interaction. Based on this information, it would be possible to manipulate the ambiance without having the user actively use an input device. At the BonnRhein-Sieg University for Applied Sciences this

problem has been addressed in a project named 6 Degree-of-Freedom Multi-User Interaction-Device for 3D-Projection Environments (MI6) by the Computer Vision research group. The projects intent is to estimate the position and orientation of the user to improve the interaction of the user with the immersive environment. Therefore a light-emitting device is developed, that creates a unique pattern of light dots. This device will be carried by the user and pointed towards the area on the projection screens of the immersive environment where the user is looking. By detecting the light dots on the projection screens, it is possible to estimate the position and orientation of the user in the cubicle, as shown in [3], [4]. One problem is that the detection process of the BLOBs and the estimation of their center points requires fast image processing. It is intended to realize an immediate feedback for the user for any change of direction and position in the immersive environment. Another problem is the required precision for the estimation of the BLOBs center points. A mismatch of the BLOBs center point by a few pixels will cause an error for the estimated position of the user. With those problems in mind the intent is to develop a BLOB detection system which serves both high performance and high accuracy. An FPGA is a grid of freely programmable logic blocks which can be combined through interconnection wires. This allows the hardware designer to have a exible and also application specic hardware design. While FPGAs have been used for prototyping only in the beginning, the revolution of personal computers helped this eld become attractive for consumer and industry products as well. The decreasing sizes for chip circuit elements allowed manufacturers to create FPGAs, which are large enough to t even complete processor designs. Those advantages have been perceived in the computer vision area [5] and used in various research projects [6], [7], [8]. The FPGA architecture allows one to create a hardware design that can process data in parallel like a multi-core architecture on a GPU. This again is well suited for computer vision problems, as have been addressed in [9]. It has to be kept in mind that, although ASICs and FPGAs have the design of hardware in common, the circuit space and power consumption of an FPGA is still several times higher than that of an ASIC. The remainder of this paper is organized as follows. Section 2 gives a brief overview of the

applied methods and system design. Section 3 shows how the implementation of the system is completed. In Section 4 the results of the system are veried and validated. Section 6 nally concludes the paper. II. S YSTEM D ESIGN

Fig. 1: Schematic design of the system architecture.

The target platform of the realized system design is an DE2 board from Altera [10]. Figure 1 shows the schematic design of the BLOB detection system. The image material can be acquired on different input devices. The analog video input is used for precision evaluation with the recorded image material. The CCD camera has been applied for performance evaluation with real-time data. Both input devices have individual pre-processing modules and provide the image material in RGB format with a resolution of 640x480 pixels. The AD-converter of the analog-video input performs with 30 frames per second (fps). The CCD camera can provide up to 70 fps for the given image resolution if the full ve Megapixels of the camera sensor is used [11]. A. CCD Camera The CCD camera D5M from Terasic has been used [11] for the performance evaluation of the BLOB detection approach. This camera sensor is arranged in a Bayer pattern format. The read-out

10

Fig. 2: Four and eight pixel neighbourhood.

speed depends on several features such as resolution, exposure time and the activation of skipping or binning. Skipping and binning is used to combine lines and columns of the CCD sensors. This functionality allows the design to read out a larger area of the sensor in the form of a smaller image resolution. The consecutive lines in the sensor are merged and handled as a single image line. The controller on the D5M camera can read out one pixel per clock pulse and processes each sensor line in sequential order. B. BLOB Detection In computer vision a binary large object is considered as a set of pixels, which describe a common attribute and are adjacent. This common attribute in most cases is dened by color or brightness of the pixel. The representation of the pixel depends on the applied color model in the image material. For the detection of the BLOBs the rst problem to be solved is the identication of relevant pixels. In this work a pixel is considered to be relevant if its brightness value exceeds a specied threshold value. This threshold value can be a static parameter or a computed average value, based upon the pixel values of previous frames. The adjacency condition is the second important step in BLOB detection. The two most common denitions for adjacency are known as four pixel neighbourhood and an eight pixel neighbourhood. Figure 2 is showing the two ways of labelling pixels to describe adjacency. On the left image the four pixel neighbourhood is applied and four BLOBs could be detected. The detection applies the adjacency check only on the horizontal and vertical axis. On the right image it can be seen that the same pixels are labelled as only two BLOBs. For the eight pixel neighbourhood the diagonal axis is

taken into account as well [12]. The check of the adjacency condition has to be performed on all pixels that match the criteria of the common attribute. Based on those two actions all BLOBs in a frame can be found. The aim of the BLOB detection for our requirements is to determine the center points of the BLOBs in the current frame. With respect to the application area this project describes BLOBs as a set of white and light gray pixels while the background pixels are black. This circumstance follows from the setup for the image acquisition where infrared cameras will be applied to track the projection surface of the Immersion Square. The expected image material will be similar to the samples in Figure 3.

Fig. 3: Example of BLOBs perfect shape (left) and with blur (right).

The characteristics of the blurring effect will depend on the acceleration of the light source by the user. This effect causes some issues for the estimation of the BLOBs center points, which have been addressed in this project. The Bounding Box based computation estimates the center-point of the BLOB by searching for the minimal and maximal XY-coordinates of the BLOB. The computation can be implemented very efciently and will not cause large performance issues.

11

BLOBs X center position = BLOBs Y center position =

maxX position + minX position 2 maxY position+minY position 2

other system modules and sends results as long as the FIFO containing the BLOB results is not empty. III. I MPLEMENTATION The overall processing of BLOB detection is separated into several sub-processes. For a higher exibility of the system design, the functionality has been implemented in separate modules. Those modules use FIFOs to store their results. This allows all modules to run separately with individual clock rates. The identication of pixels, which might belong to a BLOB, and the collection of BLOB attributes is very similar for both BLOB detection solutions. Only the computation of the center point for the detected BLOBs is different, based on the algorithm explained in the previous section. In the rst processing step the check of the common attribute on the RGB pixel data will identify relevant data. All pixels which do not match the criteria are dropped from further processing. The chosen identication criteria is the brightness value of the pixel. Every pixel which has a higher value than the specied threshold value, will be saved for the adjacency check. Those pixels are saved in a dual clock FIFO from where the adjacency check acquires its input data. The BLOB detection module implements the adjacency check for the pixel data and sorts them into data structures, referred to as a container. During the detection process the containers are continuously updated until a new frame starts. At that point the BLOB attributes are written into a FIFO and the container contents are cleared. The amount of attributes stored for each BLOB depends on the selected method for computing the center point. The last task to be processed in the BLOB detection is the computation of the center point. For controlling the processing of the BLOB attributes, the module is designed as a state machine (Figure 4). Using state machines is a common method for process control in hardware design. The results of the center point computation are written into the result FIFO from where the serial communication module obtains its input data. IV. V ERIFICATION AND VALIDATION For the verication and validation of the BLOB detection results, two different input sources have been applied. The S-Video input is used for the

The result for the center coordinates are strongly affected by the pixels at the BLOBs border. This effect becomes even stronger for BLOBs in motion. With reference to the light emitting device for the Immersion Square environment, the angle between the light beams and the projection surface changes the shape of the BLOBs and increases the range of pixels with less intensity. In addition the movement of the device by the user will cause motion blur. These effects will increase the ickering of pixels at the BLOBs border and will cause a ickering in the computed center-point. With Center-of-Mass the coordinates of all pixels of the detected BLOB are taken into account for the computation of the center point. The algorithm combines the coordinate values of the detected pixels as a weighted sum and calculates an averaged center coordinate.[12]
BLOBs X center position = BLOBs Y center position =
X position of all BLOB pixels number of pixels in BLOB Y position of all BLOB pixels number of pixels in BLOB

To achieve even better results the brightness values of the pixels have been applied as weights as well. This increases the precision of the estimated center-point with respect to the BLOBs intensity. As described in [12] the sequential processing of a frame requires an additional check of adjacency for the BLOBs itself. Dependent on the BLOBs shape and orientation the detection might separate pixels of one BLOB into two different BLOBs. The adjacency check for BLOBs is based on the same concept as has been explained for pixels earlier. C. Serial Interface To observe the system process and evaluate its results, a module for communication on the serial interface of the target board has been developed. Other hardware interfaces have been available, such as USB, Ethernet, and IRDA. But the serial interface allows the smallest design with respect to protocol overhead and resource allocation on the FPGA. The serial module reads the information about the BLOB result from a FIFO buffer and transmits it to the RS232 controller. The serial interface module operates independent from the

12

Fig. 4: State Machine for reading BLOB attributes and computing center points.

verication of detection accuracy and precision of the different center point computation methods. Secondly, the CCD camera input is used for performance benchmarking of the system design. For the performance measuring, the visualization of the captured image data from the CCD camera on the VGA display had to be disabled. This was necessary because the VGA controller module did not support higher frame rates. The applied input material on the S-Video input had a resolution of 640x480 pixels. The sampling of the AD-converter was set to the same resolution to avoid the articial offset in the digitized image material, that would have shown up for different image resolutions. The ground truth values for the precision of the expected BLOB center points had been veried by hand. Based on the specied application area, the shape of the BLOBs to be detected has been estimated as perfect circular and circular shapes with blur effect, as shown in Figure 3. For applying the BLOB detection system it needs to be congured before using the BLOB detection results for further processing. This includes the estimation of the best threshold value for the given application environment. A. Precision For the computation of the BLOBs center point the Bounding Box and the Center-of-Mass based methods showed the exact same results for the clear BLOBs with a perfect circular shape. This has been tested for several representative threshold values. If a BLOB is not showing any blur effect, the applied method for computing the center point has no inuence on the precision. The Bounding Box and Center-of-Mass computation showed different results for the image material containing BLOBs with blur effect. The Center-of-Mass results turned out to be closer

to the BLOBs center point. The results for one particular example are shown in Table I.
Ground X 230 230 232 234 237 242 Truth Y 306 306 305 303 300 298 Bounding Box X Y 226 308 227 308 228 307 231 305 236 300 241 299 Center-of-Mass X Y 229 307 229 307 230 306 233 304 237 300 241 298

Threshold 0x190 0x20B 0x286 0x301 0x37C 0x3E0

TABLE I: Results for center point computation with blur shaped BLOBs.

Center point computation with Center-of-Mass shows higher precision for BLOBs with blur effect, compared to Bounding Box. The Center-of-Mass method is showing an error of 0.02 %. The Bounding Box based computation is showing an error of 0.04 %. The threshold values which have been used for evaluation of precision are in between the value range of the BLOBs in the applied image material. This value range is dependent on the applied image material and can not be simply reused for any image source or material. The estimation of the value range is a conguration requirement before using the BLOB detection system. For threshold values above or below the given value range the system was not able to detect all BLOBs in the image material accurately. B. Performance The system performance has been evaluated on a xed environment setup. The CCD camera has been applied to perform the BLOB detection approach with real-time image data. The performance has been measured by counting the number of frames per second that were processed on the DE2 board. For verifying that the BLOB detection was working correct the results have been observed manually on the connected Host PC. The visualization of the BLOB detection results on the connected host-PC

13

allowed a reasonable validation by hand for an input speed of 45 frames per second. This was the limitation of the CCD camera for the applied resolution of 640x480 pixels. Reported values about performance and resource allocation are given in Tables II and III.
with Monitor Output Bounding Box Center-of-Mass Speed (fps.) 12 12 Camera Speed (MHz) 25 25 System Speed (MHz) 40 50 Max. System Speed (MHz) 72 65 Allocated Resources on the FPGA Logic Elements 7,850 14,430 Memory Bits 147,664 237,616 Registers 2,260 2,871 Veried * *

V. F UTURE W ORK It has been shown in Section IV-B that the BLOB detection could not be tested for its maximum performance. Both available input sources turned out to be a bottleneck for the acquisition of image material. It is planned to integrate the BLOB detection approach into a camera with an onboard FPGA. With direct access to the sensor of a high performance camera, the BLOB detection can be evaluated for its maximum processing speed. VI. C ONCLUSION With Bounding Box and Center-of-Mass, two common methods for the estimation of a BLOB center point have been implemented. The evaluation has shown reliable precision results with respect to the given application area. As it has been estimated in [12] the Center-of-Mass based approach shows higher precision and is the recommended solution for the given application task. The BLOB detection approach is a working threshold based solution specialized for BLOBs which consist of white and light grey pixels on a black image background. Both available input sources have been identied as a bottleneck for the systems processing speed. This has proven the FPGA design to be well qualied for the proposed computer vision problem. For the evaluation and further processing a module for transmitting results on the serial interface has been designed, tested and applied. The maximum performance of the serial interface is high enough for even faster frame processing rates, as described in Section II-C. The output format of the computation results can be easily changed in the system design. For instance, further information about the BLOBs can be computed on the FPGA system and transmitted as well, such as direction of movement or speed of the BLOBs. This paper has shown that a BLOB detection system can be successfully implemented and evaluated on a FPGA platform. Validation and verication showed reliable results and the advantages of image processing tasks designed in hardware. The work required a higher time effort compared to the implementation of a similar system in a high-level language, such as C++ or Java. Acknowledgments This work is supported in part by Matrix Vision, CMC Microsystems and the Natural Sciences and Engineering Research Council of Canada.

TABLE II: Resource allocation and benchmark results with monitor output.
without Monitor Output Bounding Box Center-of-Mass Speed (fps.) 46 50 Camera Speed (MHz) 96 96 System Speed (MHz) 125 125 Max. System Speed (MHz) 140 189 Allocated Resources on the FPGA Logic Elements 5,884 13,311 Memory Bits 113,364 239,316 Registers 1,510 2,078 Veried * *

TABLE III: Resource allocation and benchmark results without monitor output.

The performance result fps refers to the obtained frame rate during the benchmarking. The camera speed is the particular clock rate that is used to read out the CCD sensor. System Speed is the clock rate for the BLOB detection module during the performance test and Max. System Speed describes the maximum possible clock rate for the implemented design. The CCD camera itself has an average speed of 45 frames per second for the applied conguration parameter. While both BLOB detection approaches would have been able to perform on faster frame rates, the estimation of the maximum performance was again restricted by the input source. The resource allocation shows that Center-of-Mass requires about twice as many logic elements and memory bits than Bounding Box. Figure 5 is showing the relation between the amount of BLOBs that can be processed by the system and the maximum performance in MHz. The maximum amount of BLOBs that can be detected with the applied target board is approximatly seven for Center-of-Mass and sixteen for Bounding Box.

14

Fig. 5: Performance for BLOB Detection.

R EFERENCES
[1] C. Cruz-Neira, D. J. Sandin, and T. A. DeFanti, Surround-screen projection-based virtual reality: the design and implementation of the cave, in SIGGRAPH 93: Proceedings of the 20th annual conference on Computer graphics and interactive techniques. New York, NY, USA: ACM, 1993, pp. 135142. [Online]. Available: http://portal.acm.org/ft gateway.cfm? id=166134&type=pdf&coll=GUIDE&dl=GUIDE&CFID= 82404499&CFTOKEN=73325867 [2] R. Herpers, F. Hetmann, A. Hau, and W. Heiden, Immersion square - a mobile platform for immersive visualisations, 2005, university of Applied Sciences Bonn-Rhein-Sieg. [Online]. Available: http://www.cv-lab. inf.fh-brs.de/paper/remagen2005-I-Square1b.pdf [3] C. Wienss, I. Nikitin, G. Goebbels, K. Troche, M. G bel, L. Nikitina, and S. M ller, Sceptre - an o u infrared laser tracking system for virtual environments, vol. isbn 1-59593-321-2. Proceedings of the ACM symposium on Virtual Reality software and technology VRST 2006, 2006, pp. pp. 4550. [Online]. Available: http://www.digibib.net/openurl?sid=hbz:dipp&genre= proceeding&aulast=Wienss&aurst=Christian&title= +Proceedings+of+the+ACM+symposium+on+Virtual+ Reality+software+and+technology+VRST+2006&isbn= 1-59593-321-2&date=2006&pages=45-50 [4] M. E. Latoschik and E. Bomberg, Augmenting a laser pointer with a diffraction grating for monoscopic 6dof detection, Journal of Virtual Reality and Broadcasting, vol. 4, no. 14, pp. , jan 2007, urn:nbn:de:0009-6-12754,, ISSN 18602037. [Online]. Available: http://www.jvrb.org/4.2007/ 1275/4200714.pdf [5] J. Hammes, A. P. W. B hm, C. Ross, M. Chawathe, o B. Draper, and W. Najjar, High performance image processing on fpgas, in In Proceedings of the Los Alamos Computer Science Institute Symposium. Santa Fe, NM, 2000. [Online]. Available: http://www.cs.colostate. edu/draper/publications/hammes lacsi01.pdf

[6] A. Benedetti, A. Prati, and N. Scarabottolo, Image convolution on fpgas: the implementation of a multi-fpga fo structure, in FIFO Structure, 24 th. EUROMICRO Conference Volume 1 (EUROMICRO98), August 25 - 27, 1998. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.50.1811&rep=rep1&type=pdf [7] D. G. Bariamis, D. K. Iakovidis, and D. E. Maroulis, An fpga-based architecture for real time image feature extraction, in in: Proceedings of the ICPR International Conference on Pattern Recognition, 2004, pp. 801804. [Online]. Available: http://rtsimage.di.uoa.gr/publications/ 01334338.pdf [8] J. Trein, A. T. Schwarzbacher, and B. Hoppe, Fpga implementation of a single pass real-time blob analysis using run length encoding, MPC - Workshop, February 2008. [Online]. Available: http://www.electronics.dit.ie/ postgrads/jtrein/mpc08-1Blob.pdf [9] J. W. MacLean, An evaluation of the suitability of fpgas for embedded vision systems, The First IEEE Workshop on Embedded Computer Vision Systems (San Diego), June 2005. [Online]. Available: http://wjamesmaclean.net/ Papers/cvpr05-ecv-MacLean-eval.pdf [10] Terasic, DE2 Development and Education Board User Manual, 1st ed., Altera, 101 Innovation Drive, San Jose, CA 95134, 2007. [Online]. Available: http: //www.terasic.com/downloads/cd-rom/de2/ [11] , Terasic TRDB-D5M Hardware Specication, Terasic, June 2008. [Online]. Available: http://www.terasic.com.tw/cgi-bin/page/archive download.pl?Language=English&No=281&FID= cbf3f0dcdbe2222a36f93826bcc25667 [12] A. Bochem, R. Herpers, and K. Kent, Hardware acceleration of blob detection for image processing, in Advances in Circuits, Electronics and Micro-Electronics (CENICS), 2010 Third International Conference on. CENICS, July 2010, pp. 28 33.

15

Rapid Prototyping of OpenCV Image Processing Applications using ASP


Felix M hlbauer , Michael Grohans , Christophe Bobda u
Chair

of Computer Engineering University of Potsdam, Germany

{muehlbauer,grosshan}@cs.uni-potsdam.de CSCE

University of Arkansas, USA


cbobda@uark.edu

AbstractImage processing is becoming more and more present in our everyday life. With the requirements of miniaturization, low-power, performance in order to provide some intelligent processing directly into the camera, embedded camera will dominate the image processing landscape in the future. While the common approach of developing such embedded systems is to use sequentially operating processors, image processing algorithms are inherently parallel, thus hardware devices like FPGAs provide a perfect match to develop highly efcient systems. Unfortunately hardware development is more difcult and there are less experts available compared to software. Automatizing the design process will leverage the existing infrastructure, thus providing faster time to market and quick investigation of new algorithms. We exploit ASP (answer set programming) for system synthesis with the goal of genarating an optimal hardware software partitioning, a viable communication structure and the corresponding scheduling, from an image processing application.

I. I NTRODUCTION Image processing is becoming more and more present in our everyday life. Mobiles devices are able to automatically take a photo when detecting a smiling face, intelligent cameras are used to monitor suspect peoples and operations at airports. In production chain, smart cameras are being used for quality control. Besides those elds of application, many other are being considered and will be widened in the future. The challenge for developing such embedded image processing systems is that image processing often results in very high resource utilization while embedded systems are usually only equipped with limited resources. The common approach is based on general purpose processor systems, which processes data mainly sequentially. In contrast, image processing algorithms are inherently parallel and thus hardware devices like FPGAs and ASICs provide the perfect match to develop highly efcient solutions. Unfortunately, compared to software development, only few hardware experts are available. Additionally, hardware development is error prone, difcult to debug and time consuming, leading to huge time-to-market. From an economic point of view two criteria for the development are important: time to market and the performance of the product. Automatic synthesis with the aim of generating an optimal

architecture according the application, will help provide the required performance in reasonable time. Our motivation is to design a development environment in which high-level software developers could leverage the speed of hardware accelerators without knowledge in the low-level hardware design and integration. Our approach consists of running a ready-to-use and well known software library for computer vision on a processor featuring an operating system. We rely on very popular open source software like OpenCV [1] and the operating system Linux. This approach allows the software application developers to focus on the developpement of high-quality algorithms, which will be implemented with the best performance. Given an application, the decision to map a task to a processing element and dening the underlying communication infrastructure and protocol is a challenging task. In this paper we focus on the system synthesis problem of OpenCV applications using heterogeneous FPGA based onchip architectures as target architecture. We use ASP (answer set programming) to prune the solution space. The goal is to nd optimal solutions for the task mapping and communication simultaneously using constraints like timings and chip resources. This paper is structured as follows: After addressing related works, we explain our model for image processing architectures and the resulting design space. A brief introduction into ASP is followed by a description of the strategy how to express and solve the problem in an ASP-like manner. The paper concludes with results and future work. II. R ELATED WORK Search algorithms like e. g. evolutionary algorithms are capable of solving very complex optimization problems. A generic approach is implemented by the PISA tool [2]. Here a search algorithm is a method which tries to nd solutions for a given problem by iterating three steps: First, the evaluation of candidate solutions, second, the selection of promising candidates based on this evaluation and third, the generation of new candidates by variation of these selected

978-1-4577-0660-8/11/$26.00 c 2011 IEEE

16

candidates. As examples most evolutionary algorithms and simulated annealing fall into this category. PISA is mainly dedicated to multi-objective search, where the optimization problem is characterized by a set of conicting goals. The module SystemCoDesigner (SCD) was developed to explore the design space of embedded systems. The found solution is not necessarily the best possible, in contrast we use ASP to nd an optimal solution. In Ishebabi et al. [3] several approaches for architecture synthesis for adaptive multi-processor systems on chip were investigated. Beside heuristics methods also ILP (linear integer programming), ASP and evolutionary algorithms were used. In our work the focus is less on processor allocation and more on communication, especially handling streaming data. The resulting complexity is different. III. A RCHITECTURE The exibility of FPGAs allows for the generation of application specic architecture, without modication on the hardware infrastructure. We are especially interested in centralized systems in which a coordinator processing element is supported by other slave processing elements. Generally speaking, these processing elements could be GPPs (general purpose processors) like those available in SMP systems, coprocessors like FPUs (oating point unit), or dedicated hardware accelerators (AES encryption, Discrete Fourier Transformation, . . . ). Each processing element has different capabilities and interfaces which dene the way sata exchange withe the environment takes place. Also communication is a key component of a hardware/software system. While an efcient communication infrastructure will boost the performance a poorly designed communication system will badly affect the system. In the following our architecture model and the different paths of communication are described in detail. In general, the most important components for image processing are processing elements, memories and communication channels. The specication will therefore focus on more details on those components. In our architecture model we distinguish two kinds of processing elements: software executing processors and dedicated hardware accelerators that we call PUs (processing unit). For each image processing function one or more implementations may exist in software or hardware each of which is having a different processing speed and resource utilization (BRAM, slices, memory, . . . ). Considering the communication, image data usually does not t1 into the sticky memory inside an FPGA and must be stored in an external memory. Because of the sequential nature (picture after picture) of the image capture, the computation on video data is better organized and processed in a stream. The hardware representation of this idea is known as pipelining: Several PUs, are concatenated and build up a processing
1 A video with VGA resolution, 8 bit per color and 25 frames per second has an amount of data of (640480)(38)25 175 megabit per second.

system bus

to external memory

processor memory controller

IM

PU

PU
from camera

PU

PU

IM PU IM
PU = processing unit IM = interconnection module = SDI connection

FPGA

Fig. 1.

Example architecture according to our model

chain. The data is processed while owing through the PU chain. In order to allow a seamless integration of streaming oriented computation in a software environment, we implemented an interface called SDI (streaming data interface) here. The interface is simple and able to control the data ow to prevent data loss because of different speeds of the modules. It consists of the signals and protocolls to allow an interblocking data transport accross a chain of connected components. The SDI-interface allows for reusing PUs in different calculation contexts. A variety of processors available for FPGAs, like PowerPC, MicroBlaze, NIOS, LEON or OpenRISC, provide a dedicated interface to connect co-processors. These interfaces could be used for instruction set extensions but also as a dedicated communication channel to other modules. Hence, to build an architecture, a PU or chain of PUs could be connected to a memory or a processor. Furthermore, memories are accessible directly (e. g. internal BRAM) or via a shared bus (e. g. external memory). For these interconnections, so called IMs (interconnection module) are introduced in our model to link an SDI interface to another interface. Figure 1 shows an example architecture. IV. D ESIGN SPACE AND SCHEDULING Compared to a software only solution, the best architecture, based on a hardware/software partitioning, should be found. The search is based on a given set of image processing algorithms and a given set of objective functions and optimization constraints. These general requirements is usually the system performance, the chips resource utilization or the power consumption. We use the task graph modell to capture a computation application. It denes the dependencies between the different processing steps, from the capture of the raw image data to the production of result. Beside a pool of software and hardware implementations a database was lled with meta information about these implementations like costs and interoperability. Assuming that for each task a software implementation exists, the costs of selecting a component are its processing time and its memory utilization. For computational intensive tasks a hardware implementation is available. The important costs here are processing time, initial delay, throughput and chip utilization

17

(slices, BRAM, DSP, . . . ). These information are gathered p. ex. by proling of function calls or data ow analysis. The problem to be solved is to distribute the tasks to a selected set of processing elements while being aware of timings and scheduling. As mentioned earlier, communication is a key part and often a trade-off between communication and processing has to be found. For example for adjacent tasks it could be faster in total to process these all locally instead of transferring the data from one task in the middle to and from another high-speed module. While mapping the tasks to processors, two kinds of parallelism have to be considered: First, the parallel operation of independent processors like in SMP systems and second, PUs in series processing data while streaming like pipelining. Both parallel operations have different impact on the scheduling. Figure 2 shows a petri net modeling the behavior of the implementation of a image processing function. One byte of data is represented by one token, which traverse the chain from pin to pout according to the implementation type T. I is the number of pixels in one image or frame. The two paths on the left describe lter like operations which consume one image with bytes per pixel and output one image with maybe another data rate . The two paths on the right describe operations with a xed result size r like the image brightness. While stream-based implementations (T=0;2) work on a pixel-by-pixel base and cause an initialization delay (modeled as transaction t1 / t4 ) and a processing delay (t2 / t5 ), other implementations (T=1;3) need full access to the whole image and take a delay of t3 / t6 . PUs introduce two additional parameters, which are important to calculate the scheduling. The initial delay and the throughput of data, which must be considered for chained PUs: It takes some time after a PU has read the rst data before the rst result is available. This delay determines the starting-time of the next PU as specied in the the task graph. Additionally, it is not always possible for a PU to operate at its maximum speed because the speed is also determined by the speed of incoming data from the previous module. Thus to calculate the actual operation speed of a PU, these two facts have to taken in account too. The different operation speeds for different contexts make a decision for the best architecture more difcult. Furthermore, the interplay of components on the chip may incur communication bottlenecks. For example the parallel access to main memory by different modules will incur delays in the computation.

pin t1 1 I T=0 t3 t4 1 T=1 t5 I r pout


Fig. 2. Petri net modeling the data ow within processing elements with different implementations T

I T=2 t6 T=3

t2

A. Background The basic concept of modeling ASP programs are rules in the form: p0 p1 , . . . , pm , not pm+1 , . . . , not pn (1)

For a rule r the sets body + (r) = {p1 , . . . , pm }, body (r) = {pm+1 , . . . , pn } and head(r) = p0 are dened. To understand the concept of answer sets the rule could be interpreted as following: If an answer set A contains p1 , . . . , pm , but not pm+1 , . . . , pn then p0 has to be inserted into this answer set. Additionally, to avoid unfounded solutions for an answer set A, which contains p0 , there must exist one rule r, that head(r) = p0 , body + (r) A and body (r) A = . Of course, this is only an intuitively way to describe the wide eld of answer sets. More specic denitions could be found in several works about answer set programming [4]. With regard to the answer set semantic, a solving strategy and some solving tools are needed to handle the proposed way of understanding logic programs. The rst step in computing answer sets is to build a grounded version of the logical program: All variables are eliminated by duplicating rules for each possible constant and substitute the variable. For example, the program: p(1). is ground to: p(2). q(X) p(X). q(2) p(2). (2)

V. A NSWER S ET P ROGRAMMING Answer Set Programming (ASP) is a declarative programming paradigm, which uses facts, rules and other language elements to specify a problem. Based on this formal description an ASP solver can determine possible sets of facts, which t to the given model.

p(1).

p(2).

q(1) p(1).

(3)

Of course the grounded version of the program could be much bigger than the original one. Another problem is that the grounder needs a complete domain for each variable. For

18

this reason sometimes it is necessary to model such a domain manually, e. g. the grounder could need a time domain, where all possible time values are explicitly given. After generating the grounded version a SAT-like solver is used to compute all answer sets. The most common way to model logic programs for an ASP solver is to use the generate-and-test paradigm. Thus, there are some rules, which are responsible to generate a set of facts. Additionally, there exists some constraints which have to be met, such that the generated set is an answer set, and consequently a solution for the given problem. For that reason ASP solvers support some special language extensions. Generating rules could be modeled using aggregates [5]: l [ v0 = a0 , v1 = a1 , . . . , vn = an ] u. (4)

Normally it is not necessary to instantiate a certain component more than once, thus Ji often is equal to 1. For each instantiated component cij and each task tn an atom Mtn cij is dened, where i species the implementation type of a component and j an instance counter. Thus, for each task ti the sum of mapped components must equal 1: 1 [ Mti c11 , . . . , Mti ckl , ] 1. D. Data ow After all processing units are instantiated, they need to be connected. Connections are derived from edges in the task graph and could be simple point-to-point connections, but also could involve more components. For example in the case data should be transfered from a processing unit to a memory, the connection requires an IM in between to link the different interfaces. For each transfer n an atom Cncij ckl is dened, which indicates that data for that transfer is sent from component cij to ckl . This atom only exists if the transfer actually happens: 0 [ Cncij ckl ] 1. (7) Again, after generating atoms they have to be limited to useful ones. Modeling these constraints is similar to pathnding algorithms. Assuming a transfer n describing the dependency between two tasks tx (source) and ty (sink) the following constraints need to be met: Mtx cij , [ Cncij ckl ] 0. Mty cij , [ Cnckl cij ] 0. (8) (9) (6)

The brackets dene a weighted sum of atoms v0 , . . . , vn and their weights a0 , . . . , an . The rule describes the fact, that a subset A {v0 , . . . , vn } of true atoms exist, such that the sum of weights is within the boundaries [l; u]. Omitted weights default to 1. These rules could be used to generate different sets of atoms. Integrity constraints test if a generated set of atoms is an answer set. These constraints describe, which conditions must not be true in any answer set and are written: p, q (5)

In this example there exists no answer set containing both p and q. So far we have described all language features which are necessary to model our optimization problem. B. Model Our ASP model is structured in three parts: First, the problem description including a task graph and constraints for the demanded architecture, second a summary of meta information for all implementations like costs, mapping of task to HW or SW-component and the mapping interconnect. Third, the solver itself with all rules needed to nd a solution. This separation (in les) also offers a highly exible and reusable model. The basic idea of the model is to select an implementation for each task, connect the associated modules and nally consider timings and dependencies to build the scheduling. Details are described in the following sections. C. Allocation of processors First, each task needs to be mapped to exactly one component. For the introduced scenario this could be the main processor or a PU. The amount of permutations in the model to map components is M !, where M is the maximum number of possible components allowed to be instantiated. To reduce symmetries, a component is dened to have two indices: cij , j [1; Ji ]. With Ji dening the maximum possible number of instantiations of a component i, the amount of permutations is reduced to J1 ! . . . Jn !. The values Ji are derived from the task graph. 19

If a source task tx is mapped to a component cij there must exist a component ckl to which cij sends data in transfer n (see rule 8). Similary, if a sink task ty is mapped to a component cij there must exist a component ckl which sends data to cij in transfer n (see rule 9). Else this solution is invalid. Additionally, the model must ensure that there exists a path for each transfer between the source and sink components as well as it has to avoid senseless connections, p. ex. in the case of incompatible interfaces between to components. E. Time To evaluate the performance of a hardware architecture it is necessary to schedule all tasks in a temporal order tha will insure the minimal run-time of the algorithm. Modeling a temporal behavior could be done with the help of a time domain, which denes a discrete and nite set of possible time slots. Each task is assigned to a time slot to indicate its starting time. Additionally, the task graph in extended by two special tasks to mark the start and end of computation. While the start task is assigned to time slot 0, the time slot of the end task indicates the total runtime and is used as the value be optimized by the solver. Choosing a practical duration for a time slot is difcult. Selecting shorter clock for time intervals results in a very accurate scheduling. However a small time slot leads to an explosion of the number of possibilities that the solver has to

deal with. As trade-off we choose a normalized time interval for a time slot related to the fastest component: One time slot is the amount of time the fastest component takes to process a certain amount of data. The occupation of two time slots indicates that a component operates at half speed compared to the fastest one. In our ASP model each task ti is assigned exactly to one time slot k indicated by an atom Tti k 1 [ Tti 1 , . . . , Tti m ] 1. (10)

where m is the total number of time slots available in the time domain and given as a constant in our model. To meet the dependencies given by the task graph, a task tx may not start before its predecessor ty : Ttx kx , Tty ky , ky < kx . F. Synchronization In section IV we explained why the maximum operation speed of a component may not be exhausted and the actual speed depends on the processing context. Therefore an atom Sti k is dened as 1 [ Sti 1 , . . . , Sti p ] 1. (12) (11)

conditions. In detail, this concerns memory bandwidth, chip area utilization and total runtime. One major issue is the utilization of the system bus resp. memory bus, because most data is stored in the main memory and easily the memory interface becomes a bottleneck. For each point in time it must be assured that the bus is not overloaded and the speed of the attached components is throttled if necessary. In our model the speed of the system bus is given as a constant sb . For each time slot the trafc of all active bus transfers is summed up and compared to the system bus capacity. Furthermore the trafc caused by a component is inversely proportional to its speed, p. ex. if a component operates four times slower than the bus, the bus utilization is one quarter. This is expressed by the following inequation: 1 1 1 + ... + stc1 stcn sb (16)

where k is the speed of the task ti . Similar to the time model also for modeling the speed a relative criteria was chosen for the same reasons. A value of 1 implies that the fastest component needs one time slot to process a certain amount of data. The constant p is the maximum speed value, and thus the speed of the slowest component. The possible speed values for a component are dependent of the speed of the predecessor. If two tasks tx and ty are dependent and mapped on adjacent components, the assigned speed values have to be equal: Stx kx , Sty ky , kx = ky . (13)

To nd a scheduling some more helper values are needed. If the starting time and the speed of a task is known, then the end time could be determined. Similar to the denition of the starting time T in section V-E, the end time of a task ti is described by an atom Eti k : Eti ke Sti d , Tti ks , ke = ks + d. (14)

For the time slot t the components c1 , . . . , cn are loading the bus according to their individual speeds stck . For the ease of reading cij is shortened to ck . In ASP fraction numbers should be avoided and integer numbers used instead. Therefore the bus capacity is modeled as discrete work slots which could be allocated by active components. The constant p (introduced in rule 12) is derived from the slowest component and hence the minimal bus load is 1/p, if the bus speed sb equals 1. This also results in p as the number of needed slots respectively p/sb for sb = 1. To understand that, consider that it is possible to normalize all speed values, including the maximum value p by sb and get a new maximum value p = p/sb and a new bus speed sb = 1. For a component ck sending data and operating at speed stck , the normalization coefcient is stck /sb . Thus ck uses sb /stck of the bus capacity and consequently allocates sb sb p p p = = (17) stck stck sb stck work slots. With this normalization the equation 16 becomes an integrity constraint using only integer numbers: p p + ... + p . (18) stc1 stcn Another issue concerning the general constraints of a solution is the chip area. As described before, the resource utilization r of each component is given as part of the metainformation. For each instantiated component ck the value rk is represented by the atom Rck rk . With Ru dening the overall resource constraint the integrity constraint Rc1 r1 , . . . , Rcn rn , Ru , r =
n

To introduce a local scheduling on each component there may not exist any tasks which have intersecting computation times. Thus, if two tasks tx and ty are mapped to the same component and ty starts after tx , then tx must have nished before ty starts: Ttx ks , Etx ke , Tty ky , ks ky , ky < ke . G. Resource utilization With the introduced rules so far it is possible to build up valid architectures. In the following further rules are presented, which have global inuence on the quality of the generated architectures and second, comply with the general (15)

rn , u r.

(19)

rejects architectures, which consume too much resources. This rule is replicated to handle different resources like slices or BRAMs. Finally, to get the optimized model the total run-time should be minimized. As indicator for the runtime the end task te was

20

task gauss sobel gradient trace system bus IM (bus)

throughput PPC PU 16 1 16 1 8 2 16 -

resources slices bram 2 2 2 2 1 0 10 3

system bus

to external memory

processor memory controller trace

from camera

gauss

sobel

gradient

TABLE I B RIEF META - INFORMATION FOR DIFFERENT IMPLEMENTATIONS


FPGA

IM

chip area: 18 time slots: 21

dened earlier. In the ASP model an aggregate is used to nd the time slot of te : minimize [ Tte 1 = 1, . . . , Tte m = m ]. (20)

system bus

to external memory

processor memory controller gradient, trace


chip area: 17 time slots: 37

Each atom Tte k is weighted by its time slot number k. Because only one atom is true, the sum results in the time slot number of the end task and hence the total runtime. VI. R ESULTS At the University of Potsdam, Germany a collection of tools called POTASSCO [6] was developed to support the computation of answer sets. Some of those tools are trend-setting and award-winning in the wide eld of logic programming [7]. We use the tools gringo and clasp to solve our problem. These applications are capable of handling optimizing statements, which themselves are very similar to aggregates. A sum of specic weighted literals is build and tried to optimize this sum during the solving process. As example application for this paper we used the canny edge detector, a common preprocessing stage for object recognition. The processing steps are: camera Gauss lter (noise reduction) sobel lter (nd edges) calculate gradient of edges trace edges to nd contours. While for the rst steps a hardware implementation is very fast, the tracing of edges has no consecutive memory access and thus only a software implementation is assumed. Table I summarizes the resource utilization and the assumed throughput for each implementation. On our test system2 the ASP solver needs about 3 seconds to nd a solution. Figure 3 shows two different generated architectures while decreasing the constraint for the available chip area. The mapping of software tasks is illustrated with parallelograms and dashed arrows. In the bottom right corner of each drawing the consumption of chip area and the estimated run-time is given. Finally, the software only solution (not shown) takes 65 time slots compared to 21 for the most hardware intensive architecture. Our current ASP model is a rst approach and thus not optimized. Nevertheless we want to give an idea of the performance of the solving process. For the measurements for each number of tasks from 5 up to 10, 1000 problems were generated randomly especially the task graph and the metainformation for different implementations. Figure 4 shows a
2 Desktop

from camera

gauss
FPGA

sobel

IM

Fig. 3. Resulting architectures for different chip area constraints. The pure software solution takes 65 time slots.

Fig. 4. ASP solver runtime measurements according to the number of tasks

exponential growth of the solving time relative to the number of tasks in the problem which is not worse than expected. VII. C ONCLUSION AND FUTURE WORK We have shown that answer set programming is a viable approach to solve complex problems like the architecture generation for data stream based hardware/software co-design systems. The advantage over evolutionary algorithms or heuristics methods is the guaranty ndind of an optimal solution. Our development platform is an intelligent camera system, based on a Virtex4 FX FPGA with an embedded PowerPC hardcore processor. Because image data is normally huge and stored in the external DDR memory the rst IM which was developed connects the PLB (system bus) to a SDI component and vice versa. This module operates similar to a DMA controller but instead of just copying, the data is streamed out and in of the module between the load and store operation to go through PUs. Our next step is to improve the ASP model to be solved faster and to be more accurate especially concerning the resolution of timings and to examine more complex task graphs.

with Intel Core2 Duo Processor 3.16 GHz, 3.2 GB RAM

21

An extension of the POTASSCO tools which is currently in heavy development and capable of handling real numbers, will be used for this purpose. Our ASP model is already capable to generate architectures which include a partial reconguration of modules. The scheduling is also valid except for the case that, a module is used, recongured and directly used again. Here the delay for the reconguration must be considered in the scheduling because it stalls the processing. Its possible to include this case in our model with little modications. In the future we are going to combine the work of this paper with our work in the domain of partial reconguration. R EFERENCES
[1] Intel Inc., Open Computer Vision Library, http://www.intel.com/ research/mrl/research/opencv/, 2007.

[2] ETH Zurich, PISA - A Platform and Programming Language Independent Interface for Search Algorithms, http://www.tik.ee.ethz.ch/pisa/, 2010. [3] H. Ishebabi and C. Bobda, Automated architecture synthesis for parallel programs on fpga multiprocessor systems, Microprocess. Microsyst., vol. 33, no. 1, pp. 6371, 2009. [4] C. Anger, K. Konczak, T. Linke, and T. Schaub, A glimpse of answer set programming, K nstliche Intelligenz, no. 1/05, pp. 1217, 2005. u [5] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, S. Thiele, and T. Schaub, A Users Guide to gringo, clasp, clingo, and iclingo, Nov. 2008. [6] University of Potsdam, Potassco - Tools for Answer Set Programming, http://potassco.sourceforge.net/, 2010. [7] M. Denecker, J. Vennekens, S. Bond, M. Gebser, and M. Truszczy ski, n The second answer set programming competition, in Proceedings of the Tenth International Conference on Logic Programming and Nonmonotonic Reasoning (LPNMR09), ser. Lecture Notes in Articial Intelligence, E. Erdem, F. Lin, and T. Schaub, Eds., vol. 5753. Springer-Verlag, 2009, pp. 637654.

22

Optimization Issues in Mapping AUTOSAR Components To Distributed Multithreaded Implementations


Ming Zhang, Zonghua Gu
Colledge of Computer Science, Zhejiang University Hangzhou, China 310027 {editing, zgu}@zju.edu.cn
AbstractAUTOSAR is a component-based modeling language and development framework for automotive embedded systems. Component-to-ECU mapping is conventionally done manually and empirically. As the number of components and ECUs in vehicles systems grows rapidly, it becomes infeasible to find optimal solutions by hand. We address some design issues involved in mapping an AUTOSAR model to a distributed hardware platform with multiple ECUs connected by a bus, each ECU running a real-time operating system. We present algorithms for extracting connectivity between ports of atomic software components from an AUTOSAR model and for calculating blocking times of all tasks of a taskset scheduled by PCP. We then address optimization issues in mapping AUTOSAR components (SWCs) to distributed multithreaded implementations. We formulate and solve two optimization problems: map SWCs to ECUs with the objective of minimizing the bus load; for a given SWC-to-ECU mapping, map runnable entities on each ECU to OS tasks and assign data consistency mechanism to each shared data item to minimize memory size requirement on each ECU while guaranteeing schedulability of tasksets on all ECUs. Keywords-software component; ECU; schedulability; data consistency

W. Peng et al [2] addressed the deployment optimization problem for AUTOSAR system configuration. In their work, an algorithm was presented to find an SWC-to-ECU mapping scheme to guarantee task schedulability while minimizing inter-ECU communication bandwidth. However, their work didnt consider shared data among tasks and their protection mechanisms. Ferrari et al [3] discussed several strategies for protecting shared data items and raised the issue of optimization for time and memory tradeoffs, but did not propose any concrete algorithms. In this paper, we attempt to formulate and solve the optimization problems involved in mapping AUTOSAR model to distributed multithreaded implementations. This paper is organized as follows: Section II introduces basic concepts of AUTOSAR. Section III describes our approach in detail while Section IV presents two algorithms for extracting connectivity of ports and calculating blocking times of tasks. In Section V, two simple experiments on an application example demonstrates the correctness and effectiveness of our approach. Finally, this work is concluded in Section IV. II. BASIC CONCEPTS OF AUTOSAR According to AUTOSAR, application software is conceptually located above the AUTOSAR RTE and consists of platform-independent software components (SWCs). An SWC may have multiple ports. A port is either a P-Port or an R-Port. A P-Port provides output data while an R-Port requires input data. Each port is associated with a port interface. Two types of port-interfaces, client-server and sender-receiver, are supported by AUTOSAR. A client-server interface defines a set of operations that can be invoked by a client and implemented by a server. A sender-receiver interface defines a set of data-elements sent/received over the VFB. Runnable Entities (runnables for short) are the smallest executable elements. One component consists of at least one runnable. All runnables are activated by RTEEvents. If no RTEEvent is specified as StartOnEvent for a runnable, then the runnable is never activated by the RTE. Two categories of runnables are defined. Category 1 runnables do not have WaitPoints and have to terminate in finite time. Category 2 runnables always have at least one WaitPoint; or they invoke a server and wait for the response. At runtime, runnables within software components are grouped into tasks scheduled by the OS scheduler. The RTE generator is responsible for constructing the OS tasks bodies. Before the RTE generator takes the ECU configuration description as the input information to generate the code, the

I.

INTRODUCTION

Todays automotive electrical and electronic systems are becoming more and more complex. In order to ease the development of automotive electronic systems, leading automobile companies and first-tier suppliers formed a partnership in 2003, and established AUTOSAR (AUTomotive Open System Architecture), a standard for automotive software development. According to AUTOSAR, application software components (SWCs) are platform-independent and need to be mapped to ECUs [1]. This mapping is an important step of system configuration. For a system consisting of ECUs, with application SWCs to be mapped to the ECUs, there could be different mapping schemes, including valid and invalid schemes with respect to constraints (i.e., timing constraints of tasks). As the number of ECUs and application SWCs increases, it is inefficient and error-prone to perform the mapping manually in a trial-and-error manner. In this work, we propose an approach that works closely with the AUTOSAR model to automate the mapping process, which guarantees schedulability of tasks and consistency of data shared among application tasks, while minimizing the data rate over the bus as well as the memory overhead to protect data consistency.

978-1-4577-0660-8/11/$26.00 2011 IEEE

23

RTE configurator configures parts of the ECU-Configuration, e.g. mapping of runnables to tasks. III. PROBLEM FORMULATION

(6) (7)

A. Outline Our approach consists of two phases. The first phase tries to find one or more optimal SWC-to-ECU mapping schemes, with respect to the data rate over the inter-ECU bus, while trying to guarantee schedulability of the task set on every ECU. This phase takes as input a set of inter-connected atomic application SWCs [1], and a set of ECUs, and outputs a mapping scheme from the atomic application SWCs to the ECUs. In the second phase, a per-ECU optimization is performed. For each ECU, our approach tries to select a method for each data item to guarantee its data consistency as well as schedulability of the task set on the ECU, while minimizing the memory overhead used to protect data consistency on the ECU. This phase takes as input a mapping scheme produced by the first phase, and outputs the selected method to guarantee data consistency, for each data item. Both of the two phases involve runnable-to-task mapping. We assume the worst-case execution time and the worst-case execution time of the longest critical section of every runnable are given. Before the two phases of optimization, the top-level component of a system [1] is decomposed into inter-connected atomic SWCs. During this process, the connectivity between the ports of the atomic SWCs is maintained, as described in detail in Section IV.A. Given a set of atomic application SWCs ECUs , a mapping scheme is a function : . and a set of , where (1)

Where is a data item transmitted between ECUs; is the set of runnables of ; is the set of data items is the size transmitted by runnable over the bus; is the period of transmission of of data item ; . The first phase is formulated as the following optimization problem: Find Minimize Subject to , with 0 for every . for all ,

After the application SWCs are mapped to the ECUs, consistency of data shared among tasks on each ECU needs to be guaranteed. We consider two methods mentioned by [3]: semaphore lock and rate transition blocks. Semaphore lock incurs negligible memory overhead while introducing significant delays; Rate transition blocks incur negligible time overhead but require additional memory space to store multiple copies of data. For each data item shared among tasks on a given ECU , one of the two methods is selected. Hence, a function is to be found, where . (8)

gives the ECU which is mapped to. For the mapping to meet timing constraints, every task in the taskset on each ECU finishes before its deadline , that is: . (2)

gives the method to protect the consistency of data item (SL: semaphore lock or RTB: rate transition blocks), where is a data item shared by tasks on the given ECU. As in the first phase, schedulability of tasks on must be guaranteed. In this phase, data sharing is taken into account. As is required by PCP [5], is the longest critical section of low-priority tasks of , which is given by (4). In addition, memory overhead on the ECU introduced by rate transition blocks is to be minimized:
1

. (9)

In this work, we assume that Priority Ceiling Protocol [5] is applied as the scheduling algorithm, which defines as:
, | _

.
,

(3) . (4)

Where is the number of copies of . Since the original copy exists even if no mechanism is applied to guarantee data consistency, the original copy is not counted as overhead. To determine cases: , we need to consider three

During the first phase, we do not consider data sharing among tasks, hence, for this phase, 0. In addition, the total data rate over the bus minimized. We define as: (5) needs to be

If has only one writer and no reader, i.e., is written by one task and no task reads it, no extra copy is needed. If has only readers and no writer, i.e., is read by some tasks but written by no task, the original copy suffices.

24

If has more than one writers, or both a writer and a reader, i.e., is written by more than one tasks, or written by some tasks and read by other tasks, then a copy is needed for each of the writers and another copy is required for all the readers, including the original copy. In other words, the number of extra copies is equal to the number of writers.

Combining all of the three cases above, we define as: . (10) Where and are the number of writer tasks and reader tasks of , respectively, and evaluates to 1 when has a reader and 0 when has no reader. The second phase of our approach is formulated as the following optimization problem: For a given ECU , Find Minimize Subject to . with given by (4) for every on given by (4) is for all ,

C. Identification of Data Items Transmitted between ECUs According to AUTOSAR [1], an atomic application SWC must be mapped to an ECU as a whole. From the viewpoint of the sender SWC, this implies that it could send data to a remote ECU only via its ports. The structure of a data item is defined by a data element in a port interface. Therefore, we identify a data item transmitted over the bus by a port-data element pair , . From the data element, the size of the data item , could be obtained. From the viewpoint of a runnable, a data item it transmits is referenced by its data send point or data write access. Hence we can obtain the period of the transmission of a data item , by a runnable that has a data send point or data write access referencing it: , , . (13)

Where , denotes the runnable that transmits the data item identified by , . In order to identify data items transmitted over the bus instead of via memory on the local ECU, it is necessary to find out whether there is a receiver SWC on a remote ECU. We tackle this problem in two steps. First, we define a function where . (14)

The algorithm for calculating described in detail in Section IV.B.

In the outline described above, there are a few pending problems: Mapping runnables to tasks; Identification of data items transmitted between ECUs; Identification of data items shared by more than one tasks on the same ECU.

gives the set of all R-Ports that are connected with the given P-Port . Then, for each R-Port in , the ECU to which the owner SWC of is mapped is compared with that of . If there exist a in , whose owner SWC is mapped to a different ECU from the ECU of the owner SWC of , or formally, . Where is the owner SWC of ; is the ECU to which is mapped; must be transmitted over (15)

We will address these problems next. B. Mapping Runnables to Tasks In this work, we take into account only periodic runnables. We map runnables to tasks in a simple way that is a common practice in the industry: all runnables on the same ECU with the same period are mapped to the same task . Further, we define the period of the task as the period of the runnables mapped to it. Hence, . (11)

then every data item transmitted via the bus.

With (12), (13), (14), and (15), we could rewrite (7) as: Where 1, 0, , . (17)
, ,

. (16)

The worst-case execution time of the task is defined as the sum of the worst-case execution times of the runnables mapped to it. Hence, . (12)

Therefore, the period of a task is unique among all the tasks on the same ECU. By rate-monotonic priority assignment [4], the priority of each task is also unique.

D. Identification of Data Items Shared by Tasks on the Same ECU In an AUTOSAR model, data shared by runnables of application SWCs could be classified into two categories: those

25

shared by runnables on the same ECU, and those shared by runnables on different ECUs. For the 1st category, race conditions may occur since the data is shared via memory. For the 2nd category, data consistency is not an issue, since the communication is via message passing. From the viewpoint of application SWCs, data shared by runnables on the same ECU come in two forms: Data shared by runnables of the same atomic SWC, or inter-runnable variables; Data shared by runnables of different atomic SWCs, transmitted and received by different SWCs via their ports.

gives the set of assembly connectors data transmitted via could pass. For data items shared by runnables of different SWCs on the same ECU, (9) could be rewritten as:

, ,

(20)

Note that (19) and (20) are used by the counting process that determines , and , , which in turn are used to find , . Considering both (18) and (20), we can rewrite (9) as: . (21)

According to the AUTOSAR specification [1], an interrunnable variable is referenced by runnables that read or write them. By counting tasks that reference as a writer or a reader, we obtain and respectively. For shared data items in the form of an inter-runnable variable, (9) can be rewritten as:
1

E. Genetic Algorithm We used the NSGA-II [6] variant of Genetic Algorithm to solve the optimization problems described in the first subsection of this section. For the optimization of the SWC-to-ECU mapping, we encode each individual as a vector , with the -th element is mapped. To representing the ECU to which recombine two individuals and , the cross-over operator randomly picks a subset of each time, hence | | , and exchanges , ,, , 1 and 1 . The mutation the values of of , and maps each operator also picks a subset to a random ECU by assigning to the corresponding element in the target individual . To optimize the selection of method to guarantee data consistency of shared data for a given ECU, the individual encoding, cross-over operator and mutation operator are similar to those for optimization of the SWC-to-ECU mapping scheme. Each individual is encoded as a vector , with the th element representing the method (SL: semaphore lock or RTB: rate transition block) selected to guarantee data consistency of data item . To recombine two individuals and , the cross-over operator randomly picks a subset of the set of all shared data items , hence | | , each time, and exchanges , ,, , 1 the values of and 1 . The mutation operator also picks a subset of , and assigns a random method from the available data consistency methods by assigning to the corresponding element to each in the target individual . IV. ALGORITHMS

.(18)

For the latter, it is easy to prove the following lemma. Lemma 1. If data transmitted by an atomic SWC via a PPort , and received by another atomic SWC via an R-Port , the data passes exact one assembly connector. The concept of assembly connector is defined in [1]. Note that its possible for different data items transmitted via a PPort to pass different assembly connectors since AUTOSAR allows a port to be connected to more than one ports. The same holds for a data item received via an R-Port. We identify a data item shared by runnables of different SWCs on the same ECU with a pair , , where is the assembly connector the data item passes; is the data element in the port interface associated with the port via which the data item is transmitted, i.e., the P-Port connected by the assembly connector .

To make sure that the data item is shared by runnables on the same ECU, the P-Port via which the data item is transmitted and the R-Port via which the data item is received now belong to SWCs mapped to the same ECU, or formally: 1, 0, , . (19)

By counting tasks (possibly with runnables of different SWCs mapped to it) on the given ECU, which reference the data item identified by , as a writer or a reader, we obtain , and , respectively. To determine whether a data item , is read or written by a runnable , it needs to be determined whether the data item , transmitted by passes . We define a function , where . (20)

A. Deriving Port Clusters and Port Connectivity from AUTOSAR Models In the last section, we defined function , which finds the set of all R-Ports that are connected with the given P-Port and function , which finds the set of assembly connectors data transmitted via could pass. We define a port cluster to be a pair , where

26

is a set of P-Ports of atomic SWCs; is a set of R-Ports of atomic SWCs; Data transmitted via any P-Port in received by every R-Port in . can be

end if end for each end for each return PortToClusterMap In the pseudo-code above, we use DelegationMap to track inner ports that are directly connected with each outer port. The pseudo-code of the update PortToClusterMap with DelegationMap step is as follows: for each (outerPort, innerPortSet) in DelegationMap /*innerPortSet is the set of all inner ports that are connected directly with outerPort*/ Find PCSet(outerPort) in PortToClusterMap /*PCSet(outerPort) is the set of all port clusters that contain outerPort*/ if PCSet(outerPort) exists in PortToClusterMap /*outerPort is connected from the outside of its owner ComponentPrototype*/ for each innerPort in innerPortSet find PCSet(innerPort) in PortToClusterMap /*PCSet(innerPort) is the set of all port clusters that contain innerPort*/ add (innerPort, PCSet(innerPort)PCSet(outerPort)) to PortToClusterMap end for each end if remove (outerPort, PCSet(outerPort)) from PortToClusterMap end for each Note that an inner port may belong to multiple port clusters. When updating a (innerPort, PCSet(innerPort)), care must be taken not to lose the port clusters associated with innerPort previously. B. Calculating blocking times of tasks In this sub-section, we propose an algorithm to calculate in (4) for all tasks, which have been sorted by priority (from the highest to the lowest). In the context of this paper, we do not distinguish between a semaphore and a shared data item. Our algorithm performs two scans through the sorted list of tasks. First, a scan is performed from the highest priority to the lowest priority, and determines the ceiling of each semaphore protecting a shared data item, for every semaphore . Then, the second scan is performed from the lowest to the highest priority. During this scan, our algorithm maintains a map from each semaphore to the longest critical section that uses . For every task , this scan performs the following steps one-by-one: remove semaphores that cannot contribute to the blocking time to , hence, all : ; find the longest critical section currently in save its length as ; , and

By Lemma 1, it is obvious that a port cluster corresponds to exactly one assembly connector, which data transmitted by a PPort in the passes before being received via one or more R-Ports in the . In this sub-section, we propose an algorithm to find port clusters which a given port belongs to for every port, whose owner is an atomic SWC. Our algorithm starts with the top-level composition [1] of a system and performs a breadth-first traversal through the hierarchical structure of the components. In this process, for every composition, a port cluster is created for each assembly connector , with the P-Port connects added to the of and the R-Port to the ; every outer port of the composition, which has been added to one or more port clusters, is conceptually replaced with inner ports that are connected to it by delegation connectors [1].

This process continues until all compositions are processed, when every port in the port cluster belong to an atomic SWC. The pseudo-code of the algorithm is as follows: Algorithm 4.1 component CP0 (find port clusters) Input: top-level

add CP0 to Queue Q while Q is not empty remove CP1 from Q if CP1 is a composition clear DelegationMap for each connector Cn0 in CP1 if Cn0 is an assembly connector create a port cluster PC(Cn0)=PC(PSet,QSet) add Cn0.PPort to PSet; add Cn0.RPort to RSet add PC(Cn0) to PortToClusterMap else /*if CnP0 is a delegation connector*/ add (Cn0.outerPort, Cn0.innerPort) to DelegationMap end if end for each update PortToClusterMap with DelegationMap add all ComponentPrototypes in CP1 to Q end of while for each (p, PCSet(p)) /*p is a port while PCSet(p) is the set of all port clusters containing p*/ for each PC(PSet, RSet) in PCSet(p) if p is a P-Port add p to PSet else /*p is an R-Port*/ add p to RSet

add all semaphores it encounters for the first time during this scan, hence, : ,

27

to along with the critical section , of ; we use to denote the priority of the lowestpriority task that uses . for each semaphore : , update the critical section currently , . associated with in , if it is shorter than V. APPLICATION EXAMPLE

SC0/CP1/CP10,RE10,P0,DE00=6

In this section, we describe two experiments on an application example, which consists of 6 atomic SWCs to be mapped to 2 ECUs. The hierarchy of SWCs, along with the connectors, is shown in Figure 1. . For simplicity, each atomic SWC contains one and only one runnable, each port interface contains only one data element, and the size of the data element is 36 bytes. There is no inter-runnable variable. The worst-case execution times of runnables are shown in TABLE I. . The maximum lengths of critical sections of runnables, along with the accessed shared data items, are shown in TABLE II. , where each shared data item accessed by a runnable is represented by the port and the data element.
CP0 CP00 P1 P0 A1 R0 CP1 R0 CP10 P0

A. Experiment 1 In this experiment, we map the atomic SWCs to ECUs. Three mapping schemes are found, all with the same data rate over the bus of 450B/s. The first mapping scheme maps CP00, CP01 and CP2 to EcuInstance0; CP10, CP11 and CP12 to EcuInstance1. From Figure 1. , we can see that all data on bus come from P0 of CP01. This port is written by the runnable RE10 of CP01 with period of 80ms. Hence the data rate over the bus is 36B/80ms=450B/s. (The numerical values are for illustration purposes only.) The second mapping scheme maps CP00, CP01 to EcuInstance0; CP2, CP10, CP11 and CP12 to EcuInstance1. The data rate over the bus is the same as the first scheme. The third mapping scheme is actually the same as the first mapping scheme except that the ECUs are interchanged, thus resulting in the same data rate over the bus, since we assume a homogeneous hardware platform consisting of identical ECUs. The tasksets on EcuInstance0 and EcuInstance1 using the first scheme are shown in TABLE III. and TABLE IV. respectively. The tasksets using the second scheme are shown in TABLE V. and TABLE VI. respectively. We can see that all tasks meet there deadlines.
TABLE III. Task ID Task 0 TASKSET ON ECUINSTANCE0 UNDER 1ST MAPPING SCHEME T (ms) = D 30 80 620 C (ms) 5 25 100 B (ms) 0 0 0 WCRT (ms) 5 30 210

P0 A0 CP01 R0 P0 D1 P1

D0

D2 R0 CP11

A0

A2

R1

D1 CP11

P0 A1 R1

Task 1 Task 2
R0

D3 A0 P0

P0

TABLE IV. Task ID

TASKSET ON ECUINSTANCE1 UNDER 1ST MAPPING SCHEME T (ms) = D 40 80 132 C (ms) 10 25 30 B (ms) 0 0 0 WCRT (ms) 10 35 75

CP2 R0

Task 0 Task 1 Task 2 TABLE V. Task ID Task 0 Task 1 TABLE VI. Task ID Task 0 Task 1 Task 2 Task 3

Figure 1. Hierachy of SWCs and Connectors of the Application Example TABLE I. WORST-CASE EXECUTION TIMES OF RUNNABLES IN MS (RE DENOTES RUNNABLE ENTITY)

TASKSET ON ECUINSTANCE0 UNDER 2ST MAPPING SCHEME T (ms) = D 30 80 C (ms) 5 25 B (ms) 0 0 WCRT (ms) 5 30

SC0/CP2,RE40=100 SC0/CP1/CP12,RE30=30 SC0/CP0/CP01,RE10=25 SC0/CP1/CP11,RE20=10 SC0/CP0/CP00,RE00=5 SC0/CP1/CP10,RE10=25 TABLE II. CRITICAL SECTIONS OF RUNNABLES IN MS (DE DENOTES DATA ELEMENT SHARED BETWEEN DIFFERENT RUNNABLES) SC0/CP2,RE40,R0,DE00=10 SC0/CP1/CP12,RE30,R1,DE00=2 SC0/CP1/CP12,RE30,R0,DE00=8 SC0/CP1/CP12,RE30,P0,DE00=3 SC0/CP0/CP01,RE10,R0,DE00=5 SC0/CP0/CP01,RE10,P0,DE00=6 SC0/CP1/CP11,RE20,R0,DE00=1 SC0/CP1/CP11,RE20,P0,DE00=2 SC0/CP0/CP00,RE00,P1,DE00=1 SC0/CP0/CP00,RE00,P0,DE00=1 SC0/CP1/CP10,RE10,R0,DE00=5

TASKSET ON ECUINSTANCE1 UNDER 2ST MAPPING SCHEME T (ms) = D 40 80 132 620 C (ms) 10 25 30 100 B (ms) 0 0 0 0 WCRT (ms) 10 35 75 600

Next, we assign data consistency method to each shared data item based on the 1st mapping scheme. All the local shared data items are protected with semaphore locks, hence the memory overhead is minimal (0). The tasksets on EcuInstance0 and EcuInstance1 after data consistency method

28

assignment are shown in Tables XI and XII, respectively. Again, we can see that all tasks meet their deadlines.
TABLE VII. Task ID Task 0 Task 1 Task 2 TABLE VIII. Task ID Task 0 Task 1 Task 2 TASKSET ON ECUINSTANCE0 AFTER DATA CONSISTENCY METHOD ASSIGNMENT T (ms) = D 30 80 620 C (ms) 5 25 100 B (ms) 5 0 0 WCRT (ms) 10 30 210

TABLE XI. Task ID Task 0 Task 1 Task 2

SCHEDULABLE TASKSET ON ECUINSTANCE1 FOR THE DATA CONSISTENCY METHOD IN TABLE X. T (ms) = D 40 80 132 C (ms) 10 25 30 B (ms) 5 2 0 WCRT (ms) 15 37 75

VI.

CONCLUSIONS

TASKSET ON ECUINSTANCE1 AFTER DATA CONSISTENCY METHOD ASSIGNMENT T (ms) = D 40 80 132 C (ms) 10 25 30 B (ms) 5 8 0 WCRT (ms) 15 53 75

B. Experiment 2 In this experiment, we consider the SWC-to-ECU mapping under 1st mapping scheme (TABLE III. and TABLE IV. ), and increase length of the critical section where the runnable RE30 of SC0/CP1/CP12 accesses the shared data item identified by (R0, DE00) (3rd line in TABLE II. ) from 8ms to 60ms. If all shared data items are protected with semaphore locks, then the taskset is not schedulable due to excessive blocking time, as shown in TABLE IX. . We then run our algorithm for data consistency method assignment on EcuInstance1. This time, our approach assigns rate transition block to the shared data item identified by (A0, DE00), as shown in TABLE X. . Now the taskset is schedulable, as shown in TABLE XI. .
TABLE IX. Task ID Task 0 Task 1 Task 2 TABLE X. NON-SCHEDULABLE TASKSET ON ECUINSTANCE1 C (ms) 10 25 30 B (ms) 5 60 0 WCRT (ms) 15 115 75

As vehicle electronic systems become increasingly complex, the number of software components and the number of ECUs have also increased, making it difficult or infeasible to use manual efforts to find optimal SWC-to-ECU mapping schemes. In this work, we present an approach to automate the mapping process, which guarantees schedulability of tasks and consistency of data shared among tasks, while minimizing the data rate over the bus as well as the memory overhead to protect data consistency. Along with our approach, we present an algorithm for extracting connectivity between ports of atomic software components from an AUTOSAR model and an algorithm for calculating blocking times of tasks under PCP. Finally, we use an application example to show the correctness and effectiveness of the proposed techniques. VII. ACKNOWLEDGEMENTS This work was supported by NSFC Project Grant #61070002 and #60736017; National Important; Science & Technology Specific Projects under Grant No.2009ZX01038001 and 2009ZX01038-002. REFERENCES
[1] [2] AUTOSAR GbR, AUTOSAR Specifications, AUTOSAR Development Partnership, 2008, Release 3.0. Wei Peng, Hong Li, Min Yao, Zheng Sun, Deployment Optimization for AUTOSAR System Configuration, International Conference on Computer Engineering and Technology, vol. 4: 4189-4193, 2010. Alberto Ferrari, Marco Di Natale, Giacomo Gentile, Giovanni Reggiani, Paolo Gai,Time and memory tradeoffs in the implementation of AUTOSAR components, Design, Automation & Test in Europe Conference & Exhibition, pp. 864-869, 2009. C. L. Liu, J. W. Layland, Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment, Journal of ACM, vol. 20, No. 1, pp. 46-61, 1973. L. Sha, R. Rajkumar, J. Lehoczky, Priority Inheritance Protocols: An Approach to Real-Time Synchronization, IEEE Transactions on Computers, vol. 39, No. 9, pp. 1175-1185, 1990. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, T. Meyarivan, A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II, Parallel Problem Solving from Nature VI, pp. 849-858, 2000.

T (ms) = D 40 80 132

[3]

[4]

DATA CONSISTENCY METHOD ASSIGNMENT ON ECUINSTANCE1 AFTER MODIFICATION

[5]

SC0,A2,DE00: Lock SC0/CP1,A1,DE00: Lock SC0/CP1,A0,DE00: Rate Transition Block Memory overhead: 36.0

[6]

29

FPGA Design for Monitoring CANbus Trafc in a Prosthetic Limb Sensor Network
Faculty of Computer Science Institute of Biomedical Engineering+ University of New Brunswsick Fredericton, Canada {alexander.bochem, justin.deschenes, jeremy.williams, ken, ylosier}@unb.ca
AbstractThis paper presents a successful implementation of a Field Programmable Gate Array (FPGA) CANbus Monitor for embedded use in a prosthesis device, the University of New Brunswick (UNB) hand. The monitor collects serial communications from two separate Controller Area Networks (CAN) within the prosthetic limbs embedded system. The information collected can be used by researchers to optimize performance and monitor patient use. The data monitor is designed with an understanding of the constraints inherent in both the prosthesis industry and embedded systems technologies. The design uses a number of verilog logic cores which compartmentalize individual logic areas allowing for more successful validation and verication through both simulations and practical experiments.

A. Bochem , J. Deschenes , J. Williams , K.B. Kent , Y. Losier+

keywords: FPGA; CANbus; Data Monitor; Prosthesis; Verilog I. I NTRODUCTION

In the prosthetics eld, various research institutions and commercial vendors are currently developing new microprocessor-based prosthetic limb components which use a serial communication bus. Although some groups efforts have been of a proprietary nature, many have expressed interest in the development of an open bus communication standard. The goal is to simplify the interconnection of these components within a prosthetic limb system and to allow the interchangeability of devices from different manufacturers. This initiative is still in development and will undoubtedly face some obstacles during its development and implementation as there are currently no embedded devices available to reliably monitor the bus activity for the newly developed protocol. The open bus communication standard uses the
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

CAN bus protocol as its underlying hardware communication platform. Higher levels of the protocol dene the initialization, inter-module communication, and data streaming capabilities. Commercially available off-the-shelf CANbus logic analyzers, although capable of decoding the primary CAN elds, are unable to interpret the protocol messages in order to provide detailed information of the system behavior. The design of an FPGA-based prosthetic limb data monitor will allow embedded system engineers to monitor the new protocols communication activity occurring in the system. This provides an effective developmental tool to not only help develop new prosthetic limb components but also advance the open bus standard initiative. This monitor will help develop new prosthetic components as well as provide a means to assess their rehabilitation effectiveness. The design of the system will be exible enough to meet future needs and follow current standards, such as simplifying the work required for end users to utilize the system. Furthermore, the monitors data logging capabilities will allow the prosthetic tting rehabilitation team to analyze the amputees daily use of the system in order to assess its rehabilitation effectiveness. The evaluation of the data monitors capabilities will be performed in conjunction with UNB researchers who are leading members of the Standardized Communication Interface for Prosthetics forum [1]. Section 2 of the report outlines the eld of biomedical engineering and presenting an overview of related work. Section 3 gives an introduction into the CANbus standard. Section 4 covers the system design and its implementation. Section 5

30

shows how the functionality of the system has been tested and evaluated. Finally, Section 6 concludes the paper. II. E MBEDDED S YSTEMS Within this section we will look at traditional embedded system and biomedical engineering projects. It will highlight the characteristics of some approaches with the applied technology and projects that implement CANbus communication in particular. Initially, robots were controlled by large and expensive computers requiring a physical connection to link the control unit to the robot. Today the shrinking size and cost of embedded systems and the advances in communication, specically wireless methods, have allowed smaller, cheaper mobile robots. Robots operate and interact with the physical world and thus require solutions to hard real-time problems. These solutions must be robust and take into account imperfections in the world. These autonomous systems usually consists of sensors and actuators; the sensors collect information coming into the system and the actuators are the outputs and can be used to interact with the outside environment. The signal and control data is sent to a central processing unit which runs the main operating system. For reducing power consumption and complexity, these sensor networks use communication buses to exchange data. In Zhang et al. [2] the researchers started with a typical robotic arm setup which included a number of controllers and command systems communicating through a communication bus to one another. The researchers, who consider the communication system the most important aspect of the space arm design set out to improve it. Their system utilized the Controller Area Network (CAN) communication bus and enhanced reliability through the implementation of a redundant strategy. The researchers cited features of the CANbus that are desirable, which include: error detection mechanisms, error handling by priority, adaptability and a high costperformance ratio. In order to increase reliability, they implemented a Hot Double Redundancy technology. This technology was implemented as the communication system which consisted of an ARM microprocessor, the CANbus controller circuit, data storage, system memory and a complex programmable logic device (CPLD) used to implement a redundant strategy. The CPLD interfaced with two redundant CAN controller circuits, which were each connected to their own set of system

devices, while taking commands from the main microprocessor. The logic to handle the Hot aspect of the technology, the ability to switch from one system to the other without any down time, was implemented in the hardware denition language VHDL and put onto the CPLD. This increases the redundancy, but also signicantly improves the reliability by handling major system faults without down time and increases the exibility of the design by having hardware components which can be updated and changed without physical contact. The process was tested through the Quartus II software tool from Altera, which simulated normal and extreme activity for a period of sixteen to thirty-two hours all the while alternating between the two redundant systems. The researchers reported the tests to be successful, with a 100% transmission rate and no error frames or losses due to the redundancy switch time. Biomedical engineering is an industry which requires a multi-disciplinary skillset in which engineering principles are applied to medicine and the life sciences. In recent years renewed interest from the American military has promoted the advancement of articial limb technology. In 2005, the Defense Advanced Research Projects Agency (DARPA) launched two prosthetic revolution program, revolutionizing prosthetics 2007 (RP2007) and revolutionizing prosthetics 2009 (RP2009) [3]. The goal of RP2007 was to deliver a fully functional upper extremity prosthesis to an amputee, utilizing the best possible technologies. The arm is to have the same capacity as a native arm, including fully dexterous ngers, wrist, elbow and shoulder with the ability to lift weight, reach above ones head and around ones back. The prosthetic will include a practical power source, a natural look and a weight equal to that of the average arm. Another, more difcult goal includes control of the prosthesis through use of the patients central nervous system. In RP2009 the prosthesis technology will be extended to include sensors for proprioception feedback, a 24 hour power source, the ability to tolerate various environmental issues and increased durability. Although the prostheses developed during the course of both projects proved to be technological achievements [4], it is still unclear whether the cost of these systems, aimed at being around $100,000[5], will prohibit their use when they become commercially available. The University of New Brunswicks (UNB) hand project [6] seeks to design a low-cost three axis, six

31

basic grip anthropomorphic hand, with control of the hand using subconscious grasping to determine movement.The UNB hand team built intelligent Electromyography (EMG) sensors [7] that could amplify and process signal information, passing required information to the main microprocessor through the use of serial communication bus. This allows for a reduction in wiring, which reduces the weight and simplies the component architecture. The serial bus chosen, the Controller Area Network bus (CANbus), allowed a power strategy to be implemented, reducing overall power consumption. The CANbus is also noted as having a good compromise between speed and data protection, a necessity for prosthetics [8]. The hand project creators have begun creating a communication standard [9] to improve interchangeability and interconnection between limb components. If adopted, major increases in exibility would be gained which would be benecial to all people involved in the prosthesis industry. The paper by Banzi Mainardi and Davalli [8] extends the idea of a distributed control system and Controller Area Network serial bus to an entire arm prosthesis. In this project, along with the parallel distribution cited in the UNB hand project, the device had the additional task of handling external communication through either a Bluetooth or RS232 serial connection. The paper also cites similar reasons in using the CANbus to the UNB hand project, adding evidence of successful device integration in the CANbus traditional area, automotives and how this could parallel the prosthetic industry. The paper also outlines reasons behind not choosing other communication protocols or technologies. The two initial choices, Inter-Integrated Circuit (I2 C)[10] and Serial Peripheral Interface (SPI) Bus[11], were rejected because they failed to have adequate data control and data protection systems and were unable to handle faults acceptably. The SPI system required additional hardware overhead which would allow for device addressing. The personal computing standards and industry standards were unable to adapt to the space and weight constraints required by the prole of the prosthesis. The CANbus allowed for a reasonable number of sensors, exibility in expansion and interfacing, microcontrollers with integrated bus controllers and efcient, robust, deterministic data transmission with a reduction of cables required and near optimal voltage levels.

III. CAN BUS The CANbus is a serial communication standard used to handle secure and realtime communication between electrical and microcontroller devices and primarily denes the physical and data link layers of the open systems interconnection model (OSI model). The CANbus was originally created to support the automotive industry and its increased reliance on electronics; however because of its reliability, high speed and low-cost wiring, it has been used in many additional areas. The CANbus supports bit rates of up to 1Mbps and has been engineered to handle the constraints of a security conscious real-time system[12]. The physical medium is shielded copper wiring, which utilizes a non-return-to-zero line coding. The CAN message frame is either an 11 bit identier base frame (CAN 2.0A) or the 29 bit identier extended frame (CAN 2.0B). There are a number of custom bit identiers of which most are used to synchronize messages, perform error handling or signal various values. The CAN standard operates on a network bus therefore all devices have access to each message, addressing is handled by identiers in each message frame. The CANbus standard outlines the arbitration method as CSMA/BA where the BA stands for bit arbitration. The bit arbitration method allows any device to transmit its message onto the bus. If there is a collision the transmitter with the greatest priority, identied by the most successive recessive bits wins bus arbitration. The lesser priority devices then standoff for a predened period of time and re-transmit their message. This allows the highest priority messages to get handled the fastest. Other important properties dened in this standard are:

Message prioritization - Critical devices or messages have priority on the network. This is done through the media arbitration protocol. Guaranteed latency - Realtime messaging latency utilizes a scheduling algorithm which has a proven worse case and therefore can be reliable in all situations[13]. Conguration exibility - The standard is robust in its handling of additional nodes, nodes can be added and removed without requiring a change in the hardware or software of any device on the system. Concurrent multicasting - Through the addressing and message ltering protocols in place, the CAN-bus can have multicasting in which it is guaranteed that all nodes will

32

accept the message at the same time. Allowing every node to act upon the same message. Global data consistency - Messages contain a consistency ag which every node must check to determine if the message is consistent. Emphasis on error detection - Transmissions are checked for errors at many points throughout messaging. This includes monitoring at global and local transmission points, cyclic redundancy checks, bit stufng and message frame checking[12]. Automatic Retransmission - Corrupted messages are retransmitted when the bus becomes idle again, according to prioritization. Error distinction - Through a combination of line coding, transmission detection, hardware and software logic the CAN-bus is able to differentiate between temporary disturbances, total failures and the switching off of nodes. Reduced power consumption - Nodes can be set to sleep mode during periods of inactivity. Activity on the bus or an internal condition will awaken the nodes. IV. S YSTEM D ESIGN

The open bus standard-based data monitor system for Prosthetic Limbs captures and collects the serial information from two separate Controller Area Network (CAN) buses, the sensor bus and the actuator bus. With the collected information it then would transform the data into the correct CAN message format. A timestamp would be added and the messages would be passed through a user controlled lter to dictate which messages should be logged. After the lter the two buses messages are merged and sorted according to their timestamp. Once sorted, the information is then sent to an output device for further processing. The current design allows to choose between three different output interfaces. First, the RS232 serial interface that sends the CAN message data encoded in ASCII format to the serial port using a RS232 chip on the DE2 board. Second, the direct connection of the RS232 serial interface module with the pin interface of the DE2 board. This allows the transmission with higher bandwidth, using an external RS232 chip with better performance. And the third communication module uses the USB interface of the DE2 board to transmit the CAN message data [14]. This project will allow the engineers to observe the communication on the buses and search for sources of errors. An overview of the system

design is given in Figure 1. The implementation consists of various working cores, that can be individually tested each containing one piece of the functionality which is required for the overall system. These cores interface with one another through a standard FIFO, alleviating timing issues. The FIFO modules are dual-clocked memory buffers that work on the rstin/rst-out concept. With the dual-clock ability, those cores can be used to exchange data between two different modules in a hardware design, even if those modules run with different clock speeds. The FIFO cores belong to the Intellectual Property (IP) cores library, that is available within the development environment Quartus II from Altera. The usage of those cores is usually free for academic use, but requires a license fee for industrial or endconsumer development purposes. The module CAN Reader is handling the observation of the connected CANbus. It receives all messages that are transmitted without causing a collision on the bus and forwards it to a FIFO buffer, from where the Filter module gets its input data. One pair of the CAN Reader and the Filter are connected to the control bus while the other pair is listening to the sensor bus. The modular design would allow to connect more or fewer CANbus to the monitoring system. In the current implementation of the Message Writer module, only the data transmission on the RS232 interface is implemented. The data compression of the messages and the application of different interface technologies has been evaluated during the project. Their implementations have been postponed as future work. RS232 was implemented as it required the least overhead to establish a communication channel [15]. The design of the CAN Reader module is based upon the CAN Protocol Controller project by Igor Mohor from opencores.org. This design implements the specication for the SJA100 Standalone CAN controller from Philips Semiconductors [16]. To allow the integration of the core design, an I/O module that handles the Wishbone interface had to be implemented. The conguration of the CAN controller is working register based, as dened in the specication of the SJA1000. The created I/O module is used by the CAN Reader that is conguring the CAN controller. The CAN controller receives the data from the CANbus, while the CAN Reader collects the messages from the CAN controller and forwards

33

Fig. 1: System design of CANbus monitor.

them to the timestamp and lter procedures. For a manageable processing ow of the modules their internal control has been designed as state machines. This design concept allowed the localization of problems caused by signal runtimes and race conditions. Figure 2 gives an idea of the state machine that describes the functionality of the CAN Reader module. To give an idea of the design complexity, the state Read Receive Buffer is using the I/O module to communicate with the CAN controller. This leads to a design with state machines encapsulated in state machines. Such designs should be avoided in hardware if possible, since they tend to obscure runtime race conditions. In the current design the occurence of race conditions is probhibited by event driven mutexes. V. VALIDATION & V ERIFICATION The validation of the system design has been done by simulation and runtime experiments on the target platform. For verication of the functionality the single modules have been simulated with Alteras ModelSim tool. This allowed a step-wise execution of the Verilog code to identify logic errors in the implementation. Afterwards the modules have been tested on the target platform, with testbench modules providing the required input data. The output of the tested modules has been transmitted on the RS232 interface to a connected computer and veried by hand. The timing verication of the hardware design has been done by an experimental test setup, with two sensor nodes on one connected CANbus (Figure 3). The two sensor nodes were congured to send messages on the bus continuously at 250Hz frequency. To allow the verication that all messages

can be received and no message is lost, each message contained a set of four consecutive numbers. The rst node was congured to send numbers in the range from 0 to 999. The second node was sending numbers in the range from 1000 to 1999. Due to the CANbus standard, each node would resend its current message until it receives an acknowledge response on the bus from the other node. This would cause the node to increment the numbers for the next message. If a collision on the bus has occurred, the CANbus protocol standard ensures that the transmitting node is informed by a global collision signal. This would be sent by the rst node on the CANbus, which detects the collision. The analysis of the system lead to the conclusion that the hardware design is working correctly. All messages that have been sent from both nodes on the CANbus could be received successful and with correct data by the CAN Reader module in the FPGA design. This could be veried with the logged data from the output of the serial module in the hardware design. Since the test setup only had one CANbus the input pin has been used twice to evaluate the system with two CANbuses connected. This allowed to verify the proper ordering of the messages by the Message Merge Unit. It turned out that the only aw in the current design is the RS232 interface. The available chip on the DE2 board has a maximum bandwidth of 120kbit/s. That becomes the bottleneck if the connected CANbus runs at maximum bandwidth of 1Mbit/s and even more for two connected CANbuses. The design for an USB module as communication interface has been created but could not be sufciently evaluated for further usage. The analysis of the test results showed some

34

Fig. 2: State machine design for CAN Reader module.

unexpected behavior. While the messages of one node had been received completely, some messages of the other node were missing. By resetting the CANbus system, this effect could switch to the other node or stay the same. For either one of the two nodes some messages were missing, but never for both nodes at the same test execution. The extensive analysis of the system design leads to the conclusion that this error might be caused by the conguration of the sensor nodes. Further evaluation of the system design will be continued, after a new test setup can be provided by the project partner. VI. F UTURE W ORK It might be useful to have the ability running the CANbus monitoring design without a direct connection to a computer. The logged data could be stored on an integrated ash memory module. For this feature it would be helfpul to have a compression module to increase the systems mobility. The compression would need to be fast enough so that it does not become a bottleneck for the system. Ideally the core would take in a stream of data and output a stream of compressed data. This could plug directly into the current system design. It would be useful to have wireless functionality so that the prosthesis does not need to be tethered

to a computer to retrieve the logged data. The biggest hurdle to overcome is to understand how to communicate with the wireless controller on the hardware level. Existing solutions in open source projects could build a starting point here. At the moment all the values for the CAN controller are hard coded and thus can not be changed after the design has been synthesized. This could be improved in several ways ranging from full congurability at runtime to a bit of code reorganization so that the values can easily be changed for re-synthesis. A compromise could be the ability to congure a small number of values at runtime. The most time effective solution would be to reorganize the code so that the hard coded values are excluded into a conguration le. The ltering is currently fairly rudimentary and inexible. It would be advantageous to have more exibility with how to lter messages. Reconguring the lters, and enable or disable lters during runtime would be helpful. This becomes even more crucial as the design moves from lab testing to real world testing where re-synthesizing the design becomes less feasible. The system could potentially be a memory mapped device and masks could be stored in registers which could be written to by a microcontroller. A simpler solution would be to have several kilobytes of persistent memory, to

35

Fig. 3: Experimental test setup for nal system verication.

which masks could be written, to be used during the ltering of the messages. VII. S UMMARY This paper has shown the successful implementation of a monitoring system for a CANbus sensor network in a hardware design on an FPGA development board. Details of successful projects and related work has been introduced. The information that has been displayed in this paper should offer a good starting point, providing a general understanding of embedded projects, with emphasis on actual biomedical engineering solutions and basics of the CANbus standard. The modular design of the approach allows the application in different projects that have a need for a CANbus monitoring system. All project source code will be made available through opencores.org. Acknowledgments This work is supported in part by CMC Microsystems, the Natural Sciences and Engineering Research Council of Canada and Altera Corporation. R EFERENCES
[1] Standardised Communication Interface for Prosthetics. [Online]. Available: http://groups.google.ca/group/ scip-forum/ [2] J. Yang, T. Zhang, J. Song, H. Sun, G. Shi, and Y. Chen, Redundant design of a can bus testing and communication system for space robot arm, in Control, Automation, Robotics and Vision, 2008. ICARCV 2008. 10th International Conference on, Dec. 2008, pp. 18941898.

[3] DARPA, DARPA: 50 Years of Bridging the Gap, 1, Ed. Defense Advanced Research Project Agency, 2008. [4] L. Ward. (2007, October) Breakthrough awards 2007 darpa-funded proto 2 brings mind control to prosthetics. electronic / periodical. Popular Mechanics. [5] S. Adee. (2008, February) Ieee spectrum: Dean kamens luke arm prosthesis readies for clinical trials. electronic. IEEE Spectrum. [6] A. Wilson, Y. Losier, P. Kyberd, P. Parker, and D. Lovely, Emg sensor and controller design for a multifunction hand prosthesis system - the unb hand, draft document, 2009. [7] Y. G. P. P. A. L. D. F. Wilson, Adam W. Losier, A busbased smart myoelectric electrode/amplier, in Medical Measurements and Applications Proceedings (MeMeA), 2010 IEEE International Workshop on, 2010. [8] S. Banzi, E. Mainardi, and A. Davalli, A CAN-based distributed control system for upper limb myoelectric prosthesis, in Computational Intelligence Methods and Applications, 2005 ICSC Congress on, Istanbul,. [9] Y. Losier and A. Wilson, Moving towards an open standard: The unb prosthetic device communication protocol, 2009. [10] The I2C-BUS Specication, Philips Semiconductors Std. 9398 393 40 011, Rev. 2.1, January 2000. [11] Motorola, M68HC11 Microcontrollers Reference Manual, rev. 6.1 ed., Freescale Semiconductor Inc., Mai 2007. [12] CAN Specication Version, Robert BOSCH GmbH Std., Rev. 2.0, September 1991. [13] J. Kr kora and Z. Hanz lek, Verifying real-time propera a ties of CAN bus by timed automata, in FISITA, World Automotive Congress, Barcelona, May 2004. [14] DE2 Development and Education Board - User Manual, 1st ed., Altera, 101 Innovation Drive, San Jose, CA 95134, 2007. [15] MAX232, MAX232I DUAL EIA-232 DRIVERS/RECEIVERS, Texas Instruments, Post Ofce Box 655303, Dallas, Texas 75265, March 2004. [16] SJA1000 Stand-alone CAN controller, Philips Semiconductors, 5600 MD EINDHOVEN, The Netherlands, January 2000.

36

Session 2 Prototyping Architectures

37

Rapid Single-Chip Secure Processor Prototyping on the OpenSPARC FPGA Platform


Jakub M. Szefer 3 , Wei Zhang #1 , Yu-Yuan Chen 3 , David Champagne 3 , King Chan #1 , Will X.Y. Li #1 , Ray C.C. Cheung #2 , Ruby B. Lee 3
# Department
1 {wezhang6,

of Electronic Engineering, City University of Hong Kong


2 r.cheung@cityu.edu.hk

kingychan8, xiangyuli4}@student.cityu.edu.hk,
Electrical
3 {szefer,

Engineering Department, Princeton University, USA


yctwo, dav, rblee}@princeton.edu

AbstractSecure processors have become increasingly important for trustworthy computing as security breaches escalate. By providing hardware-level protection, a secure processor ensures a safe computing environment where condential data and applications can be protected against both hardware and software attacks. In this paper, we present a single-chip secure processor model and demonstrate rapid prototyping of the secure processor on the OpenSPARC FPGA platform. OpenSPARC T1 is an industry-grade, open-source, FPGA-synthesizable generalpurpose microprocessor originally developed by Sun Microsystems, now acquired by Oracle. It is a multi-core, multi-threaded 64-bit processor with open-source hardware, including the microprocessor core, as well as system software that can be freely modied by researchers. We modify the OpenSPARC T1 processor by adding security modules: an AES engine, a TRNG and a memory integrity tree. These enhancements enable security features like memory encryption and memory integrity verication. By prototyping this single-chip secure processor on the FPGA platform, we nd that the OpenSPARC T1 FPGA platform has many advantages for secure processor research. Our prototyping demonstrates that additional modules can be added quickly and easily and they add little resource overhead to the base OpenSPARC processor.

I. I NTRODUCTION As computing devices become ubiquitous and security breaches escalate, protection of information security has become increasingly important. Many software schemes, e.g., [1][3], have been proposed to enhance the security of computing systems and are effective in defending against software attacks. However, they are generally ineffective against physical or hardware attacks. Attackers who have full control of the physical device can easily bypass software-only protection, and the whole system is left unsafe and subject to hardware attacks. This results in an increasing need for hardware-enhanced security features in the microprocessor. Considerable efforts have been made to build secure computing platforms that can address security threats. In this paper, we present our extensible secure computing model prototyped on the OpenSPARC FPGA platform. The platform consists of the OpenSPARC T1 processor and system software, including the hypervisor and the operating system. The OpenSPARC T1 processor is the open-source form of the UltraSPARC T1 processor (from Sun Microsystems, now Oracle) that gives
c 978-1-4577-0660-8/11/$26.00 2011 IEEE

designers the freedom to modify the processor according to their own needs [4]. This OpenSPARC T1 processor is also easily synthesizable for FPGA targets, which makes the implementation of the processor quite easy. Field-programmable Gate Array (FPGA) is an integrated circuit designed to be congured by the customer or the designer after manufacture. Because of its recongurability, FPGA can be used to implement any logic function that an Application-Specic Integrated Circuit (ASIC) chip could perform, which makes it a good platform for rapid system prototyping. Through the prototyping process, we have found that the OpenSPARC FPGA platform has many advantages for secure processor research. For example, the memory subsystem is emulated in a MicroBlaze softcore, which allows new features to be added without re-synthesizing the whole platform. Furthermore, new hardware components can be easily added as Fast Simplex Link (FSL) peripherals without worrying about strict timing and editing fragile HDL code of the processor core. However, to the best of our knowledge, currently there are only a few papers [5][7] about processor research on the OpenSPARC FPGA platform and none focusing on security. In this paper, we propose a single-chip secure processor architecture based on this platform. Furthermore, we nd that the OpenSPARC FPGA platform is relatively friendly for secure processor prototyping. Although there are other opensource processors like OpenRISC and LEON available [8], they lack the infrastructure and components (e.g. hypervisor or emulated caches) of the OpenSPARC platform. The main contributions of this paper are: A recongurable single-chip secure processor model. A prototype of the single-chip secure processor on the OpenSPARC FPGA platform with our new additions: AES engine, true random number generator (TRNG), and memory integrity tree (MIT). Evaluation showing the OpenSPARC FPGA platforms advantages for secure processor research. The rest of the paper is organized as follows. Section II describes some existing secure processor models as background information. Section III proposes our single-chip secure processor architecture on the OpenSPARC FPGA platform. Section IV describes the single-chip secure processor features.

38

Section V gives an evaluation of the secure OpenSPARC platform. Finally, in Section VI we conclude the paper. II. R ELATED W ORK With the emergence of hardware attacks, hardwareenhanced security has been given considerable attention by researchers and engineers. Different secure processor architectures have been proposed to provide a secure computing environment for protecting sensitive information against both software and hardware attacks. The single-chip secure processors consider the processor to be trusted, but anything outside the processor, e.g. memory, is untrusted. One such example is the Aegis architecture, as described in [9], [10]. In this approach, the secure processor contains two key primitives: a physical unclonable function (PUF) and off-chip memory protection. Both primitives are realized within one single-chip processor so that the internal state of the processor could not be tampered with or observed directly by physical means. The Secret-Protecting (SP) architecture is proposed in [11]. In SP, trusted software modules have their data and code encrypted and hashed when off-chip and a concealed execution mode is provided where these software modules are protected from other software snooping on them, e.g. registers are encrypted on an interrupt so a potentially compromised commodity Operating System cannot read them. Also, a hierarchical key chain structure is used to store all keys in encrypted and hashed form and only the root key needs to be stored in hardware. Another secure processor architecture is Bastion [12] that can protect a trusted hypervisor, which then protects trusted software modules in the application or in the operating system. Bastion scales to provide support for multiple mutuallydistrustful security domains. Bastion also provides a memory integrity tree for runtime memory authentication and protection from memory replay attacks. It also protects the hypervisor from physical attacks and ofine attacks, not just software attacks. Based on these previous work, we propose our secure computing model in Section III. Despite the many advantages of the OpenSPARC FPGA platform, it is not very widely used as a research platform. We propose a single-chip secure processor on this platform and hope that our work can provide some reference for researchers interested in OpenSPARC. III. S INGLE -C HIP S ECURE O PEN SPARC P ROCESSOR This section presents our secure computing model and single-chip secure OpenSPARC T1 processor. A. Secure Computing Model Figure 1 illustrates our secure computing model. We divide this computing system into two parts. The rst part includes those components in the processor chip, shown inside the dashed box, and the second part consists of all components off the processor chip, shown outside the dashed box. All onchip modules, including CPU core, cache, registers, encryption/decryption engine and integrity verication module, are

Untrusted operating system Software attacks Processor chip

Malicious software

Physical attacks

Encryption / Decryption CPU Core Cache Integrity verification

External memory

Security registers

External Peripherals

Secure region Physical attacks

Fig. 1: Secure computing model. The processor chip is regarded as a physically secure region and gray blocks represent new security enhancements. assumed to be trusted and protected from physical attacks in our design, because the internal state of the processor chip could not easily be tampered with or observed directly by physical means. We do not consider such physical means as side-channel attacks that employ differential power analysis or electromagnetic analysis in this paper. On the other hand, all off-chip modules, including external memory and peripherals, are considered insecure because those modules can be easily tampered with by an adversary using physical attacks. In addition to hardware attacks, the system may suffer from software attacks from an untrusted operating system or malicious software. To ensure a secure computing environment, the computing system must have some secure functions that enable it to defend against either software or hardware attacks. The gray blocks in Figure 1 represent our initial set of new security enhancements that we add to the computing system. The encryption/decryption module encrypts all data evicted off the processor chip so that they are meaningless to an adversary. The integrity verication module veries that all data coming from the off-chip memory has not been tampered with. B. OpenSPARC FPGA Platform Our single-chip secure processor design targets the FPGAsythesizable version of the OpenSPARC T1 general-purpose microprocessor. The OpenSPARC T1 microprocessor is an industry-grade, 64-bit, multi-threaded processor and is freely available from the OpenSPARC website [4], [13]. In addition to the processor core source code (HDL), simulation tools, design verication suites, and hypervisor source code (C and assembly) are available for download [4]. The OpenSPARC FPGA platform consists of the following major components: OpenSPARC T1 microprocessor core, memory subsystem (L2 cache), DRAM controller and DRAM, hypervisor and a choice of operating systems (Linux or OpenSolaris). Due to the size constraint of the FPGA chip, the OpenSPARC FPGA platform that we use includes only one

39

Processor chip

Processor chip

TRNG

AES

Other

CPX
FSL CPX

c c x 2m b

OpenSPARC T1 Core PCX

FSL

c c x 2m b

MicroBlaze Core ( emulated cache )

MIT firmware

OpenSPARC T1 Core

FSL

MicroBlaze Core ( emulated cache )

OPB Bus

PCX

OPB Bus

UART

Ethernet

DRAM

UART

Ethernet

DRAM

Fig. 2: Block diagram of stock OpenSPARC FPGA platform. single-thread T1 CPU core to minimize its size. In addition, the L2 cache and the L2 cache controller are emulated in a MicroBlaze softcore, i.e. there is no physical L2 cache. A high-level block diagram of the OpenSPARC T1 processor is shown in Figure 2. The microprocessor core is connected to the L2 cache, emulated by a MicroBlaze softcore, through the CPU-Cache Crossbar (CCX). The DRAM controller is an IP (Intellectual Property) block synthesized and implemented in the FPGA fabric and connected to the MicroBlaze softcore. A physical DRAM is connected to the FPGA board to serve as the actual memory. Due to the different complexities of these components and to place and route operations that determine critical paths when implementing on FPGA, the FPGA version of the OpenSPARC processor chip has multiple clock domains. The OpenSPARC T1 core is in one clock domain (50MHz), the MicroBlaze softcore is in another clock domain (125MHz). Peripherals synthesized in the FPGA fabric and connected to the MicroBlaze softcore are in another clock domain (e.g. 10MHz or 50MHz for the peripherals we create). Finally, the DRAM chip has its own clock domain (400MHz). The OpenSPARC FPGA platform has a lot of advantages for security research [14]. It allows users to freely modify real hardware. Especially, the memory subsystem is emulated, rather than being fully implemented in HDL, so new features can be easily added (in C code and run on the MicroBlaze softcore) and re-synthesizing the whole platform can be avoided. In addition, new hardware components can also be easily added as rmware code or synthesized in FPGA fabric and connected to the emulated cache by buses, e.g. by FSL bus. The peripherals on the ML505 board are also very useful, e.g. the network port. Also, secure processors can be prototyped without fabricating a real processor chip. C. Secure Processor Architecture Based on the secure computing model described in Figure 1, we propose our single-chip secure processor architecture. Our approach is to add security modules to the stock OpenSPARC FPGA platform. We synthesize and implement the design in FPGA [15]. Figure 3 shows the block diagram of our single-chip secure processor architecture on the OpenSPARC FPGA platform.

Fig. 3: Block diagram of secure OpenSPARC FPGA platform. Gray blocks show our new additions. The gray blocks in the diagram represent security modules we add to the original platform, including a TRNG (true random number generator) (HDL code), an AES (Advanced Encryption Standard algorithm) engine (HDL code), and a memory integrity tree (MIT) (MicroBlaze rmware code). The TRNG and the AES engine are implemented in the FPGA fabric and the MIT is executed in the MicroBlaze softcore as rmware. The TRNG and the AES engine are connected through the FSL bus to the MicroBlaze. Through MicroBlaze, the OpenSPARC microprocessor communicates with these security modules. The MIT rmware calls the AES engine for memory integrity verication. Each module works in its own clock domain and a Digital Clock Manager (DCM) on the FPGA board serves to generate these clock frequencies. Table I shows the different clock domains in our secure OpenSPARC system. In addition to the FPGA chip, the FPGA board also has many on-board resources that can be utilized by the secure processor. In our experimental setup, we use the Xilinx Virtex5 ML505 FPGA board. This board contains a 256MB DRAM module, in which an 80MB ramdisk is used to boot the Linux or Solaris operating system. In addition, the board has an Ethernet port that can connect the secure processor to the Internet if enabled. The Ethernet port can also serve as a communication port that provides high data exchange rate between the host computer and the secure processor. Our secure processor works in one of the four modes:

STD - standard mode, which has no additional security measures; CONF - condential mode, which performs memory encryption to ensure data condentiality; ITR - integrity tamper-resistant mode, which performs memory integrity verication to ensure data integrity; FTR - full tamper-resistant mode, which performs both memory encryption and memory integrity verication to ensure data condentiality and integrity.

The secure processor can work in any of the four modes depending on the users need.

40

TABLE I: Clock domains in OpenSPARC system Module OpenMicroAES TRNG DRAM SPARC Blaze engine T1 50MHz 125MHz 50MHz 10MHz 400MHz

key 128

data _ in 128 load, start, mode

mux register

mux register

Frequency

IV. S INGLE - CHIP S ECURE P ROCESSOR F EATURES The proposed single-chip secure processor provides the following security features: memory encryption/decryption, secret key generation, and memory integrity verication.

AES _ key _ expander

AES _ round 128

FSM

128 register AES unit 128 data _ out done

A. AES Engine Advanced Encryption Standard (AES) is a symmetric encryption algorithm approved by the National Institute of Standards and Technology (NIST). It is one of the most widely used symmetric encryption algorithms and is advanced in terms of both security and performance. While any symmetric key encryption algorithm suits our purpose, we adopt AES as the memory encryption/decryption algorithm and AES CBC MAC as the cryptographic hash primitive. AES can process data blocks of 128 bits using cryptographic keys with size of 128, 192 or 256 bits. The encryption or decryption takes 10 to 14 rounds of array operations, depending on the key size. In our AES unit design, we employ the idea of parallel table lookup (PTLU) as in [16], [17]. The AES unit is based on AES-128 and takes a block of 128-bit input and 128-bit key to produce 128-bit output. The block diagram of the AES unit is shown in Figure 4. The AES unit consists of one nite state machine (FSM) that controls the operation of aes_round and aes_key_expander. The load signal triggers the FSM to load the registers with input data and key. The start signal sets an internal counter value and starts the AES encryption/decryption cycle from round 0. The mode signal changes the AES operation between encryption and decryption. The latency of one block of AES128 encryption and decryption is 14 cycles and 25 cycles, respectively. For encryption, the AES unit takes 11 cycles to produce the output data (1 cycle per round and 1 cycle for the initial AddRoundKey operation), and 3 cycles to assert load, start, and done signals. However, the decryption process incurs an extra 11 cycles in order to generate the rst round key for decryption, since the rst round key of decryption is the last round key of encryption. After the round operations are done, a done signal is asserted to signify that the output data is ready on the data_out bus. The AES engine works in cipher-block chaining (CBC) mode, i.e., AES-CBC. The input and output data width is 512 bits (64 bytes) in our design in order to correspond to the common size of modern cache lines. The AES engine is connected to the MicroBlaze softcore through the FSL bus, which has a data bus width of 32 bits. All input data has rst to be loaded to the AES engine to start the AES-CBC encryption, which needs (32+128+128+512)/32=25 cycles to load the mode, initial_vector, key, and data_in, and to completely output the data back to MicroBlaze takes 512/32=16 cycles. If

Fig. 4: Block diagram of AES-128 unit. we include the 14 cycles for encrypting one block of 128-bit input (25 cycles for decryption), the total latency of encrypting 64 bytes of input data is 25+414+16=97 cycles. Similarly we get 25+425+16=141 cycles for decryption. We note the high overhead of the FSL bus for data transfer (41 of 97 cycles for encryption). Also, performance can be improved by storing the decryption round keys so they do not need to be regenerated, at the cost of using more hardware resources. B. True Random Number Generator (TRNG) The secure processor needs a secret key for memory encryption and decryption. In addition, the secret key should be unpredictable for attackers. A true random number generator (TRNG) is utilized to generate a secret key for the AES engine. Figure 5 shows the internal structure of the TRNG, which consists of many identically laid-out delay loops, or ring oscillators (ROs). We call this a ring oscillator TRNG, introduced by Sunar et al. in [18]. Based on the design in [19], which uses 110 rings with 13 invertors, our TRNG consists of 114 ROs, each of which is a simple circuit containing 15 concatenated NOR gates that oscillate at a certain frequency. One of the two inputs of the NOR gate is used to reset the TRNG. Because of manufacturing variations, each RO oscillates at a slightly different frequency. The outputs of all ROs are exclusive-ORed in order to correct bias and correlation and to generate a random signal. We sample the random signal output from the XOR gate at a frequency of 10MHz. Similar to the AES engine, the TRNG is also connected to MicroBlaze (the emulated cache) through the FSL bus and interacts with the OpenSPARC T1 core through MircoBlaze. The TRNG module is separate from other modules, which requires overhead of data transfer over the FSL bus if the random bits are used by rmware or another module. For example, a transfer of 128 random bits is needed for the AES engine: TRNG MicroBlaze AES. The advantage is that the TRNG is not tied to AES and can be used for other purposes. Only TRNG MicroBlaze transfer is needed if the random bits are used inside the rmware. Furthermore, the TRNG could be integrated with the AES unit to reduce the transfer overhead if it is dedicated to AES key generation. To the best of our knowledge, combining TRNG with an AES

41

15 NOR gates

Root Key

reset random 114 ring oscillators D


SET

Key 1

Key 2

Key 3

CLR

clock

Key 4

Key 5

Fig. 5: Internal structure of true random number generator (TRNG).


flag Counter K 127 D 128 Key 128 - bit register K 126 D K 125
Q

Key 6

Key 7

Clock 1

Fig. 7: Hierarchical key chain.


K0
SET

Random

SET

SET

SET

SET

K1 D

CLR

CLR

CLR

CLR

CLR

Clock 0

example, if using two TRNGs, the generated random bits can be stored in two 64-bit shift registers respectively, in which case a key can be generated in only 64 clock cycles and the speed can be doubled. However, this is accomplished at the price of more hardware resource utilization. D. Key Management Cryptographic keys in the secure OpenSPARC system are critical secrets. In the future, as more secure modules and applications are added to the system, the number of keys may increase and key management and protection will become a problem. We address this problem by utilizing the concept of a key chain, which is described in [11]. The key chain is a hierarchical structure which stores all keys of the secure OpenSPARC system in encrypted form, as shown in Figure 7. Each key in the chain is encrypted by its parent key. At the root of the chain is the Root Key. Only a leaf key can be used to encrypt a users data. This tree structure allows an unlimited number of keys to be stored on the key chain. The Root Key is most critical and it is stored in the Root Key register on the processor chip, which is assumed secure from physical probing. Because all the other keys are in encrypted form, they can be stored in off-chip repositories (external memory) for retrieval and need no special protection. This has greatly enhanced key security and also reduced the on-chip storage for keys. E. Memory Integrity Verication Even though off-chip data stored in external memory are encrypted, an adversary can still tamper with them by modifying them. In ITR or FTR mode, the secure processor performs memory integrity verication on all off-chip data fed into the processor to ensure that they are not tampered with. The memory integrity verication is realized using the hash tree, which is shown in Figure 8. The external memory is divided into different data blocks, and each data block is computed to get one hash value. These hash values are further computed to get their parent hash values (in the nodes above them). Thus, a hash tree (called a Memory Integrity Tree, MIT) is formed and at the top of the tree is the root hash. The root hash value is stored in the root hash register on-chip, which is assumed secure. When data are evicted to off-chip memory, the processor performs

Fig. 6: 128-bit key generation circuit. engine on a secure processor platform for key generation has not been explicitly discussed in previous works. In testing TRNG on the FPGA platform, we devise a new method to collect enough random bits for testing. We connect to MicroBlaze an Ethernet core, which works at 10/100/1000Mb/s. The TRNG outputs random bits at 10Mb/s, so these random bits are able to be sent out to the host computer through the Ethernet port on the FPGA board. The MicroBlaze reads random bits from the TRNG and sends them to the Ethernet port which then sends out these random bits to the host computer for testing. C. Cryptographic Key Generation The output of the TRNG is single-bit. However, the AES encryption and decryption require a 128-bit key. To generate the 128-bit key using single-bit TRNG, a key generator is employed, shown in Figure 6. A single-bit-in, 128-bit-out shift register is used in the key generator. The shift register contains 128 concatenated ip ops, each of which stores one bit for the 128-bit symmetric key. The TRNG generates only one single random bit per clock cycle, and its output is connected to the input of the shift register. It takes the shift register 128 clock cycles to store 128 random bits. These 128 random bits are then connected to a 128-bit register, which outputs them as a key. In addition to the registers, there is a counter in the key generator. The counter counts from 0 to 127. When the count reaches 127, it sets a flag signal to 1, indicating that the key is ready. After the key is read, the flag signal is reset to 0 until 128 new bits are available. This allows new randomness to be returned immediately if some time has passed since last read. In this way, generating one key needs 128 clock cycles. The generated key can also be used as a seed to generate more 128-bit keys using the AES engine. The 10MHz working frequency of the TRNG might be a little slow. To get faster key generation speed, more than one TRNG can be used. For

42

TABLE II: Clock cycles needed for various operations.


Root Hash In Processor H(h5 ,h 6)

Data block

Data block

Data block

Data block

Operation AES encryption of 64byte block AES decryption of 64byte block 128-bit key generation from TRNG 128-bit key transfer TRNG MicroBlaze 128-bit key transfer MicroBlaze AES

Cycles 97 141 128 4 4

Cycle frequency 50MHz (AES cycles) 50MHz (AES cycles) 10MHz (TRNG cycles) 125MHz (FSL cycles) 125MHz (FSL cycles)

Hash(h1)

Hash(h5)

Hash(h2)

Hash(h3)

Hash(h6)

Hash(h4)

External Memory

Fig. 8: Memory Integrity Tree cache line hashing and updating of the non-leaf nodes of the MIT, in the path from the leaf node to the root hash. When external data are read in, the processor performs cache line verication in the path from the leaf node to the root hash. Whenever there is a mismatch between the new root hash and the old root hash, it can be asserted that the contents of the external memory has been tampered with, and an exception will be generated. The memory integrity tree (MIT) is realized as OpenSPARC rmware in our design, rather than as real FPGA fabric. The MIT rmware calls the AES engine to perform the hash algorithm using AES CBC MAC. Using the rmware to emulate new hardware has some benets, one of which is that the rmware can be updated without having to re-synthesize any of the existing components. V. S YSTEM E VALUATION This section evaluates our single-chip secure OpenSPARC T1 processor, including its performance and hardware costs. The secure processor is prototyped on Xilinx Virtex-5 ML505 FPGA board with a XC5VLX110T FPGA chip. We nd that the OpenSPARC FPGA platform is relatively easy for secure processor prototyping. The synthesizing time in Xilinx ISE environment for the whole secure platform is about 4 hours. However, if rmware is altered, there is no need to re-synthesize the whole system and recompiling the rmware takes only several minutes, which saves a lot of time. A. Encryption/Decryption Performance In CONF and FTR mode, the secure processor has to perform encryption and decryption operations. As mentioned in Table I, the secure OpenSPARC system has multiple clock domains. It takes the AES engine 97 clock cycles to encrypt 64 bytes of input plaintext, and 141 clock cycles to decrypt 64 bytes of input cyphertext, where many of these cycles are due to data transfer via the 32-bit FSL bus. The TRNG works at the frequency of 10MHz, and to generate one 128-bit key needs 128 TRNG cycles. However, a new key is infrequently

needed. Data from AES and TRNG have to be rst processed by MicroBlaze, and they are fetched by MicroBlaze at the frequency of 125MHz. The clock cycles needed for all these operations are shown in Table II. B. Overall Performance The STD mode causes no extra performance overhead for the processor. In CONF and FTR mode, the processor has to encrypt/decrypt off-chip data, which incurs performance overhead, but this only happens for cache misses which are relatively infrequent. For each 64-byte data evicted off the secure processor chip, a 97-cycle delay will be caused due to the encryption operation. Similarly, for each 64-byte off-chip data fed into the processor, a 141-cycle delay will be caused due to the decryption operation. This overhead can be reduced if the FSL bus could be widened or multiple FSL buses can be used. Counter-mode AES can also be used to reduce effective encryption/decryption latency. In ITR and FTR mode, the processor performs cache line hashing on data evicted to offchip memory, and cache line verication on data read into the processor, which also causes additional performance overhead. In the FPGA version of OpenSPARC, the frequency of the OpenSPARC T1 core is 50MHz, which is a bit slow for performance research running large software applications. Regardless, the OpenSPARC platform is still suitable for secure processor research because of its many advantages. In this paper, we mainly focus on how to modify the platform to add security features rather than on performance. C. Hardware Costs Our single-chip secure processor is implemented on Xilinx Virtex-5 XC5VLX110T FPGA. Table III shows the total resources of the FPGA chip and hardware costs after the security modules are added. Table III shows that the OpenSPARC T1 core has taken up most of the slices of the FPGA chip, up to 78%, while the new security modules consume much fewer slices. After both AES and TRNG are added, the slice utilization has increased by only 10%, which is far less than the 78% consumed by the T1 core. In our design, the OpenSPARC T1 microprocessor has been tailored to include only one CPU core. The resource utilization ratio of AES and TRNG will be even lower if two or more cores are used.

43

TABLE III: Logic utilization of single-chip OpenSPARC processor on Xilinx Virtex-5 FPGA Module Virtex-5 FPGA XC5VLX110T OpenSPARC T1 OpenSPARC T1 with AES added OpenSPARC T1 with TRNG added OpenSPARC T1 with AES and TRNG added Slice 17280 13561 (78%) 14030 (81%) 15166 (87%) 15181 (88%) LUT 69120 40270 (58%) 43174 (62%) 42162 (60%) 45030 (65%) Register 69120 30087 (43%) 30945 (44%) 30146 (43%) 31004 (44%)

secure

ACKNOWLEDGMENT This work is supported in part by NSF CCF-0917134 and NSF EEC-0540832, and by the City University of Hong Kong Start-up Grant 7200179. R EFERENCES
[1] W. Ford and B. S. Kaliski, Jr., Server-assisted generation of a strong secret from a password, in Proceedings of the 9th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises. IEEE Computer Society, 2000, pp. 176180. [2] J. Garay, R. Gennaro, C. Jutla, and T. Rabin, Secure distributed storage and retrieval, in Distributed Algorithms, M. Mavronicolas and P. Tsigas, Eds. Springer Berlin / Heidelberg, 1997, vol. 1320, pp. 275289. [3] P. MacKenzie and M. K. Reiter, Networked cryptographic devices resilient to capture, International Journal of Information Security, vol. 2, pp. 120, 2003. [4] OpenSPARC, Worlds First Free 64-bit CMT Microprocessors. [Online]. Available: http://www.opensparc.net/ [5] I. Parulkar, A. Wood, J. C. Hoe, B. Falsa, S. V. Adve, and J. Torrellas, OpenSPARC: An open platform for hardware reliability experimentation, the Fourth Workshop on Silicon Errors in Logic-System Effects (SELSE), April 2008. [6] K. Chandrasekar, R. Ananthachari, S. Seshadri, and R. Parthasarathi, Fault tolerance in OpenSPARC multicore architecture using core virtualization, International Conference on High Performance Computing, December 2010. [7] D. Lee, OpenSPARC - a scalable chip multi-threading design, in 21st Intl. Conference on VLSI Design, VLSID, 2008, p. 16. [8] P. Pelgrims, T. Tierens, and D. Driessens, Overview of embedded processors: Excalibur, LEON, MicroBlaze, NIOS, OpenRISC, Virtex II Pro, De Nayer Instituut, Tech. Rep., 2003. [9] G. E. Suh, D. Clarke, B. Gassend, M. van Dijk, and S. Devadas, Aegis: architecture for tamper-evident and tamper-resistant processing, in Proceedings of the 17th annual international conference on Supercomputing. ACM, 2003, pp. 160171. [10] G. E. Suh, C. W. ODonnell, and S. Devadas, Aegis: A single-chip secure processor, IEEE Design and Test of Computers, vol. 24, pp. 570580, 2007. [11] R. B. Lee, P. C. S. Kwan, J. P. McGregor, J. Dwoskin, and Z. Wang, Architecture for protecting critical secrets in microprocessors, in Proceedings of the 32nd annual international symposium on Computer Architecture. IEEE Computer Society, 2005, pp. 213. [12] D. Champagne and R. Lee, Scalable architectural support for trusted software, in 16th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2010, pp. 112. [13] D. Weaver, OpenSPARC Internals. Sun Microsystems, Inc., 2008. [14] J. Szefer, Y.-Y. Chen, R. Cheung, and R. B. Lee., Evaluation of OpenSPARC FPGA platform as a security and performance research platform, Princeton University Department of Electrical Engineering Technical Report CE-L2010-002, September 2010. [15] PALMS OpenSPARC Security and Performance Research Platform. [Online]. Available: http://palms.ee.princeton.edu/opensparc [16] R. Lee and Y.-Y. Chen, Processor accelerator for AES, in Proceedings of IEEE Symposium on Application Specic Processors (SASP), June 2010, pp. 1621. [17] J. Szefer, Y.-Y. Chen, and R. B. Lee, General-purpose FPGA platform for efcient encryption and hashing, in Application-specic Systems, Architectures and Processors conference, July 2010, pp. 309312. [18] B. Sunar, W. Martin, and D. Stinson, A provably secure true random number generator with built-in tolerance to active attacks, IEEE Transactions on Computers, vol. 56, no. 1, pp. 109 119, 2007. [19] D. Schellekens, B. Preneel, and I. Verbauwhede, FPGA vendor agnostic true random number generator, in International Conference on Field Programmable Logic and Applications, FPL, 2006, pp. 16.

BRAM 148 119 (80%) 143 (96%) 119 (80%) 143 (96%)

Table III also shows that the secure OpenSPARC T1 processor has almost taken up all the resources of the Virtex-5 XC5VLX110T FPGA chip. This will restrain the system from further development. For example, if more security modules are to be added, or two OpenSPARC T1 cores are desired in the system, the remaining resources may not be enough. One solution to this problem is to move the system to a larger FPGA chip with more logic resources, for example, the Xilinx Virtex-6 or Altera Stratix V FPGAs. VI. C ONCLUSIONS OpenSPARC T1 is an open-source, FPGA-synthesizeable general-purpose microprocessor. In this paper, we have described a secure computing model which assumes that only the on-chip environment is secure from physical attacks, and proposed a single-chip secure processor architecture. Further, we have prototyped the secure OpenSPARC T1 processor on FPGA and evaluated the resulting system. The new security modules added to the OpenSPARC system incur little extra hardware costs and performance overhead. By prototyping the secure OpenSPARC T1 processor, we nd that the OpenSPARC FPGA platform has many advantages for secure processor research and prototyping: ability to modify real hardware, ease of modication due to the emulated cache, ability to run commodity OS and benchmarks, the availability of an open source hypervisor, etc. The low working MHz rate of the stock OpenSPARC T1 processor and the high overhead of data transfer cycles using the 32-bit FSL bus affect the performance of the prototype. Hence, performance monitoring, performance estimation and performance improvement are fruitful areas for further research. In this work, we have added the AES engine, TRNG and MIT to the OpenSPARC platform. More security modules can be added to further enhance its security features. On the other hand, high-level applications can be developed to make use of these security modules. In summary, we nd that the OpenSPARC FPGA platform is relatively easy for secure processor prototyping with many advantages including its advanced software and hardware platform components.

44

A Study in Rapid Prototyping: Leveraging Software and Hardware Simulation Tools in the Bringup of System-on-a-Chip Based Platforms
O. Callanan, A. Castelfranco, C.H. Crawford, E. Creedon, S. Lekuch, K. Muller, M. Nutter, H. Penner, B. Purcell, M. Purcell and J. Xenidis

Abstract Traditional use of software and hardware simulators and emulators has been in efforts for chip level analysis and verification. However, prototyping and bringup requirements often demands system or platform level integration and analysis requiring new uses of these traditional pre-silicon methods along with novel interpretations of existing hardware to prototype some functions matching behaviors of future systems. In order to demonstrate the versatility and breadth of the presilicon environments in our systems lab, ranging from functional instruction set software simulators to Field Programmable Gate Array (FPGA) chip logic implementations to integrated systems of existing hardware built to mimic key functional aspects of the future platforms, we present our experiences with platform level verification, analysis and early software development/enablement for an I/O attached network appliance system. More specifically, we show how simulation tools along with these early prototype systems were used to do chip level verification, early software development and even system level software testing for a System on a Chip processor attached as an I/O accelerator via Peripheral Component Interconnect Express (PCI Express) to a host system. Our experiences demonstrate that leveraging the full range of pre-silicon environment capabilities results in full system level integrated software test for a I/O attached platform prior to the availability of fully functional ASICs. Index TermsSoftware debugging, software prototyping, accelerator architectures, product engineering, system analysis and design.

I. INTRODUCTION

ORKLOAD optimized computer systems represent an integrated approach to hardware and software development to achieve maximum performance from available footprint, power or cost metric. For many of these systems,

Copyright 978-1-4577-0660-8/11/$26.00 2011 IEEE O. Callanan, A. Castelfranco, E. Creedon, K. Muller, B. Purcell, M. Purcell are with the is with IBM Ireland Product Dist, Ltd, Muddart, Ireland (email {owen.callanan, antonino_castelfranco, eoin.creedon, kay.muller,brianpurcell,mark_purcell}@ie.ibm.com) C.H. Crawford, S. Lekuch, M. Nutter, and H. Penner are with the IBM TJ Watson Research Center in Yorktown Heights, NY (email {catcraw, scottl, hpenner}@us.ibm.com J. Xenidis is with the IBM Austin Research Lab in Austin, TX (email jimix@us.ibm.com)

general purpose processors are integrated with purpose built processors to balance ease of programming and implementation with integrated acceleration. When this design occurs in a single ASIC we refer to this as System On a Chip (SoC) architecture. One example of the SoC architecture is the IBM Power Edge of NetworkTM (PowerENTM) processor. This processor was designed to handle wirespeed, network facing applications. It consists of 64 PowerPCTM cores and a set of accelerators which exist as first class units in the memory subsystems on the chip. PowerENTM includes compression/decompression, encryption/decryption, regular expression (RegX pattern matching), and an Extensible Markup Language (XML) accelerators all connected via a high speed bus with an integrated Host Ethernet Adapter (HEA). (For a complete review of the PowerENTM architecture see [1]). The combination of integrated I/O with the accelerators and a massively multithreaded (MMT) capability targeting many sessions of parallel processing is ideally suited for many edge of network applications [2]. However, for some solutions and workloads, because of the availability of required libraries or components which require significant single threaded performance, implementing an entire end to end application on PowerENTM is not possible, and a hybrid solution is warranted. The integration of general purpose processors with special purpose accelerators has become a mainstay in performance sensitive computing solutions. In the high performance computing segment, hybrid architectures have been used to create systems with over a petaflop of performance using Cell Broadband Architecture [3] and GPUs [4] connected to x86 ISA based hosts. In other technical computing systems FPGAs have been employed to provide application specific acceleration [5]. Accelerated systems exist outside of technical and high performance computing as well. Today, there are a variety of vendors offering both ASIC and FPGA based I/O attached accelerators for TCP/IP offload [6], security [7], and financial data processing for real time trading [8], exactly some of the workloads for which PowerENTM was targeted,. In all of these systems, the accelerators were connected to the hosts via PCI Express, allowing for a variety of choices for both the host platform and operating system, and

45

so we also choose PCI Express for PowerENTM accelerated systems. There are also a variety of programming models to support these hybrid computing systems. Flexible runtimes such as OpenCL [9] and CUDA [10] provide versatile language bindings for a variety of applications. Runtimes derived from accelerator hierarchies, clusters of hybrid systems or a hybrid system built from x86_64 hosts and Cell Broadband Engine accelerators, have also been developed and show another level of flexibility of heterogeneous programming [11]. Other vendors build libraries for very specific function with the accelerator platform, and these libraries are then linked in with customer applications. In any of these approaches, fast and reliable communication is required between the host and the device both for latency (e.g. synchronization) and throughput (e.g. bulk data transfer) driven communication (see for instance [11]). For PowerENTM accelerated systems, we are especially concerned with the performance of our PCI Express data transport layer given that our set of targeted workloads have inherent wirespeed computing requirements. Designing and developing the appropriate PCI Express software stack for these performance sensitive applications requires significant access to systems for implementation and testing. In fact, verifying the PCI Express hardware features in PowerENTM in addition to a highly integrated and optimized user space networking stack from the HEA to the PCI Express interface requires testing from hardware functional tests to software unit tests all the way to full system level stress tests. In order to verify hardware designs and logic implementations, as well as develop and test software and stress test the system without jeopardizing time to market for our solution, we leveraged a full range of pre-silicon environments. Our challenge was to find the right mix of tools and prototype environments to efficiently develop, debug and test for what would eventually be a complex, heterogeneous system. This paper is organized as follows. In section 2, we describe the PCI Express hardware implementation on PowerENTM and the software stack which supported PCI Express connected PowerENTM to an x86 based Linux host. We review our instruction set simulation tools along with some prototype environments we used to develop and test the software in section 3. The hardware verification via HDL simulation and FPGA implementation is described in section 4. We demonstrate the flexibility of all these environments with a review of our comprehensive pre-silion test development, including performance and system stress tests in section 5. We conclude the paper with a section to summarize the value that the simulation tools and prototype environments provide for complex heterogeneous systems along with some of our future work plans in terms of application design and development.

II. THE POWERENTM PCI EXPRESS FUNCTION A. Hardware The PowerENTM architecture incorporated two PCIe ports. The two ports share up to 17 PCI Express Generation 2 lanes, with each lane providing a data traffic bandwidth of 500 MB/s. Additionally PCI Express Port 0 can be freely configured as PCI Express Host Bridge or as a PCI Express endpoint. In the context of the hybrid appliance architecture, the endpoint configuration is of interest, since it enables a PowerENTM chip to be attached to a host as a device. However, as we will show later in this paper, the ability to have the loopback configuration (Port 0 endpoint, Port 1 root complex) allowed PowerENTM chip logic to operate both functions at the same time, i.e. be a PCI Express host server and be a PCI Express device, and provide an additional hardware verification and software test configuration for bring-up. The PowerENTM PCI Express endpoint configuration provides a set of PCI Express Gen 2 features which gives a host a wide variety of useful features. Most prominent it supports SR-IOV [12], with the support of 2 physical (PF) and 16 virtual (VF) functions. This feature enables the development of device drivers on the host which exploit the PowerENTM in a virtual environment -- multiple device drivers can operate independently with hardware isolation. As an example that will be described in more detail in the next section, PF0 operates a virtual Ethernet driver whereas PF1 is used as platform of an userspace enablement driver. In future exploitations, the 16 virtual functions can be used for many instances of PCIe device drivers and corresponding device features (embedded analytics for instance), providing one instance per logical partition or user space process of the host. The PCI Express endpoint DMA engine in the PowerENTM architecture is connected to the main bus (PowerBus or PBus) on the chip as a coprocessor. As such, just as all other coprocessors on the PowerENTM chip (e.g. crypto), it has a PBus Interface Controller (PBIC) TLB which give provides mapping between bus addresses and virtual user space addresses and a DMA engine. Because it is a coprocessor, the DMA engine can be operated in userspace with coprocessor initiate (icswx) commands along with the hardware protection features of the PBIC TLB. Furthermore, by using an IOMMU on the host side, the PCI Express addresses can be translated into user space addresses, providing a user space to user space communication path from PowerENTM to the host. B. Software Given the rich PowerENTM hardware features to support user space PCI Express communication, the software stack needed to provide corresponding function for high-speed communications between userspace applications on the host and device. In other words, our PCI Express software infrastructure does not to require kernel involvement in packet transfers, as with standard socket style systems. In order to enable such high speed communication, a kernel device driver is required to provide low-level access to the PCIe device.

46

That is to say, after initial probing and provision of features (i.e. DMA engine, shared memory areas etc.) to userspace, the kernel device driver involvement is minimal. Since the PCI Express DMA engine is a PowerENTM coprocessor, we created a thin abstraction interface to initiate coprocessor requests. We called this interface libarb. This abstraction interface also provided requisite memory registration functions for DMA and MMIO between user space processes. We implemented libarb on both x86 and PowerENTM to allow for architectural neutrality as well for any software developed on top of this layer, thus allowing for us to only port this layer across the various simulation and prototype platforms. Since the DMA engine is only on the PowerENTM device, host initiated DMA actually first required the DMA command to be transferred via MMIO to the device and remapped to a device initiated call. Any thread management, locking, state management or endian conversion is done by the callers of libarb. By providing implementations of this abstraction, we have the ability to swap device drivers in/out depending on the particular bring up platform were using at the time whilst preserving the integrity of the upper layers. Layered on top of this is our communications protocol library, the Hardware Abstraction Layer for devices (HAL-d). HAL-d provides the necessary queue infrastructure required for two sided packet flow as well as the Remote DMA (RDMA) semantics and support for one sided throughput oriented bulk transfer. HAL-d is also not thread safe, but it is thread-friendly. That is to say that we provide for multiple DMA groups so that individual threads can initiate DMAs using separate command groups. Now that we have a full userspace stack in place, we require function to start host/device userspace applications. Starting host applications is trivial, given that the host is standard Linux server. However, starting applications on the remote device is not so straightforward. An intuitive method is to ssh into the remote system and then start the devices userspace application. In order to do this, we have an additional kernel device driver that essentially provides a virtual ethernet channel over PCI Express. On probing this device, a new network interface will be instantiated e.g. eth1, which can be configured using ifconfig. In order to avoid writing yet another network device driver, we researched the feasibility of using the virtio infrastructure [13] (particularly virtio_net) to provide most of the interaction with the kernels networking stack. Although virtio was originally intended for virtualization, it proved suitable for our purpose. Internally, this driver also uses the same communications protocol library for packet flow as the userspace stack, thus providing the possibility of having a device userspace to host kernel packet flow. Although many concepts about MMIO, receive and send queue management, and RDMA have been well understood in protocols and network research for decades, actual design and implementation on new hardware can be a challenge. This is especially true when working across endian boundaries. To

help facilitate our work on the PowerENTM platforms we used a variety of pre-silicon environments, many of which are described in [14]. In the next sections we describe how we leveraged these tools and environments for PCI Express and PowerENTM in order to meet our feature and performance requirements within strict time to market constraints. III. SOFTWARE DESIGN AND VERIFICATION A. Shared memory platforms Using memory areas shared between two processes is an effective way to mimic the behavior of PCI Express communications hardware without needing access to any specific hardware. By mapping shared memory areas into two separate processes running on a single system, one process runs the device-side code whilst the other process runs the host-side code, both MMIO and DMA data transfer methods can be simulated on a single system.

Figure 2: A diagram showing the logical description of the HAL-d IOREMAP as described in the text for MMIO based data movement. MMIO is the simplest to implement; two shared memory areas are created within a kernel module, a device-side MMIO receive area and a host-side MMIO receive area. The deviceside area is mapped into the device process as the receive MMIO space, and into the host process as the send MMIO space. Similarly the host-side receive area is mapped to the host process as the receive MMIO space, and to the device process as the send MMIO space. This IOREMAP is shown in Figure 2. MMIO data communication is performed by, for example, the host process writing data to its send MMIO space, which will then be visible to the device process in its receive space. DMA is more complex to simulate due to its asynchronous nature, and since with many systems arbitrary areas of memory can be used as DMA buffers, once they have been suitably prepared. The libarb interface is the key to effectively simulating DMA with shared memory. Libarb places two key restrictions on DMA users (e.g. HAL-d): 1. All memory for DMA must be allocated through arb_allocate_buffer 2. DMA transfers may only be initiated by calling arb_send. Restriction 1 ensures that only shared memory areas are allocated to the user application as DMA buffers.

47

This restriction is also useful for the PCIe libarb since it hides the complexities of creatig DMA'able memory from the libarb user. To enable user-space memory copies libarb also internally maps the "remote" DMA buffers, so when arb_send() is called it simply copies data between the "local" DMA buffer (mapped to the user-space process) and the "remote" DMA buffer, sets the completion struct and then returns. Due to restriction 2 this is invisible to the user-space process, so the behaviour of shared-memory libarb is identical to that of the libarb running on physical PCI Express PowerENTM hardware. With libarb, a DMA command is issued with arb_send(), and command completion is checked using arb_check_completion(). When run on a standard processor architecture however, such as Power or x86 instruction set, the simulated DMA transfers are inherently synchronous; the transfer is performed in software within the calling process' thread of execution. As a result shared memory on x86 does not properly test the DMA completion monitoring systems of any applications using libarb, since arb_check_completion will also return true when called after arb_send(). On PowerENTM PCI Express hardware the transfer is performed asynchronously by the chip's PCI Express DMA engine, and the engine notifies DMA completion by updating a struct held in the calling process' memory. DMA's may be completed out-of-order and at any time. To provide a more complete test of the DMA path, and to verify the HAL-d stack on PowerENTM's A2 processor architecture, the x86 shared memory libarb was ported to an early version of PowerENTM hardware. This hardware did not have PCI Express end-point functionality, so to test the DMA completion code arb_send was altered to use the asynchronous data mover (ADM). The ADM is a co-processor on the PowerENTM chip that asynchronously moves data between locations in PowerENTM memory. It's mode of operation is very similar to that of the PCI Express DMA engine, except that both the source and destination addresses must be in PowerENTM memory. The shared memory drivers were ported to single-chip PowerENTM, and arb_send was enhanced to use the ADM. In this way libarb users can test their completion monitoring more completely, without needing full PowerENTM PCI Express hardware. B. Hybrid Architecture Simulation The shared memory device drivers and libarb layer are very useful for early development and test of the user-space software stack, however they are little use for pre-silicon development of the kernel space device drivers. For this a much more sophisticated simulation environment that simulates at a hardware register level is required. A combination of the PowerENTM version of the IBM Full System Simulator (IBM FSS or Mambo) [15] and an x86 simulator called Simics [16], from WindRiver is used to provide this. Similar in concept to the IBM FSS, Windriver Simics speeds up system design, software development, deployment and test automation of hardware architectures such as embedded systems, single and multicore CPUs, complex

hybrid architectures and network connected systems like clusters, racks and distributed systems. It supports several processor families, IO devices and standard communication protocols. It runs indifferently the same binary working on real systems on modeled virtual hardware enabling the developers to program, debug and deploy firmware, device drivers, operating systems, middleware, and the application software. Debugging and testing are simplified by the use of a user friendly interface to run, break and stop the execution of the simulator, to inspect the hardware faults, save the hardware state for later inspection, get the output for using the system in batch from convenient automation tools. By appropriately connecting the Simics x86 simulator to the IBM FSS PowerENTM simulator, the PCIe link between an end-point mode PowerENTM processor and a root-complex x86 processor was simulated with sufficient accuracy to allow pre-silicon design and implementation of the x86 and PowerENTM device drivers. We used this software to develop two PCI Express devices drivers and a middleware on top of them. We have built two very small images containing busybox linux OS and a customized linux kernel, to have an agile environment for kernel development that allowed us to reboot, debug and modify quickly the system in order to advance investigation useful to possibly fixing hardware and software issues. We developed a set of bash scripts to create automatically these images and we created a set of Simics scripts to test our device drivers indifferently using Linux or Bare Metal Applications (BMAs) [14] on the device. Furthermore since the Simics/FSS environment runs Linux on both sides, it was also suitable for test and verification of the user-space software stack. Using the Simics File System we were able to mount the partition on the real machine to run user-space applications. Testing the user-space stack on Simics/FSS uncovered a number of bugs and problems which did not show up on the shared memory platform, in particular, those software bugs related to endian conversion., We were actually able to discover software bugs in the fully integrated software stack since the Simics scripts also provides a way to call the simulation in batch mode from an external automation tool, our test harness, for continuous integration development automating build, deployments, unit tests and functional tests. In particular, Simics provides a way to connect X terminals using a virtual serial connections that allowed us to stream the output of the host side and the device side terminal in log files saved on the host machine and to monitor execution of the tests to exit gracefully in case of errors during the executions. The tests are called automatically from an external automation tool and produce reports about the sanity of the software. The FSS/Simics environment was our first system level pre-silicon platform on which we developed and executed a comprehensive test suite for the PCI Express software and corresponding applications. C. Prototype PowerPCTM - x86 Testbed A third bring up platform was based on, existing hardware, namely the PowerXCellTM 8i based PCI Express accelerator board, or PX-CAB [17]. The PowerXCellTM 8i is a Cell BE

48

based processor, and thus has a PowerPCTM based main processor, just like PowerENTM. The PCI Express DMA engine logic implemented as a separate ASIC on the PX-CAB also resembles the PCI Express logic on PowerENTM. Given these two features of PX-CAB, it is the most closely related physical platform to our target platform. There are several benefits to using real hardware for bring up. Due to the fact that PX-CAB is a PowerPCTM system, its use immediately highlights any endian correction issues when attached to an Intel host. It is also an established and marketed device, thus being a stable platform for test case generation and execution, allowing us to use it to drive more complete end-end test case scenarios, for example, using the PX-CAB as a NIC and performing some packet processing prior to forwarding packets to the host. Additionally, we were able to re-use some portions of the existing PX-CAB device driver to accelerate bring up. Our PX-CAB stack is identical to our target software stack at the upper stack layers, the main difference being the actual device driver and an implementation of our libarb abstraction layer to interface with it. Also, some development effort was required on the PX-CAB device driver to bring it up to the kernel version of our target software stack, from 2.6.26 to 2.6.32 which helped development teams gain learning on pushing our driver from one kernel version to another. One particularly useful aspect of using PX-CAB was to be able to stress test and performance tune the upper stack layers relatively early. Especially in relation to our packet flow communications library, we were able to stress the packet queues and ensure that there was no unnecessary polling across the PCI Express bus accessing either the send or receive queues. Measuring the latency for direct data transfers it was possible to ascertain any large discrepancy between this and the transfers using our communications library. IV. HARDWARE VERIFICATION A. HDL Simulator The HDL which described the actual implementation of the PowerENTM chip was regularly run on the hardware simulator [14] to ensure the functionality and performance of the resulting chip. As stated previously PowerENTM has two PCI Express ports which could be operated in different modes. This feature enabled us to create an easy test configuration with only minor changes of the HDL. This was done by connecting PCI Express Port 1 Root Complex (RC) to the PCI Express Port 0 Endpoint Complex (EP). On the HDL it is basically connecting the TX lanes from one port to the RX port of the other. This configuration enabled a wide range of test, starting as being able to run the full initialization code including the PCI Express scan to extensive testing of PCI Express EP functionally like DMA. And we were able to find issues we could resolve before the real chip, significantly improving the quality of the first chip samples. The major development and testing on this environment was the firmware code responsible for the PCI Express setup of

PCI Express EP and PCI Express RC. For the former a static setting has to be applied whereas for the later the initialization code is more complex by the nature of the logic PCI Express scan has to do a PCI Express bus walk and configure the device find on this way. The wrap we had with our model gave this code an opportunity to find a device, which is a SR-IOV capable PCI Express EP, providing the most complex setup of existing PCI Express devices. With this we could not only find bugs in the newly developed SR-IOV capable scan code but also give feedback to the chip team where the functionality was not correct or the documentation lacked preciseness. Another challenge we addressed in the loopback test setup was the complex process of PCI Express link training. Since the PHY was part of the HDL simulated and the PCI Express lanes connected where after the PHY we were able to run the link training sequence and spend quite some time to get the link up and running. This not only exposed to the firmware team to complexity of the link training but also provided the opportunity ahead of hardware availability to have initialization code and debug code ready for the debug on the real hardware. The second task on this environment was the development of testcases for the PCI Express EP functionality, e.g. DMA, MMIO access, interrupt handling and SR-IOV. As part of the firmware a special wrap test was created, which ensured the functionality of the entire feature set listed above in the wrap configuration. This test again not only discovered bugs in the early phase but also gave us first performance impression, since the HDL model is cycle accurate. Only by this effort we were able to ensure the full functionality of PCI Express of the PowerENTM chip. And as a side effect we created a combined hardware and software team which fully understands the system architecture and was able to do the real hardware bringup in a much shorter test cycle than our lab has previously experience for full platform enablement. B. FPGA Emulation of the PCI Express Platform The PowerENTM FPGA Based Emulator (PFBE) architecture [14] could be used to exercise unique unit functions by reassigning the logic configurations assigned to the FPGA units. For example, logic cards connect to the central bus could be assigned to hold either additional processing nodes, or reconfigured to hold multiple instances of either I/O or accelerator units. In one example, multiple instances of a single accelerator were emulated in the system by re-assigning multiple instances of the accelerator to FPGA's originally provisioned to be used for other accelerator units. Having multiple instances of the accelerator in the FPGA logic created a platform that enabled the software team to exercise advanced unit to unit SMP communication functions of the accelerator engines well in advance of the availability of the final ASIC mounted in an SMP configuration. In a second example, a unique PCI Express to PCI Express wrap logic block was created to bridge two controllers in the PCI Express logic chiplet. This wrap logic block removed a dependency on any external physical I/O by directly connecting the PIPE interface

49

of one PCI Express unit configured as a host controller, to the PIPE interface of a PCI Express unit configured as a endpoint. By directly connecting the FPGA logic units, without any physical external connection, all of the logic could be run at the same relative system speed, eliminating external real time dependencies. This combination of root complex with endpoint complex wrapped together in the FPGA system enabled the software development teams to develop the drivers for early root and endpoint logic function. As a result of this early pre-silicon work, the software team was able to demonstrate PCI Express driver function on the chip within one week of receiving the physical chip. V. TEST DEVELOPMENT AND MULTI-PLATFORM EXECUTION Given the variety of function capabilities in our pre-silicon environment, we were able to develop a comprehensive platform test suite to validate function, measure performance, and stress both the hardware and software for the x86 host PowerENTM device environment. To achieve a broad range of test execution on multiple pre-silicon platforms a common test framework was developed which could be utilized across all target test platforms. Our test framework began with the various unit and function level testing of the A2 cores and corresponding memory subsystem. For this, we leveraged the Linux Test Program (LTP) suite [18], augmented with lmbench [19], along with some microbenchmarks that we developed internally which exercised some of the key features of a highly multithreaded indexing code, e.g. a pointer chase test. We then added component level testing for each of the coprocessors to stress thread concurrency. Finally, at the chip level we added user level networking. Where possible, similar tests were run in the BMA environment to measure software overheads and ensure any errors found were introduced by our own PBIC address management code. As part of system level testing, test suites have been developed to test the MMIO, DMA and virtual ethernet channels of the PCI Express software stack for PowerENTM. As it is difficult to predict how a user will utilize HAL-d, e.g what combinations of DMA vs. MMIO, data sizes, address offsets, etc., we had to develop a test harness with which one could easily change the data patterns and simulate multithreaded and multi-process use cases, as well as the standard single-thread single-process use case. The test driver was designed primarily to enable the easy, robust and extendable testing of HAL-d functionality. It also provides monitoring and reporting functionality. Monitoring is especially important in utilities such as HAL-d where blocking communication plays a role. Should a test fail during the communication or hang, the monitoring function has the ability to terminate the test after a user defined timeout. Defect identification is assisted by the reporting functionality of the framework. This functionality can provided detailed descriptions of the type of failure encountered and also generate its own backtrace allowing for quick identification of any issues should they arise. This test framework provides uniform test execution across

all target test platforms, making it easier for results comparisons, issue identifications and status reporting. For instance, we found several scenarios where the software tests would pass in the functional simulator environment, but the same software tests would fail on the PFBE loopback system. Upon further inspection and discussions with the hardware teams, we discovered that the problem was errors in the hardware documentation when compared with the implementation. Because we could isolate this to a functional level on the PFBE system well ahead of silicon delivery, this helped us considerably when actual hardware arrivied. In comparison to testing on the simulated hardware, in which test execution can take much longer than on real hardware, shared memory testing provides real world execution times of the PCI Express software stack on the target architecture. This allows for more efficient prototyping of the test framework as both host and device processes are resident on a single machine and the tests themselves are running at current microprocessor speeds. Along with PX-CAB, which provided fast execution time for developing tests around endian issues, these systems provided the test team with a complete test execution environment prior to the release of PowerENTM. Using the different pre-silicon platforms, satisfactory test coverage of the target architecture can be achieved. We were also able to leverage the wide variety of our presilicon platforms to create performance models of the PCI Express hardware and software. As mentioned previously, the hardware performance model was developed using the awanstar cycle accurate simulator. On top of that we took measurements on both shared memory x86 as well as the PXCAB environment to develop estimates of the HAL-d software overhead. For PowerENTM, we took measurements for both pure shared memory and the asynchronous data mover (ADM) for DMA. Interestingly enough, these measurements showed performance variations depending upon the size of individual transfers as well as number of transfers before synchronizing when using the ADM. Upon further investigation we found that this was an artifact of the ADM hardware architecture and not something we would necessarily find in the PCI Express DMA engine. We adjusted the performance models accordingly. In the end, we created a performance model and results of this suite/model have been used to identify performance bottlenecks and tune the PCI Express software stack accordingly to remove these. Finally, to gauge the HALd and PCI Express software stack performance both user space and PCI Express link level applications were used. Differences between the applications reflect the overhead of HAL-d and the PCI Express software stack, allowing the performance model to be created independent of the execution platform.

VI. DISCUSSION AND FUTURE WORK A PCI Express attached network processor to an x86 based host allows for a variety of applications, ranging from

50

intrusion detection systems to financial market data feed handlers to sensor network data aggregation and filtering. Many of these applications target the hybrid architectures to gain substantive performance benefit while maintaining the general programmability and library availability of the host. The hybrid approach also allows a development path in which portions of the code can be ported while others remain on x86 host allowing for greater experimentation as well as a progressive and limited risk process. For all these reasons noted above, the hybrid computing approach, especially one in which an integrated highly programmable and powerful network processor, such as PowerENTM, is used. In order to enable PowerENTMs PCI Express hardware capabilities and the corresponding high performance applications, we implemented a low latency, high throughput PCI Express software stack, containing both MMIO and RDMA based programming interfaces. This software stack was designed with the requirements of application data planes in mind. Since many applications also require control plane operations, such as logging, heartbeating, etc., we also implemented a virtual Ethernet over PCI Express stack which allowed for standard socket based programming between the x86 host process and the PowerENTM based device process. Therefore application developers have a choice of interfaces to leverage when optimizing with the PowerENTM PCI Express system. To gain the greatest benefit from workload optimized application development on combined general purpose and purpose built systems, the software application and hardware platform, including the processor, needs to be designed and developed in an integrated process. This integrated development process implies that iterative hardware designs should be considered as applications are ported and tuned. Therefore, the availability of system level pre-silicon environments for hardware verification and test along with software development are crucial not just from a time to market perspective, but also from a performance optimization perspective as well. In this paper we have presented a variety of pre-silicon environments which were used to validate hardware and software designs for the new PowerENTM processor in a hybrid system. Traditionally, pre-silicon environments have been used at the microprocessor level. System level design and development work, including system test, required the existence of silicon and early hardware. Our goals were to design and prototype at the system level from processor unit verification all the way to system software design and enablement to fully integrated stress tests. In order to accomplish this we had to include standard software instruction set simulators, HDL simulators, a novel approach to using FPGAs to emulate actual chip logic VHDL at microprocessor speeds, and even some early hardware prototype environments. The various stages and requirements of integrated development were carefully reviewed and the various pre-silicon environments were

chosen for appropriate and efficient verification, development and debugging. For instance, much of our PCI Express software stack early design and development was done in shared memory and then prototypes mimicking tightly coupled x86 and PowerPCTM systems. Once the PCI Express software stack was available on the various pre-silicon environments, build acceptance, functional verification, and even full system, stress and performance tests were developed. This allowed for both the full software stack and the various tests to be available as soon as the real hardware arrived. Our test suites have been developed not just to exercise the PCI Express function, but also the entirety of the system used by applications of interest, e.g. the HEA for packet processing, the A2 PowerPCTM binaries, the PowerENTM coprocessor along with moving data up to and back from the x86 host. We are currently evaluating a variety of wire speed applications in the areas mentioned previously in terms of their performance characteristics and ability to stress the overall system to integrate into our system test suite. As actual hardware arrives, having already developed and debugged the core of the system infrastructure will allow us to focus on the performance optimizations required to reach new levels of processing capability in network computing. ACKNOWLEDGMENT We are grateful to Nancy Greco and Heather Achilles for their continued support and guidance throughout all phases of these projects as well as the members of the Hybrid Systems Lab, specifically, Heinz Baier, Thomas Hovarth, Ken Inoue, and Steve Millman.

REFERENCES
[1] H. Franke, J. Xenidis, C. Basso, B.M. Bass, S.S. Woodward, J.D. Brown, C.L. Johnson, Introduction to the wire-speed processor and architecture, IBM Journal of Research and Development, Vol 54. No 1. paper 3 January/February 2010. D.P. Lapotin, S. Daijavad, C.L.Johnson, S.W. Hunter, K.Ishizaki, H. Franke, H.D. Achilles, D.P. Dumarot, N.A. Greco, B. Davari, Workload and ntwork-optimized computing systems, IBM Journal of Research and Development Vol. 54 No1 paper 1: 1-12, January/February 2010. M.Kistler,J.Gunnels,D.Brokenshire,B.Benton, Petascale computing with accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP 2009, Raleigh, NC, 2009, pp 241-250. R.Stone and H.Xin Supercomputer leaves competition and users in the dust, Science, Vol 330. No 6005, pp 746-747, 5 November 2010. R. Sass, W.V. Kritikos, A.G. Schmidt, S. Beeravolu, P. Beeraka, Reconfigurable Computing Cluster (RCC) Project: Investigating the Feasibility of FPGA-Based Petascale Computing Proceedings of the 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2007, pp 127-140, IEEE Computer Society, Washington, DC, USA. The Unified Wire, Chelsio communications whitepaper. http://www.chelsio.com/unifiedwire_eng.html. Accelerators in Crays Adaptive Supercomputing from http://rssi.ncsa.illinois.edu/2007/docs/industry/Cray_presentation.pdf G. Valente, Implementing Hardware Accelerated Accelerated Applications For Market Data and Financial Computations, from HPC on Wall Street, September 17, 2007, New York, NY,

[2]

[3]

[4] [5]

[6] [7] [8]

51

[9] [10] [11]

[12] [13] [14]

[15]

[16]

[17]

[18] [19]

http://www.lighthousepartners.com/highperformance/presentations07/Session-7.pdf http://www.khronos.org/opencl/ CUDA Xone, NVIDIA Corporation; see http://www.nvidia.com/object/cuda_home.html#. IBM Corporation, Data Communication and Synchronization Library Programmers Guide and API Reference, see http://publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/eic ck/dacs/DaCS_Prog_Guide_API_v3.1.pdf PCI Special Interest Group, Single Root I/O Vritualization, see http://www.pcisig.com/specifications/iov/single_root/ R.Russell. virtio: Towards a De-Facto Standard For Virtual I/O Devices. See: http://portal.acm.org/citation.cfm?id=1400108 J. Aylward, C. Cox, C.H.Crawford, K.Inoue, S.Lekuch, K.Muller, M.Nutter, H.Penner, K.Schleupen, J.Xenidis, A Review of Software and System Tools for Hardware Design, Verification and Software Enablement for System-on-a-Chip Architectures, IBM Research report, submitted. P. Bohrer, M. Elnozahy, A. Gheith, C.Lefurgy, T. Nakra, J. Peterson, R. Rajamony, R. Rockhold, H. Shafi, R. Simpson, E. Speight, K. Sudeep, E. Van Hensbergen, and L. Zhang, Mambo A Full System Simulator for the PowerPC Architecture ACM SIGMETRICS Performance Evaluation Review, Volume 31, Number 4, March 2004, pp. 8-12. P. Mgnusson, M. Christensson, J. Eskilson,D. Forsgren,G.Hllberg,J. Hgberg,F.Larsson,A.Moestedt,B.Werner, Simics: A Full System Simulation Platform, IEEE Computer, Feb 2002. H. Penner, U. Bacher, J. Kunigk, C. Rund, H.J. Schick, directCell: Hybrid systems with tightly coupled accelerators, IBM Journal of Research and Development, Vol 53. No 5. 2009, paper 2. LTP: Linux Test Project; see: http://ltp.sourceforge.net/ LMBench; see: http://www.bitmover.com/lmbench/

52

Rapid automotive bus system synthesis based on communication requirements


Matthias Heinz, Martin Hillenbrand, Kai Klindworth, K.-D. Mueller-Glaser
Karlsruhe Institute of Technology (KIT), Germany Institute of Information Processing Technology Email: {heinz, hillenbrand, kai.klindworth, klaus.mueller-glaser}@kit.edu
AbstractThe complexity of modern cars and along their electric/electronic architecture (EEA), rapidly increased during the last years. New applications like driver assistance systems are highly distributed over the network of hardware components. More and more systems share common sensors placed in sensor clusters. This leads to a greater number of mutually connected electronic control units (ECUs) and bus systems. The traditional domain specic approach of grouping connatural ECUs into one bus system, does not necessarily lead to an overall optimal EEA design. We developed a method to automatically determine a network structure based on the communication requirements of ECUs. Based on the EEA model, which is developed during the vehicle development life-cycle, we have all the information we need, like cycle times and data width, to build a network of automotive bus systems. We integrated our method into the EEA tool PREEvision to allow rapid investigation of realization alternatives. The relocation of functions from one ECU to another can ideally be supported by our method, since we can generate a new network structure within minutes, tting the new communication demands.

I. I NTRODUCTION During the design of a vehicle, several thousand signals between up to 70 ECUs have to be considered [1]. Based on the given requirements, the network designer has to set up a compound of automotive bus systems. Arising from the communication requirements, he has to select the right bus systems for the given bus load. Although not all ECUs in a new vehicle model are newly introduced and former architectures can be taken as reference, the high innovation rate in vehicle electronics leads to a bunch of new functions in every new car. Also new concepts like the use of sensor clusters sharing the sensors for several ECUs, lead to changed design requirements. Moving of functions from one ECU to another may require a restructuring of bus systems. New technologies, like radar or video based driver assistance systems, feature high requirements concerning the required data rates. This additionally leads to new connections to the already established ECUs. Currently the complete bus system architecture is designed by hand. Using a domain approach, where ECUs of a certain application area are bind together, designers try to handle the complexity. In the past, powertrain, chassis and body bus systems where installed fullling the communication needs. But new highly distributed systems like for example Adaptive
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

Cruise Control (ACC) or lane keeping are distributed to a network of many ECUs and do not clearly t into this xed structure. So the grown domain specic approach, does not necessarily lead to the overall optimal design. To avoid these problems and to speed up the prototyping and development process, we present a method, that allows to automatically generate automotive bus system connections between ECUs. The connections are based on the communication requirements of the provided ECUs. To get all necessary information, we employ the data provided in the electric/electronic architecture (EEA) model. Since the EEA model is a crucial part of the overall vehicle design process, the data provided by this model is always up to date and can ideally be taken as reference. This allows to directly inuence the network structure based on the current data requirements. Since our methodology has been implemented as plugin, for the eclipse based EEA modeling tool PREEvision, we can automatically create all generated bus systems one click. The partitioning of ECUs, we used for building our networks, is the basis for hardware/software co-design approaches since a while. The place and route algorithms, used in tools for recongurable hardware devices like Field Programmable Gate Arrays (FPGA) and also Very-large-scale integration (VLSI) processes, extensively uses partitioning technologies [2]. Literature provides many different algorithms for solving such problems. Basic algorithms for Electronic Design Automation (EDA) approaches for electronic devices can be e.g. found in [3]. We adapted several techniques for the partitioning problem described in this paper. The subsequent paper is organized in ve sections. A short introduction to automotive bus systems and architecture modeling is given in sections II and III. Section IV presents our approach of ECU partitioning describing the clustering, the implemented nearness functions, partition optimizing and partition merging. Section V describes the verication and test of our approach, followed by conclusions and outlook in section VI. II. AUTOMOTIVE BUS SYSTEMS Current vehicles feature a number of different bus systems fulllling the diverse communication requirements of the distributed network of interconnected sensors, ECUs and

53

actuators. Based on their communication bandwidth and application, they are separated in different classes. Currently deployed bus systems are listed in Table I. Since infotainment bus systems have not been designed for open or closed loop control, they feature their own class.
Class A B C D Infotainment Data rate < 25 Kbit/s 25-125 Kbit/s 125-1000Kbit/s > 1 Mbit/s > 10 Mbit/s Deployed buses Local Interconnect Network (LIN) [4] Controller Area Network (CAN) ClassB [5] Controller Area Network (CAN) ClassC [5] FlexRay [6] Media Oriented Systems Transport (MOST) [7] TABLE I

Require ments Function network


Sensor Function_1 Sensor Function_2

1.2 1.2.1 1.2.2

Requirement1.2Description Requirement1.2.1Description Requirement1.2.2Description

Composition
Functional Group

AssemblyConnector

Actor Function_2

Actor Function_1

Sensor1 Sensor2 Grounding Concept

ECU1

Actor1 Actor2

Networking& Communication Ground point ECU1 ECU2 Detailed Wiringharness ECU1 Detailed ECU2 ECU2 Placedin Placedin
Inline connector
Installation Location Installation Location

FuseBox Detailed ECU1


RAM CPU Software PCB ROM

Schematics

OVERVIEW AUTOMOTIVE BUS SYSTEMS [8]

Topology

III. E LECTRIC / ELECTRONIC ARCHITECTURE MODELING During the concept phase of a vehicle the electric/electronic architecture (EEA) is designed. The modeling of such architectures allows to balance the possible realization alternatives and to nd an overall design. The tool PREEvision, which is used by leading car manufacturers [9], allows to design such complex architectures containing up to 800.000 elements for a premium car. To handle this complexity, different perspectives to the model are provided (Fig. 1). The EEA elements required for our method are located in the function network and the component network. Functions feature ports that implement interfaces. An interface describes the exchanged data elements between the participators. A communication requirement which can be allocated to a port prototype allows to describe the cycle time of the exchanged data elements. In the component network ECUs, bus connectors and bus systems are modeled. IV. ECU PARTITIONING The following information from the EEA model is utilized, for starting the ECU partitioning. Elements can directly be accessed in the model using java code. Cycle time given by communication requirement (PortCommunicationRequirement) Type and number of data elements out of interfaces (DataElement) Sending ECU (sender) Receiving ECU (receiver) A. Representation as graph The representation of ECUs and their communication requirements as a graph allows to solve the partitioning problem with the help of algorithms. In our case, edges represent the exchanged data while nodes represent ECUs. The algorithm tries to partition the nodes to single networks while it tries to reduce the cutting costs between the partitions. Partitioning is a classical problem in computer science and is considered as non-deterministic polynomial-time hard (NP-hard). There

Installation Location

Segment
Branchoff

Installation Location

Installation Location

Installationspace

Installationspace

Fig. 1.

Layered EEA

are some well established algorithms to solve such problems [10]. Namely there are Kernighan-Lin, Fiduccia-Mattheyses, Simulated Annealing, Hierarchical Clustering, Evolutionary Algorithms, Integer Linear Program and Tabu Search. B. Hierarchical clustering To build a set of partitions in the rst place we used the Hierarchical Clustering (HC) algorithm. While the nodes of the graph are ECUs, weighted edges indicate the nearness of the ECUs represented by their communication requirements. The edges are undirected cause the direction of information exchange does not inuence the result. The information ow in both directions between the ECUs is summed up in the weight of the edge. Hierarchical Clustering starts with one ECU in each cluster and merges the two partitions with the greatest nearness together. This is proceeded till only one partition is left. Each group of partitions featuring all ECUs forms a possible solution, independent on which solution step they appear. This feature will be utilized in the succeeding steps. To nd the best overall solution, the costs for each determined solution have to be calculated. The costs for one partitions are composed out of the following parts:

Bus system costs: holds the costs for establishing a bus system Bus participant: is calculated for each bus member Gateway costs: will be calculated if there is data transfer to other partitions Byte/s for external data transfer: covers the costs for the data that must be transferred through the gateway.

54

Component description

Powerdistribution

The overall costs of a partition, is the cheapest solution from the sum of single costs, calculated for all possible bus systems (LIN, CAN, FlexRay). If a bus system can not fulll the communication requirements, the algorithm returns an error message.
Traffic:660kbit/s FlexRay Traffic:250kbit/s HighspeedCAN 7 Traffic:150kbit/s HighspeedCAN

Partition 1 2 3 4 5 6 7

Bus system FlexRay FlexRay High-speed CAN High-speed CAN High-speed CAN High-speed CAN High-speed CAN

(5 (4 (8 (9

devices) devices) devices) devices)

Own costs/Costs of children 950/860 525/490 370/570 250/240/280/290/-

TABLE II C ALCULATED COSTS FOR THE GRAPH IN F IGURE 2

Traffic:410kbit/s FlexRay 4 Traffic:180kbit/s HighspeedCAN

HighspeedCAN

5 Traffic:230kbit/s HighspeedCAN

6 Traffic:100kbit/s HighspeedCAN

HighspeedCAN

HighspeedCAN

Fig. 2.

Partition tree after Hierarchical clustering


LowspeedCAN 90%busload* LowspeedCAN LowspeedCAN 30%busload* 30%busload* LowspeedCAN 90%busload*

To nd the overall best solution for the whole HC-tree, the algorithm (Fig. 3) processes the tree, beginning at the top. In the rst step, the costs for the currently selected partition are calculated. Afterwards the costs for the child partitions are calculated and the sum is compared to the own costs. The cheaper solution will then be taken. So the algorithm steps recursively through the tree and determines the cheapest solution for all possible partitions. Using the graph in Fig. 2, the algorithm returns the solution in Table II. To get a better overview and to keep it simple, we set the gateway and external data costs to zero for this example. We set a High-speed CAN bus to 200, a HS-CAN device to 10, a FlexRay bus to 300, and a FR device to 25 as ctitious costs for this example. As cheapest solution, partition 3, 4 and 5 will emerge. Since the HC algorithm does not take the available data rate of a bus system into concern during partitioning, inappropriate partition sizes can appear (Fig. 4). To overcome this issue, an additional bin packing algorithm has been implemented to merge the under utilized partitions together. Data: Partition part Result: costs c, list of partitions list myCost := costOfPartition(part) mySolution := part; childrenCost := 0 childrenSolution := while child := part.nextChild do childrenCost += cheapestSolution(child).c childrenSolution := childrenSolution cheapestSolution(child).list if childrencost < mycost||childrenSolution = then return (childrenCost, childrenSolution) else return (myCost, mySolution)
Fig. 3. Cheapest solution algorithm

*busloadisapprox.50%of125 kbit/sdatarateforLowspeedCAN

Fig. 4.

Inappropriate partitioning

C. Nearness function To execute the HC-algorithm, a nearness function has to be implemented. The obvious idea to take the absolute data rate between the partitions turned out to be inapplicable. A usual network consists of unequally fast participants. An algorithm only taking into account the data rate, would start to merge the fastest participants together. A big partition dominating the others would emerge. Since it is very likley that left over partitions, featuring only one ECU, also feature a high nearness to this big partition one after another will be added to the big partition. This leads to an abnormal HC tree. This behavior is not desired, since the slower participants have no chance to build their own network, featuring minor bus requirements. During the development of nearness functions, it turned out that several nearness functions lead to different results. The rst nearness function, we call it relative nearness, the data transfer of the current partition to another partitions is divided through the overall transfer rate of the current partition. This shows the percentaged quota of the communication to other partitions (Fig. 5). This enables partitions communicating strongly which each other to have a high nearness, even if they have low data rates. To avoid the merging of slow nodes with faster ones, always the smaller value for one connection is used. In Fig. 5 a nearness of 0.05 would be taken for the shown connection. We call it bothsided relative nearness. Because the right node massively lowers the nearness of the connection and so the sum of all connections for the left node, again the quotient of one connection to all other connections is calculated for the weigthed nearness (Fig. 6). The sum of

55

8 0.45 9 8 0.05

700

900

Fig. 5.

Weighting of connections

the connections in this example is 0.20 + 0.14 + 0.02 = 0.36. If we divide the single values by 0.36 we get 0.56, 0.39, 0.06 as new values.
7 0.20/0.56 0.02/0.06 5 0.14/0.39 500 600 23

4) Till not all nodes are in this list, the gain for all connected nodes will be newly calculated and gone on to step 2. 5) The shifting sequence will now be executed till the highest aggregate gain is found. If the gain is negative for all steps we stop and do not shift any nodes. If not, the list of xed elements will be cleared and we start over with step 1. The rst pass of the given algorithm is depicted in Fig. 7. The shown nodes are ECUs, the while the dashed lines indicate the limits of the partitions. The numbers below the nodes show the achievable gain when moving the node to the other partition.
Start Step1 Gaintotal:2Thisstep:2 1 2 fixed 1 Step2 Gaintotal:2Thisstep:0 0 fixed 1

Fig. 6.

Balanced weighting of connections

1 Criterion violated

0 fixed

It turned out that penalizing partitions with many connections leads to a better solution, so we divided the weighted nearness through the number of connections. We call this new function shared weighted nearness. The software we developed, contains all above mentioned nearness functions, since the structure of future networks can not be foreseen. Therefore all different functions will be calculated and the best solution will be taken. Since the best solution for a certain communication network is unknown, we used randomly generated networks to prove the concepts experimentally. D. Partition optimization Since the HC-algorithm follows a greedy strategy and takes a locally optimal choice at each stage, it doesnt necessarily lead to the overall best solution. To improve the solution found by HC, we implemented a succeeding Fiduccia-Mattheyses (FM) algorithm. This iterative algorithm featuring a complexity of O(n), helps to lower the cutting costs between partitions. The advantage of this algorithm is, that it doesnt require partitions of equal size like e.g. Kernighan-Lin. A balance criterion can be used to check, if the balance between partitions is not too unequal and that the busload of the bus system is not exceeded. The implemented FM-algorithm comprises the following steps: 1) The gain of all nodes for shifting between the partitions is calculated. 2) A node not violating the balance criterion and holding the highest gain will be selected. If several nodes feature the same gain, the one best tting the balance criterion will be selected. 3) This node will be taken to a list and cannot be moved in further steps during this optimization step.

Step3 Gaintotal:2Thisstep:0 0 fixed 1 fixed

Step4 Gaintotal:1Thisstep:1 0 fixed 1 fixed

Balancecriterion: 1<=Partitionsize<=3 Max.gain: Step1,2,3 Bestcriterion:Step2

2 fixed

1 fixed

2 fixed

Fig. 7.

example FM algorithm rst pass

E. Partition merging After executing the HC- and FM-algorithm sometimes several partitions are not lled to 100%. To lower the costs of the overall architecture it is reasonable to merge partitions to a single bus. Only partitions featuring the same bus type will be merged, since the costs would rise if nodes would be shifted to a faster and so more expensive bus. The merging of partitions can be considered as bin-packing problem [11]. Our goal is to maximize the lling level of the partitions. There are several algorithms available in literature, to solve the bin packing-problem exactly [12]. We solved the problem using the dynamic programming approach [3]. A challenge for this specic problem is, that the partitions change their weight when they are packed together. A simple example of this relationship is depicted in Fig. 8. The data transfer between to partitions transferred trough a gateway will be counted for both partitions. If these partitions are merged the data transfer is only counted once, since the ECUs can communicate directly. So the overall bus utilization is lower than in both parts. The bin packing algorithm has being implemented for each different bus system. Since the dynamic programming reuses

56

ECUA

5kbit/s

Gateway

BusA

BusB 5kbit/s ECUB

Newbus ECUA 5kbit/s ECUB

partitions the shifting of the gray node would cause a loss of 1. After merging the partitions on the right, we can achieve a gain of 3.
Gain:1 4 5 5 4 4 Gain:3 4

Fig. 8.

Weight change arising from partition merging

the former calculated blocks to speed up the calculation it has to suffer the above described the partition size problem. To minimize this drawback we implemented the following steps: 1) Generate list of all partitions featuring the same bus system 2) Create empty bin featuring the capacity of the bus system 3) Filling the bin with partitions. Deleting used partitions out of the list and adding the new created bin to the list if more than one partition has been packed into it. 4) If unpacked partitions are in the left over, restart with step 1. Step 3 enables to recalculate the size of the bins and allows to pack another partition into it if there is enough space. This addresses the above described problem concerning the size changes after merging partitions. For a small number of bus systems it is also possible to use the exact algorithm which uses more computing time and memory than the dynamic programming approach. The exact algorithm steps through each possible combination of partitions. It recursively calls itself adding a partition to the bin and a second time without adding it. So it checks all available solutions. The pseudocode is given in Fig. 9. Data: List of old partitions oldlist, partition list allparts, list position it, used bus bus Result: used partitions list part := oldpart {allparts(it)} if it = allparts.last then if bus = checkBus(part) then /* Check if the busload of the current bus is not exceeded */ return part else return oldpart else if bus = checkBus(part) then return rucksack(oldpart, it + 1, bus) else return maxInternalTraffic( rucksack(part, it + 1, bus), rucksack(oldpart, it + 1, bus))
Fig. 9. Exact bin packing algorithm

Fig. 10.

Additional execution of the FM-algorithm

The whole process now looks as follows: 1) Hierarchical clustering 2) Selecting the best solution in the HC tree 3) Optimizing of cutting costs using the FM algorithm 4) Merging of non busy partitions using the bin packing algorithm 5) Again optimizing the cutting costs using the FM algorithm To enable a rapid exploration of different architecture prototypes we implemented these steps into a customized metric block in PREEvision (Fig. 11). PREEvision is based on eclipse platform and can so easily can be extended by using custom metric blocks. Our customized metric block is based on a java code which can be directly executed in the framework [13]. To start the calculation, we provide a list of ECUs that shall be partitioned and a folder containing the allowed bus systems. This allows to exclude ECUs from bus generation, e.g. to meet non-technical requirements. The same is true for the list of allowed bus systems which allowes to include/exclude certain network types. Additional data, e.g. communication requirements, is directly read from the EEA model. This provides all necessary input data for the above described steps. The metric block automatically generates the determined bus connectors and bus systems in the architecture model.
ECUs DoorLockDD DoorLockFP MasterModule BodyModule SlidingDoorController ecuInput Bus synthesis 3 2

Context

Bus systems Networks

1 Context

busInput

Fig. 11.

PREEvision block plugin implementation

After nishing the bin packing, it makes sense to run the FM-algorithm again. This is because merged partitions may have a higher nearness to nodes in other partitions. This relationship is depicted in Fig. 10. Before merging the

V. R ESULTS Since the real communication structure between the functions and ECUs is strictly condential knowledge of car manufacturers, there was no data from a real car available to

57

test our approach. As workaround, we designed a customizable random network generator. This generator features the following settings:

min/max min/max min/max nections min/max min/max

number of ECUs number of connections distance between the min/max number of conof the minimum data rate of connections of the maximal data rate of connections

Furthermore we implemented a likeliness that a ECU connects to ECUs of the same tenner block. This means that ECU 16 has a higher likelihood to connect to the ECUs 10-19 than to all others. In addition the user can to set the data rate for each of these blocks different. This helps to see if the algorithm correctly detects the ECUs belonging together. The designed network generator also allows to set the group size of ECUs belonging together, but this prevents from identing the ECUs beloging together. Another setting allows to set the group of the adjacent ECUs by hand. We generated 100 different networks using our network generator. The results are depicted in Fig. 12. While it looks like the shared weighted nearness is the overall winner of the benchmark, this is only the case for most of the networks. Looking at the standard deviation shows, that also other nearness functions can lead to a better solution for a specic network. Because of this, the result of all different nearness functions is calculated and the best one is selected. The result graph in Figure 12 is based on the best solution found for each network. The deviation of the current solution compared to the best solution found is depicted in percent.
90 80 70 60 50 40 30 20 10 0 equal distibution normal relative bothsided relative weighted shared weighted
without optimization bin packing FM+bin packing

can be selected, the ECU partitioning can also be executed for only a specic set of ECUs. Modications to the automatically generated network may of course be necessary, since politically decisions always have to be considered during the design. Nevertheless our tool can provide a very good starting solution, meeting all requirements concerning data rates. The cost function which is the basis for the decision for a specic network, can individually be set by the designer and so meet the specic computation of different car manufacturers. The current approach can also easily be expanded be new bus systems, since it is not dependent on a certain kind of bus. In the next steps, we will try to improve the selection of bus systems. Currently, a certain bus is selected by a xed bandwidth value. This could be extended by an in-depth conguration and scheduling for the selected bus. So possibly a better bandwidth utilization could arise. R EFERENCES
[1] J. Broy and K. D. Mueller-Glaser, The impact of time-triggered communication in automotive embedded systems, in Industrial Embedded Systems, 2007. SIES 07. International Symposium on, Jul. 2007, pp. 353356. [2] J. Teich and C. Haubelt, Digitale Hardware/Software-Systeme: Synthese und Optimierung, 2nd ed. Berlin: Springer, 2007. [3] J. Lienig, Layoutsynthese elektronischer Schaltungen - Grundlegende Algorithmen f r die Entwurfsautomatisierung. Berlin: Springer, 2006. u [4] LIN-Consortium, LIN Specication Package, revision 2.1 ed., Nov. 2006. [5] Robert Bosch GmbH, CAN Specication, 2nd ed., Stuttgart, September 1991. [Online]. Available: http://www.semiconductors.bosch.de/pdf/ can2spec.pdf [6] FlexRay Consortium, FlexRay Communications System - Protocol Specication Version 2.1, Dec. 2005, version 2.1 Revision A. [7] MOST Cooperation, MOST Specication, MOST Cooperation, 07 2010, rev. 3.0 E2. [8] W. Zimmermann and R. Schmidgall, Bussysteme in der Fahrzeugtechnik. Protokolle und Standards. Vieweg+Teubner, September 2008, vol. 3. Auage. [9] aquintos GmbH , E/E-Architekturwerkzeug PREEvision, 2009. [10] R. Xu and D. Wunsch, Clustering (IEEE Press Series on Computational Intelligence). New York: IEEE Press, 2009. [11] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack problems : with ... 33 tables. Berlin: Springer, 2004. [Online]. Available: http://books.google.de/books?isbn=3540402861 [12] S. Martello and P. Toth, Knapsack problems: algorithms and computer implementations. New York, NY, USA: John Wiley & Sons, Inc., 1990. [13] B. Daum, Java-Entwicklung mit Eclipse 3.3 : Anwendungen, Plugins und Rich Clients, 5th ed. Heidelberg: Dpunkt-Verl., 2008. [Online]. Available: http://www.ulb.tu-darmstadt.de/tocs/194895912.pdf

Fig. 12.

Comparison of implemented nearness functions and algorithms

VI. C ONCLUSION AND FUTURE WORK Our method to automatically partition communicating ECUs to automotive networks allows to rapidly evaluate different design alternatives. During the design phase of a vehicle development, different architectures are investigated. Automatic ECU partitioning can help the designer to quickly generate a new network prototype when moving function blocks from one ECU to another. With the help of our approach, all bus system requirements are met. Since only a subset of ECUs

58

Anevent-drivenFIRfilter:designand implementation
Taha Beyrouthy, Laurent Fesquet
TIMA Laboratory - Concurrent Integrated Systems Group, Grenoble, France Taha.beyrouthy@img.fr Laurent.fesquet@imag.fr

Abstract Non-uniform sampling has proven through different works, to be a better scheme than the uniform sampling to sample low activity signals. With such signals, it generates fewer samples, which means less data to process and lower power consumption. In addition, it is well-known that asynchronous logic is a low power technology. This paper deals with the coupling between a non-uniform sampling scheme and an asynchronous design in order to implement a digital Filter. This paper presents the first design of a micropipeline asynchronous FIR filter architecture coupled to a non-uniform sampling scheme. The implementation has been done on an Altera FPGA board. Index Terms Asynchronous logic, non-uniform sampling, FIR filter, FPGA.

A new class of ADCs, called asynchronous ADCs (AADCs) has been developed by the TIMA Laboratory [7]. This A-ADC is based on the combination of a level-crossing sampling scheme and a dedicated asynchronous logic [8]. The asynchronous logic only samples digital signals when an event occurs, i.e. a sample is produced by the A-ADC which delivers non-uniform data in time. This event-driven architecture combined with the level-crossing sampling scheme is able to significantly reduce the dynamic activity of the signal processing chain. Many publications on non-uniform sampling are available in the literature, but to the best of our knowledge none relates to the coupling of an event-driven (asynchronous) logic and FIR filters techniques applied to a non-uniform sampling scheme This paper presents an asynchronous FIR filter architecture, based on a micro-pipeline asynchronous design style. It also shows a successful implementation of this architecture on a commercial FPGA board from Altera. The second part of the paper is dedicated to the asynchronous logic and more precisely to one kind on asynchronous circuit: micro-pipeline. Some details about the A-ADC as well as the non-uniform sampling scheme are showed in the third paragraph. The fourth paragraph handles the asynchronous FIR Filter algorithm and architecture. Finally the fifth paragraph presents the implementation result of the proposed architecture on a DE1 Altera FPGA board.

I. INTRODUCTION

ith the increasing system on chip complexity, several problems become more and more critical, and affect severely the performance of the system. These issues can have different form such as: power consumption, clock distribution, electromagnetic emission, etc. Synchronous logic seems to reach its technological limits when dealing with these problems. However, asynchronous logic has proven that it could be a better alternative in many cases. It is well known that it has many interesting properties such as immunity to metastable states 2], low electromagnetic noise emission 15], [ [ low power consumption 11] 12], high operating [ [ speed 13] 14] or Robustness towards variations in supply [ [ voltage, temperature, and fabrication process parameters 16]. [ Moreover, non-uniform sampling and especially the levelcrossing sampling become more interesting and beneficial when they deal with specific signals like temperature, pressure, electro-cardiograms or speech which evolve smoothly or sporadically. Indeed, these signals are able to remain constant on a long period and to vary significantly during a short period of time. Therefore using the Shannon theory for sampling such signals leads to useless samples which increases artificially the computational load. Classical and uniform sampling takes samples even if no change occurs in the input signal. The authors in [5] and [6] show how using the non-uniform sampling technique in ADCs leads to drastic power savings compared to Nyquist ADCs.

II. PRINCIPLES OVERVIEW A. Asynchronous logic Asynchronous logic is well known for its interesting properties that synchronous logic doesnt have: such as low electromagnetic emission, low power consumption, robustness, etc. 1]. It has been proven that this logic improves [ the Nyquist ADCs performances in terms of immunity to metastable states 2], low electromagnetic emission 3] or low [ [ power consumption 4]. This paragraph briefly presents the [ main asynchronous logic principles. It also shows how asynchronous micro-pipelined circuits are build, from two distinguish parts: the data path, and the asynchronous control path.

978-1-4577-0660-8/11/$26.00 2011 IEEE

59

Asynchronous principles Unlike synchronous logic, where the synchronization is based on a global clock signal, asynchronous logic doesnt need a clock to maintain the synchronization between its subblocks. It is considered as a data driven logic where computation occurred only when new data arrived. Each part of an asynchronous circuit, establishes a communication protocol, with its neighbors in order to exchange data with them.Thiskindofcommunicationprotocolisknownashand shake protocol. It is a bidirectional protocol, between two blocks called Sender and Receiver as showed in Figure 1

This is why the C-element is extensively used in asynchronous logic, and is considered as the fundamental component on which is based the communication protocols Asynchronous Micro-pipeline circuits Many asynchronous logic styles exist in the literature. It is worth mentioning that the choice of the asynchronous style affects the circuit implementation (area, speed, power, robustness, etc.). One of the most known styles is the micropipeline style. Among all the asynchronous circuit styles the micro-pipeline has the most closely resemblance with the design of synchronous circuits due to the extensive use of timing assumptions 5]. [ Same as a synchronous pipeline circuit, the storage elements is controlled by control signals. Nevertheless, there is no global clock. These signals are generated by the Muller gates in the pipeline controlling the storage elements as shown in Figure 5 A simple asynchronous micro-pipeline circuit, could be built by using transparent latches as storage elements as shown in Figure 4.

Figure 1: Handshake protocol is established between two subblocks of an asynchronous circuit that need to exchange data between each other

The sender start the communication cycle by sending a request signal req to the receiver. This signal means that data are ready to be sent. The receiver start the new computation after the detection of the req signal, and send back an acknowledge signal ackto the sender marking the end of the communication cycle, so a new one could start. Themaingateusedinthiskindofprotocol,istheMuller gate or also known as C-Element. It helps - thanks to its properties- to detect a rendezvous between different signals. The C-element is in fact a state-holding gate; Table 1 shows its output behavior.

Figure 3: Muller pipeline, controlling Latch chain

The Muller gate pipeline is used to generate the local clocks. The clock pulse generated in a stage overlaps the pulses generated in the neighboring stages in a specific controlled interlocked manner (depending on the handshake protocol).

Figure 2: C-Element or Muller gate


Input1 0 0 1 1 Input2 0 1 0 1 Output 0 Output-1 Output-1 0

Table 1: The Truth table of the C-Element The output copies the inputs value when both have the same, and maintain its previous value when inputs are different.

Figure 4: Micro-pipeline asynchronous circuit

Consequently,whentheoutputchangesfrom0to1,we mayconcludethatbothinputare1.And similarly, when the output changes from 1 to 0, we may conclude that both inputsarenowsetto0.Thisbehaviorcouldbeinterpreted asanacknowledgementthatindicateswhenbothinputare1 or0.

This circuit can be seen as an asynchronous data-flow structurecomposedoftwomainblocks:thedatapathwhich is clocked by a distributed gated-clock driver, The control path.

60

III. ASYNCHRONOUS ANALOG TO DIGITAL CONVERTER AADC Most of real life signals are time varying in nature. The spectral contents of these signals vary with time, which is a direct consequence of the signal generation process 6]. The [ synchronous ADCs are based on the Nyquist architectures. They do not exploit the input signal variations. Indeed, they sample the signal at a fixed rate, without taking into account the intrinsic signal nature. Moreover, they are highly constrained due to the Shannon theory especially in the case of low activity sporadic signals like electrocardiogram, seismic signals, etc. It leads to capture and process a large number of samples without any relevant information, a useless increase of the system activity and power consumption. The Asynchronous Analog-to-Digital Converter (AADC) presented in 7] and 8] is based on a non-uniform sampling [ [ scheme called level-crossing sampling 9]. This system is only [ driven by the information present in the input signal. Indeed, it only reacts to the analog input signal variations. A. Non-uniform - level crossing sampling The sampling process strongly affects performances of the post Digital Signal Processing (DSP) chain. Best performances can be achieved if the signal is efficiently sampled. Several ways exist to sample an analog signal. The classical uniform sampling is well developed and well adapted to the existing signal processing devices. Although it covers the whole existing DSP areas, it is not the best one for all of them. In many cases, the non-uniform sampling could be a better candidate and provides advantages such as system complexity reduction, compression, smarter data transmission, and acquisition, etc., which are not attainable with the uniform sampling process. With our non-uniform sampling scheme, a sample is only captured when the Continuous Time (CT) input signal x(t) crosses one of the defined levels (Figure 5).

signal variations. Thus, together with the value of the sample axn, the time dtxn = txn txn-1 is defined. It corresponds to the time elapsed since the previous sample axn-1. A local timer of period TC is dedicated to record dtxn and deliver it when necessary along with axn. Contrarily to the usual sampling technique, the amplitude of the sample is known and the time elapsed between two samples is quantized with a timer. The Signal to Noise Ratio (SNR) depends on the timer period TC, and not on the number of quantization levels [8]. Thus, for a given implementation of the non-uniform sampling A/D converter (a fixed number of quantization levels: L = 2M-1), the SNR can be externally tuned by changing the period TC of the timer. In theory, for level-crossing sampling, the SNR can be improved as far as it is needed, by reducing TC 10]. [ B. A-ADC architecture LetbetheA-ADC processing delay for one sample. Thus the proper signal capturing the x(t) must satisfy the tracking condition given by the equation (1):

(1)
Where q is the A-ADC quantum, defined by the equation (2):

(2)
where E represents the amplitude dynamics of the A-ADC, and M its resolution. The output digital value Vnum is converted to Vref by the DAC, and compared to the CT input signal x(t) (Figure 6).

Figure 6: Block diagram of the A-ADC

If the comparison result is greater than


Figure 5: Level crossing sampling. q is considered as the AADC quantum.

, the counter is

For an M-bit resolution, 2M-1 quantization levels are regularly disposed along the amplitude range of the signal. Unlike the classical Nyquist sampling,, the samples are not uniformly spaced out in time, because they depend on the

incremented. If it is lower than , the counter is decremented. In other cases, nothing is done, and thus, the output Vnum remains constant. The output signal is composed of couples (axn, dtxn) where axn is the digital value of the sample, and dtxn the time elapsed

61

since the previous converted sample axn-1, given by the timer, as said before. Since, the architecture of the A-ADC is asynchronous, the latter uses asynchronous communication protocol (based on req and ack signals) to exchange data with its environment.

corresponding to the convolution product in a sum of rectangle areas with different widths (Figure 7). In order to compute each rectangle area, an iterative loop can do the job 10]: [ { ( )

IV. ASYNCHRONOUS FIR-FILTER A. Principles & Algorithm A synchronous Nth order FIR filter based on a uniform sampling scheme, computes a digital convolution product (equation (4)). where is the sampling period. If min = If min = if min = then k = k+1 then j = j+1 Then j = j+1 k = k+1

(3)

where min =min( , An example illustrating these iterations is shown in Figure 8

In the non-uniform sampling scheme, the sampling time of the kth sample of the impulse response h does not necessarily correspond to the sampling time of the (n-k)th sample of the input signal ax (equation (5))

(4)
The product of two samples is thus meaningless. To bypass this issue, the impulse response h of the filter is resampled and interpolated as well as the input signal ax. Thus the new convolution product is processed between these new samples and (Figure 7).
Step1

Step 2

Figure 7: Principle of the resampling scheme used in the irregular FIR computation. The continuous lines represent the original samples, whereas the dashed lines correspond to the new resampled interpolated samples.

Figure 8 : an example of an asynchronous convolution product.

B. FIR filter asynchronous micro-pipeline architecture General architecture

The new convolution product is an area computation. The easier way to compute this area is the rectangle method i.e. a zero-order interpolation. This method allows splitting the area

The previously proposed algorithm is implemented with the feedback structure presented in the Figure 9. This structure

62

describes the architecture of the FIR filter. The Delay Line gets the sampled signal data (axn, dtxn), from the A-ADC. The communication between these two blocks is based on the handshake protocol. TheDelayLineisthememoryblockofthefilter.Itisashift register that stores the input samples (magnitudes and time intervals). The output of this register is connected to a multiplexer (not shown) that allows selecting samples dependingonthevalueoftheselectioninputk.AROMand another multiplexer (not shown) are used to store the impulse response coefficients. The coefficients are selected the signal j.

based on detecting the signal k. If k reaches its maximum, that means that all filter coefficients are used, and that the convolution product is done. The filter is ready to perform a new one. The MIN block generates at this point a reset signal to the other blocks, commanding them to start a new convolution. 4- Finally, MIN generates at the end of each convolution product cycle, a rest signal and an enable signal to reset the output of the Accumulator and enable the output of the Buffer. Then the Multiplier computes all sub_areas value (dtmin*axn-k*ahj)thatareaccumulatedintheAccumulatorin order to compute the convolution product. Until now, nothing is so special. The structure is a simple and logical translation of the iterative function presented in the previous paragraph. In the micropipeline architecture, this part ofthecircuitwillbeconsideredasthedatapath. The challenge begins with defining the control path of the micropipeline of the filter. The control path will be in charge of synchronizing the communication between different parts in the data-path, so that they will work in a complete harmony while exchanging their data. Once the data and control path are described, the filter implementation starts on the FPGA. In order to implement the control path, a specific asynchronous library has to be specified. This library contains Muller gates, asynchronous controllers and some other asynchronous functions 17]. [ As shown in the paragraph I, each functional block has its I own controller. For the clarity of the paper, not all the controllers are presented. As an illustration, the controller of the MIN block is studied below. Control block of MIN The simplest block specified for our asynchronous controllers is the Linear_control Figure 10. It ensures the rendezvousbetweentoincomingsignalsreqandack.It has two inputs, and two delayed outputs. The first input represents an input request signal coming from a previous P block connected to the inputs of the MIN block. The P block sends along with its data a request signal to the MIN block. The second input is used for the input acknowledge signal that comes from the following block F. The F block receives data to compute from the MIN block, and sends back an acknowledge signal. Then the MIN block is ready for receiving new data. The Linear Controller has also two outputs. The first one is an output request signal. This output will indicate whether MIN has finished or not his computation, and thus a new data is ready to be sent. This signal will be sent to the controller of

Figure 9 : Iterative structure for the Asynchronous FIR Filter samples.

TheMINblockhasmultiplefunctionalities: 1- It allows determining the minimum time interval dtmin of (dtxn-k, dthj). 2- It generates the selection signals j and k that control the selection process in the Delay Line. 3- It detects the end of a convolution product round, and allows starting a new one. This functionality is

63

theF block. The second output is the output acknowledge signal, it is send to the P block to indicate if MIN is ready or not to receive new data.

These problems are manually managed because there are no available asynchronization tools. The controllers of the other blocks are designed following the same steps C. Asynchronous FIR-filter implementation results The micro-pipeline asynchronous FIR filter architecture, as well as a part of the A-ADC, are implemented on a synchronous FPGA board: the DE1 from Altera. As mentioned before, a dedicated library has been specified because synchronous commercial FPGA are not able to support asynchronous circuits. Figure 12 shows the simulation of our asynchronous FIR Filter after Place and route on the Altera FPGA. It is a lowpass FIR-Filter of 15th order. An input signal varying from 1 kHz to 18 kHz has been injected to its input

Figure 10: primitive Linear Controller. The delay value depends on the propagation delay of the functional block.

In the case of the MIN block: it receives inputs from only one P block:thedelayline.Thismeansonlyonereqinput is sent to its controller. However the MINs output is connected to more than one block: it is connected to the Multiplier that receives dtmin.. It is also connected to the Accumulator. The Accumulator receives a reset signal from MIN, in order to restart a new accumulation (a new accumulation corresponds to a new convolution product as mentioned in the description of the MIN functionality). Finally, one of the MIN outputs is connected to the Buffer input. At the end of each convolution product, the buffer receives an enable signal from MIN in order to transfer the new convolution product value to the output. As a conclusion, the MIN outputs are connected to 4 F blocks. This means that the MIN controller receives 4 incoming acknowledge signals, one from each F block. This also means that MIN has to wait for these 4 acknowledgement signals in order to generate a new output data. Thus, a rendezvous between these 4 signals should be processed. This is done by a block calledjoin_4whichisimplementedwith 3 2-input Muller gates connected to each other. The MIN block and its controller are shown in Figure 11.

Figure 12: Asynchronous FIR Filter after P&R simulation

V. CONCLUSION An asynchronous FIR Filter architecture was presented in this paper, along with an asynchronous Analog to digital converter (A-ADC). The FIR Filter architecture is designed using the micro-pipeline asynchronous style. This architecture has been successfully implemented for the first time on a commercial FPGA board (Altera-DE1). A specific library has also been designed for this purpose. It allows the synthesis of asynchronous primitive bocks (the control path in our case) on the synchronous FPGA. Simulation results of the FIR Filter after place and route validate the implementation. This work is still going on, in order to optimize the implementation. We expect to have a very low-power FIR Filter, with a reduction of the total power consumption by one order of magnitude. REFERENCES [1] M. Renaudin, Asynchronous Circuits and Systems: a Promising Design Alternative, Journal of Microelectronic Engineering, Vol. 54, pp. 133-149, 2000. [2] D. Kinniment et al.,SynchronousandAsynchronousADConversion,IEEE Trans. on VLSI Syst., Vol. 8, n 2, pp. 217-220, April 2000. [3] D.J. Kinniment et al., Low Power, Low Noise Micropipelined Flash A-D Converter, IEE Proc. On Circ. Dev. Syst., Vol. 146, n 5, pp. 263-267, Oct. 1999.

Figure 11: MIN block and its controller

Practically, the MIN block is more complex. In fact, MIN is divided into 3 sub-blocks, each processing one of the functions previously described. Each sub-block has its own controller. The problem that could appear with multiple blocks connected to each other in a non-linear pipeline is a dead-lock.

64

[4] L. Alacoque et al., An Irregular Sampling and Local Quantification Scheme A-D Converter, IEE Electronics Letters, Vol. 39, n 3, pp. 263-264, Feb. 2003. [5] PrinciplesofAsynchronousCircuitDesign A Systems Perspective, Edited by JENS SPARS Technical University of Denmark & STEVE FURBER The University of Manchester UK [6] L. Wiliams et al.A Stereo 16-bit Delta-Sigma A/D ConverterforDigitalAudio,Ph.D.dissertation Stanford University 1993. [7] E. Allier, L. Fesquet, G. Sicard, M. Renaudin, Low Power Asynchronous A/D Conversion, Proceedings of the 12th International Workshop on Power and Timing, Modeling,Optimization and Simulation (PATMOS02), September 11-13 2002, Sevilla, Spain. [8] E. Allier, G. Sicard, L. Fesquet, M. Renaudin, A New Class of Asynchronous A/D Converters Based on Time Quantization, ASYNC Proceedings, pp. 197-205, May 12-16 2003, Vancouver, Canada. [9] J.W.Marketal.,ANonuniform Sampling Approachto Data Compression, IEEE Trans. on Communication. Vol. COM-29, n 4, pp. 24-32, Jan. 1981. W.-K. Chen, Linear Networks and Systems (Book style). Belmont, CA: Wadsworth, 1993, pp. 123135. [10] F. Aeschlimann, E. Allier, L. Fesquet, M. Renaudin, "Asynchronous FIR Filters: Towards a New Digital Processing Chain," Asynchronous Circuits and Systems, International Symposium on, pp. 198-206, 10th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'04), 2004 [11] S.B. Furber, J.D. Garside, S. Temple, J. Liu, P. Day, and N.C. Paver.AMULET2e: An asynchronous embedded controller. In Proc. International Symposium on advanced Research in Asynchronous Circuits and Systems, pages 290299. IEEE Computer Society Press, 1997. [12] L.S. Nielsen. Low-power Asynchronous VLSI Design. PhD thesis, Department of Information Technology, Technical University of Denmark, 1997. IT-TR:1997-12. [13] SPeedster, a very high speed FPGA by Achronix: http://www.achronix.com/ [14] A.J. Martin, A. Lines, R. Manohar, M. Nystrom, P. Penzes, R. Southworth, U.V. Cummings, and T.-K. Lee. The design of an asynchronous MIPS R3000. In Proceedings of the 17th Conference on Advanced Research in VLSI, pages 164181. MIT Press, September 1997. [15] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A. Lien, and J. Liu. A low-power, low-noise configurable self-timed DSP. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 3242, 1998. [16] L.S. Nielsen, C. Niessen, J. Spars, and C.H. van Berkel. Low-power operation using self-timed circuits and adaptive scaling of the supply voltage. IEEE Transactions on VLSI Systems, 2(4):391397, 1994. [17] Quoc Thai Ho, J.-B. Rigaud, L. Fesquet, M. Renaudin, R. Rolland, "Implementing asynchronous circuits on LUT based FPGAs", The 12th International Conference on Field Programmable Logic and Applications (FPL),

September 2-4, 2002, Montpellier (La Grande-Motte), France.

65

Session 3 Prototyping Radio Devices

66

Applying Graphics Processor Acceleration in a Software Dened Radio Prototyping Environment


William Plishker, George F. Zaki, Shuvra S. Bhattacharyya
Dept. of Electrical and Computer Engineering and Institute for Advanced Computer Studies University of Maryland College Park, Maryland {plishker,gzaki,ssb}@umd.edu

Charles Clancy, John Kuykendall


Laboratory for Telecommunications Sciences College Park, Maryland, USA {clancy, jbk}@ltsnet.net

AbstractWith higher bandwidth requirements and more complex protocols, software dened radio (SDR) has ever growing computational demands. SDR applications have different levels of parallelism that can be exploited on multicore platforms, but design and programming difculties have inhibited the adoption of specialized multicore platforms like graphics processors (GPUs). In this work we propose a new design ow that augments a popular existing SDR development environment (GNU Radio), with a dataow foundation and a stand-alone GPU accelerated library. The approach gives an SDR developer the ability to prototype a GPU accelerated application and explore its design space fast and effectively. We demonstrate this design ow on a standard SDR benchmark and show that deciding how to utilize a GPU can be non-trivial for even relatively simple applications.

I. I NTRODUCTION GNU Radio [1] is a software development framework that provides software dened radio (SDR) developers a rich library and a customized runtime engine to design and test radio applications. GNU Radio is extensive enough to describe audio radio transceivers, distributed sensor networks, and radar systems, and fast enough to run such systems on off-the-self radio hardware and general purpose processors (GPPs). Such features have made GNU Radio an excellent rapid prototyping system, allowing designers to come to an initial functional implementation quickly and reliably. GNU Radio was developed with general purpose programmable systems in mind. Often initial SDR prototypes were fast enough to be deployed on general purpose processors or needed few custom accelerators. As new generations of processors were backwards compatible with software, GNU Radio implementations could track with Moores Law. As a result, programmable solutions have been competitive with custom hardware solutions that required longer design time and greater expense to port to the latest process generation. But with the decline in frequency improvements of GPPs, SDR solutions are increasingly in need of multicore acceleration, such as that provided by graphics processors (GPUs). SDR is well positioned to make use of them since many SDR

applications have abundant parallelism. GPUs are starting to be employed in SDR solutions, but their adoption has been inhibited by a number of difculties, including architectural complexity, new programming languages, and stylized parallelism. While other research is addressing these topics [5] [6], one of the primary barriers in many domains is the ability to quickly prototype the performance advantages of a GPU for a particular application. The inability to assess the performance impact of a GPU with an initial prototype leaves developers to doubt if the time and expense of targeting a GPU is worth the potential benet. Many design decisions are needed before arriving at initial multicore prototype including mapping tasks to processors and data to distributed memories. Mapping SDR applications is further complicated by application requirements. The amount of parallelism present may be dictated by the application itself based on its latency tolerances and available vectorization of the kernels. More vectorization tends to lead to higher utilization of the platform (and therefore higher throughput), but often at the expense of increased latency and buffer memory requirements. Also an accelerator typically requires signicant latency to move data to or from the host processor, so sufcient data must be burst to the accelerator to amortize such overheads. Ideally, application designers would be simply presented with a Pareto curve of latency versus vectorization trade-offs so that an appropriate design point can be selected. However, vectorization generally inuences the efciency of a given mapping. Thus, to fully unlock the potential of heterogeneous multiprocessor platforms for SDR, designers must be able to arrive at a variety of solutions quickly, so that the design space may be explored along such critical dimensions. To enable developers to arrive at an initial prototype that utilizes GPUs, we introduce a new SDR design ow, as shown in Figure 1. We begin with a formal description of an SDR application, which we extract from a GNU Radio specication. Formalisms provide the design ow with a structured, portable application description which can be used for vectorization,

978-1-4577-0660-8/11/$26.00 c 2011 IEEE

67

represented, respectively, by prd (e) and cns(e). Homogeneous Synchronous Data Flow (HSDF) is a restricted form of SDF where prd (e) = cns(e) = 1 for every edge e. Given an SDF graph G, a schedule for the graph is a sequence of actor invocations. A valid schedule guarantees that every actor is red at least once, there is no deadlock due to token underow on any edge, and there is no net change in the number of tokens on any edge in the graph (i.e., the total number of tokens produced on each edge during the schedule is equal to the total number consumed from the edge). If a valid schedule exists for G, then we say that G is consistent. For each actor v in a consistent SDF graph, there is a unique repetition count q(v), which gives the number of times that v must be executed in a minimal valid schedule (i.e., a valid schedule that involves a minimum number of actor rings). In general, a consistent SDF graph can have many different valid schedules, and these schedules can differ widely in the associated trade-offs in terms of metrics such as latency, throughput, code size, and buffer memory requirements [4]. III. R ELATED W ORK Many models of computation have been suggested to describe software radio systems. In [2], the advantages and drawbacks of various models are investigated. Also different dataow models that can be applied to various actors of an LTE receiver are demonstrated. Actor implementation on GPUs is discussed in [13]. A GPU compiler is described in order to take a naive actor implementation written in CUDA [11], and generate an efcient kernel conguration that enhances the load balance on the available GPU cores, hides memory latency, and coalesces data movement. This work can be used in our proposed framework to enhance the implementation of individual software radio actors on a GPU. Raising the abstraction of CUDA programming through program analysis is the focus of Copperhead [6]. In [12], the authors present a multicore scheduler that maps SDF graphs to a tile based architecture. The mapping process is streamlined to avoid the derivation of equivalent HSDF graphs, which can involve signicant time and space overhead. In more general work, MpAssign [5] employs several heuristics, allows different cost functions and architectural constraints to arrive at a solution. In [15], a dynamic multiprocessor scheduler for SDR applications is described. The basic platform consists of a Universal Software Radio Peripheral (USRP), and cluster of GPPs. A exible framework for dynamic mapping of SDR components onto heterogeneous multiprocessor platforms is described in [9]. Various heuristics and mixed linear programming models have been suggested for scheduling task graphs on homogeneous and heterogeneous processors (e.g., see [10]). In these works, the problem formulations are developed to address different objective functions and target platforms for implementing the input application graphs. The focus of this work is to construct a backend capable of integrating specialized multicore solutions into a domain

Fig. 1.

Dataow founded SDR Design Flow.

latency, and other design decisions. These design decisions can ultimately be incorporated into an SDR application through a GPU specic library of SDR actors. For this work, we have constructed GRGPU, which is a such a library written for GNU Radio. With this design process, we demonstrate the value of this approach with GNU Radio benchmark on a platform with a GPU. II. BACKGROUND Dataow graphs are widely used in the modeling of signal processing applications. A dataow graph G consists of set of vertexes V and a set of edges E. The vertices or actors represent computational functions, and edges represent FIFO buffers that can hold data values, which are encapsulated as tokens. Depending on the application and the required level of model-based decomposition, actors may represent simple arithmetic operations, such as multipliers or more complex operations as turbo decoders. A directed edge e(v1 , v2 ) in a dataow graph is an ordered pair of a source actor v1 = src(e) and sink actor v2 = snk (e), where v1 V and v2 V . When a vertex v executes or res, it consumes zero or more tokens from each input edge and produces zero or more tokens on each output edge. Synchronous Data Flow (SDF) [8] is a specialized form of dataow where for every edge e E, a xed number of tokens is produced onto e every time src(e) is invoked, and similarly, a xed number of tokens is consumed from e every time snk(e) is invoked. These xed numbers are

68

specic prototyping environment. This should facilitate the previously described dataow based design ow, but should also enable these other works to be applied in the eld of SDR. Any solution targeting a complex multicore system is unlikely to produce the optimal solution with its rst implementation. The ability to quickly generate and evaluate many solutions on a multicore platform should improve the efcacy the approach and ultimately the quality of the nal solution. IV. SDR D ESIGN F LOW FOR GPU S We implemented the design ow proposed in Figure 1 by using GNU Radio as the SDR description and runtime environment and the Dataow Interchange Format (DIF) [7] for the dataow representation and associated tools. Our GPU target was CUDA enabled NVIDIA GPUs. With these tools in place the design ow proceeds as described in the following steps: 1) Designers write their SDR application in GNU Radio with no consideration for the underlying platform. As GNU Radio has an execution engine and a library of SDR components, designers can verify correct functionality of their application. For existing GNU Radio applications, nothing must be changed with the description to continue with the design ow. 2) If actors of interest are not in the GPU accelerated library, a designer writes accelerated versions of the actors in CUDA. The design focuses on exposing the parallelism to match the GPU architecture in as parametrized way as possible. 3) Either through automated or manual processes, instantiated actors are either assigned to a GPU or designated to remain on a GPP. With complex trade offs between GPU and GPP assignments possible, this step may be revisited often as part of a system level design space exploration. Dataow provides a platform independent foundation for analytically determining good mappings, but designer insight is also a valuable resource to be utilized at this step. 4) The mapping result is utilized by augmenting the original SDR application description environment. By leveraging a stand-alone library of CUDA accelerated actors for GNU Radio, the designer can describe and run the accelerated application description with existing design ow properties. The following sections cover these steps in detail, specifically as they relate to our instance of the design ow that utilizes CUDA, GNU Radio, and DIF. A. Writing GPU Accelerated Actors After the application graph is described in GNU Radio, actors are individually accelerated using GPU specic tools. If an actor of interest is not present in the GPU accelerated library, the developer switches to the GPU customized programming environment, which in our case is CUDA. The designer is still saddled with difcult design decisions, but these decisions are localized to a single actor. System level design decisions are

orthogonal to this step of the design process. While we do not aim to replace the programming approach of the actors functionality, the following design strategy lends itself to later design space exploration by the developer. As with other GPU programming environments, in CUDA a designer must divide their application into levels of parallelism: threads and blocks, where threads represent the smallest unit of a sequential task to be run in parallel and blocks are groups of threads. In our experience, SDR actors vary in how to use thread level parallelism, but tend to realize block level parallelism with parallelism at the sample level. The ability to tightly couple execution between threads within a block creates a host of possibilities for the basic unit of work within a block, be it processing a code word, multiplying and accumulating for a tap, or performing an operation on a matrix. Because blocks are decoupled, only fully independent tasks can be parallelized. For SDR those situations tend to arise between channels or between samples on a single channel. Some samples may overlap between blocks to support the processing of a neighboring sample, but this redundancy is often more than offset by the performance benets of parallelization. The performance of this parallelization strategy strongly inuenced by the number of channels or the size of a chunk of samples that can be processed at one time. When the application requests processing on a small chunk of sample, there are few blocks to spread across a GPU leaving it under utilized, while large chunks enable high-utilization. The performance difference between small and large chunks is non-linear due to the high xed latency penalty that both scenarios experience when transferring data to and from the GPU and launching kernels. When chunks are small, GPU time is dominated by transfer time, but when chunks are larger, computation time of the kernel dominates, which amortizes the xed penalty delay. As the application dictates these values, actors must be written in a parametrized way to accommodate different size inputs. B. Partitioning, Scheduling, and Mapping Once actors are written, system level design decisions must be made, such as assigning which actors are to invoke GPU acceleration. With some applications, the best solution may be to ofoad every actor that is faster on the GPU than it is on the GPP. But in some cases, this greedy strategy fails to recognize the work that could occur simultaneously on the GPP, while the host thread with the kernel call waits for the GPU kernel to nish. A general solution to the problem would consider application features such as rates of rings, dependencies, and execution times on each platform of each actor, as well as architectural features such as the number and types of processing elements, memories, and topology. To simplify the problem, designers can cluster certain actors together so that they are assigned to the same. To promote this clustering, designers may partition the application graph. Multirate applications also need to be scheduled properly to ensure ring rates and dependencies are proper accounted for.

69

compatibility with previous and future versions of GNU and GNU Radio. When a GNU Radio actor is instantiated, a new C++ object is created which stores and manages the state of the actor. However, state in the CUDA le is not automatically replicated, creating a conict when more than one GRGPU actor of the same type is instantiated. To work around this issue, we save CUDA (both host and GPU) state inside the C++ actor, which includes GPU memory pointers of data already loaded to the GPU. The state from the GPU itself is not saved inside the C++ object, but rather the pointers to the device memory are. Data residing in the GPUs memory space is explicitly managed on the host, so saving GPU pointers is sufcient for keeping the state of the CUDA portion of an actor. To minimize the number of host-to-GPU and GPU-tohost transfers, we introduce two actors, H2D and D2H, to explicitly move data to and from the device in the ow graph. This allows other GRGPU actors to contain only kernels that produce and consume data in the GPU memory. If multiple GPU operations are chained together, data is processed locally, reducing redundant I/O between GPU and host as shown in Figure 3. In GNU Radio, the host side buffers still exist which connect links between the C++ objects that wrap the CUDA kernels. Instead of carrying data, these buffers now carry pointers to data in GPU memory. From a host perspective, H2D and D2H transform host data to and from GPU pointers, respectively. While having both a host buffer and a GPU buffer introduces some redundancy, it has a number of benets which make this an attractive solution. First, there is no change to the GNU Radio engine. The GNU Radio engine still manages data being produced and consumed by each actor, so decisions on chunk size or invocation order do not need to be changed with the use of GRGPU actors. Second, GPU buffers may be safely managed by the GRGPU actors. With GPU pointers being sent through host buffers, actors need only concern themselves with maintaining their own input and output buffers. This provides dynamic exibility (actors can choose to create and free memory for data as needed) or static performance tuning (actors can maintain circular buffers which they read and write a xed amount of data to and from). Such schemes require coordination between GRGPU actors and potentially information regarding buffer sizing, but the designer does have the power to manage these performance critical actions without redesigning or changing GRGPU. Future versions of GRGPU could provide a designers with a few options regarding these schemes and even make use of the dataow schedule or other analysis to make quality design decisions. Finally, no extraneous transfers between GPU and host occur. While the host and GPU buffers mirror each other, no transfers occur between them, which avoids I/O latencies that can be the cause of application bottlenecks.

Fig. 2.

GRGPU: A GNU Radio integration of GPU accelerated actors.

When the application can be extracted into a formal dataow model, schedulers will not only respect these constraints but are able to optimize for buffer assignments [3]. The applicability of such techniques for specialized multicore platforms are still open research, and this design ow enables greater experimentation with them for SDR applications. Manual scheduling and mapping is likely to continue to dominate smaller, more homogeneous mappings, but a grounding in dataow opens the door for new automation techniques. In this work we focus on the design ow, conventions for writing SDR actors, and integrating GPU accelerated actors with GNU Radio. C. GRGPU: GPU Acceleration in GNU Radio We developed GPU accelerated GNU Radio actors in a separate, stand-alone library called GRGPU. GRGPU extends GNU Radios build and install framework to link against libraries in CUDA as shown in Figure 2. After building against CUDA libraries, the resulting actors may be instantiated alongside traditional GNU Radio actors, meaning that designers may swap out existing actors for GRGPU actors to bring GPU acceleration to existing SDR applications. The traditional GNU Radio actors run unaffected on the host GPP, while GRGPU actors utilize the GPU. When writing a new GRGPU actor, application developers start by writing a normal GNU Radio actor including a C++ wrapper that describes the interface to the actor. The GPU kernels are written in CUDA in a separate le and tied back to the C++ wrapper via C functions such as device work(). Additional conguration information may be sent in through the same mechanism. For example, the taps of a FIR lter typically need to be updated only once or rarely during the execution, so instead of passing the tap coefcients during each ring of the actor (taps sent from work() to device work() to the kernel call), they could be loaded into device memory when the taps are updated in GNU Radio. The CUDA compiler, NVCC, is invoked to synthesize C++ code which contains binaries of the code destined for the GPU, but glue code formatted for C++. By generating the C++ instead of an object le directly, we are able to make use of the standard GNU build process using libtool. Even though the original application description was in a different language, the code is wrapped and built in the GNU standard way giving it

70

Fig. 3. GRGPU actors within H2D and D2H communicate data using the GPUs memory, avoiding unnecessary host/GPU transfers

of a lter tap coefcient with an input sample, and adding this product to the partial sum of the previous stage. After processing a set of inputs, the threads perform a block store of the calculated results to the GPU device memory. B. Empirical Results We found a variety of design points of mp-sched to evaluate the utility of rapid prototyping with GRGPU. The target platform was two GPPs (Intel Xeon CPUs 3GHz) and a GPU (an NVidia GTX 260). The actors performed a 60 tap FIR ltering with either CUDA acceleration in the case of GPU accelerated actors or SSE acceleration in the case of the GPP. To minimize the latencies incurred by using H2D and D2H, the GPU accelerated actors were clustered together leaving remaining GPP actors similarly clustered. In the case of our exploration of the mp-sched implementation design space, each pipeline was located in a separate thread and the number of actors with GPU acceleration was congurable. Mp-sched pipelines could run in parallel and share the GPU as an acceleration resource during runtime. Multiple pipelines with GPU accelerated actors were forced to serialize their GPU accesses according to CUDA conventions. For example, one possible solution to a 2x20 instance of mp-sched is shown in Figure 5. The Gantt chart is not to scale, but shows how the two different pipelines (one in red and one in blue), are able to run in parallel on the two GPPs, but must have exclusive access to the GPU when running accelerated actors. While the cross thread sequencing was not specied at runtime, GRGPUs ability to specify acceleration and clustering enables the creation of multicore, GPU accelerated complete solutions. The problem for a designer is then to leverage GPPs and GPU, weigh SSE acceleration and CUDA acceleration, account for communication latencies between GPU and GPP and thread to thread, and consider how all of this will occur in parallel. Models and automated techniques should continue to assist in providing good starting points, but a necessary condition to arriving at a quality solution is still the ability to try many points quickly. To this end, we constructed an illustrative example that produces an interesting set of design points: mp-sched with 20 stages and varied the number of pipelines. Figure 6 shows a sub-sampling of the design space. All GPP means all stages of all pipelines are assigned to the GPPs, while All GPU means all stages of all pipelines are assigned to the GPU. 3/4 GPP, Half GPP, and 3/4 GPU indicates that three quarters, one half, or one quarter of the stages of all pipelines are assigned to the GPP, respectively, while the remaining actors use the GPU. For example, Figure 5 shows the 2x20 Half GPP solution. We also evaluated solutions in which one of the pipelines was all GPP and the rest GPU (One GPP) and the reverse (One GPU). In the case of only one pipeline, these solutions were equivalent to an all GPP or all GPU solution. We ran each solution for 200,000 samples and recorded the execution time, including GNU Radio overheads, communication overheads, etc.

Fig. 4.

SDF graph of the mp-sched Benchmark.

V. E VALUATION We have experimented with the proposed design ow using the mp-sched benchmark. Figure 4 shows the mp-sched benchmark pictorially. Each of the actors after the distributor performs FIR ltering. To provide exibility for evaluating different multicore platforms, it is congurable with number of chains of FIR lters (pipelines) and the depth of the chains (stages). This benchmark describes a ow graph that consists of a rectangular grid of FIR lters. The dimensions of this grid are parametrized by the number of stages (ST AGES) and number of pipelines (P IP ES). The total number of FIR lters is thus equal to P IP ES ST AGES. This benchmark represents a non-trivial problem for the multiprocessor scheduler as all actors in different pipelines can be executed in parallel. More information about the mp-sched benchmark can be found in [1]. A. FIR Filter Design In this implementation [14], we take advantage of data parallelism between the lter output samples as well as functional parallelism to calculate every sample. For relatively large chunks of samples, the CUDA kernel is congured such that the number of blocks is equal to double the number of available streaming multiprocessors. By using this conguration, the rst level of data parallelism can be achieved if every CUDA block is responsible to calculate a different set of output samples. In other words, the required number of output samples are evenly distributed on the number of CUDA blocks. To overcome the inherited stateful property of the FIR lter(i.e, consecutive output samples depend on some shared input samples), the input of every block must contain an extra set of delayed input samples equal to the number of taps. To reduce the number of device memory access, initially all of the threads will perform a load of a coalesced chunk of input elements to the shared memory of a multiprocessor. Then every thread will be responsible of calculating the product

71

Fig. 5. Gantt chart for 2x20 mp-sched graph on 2 GPP and 1 GPU. The blue and the red set of blocks and arrows each represent one branch of the mp-sched instance. Fig. 6. A sampling of the design space of 1x20, 2x20, 3x20, 4x20 mp-sched graph on 2 GPPs and 1 GPU for different assignments.

For the 60 tap FIR lter, SSE acceleration performs well, but still somewhat slower than the GPU implementation, so once a sufcient amount of computation is located on the GPU, GPU weighted implementations tend to perform better. But this graph does reveal that the GPU should be employed in different ways depending on the number of pipelines. For example, a single pipeline implies that there is not quite enough computation present to merit GPU acceleration. However when 2 or more pipelines are used, the GPPs become saturated to the point that GPU acceleration can improve upon the result. When 4 pipelines are needed, one GPP only pipeline proves higher performing than an all GPU solution, indicating that the GPU itself has become saturated with computation and that employing more of the GPP is appropriate. In each of the cases, retrospective reasoning gives us insight into improving performance, but a change in GPU, communication latencies, etc. would likely change this space again, leaving a designer to re-explore the design space. It should be possible to arrive at these solutions more analytically to accelerate the design space exploration, but inevitably a set of points will need to be evaluated to judge the efcacy of any analytical assistance. GRGPU will continue to provide value in such a scenario feeding-back empirical solutions to the design space exploration engine. VI. C ONCLUSION
AND

solutions that have been developed. Also, GRGPU should be able to extend to multi-GPU platforms by customizing GRGPU actors to communicate and launch on a specic GPU. Acknowledgments This research was sponsored in part by the Laboratory for Telecommunication Sciences, and Texas Instruments. R EFERENCES
[1] http://gnuradio.org/redmine/wiki/gnuradio. Nov 2010. [2] H. Berg, C. Brunelli, and U. Lucking. Analyzing models of computation for software dened radio applications. In Proc. IEEE International Symposium on System-on-Chip, pages 14, Nov. 2008. [3] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Software Synthesis from Dataow Graphs. Kluwer Academic Publishers, 1996. [4] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee. Synthesis of embedded software from synchronous dataow specications. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 21(2):151166, June 1999. [5] Y. Bouchebaba, P. Paulin, A. E. Ozcan, B. Lavigueur, M. Langevin, O. Benny, and G. Nicolescu. Mpassign: A framework for solving the many-core platform mapping problem. In Rapid System Prototyping (RSP), June 2010. [6] B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: Compiling an embedded data parallel language. Technical Report UCB/EECS-2010124, EECS Department, University of California, Berkeley, Sep 2010. [7] C. Hsu, I. Corretjer, M. Ko., W. Plishker, and S. S. Bhattacharyya. Dataow interchange format: Language reference for DIF language version 1.0, users guide for DIF package version 1.0. Technical Report UMIACS-TR-2007-32, Institute for Advanced Computer Studies, University of Maryland at College Park, June 2007. Also Computer Science Technical Report CS-TR-4871. [8] E. A. Lee and D. G. Messerschmitt. Synchronous dataow. Proceedings of the IEEE, 75(9):12351245, September 1987. [9] V. Marojevic, X. R. Balleste, and A. Gelonch. A computing resource management framework for software-dened radios. IEEE Transactions on Computers, 57:13991412, 2008. [10] R. Niemann and P. Marwedel. Hardware/software partitioning using integer programming. In Proc. of the European Design and Test Conference, pages 473 479, Mar. 1996. [11] NVIDIA. CUDA C programming guide version 3.1.1. July 2010. [12] S. Stuijk, T. Basten, M. C. W. Geilen, and H. Corporaal. Multiprocessor resource allocation for throughput-constrained synchronous dataow graphs. In Proc. of the 44th annual Design Automation Conference, DAC 07, pages 777782, June 2007.

F UTURE W ORK

As SDR attempts to leverage more special purpose multicore platforms in complex applications, application developers must be able to quickly arrive at an initial prototype to understand the potential performance benets. In this paper, we have presented a design ow that extends a popular SDR environment, lays the foundation for rigorous analysis from formal models, and creates a stand-alone library of GPU accelerated actors which can be placed inside of existing applications. GPU integration into an SDR specic programming environment allows application designers to quickly evaluate GPU accelerated implementations and explore the design space of possible solutions at a system level. Useful directions for future work include new methods for dealing with scheduling, partitioning, and mapping for multicore systems along with evaluating existing automation

72

[13] Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proc. of the 2010 ACM SIGPLAN conference on Programming language desing and implementation, June 2010. [14] G. Zaki, W. Plishker, T. OShea, N. McCarthy, C. Clancy, E. Blossom, and S. S. Bhattacharyya. Integration of dataow optimization techniques into a software radio design framework. In Proceedings of the IEEE Asilomar Conference on Signals, Systems, and Computers, pages 243 247, Pacic Grove, California, November 2009. Invited paper. [15] K. Zheng, G. Li, and L. Huang. A weighted-selective scheduling scheme in an open software radio environment. In IEEE Pacic Rim Conference on Communications, Computers and Signal Processing, pages 561 564, Aug 2007.

73

Validation of Channel Decoding ASIPs A Case Study


Christian Brehm, Norbert Wehn
Microelectronic Systems Design Research Group University of Kaiserslautern, Germany {brehm, wehn}@eit.uni-kl.de

Sacha Loitz, Wolfgang Kunz


Electronic Design Automation Research Group University of Kaiserslautern, Germany {loitz, kunz}@eit.uni-kl.de

Abstract It is well known that validation and verication is the most time consuming step in complex System-on-Chip design. Thus, different validation and verication approaches and methodologies for various implementation styles have been devised and adopted by the industry. Application specic instruction set-processors (ASIPs) are an emerging implementation technology to solve the energy efciency/exibility trade-off in baseband processing for wireless communication where multiple standards have to be supported at a very low power budget and a small silicon footprint. In order to balance these contrary aims ASIPs for these application domains have a restricted functionality tailored to a specic class of algorithms compared to traditional ASIPs. Downside of the outstanding efciency/exibility ratio is the coincidence of bad attributes for validation. Compared to standard processors, these ASIPs often have a very complex instruction set architecture (ISA) due to the tight coupling between the instructions and the optimized micro-architecture requiring new validation concepts. This paper will sensitize for the distinctiveness and complexity of the validation of ASIPs tailored to channel decoding. In a case study a composite approach comprising formal methods as well as simulations and rapid-prototyping for validating an existing channel decoding ASIP is applied and transferred it into an industry product.

I. I NTRODUCTION Todays and future wireless communication networks require exible modem architectures to support seamless services between different network standards. Next generation handsets will have to suppport multiple standards, such as UMTS, LTE, DVB-SH or WiMax. This creates the demand for the design of exible, yet power- and area-efcient solutions for baseband signal processing, which is one of the most computation intensive tasks in mobile wireless devices [1]. Application specic instruction set processors are a very promising candidate for this task, as they promise a much higher exibility than dedicated architectures and a better energy efciency than general purpose processers (GPPs) [2]. For many applications efcient ASIP designs are best derived from standard processor pipelines in a top-down manner. This is done by adding functionality and instructions for the most common kernel operations of the targeted algorithms, such as e.g., an FFT. Also in the eld of channel decoding ASIPs are very popular, as they are seen as an elegant way to cope with the
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

vast amount of different coding schemes and their parameters, e.g. [3][7]. However, there ASIP designs often have not much in common with an enhanced standard processor pipeline. Energy and area efciency demand for distributed memory embedded into the pipeline which are typical for many stateof-the-art decoding schemes and the demand for unifying the commonalities of several dedicated architectures. Fully customized deep pipelines with non-standard memory interfaces and instructions tailored to the targeted algorithms are the consequence. A minimal support for ow control operations is added, resulting in a weakly programmable architecture that offers no more than the desired exibility. We denote this type of ASIPs as Weakly Programmable IP Cores (WPIPs). While WPIPs combine many advantages of standard IP block design and programmable architectures, they inherit the drawbacks of the respective implementation styles w.r.t. validation. This has yet been barely addressed by the research community. Muller [8] and Alles [9] have presented rapid prototyping platforms for ASIPs, which can be used for testing purposes. But both approaches are far too inexible, as they can only show the presence of errors but never their source or their absence. The rest of the paper is structured as follows: we will illuminate the differences in the design ows for the various implementation styles (Section II) and the challenges in WPIP validation (Section III) quantify the effort required for different verication and validation tasks in order to sensitize for the importance introduce our approach for validation in a case study which was successfully applied to our FlexiTreP ASIP in order to bring it to product level (Section IV). II. I MPLEMENTATION S TYLES Design methodologies for the implementation of digital signal processing systems consist of two phases. The goal of the rst phase is to make all functional design decisions from algorithm selection down to quantization. Purely functional, software-based system models are used at this stage to guarantee the desired functional behavior. For state-of-the-art communication systems this step is particularly challenging, as the communications performance of todays channel codes can not be evaluated analytically. Instead, extensive Monte Carlo simulations have to be performed to determine the

74

TABLE I C HARACTERISTICS IN ASIP I MPLEMENTATION S TYLES Top-Down Approach (classical ASIP) Standard Pipeline Standard Memory Access Scheme Standard Instruction Set Extendend by Application Specic Instructions Single-context instructions Bottom-Up Approach (WPIP) Application Specic Pipeline Application Specic Memory Access and Organization Only Application Specic Instructions dening interplay of functional blocks Multi-context instructions

Fig. 1.

Implementation Styles [10]

bit-error performance or frame-error performance of every single design candidate. At the end of this iterative renement and evaluation process stands the so called Golden Reference Model, a functional software implementation of the system. The second phase deals mostly with non-functional aspects of the actual system implementation. Various implementation styles exist and the right choice depends on exibility and energy or area efciency requirements (cf. Figure 1). By far the most challenging part at this stage is the validation and verication of the implementation against the Golden Reference Model. Their properties are highlighted in the following.

General Purpose Processors (GPP) offer the greatest exibility of all implementation styles. It is also easily possible to upgrade such systems to support new features or new standards by a simple update of the system software. Another advantage of this implementation style is the comparatively low effort for system validation and verication. Given the correctness of the processor and a functional model of the instruction set architecture (so called ISA model), the application software can be validated independently from the underlying processor. The correctness of the hardware is often proven with formal verication methods and guaranteed by the manufacturer. The big drawback of such platforms is their very low area and energy efciency. Dedicated, hardwired architectures in contrast offer the highest implementation efciency. For such architectures, traditional synthesis based design ows are widely used. As the RTL (register transfer level) hardware description is typically derived at least in parts by iterative renement of the well elaborated golden reference model, the correctness of the implementation can be shown by simulation or formal methods. The high implementation efciency of dedicated architectures of course comes at the cost of very limited exibility. Application Specic Instruction Set Processors (ASIP) [11][13] try to close the gap between dedicated hardwired and programmable off-the-shelf solutions. Typically, the instruction set of a GPP is enhanced with

special non-standard instructions to allow a more efcient processing of the algorithms under consideration. These additional instructions, extracted in detailed analysis and proling of the algorithms, are supported by additional dedicated stages which are inserted into the processor pipeline. Thus, the original instructions from the GPP remain unchanged and the ISA is only enhanced. Concepts from standard processor validation are still applicable. Weakly programmable IPs (WPIP), too, are ASIPs, but they are created in a bottom-up approach starting from dedicated architectures rather than from a GPP. The commonalities of dedicated architectures with similar kernel operations and memory requirements are extracted and unied in a fully customized pipeline with a custom, scattered memory architecture offering exactly the required bandwidth and exibility. The characteristic differences compared to traditional ASIPs are faced in Table I. The gain of this approach is a performance and energy efciency very close to that offered by dedicated architectures and at the same time offering at least the minimum required exibility from programmable architectures (see Figure 1). Thus, they are the preferable implementation style for upcoming multi-standard channel decoder implementations. The biggest challenge in WPIP design, however, is the validation. III. WPIP VALIDATION

While WPIPs inherit many desirable properties from dedicated as well as programmable architectures, this is not true for the ease of validation. The ISA of a WPIP is not designed specically, but is merely an emerging phenomenon from the combination of architectures. Thus, there is no standard ISA model that can be used in tools for formal verication. The tight coupling of hardware and software hinders the separated validation of the WPIP architecture without the applications running on it. Furthermore, the pipeline of a typical WPIP is very deep (e.g. , 15 stages in case of [14]) and contains a complex system of irregularly sized and distributed memories, which even may be accessed in an out-of-order fashion in several pipeline stages. This fact creates inter-instruction dependencies over many clock cycles exceeding the capabilities of formal verication tools commonly available today. Taking these properties into account the following approaches turned out as being potentially appropriate for WPIP validation.

75

TABLE II RUNTIMES FOR S IMULATIONS AND V ERIFICATION Simulation Property Checking [16] Monte Carlo Simulation (Software) Viterbi, 1k info bits, w/ ASIP bTC UMTS, w/ ASIP bTC UMTS, w/ SW reference decoder RTL Sim. bTC, UMTS, only ASIP Runtime 10 k blocks 18 h 0.7 h 10 h 15 min 47 h Throughput 83 properties 3.8 kbps 1.4 kbps 58 kbps 0.3 kbps

TABLE III S IMULATIONS R EQUIRED FOR M ULTI - STANDARD S YSTEM VALIDATION Code Types (Enc. Type, Tailing, Rates, Polynomials) 3 8 4 1 1 1 1 Different Blocksizes per Type 375 1000 724 188 5075 18 17

Standard

Code Viterbi 256 states Viterbi 64 states Viterbi 16 states bin. Turbo bin. Turbo bin. Turbo duobin. Turbo

GSM/EDGE LTE UMTS/HSPA CDMA2k WiMax

A. Formal Verication Formal methods prove the absence of errors, while simulation can always only show the presence of errors. Loitz et. al. [15] have recently established a way to reduce the complexity of interval property checking for WPIPs by composing instructions of micro operations making the validation of complex WPIP instructions feasible. Their completeness approach enables a formal proof that each of the complex instructions behaves as intended. Although this does not necessarily prove our design to be completely functional as expected, verication of each instruction can detect errors such as saturation or rounding issues. However, formal verication of the system behavior is by now impossible and still under research. B. Simulations For programmable architectures based on a standard instruction-set, verication of the instruction-set guarantees that any arbitrary algorithm can be implemented. In contrast to that for WPIPs the correctness of all instructions is not sufcient since this does not guarantee that the intended system behavior can be implemented. Hence, despite the condence that formal methods provide, there is still the demand for simulations. Particularly, they are invaluable for analysis during the program development phase which turned into a challenging and error prone task due to the optimized instruction-set. As WPIPs are designed to implement a large number of possible standards, a purely simulation-based validation approach needs to simulate every channel decoding application and compare a statistically signicant number of computed values to the respective golden reference models or the respective frame or bit error rate (FER, BER) point specied in communication standards. Further, WPIPs are not created by renement from the golden reference model, as are dedicated architectures. Hence, there is no structural similarity between the two, which could be exploited for validation. An approach to cope with this will be presented in Section IV. Simulation times of WPIPs exceeding those of the golden reference or even an RTL model of a dedicated architecture by orders of magnitudes (see Table II). It quickly becomes infeasible to perform statistical Monte Carlo simulations as the only means of implementation validation.

overall different code blocks: 17,319

Finally, one of the biggest advantages designers hope for when choosing programmable architectures is the exibility to easily extend the system to new standards. While this is feasible by software adjustment for GPPs or ASIPs based on standard ISAs, every change in the pipeline of a WPIP effectively poses a potential change to the functional behavior of every single application running on it. As there is no clear separation between the WPIP and the applications by the means of a well dened ISA, and hardware is shared over the supported algorithms, even small changes can require a complete re-validation of all implemented applications. C. Rapid Prototyping Simulations are a powerful validation method. Drawback is that simulations with a sufcient amount of test vectors last very long, up to several days, depending on complexity and computation intensity. This problem can be attenuated by rapid prototyping: The simulation is transferred to an acceleration platform, usually an FPGA board and run there. A sophisticated variant is often denoted as hardware in the loop where the testbench or simulation environment remains in software and the device under test is integrated as a real hardware component. D. Combined Approach The specic properties of WPIP and the andvantages and disadvantages of the above described validation methods yield that neither of these common approaches for traditional implementation styles solely will be applicable. A mix of formal methods and simulation or emulation will be appropriate. In the next section we will introduce this approach for WPIP validation using the example of a WPIP designed for industrial use and quantize the validation and verication effort. IV. C ASE S TUDY: F LEXI T RE P VALIDATION The WPIP for the case study is FlexiTreP [14], a Flexible Trellis Processing engine. With its capability of decoding binary and duobinary Turbo and convolutional codes it supports most important wireless communication standards, among others UMTS, LTE, DVB-SH, or WiMax. It comprises 15 pipeline stages and seven memories that are accessed in different pipeline stages. The pipeline is dynamically recongurable in order to react to code changes. The pipeline is implemented in

76

Fig. 2.

ASIP Design and Validation Flow

Fig. 3.

Generic System Simulation Chain

a high-level processor description language, LISA [17]. From this description a cycle accurate C++ simulation model as well as a synthesizable RTL model are generated using Synopsys Processor Designer [11]. For validation we applied a combined approach according to Figure 2: with the properties we gained during the implementation phase from our system specication and the existing implementation knowledge we can verify formally that the instructions described in our high-level language work correctly by applying property checking to the RTL model (cf. left part of Figure 2). This can be done independently from application program development and excludes a wide range of errors such as memory access conicts (e.g. , from stall units), address range faults or rounding and saturation faults. Despite the instruction verication there is still a huge amount of simulations to be done which is shown in the right part of Figure 2. Simulations are mandatory for two purposes. For channel decoding architectures like FlexiTreP the algorithmic performance needs to be proven. This is only possible by Monte Carlo simulations comparing FER or BER against the specication in the respective standards with all their parameters. Therefore the approach from [8] is not applicable since only single blocks can be decoded with this platform. Rapid Prototyping as introduced in [9] is an option for gathering BER/FER performance but lacks in exibility for analysis, debugging and application development: Usually the designer wants to check a modication as early as possible. According to traditional IP block design approaches smaller functional parts of the pipeline are compared against the reference model. This shortens hardware as well as software development which consumes a great amount of time due to the sophisticated ISA. For a rst test of minor modications simulations on single blocks are perfectly suitable and save simulation time. Table III shows that for the validation of FlexiTreP for the required standards more than 17,000 different codeblocks

exist. Each of them is combinable with various typical parameters depending on the code (e.g. , windowsize, block or acquisition length, . . . ) so that the theoretical number of possible combinations multiplies. However, many paramaters can be considered constant for a given code, as they are known to provide the best properties. Nevertheless, for each code block hundreds of thousands of bits have to be simulated in order to reach statistical signicance. As an example let us assume that 10,000 blocks (corresponds a FER of 102 ) per simulation were sufcient. Table II lists the simulation times for a single block for each case. For a complete validation of FlexiTreP for the given standards calculative simulation times of more than ve years were required. In order to reduce this simulation effort, we have set up a simulation environment modeling the channel encoding and decoding chain as depicted in Figure 3. It is arbitrarily congurable. Modules can be exchanged and added or removed according to the needs. For comparison of a design against a reference it is possible to instantiate the design under test (DUT) and an arbitrary reference model in parallel. The deployed reference models are IO equivalent to the implementation, they are well elaborated and also proven by existing hardware implementations. This enables debugging and program development with an environment supporting the exibility of the design. Functional simulations can be run without the fairly slow cycle accurate model only on basis of the well elaborated reference models which reduces the simulation times by up to an order of magnitude, depending on the code. In conjunction with formal methods we still get a high quality while simulation times reduce to a few days. The additional time for property checking is negligible once the properties are set up. Verication can be done independently in parallel to the simulations. Additionally we have added an interface from the simulation chain to an FPGA board over Ethernet. With this the whole simulation chain runs on a standard PC offering the full

77

exibility of simulation. Only the design under test is exchanged by the hardware implementation. This setup enables from the same environment debugging and analysis of single code blocks, performance simulations in software or emulation with the RTL design, or comparisons to an already veried reference. The emulation offers an acceleration of another order of magnitude. Our separation approach offers an additional big advantage: whenever the hardware is modied, it can be shown formally that unmodied instructions are not inuenced by the modications. Hence, by applying the existing (or only slightly modied) properties again, it is assured that programs that do not use any new or modied instructions are still working as before and need not to be simulated again. This reduces the additional validation time for the enhancement to a third compared to a full rerun of all simulations. V. C ONCLUSIONS AND F UTURE W ORK Application specic programmable architectures are spreading quickly in the eld of channel decoding. In this paper we outlined the differences between various implementation styles, showed the advantages of ASIPs and in particular WPIPs for channel decoding and highlighted their disadvantages w.r.t. validation and eshed this out by concrete numbers. We validated our existing FlexiTreP ASIP with our exible channel decoding simulation environment and formal verication methods. We showed that with deliberate simulations in combination with formal methods the effort for validation of a multi-standard architecture can be reduced from infeasible times calculatingly in the range of several months to a few days, depending on the supported codes. Our validated ASIP was successfully produced in a 65 nm technology and integrated into a commercial product. For the future enhancement in the architecture and enhancements for a multi-core system is planned. The validation environment emerged to be perfectly suitable for debugging these thanks to its modular character and congurability. R EFERENCES
[1] Y. Lin, H. Lee, M. Who, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner, SODA: A Low-power Architecture For Software Radio, in Proc. 33rd International Symposium on Computer Architecture ISCA 06, 2006, pp. 89101. [2] C. Rowen, Silicon-efcient dsps and digital architecture for lte baseband, in 10th International Forum on Embedded MPSoC and Multicore, Gifu, Japan, Aug. 2010.

[3] B. Bougard, R. Priewasser, L. V. der Perre, and M. Huemer, AlgorithmArchitecture Co-Design of a Multi-Standard FEC Decoder ASIP, in ICT-MobileSummit 2008 Conference Proceedings, Stockholm, Sweden, Jun. 2008. [4] S. Kunze, E. Matus, and G. P. Fettweis, ASIP decoder architecture for convolutional and LDPC codes, in Proc. IEEE International Symposium on Circuits and Systems ISCAS 2009, May 2009, pp. 24572460. [5] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Raghavan, L. Van der Perre, and F. Catthoor, A unied instruction set programmable architecture for multi-standard advanced forward error correction, in Proc. IEEE Workshop on Signal Processing Systems SiPS 2008, Oct. 2008, pp. 3136. [6] O. Muller, A. Baghdadi, and M. J z quel, From Parallelism Levels to e e a Multi-ASIP Architecture for Turbo Decoding, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 92102, Jan. 2009. [7] M. Alles, T. Vogt, and N. Wehn, FlexiChaP: A Recongurable ASIP for Convolutional, Turbo, and LDPC Code Decoding, in Proc. 5th International Symposium on Turbo Codes and Related Topics, Lausanne, Switzerland, Sep. 2008, pp. 8489. [8] O. Muller, A. Baghdadi, and M. J z quel, From Application to ASIPe e based FPGA Prototype: a Case Study on Turbo Decoding, IEEE International Workshop on Rapid System Prototyping, pp. 128134, Jun. 2008. [9] M. Alles, T. Lehnigk-Emden, C. Brehm, and N. Wehn, A Rapid Prototyping Environment for ASIP Validation in Wireless Systems, in Proc. edaWorkshop 09, Dresden, Germany, May 2009, pp. 4348. [10] T. Noll, T. Sydow, B. Neumann, J. Schleifer, T. Coenen, and G. Kappen, Recongurable Components for Application-Specic Processor Architectures, in Dynamically Recongurable Systems, M. Platzner, J. Teich, and N. Wehn, Eds. Springer Netherlands, 2010, pp. 2549. [11] Synopsys Processor Designer, June 2010. [Online]. Available: http://www.synopsys.com/Tools/SLD/ProcessorDev/ [12] Target Compiler Technologies, http://www.retarget.com. [13] Tensilica Inc. http://www.tensilica.com. [14] T. Vogt and N. Wehn, A Recongurable ASIP for Convolutional and Turbo Decoding in a SDR Environment, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, pp. 13091320, Oct. 2008. [Online]. Available: http://dx.doi.org/10.1109/TVLSI.2008. 2002428 [15] S. Loitz, M. Wedler, C. Brehm, T. Vogt, N. Wehn, and W. Kunz, Proving Functional Correctness of Weakly Programmable IPs - A Case Study with Formal Property Checking, in Proc. Symposium on Application Specic Processors SASP 2008, Anaheim, CA, USA, Jun. 2008, pp. 4854. [16] S. Loitz, M. Wedler, D. Stoffel, C. Brehm, N. Wehn, and W. Kunz, Complete Verication of Weakly Programmable IPs against Their Operational ISA Model, in FDL, A. Morawiec and J. Hinderscheit, Eds. ECSI, Electronic Chips & Systems design Initiative, 2010, pp. 2936. [17] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, O. Wahlen, and H. Meyr, A methodology for the design of application specic instruction set processors (ASIP) using the machine description language LISA, in Computer Aided Design, 2001. ICCAD 2001. IEEE/ACM International Conference on, Nov. 2001, pp. 625630.

78

Area and Throughput Optimized ASIP for Multi-Standard Turbo decoding


Rachid Al-Khayat, Purushotham Murugappa, Amer Baghdadi, Michel J z quel e e Institut Telecom; Telecom Bretagne; UMR CNRS 3192 Lab-STICC Electronics Department, Telecom Bretagne, Technop le Brest Iroise CS 83818, 29238 Brest o Universit Europ enne de Bretagne, France e e E-mail: {rstname.surname}@telecom-bretagne.eu
AbstractIn order to address the large variety of channel coding options specied in existing and future digital communication standards, there is an increasing need for exible solutions. Recently proposed exible solutions in this context generally presents a signicant area overhead and/or throughput reduction compared to dedicated implementations. This is particularly true while adopting an instruction-set programmable processors, including the recent trend toward the use of Application Specic Instruction-set Processors (ASIP). In this paper we illustrate how the application of adequate algorithmic and architecture level optimization techniques on an ASIP for turbo decoding can make it even an attractive and efcient solution in terms of area and throughput. The proposed architecture integrates two ASIP components supporting binary/duo-binary turbo codes and combines several optimization techniques regarding pipeline structure, trellis compression (Radix4), and memory organization. The logic synthesis results yield an overall area of 1.5mm2 using 90nm CMOS technology. Payload throughputs of up to 115.5M bps in both double binary Turbo codes (DBTC) and single binary (SBTC) are achievable at 520M Hz. The demonstrated results constitute a promising trade-off solution between throughput and occupied area comparing with existing implementations. Index TermsSoC design, Embedded System Architecture, ASIP, Pipeline Processor, Turbo codes, WiMAX, 3GPP, LTE, DVB-RCS.
Standard IEEE802.16 (WiMax) DVB-RCS 3GPP-LTE Codes DBTC DBTC SBTC Rates 1/2 - 3/4 1/3 - 6/7 1/3 States 8 8 8 Block size .. 4800 .. 1728 .. 6144 Channel Throughput .. 75 Mbps .. 2 Mbps .. 150 Mbps

TABLE I: Selection of standards supporting turbo codes. DBTC: Double Binary Turbo Code, SBTC: Single Binary Turbo Code

I. I NTRODUCTION Systems on chips (SoCs) in the eld of digital communication are becoming more and more diversied and complex. In this eld, performance requirements, like throughput and error rates, are becoming increasingly severe. To reduce the error rate (closer to the Shannon limit) with a lower signal-tonoise ratio (SNR), turbo (iterative) processing algorithms have been recently proposed [1] and adopted in emerging digital communications standards. These standards target different sectors such as: LTE and WiMAX covering metropolitan area for voice and data applications with limited video service, while the DVB series targeting video broadcasting. A selected list of current standards and their throughput requirements are given in table I The user demands on the other hand, require these applications to be supported on a single portable device which calls for future wireless devices to be multi-standards such as PDAs, smart phones and other devices. The efcient implementation
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

of the advanced channel decoders, which is the most area consuming and computationally intensive block in baseband modem, becomes more important. Numerous research groups have come up with different architectures providing specic recongurability to support multiple standards on a single device. A majority of these works target channel decoding and particularly turbo decoding. The supported types of channel coding for turbo codes are usually Single Binary and/ Double Binary Turbo Codes (SBTC and DBTC). In this context, the work in [2] presents an ASIP-based implementation (Application-Specic Instruction-set Processor) with a exible pipeline architecture that support turbo decoding of SBTC and DBTC. The presented ASIP occupies a small area of 0.42mm2 in 65nm technology ( 0.84mm2 in 90nm), however it achieves a limited throughput of 37.2Mbps in DBTC and 18.6Mbps in SBTC modes at 400Mhz. Besides ASIP-based solutions, other exible implementations are proposed using a parametrized dedicated architecture (not based on instruction-set), like the work presented in [3]. The proposed architecture supports DBTC and SBTC modes and achieves a high throughput of 187M bps. However, the area occupied is large: 10.7mm2 in 130nm technology (5.35mm2 in 90nm). On the other hand, several one-standard dedicated architectures exist. In this category we can cite the dedicated architectures presented in [4] and [5] which support only SBTC mode (3GPP-LTE). In [4] a maximum throughput of 150M bps is achieved at the cost of a large area of 2.1mm2 in 65nm technology ( 4.2mm2 in 90nm), while in [5] a maximum throughput of 130M bps is achieved at the cost of 2.1mm2 in 90nm technology. Another example is the dedicated architecture proposed in [6] which support only

79

DBTC mode (WiMAX). A limited throughput of 45M bps is achieved (not covering all WiMax requirements) with an occupied area of 3.8mm2 in 180nm technology ( 0.95mm2 in 90nm). While analyzing the overall state of the art in this context, one can note that proposed exible solutions generally present a signicant area overhead and/or throughput reduction compared to dedicated implementations. This is particularly true while adopting an instruction-set programmable processors, including the recent trend toward the use of ASIP. In this paper we illustrate how the application of adequate algorithmic and architecture level optimization techniques on an ASIP for turbo decoding can make it even an attractive and efcient solution in terms of area and throughput. The considered initial ASIP for turbo decoding is the one proposed in [7] and the proposed optimizations made its area size decreases from 0.2mm2 to 0.15mm2 in @90nm technology and its throughput increases from 50M bps in DBTC to 115.5M bps and from 25M bps in SBTC to 115.5M bps. These signicant improvements are obtained due to applying three levels of optimization: (1) Architecture optimization by re-arranging the pipeline and decreasing the instruction set used in iterative-loop to generate extrinsic information, (2) Algorithmic optimization by applying trellis compression (Radix4) to double the throughput in SBTC mode, and (3) Memory re-organization to optimize the area. The proposed architecture allows to have simple light weight 1 1 system decoder which achieve a best ratio between throughput and area to support SBTC/DBTC turbo decoding for an array of standards (WiMAX, LTE, DVB-RCS). The rest of the article is organized as follows: section II presents the decoding algorithms for turbo decoding used in the proposed architecture. section III explains in detail the proposed architecture of the decoder system and the proposed optimization techniques. The synthesis results and comparisons w.r.t. the state of the art are given in section IV and nally the paper concludes with section V giving some future perspectives. II. D ECODING ALGORITHMS A. Turbo decoding The typical system diagram for the turbo decoding is shown in Fig.1. It consists of two component decoders exchanging extrinsic information via an interleave () and deinterleave (1 ) processes. The component decoder0 receives Loglikelihood ratio k (1) for each bit k of a frame of length N in the natural order while component decoder1 is initialized in interleaved order.
k = log P r{dk=0 |y 0..N 1 } P r{dk=1 |y 0..N 1 } (1)
S1

Channel LLR

Component decoder1
n.ext Zk

Component decoder0

Hard. dec

Fig. 1: Turbo decoding system The extrinsic information dened by (3) is calculated from the aposteriori probability given by (4), wherein k (s) and k (s) are the state metrics in forward (5) and backward recursion (6) respectively and k (s , s) are the branch metrics (7). The sys par k (s , s) and k (s , s) are the systematic and parity symbol LLRs. Finally, when the required number of iterations Niter are completed the hard decision is calculated as given by (9).
ext apos Zk (d(s , s) = i) = (Zk (d(s , s) = i) apos Zk (d(s , s) = i) = max (k1 (s)+ (s ,s)/d(s ,s)=i int k (s , s))

(3) (4) (5) (6) (7) (8) (9)

n.ext k (s , s) + k (s)), i {00, 01, 10, 11}

k (s) = maxs k (s) = maxs k (s , s) =


int k (s

,s (k1 (s)

+ k (s , s))

,s (k+1 (s)

+ k+1 (s , s))
n.ext k (s par k (s

int k (s

, s) +

, s) , s)

, s) =

sys k (s

, s) +

Hard.dec Zk

apos sign(Zk )

B. Radix4 decoding algorithm For SBTC, the trellis length is reduced by half through applying the one-level look-ahead recursion [8]. The modied and state metrics for this Radix4 optimization are given by (10) and (11) where k (s , s)is the new branch metric for the combined two-bit symbol (uk1 , uk ) connecting state s and s.
uk1
S0

uk
S0

uk+1
S0

uk1
S0

uk+1
S0

S1

S1

S1

S1

S2

S2

S2

S2

S2

S3

S3

S3

S3

S3

Fig. 2: Trellis compression (Radix4)


k (s) = maxs k (s) = maxs
,s {k2 (s ,s {k+2 (s

) + k (s , s)} ) + k (s , s)}

(10) (11) (12) (13) (14)

For efcient hardware implementation Max-Log MAP algorithm is used, as described in [7]. For DBTC, the three normalized extrinsic information are dened by (2) where i (01, 10, 11) of the k th symbol while s and s are the previous and current corresponding trellis state respectively.
n.ext ext ext Zk (d(s , s) = i) = Zk (d(s , s) = i) Zk (d(s , s) = 00) (2)

k (s , s) = k1 (s , s ) + k (s , s)
n.ext ext ext ext ext Zk1 = (max(Z10 , Z11 ) max(Z00 , Z01 ))

The extrinsic information for uk1 and uk are computed as:


n.ext Zk

ext ext (max(Z01 , Z11 )

ext ext max(Z00 , Z10 ))

80

C. QPP Interleaving Interleaving/deinterleaving of extrinsic information is a key issue that needs to be aware of to enable addressing multiple extrinsic information in the same cycle because memory access contention may occur when MAP decoder fetch/write extrinsic information from/to memory. The interleaving/deinterleaving addresses required w.r.t. LTE standard QPP interleaving which is a contention-free interleaver that is expressed via a simple mathematical formula,let N be the number of data couples in each block at the encoder input.
F or j = 0 .. N 1, I(j) = (F2 j 2 + F1 j) mod N (15)

Where F1 and F2 are constants dened in the standard with j being the index of the natural order. By denition, parameter F1 is always an odd number whereas F2 is always an even number. QPP interleaver has many algebraic properties, an interesting one is I(x) has same even/odd parity as x:
I(2k) mod 2 = 0 I(2k + 1) mod 2 = 1 (16)

This algebraic property is used later in section III-C in designing memory system for addressing multiple extrinsic information without memory access contention. III. D ECODER S YSTEM ARCHITECTURE The proposed decoder system architecture consists of 2 ASIPs interconnected directly as shown in Fig.3. Shufed decoding ASIPs [7] are congured to operate in a 1 1 mode, with one ASIP (ASIP 0) process the data in natural order while the other one (ASIP 1) process it in interleaved order. The generated extrinsic information are exchanged between the two ASIP decoder components via connection of buffers & multiplexers.
Ext memTop Ext memTop

Input T op

Odd

Even

Odd

Even

Input T op

ASIP 0
Input Bottom

Mux + Buffer

Mux Buffer

ASIP 1
Input Bottom

Odd

Even

Odd

Even

Ext memBot

Ext memBot

COM P ON EN T DECODER0

COM P ON EN T DECODER1

Fig. 3: 1 1 Decoder system architecture A. ASIP architecture and optimization levels Fig.7a illustrates the overall architecture of the optimized ASIP with the proposed memory structure and pipeline organisation in 9 stages. The numbers in brackets indicate the equations (referred in section II-A) mapping on the corresponding pipeline stage. The extrinsic information format, at the output of the ASIP, is also depicted in the same gure for the two modes SBTC and DBTC. The rest of this sub-section details the proposed architecture optimizations to achieve efcient solution in terms of area and throughput, classied in three levels.

1) Architecture level optimization: The initial decoding process of the ASIP proposed in [7] was done through 8 pipeline stages and was implementing the buttery scheme for metrics computation. During this process, ASIP calculates the metrics (forward recursion) and (backward recursion) simultaneously, till it reaches half of the processed sub block (which called left-buttery). In left buttery the recursion units do the state metric calculations in the rst clock cycle and the max operators nd the state metric maximum values in the second clock cycle eq.(5) & eq.(6). During processing the other half of the processed sub block (right-buttery) besides nding the state metrics values, another three clock cycles are required to do addition for extrinsic information and then nding maximum A posteriori information eq.(4). All these operations are taking place in (EXstage). So 7 clock cycles in total are required to generate extrinsic information for two symbols. Major proposed optimization in the architecture level is in re-arranging the pipeline by modifying the (EXstage) to place the recursion units and max operators in series so only one instruction is enough to calculate the state metric and nd the maximum values eq.(5) & eq.(6). In similar way, one instruction will be needed to nd maximum A posteriori information eq.(4). In fact, nding maximum A posteriori information is done in three cascaded stages of max operators (searching the max between 8 metric values). Thus, placing them in series with recursion units in one pipeline stage will increase the critical path (i.e. reduce the maximum clock frequency). To avoid that, new pipeline stage (M AXstage) is added after (EXstage) to distribute the max operators as shown in Fig.4. During the decoding process in left buttery, ACS units (Add, Compare, Select) do the state metric calculations and nd the state metrics maximum values in the same clock cycle in (EXstage), while during the right buttery besides nding the state metrics values, ACS units do addition and nd maximum A posteriori information eq.(4) in (M AXstage) in one clock cycle. So 3 clock cycles in total are required to generate extrinsic information for two symbols. Another proposed optimization concerns the implementation of windowing to process large block-size which is achieved by dividing the frame into N windows, where the window maximum size supported is 128 bits. Fig.5 shows the windows processing in buttery scheme, i.e. ASIP calculates the values (forward recursion) and (backward recursion) simultaneously, and when it reaches half of the processed window (left-buttery) and start the other half (rightbuttery), ASIP can calculate the extrinsic information on the y along with and calculations. State initializations i i ( int(w(n1) ), int(w(n1) )) of a specic recursions across windows are done by message passing via a specic array of registers. Since the maximum window size is 128 bits, 48 windows are needed to cover all LTE block-sizes, so 4896 array of registers is added. 2) Algorithmic level optimization: In the ASIP proposed in [7], SBTC throughput equals half of DBTC throughput

81

64 32
Max Finder

16
Max Finder

8
Max Finder

Max Finder

Max Finder

Max Finder

3) Memory level optimization: Three major memory structure optimizations are implemented. The rst one concerns the normalization of the extrinsic information as presented in eq.(2). This optimization reduces the extrinsic memory n.ext by 25% because (00 ) is not stored anymore. The second optimization is to restrict the support of trellis denition to a limited number of standards (WiMAX, 3GPP) rather than all possible ones. Besides reducing the complex multiplexing logic, this optimization allows for 1bit mode selection which is passed through the instruction set, and thus, the conguration memories which store the trellis denition are eliminated. The third optimization is to re-organize input and extrinsic memories, where input memories contain the channel LLRs n which are quantized to 6 bits each. In the proposed organization for DBTC mode, LLRs values of systematics n (S0 , Sn1 ) and parties (P n0 , P n1 ) for same double binary symbol are stored in the same input memory word. However in SBTC mode, each input memory word stores the LLRs n+1 n+1 n n values (S0 , S0 , P0 , P0 ) for two consecutive bits (single binary symbols). The same approach is proposed for extrinsic memories. As normalized extrinsic values are quantized to 8 n.ext n.ext n.ext bits, in DBTC mode, values 01 , 10 , 11 related to same symbol are stored in the same memory word. While in SBTC mode, each memory word stores the extrinsic values n+1.ext n.ext 1 , 1 for two consecutive bits. In this way the memory resources in two turbo code modes (STBC/DBTC) are efciently re-utilized. Memory sizes are dimensioned to support the maximum block size of the target standards (Table I). This corresponds to the frame size of 6144 bits of 3GPP-LTE standard which results in a memory depth of 6144+3 tail bits+1 unused = 1537 words (for (Nsym =2)(Nmb =2) both input and extrinsic memories). Where Nsym is number of symbols per memory word and Nmb is number of memory blocks (Nmb = 2 as buttery scheme is adopted). Table II presents the used memories in the proposed ASIP. It has 2 single port input memories to store channel LLR values of size 241537 and 2 simple dual port (oneportread and oneportwrite) extrinsic memories to save a priori information. Each extrinsic memory is split into two banks odd 81537 and even 161537. Each ASIP is further equipped with 12816 cross-metric memory which implement buffers to store and in left buttery calculation phase and reutilized in right buttery phase.

EX

MAX

Fig. 4: Modied pipeline stages (EX, MAX)


int(w0) int(w0)

w(N 1)

frame size N

int(w(N 2)) int(w2)


left BF

Extrinsic

right BF

w1

int(w1)

Extrinsic

w0

Extrinsic

Time

Fig. 5: Windowing in buttery computation scheme

because the decoded symbol is composed of 1bit in SBTC while it is 2 bits in the DBTC mode. Trellis compression is applied to overcome this bottleneck as explained in section II-B. This makes the decoding calculation for SBTC similar to DBTC as presented in eq.(10), eq.(11) and eq.(12) so no additional ACS units are added. The only extra calculation is to separate the extrinsic information to the corresponding symbol as presented in eq.(13) and eq.(14) and the cost for its hardware implementation is very small. Fig.6 depicts buttery scheme with Radix4, where the numbers indicate the equations (referred in section II-B). In this case four bits (single binary symbols) are decoded each time.
LLR4 (14)

Extrinsic

LLR3 (13)

LLR2 (14) LLR1 (13)

Memory name Program memory Input memory Extrinsic memory odd Extrinsic memory even Cross-metric memory

# 1 2 2 2 1

depth 64 1537 1537 1537 16

Width 16 24 8 16 128

frame size N

Time

Fig. 6: Buttery scheme with Radix4

TABLE II: Typical RAM conguration used for one ASIP decoder component

82

B. Assembly Code Example An assembly code example of the proposed optimized ASIP in turbo mode is as shown in Fig.7b. First we initialize the ASIP mode (SBTC, DBTC), initialize the scaling factor identied in eq.(3) and eq(13), from software module found for better performance BER = 0.75 for DBTC and = 0.5 for SBTC, current iteration number (iter = 0), number of windows (N ) per ASIP, length of windows (L) and the length of last window (Llast ). The REP EAT instruction controls the number of iterations (IT ER M AX = 6). For the rst iteration (i=0) the ASIP start with zero as the initial state meti=0 i=0 ric ( int(wn ) = int(wn ) = 0). The ZOLB instruction controls the instructions @10 and @12-13 to execute L (or Llast in case of last window) number of times. The DATA LEFT instruction @10 executes the left-buttery recursion calculating the \ metrics and store them in the cross-metric memory. The DAT A RIGHT instruction executes the rightbuttery recursion calculating \ metrics to be used on the y along with the corresponding stored metrics from the cross-metric memory in the next instruction EXT CALC to calculates the extrinsic information (3) @12-13 and sends them to the other decoder component through (Buffer MUX), so extrinsic calculation require two clock cycles to be calculated. To avoid conict in cross-metric memory when ASIP nish processing left-buttery and starting right-buttery N OP @11 is placed and executed one time for 1 clock cycle delay. In case of SBTC mode, four extrinsic information is generated one for each input LLR, while in DBTC, six extrinsic information is generated three for each input LLR i ((13),(14)). The EXCH W IN forwards the last (n) values i as int(w(n) ), initializes state metric of the next window i with w(n) of window n and increments the current window counter (n = n + 1). C. Addressing implementation In DBTC turbo decoding, due to the use of buttery scheme, two symbols are decoded at the same time so two extrinsic information are generated simultaneously and should be addressed to the other component decoder. As explained in section III-B during right buttery there are two clock cycles to generate extrinsic information, so one value is addressed in rst clock cycle and the other is buffered to be addressed next clock cycle. In SBTC, Radix4 decoding is adopted. Using this decoding with buttery scheme will generate four extrinsic information simultaneously each time and should be addressed and sent to the other decoder component in two clock cycles. To avoid collision QPP interleaving is applied as explained in section II-C. According to eq.(16) odd addresses in natural domain are also odd in interleaved domain and the same for even addresses. Extrinsic memories has been split to two banks (odd/even) to avoid memory conicts. In fact, in the rst clock cycle two extrinsic information (out of the generated four in SBTC mode) with odd and even addresses are sent, followed by the other two extrinsic information in the next clock cycle.

IV. S YNTHESIS R ESULTS The ASIP was modeled in LISA language using CoWares processor designer tool. Generated VHDL code was validated and synthesized using Synopsys tools and 90nm CMOS technology. Obtained results demonstrate an area of 0.15mm2 per ASIP with maximum clock frequency of Fclk = 520M Hz . Thus, the proposed turbo decoder architecture with 2 ASIPs occupies a logic area of 0.3 mm2 with total memory area of 1.2 mm2 . With these results, the turbo decoder throughput can be computed through the equation (17). An average Ninstr = 3 instructions per iteration are needed to generate the extrinsic information for Nsym = 2 symbols in DBTC mode, where a symbol is composed of Bitssym = 2 bits. In SBTC mode, same number of instructions is required for Nsym = 4 symbols, where symbol is composed of Bitssym = 1 bit. Considering Niter = 6 iterations, the maximum throughput achieved is 115.5M bps in both modes. T hroughput = Nsym Bitssym Fclk Ninstr Niter
Core Normalized area Core area 2 (mm ) @90nm (mm2 ) 1.5 1.5 10.7 0.42 3.8 2.1 2.1 5.35 0.84 0.95 4.2 2.1 Throughput (Mbps) 115.5 @6iter 187 @8iter 18.6-37.2 @5iter 45 150 @6.5iter 130 @8iter

(17)

Standard compliant This Work [3] [2] [6] [4] [5] WiMAX, DVB-RCS, LTE WiMAX, LTE DBTC, SBTC WiMax LTE LTE

Tech (nm) 90 130 65 180 65 90

Fclk (MHz) 520 250 400 99 300 275

TABLE III: Comparison with state of the art implementations Table III compares the obtained results of proposed work architecture with other related works. The presented ASIP in [2] supports both turbo modes (DBTC, SBTC). Although it occupies almost half the area of our proposed ASIP, it presents a limited throughput of 6 times less for SBTC mode and 3 times less for DBTC mode. The parametrized dedicated architecture in [3] supports both turbo modes (DBTC, SBTC) and achieves higher throughput ( 1.6 times) at the cost of more than 3.5 times in area comparing to this work. The SBTCdedicated architecture proposed in [4] achieves a throughput of 30% more than the proposed work but at a cost of almost 3 times the occupied area. Similarly, the SBTC-dedicated architecture proposed in [5] achieves a throughput of 13% more than the proposed work but at a cost of almost 1.4 times the occupied area. The DBTC-dedicated architecture proposed in [6] occupies an area of around 30% less comparing with this work but the achieved throughput is around 40% less. This analysis demonstrates how the proposed optimized architecture constitutes a promising trade-off solution between throughput and occupied area comparing with existing implementations.

83

Program memory
16x64

Prefetch

ext ext Z11 Z10

ext Z01

Fetch

1537
k instruction SET CONF double SET SF 6 SET WINDOW ID 1 ;setnum windows SET WINDOW N 3 ;1st n last window length SET SIZE 32,8 ;repeat @11=41 if last window executed else ;repeat @28-41, for 6*WINDOW N times REPEAT until LOOP 6 times NOP ;repeat 30-31, and 35-36 for CurrWindowLen times ZOLB RW1, CW1, LW1 NOP RW1: DATA LEFT add m column2 ;save last beta load alpha init CW1: NOP DATA RIGHT add m column2 LW1: EXTCALC add i line2 EXT ;save last alpha load beta init if lastwindow else ;exch calculated alpha and beta EXCH WIN NOP LOOP: NOP

Decode

Input memory
24x1537

Operand fetch
16x1537

1 2 3

Even

BranchMetric1

8x1537

Odd

(7)
p1 s1 p0 s0

2
Extrinsic memory Cross metric memory 128x16

5 6 7 8 9 10 11 12 13 14 15 16

BranchMetric2

(8)
1537 EX

(3) (5)(6)
6 6 6 6 Max

(4)
Extrinsic Exchange

(2)

(9)
84

(b) Example Assembly Code

SBTC
ext ext ext ext addrwj Zwj addrwj1 Zwj1 j+1 Zj+1addrj Zj addr

DBTC
13 8 13 8
ext ext ext ext ext ext 0 addrwj Z11 Z10 Z01 addrj Z11 Z10 Z01 10 13 8 8 8 13 8 8 8

13

13

wj

wj1

j+1

(a) ASIP Pipeline Architecture

wj

Fig. 7: ASIP pipeline and execution schedule V. C ONCLUSION In this paper, we have presented an area efcient highthroughput 11 decoder system based on ASIP that support turbo codes for both modes DBTC (WiMAX, DVB-RCS) and SBTC (LTE). Three levels of optimization (Architecture, Algorithmic, Memory) have been proposed and signicant performance improvements have been demonstrated. The proposed contribution illustrates how the application of adequate optimization techniques on a exible ASIP for turbo decoding can make it even an attractive and efcient solution in terms of area and throughput. Future work targets to integrate low power decoding techniques. VI. ACKNOWLEDGE This work was supported in part by UDEC and TEROPP projects of the French National Research Agency (ANR). R EFERENCES
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near shannon limit error-correcting coding and decoding: Turbo-codes. 1, In Proc. IEEE International Conference on Communications, ICC93., vol. 2, pp. 1064 1070, 1993. [2] T. Vogt and N. Wehn, A Recongurable Application Specic Instruction Set Processor for Viterbi and Log-MAP Decoding, In Proc. IEEE Workshop on Signal Processing Systems Design and Implementation, SIPS06, pp. 142147, 2006.

[3] J.-H. Kim and I.-C. Park, A Unied Parallel Radix-4 Turbo Decoder for Mobile WiMAX and 3GPP-LTE, In Proc. IEEE Custom Integrated Circuits Conference, CICC09., pp. 487490, 2009. [4] M. May, T. Ilnseher, N. Wehn, and W. Raab, A 150 Mbit/s 3GPP LTE Turbo Code Decoder, In Proc. Design, Automation and Test in Europe Conference & Exhibition, DATE10, pp. 14201425, 2010. [5] C. Cheng-Chi, Wong. Hsie-Chia, Recongurable Turbo Decoder With Parallel Architecture for 3GPP LTE System, IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 57, no. 7, pp. 566570, 2010. [6] H. Arai, N. Miyamoto, K. Kotani, H. Fujisawa, and T. Ito, A WiMAX turbo decoder with tailbiting BIP architecture, In Proc. IEEE Asian SolidState Circuits Conference, SSCC09., pp. 377380, 2009. [7] O. Muller, A. Baghdadi, and M. Jezequel, From Parallelism Levels to a Multi-ASIP Architecture for Turbo Decoding, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 92102, 2009. [8] Y. Zhang and K. Parhi, High-Throughput Radix-4 logMAP Turbo Decoder Architecture, In Proc. Asilomar Conference on Signals, Systems, and Computers, ACSSC06., pp. 17111715, 2006.

84

Design of an Autonomous Platform for Distributed Sensing-Actuating Systems


Francois Philipp, Faizal A. Samman and Manfred Glesner Microelectronic Systems Research Group Technische Universit t Darmstadt a Merckstrae 25, 64283 Darmstadt, Germany {francoisp,faizalas,glesner}(at)mes.tu-darmstadt.de
AbstractA platform for the prototyping of distributed sensing and actuating applications is presented in this paper. By combining a low power FPGA and a System-on-Chip specialized in low power wireless communication, we enabled the development of a large range of smart wireless networks and control systems. Thanks to multiple customization possibilities, the platform can be adapted to specic applications while providing high performance and consuming little energy. We present our approach to design the platform and two application examples showing how it was used in practice in the frame of a research project for adaptronics. Keywords-Smart structures, Control Systems, Wireless Sensor Networks, Recongurable Hardware

Alternatively, local data preprocessing and in-network aggregation algorithms can be implemented with high energyefciency on FPGAs, improving signicantly the lifetime of the network. The paper is organized as follows. After a short review of related work in section II, the architecture of the developed platform and our design concept are introduced in section III. We then present in section IV-A and IV-B two Wireless Sensor - Actuator Networks (WSANs) applications using HaLOEWEn as a prototype. II. R ELATED W ORK Prototyping of wireless sensor node with recongurable hardware was addressed by Hinkelmann & al. in [1]. A Xilinx Spartan3E with 2000k gates was used as a prototyping chip for emulating wireless sensor networks microcontrollers. A sophisticated hardware / software debugging interface has been additionally developed for precise internal debugging of the design implemented on the FPGA. Altough the platform provides enough exibility to implement and test a large range of wireless sensor networks applications, its power consumption is too high for long term deployments. The node still need a reliable power supply close-at-hand or cables making the wireless communication feature less adapted. Recongurability of FPGAs was used by Portilla & al. [2] to implement custom sensor interfaces. Based on a Spartan III with 200k gates, the COOKIE platform can interface a large range of analog and digital sensors thanks to a HDL interface library. However, even if it reduces the power consumption, the limited size of the FPGA does not allow the implementation of complex data processing circuits required by our target applications. Following the similar approach to reduce energy consumption by locally processing the data, the Imote 2 node [3] developed by Intel includes a high speed multimedia DSP coprocessor to handle high bandwidth sensing. Signicant improvements in performance and energy-efciency were shown for various applications in comparison to standard nodes including only a simple microcontroller. Now commercially available with multiple sensor and power supply extensions, the Imote 2 is an interesting alternative for rapid-prototyping of wireless networks applications implicating complex data processing. However, custom hardware implementations enabled

I. I NTRODUCTION Distributed sensing systems are nowadays a key element for the development of intelligent environments and adaptive structures. Information gathered by spatially distributed sensors can either be used for passive monitoring of a system condition or active real-time feedback control. In both cases, tiny platforms that can be easily integrated on existing structures are required. Using wireless sensor nodes, placement of power and sensor cables along a construction is no longer necessary, but new issues regarding synchronization, speed and autonomy are appearing. We introduce a platform combining low power consumption for autonomous wireless sensor networks applications and high performance for real-time distributed control systems. The Hardware accelerated LOw Energy Wireless Embedded Sensor-Actuator node (HaLOEWEn) relies on ne-grained recongurable hardware to implement complex data processing tasks. If FPGAs tend to replace microcontrollers and DSPs for prototyping control systems, their introduction in the design of very low power autonomous embedded systems is recent. The new generation of FPGAs based on non volatile memory is highly suitable for this range of application where a switch between active and sleep periods is frequent. The power consumption of these devices is also sufciently low to be integrated in systems intended to run on long-term deployments with batteries. In addition, monitoring and control of structures wireless sensor networks depends on high-bandwidth sensing. Large amount of data are generated by vibration or acceleration sensors. Wireless transmission of raw data would have a nonnegligible impact on the energy consumption of the node.

978-1-4577-0660-8/11/$26.00 c 2011 IEEE

85

Low Power Mode FPGA Flash & Freeze - RF SoC Idle FPGA deep sleep - RF SoC Idle FPGA deep sleep - RF SoC LPM1 FPGA deep sleep - RF SoC deep sleep

Power Consumption 30.7mW 29.8mW 2.1mW 50W

TABLE I P OWER CONSUMPTION OF THE PLATFORM DURING DIFFERENT SLEEP


MODES

by FPGAs result in many cases in higher performance for a larger range of applications. III. P LATFORM A RCHITECTURE For our design, we considered an Actel IGLOO FPGA AGL1000V5 [4] with 1000k equivalent system gates as the central unit of the system. It is by default extended by a Texas Instruments CC2531 System-on-Chip (SoC), integrating an IEEE 802.15.4 compliant 2.4 Ghz transceiver and a 8051 CPU core [5]. The SoC includes 256k programmable Flash memory and 8K RAM. The system runs with a 32 MHz oscillator. Any node can be connected to a PC via the integrated USB port. Through such nodes acting as base stations, data can be accumulated and visualized immediately and user requests can be disseminated to the whole network. Debuggers can be plugged to the board allowing simultaneous monitoring of the FPGA and the RF SoC operation. The FPGA and the RF SoC communicate via a dedicated SPI bus. When running at the maximum frequency with a DMA controller, datarate can reach 2Mbps. Both components have different deep low power modes useful for applications involving long sleeping periods. The FPGA has a so-called Flash & Freeze Low-Power mode with internal SRAM retention activated by a dedicated pin driven by the software running on the RF SoC. The RF SoC is also able to switch off the power supply of the FPGA resulting in a deep low power mode with conguration retention since IGLOO FPGAs are based on ash memory. Thus, FPGA functionality is quickly and energy-efciently recovered after sleeping periods. Power consumption measured on the platform are summarized in table I. The power consumption of the FPGA in active mode depends on the implemented design and can range from 4mw to 120mW . When the radio is activated, 47mW have to be added in listening mode and 72mW when transmitting at maximum power output (10dBm). The platform can be powered by an external power supply available next to the node or by a battery. If the external conditions are adequate, a hybrid energy harvesting circuit combining power extracted from different sources has been developed for this platform [6]. The latest solutions are well suited for monitoring-only systems but they are inappropriate when the application includes control of actuators. Power generated by batteries or energy harvesting is not sufcient in this cases and the platform is likely to be supplied by an external source. As they do not require complex data processing, low rate analog sensors are connected to the integrated ADC of the
Fig. 1. Schematic Side View of the Platform with Extension Boards

SoC. A temperature and a light sensor were placed by default on the board. Further analog sensors may be attached through an external connector. Other parts of the platforms are customizable modular circuits which are connected to one of the three left FPGA I/O banks. An FPGA with a relatively high number of I/Os (256) has been chosen to maximize the connectivity of the platform. Available extension boards include for example additional volatile and non-volatile memory, Analog-to-Digital (ADC) and Digital-to-Analog (DAC) converters, interfaces to other boards, digital sensors, etc. Each available I/O bank has a dedicated 50 pins header connector to plug the extensions. The board is small enough (60 mm x 96 mm) to be easily integrated in various environments. A. Development Environment We distinguished two main parts for typical distributed sensing-actuating applications: communication and data processing. The wireless communication with other nodes of the network is handled by the microcontroller in software while sensors and actuators are directly interfaced by the FPGA (Fig. 3). Preprocessing of the sensor data is thus handled by dedicated hardware circuits for enhanced energy-efciency. Similarly, actuator control is implemented on the FPGA for fast and accurate operation. Communication between the microcontroller and the FPGA is limited to update of parameters for control (feedback information from other nodes) and data extracted from the sensors. The wireless communication should guarantee a very accurate synchronization between nodes in the case of distributed control systems in order to minimize delays in the feedback loop. As it is independent from the sensor and actuator data processing, both operations can be run in parallel. Thus, a very accurate synchronization may be achieved by frequent resynchronizations phases without interferences with the accelerator operation. A direct implementation of the communication protocol on the FPGA, as it was done in [7], is limited by the area available on the FPGA. Complex synchronization or routing protocols require control units and memory that can not t together with data processing blocks. Even if improvements in the energy

86

Fig. 3.

Typical Application Mapping

Fig. 4.

Wireless Sensor Network for Acoustic Source Localization

Fig. 2.

Top and Bottom View of the HaLOEWEn platform

IV. P ROTOTYPING A PPLICATIONS In this section, we present the details of two applications illustrating how the platform is used to test and develop distributed sensing-actuating systems in the frame of a research project. A. Acoustic localization We rst detail a setup for the prototyping of a wireless sensor network used for acoustic localization [9] as illustrated in Fig. 4. The purpose of this application is to identify the source of a sound disturbance in a closed environment. Each platform is extended with two sensors: an ultrasonic transceiver and a low cost MEMS microphone. The ultrasonic transceiver is used to perform an accurate self-localization of the nodes between each others. Ultrasonic signals are exchanged at regular time intervals in order to determine distances and then positions of the nodes with a multilateration algorithm. Nodes may thus have arbitrary positions and can be

consumption are possible, it is then preferable to use software implementations of the networking protocol for prototyping purposes. Power management is also handled in software: FPGA and RF SoC operation can be shut down to reduce the power consumption of the platform during idle periods. If the platform uses energy harvesting, part of the power management control, like the maximum-power-point-tracking algorithm can also be mapped on the processor [6]. The communication protocol is programmed in the C language and can be supported by well-known wireless sensor network operating systems like Contiki [8]. The design implemented on the FPGA is described with VHDL or Verilog and Actel IP cores within the Libero IDE.

87

Fig. 6.

Architecture of the Localization Accelerator

Fig. 5.

Time on Arrival Estimator

moved during operation without interfering with the correct operation of the application. The sound disturbance localization process involves two steps where intensive computation is required. In the rst step, time difference of arrival of the sound to the different nodes must be estimated. The most common way to realize this is to use cross-correlation with a reference signal or among sounds recorded by the node members of the network. In order to minimize delays due to communication overhead, the rst solution is preferred although it limits the localization to predened sounds (a ringtone for example). tarrival = t, max {(record ref erence)(t)} > threshold (1) Thanks to the FPGA implementation of the crosscorrelation based detection, a timestamp can be very quickly extracted. The nodes can then synchronize each other by exchanging their local time reference (Post-Facto Synchronization) and compute time differences on arrival (TDOAs). In a second step, TDOAs and spatial coordinates of each nodes are combined by using an algorithm based on a leastsquares estimation called spherical intersection [10]. The complexity of this algorithm is O(n3 ) for three dimensional localization where n denotes the number of available measurements. When the size of the network is important, it is then likely to use hardware acceleration to speed up the computation. However, an implementation of this algorithm is only necessary on one node since the result is unique for the whole network. As it does not t together with the crosscorrelator on a single FPGA, the localization accelerator is only implemented on a reference node that keeps trace of other nodes positions and measurements. The hardware architecture of the accelerator is depicted in Fig. 6. Using FPGAs for this application has several advantages : it rst allows a fast estimation of the cross-correlation necessary for precise synchronization of nodes. When activated, sound

processing has to be continuously performed in order to detect the sound on every nodes. This data streaming implies a realtime processing of the incoming data which is only supported using dedicated computation blocks. Additionally, the network trafc generated by the application is highly reduced. Only time of arrival and synchronization information need to be exchanged Secondly, it allows a fast estimation of the source position within the network. The system is then able to run autonomously without support from an external computation unit allowing deployments in harsh environments. Examples of applications can be found in the military domain (Countersniper project) [11], but also for structural health monitoring based on acoustic emissions [12]. B. Wireless Distributed Control Systems In our project, the platform will also be used for active vibration and noise control systems, in which the platform will be implemented as wireless distributed controller. A decentralized control strategy will be used where a master platform will coordinate data synchronization and communications. Fig. 7 presents an example of distributed control systems for vibration control of a large plate. We assume that the plate vibration is controlled by using N number of local adaptive controllers. A single small area i = {1, 2, , N } on the plate will be controlled by an adaptive controller. As shown in the gure, only two adaptive controllers are presented for the sake of simplicity. The objective of the vibration control system is as follows. The vibration sensed on N number of points on the plate will be minimized subject to a force disturbance d on an arbitrary location on the plate. The error signals ei , i = {1, 2, , N } measured from error sensors should be minimized. A pair of sensor-actuator is placed on each node i. Piezoelectric patches can be used as actuators and sensors. For some reasons, an tuned-mass-dampers can also be used to absorb vibrations. The signal ui is an actuating signal sent by the controller to the actuator. While ei and xi are perturbed error signal and reference signal, respectively. Both signals are used to make controller parameters adaptation mechanism.

88

Adaptive Transversal Filter

Adaptive Controller

Adaptive Controller

x(k)

z1

x(k1)

z1

x(k2)

z1

x(k3)

z1

x(k4)

z1

x(kP)

u1

e1

eN uN TuxN
sensor

x
a2 a3

x
a4

x
aP

x
u(k)

Controller Parameters e(k)

Tux1
actuator sensor

actuator

a0

a1 Parameter
Fig. 9.

Tue1 Tde1

Tdx TdeN

TueN

Adaptation

Algorithm

Adaptive Controller Architecture.

d
P

u(k) =
j=0

aj (k) x(k j)

(2)

Fig. 7.

Distributed parameter control for vibration control of a large plate.

T ux1 Tdx d T de1 y1 C(z) x u1 T ue1 z 1 e1 TdeN yN TuxN

The controller parameters ap , p {0, 1, , P } are adaptively tuned by using a commonly used Least-Mean-Square (LMS) algorithms [13] shown in Equ. (3). The parameters are adaptively updated by using the measurement of the mean square error of an error sensor signal on a local point in a system. The constant is the adaptation gain that can be set to accelerate the parameter adaptation mechanism. But a higher constant tends to instabilize the system. Therefore, a correct value must be set. aj (k + 1) = aj (k) + e(k)x(k j) (3)

..... e N

z N TueN

uN

C(z)

parameters adaptation mechanism

parameters adaptation mechanism

Fig. 8.

Block diagram of the control systems.

Some blocks of transfer functions are shown in Fig. 7. Tdx is the transfer function from the disturbance signal d to reference sensor signal x. Tde is the transfer function from the disturbance signal d to error sensor signal e. Tux is the transfer function from the control signal u to reference sensor signal x. Tue is the transfer function from the control signal u to error sensor signal e. Fig. 8 shows the block diagram of the distributed adaptive control system. Because the location of disturbance signal d is not xed, then the parameter values (and perhaps also the structure) of the Tdx and Tde will change accordingly. Due to such situation, the adaptive control system is used to handle such problem. The parameters of the adaptive controller can be adaptively tuned to compensate the changes of the transfer function parameters of Tdx and Tde . The structure of the adaptive transversal lter is shown in Fig. 9. The transversal lter and the parameter adaptation algorithm will be implemented on the Actel FPGA IGLOO mounted on the platform. The lter consists of three main units, i.e. a multiplier, an adder and a delay unit (z 1 ). The lter output is described in Equ. (2), where u(k) is the control signal, aj (k) is the tuneable controller parameter, and x(k) is the reference signal.

For a system with single adaptive controller, the error signal is e(k) = y(k) z(k) = Tde (z)d(k) Tue (z)u(k). By using Equ. (2), then we have e(k) = Tde(z)d(k) Tue (z){ P aj (k)x(k j)}, or based on Fig. 8, e(k) = j=0 Tde (z)d(k) Tue (z)C(z)x(k). From the gure, we see that x(k) = Tdx (z)d(k) Tux (z)C(z)x(k), and x(k) = Tdx (z) 1+Tux (z)C(z) d(k). Therefore, in a system with single controller, the adaptive control will move to a steady-state, i.e. e(k) 0, when Equ. (4) is fullled. Tde (z) = Tdx (z)Tue (z)C(z) 1 + Tux (z)C(z) (4)

And, for a system with N number of adaptive controllers, then the steady-state condition holds, when Equ. (5). Tde,i (z) = Tdx (z)Tue,i (z)Ci (z) 1+
N i=1

Tux,i (z)Ci (z)

(5)

where i {1, 2, , N }. In order to optimize the logic gates usage in the FPGA IGLOO, the adaptive transfersal lter structure presented in Fig. 9 can also be implemented using serial architecture and (P + 1) number of storage registers for the shifted reference signals and controller parameters. Wireless communication has been an interesting issue in the area of networked control systems [14], [15]. There are two control strategies that can be used regarding the need for the wireless communication based on our platform. Firstly, the

89

wireless communications could be used by a master controller and the other local controller to synchronize data sampling from the reference and error sensors. In this case a decentralized control strategy is used, where the LMS adaptation algorithm can be used to tune the controller parameters. Secondly, the wireless communications could be used to exchange the actuator and sensor data such that a master controller can make online parameter identications. The online parameter identication can also be made in each local controller. The identied parameters can be further used to recongure controller parameters values, which can meet the control objective. However, this strategy requires a high-speed and guaranteed-lossless communication infrastructure. In order to come up to such issues, a predictive or a state estimation control strategy can be used to reduce data exchanges in the wireless media [16], [17]. V. C ONCLUSION
AND FUTURE WORK

A platform that can be implemented on multiple distributed sensing and actuating applications is introduced in this paper. The hardware platform is realized in a printed-circuit-board (PCB), in which two main computing elements, i.e. the FPGA IGLOO and the CC2351 SoC are used. By plugging multiple independent extensions to the FPGA, the node can be easily adapted to specic applications. The power consumption of the whole system is low enough for autonomous operation during long periods while complex control and data processing algorithms can be implemented with high efciency. For our future work, the performance and the reliability of our platform for the aforementioned applications will be precisely measured. Based on comparisons with traditional motes, speedup and energy consumption gain will be estimated. Additionally, we will use the platform for further applications on smart structures. In particular, a structural health monitoring network for a bridge will be deployed. ACKNOWLEDGEMENTS This work has been supported by the European FP7 project Maintenance on Demand (MoDe) Grant FP7-SST-2008-RTD 233890 and by Hessian Ministry of Science and Arts toward Project AdRIA (Adaptronik-Research, Innovation, Application) with Grant Number III L 4518/14.004 (2008). R EFERENCES
[1] H. Hinkelmann, A. Reinhardt, and M. Glesner, A Methodology for Wireless Sensor Network Prototyping with Sophisticated Debugging Support, in Proceedings of the 19th IEEE/IFIP International Symposium on Rapid System Prototyping, 2008. [2] J. Portilla, A. de Castro, E. de la Torre, and T. Riesgo, A Modular Architecture for Nodes in Wireless Sensor Networks, Journal of Universal Computer Science, vol. 12, pp. 328 339, 2006. [3] L. Nachman, J. Huang, J. Shahabdeen, and R. A. R. Kling, IMOTE2: Serious Computation at the Edge, in Proceedings of the International Conference on Wireless Communications and Mobile Computing, 2008. [4] IGLOO Low-Power Flash FPGAs Datasheet, Actel. [5] CC253x System-on-Chip Solution for 2.4 GHz IEEE 802.15.4 and ZigBee Applications Users Guide, Texas Instruments. [6] F. Philipp, P. Zhao, F. A. Samman, and M. Glesner, Demonstration : Monitoring and Control of a Dynamically Recongurable Wireless Sensor Node Powered by Hybrid Energy Harvesting, in Design, Automation & Test in Europe (DATE), University Booth, 2011.

[7] L. A. Vera-Salasa, S. V. Moreno-Tapiaa, R. A. Osornio-Riosa, and R. de J. Romero-Troncosob, Recongurable Node Processing Unit For A Low-Power Wireless Sensor Network, in Proceedings of the Intenational Conference on Recongurable Computing, 2010. [8] A. Dunkels, B. Gronvall, and T. Voigt, Contiki - A Lightweight and Flexible Operating System for Tiny Networked Sensors, in Proceedings of the 29th IEEE International Conference on Local Computer Networks, 2004. [9] F. Philipp, F. A. Samman, and M. Glesner, Real-time Characterization of Noise Sources with Computationally Optimised Wireless Sensor Networks, in Proceedings of the 37th Annual Convention for Acoustics (DAGA), 2011. [10] H. C. Schau and A. Z. Robinson, Passive Source Localization Employing Intersecting Spherical Surfaces from Time-of-Arrival Differences, in IEEE Transactions on Acoustics, Speech and Signal Processing, 1987. [11] G. Simon, M. Marti, . Ldeczi, G. Balogh, B. Kusy, A. Ndas, G. Pap, J. Sallai, and K. Frampton, Sensor Network-Based Countersniper System, in Proceedings of the 2nd International Conference on Embedded networked sensor systems, 2004. [12] S. D. G. C. U. Grosse and M. Krger, Initial Development of Wireless Acoustic Emission Sensor Motes for Civil Infrastructure State Monitoring, Smart Structures and Systems, vol. 6, pp. 197 209, 2010. [13] S. Haykin, Adaptive Filter Theory, 3rd ed. Prentice-Hall, 1996. [14] N. J. Ploplys, P. A. Kawka, and A. G. Alleyne, Closed-Loop Control over Wireless Networks, IEEE Control Systems Magazine, vol. 24, no. 3, pp. 5871, June 2004. [15] H. A. Thompson, Wireless and internet communications technologies for monitoring and control, Elsevier J., Control Engineering Practice, vol. 12, no. 6, pp. 781791, June 2004. [16] J. K. Yook, D. M. Tilbury, and N. R. Soparkar, Trading Computation for Bandwidth: Reducing Communication in Distributed Control Systems using State Estimator, IEEE Trans. Control Systems Technology, vol. 10, no. 4, pp. 503518, July 2002. [17] R. Wang, G.-P. Liu, W. Wang, D. Rees, and Y. B. Zhao, Guaranteed Cost Control for Networked Control Systems Based on an Improved Predictive Control Method, IEEE Trans. Control Systems Technology, vol. 18, no. 5, pp. 12261232, Sep. 2010.

90

Session 4 Virtual Prototyping for MPSoC

91

A Novel Low-Overhead Flexible Instrumentation Framework for Virtual Platforms


Tennessee Carmel-Veilleux , Jean-Franois Boland and Guy Bois
Dept. of Electrical Engineering, cole de Technologie Suprieure, Montral, Qubec, Canada Dept. of Software and Computer Engineering, cole Polytechnique de Montral, Montral, Qubec, Canada

AbstractInstrumentation methods for code proling, tracing and semihosting on virtual platforms (VP) and instruction-set simulators (ISS) rely on function call and system call interception. To reduce instrumentation overhead that can affect program behavior and timing, we propose a novel low-overhead exible instrumentation framework called Virtual Platform Instrumentation (VPI). The VPI framework uses a new table-based parameter-passing method that reduces the runtime overhead of instrumentation to only that of the interception. Furthermore, it provides a high-level interface to extend the functionality of any VP or ISS with debugging support, without changes to their source code. Our framework unies the implementation of tracing, proling and semihosting use cases, while at the same time reducing detrimental runtime overhead on the target as much as 90 % compared to widely deployed traditional methods, without signicant simulation time penalty. Index TermsComputer simulation, Software debugging, Software prototyping, System-level design

I. I NTRODUCTION With the advent of multiprocessor systems-on-chip (MPSoC) for consumer and networking applications, complexity has become a signicant issue for system debugging and prototyping. Simulators and system-level modeling tools have become necessary tools to manage this complexity. Virtual platforms (VP) are system-level software tools combining instruction-set simulators (ISS) and peripheral models that are used to start software prototyping before availability of the nal product. In the case of state-of-the-art MPSoCs, virtual platform models can even be used as the golden model provided to developers years before availability of nal silicon [1]. The proliferation of SystemC-based design-space exploration tools (e.g. Platform Architect [2], ReSP [3], Space Studio [4], etc.) was also made possible by mature VP technology. When using VPs for debugging or design-space exploration, software instrumentation methods can be used to obtain proling data, execution traces or other introspective behavior. The runtime overhead (i.e. intrusiveness) on the target of these instrumentation methods is critical. It must be minimized to prevent interfering with the strict timing constraints common in embedded software [5]. In this paper, we present a novel low-overhead exible code instrumentation framework called Virtual Platform Instrumentation (VPI). The VPI framework can be used to extend existing virtual platforms with additional tracing, proling and semihosting capabilities with minimal target code overhead and timing interference. Semihosting is a mechanism whereby a functions execution on the target is delegated to an external hosted environment, such as a VP.
The authors would like to acknowledge nancial support from the Fonds qubcois de la recherche sur la nature et les technologies (FQRNT), the cole de technologie suprieure (TS) and the Regroupement Stratgique en Microsystme du Qubec (ReSMiQ) in the realization of this research work.

Semihosting is traditionally used to exploit the hosts I/O, console and le system before support becomes available on the target [6]. Through our proposed framework, we make three main contributions. Firstly, we describe a new mechanism for fully inlinable instrumentation insertion with table-driven parameter-passing between a simulated target and its host. Our method completely foregoes function call parameter preparation overhead seen in traditional semihosting. In doing so, we reduce detrimental runtime overhead on the target between 210 times in comparison to traditional methods, while showing nearly identical simulation run times. Secondly, we show that our framework can realize function semihosting, tracing and proling tasks, thus unifying usually separate use cases. Thirdly, we propose a generic high level instrumentation handling interface for VPs which allows for new instrumentation behavior to be added to existing tools, without requiring modications. This paper is organized as follows: section II presents background information and related works about virtual platform code instrumentation, section III describes our proposed instrumentation framework, section IV presents experimental case studies of semihosting and proling with conclusions and future work in section V. II. BACKGROUND AND RELATED WORK In this section we explore different instrumentation methods used for debugging, system prototyping and proling on virtual platforms. This is followed by an overall comparison of the methods, including our proposed VPI framework. For our purposes, we dene virtual platforms as software environments that simulate a full target system on a host platform. Virtual platforms integrate instruction-set simulators as well as models of memories, system buses and peripherals to realize a full SoC simulator. The conceptual layering of a VP is shown in Figure 1. Through the development of our framework, we evaluated the features and mechanisms present in the Simics [7], Platform Architect [2], QEMU [8], ReSP [3] and OVPSim [9] virtual platforms. In our experimental case study of section IV, we concentrated on Simics and QEMU. A. Instrumentation use cases overview We dene instrumentation as tools added to a program to aid in testing, debugging or measurements at run-time. These tools can be implemented as intrusive instrumentation functions in source code or as non-intrusive instrumentation functionality within a VP. In our context, intrusive means that target run time is affected in some way by the instrumentation. An instrumentation site refers to the location where instrumentation is inserted.

978-1-4577-0660-8/11$26.00 2011 IEEE

92

Memory Model

ISS Core 1 System Bus Model

Instruction Set Simulator (ISS) Core 0

Peripheral Model

Virtual Platform Instrumentation Interface (VPII)

Virtual Platform Model of Target System (e.g. SystemC model) Host Operating System (Windows, Linux) Host Hardware (Users Workstation)

Figure 1.

Virtual platform modeling layers

Some examples of intrusive instrumentation use cases are:


compile-time insertion of tracing or proling calls at every function entry and exit point [10, p. 75]; compile-time insertion of code coverage or other measurement statements in existing source code; insertion of probe points for ne-grained execution tracing at the OS kernel level (e.g. Kernel Markers [11] in the Linux kernel). insertion of breakpoints and watchpoints at runtime using a debugger to aid in tracing and debugging; interception of library function calls through their runtime address to emulate functionality or store proling data [4], [3]; runtime insertion of transparent user-dened instruments tied to program or data accesses such as the probe, event and watch mechanisms of Avrova [12]; storage of control-ow data in hardware trace buffers readable through specialized interfaces.

Conversely, examples of non-intrusive instrumentation include:


The instrumentation methods which implement these use cases differ signicantly in how much they affect target and host run time (i.e. their intrusiveness) and what they enable the user to do (i.e. their exibility). In terms of tracing instrumentation, many varieties exist which differ in semantic level. The VPs we evaluated each allowed for straightforward dumping of a trace showing all instructions executed and every data access, with no context-related semantic information. Conversely, user-dened or compiler-inserted high-level tracing stores much less data, but with much higher semantic contents (e.g. a list of task context switches in an OS [7]). When we refer to tracing in this paper, we are referring to the high-level tracing case. In the next section we will discuss semihosting, a method used to implement several of the aforementioned instrumentation tasks either intrusively or non-intrusively. B. Semihosting function calls In the general case, semihosting works by intercepting calls to specic function stubs in the target code. Instead of running the function on the target at these sites, the ISS forwards an event to the VP. The VPs semihosting implementation then examines the processor state in the ISS and emulates the functions behavior appropriately, with target time stopped.

With semihosting, mechanisms for call interception differ by implementation. Run-time and code size intrusiveness are directly linked to which interception mechanism is used. We distinguish three such mechanisms: Syscall interception: the system call instruction is diverted for use in interception. This is the traditional approach as used by ARM tools [6] as well as in QEMU for PPC and ARM platforms. This approach may require an exception to be taken, with associated runtime overhead. Simcall interception: a specic instruction is diverted for use as an interception point. The instruction can be specic for that purpose, like the SIMCALL instruction in the Tensilica Xtensa architecture [13, p. 520], or it can be an architectural NO-OP as in the Simics virtual platform, where it is called a magic instruction [7]. Address interception: the entrypoints of all functions to be emulated are registered as implicit breakpoints in the VP. When the program counter (PC) reaches these breakpoints, interception occurs. This approach is used in tools such as Imperas OVPSim [9], ReSP [3] and Space Studio [4] amongst others. It can also be implemented in any VP with debugging support using watchpoints or breakpoints. With all traditional semihosting methods, context-specic parameters are passed using regular function parameter-passing. The preparation of semihosted function call parameters according to normal calling conventions accounts for most of the runtime overhead of these methods. Since the emulated function is fully executed by VP host code, the entire state of the modeled system can be exploited. Using the example of a MPSoC model, this would imply that internal registers of all CPU cores could be accessed while processing the emulated function. This opens the possibility of emulating as much as a full OS system call as done in [3], [14] or providing perfect barrier synchronization primitives across the system [15]. Semihosting is increasingly used in the implementation of design space exploration tools for hardwaresoftware codesign. In that case, it enables developers to quickly assess the performance of an algorithm without having to deal with the accessory details of OS porting or adaptation early in the design phase [3], [4]. C. Comparison of instrumentation methods In this section we compare different instrumentation methods commonly used under virtual platforms. The comparison matrix is shown in Table I. Although we have not yet detailed our Virtual Platform Instrumentation (VPI) method, it is included in the Table for comparison purposes and fully described in section III. Firstly, we evaluated intrusiveness in terms of code size, run time and features lost to the method (e.g. system call no longer available). Secondly, we established whether the methods work without symbol information (i.e. even with a raw binary image) and whether they allow for inlining of the instrumentation. By symbol information, we mean the symbols table that links function names to their addresses, which is present in all object le formats. Finally, we determined if the methods listed were suitable for different use cases presented earlier. For the qualitative criteria, we evaluated implementation source code or manuals of every method listed to determine the values shown. Our VPI method appears to compare favorably with existing approaches. We contrast our method with other approaches and provide experimental results supporting these intrusiveness comparisons in section IV.

93

Table I I NSTRUMENTATION METHODS COMPARISON MATRIX Intrusive ? Method Compiler-inserted proling function calls Traditional semihosting ("Syscall" interception) Traditional semihosting ("Simcall" interception) Traditional semihosting (Address interception) Watchpoints / Breakpoints VPI (Proposed method) Code High Medium Low Low None Low Run-time High Medium Low None None NoneLow Features lost No Yes Yes No No No Works ? Without Inline symbols No Yes Yes No No Yes No No No No N/A Yes Supports ? Syscall/OS emulation No Yes Yes Yes No Yes

Tracing Yes Depends Depends Yes Yes Yes

Proling Yes Depends Depends Yes No Yes

III. D ETAILS OF PROPOSED FRAMEWORK Our code instrumentation framework (VPI) is composed of two software elements: 1) an inline instrumentation insertion method with table-based parameter-passing, implemented with inline assembler in C code; 2) a high-level virtual platform instrumentation interface (VPII) that handles interception of instrumentation sites by calling appropriate virtual platform instrumentation functions (VPIFs). Combined, these two components form a low-overhead generic code instrumentation framework that can be implemented on any VP or ISS with debugger support or extension capabilities. For the purposes of this paper, the compilers inline assembler extensions are those of the unmodied GCC version 4.5 C compiler [16]. However, the concepts behind our method are tool-agnostic and applicable to production-level compilers. In the following subsections, we refer to the numbered markers in Figure 2 to illustrate the ow of instrumentation insertion from initial source code to the compiler-generated assembler code. Marker 1 of Figure 2 will be listed as (), marker 2 as () and so on. We will use the fopen() C library function as a semihosting example to illustrate instrumentation insertion. A. Target-side instrumentation insertion Instrumentation statements are inserted into target code by the developer using common C macros (). They can refer to any program variable (). Each instrumentation macro expands to inline assembler statements containing a semihosting interception block (,) and a parameter-passing payload related to the desired instrumentation (). The entire instrumentation call site is inserted inline (i.e. in-situ). At compile time, the interception block () and parameter-passing payload table () are constructed from compiler-provided register and memory address allocations. This is done by accessing inlineassembler-specic placeholders () and pretending instructions are emitted from them. When inline assembler is used within a functions body, placeholders referencing C variables in the assembler code are replaced by values from the compilers internal registers and memory addresses allocation algorithms. We save these references out of band from the main code section (.text), in the read-only data section (.rodata). The choice of the .rodata section for the payload data table is deliberate, to prevent instruction cache interference by data that never gets read by user code. However, it is possible and sometimes required to use the .text section for the payload table. For example, if the target OS uses paged

virtual memory, the interception block and payload table may need to be inlined in the code section. Otherwise, the tables effective address range might not currently be mapped-in by the OS, causing a data access exception at interception time. Interception block Although interception is still necessary with our method, we do not mandate the use of a specic mechanism. The interception block from our example of Figure 2 (,) is composed of three parts: 1) Simcall interception instruction (rlwimi 0,0,0,0,9 in this case); 2) pointer-skipping branch; and 3) payload table pointer. The interception block shown is an arbitrary example. Any other interception mechanism described in section II-B could be used, as long as it is supported by the VP. Along with the interception block, a pointer to the parameterpassing payload table is used to link an instrumentation site with its parameters. An unconditional branch is added to the interception block to prevent the fetching and execution of the payload table pointer. Parameter-passing payload table The parameter-passing payload table serves as a link between the target programs state and the high-level instrumentation interface running in the VP. For an instrumentation site, it both uniquely identies desired behavior and provides reference descriptors to the function parameters that should be passed tofrom the handler. These reference descriptors allow a high-level instrumentation interface to both read data from, and write data back to the target programs state. The format used for each payload table is as follows: Signature header (1 word) including a functional identier (16 bits) and quantity (from 015 each) of constants, input variables and output variables references; Constants table (1 word each); Input variables references (xed number of strings and/or instructions); Output variables references (xed number of strings and/or instructions); The signature header identies the desired functional behavior (e.g.: tracing, fopen(), printf(), etc.). For every functional identier, it is possible to use more or less constants, inputs and outputs depending on the need. For instance, a printf() function could be implemented as 16 versions, covering the cases where 0 to 15 variables need to be formatted. Constants are emitted from references known before runtime. For instance, our implementation of a semihosted printf() uses a

94

2 1

char filename[] = "testfile"; FILE *f; f = VP_FOPEN(filename, "w+");

Table II D ESCRIPTION OF F IGURE 2 S PAYLOAD TABLE Compiled value 0x0120000b Description Signature header Function 0x000b 0 constants 2 inputs 1 output Value of retval output variable is in GPR11 Value of lename input variable is in GPR30 Pointer to mode (w+) input variable is contents of GPR10 + 8

Expand macro
do { FILE *retval; __asm__(" SIMCALL/NOP b 2f .long 1f 2: .section "rodata" 1: .long FOPEN_IDENTIFIER .asciz "%[RETVAL]" .asciz "%[FNAME]" .asciz "%[MODE]" .text " : [RETVAL] "=r" (retval) : [FNAME] "rm" (filename), [MODE] "rm" ("w+") ); return retval; } while (0) Instrumentation parameter-passing payload with 5 placeholders

11 30
Interception 3 Pointer to payload 4 (skipped by interception)

8(10) or stw 0,8(10)

(stw) instruction that could replace the reference string of the last reference in the Table. Remarks about insertion method construction As far as we know, every other semihosting-based methods are designed for source code equivalenceall instrumentation-calling code must remain identical after instrumentation is removed. This requirement has the advantage of allowing instrumentation to be included by simply linking with different versions of the libraries. However, parameter-passing becomes bound to the C calling conventions in effect on the target platform. We constructed our proposed instrumentation insertion method to overcome the articial requirement of function call setup when running on virtual platforms. In our case, where we know the instrumented binary will be run in a VP, it becomes only necessary to somehow tell the VP where to nd function parameters after a call is intercepted. Function call preparation merely copies program variables into predetermined registers or stack frame locations. Since the VP can access all system state in the background without incurring instruction execution penalties, we replaced the function call and associated execution overhead with a static parameter-passing table. Parameters can then be accessed by interpreting the table, rather than reading predetermined registers or stack frame locations. The compiler guarantees the reloading from memory of any variable not locally available from registers or offsetable memory locations. This reloading overhead is, in all cases, a subset of standard function call overhead. Another side effect of our methods construction is that instrumentation insertion is always inlined. This has the desirable consequence of following other inlining done by the compiler. It then becomes trivial to instrument functions inlined by the compilers optimizer, and to identify them uniquely, without any special compiler support. Finally, while optimizing compilers can reorder statements around sequence points in C code, some compiler-specic mechanisms can be used to guarantee the positioning of the inlined assembler blocks. During our tests with GCC 4.5 on ARM and PPC platforms, the use of volatile asm statements with memory clobber prevented any instruction reordering from affecting the test result signatures at every optimization level. B. Virtual Platform Instrumentation Interface Within our framework, we propose that the VP be pre-congured to run a centralized instrumentation handler whenever interception occurs at an instrumentation point. A high-level, object-oriented virtual platform instrumentation interface (VPII) layer is used to interface between the VP and the instrumentation functions by providing abstract interfaces to the VPs state and parameter-passing tables.

Parameters list in inline assembler 6 syntax

Compile

/* Start of rlwimi b .long 2: 1:

inserted inline assembler */ 0,0,0,0,9 7 Interception: 2f 3 words, 2 executed 1f

.section .rodata,"a",@progbits Parameter-passing .long 0x0120000b payload: .asciz "11" 1 word and 3 strings .asciz "30" (16 bytes) 8 .asciz "8(10)" .text /* End of inserted inline assembler */

Figure 2.

Overview of VPI instrumentation insertion in source code

constant slot for the pointer to the format string. The example of Figure 2 does not use any constants. Input variable references and output variable references are compiler-provided data references that can be accessed by the VPIFs through the VPII. Table II breaks-down the payload table of our fopen() example from Figure 2. Again, this example is based on a PowerPC target, but equivalent content would be present for any architecture. Although our example of Figure 2 uses only strings for references, both strings and instructions can be used, as long as the table format is understood by the VPII implementation. In the case where instructions are used, the VPII can disassemble them at runtime to decode the references they contain. To illustrate this, we show a store

95

Virtual Platform Instrumentation Functions (VPIF) Virtual Platform Instrumentation Interface (VPII) GDB + Python OR Internal Python Interface

Instruction Set Simulator (ISS) Virtual Platform (VP)

compared our experimental framework implementation to common instrumentation methods using controlled examples. Source code for the case study, as well as for our VPI implementation, is available at http://tentech.ca/vp-instrumentation/ under a BSD open-source license. A. Experimental setup The case study was run on a standard PC running Windows 7 x64 Professional with an Intel Core 2 Duo P8400 with two 2.26GHz cores. The toolchain and C libraries were from the Sourcery G++ 2010.09-53 release, based on GNU GCC 4.5.1 and GNU Binutils 2.20.51. The target was a PowerPC e600 single-core processor on the Wind River Simics 4.0.60 and QEMU PPC 0.11.50 virtual platforms. We used GDB 7.2.50 with Python support as the debugger. We instrumented the QURT quadratic equation root-nding benchmark program from the SNU WCET suite [17] based on two instrumentation scenarios, which were run independently. These scenarios showcase the unication of proling and semihosting use cases since both are implemented using the same VPI framework functionality and insertion syntax. Each scenario comprised a base non-instrumented case, and four instrumented cases. The instrumented cases represent different combinations of instrumentation methods and VPI congurations. The VPI congurations were the following:

Figure 3.

VPI framework implementation layers

t accesses-state-through DebuggerInterface +eval_expr() +read_mem() +write_mem() +read_n_bytes_variable() +write_n_bytes_variable() +read_string_variable() +get_reg() +set_reg() +get_pc() ApplicationBinaryInterface +is_64_bits() +get_longlong_second_reg() +get_longlong_from_regs() +get_regs_from_longlong() +get_adapted_endian() +to_signed() +get_top_address() +get_double_from_longlong() VariableAccessor +read_n_bytes_variable() +write_n_bytes_variable() +read_string_variable() created

DirectAccessor RegisterAccessor SymbolicAccessor

creator VPIInterface +register_func() +get_payload() +process() +accessor_factory() VPIFunction +get_func_id() +process() VPITriggerMethod +register_func() +get_payload() +process() +accessor_factory()

GdbDebuggerInterface

PpcApplicationBinaryInterface

PpcGdbDebuggerInterface PpcVPIInterface PpcGdbVPITriggerMethod

PpcGdbVPIInterface

Figure 4.

Class hierarchy of a sample VPII implementation

It also executes the appropriate virtual platform instrumentation function (VPIF) handler on behalf of the target code. The layers forming this high-level interface are shown in Figure 3. The VPII abstraction allows the VPIF handlers to access registers, memory and internal VP state using generic accessors that hide lowlevel platform interfaces. It also handles data conversion tasks related to a platforms application binary interface (ABI). The VPII is implemented using the high-level language (HLL) extension interfaces built into VPs. For instance, this could be an internal script interpreter, such as Python in Simics [7] or Tcl in Synopsis Platform Architect tools [2]. It could also be a C++ library built on top of a SystemC simulator. Alternatively, the GNU Debuggers (GDB) Python interface can be used to implement a generic VPII suitable for existing ISS and VP implementations with GDB debugging support. For our experimental implementation, we developed VPII and VPIF libraries supporting both Simics and GDBs Python extension interfaces. The class hierarchy for our implementation of the VPII interface for PowerPC targets with GDB-based VP access is shown in Figure 4. In that example, the PpcGdbVPIInterface class is used as the focal point to register instrumentation behavior (VPIF handlers) and access the VP through GDB. For testing, we also developed a sample library of VPIF handlers covering common instrumentation tasks of I/O semihosting, tracing and code timing. New VPIF handlers can be registered and modied dynamically at run-time with our sample Python implementation. Handlers written generically using only parameter accessors without using any VP-specic functionality can be reused on any supported architecture. IV. E XPERIMENTAL CASE STUDY In order to validate that the method we propose is exible and has low overhead, we performed a comparative case study. We

Internal: VPI handler is run internally on Simics Python interpreter with simcall interception. External: VPI handler is run externally on GDBs Python interpreter with debugger watchpoint interception under either Simics or QEMU. VPI: uses our inlined VPI instrumentation for each site; Stub-call: calls a C function stub at every site which wraps an inlined VPI instrumentation site, so that traditional semihosting function call overhead can be compared; Full-code: in the case of the printf() scenario, we run an optimized printf() implementation entirely on the target, with I/O redirected to a null device so that manual non-semihosted instrumentation overhead can be compared.

We compared the three following instrumentation methods:


For each run, we recorded binary section sizes, simulation times on the host and cycle counts on the target. Section sizes provide information about code size interference. Simulation times and cycle counts are used to compare runtime overhead. With the stub-call cases, the results in Tables III, IV and V are compensated by subtracting the wrapped VPI site contribution, which would have articially inated the results of those cases. All results are from release-type builds with no debugging symbols and -O2 (optimize more) option on GCC. Host OS noise was quantied by executing 50 runs of each case. B. Results of printf() semihosting scenario The printf() semihosting scenario compares space and time overhead of a printf() function semihosting use case. In this case, we inserted 3 instrumented sites to display the results of different loops of the QURT benchmark. Each loop ran 100 times, for a total of 300 calls. For all cases, the printf() implementation was functionally equivalent, with full oat support. The printf() statement was printf("Roots: x1 = (%.6f%+.6fj) x2 = (%.6f%+.6fj)\n", x1[0], x1[1], x2[0], x2[1]).

96

Table III TARGET RUNTIME OVERHEAD IN CYCLES FOR P R I N T F () SCENARIO

Simulation time (seconds)

Instrumentation case None Internal VPI External VPI Stub-call Full-code

CPU cycles 5 538 648 5 539 272 5 539 872 5 545 848 7 174 476

Total overhead 0 62424 122424 688824 1 635 82824

Per-call overhead 0 2 4 23 5453

Overhead increase N/A 1 (Base) 2 11.5 2726.5

Simics QEMU
100

150.0 182.2

10

9.791

1
0.353 0.247 0.249 0.285 0.292 0.293 0.304 0.318 0.353

size

size

size

None Internal VPI External VPI Stub-call Full-code

34 796 +132 +168 +320 +328

1864 +0 +32 +0 +0

1224 +200 +200 +80 +40

37 884 +332 +400 +400 +368

Int

Instrumentation case

.text

.data

.rodata

Total

Figure 5.

We ran this scenario on both Simics and QEMU. QEMU only supports the external conguration without source code modications. Since all binaries are identical between Simics and QEMU, the results of Tables III and IV apply equally to both VPs. Simulated CPU runtime overhead results are detailed in Table III. Uncertainty on overhead was 24 cycles because of the timing method. We observe that execution overhead per call for VPI cases is only 24 cycles, depending on the conguration. The external VPI congurationwith watchpoint interceptionrequires twice as many instructions per call as the internal simcall-based VPI conguration. Overheads of the VPI cases are a signicant 511 times reduction over traditional stub-call instrumentation. Function call preparation accounts for the higher overhead of the stub-call case. In contrast, even when excluding I/O cost, the full-code printf() cases has 3 orders of magnitude higher runtime overhead than either VPI cases. Space overhead results are listed in Table IV, in comparison to the uninstrumented base case. Code section (.text) space overheads of the VPI cases are noticeably lower than the other cases. Through manual assembler code analysis we conrmed that function call preparation accounted for the difference observed between the stub-call and fullcode cases. As expected, the lower code section overheads with VPI cases come at the cost of a larger constants section (.rodata), although total sizes are comparable. In terms of simulation time, our VPI frameworks overhead depends considerably on whether an internal or external conguration is used. Simulation times for different scenarios under both Simics and QEMU are shown in Figure 5 (note the logarithmic scale on the simulation time axis). The full -code and stub-call cases in that gure do not have any interception methods enabled at runtime. The internal uninstrumented (Internal None) case is shown to have no penalty on simulation time. Conversely, the external uninstrumented (External None) casewhich uses watchpoint instead of simcall interceptioncauses some baseline interception overhead. Furthermore, the internal VPI-only case is shown to have no penalty over a traditional internal stub-call case. With all instrumented cases, those using internal congurations

display signicantly better simulation performances than those using external congurations. Moreover, the internal VPI instrumented case is even faster than the external uninstrumented case under Simics. This shows that simulation time is practically unaffected by low instrumentation loads when an internal VPI framework conguration is used. The interception and VPII mechanisms appear to be much slower when going through the GDB interfaces used in all external cases. We determined that the slowdown was due to the overhead of both the GDB ASCII protocol and the context switches required to go back and forth between the GDB and VP processes. In contrast, the internal conguration has direct access to VP resources, which explains its better performances. In the case of QEMU, GDB communication overhead was prohibitive enough to prevent the use of our framework for non-trivial cases under that particular VP. C. Results of proling scenario The proling scenario compares overhead of runtime proling between stub-call (i.e. compiler-inserted) and inlined tracing/proling instrumentation. In the stub-call cases, the -finstrument-functions option of GCC was used to automatically insert a call to instrumentation stubs at every function entry and exit points. For the VPI cases, we manually inserted the VPI tracing calls in the C source code at every function entry and exit points. In both cases, the instrumentation behavior involved recording execution tracing information to a le, as usually done by proling tools. The tracing call was of the form vp_gcc_inst_trace("FUNC_ENTER", "NAME", __FILE__, __LINE__), where vp_gcc_inst_trace is a VPI instrumentation site insertion macro. There were 11 instrumentation sites, totalling 456 802 calls over a run and yielding a trace le over 23 megabytes long. This is more than a thousandfold increase in instrumentation calls over the printf() scenario. We did not run this scenario under QEMU in light of the prohibitive simulation times for the much simpler printf() scenario. Results are detailed in Table V. We only present runtime and simulation time overhead results, since space overhead is negligible in runtime-dominated proling use cases. As with the printf() semihosting scenario, large differences exist in results depending on the conguration used. With internal conguration, the instrumentation calls penalized simulation time on the order of 150 s per call. In contrast, the negative impact on simulation speed of accessing VP state through an external interface is clearly demonstrated by over-

97

ern Bas al e N Fu one ll-c S ode Int tub-c Ex ernal all ter V n P Ex al N I ter one na lV PI Ex ter Ba na se lN Fu one ll-c Stu ode Ex bter cal na l lV PI
Simulation times for printf() scenario

Table IV B INARY SIZE OVERHEAD FOR P R I N T F () SCENARIO

0.1

Table V OVERHEADS PER CALL FOR PROFILING SCENARIO Total simulation time (s) 0.250 62.59 69.88 3286 3299 Runtime overhead (cycles) 0 2.13 9.7 4.06 9.7 Simulation overhead (seconds) 0 136.5 152.4 7193 7221

Instrumentation case None Internal VPI Internal stub-call External VPI External stub-call

Secondly, since our prototype implementation uses pure Python scripting code, it is at least an order of magnitude slower than what could be achieved using a native C/C++ implementation. Future work includes implementing our VPI framework on a wider variety of VPs and architectures. Additional case studies and benchmarks could be benecial in identifying more use cases where our method is an optimization of existing practices, while also serving as validation that inlined instrumentation is robust under more optimizations than those we validated. ACKNOWLEDGEMENTS We would like to thank L. Moss, J. Engblom, G. Beltrame and L. Fossati for providing us with valuable insights about code instrumentation on virtual platforms, which helped shape the construction of our framework and its presentation in this paper. We also wish to thank J-P. Oudet and the peer reviewers for helpful comments about the original manuscript. R EFERENCES
[1] Freescale Semiconductors, Inc. (2008, Jun.) Virtutech announces breakthrough hybrid simulation capability allowing mixed levels of model abstraction. Accessed 6/7/2010. [Online]. Available: http: //goo.gl/UErXR [2] Synopsys, inc., CoWare Platform Architect Product Family: SystemC Debug and Analysis Users Guide, v2010.1.1 ed., Jun. 2010. [3] G. Beltrame, L. Fossati, and D. Sciuto, Resp: A nonintrusive transaction-level reective mpsoc simulation platform for design space exploration, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 28, no. 12, pp. 18571869, Dec. 2009. [4] L. Moss, M. de Nanclas, L. Filion, S. Fontaine, G. Bois, and M. Aboulhamid, Seamless hardware/software performance co-monitoring in a codesign simulation environment with rtos support, in Proc. Design, Automation Test in Europe Conf. Exhibition (DATE), 2007, pp. 16. [5] S. Fischmeister and P. Lam, On time-aware instrumentation of programs, in Proc. 15th IEEE Real-Time and Embedded Technology and Applications Symp. (RTAS), Apr. 2009, pp. 305314. [6] ARM Ltd, ARM Compiler toolchain: Developing Software for ARM Processors, 2010, version 4.1, document number ARM DUI 0471B. [Online]. Available: http://goo.gl/qlKkO [7] J. Engblom, D. Aarno, and B. Werner, Full-System Simulation from Embedded to High-Performance Systems. Springer US, 2010, ch. 3, pp. 2545. [Online]. Available: http://dx.doi.org/10.1007/ 978-1-4419-6175-4_3 [8] Qemu open-source processor emulator. [Online]. Available: http: //www.qemu.org [9] Imperas Ltd. (2010) Technology ovpsim. Accessed 12/15/2010. [Online]. Available: http://www.ovpworld.org/technology_ovpsim.php [10] IBM Corporation, IBM XL C/C++ for Linux, V11.1, Optimization and Programming Guide, 2010, document number SC23-8608-00. [Online]. Available: http://goo.gl/e1Ri9 [11] J. Corbet. (2007, aug) Kernel markers. Accessed 12/10/2010. [Online]. Available: http://lwn.net/Articles/245671/ [12] B. L. Titzer and J. Palsberg, Nonintrusive precision instrumentation of microcontroller software, ACM SIGPLAN Not., vol. 40, pp. 5968, June 2005. [Online]. Available: http://doi.acm.org/10.1145/1070891.1065919 [13] Tensilica, inc., Xtensa Instruction Set Architecture: Reference Manual, Santa Clara, CA, Nov. 2006, document number PD-06-0801-00. [14] H. Shen and F. Petrot, A exible hybrid simulation platform targeting multiple congurable processors soc, in Proc. 15th Asia and South Pacic Design Automation Conf., Jan. 2010, pp. 155160. [15] N. Anastopoulos, K. Nikas, G. Goumas, and N. Koziris, Early experiences on accelerating dijkstras algorithm using transactional memory, in Proc. IEEE Int. Symp. on Parallel Distributed Processing (IPDPS), May 2009, pp. 18. [16] Free Software Foundation. The gnu c compiler. Accessed 11/1/2010. [Online]. Available: http://gcc.gnu.org [17] S.-S. Lim. (1996) Snu-rt benchmark suite for worst case timing analysis. Original SNU site now down. [Online]. Available: http: //www.cprover.org/goto-cc/examples/snu.html

head results around 7 ms per call, which is close to 50 times worse than with internal cases. On the opposite end of the performance spectrum, the internal VPI instrumentation case displays signicantly lower runtime overhead than the traditional stub-call approach, for a comparable simulation time. In terms of target runtime overhead, a reduction of 25 times over the stub-call case is seen with the VPI cases. If complex behavior had been implemented in the instrumentation functions on the target instead of wrapping a VPI call, overhead would have increased proportionately over the simple stub-call cases shown. V. C ONCLUSION AND FUTURE WORK Compared to existing semihosting and proling instrumentation approaches, our contributed framework is shown to have lower runtime and space overhead on the target. In both case study scenarios, our method showed 211 times lower runtime interference compared to traditional methods. The lower overall target overhead and the construction of our VPI instrumentation insertion method enable the use of our framework to unify the implementation of previously separate semihosting and tracing/proling use cases. Because our method allows for inlining, is fully compatible with all optimization levels and has low target space and time overhead, it may remain in release code. With interception disabled in the VP, instrumented sites do not affect the runtime. This opens the possibility of distributing instrumented binaries which can later be pulled from the eld for re-execution with instrumentation enabled under a VP. In terms of simulation time, our VPI implementation has performances comparable to traditional stub-call semihosting when using the internal conguration. We have also shown that our framework can be used to extend the instrumentation capabilities of existing VPs without changing their source code. This add-on instrumentation capability exploits scripting interfaces currently available in VPs and provides users with the option of reusing our sample implementation in their own environments. While our results validate our assertions, we must also acknowledge that our prototype implementation suffers from some performance issues which are unrelated to the core VPI concepts presented in this paper. Firstly, simulation time overhead is dominated by choice of VPI conguration, with the external conguration executing as much as 50 times slower than internal congurations. In the case of our GDB-based external implementation, performances are limited by the communication and context switching overheads between GDB and the VP. These performance issues are due to the architecture of GDB and shared by any tool employing GDB as a generic interface to a virtual platform.

98

Using Multiple Abstraction Levels to Speedup an MPSoC Virtual Platform Simulator


Jo o Moreira , Felipe Klein , Alexandro Baldassin , Paulo Centoducatte , Rodolfo Azevedo and Sandro Rigo a
of Computing University of Campinas (UNICAMP) Brazil joao.moreira@lsc.ic.unicamp.br, {klein, ducatte, rodolfo, sandro}@ic.unicamp.br IGCE/DEMAC UNESP Brazil alex@rc.unesp.br
Institute

AbstractVirtual platforms are of paramount importance for design space exploration and their usage in early software development and verication is crucial. In particular, enabling accurate and fast simulation is specially useful, but such features are usually conicting and tradeoffs have to be made. In this paper we describe how we integrated TLM communication mechanisms into a state-of-the-art, cycle-accurate, MPSoC simulation platform. More specically, we show how we adapted ArchC fast functional instruction set simulators to the MPARM platform in order to achieve both fast simulation speed and accuracy. Our implementation led to a much faster hybrid platform, reaching speedups of up to 2.9x and 2.1x on average with negligible impact on power estimation accuracy (average 3.26% and 2.25% of standard deviation).

I. I NTRODUCTION As new hardware architectures become increasingly complex, the need for tools to support their development becomes evident. The use of virtual platforms to enable design space exploration has shown to be an important procedure for accelerating the design of new hardware components, allowing early architectural exploration and verication. Low power consumption is a key feature in hardware development, not only for embedded systems, extending battery lifetime, but also for hardware in general, reducing heat dissipation. The development of energy-aware systems is a hard task that can be assisted by virtual platforms, since they make it possible to trace the behavior of interconnected hardware components, allowing performance and power estimation. Many approaches have been proposed for power estimation on single core applications, but only a few options are available for the multi-core domain. Some simulation platforms [1], [2] with power analysis support were published and are in use, but a need for more alternatives and resources, satisfying a wider range of testing possibilities, still exists. Virtual platforms may be implemented using different abstraction levels. Cycle-accuracy provides precise simulations with highly trustable results, but this efciency comes at cost of complexity and, consequently, increased simulation time. This characteristic imposes hard performance limitations, sometimes making the execution of real-world applications unfeasible. The use of higher abstraction levels, such as functional simulators, reduces the simulation complexity, improving the platforms time efciency. However, since many
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

hardware details are not taken into account, their results might be less precise if compared to those generated with a cycleaccurate platform. This paper is focused on how to improve the speed of a cycle-accurate platform by including a functional simulator while maintaining the accuracy. We integrated functional simulators generated with ArchC [3] into the MPARM [1] platform, turning it into a faster hybrid platform. The contributions of this work are threefold. First, we introduce a new simulation resource into the MPARM platform to improve its speed up to 2.9 times. Second, we present a detailed implementation description of a hybrid simulation platform, showing how the abstraction compatibility problems were xed. Finally, a description of how we managed to statistically x the precision loss introduced by the functional simulator is showed. This paper is organized as follows: Section 2 describes related works. Section 3 details the implementation of the platform, describing the interface of the functional simulator with the MPARM platform, techniques used to improve precision, and the verication process. Section 4 describes the experimental results, showing the obtained speedups and describing how power estimations were statistically xed. Section 5 presents our conclusions. II. R ELATED W ORK MPARM [1] is a complete platform for Multi-processor Systems-on-Chip simulation (MPSoCs). It is written in C++ and makes use of SystemC [4] as its simulation engine. The platform includes an implementation of a cycle-accurate ARM simulator called SWARM [5], AMBA buses, hierarchic memories and synchronization mechanisms. Cycle-accurate power models for many of the simulated devices are included in MPARM platform, which makes it quite suitable for power estimation. MPARM is well known, and have been used for power analysis in MPSoCs [6], for testing Hardware and Software Transactional Memory systems [7], [8]. SimWattch [2] is a simulation tool based on Simics [9] and on Wattch [10], a power modeling extension present on SimpleScalar [11]. This tool have been designed to support microprocessor performance and power estimation, but no models are provided for other system components, such as external memories.

99

Fig. 1.

MPARM platform with ArchC generated cores

ArchC [3], [12] is an open-source SystemC-based architecture description language capable of generating fast, functional, Instruction Set Simulators (ISS). It is easy to modify an ArchC model to use its natively supported TLM [13] interfaces to communicate with external modules, providing seamless integration into virtual platforms. The ArchC ARMv5 model is a functional simulator, which means that there is no detailed pipeline simulation. This makes the simulator implementation very simple, turning it into a much faster option than the original SWARM simulator distributed with MPARM. III. P ROBLEM D ESCRIPTION Hardware research frequently requires the execution of large simulation sets, using multiple applications with varying number of congurations and hardware compositions. Simulation time is crucial in order to evaluate a specic design and, since the result of the simulation will probably require design modications and further simulation, it becomes impractical to wait hours (or even days) for a single simulation to nish. Cycle-accuracy, as in MPARM, imposes hard restrictions to simulation sets, requiring the remotion of heavy-weight software or complex hardware from the workload. As a realistic example, consider the simulation of a lock-based version of the genome application, which is one of the fastest applications within the STAMP [14] benchmark. Running genome on MPARM with one core required 3:16 minutes. When the number of cores is increased to 2, 4, 8, and 16, the overall simulation time rises to 5:20, 8:47, 21:21 and 45:02 minutes, respectively. Simulating larger STAMP applications, such as Yada, takes around 31 hours with the 8 cores conguration. Another signicant limitation imposed by MPARM is its lack of exibility. The platform implementation tightly couples the processor with adjacent modules, making the exploration of different architectures a hard task. This lack of exibility also happens in other simulation platforms but we focus on MPARM in this paper. A. MPARM modications In order to improve performance, the cycle-accurate SWARM processor simulator was replaced by a functional ARMv5 simulator core generated with the ArchC toolset. This implementation led to a hybrid simulator, where all modules, except for the processor, are cycle-accurate. As it will be seen

ahead, power measurement was not seriously compromised and could be statistically estimated. We choose to modify the processor simulator because we are interested in platforms with a varying number of cores (from 1 up to 16) and our prole indicated that it is possible to get better results with a faster ISS (see Section III-E). In the MPARM platform, each ISS module consists of a C/C++ implementation encapsulated into a SystemC wrapper. The wrapper is an interface between the processor core and the platform, handling all communication. This wrapper is also responsible for providing control to the core to execute a new cycle, which turns it into an efcient layer to force module synchronization on the platform. Due to the loss of simulation accuracy implied by the higher abstraction level, some modications needed to be applied to the model.The original ArchC ARM model, that was designed to use internal memory, was changed to use two TLM ports and the memory loader available in the platform. The platform was modied to correctly compile and instantiate the new processors. Functions to estimate core power and generate simulation reports were also created. Finally, some modications were applied to the core signal data structure existent on the platform, making it compatible with SystemC 2.1. The implementation required 460 lines of code for the TLM interfaces and the SystemC wrapper. Around 200 code lines were written in the original platform to correctly support the ArchC processor. The code can be easily reused to integrate any other ArchC processor model into MPARM, allowing the exploration of many architectures not yet supported by MPARM. This new feature can turn the platform, that was originally designed to evaluate embedded systems, into a more embracing one. B. Model integration Models written with ArchC language may use internal memory or TLM interfaces to allow processor communication with external modules. When using TLM interfaces, every memory operation executed by the processor is forwarded to the TLM ports through a TLM packet. The TLM interface implements the SystemC signaling communication with external modules. Being a centralized communication channel, creating memory hierarchy only required plugging the cache memories to the TLM interface code, correctly consulting and updating it on every operation. The cache memories implementation was the same originally used on MPARM. To support split data and instruction caches, two TLM ports were created in the ArchC processor model. One port is exclusively used for instruction fetching and the other for any other memory access. Each TLM port is connected to its own cache, which is consequently turned into a data or instruction cache. A diagram with detailed TLM interface implementation can be seen on Figure 1. MPARM does not support TLM by default. For this reason, once the packets reach the ArchC TLM interface, they are translated to an MPARM core signal, and a memory operation request is made to the bus master. These memory operations will block the processor until a ready signal is received back

100

from the bus master. The MPARM platform also uses certain memory addresses to create a communication channel between the application simulated and the simulator itself. This is useful for allowing the use of the simulation support API, which provides calls for functionalities such as enabling and disabling power measurement or printing output messages. The original AMBA bus was not modied, keeping its original cycle accuracy. When performing memory operations, this characteristic makes the processors wait the same number of cycles as cycle-accurate ones would. Being a blocking function, the TLM interface forces synchronization between the processor and other modules in the platform. To provide correct power estimation, calls to the measurement API were encapsulated into the TLM interface. Since the power estimation models were developed focusing a cycleaccurate simulation, another strategy needed to be applied. The SWARM processor is modeled as a state machine with eight different states. Each of these states describes an operation mode of the processor, and its transitions are dened by the internal ow of events such as cache and memory operations. On every simulated cycle, the processor state is stored by measurement calls. At the end of the simulation, the number of cycles on each state is used to estimate power. Since the ArchC simulator is functional, this power estimation mechanism could not be directly applied. But, instead of using a xed state ow or value for each instruction, we placed the measurement calls on the TLM interface. Within the TLM interface we can easily trace all memory accesses, cache updates and number of wait cycles, which allows us to dynamically reproduce the processors state ow for each instruction. This approach led to a reduced precision loss in our power estimation model, making it very similar to the original one, as we show in Section IV. Figure 2 illustrates the measurement API calls on the TLM interface.

with results is stored using its address as index. If the instruction is needed once again, it is retrieved directly from the decode cache. Despite being essential for high performance, this mechanism imposes difculties to statistics collection as for an instruction that was already decoded, no memory access is performed, leading to wrong memory and cache measurements. To x this behavior, we modied the original decode cache code to perform a dummy memory access to the corresponding address, in case of a cache hit. The solution forced the core to make the same original memory accesses, but avoided the need of decoding the same instruction twice. In a rst comparison between the cycle-accurate and the functional simulations, a big difference of memory read operations was noticed. This difference happened due to the lack of a pipeline in the functional simulation. With a pipeline, the execution of a branch may ush instructions on the rst and second pipeline stages. In the functional simulation, these two instructions would never be fetched from memory. This inconsistency is imposed by the different abstraction levels among processors and the rest of the platform. To cope with this problem, a branch detection mechanism was implemented in the ArchC simulator. If a branch is taken, this mechanism generates two dummy memory reads to the next addresses of the not taken program ow, corresponding to the instructions that would be in the pipeline but that would be discarded in the cycle-accurate simulation. This mechanism drastically improved the precision of the functional simulation, allowing a much more reliable power measurement. D. Platform Verication Process The new platform was tested using the STAMP benchmark [14], which a set of applications targeting the evaluation of Software Transactional Memory (STM) systems. This benchmark showed to be very suitable to our purposes, since its inherent concurrency contributed for correctly evaluating the platforms communication through the TLM interfaces. The wide variety of algorithms in the STAMP applications presents a good robustness test, evaluating different aspects of the platform, such as varying bus contentions, critical sections length and shared memory area sizes. The platform verication comprised three main stages. In a rst moment, different applications were executed and their outputs were compared for correctness. The tests started with micro-benchmark applications, and were concluded after executing all applications in STAMP. At this point, all tests were executed with a single core instantiated on the platform. The second stage consisted in comparing memory traces generated by the new platform with traces generated by the original one. This test allowed the verication for correct memory operations and address translation. After comparing the traces, the output reports were also checked, showing that both platforms had the same number of memory access for each memory area while executing the same benchmark. Once again, only one core was instantiated on the platform. The third validation stage consisted in executing the rst and second stages with multiple cores (2, 4 and 8). It is worth

Fig. 2.

TLM owgraph

C. ArchC ARMv5 model modications For the sake of speed, ArchC simulators use an instruction decode cache to avoid the need of decoding the same instruction twice. Once an instruction is decoded, a data structure

101

mentioning that some of the STAMP programs run more than a billion of instructions in each processor. Once the expected output for the application was found, memory traces were compared. Due to the different simulation abstraction, the memory traces were not exactly the same for both platforms. A big difference on the number of memory reads to the processors private memory area was noticed on the tests. By using the branch detection mechanism mentioned on III-C, this difference was drastically reduced, showing that it was an effect of the pipeline absence in the new abstraction, as described. In multi-core simulations, other memory areas also shown some differences but none was signicant. The nature of such differences will be described on Section III-E. An illustrative comparison between 4 applications can be seen on Table I. Values denoted with a + signal means that the simulation made with ArchC executed a larger number of memory operations. Values describing multi-core simulations refer to the number of operations executed by the rst core.
TABLE I M EMORY OPERATION DIFFERENCE

Operation Private Rd Private Wr Shared Rd Shared Wr Semaphore Rd Semaphore Wr

genome 1 core 0.33% 0% 0% 0% 0% 0%

genome 8 cores 9.16% 0.2% 0.08% 1.08% 3.12% 0.49%

Intruder 1 core 0.05% 0% 0% 0% 0% 0%

Intruder 8 cores 14.04% +1.44% 2.89% + 1.54% 2.47% + 1.06%

accesses that is directly dependent on the number of stages in its pipeline, and it would never be equal to one memory access per cycle. Since the functional simulator requires less cycles to execute an instruction, it also executes a code block in a smaller interval. This fact led to differences in the number of cycles spent running critical sections, which also had a major effect on the number of cycles and memory operations performed by other processors on the platform while waiting for the lock acquisition. Consequently, the bus contention of both platforms is not the same. Considering the effects of the new abstraction on critical sections, and the fact that a single difference here may inuence the whole program ow, not only the different number of memory accesses are explained, but also the varying times spent on modules other than the processor on each platform. Table II summarizes the prole data. It presents the time spent by each processor implementation, in seconds, and the whole execution time percentage spent on the processors. Since the abstraction level inuences the whole platform, it is not possible to assume an absolute efciency comparison for both processors. However, the values show that, while using the ArchC simulator, the percentage of time spent on the processor in relation to the overall simulation time was reduced, at least, 2.5x, reaching a reduction of 4x in a 8 core conguration. After these tests were complete, a series of experiments was performed to assess the performance and power estimation capabilities of our implementation. These experiments are discussed in the next section.
TABLE II P ROCESSOR PROFILING SUMMARY

E. Platform Proling After reaching the expected behavior with the STAMP applications, the original and the modied platforms were proled in order to enable better understanding of the differences imposed by changing the abstraction. Using the gprof tool, the execution of both platforms was proled while running the application genome with 1, 2, 4 and 8 cores. The proling allowed the verication of the time spent running each of the platforms module. The results showed that the differences in the time spent by each of the modules were not restricted to the ISS code, indicating that the effects of changing its abstraction were propagated to other parts of the platform, such as buses and memories. A deeper observation of the bus behavior revealed different contentions on each platform, showing that the processors abstraction had inuence on bus operations. In fact, the way each abstraction emits its instructions and its memory accesses are not equal. Only one cycle is required to run an instruction that does not require memory accesses on the functional simulator. If we consider an empty pipeline, it would be required at least three cycles to run the same instruction on a cycle-accurate pipeline. Considering that the block being executed is already in the cache, if each operation needs to perform a memory access, the functional simulator would emit one memory access per cycle. In a similar situation, the cycle-accurate simulator would emit a number of memory
Processor ArchC SWARM ArchC SWARM

1 core 6.87% 17.25% 5.15s 25.15s

2 cores 7.07% 20.29% 8.72s 51.83s

4 cores 7.61% 23.18% 16.9s 97.1s

8 cores 5.55% 22.59% 32.07s 180.29s

IV. E XPERIMENTAL R ESULTS The STAMP benchmark with lock-based synchronization was used to evaluate both performance and power estimation. The test consisted in executing all the 8 applications available in the benchmark with 1, 2, 4, and 8 cores. Some of them were executed with more than one conguration, totalizing 13 application variants. A sequential version of each variant was also executed. The whole test consisted in 65 simulations, which were executed on 2.4GHz and 4GB RAM machines running Ubuntu Linux, with Kernel 2.6.9. Due to restrictions imposed by the source code of the platform, all the code was compiled using GCC 3.4 using the O3 optimization level.

A. Performance Assessment

The result of each simulation was compared with a similar simulation on the original MPARM platform. A performance comparison can be seen on Figures 3 and 4. The speedup is showed as gray bars in the gures. The nomenclature for each

102

Fig. 3.

Lock-based speedup / Simulated cycles

same batch of simulations was completed in 98 hours and 19 minutes on the new MPARM implementation using ArchC functional cores. B. Energy and Power estimation In MPARM, the total energy estimation is calculated based on the stored states of the processor, as described in Section III-B. At each cycle, the processor state is stored and, at the end, is applied to the energy model present in the platform. As expected, the use of a higher abstraction introduced imprecision to energy estimation. A scatter plot showing the error in energy estimation obtained from the new platform can be seen in Figure 5, where the dots represent the obtained results, and the line, the value obtained with the original platform, used as reference for correctness.
Fig. 4. Sequential speedup / Simulated cycles

simulation follows the one presented in the original STAMP paper [14]. As it can be seen, our implementation reached at least 1.8x speedup for each simulation, with a maximum speedup of 2.9x. The average speedup was of 2.1x. The absence of a pipeline reduced the overall number of simulated cycles for each execution. The number of cycles, normalized to values obtained with the original platform, can be identied by a black line in Figures 3 and 4. As shown, 70% of the original cycles were simulated with the sequential implementation. This value also stands for lock simulations with 1 core, but it increases as more cores are added to the platform. This is an effect of bigger bus contention generated by the addition of new cores to the platform. Since the bus is cycle-accurate, memory operations will keep the cores blocked for a number of cycles equivalent to the cycle-accurate simulation, which makes the number of cycles increase towards the one obtained with the original platform. Simulating less cycles was not the only reason for the speedup. The functional ArchC simulator is simpler, and thus faster, than SWARM. On average, the ArchC model simulated 1.9x more cycles per second, reaching a maximum of 2.4x. A comparison of simulated cycles per second can be seen on Table III. Serially running these 65 simulations would spend 187 hours and 30 minutes on the original MPARM with SWARM. This

Fig. 5.

Total energy measurement error

In the results presented in Figure 5, the applications Bayes, Labyrinth, Labyrinth+, and Yada shown a more signicant error when executed with 4 cores. Since these applications have long critical sections, they are more susceptible to the new abstraction effects on bus contention. The upper limit of bus contention can be understood as all processors on the platform waiting for bus operations. For the mentioned applications, the 4 cores simulation reached maximum contention in

103

TABLE III N UMBER OF K CYCLES SIMULATED PER SECOND

Application kmeans-low kmeans-high yada bayes intruder intruder+ labyrinth labyrinth+ vacation-low vacation-high ssca2 genome genome+ average

Sequential ArchC SWARM 270 150 275 203 297 214 318 150 351 161 345 160 319 236 318 233 352 260 340 261 326 254 324 248 334 255 320 214

1 core ArchC SWARM 276 209 279 216 295 142 312 227 341 159 350 160 317 149 316 223 341 163 347 260 339 244 322 232 330 157 320 195

2 cores ArchC SWARM 406 196 407 282 409 308 487 322 480 216 488 325 461 204 465 204 468 223 479 333 502 324 519 337 520 355 468 279

4 cores ArchC SWARM 546 345 587 360 465 375 657 396 603 262 590 261 575 243 575 357 588 268 613 427 710 408 711 431 709 428 609 350

8 cores ArchC SWARM 870 515 938 359 727 540 991 584 889 556 894 562 892 356 880 530 906 391 880 415 1063 623 1083 605 1075 464 929 500

the hybrid platform, but not in the original one, resulting in the observed error. The 8 cores simulations reached the maximum bus contention, in both platforms, during almost all runtime and, for this reason, the observed error was not signicant. The power estimation mechanism on MPARM uses the number of simulated cycles to estimate the consumption on each core. As the hybrid platform implementation trades off cycle precision for performance, an error margin was introduced to power calculations due to the difference in number of cycles. Raw results obtained with the modied platform had an average power estimation error of 21.2% with a 2.7% standard deviation (SD). In order to improve these results, a model based on regression analysis was built using the least square methods. To build the model, values measured on the original platform were used as expected values. The coefcient reached, if applied to the results obtained with the hybrid platform, minimizes the percentage of error. Since this rst model was built based on results obtained with the whole simulation set, it was named general model. After applying the general model to the results obtained with the hybrid platform, the average error was reduced to 14.45%, with a SD of 4.3%.
TABLE IV L INEAR R EGRESSION C OEFFICIENTS

specic coefcients obtained through the linear regression can be seen on Table IV, where x stands for the raw power estimation value obtained with the hybrid simulation. By using the core-specic models, the error margin was reduced to an average of 3.25% with a SD of 2.25%. A scatter plot showing the error of the nal results can be seen in Figure 6, where dots represent obtained results after applying the estimation models and the diagonal line represents results obtained with the original platform.

cores general model 1 2 4 8

model 0.9 x 3 0.71 x + 0.57 0.8 x 0.75 x + 7.5 0.8 x + 24

Fig. 6.

Total power measurement error

C. Regression Model validation In order to correctly assess our models, a small test set, composed by STAMP applications executed with different inputs, was used. The general and the core specic coefcients were applied to the results obtained after a hybrid simulation. The applications that composed the test set were Bayes, Intruder, and Intruder+ with a different random seed; Labyrinth and Labyrinth+ with different mazes.

As shown in Figure 3 and explained in Section IV-A, the normalized number of simulated cycles is not the same for simulations with different number of cores. As this value is an important parameter for the power estimation, it turns out that using the general model described above is not the most appropriate choice. In order to achieve a higher accuracy level, we calculated new models for each number of cores. Core-

104

(a) Power measurement error

(b) Power measurement error, general model applied

(c) Power measurement error, specic model applied

By using the general model, the average error was reduced from 22.21% to 18.74%. The specic model reduced the average error to 5.85%. Despite of showing a general worse efciency than the core-specic model, the general model was more precise when applied to the 8 cores simulation. This behavior is due to the larger data set employed in the construction of the general model and the similarity in the number of cycles executed in the simulations. Plots with the errors originally obtained and reduced after applying the models are presented on Figures 7(a), 7(b) e 7(c). V. C ONCLUSION We have introduced a new simulation resource to MPARM, turning it into a hybrid simulation platform regarding the model abstraction levels. By replacing the cycle-accurate processor with a functional one we have signicantly increased performance, as a consequence of simulating more cycles per second, and of reducing the number of overall simulated cycles. We also have introduced techniques to reduce the loss of precision in cycle/power estimates imposed by a higher abstraction level. The lack of precision introduced by the abstraction modication was discussed, highlighting the effects of bus contention while running simulations with a bigger number of cores. Finally we have suggested the use of regression analysis to improve power estimation results, dening a different correction model to each number of cores, due to the variation in bus contention effects in each case. By reaching an average error of 3.26% we showed that our hybrid platform, in spite of reaching speedups of up to 2.9 times if compared to the original MPARM, is able to generate power estimation with a very similar level of condence in the results. VI. ACKNOWLEDGEMENT This work was partially supported by grants from FAPESP (2009/04707-6, 2009/08239-7, 2009/14681-4), CNPq, and

CAPES. R EFERENCES
[1] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri, MPARM: Exploring the multi-processor SoC design space with systemc, J. VLSI Signal Process. Syst., vol. 41, no. 2, pp. 169182, 2005. [2] J. Chen, M. Dubois, and P. Stenstrom, Integrating complete-system and user-level performance/power simulators: the simwattch approach, in ISPASS 03: Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, 2003, pp. 110. [3] S. Rigo, G. Araujo, M. Bartholomeu, and R. Azevedo, ArchC: a systemc-based architecture description language, Computer Architecture and High Performance Computing, pp. 6673, October 2004. [4] D. C. Black and J. Donovan, SystemC: from the ground up, 2004. [5] M. Dales, Swarm 0.44 documentation, February 2003, www.cl.cam.ac.uk/ mwd24/phd/swarm.html. [6] M. Loghi, M. Poncino, and L. Benini, Cycle-accurate power analysis for multiprocessor systems-on-a-chip, in GLSVLSI 04: Proceedings of the 14th ACM Great Lakes symposium on VLSI, 2004, pp. 410406. [7] C. Ferri, T. Moreshet, R. I. Bahar, L. Benini, and M. Herlihy, A hardware/software framework for supporting transactional memory in a mpsoc environment, SIGARCH Comput. Archit. News, vol. 35, no. 1, pp. 4754, 2007. [8] A. Baldassin, F. Klein, G. Araujo, R. Azevedo, and P. Centoducatte, Characterizing the energy consumption of software transactional memory, Computer Architecture Letters, vol. 8, no. 2, pp. 5659, Feb. 2009. [9] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, Simics: A full system simulation platform, Computer, vol. 35, pp. 5058, 2002. [10] D. M. Brooks, P. Bose, S. E. Schuster, H. Jacobson, P. N. Kudva, A. Buyuktosunoglu, J.-D. Wellman, V. Zyuban, M. Gupta, and P. W. Cook, Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors, IEEE Micro, vol. 20, pp. 2644, 2000. [11] T. Austin, E. Larson, and D. Ernst, Simplescalar: An infrastructure for computer system modeling, Computer, vol. 35, no. 2, pp. 5967, 2002. [12] R. Azevedo, S. Rigo, M. Bartholomeu, G. Araujo, C. Araujo, and E. Barros, The ArchC architecture description language and tools, Int. J. Parallel Program., vol. 33, no. 5, pp. 453484, 2005. [13] F. Ghenassia, Transaction-Level Modeling with Systemc: Tlm Concepts and Applications for Embedded Systems, 2006. [14] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun, STAMP: Stanford transactional applications for multi-processing, in IISWC 08: Proceedings of The IEEE International Symposium on Workload Characterization, September 2008.

105

A non intrusive simulation-based trace system to analyse Multiprocessor Systems-on-Chip software


Damien Hedde, Frdric Ptrot
TIMA Laboratory CNRS/Grenoble INP/UJF, Grenoble, France
{Damien.Hedde, Frederic.Petrot}@imag.fr AbstractMultiprocessor Systems-on-Chip (MPSoC) are sealing in complexity. Most part of the MPSoCs are concerned with this evolution: number of processors, memory hierarchy, interconnect systems, . . . Due to this increase in complexity and the debugging and monitoring difculties it implies, developing software targeting these platforms is very challenging. The need for methods and tools to assist the development process of the MPSoC software is mandatory. Classical debugging and proling tools are not suited for use in the MPSoC context, because they lack adaptability and awareness of the parallelism. As virtual prototyping is today widely used in the development of MPSoC software, we advocate the use of simulation platforms for software analysis. We present a trace system that consists in tracing hardware events that are produced by models of multiprocessor platform components. The component models are modied in a non-intrusive way so that their behavior in simulation is not modied. Using this trace results allows to run precise analysis like data races detection targeting the software executed on the platform.

I. I NTRODUCTION The ever increasing performance and exibility demands for running applications on embedded platforms has led to the emergence of Multiprocessor Systems-on-Chip (MPSoC) platforms a decade ago. Nowadays, the systems integrate very complex memory subsystems and interconnects. In its 2009 edition [1], the ITRS (International Technology Roadmap for Semiconductor) expects Systems on Chip in portable consumer segment with more than 1000 processing elements in 2020. Unfortunately, MPSoCs embed more an more elements while keeping few debugging or monitoring external capabilities. As a consequence, the observability abilities such SoCS are not increasing as their complexity does. By integrating elements that were previously outside the chip, it becomes almost impossible to observe their behavior and communication. Furthermore, the growing number of internal elements that need to be connected together leads to the saturation of classical interconnect systems such as buses. To replace them, scalable interconnects (ie: Networks on Chip (NoC)), are used. Although providing higher bandwidth, they do not have the same observability. To settle this general observability problem, Design for Debug (DfD) features ([2], [3], [4], [5]) are developed and integrated into SOCs. This complexity raises problems for designing a MPSoC but also for the software running on its processors. The
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

software has to target the processors and make them communicate in order to execute the required application. Depending on the MPSoC architecture, the software may have to handle multiple communication mechanisms (shared memory, mailboxes, DMA) and target different processor types (general purpose processor, specialized processor). Analysing and debugging programs that use concurrently several processors is not a new problem. Lots of techniques were developed [6] during the 1980s period, particularly for targeting distributed systems programs. Encountered difculties for debugging these systems is not far from the ones for MPSoCs. A very important one is the difculty to get the state of the global system. There is no real problem to get the state of each node, but getting the state of each node at the same time is nearly impossible. With the development of GALS (Globally Asynchronous, Locally Synchronous) integrated systems, it becomes clearly unfeasible. In this paper, we present a method for ne-grain analysis of software running on MPSoC. This method uses simulation and consists in doing an instrumentation of component models. During the simulation, these models then generate events which are collected for further analysis. Our approach rely on the fact that when simulating a entire SoC platform, we can have access to everything thats going on during the simulation. The ability of collecting information is indeed not limited by the classical constraints a real SoC has: limited bandwidth with external debugging device, limited observability, . . . Contrary to software instrumentation, information from the simulation can be collected without being intrusive: information can be obtained without changing the behavior of the simulated components. However simulation relies on models that are often only approximate in timing. In our case it is not a problem because we are mainly concerned with the order of memory events. The produced trace mainly contains instructions executed by the different processors. The related memory accesses are also traced up to the memory by every component relaying them. These memory access are used to recover the interprocessors instructions dependencies, allowing to analyse the detailed, low level, synchronization mechanisms between the different processors. The remainder of this article is organized as follows. Related work is reviewed in Section II. Section III describes the

106

trace mechanisms and analysis method. Section IV presents experimentations and results. We then conclude in Section V. II. R ELATED W ORK During the last years, lots of methods were proposed to help the debug and analyse of program running onto MPSoCs. Due to the increasing SoC parallelism, an important effort is made on communication monitoring. Parallel program errors comes indeed from interaction errors between the processing elements and interactions are mainly done through memory. Several solutions are proposed in order to integrate features in chips ([5], [7]), allowing to debug at run-time. Following an other track, work is done to improve debugging abilities by the use of simulation techniques. An advantages of these methods is that they are really nonintrusive. [8] propose a solution using a virtual machine that expose the data structures of the Operating System (OS) running into the virtual machine to an external debugger. In [9], a Instruction Set Simulator (ISS) API is proposed for the simulation of processors in MPSoCs. This ISS has an interface allowing to add instrumentation tools independent of the simulated processor. An implementation of a GDB server using the instrumentation ability has been made, allowing to control the set of processors of the MPSoC. Although allowing to control and monitor every processor, it doesnt address the communication mechanism. In [10], a solution is proposed for verifying the shared memory mechanism. The method consists in recording memory operations and their order, and then to check whether there was a bug or not. Non-intrusive instrumentation in simulators already exists: In eSimu [11], a very low-level trace is generated by a cycle accurate simulator. This trace is used in energy proling goal. The trace contains instructions with cache penalties information, peripheral change of state and data evolution through fo in order to link energy consumption to the instruction which generates it. For example the sending of data through a wireless device cost lots of energy but it is issued long time after the related instruction, it is so needed to keep tracks of the data. This solution does not target platforms that embed multiple processor and so lacks information for recovering memory accesses order in multiprocessor platforms. It is also only focused on proling. In his thesis, D. Kranzlmueller [12] studies concurrent programs modelling using events in a debugging goal. His work targets mainly distributed programs at the application level and not the whole software stack. He records and analyse main events of a concurrent program (mostly communication events). The event trace is intrusive as it is generated through software instrumentation. Event relations and orders are analyzed in order to detect erroneous behaviors. Some other works, like [13], focus on MPSoC software monitoring by using a specic programming model with integrated observation abilities. In this last work, a component-based approach is used where the component has observation interfaces.

III. C ONCURRENT PROGRAM ANALYSIS There are several kind of analysis which can be performed on a trace: verication of cache coherency protocol, verication of memory consistency model, detection of data races, and so on. These analysis require to build order relations between the memory access instructions. Part III-A below explains how to do this instruction trace for a multiprocessor platform using a sequential consistency memory model [14]. The analysis needs to build the software threads (which may have migrated between several processors in SMP architectures) and identify synchronizations between the threads to highlight erroneous behaviours. We explain how to proceed in part III-B focusing on data races detection. A. Recovering instruction scheduling In order to analyse a concurrent program running on a multiprocessor shared memory platform, we need to sort concurrent program instructions that access the same memory. But the date of a memory access may be very different from the date of the instruction that generate the access. Indeed due to communication interconnects and cache or buffer components, the date of an access might be signicantly shifted from the instruction date. This section describes the method we use to associate the right date of the access to each instruction accessing memory using information provided by the simulation. This is method is done in two step and has no impact on the program executed by the simulated platform. 1) Tracing hardware events This rst step takes place during the simulation of a multiprocessor platform executing a concurrent program. It consists in tracing all events related to the program execution and memory accesses. Several kind of components of the platform are involved in this operation: processors, caches and memories. Traced event are stored to allow further use. Each involved component traces events depending on the operations it is doing. Traced events are: processor instructions, processor requests, cache acknowledgements, cache requests, memory acknowledgements. Requests and acknowledgements represent memory accesses. The initiator of a memory access generates a request event and the target generates an acknowledgement event. An acknowledgement match the action of the target to the memory array (mainly read to or write from). Events contain some type related data. A request event contains the address, width and type of the access. Examples of type are: load, store, linked load, exclusive (for write-back policy caches), etc. A processor request event contains also the data read or written by the access. An instruction event contains the instruction address and processor state changes. A memory acknowledgement event contains only the date

107

processor_0

Instruction @ 0x5000

Request Read 0x2004

Instruction @ 0x5004

Request Read 0x2008

Instruction @ 0x5008

cache_0

Cache Request Read 0x2000

Acknowledgement

An arrow means a dependency between 2 events: the circle side event contains the identier of the triangle side Fevent

memory

Acknowledgement

Acknowledgement

Acknowledgement

cache_1

Cache Request Write 0x3008 Request Write 0x3008 Instruction @ 0x6000 Request Write 0x300C Instruction @ 0x6004 Request Read 0x0014 Instruction @ 0x6008 time

processor_1

Figure 1.

Event dependencies examples for 2 processors platform with write-through policy caches with write buffers

of the targets action. Because dates are only used to sort acknowledgements, they can be logical dates and not related to the simulated time. Events are generated by several components, and we need to link some of them together (for example an acknowledgement with the corresponding request). In order to do that, each event is tagged with a unique identifier. The identifier of an event can then be used by an other event to indicate a relation between the two events. This method is used in three cases:

event, but not the opposite) , the response channel does not need to be modied. Only the request channel is modied. 2) Associating dates to memory related instructions The second step consists in building for each processor the thread of executed instructions with proper dates associated to memory access instructions. Theses dates are later used to sort the instructions from different processors in order to match the memory access order. An Instruction that generates a memory access must be tagged with the date of the acknowledgement at the memory. It can be noticed that the way of handling write accesses depends on the cache policy. Write-through caches do not generate acknowledgment events on write accesses, but relay them to the memory as shown in Figure 1. In the contrary write-back caches can acknowledge write accesses when the corresponding line state allows it. When a write-through cache uses a write-buffer, still no write access acknowledgment is generated by the cache, but several identifiers (one for each intial request) are indicated in the resulting request event (see Figure 1). In order to associate a date to an instruction that generates an access, the relations between events must be followed from the memory acknowledgement down to the instructions. But due to the presence of caches, each processor request is not directly linked to a memory acknowledgement. In case of a read cache hit, a cache acknowledges a request but the real read access up to the memory has been done possibly long time ago by the cache. A similar issue is raised for a write access in write-back cache policy (but threal acces is after and not before). It could be possible to assign the true date (which could be in the past or in the future) of the memory access to every processor request but it is not what we need. This would not lead to a satisfying result because the dates of memory accesses would not be in an increasing order inside a processor instructions sequence. Without this last point, the analysis of threads would be complicated: for example, it would be

An acknowledgement event (issued by a cache or a memory) contains the identifier of the request event that leads to this acknowledgement. A cache request contains the identifier of a processor request if the cache relays a request to the memory (for example: in case of a processor load that generates a load to a whole cache line). A processor request event contains the identifier of the instruction that generates the memory access.

Figure 1) shows some typical examples of event dependencies. In this Figure, processor 0 rst does a load which lead the cache to do a line load from the memory. The processor 0 then does a second load in the same line which is handled by the cache. The processor 1 rst does two write which are gathered by the cache write buffer and then does an uncached read which is handled by the memory, skipping the cache. As long as events are not linked by identier, it is not a problem to generate them. Processors, caches and memories components have to be modied to generate the events but not their communication channels. But as an event may need the identier of another event, the communication channel between the two components that generate these events must be modied. The channel have indeed to transport along with the standard communication the identier of an event. Due to the direction of links (an acknowledgement need the identifier of the request

108

Figure 2.

Example of date associations to processor requests for a processor behind a write-back policy cache

impossible to nd which is the next access to a memory location without checking the whole instructions sequence of each processor. We have then the following constraints: 1) Dates in a processor instruction sequence must be in increasing order. 2) Dates of two memory access at the same address from two different processors must be in the proper order. The access that take place rst must have a smaller date. Accesses can be classied in two categories: memoryacknowledged accesses and cache-acknowledged accesses. Figure 2 shows how accesses are dated. There is no difculties for assigning a date for an access that is acknowledged by the memory: we keep the date of the acknowledgement and the constraints are met. However for accesses that are acknowledged by a cache, a date must be computed. In order to meet the rst constraint, this date must be between the dates of the previous and next memory-acknowledged accesses of the processor. The date of the corresponding memory acknowledgment cannot be used since it will violate this constraint. For all cache-acknowledged accesses between two memory-acknowledged accesses, we use the date of the previous memory-acknowledged access. The rst constraint is obviously met although the processor order is not strict. The second constraint is also met because the cacheacknowledged access could have been done at the date of the previous memory-acknowledged access without changing the results of accesses at the access address. We have two cases, the read case and the write case. Let us call T 1 the true date of the access and T 2 the date of the previous memoryacknowledge access. Sequential consistency ensures there is a total order of all memory access respecting every processor access order.

back cache policy and T 1 > T 2. So the cache has the accessed line in exclusive state and no other processor can do an access to the line (even a read) without the cache doing rst the write back to the memory. So the line has not been accesses between T 2 and T 1. The following algorithm tags each processor request with a proper date. Due to the direction of the links between events (acknowledgement to requests), it is not possible to nd easily the acknowledgment event starting with a request event. Instead starting from the acknowledgement event to nd the requests that correspond to it do not lead to difculties. This is why the algorithm is driven by top memory hierarchy components (ie: the memories).

Main: Consume all events of the memories following date order. Memories should only have generated acknowledgment events. Memory: For each consumed event, identify the source component (processor or cache) with the identifier that is contained in the event, and consume events of this component up to the one referred by the identifier. The acknowledgement date is given to this source component. Cache: For each consumed event, if there is a source identifier, consume event of the source component (lower level cache or processor) up the identied event. The date given to the source component is either the previous date or the just received date if the event is linked to the request being acknowledged by the memory (or higher level cache). Processor: For each consumed request, it is tagged with the current received date.

If the access is a read, then it is a cache hit and T 1 <= T 2. If a modication had occured before T 2 to the memory cell then, the cache would have received an invalidation before the previous memory-acknowledged due to the sequential consistency. So the line has not been writen between T 1 and T 2. Some may have read to the line, but this is not a problem to reorder consecutive read to the same address. If the access is a write, then it is delayed due to the write-

This algorithm works as long as dates are consistent between every memory components. Depending on the simulator used, such a time might not be at our disposal. This algorithm consumes events of each component in the order they were generated. The intermediate storage of the events thread of each component can then be avoided and the components can directly feed the above algorithm. This way, only processor sequences will be generated.

109

B. Software analysis: data races detection The previously generated processors instructions sequences contain executed instructions. They are associated with processor state changes and memory access acknowledgment dates. This information allows to run some analysis on the executed concurrent software. 1) Building software structure: threads Processor sequences contain information related to the hardware part of the execution: processor state changes and dates of memory accesses. These dates guarantee a proper interleaving of memory instructions. However it lacks information on the software part. The following analysis method needs, as well as the processor instruction sequences, the information called symbols, that allows to match for example instruction addresses to functions. A prerequisite of the analysis is that the symbols are known. The Application Binary Interface (ABI) is also needed. Processors sequences are then analysed to generate the software structure. This operation is done with respect to the memory access order and may be compared to a kind of replay of the execution. The ABI allows to detect function calls and returns if the symbols of these functions are known. Additional data such as DWARF debugging data contains parameters locations for function symbols and allow to identify parameters of functions calls. Some analysis rely on the identication of specic function calls or returns. At start, a thread is associated for each processor and the call graphs are built. This will lead to correct result as long as there is no Operating System (OS) doing some thread scheduling. If there is an OS, thread creation and scheduling functions have to be detected to create new threads and change the thread which is associated to a processor. 2) Adding synchronization points Detection of any function is possible if its symbol information is known. In order to apply this system to detect data races, synchronization mechanisms of the software threads need to be rst detected. Without taking synchronization constraints into account, the whole threads will be considered concurrent. A synchronization point is a point in software thread that has some link with at least another point in a second software thread. A Link tells that a synchronization point occurs either after or before the other. At the lower level atomic memory accesses (load linked, store conditional, test and set, compare and swap, etc) may be considered as synchronization points. But some synchronization can be done without using these accesses. Considering each memory access as a synchronization is not a good solution as most of them do not serve the realization of synchronization. From a higher level point of view, synchronizations are done through software functions. Synchronization point are then

created when the execution of such a function is detected. Useless synchronization points generate additional constraints which indeed might mask data races. To avoid these useless synchronization points must be set only for highest level synchronization functions, as they may be constructed from several lower level synchronization ones. 3) Checking data races Data-race corresponds to a case when multiple threads accesses the same memory location in a concurrent way and if the result of the accesses can not be decided without knowing in which order the accesses take place. Two cases exists: writeread and write-write. These two cases can be reduce to a unique one: a data race occurs if a thread do a write access to a memory location and another thread do an access to the same memory location (read or write). Figure 3 shows some software threads with their synchronization points. Parts of threads between two synchronization points are called segments in the following. To nd all data races, every pair of threads must be analyzed.

Figure 3. Example of threads with their synchronization point. Numbers represent synchronization points and letters represent segments. Dashed arrows represent indirect links.

In order to nd data races between two threads, the whole graph including all threads must be reduced. Although only two threads are concerned, the graph can not be reduced by just removing every other thread and links not involving the two analysed threads. Due to the transitivity of the links between synchronization points, some indirect links between the two studied threads may be inferred. For example, in Figure 3, point 1.3 happens before point 0.2 although there is no direct link between them. Two segments of the two threads may be concurrent if there no constraints forcing one to be completely executed before the other starts. In Figure 3 concurrent segments of thread 1 for the segment 0.A of thread 1 are 1.A, 1.B and 1.C. Finally accesses of concurrent segments must be checked in order to nd data races. IV. I MPLEMENTATION AND RESULTS This section presents the implementation we have done and the obtained results. We rst describe the implementation architecture of the software analysis program. We then detail our experimentations and results.

110

A. Analysis implementation The two steps described in sections III-A1 and III-A2 are not implemented in the same program. The rst step has been implemented in component models that are used in the simulation. Events are stored into separated les (one per component), the storage is done in parallel of the simulation in a separate thread to limit the overhead. The second step is done in a separate program and generates a dated instructions sequence per processor. The software thread analysis program is organized as follows. It takes in input the previously generated processor instruction sequences and the software binary image executed by the simulated platform. The software image is used to obtain the software symbols. The main program core consumes instruction following the date order. When an instruction is consumed it is put into the current software thread of its processor. And using the software symbols and the ABI, the software thread histories (call graphs) is built. Hook functions can be registered by the user at entry or end of function symbols to do some tasks (for example changing the current thread of a processor, or adding some synchronization points). A hook function is executed by the main core when the given symbol is detected. The program stores into les the software threads histories and the synchronization graph. Thread histories can then be used to study the software behavior. A datarace program has been implemented to detect datarace between a pair of thread using two threads histories and the synchronization graph. B. Experimentation detail We implemented the rst step (section III-A1) using the SoCLib framework, which consists in a library of SystemC components for MPSoCs simulation. The SoCLib [15] library provides components such as processors, caches, memories, interconnect and allows to build platforms at system level.

Figure 4 shows an overview of this platform. The platform contains also others peripherals components: a timer, an interrupt controller, a frame-buffer, a serial output peripheral and a storage peripheral. The simulation was launched under the SystemCASS [16] SystemC kernel. The software that runs on this platform is a parallel MJPEG decoder on top of a small Operating System called DNA [17]. The MJPEG decoder is organized as follows: one rst thread is in charge of reading the le and dispatching JPEG blocks to several threads which do the decoding part. Decoded part of images are then given to a last thread which is in charge of the display. For the software analysis, several hooks have been registered on DNA functions. Hooks have been set for thread context handlers (cpu_context_init, cpu_context_load and cpu_context_save) to handle thread creattion and scheduling. Hooks have been set for lock functions (lock_acquire, lock_release) and operating system barrier (cpu_mp_proceed and cpu_mp_wait to create synchronization points between the threads. The insertion of synchronization points must be done very carefully, since missing some will lead to false data race detection. But putting to much points may lead not to detect some races. For example semaphore should not be considered as synchronization points, since they are not generally used to protect shared variables or memory. Each thread was separated into an application level and a kernel level and other hooks have been set for several kernel functions to operate the switch between the levels. This allows to remove the kernel synchronization from the application part and vice-versa. C. Results The platform was simulated with 1 to 16 processors.Figure 5 shows some numbers for a simulation of 100000000 cycles for 1, 2 and 8 processors. Sim. time is the time to simulate the platform and store the events. Overhead is the simulation time overhead compared with the initial simulation without tracing the events. Ev. num and Ev. size are the number of events and their size. Inst. time is the total and user time needed to generate the dated processor instruction sequences from the events (step described in section III-A2). Inst. num and Inst. size are the number of instructions executed by all processors and their size. Soft. time is the total and user time used to build the software thread and synchronization graph from the dated instruction sequences. As shown by these numbers the application on the platform does not scale well. This is due to the platform implementing a sequential consistency which is a very strict constraint on the memory hiearchy. Furthermore we do not use any DMA transfer because nowadays we do not support them in the trace system. In consequence the performance is very low when using several processors.

Figure 4.

Simulated platform

We modied the CABA (Cycle Accurate, Bit Accurate) (CABA) implementation of the components to build a platform containing several MIPS32 ISA processors with write-through caches and write-buffers. Memory and caches communicate through an abstract network. The memory is kept coherent between the caches through a directory based mechanism.

111

Processors Sim. time (seconds) Overhead Ev. num (millions) Ev. size Inst. time (total/user) Inst. nb Inst. size Soft. time (total/user)
Figure 5.

1 156s 6.3% 64,5 1.3GB 127s/12s 31,4 582GB 43s/7s

2 395s 4.5% 64,9 1.3GB 127s/13s 31,6 586GB 44s/8s

8 468s 3.5% 70,2 1.4GB 131s/14s 34,6 638GB 48s/8s

adding random delay in communication) of the simulation to try to change the execution at each simulation and increase the test coverage. We plan to extend this work to systems that do not follow the sequential consistency model, and to the verication of cache coherence protocols implementation. ACKNOWLEDGMENT This work is funded by the French Authorities in the framework of the Nano 2012 Program. R EFERENCES

Time in the different step for different numbers of processors

The overhead of trace system during the simulation is not very high. The Time spent in following steps is important compared to the simulation time. But this time is for the most part spent in system parts (the user time is low). This is due to the large amount of data that is read from les for both step. In consequence most of that time could be avoided by pipelining all steps and not using intermediate les. This could be done without difculties as the two last step read linearly the results of the previous step. Using the thread call graphs and synchronization graph generated by the last step, data races detection were runned on pairs of thread that communicate tgether. Due to the complexity (each part of the rst thread between two consecutive synchronization points must be checked with each part of second thread ath can be concurrent with the frst thread part) of data race detection. No data races were detected in the MJPEG decoder using these. synchronization points. In order to test our analysis, we removed some locks from the fo driver used by the threads of the MJPEG decoder to transfer data. This lead to the detection of data races. However a data race detection must be studied carefully. They may be false data race detection. For example, an increasing counter which is only protected by a lock for updating and not for reading will cause false data races to be detected on read. V. C ONCLUSION The generalization of multiprocessor in integrated circuits introduces the issue of debugging parallel programs into the embedded devices. Debugging can no longer be done step by step on a console, and more and more rely on trace analysis. We have shown that using virtual prototypes producing non intrusive traces, it is possible to perform complex analysis, such as data races detection. The analysis can be done at different level of abstraction or target different part of the software. For example, operating system kernel and application can be analyse indepedantly. However our method only analyses a given execution on a platform. It may be coupled with stimulation (for example by

[1] ITRS, ITRS 2009 Edition, 2009, http://www.itrs.net. [2] ARM, Embedded trace macrocell architecture specication, http:// www.arm.com. [3] MIPS Technologies, EJTAG Trace Control Block Specication, http: //www.mips.com. [4] A. B. Hopkins and K. D. McDonald-Maier, Debug support strategy for systems-on-chips with multiple processor cores, IEEE Transactions on Computers, vol. 55, no. 2, pp. 174184, February 2006. [5] B. Vermeulen, K. Gooseens, and S. Umrani, Debugging distributedshared-memory communication at multiple granularities in networks on chip, in Proceedings of ACM/IEEE International Symposium on Networks-on-Chip, April 2008, pp. 312. [6] C. E. Mcdowell and D. P. Helmbold, Debugging concurrent programs, ACM Computing Surveys, vol. 21, no. 4, pp. 593622, December 1989. [7] C.-N. Wen, S.-H. Chou, T.-F. Chen, and A. P. Su, Nuda: A non-uniform debugging architecture and non-intrusive race detection for many-core, in Proceedings of 46th ACM/IEEE Design Automation Conference, July 2008, pp. 148153. [8] L. Albertsson, Simulation-based debugging of soft real-time applications, in Proceedings of the 7th IEEE Real-Time Technology and Applications Symposium, may 2001, pp. 107108. [9] N. Pouillon, A. Becoulet, A. V. de Mello, F. Pecheux, and A. Greiner, A generic instruction set simulator api for timed and untimed simulation and debug of mp2-socs, in Proceedings of the IEEE/IFIP International Symposium on Rapid System Prototyping, June 2009, pp. 116122. [10] S. Taylor, C. Ramey, C. Barner, and D. Asher, A simulation-based method for the verication of shared memory in multiprocessor systems, in IEEE/ACM International Conference on Computer Aided Design, November 2001, pp. 1017. [11] N. Fournel, A. Fraboulet, and P. Feautrier, esimu : a fast and accurate energy consumption simulator for real embedded system, in IEEE International Symposium on a World of Wireless, Mobile and Multimedia Newtorks, 2007, pp. 16. [12] D. Kranzlmueller, Event graph analysis for debugging massively parallel programs, Ph.D. dissertation, Johannes Kepler University of Linz, Austria, 2000. [13] C. Prada-Rojas, V. Marangozova-Martin, K. Georgiev, J.-F. Mehaut, and M. Santana, Towards a component-based observation of mpsoc, in Proceedings of International Conference on Parallel Processing Workshops, 2009, pp. 542 549. [14] L. Lamport, How to make a multiprocessor computer that correctly executes multiprocess programs, vol. 28, pp. 690691, 1979. [15] SoCLib, http://www.soclib.fr. [16] R. Buchmann, F. Ptrot, and A. Greiner, Fast cycle accurate simulator to simulate event-driven behavior, in Proceedings of The 2004 International Conference on Electrical, Electronic and Computer Engineering (ICEEC04), 2004, pp. 3539. [17] X. Guerin and F. Ptrot, A system framework for the design of embedded software targeting heterogeneous multi-core socs, in Proceedings of the 20th IEEE International Conference on Application-Specic Systems, Architectures and Processors, 2009, pp. 153160.

112

Embedded Virtualization for the Next Generation of Cluster-based MPSoCs


Alexandra Aguiar, Felipe G. de Magalh es, Fabiano Hessel a Faculty of Informatics PUCRS Av. Ipiranga 6681, Porto Alegre, Brazil alexandra.aguiar@pucrs.br, felipe.magalhaes@acad.pucrs.br, fabiano.hessel@pucrs.br Abstract
Classic MPSoCs tend to be fully implemented using a single communication approach. However, recent efforts have shown a new promising multiprocessor system-on-chip infrastructure: cluster-based or clustered MPSoC. This infrastructure adopts hybrid interconnection schemes where both buses and NoCs are used in a concomitant way. The main idea is to decrease the size and complexity of the NoC by using bus based communication systems at each local port. For example, while in a classic approach a 16 processor NoC might be formed in a 4 x 4 arrangement, in cluster-based MPSoCs a 2 x 2 NoC is employed and each router connected to a local port contains buses that carry 4 processors. Nevertheless, although good results have been reached using this approach, the implementation of wrappers to connect the local router port to the bus can be complex. Therefore, we propose in this work the use of embedded virtualization, another current promising technique, to achieve similar results to cluster based MPSoCs without the need for wrappers besides providing a decreased area usage. solution, since they used to be usually simpler in terms of implementation. On the other hand, buses have poor scalability rates, since only a few dozens of processors can be placed in the same structure without presenting prohibitive contention rates. Therefore, other communication solutions started being researched and the most prominent one is the Networkon-Chip (NoC) approach [16]. NoCs are a communication solution widely accepted and based on general purpose network concepts. However, NoCs can present more complex communication protocols and, consequently, less predictability. In this context, a recent idea known as Cluster-based MPSoCs has gained notoriety [7], [12]. In this approach the best of both worlds are intended to be placed together: NoCs allow higher scalability rates but buses keep the design simpler even with more processors on the system. To better understand the concept, Figure 1 depicts a 2x2 sized NoC which contains a bus located at each local port. Each bus carries along four processors which communicate in simpler ways inside and, if needed, can communicate with other clusters through the NoC. Dotted lines represent the wrappers needed to connect the bus to the NoC.

Introduction

Embedded Systems (ES) have become a solid reality in peoples lives. They are present in a broad range of facilities, such as entertainment devices (smart phones, video cameras, games toys), medical supply (dialysis machines, infusion pumps, cardiac monitors), automotive business (engine controls, security, ABS) and even in aerospace and defense elds (ight management, smart weaponry, jet engine control) [18]. Usually, these systems need powerful implementation solutions, which contemplates several processor units, such as the Multiprocessor System-on-Chips (MPSoCs) [11]. One of the most important issues regarding MPSoCs lies in the way communication is implemented. Initially, busbased systems used to be the most common communication
978-1-4577-0660-8/11/$26.00 2011 IEEE

Figure 1. Cluster-based MPSoC concept Another recent idea for embedded systems is the use of virtualization in their composition. Virtualization has several possible advantages, including the decrease of area, increase of security levels and the ease of software design [10], [2], [3]. Virtualized systems are composed by

113

a hypervisor that holds and controls all virtual machines operation details. This paper proposes the unication of both concepts. Instead of using buses on each router of the NoC, we propose a single processor holding a hypervisor, providing the emulation of several virtual processors. Since buses are poorly scalable, hypervisors do not need to support more processors than a simple bus would. The main contribution of this proposal, named as Virtual Cluster-based MPSoCs, is to provide multiprocessed systems with less area occupation. The remainder of the paper is organized as it follows. Next section show some related work on cluster-based MPSoC. Section 3 shows basic concepts regarding embedded virtualization. Then, in Section 4, details about the Virtual Cluster-based MPSoCs are discussed. Section 5 details motivational use cases and some initial experimental results. Finally, Section 6 concludes the paper besides presenting some future work.

Shared Memory, NI, for Network Interface and SDRAM IF for SDRAM Interface.

Figure 2. The architecture of Cluster-based MPSoC proposed by [7]

Cluster-based MPSoCs

It is widely known that several MPSoCs are bus-based architectures. Systems such as the ARM MPCore [9], the Intel IXP2855 [6] and the Cell processor [13] are examples of it. Nevertheless, the need for more processing elements and a growing system complexity has led other approaches to be researched. Networks-on-Chip (NoCs) have arisen as the main communication infrastructure involving complex MPSoCs. However, the design of NoC-based parallel application is far more complex that the one involving only bus-based systems [7]. Due to the lack of scalability present in bus-based systems and the excessive application design complexity found in NoCs, cluster-based systems are becoming a possible alternative. These systems, intend to achieve the advantages of both systems. In [7], the authors propose a cluster-based MPSoC prototype design. In this paper, the authors integrate 17 NiosII [14] cores, organized in four processing clusters and a central core. In every cluster, the cores are composed by their own local memory and their communication is performed through a shared memory, accessed from the bus. In order to access the inter-cluster communication, cores have a shared network interface. This system still proposes that a single processing element has the access to external peripherals, such as SDRAM controllers. Also, this central control unit is responsible for managing mapping issues of the parallel application in the clusters as well as gathering expected results. Figure 2 depicts the architecture proposed by [7]. In this Figure, LM stands for Local Memory, CSM, for Common

Results were taken considering two real applications: matrix chain multiplication and JPEG picture decoding, both implemented on an FPGA development board. The implementation resulted in speedup ratios of above 15 times. The main drawback is that real-time applications are not referred by the authors. Figure 3 shows an example of a processing cluster that composes the cluster-based MPSoC, which is composed by four processor cores itself. Each processor core, a NIOSII, contains its own Local Memory (LM, in the gure) and a bridge to access the local bus. In this bus, it is also connected a Common Shared Memory (CSM, in the gure), used to exchange data among the processors. Still, a semaphore register le, used for synchronization purposes among the processes during the use of the shared memory, is present. Finally, the cores also share a Network Interface (NI, in the gure) which allows the inter-cluster communication.

Figure 3. The architecture of each processing cluster proposed by [7]

Jin [12] proposes a cluster-based MPSoC using hierarchical buses on-chip, aiming to attack some of the problems pure NoC implementations can present to the component connected to the network. One of the main problems pointed by the authors is for real-time applications,

114

where the NoC must provide a high efciency for data exchange. In this approach, no NoCs are adopted. Therefore, in cluster-based MPSoCs the performance of the computation cluster is very important for the system as a whole. The approach presented in [12] can be seen in Figure 4. The system adopts the AMBA-AHB protocol, which is a high performance system bus that supports multiple bus masters besides providing high-bandwidth operation. The authors also use a hierarchical bus architecture aiming to obtain better performance results, especially when decreasing bus collision rates, improving the speed of register conguration and avoiding shared memory contention and bottlenecks.

To implement the hypervisor, also known as Virtual Machine Monitor (VMM), commonly two approaches are used. In hypervisor type 1, also known as hardware level virtualization, the hypervisor itself can be considered as an operating system, since it is the only piece of software that works in kernel mode, like depicted in Figure 5. Its main task is to manage multiple copies of the real hardware - the virtual boards (virtual machines or domains) - just like an OS manages multitasking.

Figure 5. Hypervisor Type 1 Figure 4. The architecture of the cluster based MPSoC proposed by [12] Type 2 hypervisors, also known as operating system level virtualization, depicted in Figure 6, are implemented such that the hypervisor itself can be compared to another user application that simply interprets the guest machine ISA.

The proposed solution is divided into inner buses, which are present in each SoC itself - forming each cluster - and the outer bus, which connects them to each other and to external peripherals. Still, the work proposed by [4] also targets pure NoC implementation by adding bus-based interface on NoC routers. The main goal is to ease the integration with other bus-based IP components, which are more commonly found. Thus, the proposed NoC has the ability of integrating standard nonpacket based components thus reducing design time. Other approaches also studied the use of buses in NoCs with different purposes [15], [20]. In our case, we still want to use the NoC infrastructure but instead of adding another level of communication we propose to use virtual domains. Next section introduces some concepts about embedded virtualization.

Virtualization and Embedded Systems

Figure 6. Hypervisor Type 2 One of the most successful techniques to implement virtualized systems is known as para-virtualization. It is a technique that replaces sensitive instructions of the original kernel code by explicit hypervisor calls (also known as hypercalls). Sensitive instructions belong to a classication for the instructions of an ISA (Instruction Set Architecture) into three different groups, proposed by Popek and Gold-

First of all, even for classic virtualization concepts, which date back more than 30 years [8], the main component involving virtualization is the hypervisor. It is the hypervisor the responsible for managing the virtual machines (also known as virtual domains) by providing them the needed scenario for its ne work.

115

berg [19]: 1. privileged instructions: those that trap when used in user mode and do not trap if used in kernel mode; 2. control sensitive instructions: those that attempt to change the conguration of resources in the system, and; 3. behavior sensitive instructions: those whose behavior or result depends on the conguration of resources (the content of the relocation register or the processors mode). The goal of para-virtualization is to reduce the problems encountered when dealing with different privilege levels. Usually, a scheme referred to as protection rings is used and it guarantees that the lower level rings (Ring 0, for instance) holds the highest privileges. So, most of OSs are executed in Ring 0, thus being able to interact directly with the physical hardware. When the hypervisor is adopted, it becomes the only piece of software to be executed in Ring 0, bringing severe consequences for the guest OSs: they are no longer executed in Ring 0, instead, run in Ring 1, with fewer privileges. These concepts are present in the virtualization done for general purpose systems but are very important when dealing with embedded systems typical challenges. Next, some peculiarities found in the application of virtualization solutions in embedded systems are discussed.

Therefore, in VHH, domains need to be modied before being executed on top of it. As a result, they do not manage hardware interrupts directly. Instead, the guest OS must be modied to allow the use of virtualized operations provided by the VHH (hypercalls). Figure 7 depicts the Virtual-Hellre Hypervisor structure. In this gure, the hardware continues to provide the basic services as timer and interrupt but they are managed by the hypervisor, which provides hypercalls for the different domains, allowing them to perform privileged instructions.

Figure 7. Virtual-Hellre Hypervisor Domain structure Thus, Virtual-Hellre Hypervisor is implemented based on the HellreOS [1] and counts on the following layers: Hardware Abstraction Layer - HAL, responsible for implementing the set of drivers that manage the mandatory hardware, like processor, interrupts, clock, timers etc; Kernel API and Standard C Functions, which are not available to the partitions; Virtualization layer, which provides the services required to support virtualization and para-virtualization services. The hypercalls are implemented in this layer. Figure 8 depicts the architecture of the VHH, where some of the following modules can be found: domain manager, responsible for domain creation, deletion, suspension etc; domain scheduler, responsible for scheduling domains in a single processor;

3.1 Virtual-Hellre Hypervisor


There are several hypervisors with embedded systems focus [22], [10], [21]. In this work, we adopt the VirtualHellre Hypervisor (VHH) [3], part of the Hellre Framework. The main advantages of VHH are: temporal and spatial isolation among domains (each domain contains its own OS); resource virtualization: clock, timers, interrupts, memory; efcient context switch for domains; real-time scheduling policy for domain scheduling; deterministic hypervisor system calls (hypercalls). VHH considers a domain as an execution environment where a guest OS can be executed and it offers the virtualized services of the real hardware to it. In embedded systems where no hardware support is offered, paravirtualization tends to present the best performance results.

116

interrupt manager, which handles hardware interrupts and traps. It is also in charge of triggering virtual interrupt and traps to domains, and; hypercall manager, responsible for handling calls made from domains, being analogous to the use of system calls in conventional operating systems.

Figure 9. VHH Memory for (A) Non-clustered systems (B) Clustered systems

Figure 8. VHH System Architecture

Virtual Cluster-Based MPSoCs

This section describes the Virtual Cluster-Based MPSoC proposal. Initially, let us take a look into each cluster of the MPSoC. Since our work is based on the Hellre Project, we also use the Plasma [5] processor, a MIPS-like architecture. Therefore, the VHH is placed on a Plasma processor as the basis of our cluster. Then, the VHH is responsible for managing several virtual domains. In our case, each VHH is responsible for managing its own processing cluster and it allows the internal communication of these processors through shared memory. Figure 9 is divided in two parts. In A, the current version for memory division, which only predicts a single memory partition per virtual domain, is shown. This means that this partition is considered to be the local memory for a given virtual domain. In B, it is possible to see that an extra partition was added: the shared partition. Here, the idea is to provide easy and low overhead communication inside the cluster. The VHH was extended to allow the communication in two levels. The rst level, is named as intracluster communication and occurs through shared memory. Currently, this is not user transparent and a specic hypercall must be used for this communication. In this hypercall, a single CPU identication (CPU ID) must be used, which means they belong to the same processing cluster. These hypercalls are similar to the communication functions provided by the HellreOS and have the follow-

ing parameters: VHH SendMessage (cpu id, task id, message, message length) used to send a message through the shared memory and VHH ReceiveMessage (source cpu id, source task id, message, message length) used to receive it. The second communication level is done among clusters, through the NoC. In our case, we use the HERMES NoC [17] and a MIPS-like processor in each router. We adopted a Network Interface (NI) as a wrapper which connects the NoC router to the processor located in its local port. This interface, works in a similar way that the nonvirtualized approach. This increases the possibility of using several NoC infrastructures as the underlying architecture. Figure 10 depicts this approach.

Figure 10. VHH Communication Infrastructure with NoC based Systems The wrapper is connected through specic memory addresses: read and write, to the Plasma. Still, a communication VHH driver had to be written to allow the integration between the wrapper and the virtual cluster. Also, the hypercalls provided by the VHH allow a virtual processor to send or receive messages with an extra parameter: the Vir-

117

tual CPU ID, as an identication of the virtual CPU on a specic cluster. Thus, the hypercalls to be used to the inter-cluster communication are: VHH SendMessageNoC (cpu id, virtual cpu id, task id, message, message length) used to send a message through the NoC and VHH ReceiveMessage (source cpu id, source virtual cpu id, source task id, message, message length) used to receive it. The complete vision of the system is depicted in Figure 11. In the Figure, VHH is the Virtual Hellre Hypervisor. LM stands for Local Memory. NI stands for Network interface and PE, for Processing Element. R represents each router of the NoC.

Figure 12. Virtual Cluster-Based MPSoC with Application Specialization

Figure 11. Virtual Cluster-Based MPSoC proposal

Use Cases and Experimental Results

In this section we highlight some possible use cases for Cluster-Based MPSoCs and some preliminary prototyping results. The main use for a Cluster-based MPSoC is the possibility for eld specialization. In this case, each cluster is responsible for executing a set of tasks with a common purpose. For instance, it is possible to execute a JPEG decoder in one cluster, a MPEG decoder in another and so on. In this case, the greatest advantage is to simplify the communication of similar tasks, since they share a given memory area, but still allowing a great number of processors, increasing system scalability through the NoC usage. Figure 12 depicts an example of cluster-based MPSoC with application specialization. Another possible use of the Virtual Cluster-based MPSoC is when decreasing area with guaranteed system scalability is needed. Scalability is assured by NoC usage and the cluster-based MPSoC itself allows an easier use of real-

time tasks with no extra communication penalties. Regarding area occupation, we prototyped some possible congurations to illustrate the benets of our approach in this issue. We used the Xilinx Virtex-5 XC5VLX330T FPGA. First, when using the HellreOS with a Plasma processor, we usually indicate a processor with at least 16KB of local memory. HellreOS is a much optimized kernel and depending on the application even such a small memory can fulll the expected needs. When using the VHH, more memory is required and the total memory size depends especially on the number of virtual domains that are required. Although greater memory sizes infer more block RAMs, it does not affect the FPGA area measured in LUTs. In all experiments performed, the total system memory could be inferred as block RAMs. We used three different MPSoC congurations, all with 16 processors (physical or virtual). First, we have a 16 processor MPSoC, distributed in a 4x4 NoC where each router carries its own processor, known as Pure 4x4 NoC approach. The second MPSoC conguration regards a 2x2 NoC with bus-based clustering system, known as Bus Clustered approach. Here, each router has a wrapper to connect it to the clustered-bus, and each bus carries four processors. Finally, the last approach is the Virtual cluster-based (VCluster 2x2 NoC) where a 2x2 NoC was used again and each router contains a single physical processor. This processor runs the VHH, where 4 virtual domains are emulated per cluster, totalizing the 16 processors of the MPSoC. In the rst two solutions, each processor has 16KB of local memory. The last, for the virtual cluster approach, 4 processors with 128KB of memory each were employed. In Table 1, it is possible to see the prototyping results for three different MPSoCs. These results show a decrease of the area occupation in up to 70%, depending on the processor local memory con-

118

Table 1. Area results for MPSoCs conguration Conguration Pure 4x4 NoC Bus Clustered 2x2 NoC V-Cluster 2x2 NoC Area occupation (LUTs) 60934 56099 17179

[4] [5] [6] [7]

guration and the original MPSoC conguration. Also, depending on the bus structure used for the Bus-based clustered version, the bus communication overhead is similar to the virtualization overhead.

[8] [9] [10] [11] [12] [13] [14] [15] [16]

Concluding Remarks and Future Work

This paper presents a new proposal for MPSoC conguration using virtualization with a cluster-based approach. For validation purposes, we use an extension of the Hellre Framework and, in order to incorporate our virtualization methodology, the Virtual-Hellre Hypervisor (VHH). We use a HERMES NoC as the underlying architecture where each processor runs the VHH, forming the processing clusters. We achieved up to 70% decrease in FPGA area occupation in our preliminary tests. As a future work we intend to get comparison results for performance and overheads with other approaches. Still, we want to improve the proposal itself, especially regarding memory and I/O management.

Acknowledgment
The authors acknowledge the support granted by CNPq and FAPESP to the INCT-SEC (National Institute of Science and Technology Embedded Critical Systems Brazil), processes 573963/2008-8 and 08/57870-9. Also, this work is supported in the scope of the project SRAM by the Research and Projects Financing (FINEP) under Grant 0108031000.
[17]

[18] [19] [20]

References
[1] A. Aguiar, S. Filho, F. Magalhaes, T. Casagrande, and F. Hessel. Hellre: A design framework for critical embedded systems applications. In Quality Electronic Design (ISQED), 2010 11th International Symposium on, pages 730 737, 2010. [2] A. Aguiar and F. Hessel. Embedded systems virtualization: The next challenge? In Rapid System Prototyping (RSP), 2010 21st IEEE International Symposium on, pages 1 7, 2010. [3] A. Aguiar and F. Hessel. Virtual hellre hypervisor: Extending hellre framework for embedded virtualization support.

[21] [22]

In Quality Electronic Design (ISQED), 2011 12th International Symposium on, 2011. B. Ahmad, A. Ahmadinia, and T. Arslan. Dynamically Recongurable NoC with Bus Based Interface for Ease of Integration and Reduced Design Time. IEEE, June 2008. O. Cores. Plasma most mips i(tm) opcodes. http://www.opencores.org.uk/projects.cgi/web/mips/, Accessed, September 2009, 2007. I. Corp. Intel ixp2855 network processor. Web, Available at http://www.intel.com/., 2005. L.-F. Geng. Prototype design of cluster-based homogeneous Multiprocessor System-on-Chip. 2009 3rd International Conference on Anti-counterfeiting, Security, and Identication in Communication, pages 311315, Aug. 2009. R. P. Goldberg. Survey of virtual machine research. Computer, pages 3435, 1974. J. Goodacre and A. N. Sloss. Parallelism and the arm instruction set architecture. Computer, 38:4250, July 2005. G. Heiser. Hypervisors for consumer electronics. pages 1 5, jan. 2009. A. Jerraya, H. Tenhunen, and W. Wolf. Multiprocessor systems-on-chips. Computer, 38(Issue 7):36 40, July 2005. X. Jin, Y. Song, and D. Zhang. FPGA prototype design of the computation nodes in a cluster based MPSoC. IEEE, July 2010. M. Kistler, M. Perrone, and F. Petrini. Cell multiprocessor communication network: Built for speed. IEEE Micro, 26:1023, May 2006. A. Ltd. Nios ii processor reference. Web, Available at http://www.altera.com/., 2009. R. Manevich, I. Walter, I. Cidon, and A. Kolodny. Best of both worlds: A bus enhanced NoC (BENoC). IEEE, Nov. 2010. G. D. Micheli and L. Benini. Networks on Chips: Technology and Tools (Systems on Silicon). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. F. Moraes, N. Calazans, A. Mello, L. M ller, and L. Ost. o Hermes: an infrastructure for low area overhead packetswitching networks on chip. Integr. VLSI J., 38(1):6993, 2004. T. Noergaard. Embedded Systems Architecture: A Comprehensive Guide for Engineers and Programmers. Newnes, 2005. G. J. Popek and R. P. Goldberg. Formal requirements for virtualizable third generation architectures. Commun. ACM, 17(7):412421, 1974. T. Richardson, C. Nicopoulos, D. Park, V. Narayanan, Y. Xie, C. Das, and V. Degalahal. A hybrid soc interconnect with dynamic tdma-based transaction-less buses and on-chip networks. In VLSI Design, 2006. Held jointly with 5th International Conference on Embedded Systems and Design., 19th International Conference on, page 8 pp., 2006. W. River. Wind river. Web, Available at http://www.windriver.com/. Accessed at 2 oct., 2010. XEN.org. Embedded xen project. Web, Available at http://www.xen.org/community/projects.html. Accessed at 10 ago., 2010.

119

Session 5 Model Based System Design

120

Rapid Property Specication and Checking for Model-Based Formalisms


Daniel Balasubramanian, G bor Pap, a Harmon Nine, G bor Karsai a ISIS / Vanderbilt University, Nashville, TN 37212 Email: {daniel.a.balasubramanian, gabor.pap, harmon.s.nine, gabor.karsai}@vanderbilt.edu Michael Lowry, Corina P s reanu, Tom Pressburger aa NASA Ames Research Center, Moffett Field, CA 94035 Email: {michael.r.lowry, tom.pressburger, corina.s.pasareanu}@nasa.gov

AbstractIn model-based development, verication techniques can be used to check whether an abstract model satises a set of properties. Ideally, implementation code generated from these models can also be veried against similar properties. However, the distance between the property specication languages and the implementation makes verifying such generated code difcult. Optimizations and renamings can blur the correspondence between the two, further increasing the difculty of specifying verication properties on the generated code. This paper describes methods for specifying verication properties on abstract models that are then checked on implementation level code. These properties are translated by an extended code generator into implementation code and special annotations that are used by a software model checker.

I. I NTRODUCTION Model-based development (MBD) is a software and system design paradigm based on abstractions called models. Domainspecic modeling languages (DSMLs) [1] provide the ability to represent models that are specic to a particular problem domain. Cast in this light, Matlab/Simulink [2] can be viewed as a DSML for physical and embedded systems, as they allow modeling the (dynamics of the) physical plant as well as the behavior of its controller software. Once the model is created the closed loop system can be simulated, output traces observed, and the model modied as needed. Simulation alone, however, cannot provide rigorous guarantees about a models behavior. In order to prove exhaustively that a models dynamic behavior always satises a set of properties, some sort of verication [3] must be performed. Typical properties include state reachability, deadlock-freedom and a wide range of temporal properties. In recent years, model-level verication tools have been developed that can check models for such properties. While these tools play an important role in MBD and can provide guarantees about a models behavior, their use is often limited to a small portion of a complex system, i.e. key properties and algorithms. One of the key goals of MBD is to gradually rene abstract, high-level models until they can be automatically synthesized

into an implementation that runs on a non-ideal computational platform. However, one crucial problem is often ignored: how can one verify that the synthesized implementation code satises the same properties as the models from which it was generated? Without verifying the implementation, the guarantees provided by checking the abstract models are lost. Checking or proving the correctness of the synthesis (transformation) algorithms is an open problem. Further, if no verication is performed on high-level models, then verifying the implementation is the only way to prove properties about the system. The major difculty in verifying model level properties on implementation level code lies in the different levels of abstraction. Abstract models are developed by hand and designed with readability in mind, while automatically generated code can be difcult to read. Further, the correspondence between model elements and their generated code is not obvious. Renamings and optimizations make it difcult to understand how a particular model element is represented in the generated code. As a result, knowing where to place properties that are to be veried becomes a challenge. Another difculty lies in the mismatch between the input languages of verication tools used at the different levels of abstraction. Individual verication tools typically each use their own input language for dening properties, so that properties checked at the model level must be rewritten in a new syntax to be checked on the implementation level code. This problem is exacerbated by the fact that code generators typically rename model elements in the generated code, so that, for instance, the names of variables in the generated code are not known on the model level. Without knowing the names of the variables, certainly verication properties cannot be dened. We present in this paper a method for specifying properties on high-level models that are then used in the verication of the generated, implementation level code. Properties are written in an intuitive way, directly on the model elements. As the model is translated into various intermediate forms and

978-1-4577-0660-8/11/$26.00 c 2011 IEEE

121

SL/SF model +
Verification properties

Input

Code generator
Translated

Java code +
Output
Verification properties

Observer automata Contracts

Property specification methods

Verification report

Generates

Software model checker

Fig. 1. Overview of framework. Verication properties can be specied using observer automata or contracts.

ultimately into executable code, the user dened properties are preserved and translated into implementation code and annotations that are checked by a software model checker. The translation is performed via a code generator that has been extended to handle the extra information. The results of the verication are then displayed to the user (in terms of the original high level model). While we focus on Matlab/Simulink, we believe that our method of dening properties on the model level that are checked against a generated implementation can be generalized and leveraged in other MBD tools as well. This approach makes property-based verication an integral part of the development workow. Note that the framework enables run-time verication in addition to model checking. The remainder of the paper is organized as follows. Section II gives an overview of our approach and background, including a description of the tool-suite. Section III provides details on how the user annotates Simulink models with properties. Section IV presents an end-to-end example. We compare our approach with related work in Section V and conclude in Section VI. II. OVERVIEW AND BACKGROUND An overview of our approach is depicted in Figure 1 and consists of the following steps: (1) a Simulink model is dened, (2) the model is annotated with properties to verify, (3) the code generator is invoked to produce executable code, (4) the software model checker is executed on the code and properties, (5) results about about the verication process are reported. The rst and third steps are described in [4]; this paper concentrates on the other steps. The code generator produces restricted form Java code and is the same code generator described in [4], but extended with features for generating annotations for verication. The main motivation for this choice of the target language was that the software model checker used can work with Java programs. The code generated by our toolchain is completely sequential and does not use dynamic memory (after initialization), hence it is suitable for embedded applications. The code is also object-oriented (an increasing trend in embedded software): subsystems are translated into Java classes that are instantiated at initialization time. Our code generator actually uses a re-targetable back-end, such that either Java or C code can be produced from the same abstract syntax tree.

the development of model checking [5] in the early 1980s, a number of specication languages have been invented to formally dene properties. Common ways of specifying these properties include regular expressions and temporal logic, such as LTL and CTL. However, the drawback to using temporal logics for property specication is their steep learning curve for industrial practitioners. Consequently, designers and developers will be less likely to use verication tools if they must devote large amounts of time to learning a specication language. For this reason, we decided to take two approaches to property specication. The rst uses the pattern-based system introduced in [6]. In that work, the authors studied a large body of existing property specications and found that the majority of them were instances of a small set of parameterizable patterns: reusable solutions to recurring problems. Patterns are entered into our system using a custom interface that we integrated directly into Simulink. After the parameters have been entered, our interface generates an observer automaton to represent an instance of that pattern. Formulation of assertions as Statechart observer automata has been described in Chapters 4 and 5 of [7]. Because we are in a Simulink context, it is natural that we represent observer automata as Stateow subsystems inserted in the Simulink diagram that implement the logic of the specication described by the pattern. They contain input signals corresponding to the variables and events under observation, and the internal states that implement the logic of property. The generated observer automata are competitive in size to those coded by hand. Full details are given in Section III. The second approach to property specication is based on contracts and is similar to the idea of programming by contract [8]. Programming by contract is a methodology for writing programs that use interface specications on software components to dene properties about their behavior. Typically, the specication on a component includes three elements: properties that must hold in order to use the component correctly (preconditions), properties that will hold when the component has nished executing (postconditions), and properties that must always be satised (invariants). We applied this idea of contracts to specifying properties for Simulink subsystems. On any subsystem, the user is allowed to write preconditions, postconditions and invariants that must be satised by that subsystem. During the code generation phase the contracts on various subsystems are translated into annotations on methods and classes implementing these subsystems in the generated code. A thorough description is given in Section III.

B. Software model checking Our generated code is veried using Java Pathnder (JPF) [9], a software model checker for Java. We chose JPF for two reasons. First, our toolsuite was already congured to generate Java code. Second, JPF provides libraries supporting a number of verication features especially useful in our toolsuite: code contracts, monitoring execution for exceptions and numerical A. Property annotations problems, as well as symbolic execution. The second step in Figure 1 is annotating the Simulink The code contract feature of JPF permits annotations for model with properties to verify on the generated code. Since preconditions, postconditions and invariants to be written 122

on classes and methods. JPF monitors these conditions at runtime and reports any violations. This feature allows the preconditions, postconditions and invariants that are dened on the Simulink model elements to be translated to the generated code in a straightforward manner by the code generator. The symbolic execution [10] feature of JPF allows us to perform state reachability analysis and test case generation. The symbolic execution engine runs a program much like as normal program execution, but it does not assign a concrete value to program input variables. Instead, input variables are left as symbolic values. When input variables are used in a branching condition, a constraint solver attempts to nd values for the symbolic variables that will allow both branches of the condition to be taken. This idea is explained further in [10]. In this paper, we do not concentrate on the symbolic execution aspect. III. S PECIFICATION PATTERNS AND CONTRACTS This section gives details on how properties are specied on the model level and then translated into generated code. We rst describe the specication patterns, which can be attached to the model using a custom interface or from a supplied library. If the interface is used, a corresponding observer automaton is automatically generated from the specications. The interface can be used to insert basic properties, but to describe more complex properties, the observer automata can be compositionally dened using the supplied library. We also describe the details of how contracts are written on the model and then translated into annotations on the generated code.

Pa#ern

error_event end_event [propertyOK == false]

Error State

Global scope

Pa#ern
1

[Before && propertyOK == false]

Error State

[Before && propertyOK == true] Safe State

Before scope
IniBal State

[AFer] Pa#ern

[Before && propertyOK == true]


4 1 2

error_event end_event [propertyOK == false] [Before && propertyOK == false]


3

Error State

UnBl scope
Fig. 2. Scope library.

must be followed by another state) and the precedence (a state must be preceded by another state) patterns, and the compound group contains the chain precedence, and chain response patterns. Dwyer et al. [6] have shown how these scopes and patterns can be expressed in LTL, CTL, and other formalisms. However, the property specication patterns can also be easily expressed as parameterized observer automata, which is the A. Specication patterns approach we take. Note that many specications can be Property specication patterns describe commonly observed added to a model and each one is translated into a separate requirements in a generalized manner. They capture a par- automaton. Additionally, the denition of a simple interface ticular aspect of a systems behavior as a sequence of state allows the composition of the scope and pattern aspects of the congurations. Note that the specications can be state-based specication, represented as two distinct automata templates. or event-based. In the discussion below we mention the state- Furthermore, using the Stateow language allows the observer based form, but the same approach applies to events as well. automata to be created inside Simulink diagrams. Statechart To illustrate, consider the property that throughout a sys- hierarchy is exploited in some of the examples in Chapters 4 tems execution the value of a certain variable should always and 5 of [7], and we make use of hierarchy in formulating be greater than zero. There are two basic parts to this property each scope as a Stateow diagram that contains a pattern that commonly occur. The rst tells when the property should submachine. hold (in this case, at all times during execution), and the The Simulink model extended with the observer automata second tells what condition should be satised during this time is then translated into the target language. Hence the gener(here, the variable should be greater than zero). ated, functional code will be augmented with the code that A property consists of precisely those two pieces: a scope implements the observer automata. Now the software model and a pattern. The scope denes when a particular property checker can monitor and verify the execution of the entire should hold during program execution, and the pattern denes implementation, paying special attention to the error states and the conditions that must be satised. There are ve basic kinds properties specied in observer automata. As specications are of scopes: global (the entire execution), before (execution translated into executable code, the distance between codeup to a given state), after (execution after a state), between level monitoring and software model checking and model-level (execution from one state to another) and until (execution from property specications is reduced. one state even if the second never occurs). Figure 2 shows the automata for three of the ve scopes. There are three categories of patterns: occurrence, order We now briey describe each of these. The automaton for the global scope is shown at the top and compound. The occurrence group contains the absence (never true), universality (always true), existence (true at least of Figure 2. This scope indicates that a property should hold once) and bounded existence (true for a nite number of during the entire system execution. Initially, the state labeled times) patterns. The order group contains the response (a state Pattern is entered. There are two transitions from this state 123

to the state labeled Error State. The rst is triggered by Initial State P1 Encountered an event named error event. This event is generated by an [P1] en: propertyOK = false en: propertyOK = true enclosed property when that property has been violated. The Existence pattern second transition is triggered by an event named end event and a guard condition requiring the boolean value properError State Initial State [P1]{propertyOK = false; error_event;} en: propertyOK = true tyOK to be false. The end event is generated upon system termination and the propertyOK variable is set to false by Absence pattern the scopes enclosed property if that property is violated. That Initial State is, the second transition is taken if the system terminates and [P2]{propertyOK=false; error_event;} Error State en: propertyOK = true 1 the property enclosed by this scope has been violated. 2 The automaton for the before scope is shown in the middle [P1] of Figure 2. This scope is used to express that a property P1 Encountered Safe State [P2] should hold before some other condition is met. In the Figure, the event named Before is used to represent the condition. Precedence pattern Initially, the Pattern state is entered. If end event occurs Fig. 3. Pattern library. (the system terminates) and the enclosed property has been violated (propertyOK is false) then the rst transition is taken and the ErrorState is entered. If the Before event Property error_event occurs and propertyOK is false, the second transition is Error State Ini?al State taken and ErrorState is entered. The state named Safe end_event [propertyOK == false] en: propertyOK = false State is only entered if the Before event occurs and the x enclosed property has not been violated (propertyOK is [x > 0] true). Note that, in general, a property is considered satised Safe State as long as the error state of the propertys scope automaton is en: propertyOK = true not active. The until scope captures the requirement that some condiExistence paEern Global scope tion should hold from one state to another even if the second condition never occurs, or stated differently, in between one condition and a second, even if the second condition never Fig. 4. Property describing that at some point, x should be greater than 0. occurs. The bottom of Figure 2 shows the automaton for Scope states are white and patterns states are shaded. this scope. The two variables named Before and After are used to represent the two conditions in between which a property should hold. Upon entry, Initial State is entered. that the property is initially satised: P1 has not occurred. If When the variable After becomes true, then the transition to P1 does become true, then the transition to Error State the Pattern state is taken. While in this state, the automaton is taken,propertyOK is set to false and the error event is is waiting for the property to happen before the second con- emitted. The automaton for the precedence pattern is at the bottom dition is satised. When the property is satised, the variable propertyOK becomes true. If before propertyOK becomes of Figure 3. This captures the property that some condition true either the Before condition becomes true or system (P2) must be preceded by another condition (P1). Note execution ends (end event occurs), the transition to Error that in this automaton, the initial state sets the propertyOK State occurs and signals an error to the user. Otherwise, if variable to true: the property is initially satised. If P2 is true propertyOK is true (the property is satised) and the second before P1, that is, the condition denoted by P2 happens condition is also satised (Before is true), the transition back before the condition denoted by P1 is met, then the transition to Error State is taken, propertyOK is set to false, and to Initial State is taken, and the cycle repeats. Figure 3 shows the automata for three of the patterns. the error event is emitted. Otherwise, the overall precedence At the top of the Figure is the automaton for the existence pattern is satised. Scopes and patterns are combined to form property specipattern. This pattern states that a condition (represented in the automaton by the boolean variable P1) should occur cations. Consider the example in Figure 4, which species the during a specied scope. When the Initial State is entered, following property: at some point during system execution, the the propertyOK variable is set to false, indicating that the input variable x should be greater than 0. Stated differently, property is initially unsatised: P1 has not occurred. If P1 throughout the entire system execution (i.e., global scope), x does become true, then the transition to P1 Encountered is should be greater than 0 at least once (i.e., existence property). taken and propertyOK is set to true. To dene this property, the existence pattern shown in Figure 3 A simple pattern, absence, is shown in the middle portion is inserted into the Pattern state of the global scope shown in of Figure 3. This pattern states that a condition (represented Figure 2. The difference is that the generic condition shown in the automaton by the boolean variable P1) should not as P1 in the basic existence pattern is replaced with the occur during a specied scope. When the Initial State is condition x > 0. Note that the propertyOK variable is set entered, the propertyOK variable is set to true, indicating by the pattern and its value is used by the scope. 124
1 2

X Y Z

Subsystem

z is greater than 0, and if x is 1, then the output z is less than 0. These requirements are attached to the subsystem using the dialog box as shown at the top of Figure 5. The contracts are added to the subsystem model as specially formatted descriptions (that are usually just unstructured text), using XML-like syntax. The code generator parses these descriptions, and if they are syntactically correct, it constructs the properly formatted strings (with variable names rewritten into their code equivalent) that are suitable for the software model checker. A Java implementation of the subsystem in Figure 5 that is very similar to the code produced by our code generator is shown in Listing 1. Note that in the contract, the inputs and outputs of the subsystem are referred to by their name in the model. This is an important part of our approach: the user always refers to the model elements as they are written in the model. No knowledge of the code generation process is needed to write specications. The contract specied in the model is generated in the Java code as annotations that automatically reference the correct variable names. These annotations are used by the software model checker to monitor the code execution.
Listing 1. Java implementation of the subsystem in Figure 5. public c l a s s Subsystem15 { private int value1 = 0; private int value2 = 0; @Requires ( ( x13 == 0 && y25 > 0 && ( x13 == 1 && y25 > 10 && @Ensures ( ( x13 == 0 && z65 > 0 ) | | ( x13 == 1 && z65 < 0 ) ) p u b l i c v o i d Main23 ( i n t x13 , i n t y25 , v a l u e 1 = x13 ; v a l u e 2 = y25 ; . . . / / Code i m p l e m e n t i n g s u b s y s t e m } y25 < 1 0 ) | | y25 < 2 0 ) ) i n t [ ] z65 ) { logic

Fig. 5.

Contract example.

Additionally, we developed a dedicated user interface that uses dialog forms for inputing property specications. The dialogs capture both the kind of scope and pattern, as well as the parameters needed to instantiate and compose them. The user picks the scope and the pattern and enters the appropriate conditions. A composed automaton that composes an instance of both the scope and pattern is then automatically generated. An example using these dialog forms is described in Section IV. B. Contracts

IV. E XAMPLE The second method we use for describing verication properties is based on contracts. We extended Simulink with a This section shows how our framework can be applied to custom interface that allows the user to annotate any subsystem realistic models. The example we use is the Apollo Lunar with three additional items. Module digital autopilot model, which is included with the Preconditions that the input signals to the subsystem must Matlab/Simulink distribution as an example. The full model satisfy. includes a dynamic model of the plant: the Apollo Lunar Mod Postconditions that the output signals of the subsystem ule, as well as a model of the Reaction Jet Controller (RJC) must satisfy. we focused on the embedded controller. A very high-level view Invariants that must always be satised by the subsystem. is shown in Figure 6. The RJC receives attitude measurements Note that a subsystem translates into an executable function and desired attitude values, and generates control signals to that is called by some scheduler, periodically. Hence, the above activate yaw, pitch and roll thrusters. conditions and invariants can be checked during execution of that function block. Figure 5 shows an example of specifying contracts on a A. Step 1: Dene Property subsystem block. The internal details of the subsystem are The Yaw Jets output of the RJC block is a value from the not important, but rather serve to show how our approach set -2, 0, 2, which indicates that the yaw thruster should have allows the complexities of certain elements to be ignored when a negative thrust, no thrust or a positive thrust, respectively. writing specications. The subsystem in Figure 5 has two Suppose we wish to verify the property that the Yaw Jets inputs, x and y, and one output, z. Suppose we wish to check output can never go directly from -2 to 2 or directly from the following property: either x is equal to 0 and y is between 2 to -2: at least one output of 0 must always be found in0 and 10, or x is equal to 1 and y is between 10 and 20. between. Section III showed how a property like this could be Suppose we also wish to check that if x is 0, then the output built manually using automata. Using the scope and pattern 125

automata as building blocks, one could dene this property directly in Stateow. As mentioned above, we have also developed a custom extension to the Simulink environment that allows properties to be entered in an easier way using dialog forms. These dialogs decompose the patterns detailed in Section III-A: the user selects a pattern, enters a scope and a property and the equivalent automaton is generated, including input ports. Our rst task is to decide which pattern we need to implement the property that the Yaw Jets can never go directly from -2 to 2 or directly from 2 to -2. Part of the property states that we do not want the value of Yaw Jets to be -2 during a certain scope. The absence pattern ts this requirement, as it checks to see that some condition never occurs. The dialog form for the absence pattern is shown in Figure 7. This dialog guides the user through the process of dening a property. After dening the condition that should never hold (Command == -2), we dene the scope during which this condition should hold. In this example, we never want Command to go directly from 2 to -2, so the condition that Command should never be -2 should hold after Command is equal to 2 and before Command is equal to 0. The property that Command should never go directly from -2 to 2 is dened in an analogous way using the absence pattern dialog. B. Step 2: Connect generated automata After entering the parameters in the dialog form, the observer automaton monitoring the property is generated, as shown in Figure 8. The states representing the scope portion of the property are white, and the states representing the pattern are shaded. The transition from the initial state is taken when Command is 2, at which point we are in scope and want to verify the absence of the condition that Command is -2 before it is 0. If the value of Command is -2 before it is 0, the transition to the inner error state is taken, which sets the propertyOK variable to false and emits the error event. When error event is emitted, the outer transition to the error state is taken and the automaton remains in this state. Note that while the automaton is in scope, system termination (the end event) will not cause the property to be violated as long as Command has not been set to -2. The input parameter for command is automatically generated, so the user must connect the Yaw Jets signal to the automaton so that it can be monitored. In Figure 6, the Command Constraint and Command Constraint2 automata have already been connected to the Yaw Jets signal.

Fig. 6. High-level view of the Apollo Autopilot. The Command Constraint automaton was automatically generated using the property dened in Figure 7. The second automaton was also automatically generated.

Fig. 7. Property dialog. The property says that after the input variable Command becomes 2, it should never be equal to -2 before returning to 0.

either method, property violations can be reported to the user in the form of a stack trace showing the sequence of method invocations that led to an error state. V. R ELATED WORK

In more traditional forms of software development, verication is done in one of two ways. Either an abstract model of the software is created and veried, or the executable code itself is veried. [11] discusses the ongoing trend towards placing the verication efforts directly on the executable code rather than on models. In MBD, however, one intentionally begins with C. Step 3: Verication with JPF models and gradually renes them until they are synthesized The nal step is to invoke the code generator and use JPF to into the executable code, and ideally both artifacts can be verify our properties. There are two ways JPF can check the veried. Our approach eases the burden of both specifying code for property violations. The rst uses concrete inputs and checking properties on code generated during the MBD provided by the user. If this is done, JPF will perform a process. concrete system execution using those inputs and report any A number of tools are available for verifying Simulink/Sproperty violations in the form of stack traces. The second tateow models. Simulink Design Verier [12] and Reactis way JPF can check for property violations uses the symbolic [13] are commercial tools for checking model properties. execution module. In this case, JPF will try to determine inputs [14] describes an approach that is based on hybrid automata: to the system that will cause properties to be violated. With models are translated from Simulink to a hybrid automata for126

Initial State

[Command == 2]

[Command == 0 && propertyOK == true]


4

Command

Initial State en: propertyOK = true [Command == -2] {propertyOK = false; error_event;} error_event
1

Error State

Error State

end_event [propertyOK == false]


2

[Command == 0 && propertyOK == false]


3

Absence pattern

Until scope

Fig. 8. Generated observer automaton implementing the property specied in Figure 7. Scope states are white and pattern states are shaded.

time properties, and (2) dealing with concurrency. Translated Simulink subsystems are typically executed periodically, with a xed rate. Timing properties can be related to a single execution run (i.e. the worst-case execution time of a function block), as well as the temporal properties of the system over multiple execution runs (e.g. the system reacts to a triggering event within a bounded number of execution runs). Translated Simulink subsystems are also completely sequential; they are usually translated to functions in an implementation language. In order to run them on an execution platform, they have to be embedded into OS processes, and their communication and synchronization implemented outside of Simulink. Hence, we need to model these embeddings, and how the threads containing the function blocks communicate and synchronize. These topics are the subject of on-going research. VII. ACKNOWLEDGMENTS The work described in this paper has been supported by NASA under Cooperative Agreement NNX09AV58A. Any opinions, ndings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reect the views of the National Aeronautics and Space Administration. The authors would also like to thank Michael Whalen for valuable discussions and feedback. R EFERENCES
[1] [2] [3] [4] e A. L deczi, A. Bakay, M. Maroti, P. V lgyesi, G. Nordstrom, J. Sprinkle, o and G. Karsai, Composing domain-specic design environments, IEEE Computer, vol. 34, no. 11, pp. 4451, 2001. MATLAB, version 7.10.0 (R2010a). Natick, Massachusetts: The MathWorks Inc., 2010. G. J. Holzmann and R. Joshi, Model-driven software verication, in SPIN, 2004, pp. 7691. J. Porter, P. V lgyesi, N. Kottenstette, H. Nine, G. Karsai, and J. Sztio panovits, An experimental model-based rapid prototyping environment for high-condence embedded software, in IEEE International Workshop on Rapid System Prototyping, 2009, pp. 310. E. M. Clarke, The birth of model checking, in 25 Years of Model Checking, 2008, pp. 126. M. B. Dwyer, G. S. Avrunin, and J. C. Corbett, Patterns in property specications for nite-state verication, in ICSE, 1999, pp. 411420. D. Drusinsky, Modeling and Verication Using UML Statecharts - A Working Guide to Reactive System Design, Runtime Monitoring and Execution-Based Model Checking. Elsevier, 2006. B. Meyer, Object-Oriented Software Construction, 1st editon. PrenticeHall, 1988. W. Visser, K. Havelund, G. P. Brat, S. Park, and F. Lerda, Model checking programs, Automated Software Engineering (ASE), vol. 10, no. 2, pp. 203232, 2003. J. C. King, Symbolic execution and program testing, Commun. ACM, vol. 19, no. 7, pp. 385394, 1976. G. J. Holzmann, Trends in software verication, in FME, 2003, pp. 4050. Mathworks Inc. Simulink Design Verier, http://www.mathworks.com/products/sldesignverier/. Reactive Systems, Inc. http://www.reactive-systems.com/. R. Alur, A. Kanade, S. Ramesh, and K. C. Shashidhar, Symbolic analysis for improving simulation coverage of simulink/stateow models, in Proceedings of the 8th ACM international conference on Embedded software, ser. EMSOFT 08. New York, NY, USA: ACM, 2008, pp. 8998. S. Sankar and M. Mandal, Concurrent runtime monitoring of formally specied programs, IEEE Computer, vol. 26, no. 3, pp. 3241, 1993. E. Bodden, L. J. Hendren, and O. Lhot k, A staged static program a analysis to improve the performance of runtime monitoring, in ECOOP, 2007, pp. 525549.

malism and existing techniques for checking hybrid automata can then be applied. Our approach is complimentary to these methods and ensures the properties proved by these tools also hold for the generated code. Our approach to specifying properties through patterns is based on the work of Dwyer et al. in [6]. The pattern library described there contains a general description along with mappings into multiple formalisms, including LTL, CTL and quantied regular expressions. Our implementation uses a dialog forms to chose and congure simple patterns from which observer automata are generated, and includes a library of observer automata for individual scopes and properties from which more complex patterns can be dened. Runtime monitoring [15] is a related area in which formally specied properties are typically translated into executable code that is used to check program properties during program execution. Recent work in this area includes optimizing such monitors through static analysis techniques [16]. Our approach translates properties specied using observer automata into executable code that is checked by a software model checker and translates contracts on model elements into annotations that are used by the model checker. VI. C ONCLUSION

[5] [6] [7] [8]

Checking model level properties on implementation code [9] is a useful approach for practical model-driven development. In this paper, we have shown how relevant properties can be specied on the model level and then translated into [10] implementation code that can be veried with a software [11] model checker. Our approach is a pragmatic realization of the work described in [6], in the context of the Simulink/Stateow [12] environment. We have shown how the specication patterns [13] can be instantiated from observer automata templates for [14] scopes and properties and how subsystem blocks can be annotated with pre-, post-conditions, and invariants that are monitored by the software model checker. We have shown the [15] use of the approach on a realistic example. Our approach allows two ways for specication: contracts [16] and property specications based on patterns (that are translated into observer automata). For designers of embedded systems two extensions would be very useful: (1) specifying real127

Automatic Generation of System-Level Virtual Prototypes from Streaming Application Models


Philipp Kutzer, Jens Gladigau, Christian Haubelt, and Jrgen Teich
Hardware/Software Co-Design, Department of Computer Science University of Erlangen-Nuremberg, Germany Email: {philipp.kutzer, jens.gladigau, haubelt, teich}@cs.fau.de
AbstractVirtual prototyping is a more and more accepted technology to enable early software development in the design ow of embedded systems. Since virtual prototypes are typically constructed manually, their value during design space exploration is limited. On the other hand, system synthesis approaches often start from abstract and executable models, allowing for fast design space exploration, considering only predened design decisions. Usually, the output of these approaches is an "ad hoc" implementation, which is hard to reuse in further renement steps. In this paper, we propose a methodology for automatic generation of heterogeneous MPSoC virtual prototypes starting with models for streaming applications. The advantage of the proposed approach lies in the fact that it is open to subsequent design steps. The applicability of the proposed approach to realworld applications is demonstrated using a Motion JPEG decoder application that is automatically rened into several virtual prototypes within seconds, which are correct by construction, instead of using error-prone manual renement, which typically requires several days.

Source

c1

P arser

c2

Recon

c7

c6

c3

c4

Sink

c8

M Comp

c5

IDCT

CPU

Memory

HW

Bus

Fig. 1. Application model of a Motion JPEG decoder, clustered and mapped to an architecture template. The architecture template consists of a CPU, a hardware accelerator (HW) and an external memory. All the components are connected via a bus.

I. I NTRODUCTION Today, modern Multi-Processor System-on-Chip (MPSoC) architectures consist of a mixture of microprocessors, digital signal processors (DSPs), memory subsystems, and hardware accelerators, as well as interconnect components. It is noticeable that the adoption of programmable logic in such electronic systems is more and more increasing. Driven by this rise, the process of software development becomes the dominating part during system design. In the course of software development, software engineers have to cope with operating systems, communication stacks, drivers, and so forth. In order to allow early software development, virtual prototyping is a more and more frequently used technology in Electronic System Level (ESL) design. There, the desired target platform is modeled as an abstract, executable, and often completely functional software model. Hence, the virtual prototype includes all functional properties of the target platform, while non-functional properties, such as timing behavior, are mostly disregarded. In contrast to FPGA-based prototyping, virtual prototypes are deployed before architectural models on register-transferlevel are available. Due to this early availability, the overall time spent on hardware and software design can be reduced,
Supported in part by the German Science Foundation (DFG Project HA 4463/3-1) 978-1-4577-0660-8/11/$26.00 c 2011 IEEE

because software can be implemented, rened, tested, debugged, and veried on realistic hardware models in parallel to the hardware design process. Nevertheless, additional time is needed for implementing such prototypes from the functional and desired architectural system specication. This drawback could be avoided with an automatic virtual prototype generation. This would further speed up the design process and in addition, errors, often made in manual prototype generation, are avoided. Describing a complex application abstracted as an actororiented model [1] is a more and more accepted approach in ESL design. Such models are used to describe the functional behavior of the application. Therefore, they consist of concurrently executing actors, which communicate over abstract channels. In our approach, the communication takes place via channels with FIFO semantics. An example is shown in Fig. 1 for a small actor-oriented model, a Motion JPEG decoder, which consists of the actors Source, Parser, Reconstruction (Recon), Inverse Discrete Cosine Transformation (IDCT), Motion Compensation (MComp), and Sink, as well as FIFO channels c1 to c8 . In order to generate a virtual prototype starting with an actor-oriented model, additional information about the system architecture candidates and the

128

mapping possibilities of the functional components have to be specied. In the lower part of Fig. 1, a possible mapping to an architecture template is given by the dotted arrows. In the following, we present a method for automatic generation of MPSoC virtual prototypes from actor-oriented models. Our proposed approach performs the virtual prototype generation in two steps: (i) Based on a given resource mapping, communication within the application model is rened to transactions in the virtual prototype, and controllers for intra-resource communication are generated. (ii) The virtual prototype is generated by assembling cycle-accurate processor models, memory models, and models for hardware accelerators using bus models, and synthesizing the software for each processor, according to the given mapping. The remainder of this paper is structured as follows: Section II reects related work. In Section III, a brief overview of our approach is given. Section IV describes application modeling. In Section V, the automatic generation of architectural TLM models is discussed in more detail. Section VI describes the architectural renement in more detail. Section VII presents experimental results from applying the proposed prototype generation approach to a Motion JPEG decoder, a multimedia streaming application mapped onto an MPSoC architecture. Finally, conclusions are given in Section VIII. II. R ELATED W ORK As virtual prototypes are nowadays commonly used in system-level design ows, several commercial as well as free of charge tools exist to build, simulate and evaluate such prototypes. Most prominent are Platform Architect from CoWare, CoMET from VaST and OVPsim [2] from Imperas. The two rst mentioned tools were acquired by Synopsys [3] within the last year. Most existing virtual prototyping tools support the integration of transaction level models written in SystemC [4] in the prototypes. However, none of them allows the automatic transformation of a formal description, like an actor-oriented model, to a virtual prototype. In general, mapping formal models on MPSoCs is a current research topic in system synthesis (e.g., see [5], [6]). There exist several system-level synthesis tools that automatically map formal described applications to a MPSoC target, like Daedalus [7], Koski [8], and SystemCoDesigner [9]. All these approaches want to achieve a common purpose. They target nal product generation. This means, they have to cover the complete design ow, starting with an high-level application specication down to the running system. Caused by this, their integration into existing design ows is hard to establish. In contrast to system synthesis tools, our proposed approach targets automatic virtual prototype generation. In this scenario, important design decisions are reected in the generated prototype, while support for further manual renement is retained. Hence, the product quality still could be inuenced by a designer and, even more important, our proposed approach

Application Model

Architectural Template

TLM Generation

Architectural Model (TLM) Automatic 2-step Prototype Generation Prototype Generation

System-Level Virtual Prototype

Software Renement

Fig. 2. Design ow from an application model, represented by an abstract executable specication, to a virtual prototype. The ow includes automatic mapping of actor-oriented models to TLM architecture models, as well as virtual prototype generation.

could be easily integrated in established industrial design ows. III. V IRTUAL P ROTOTYPE G ENERATION - OVERVIEW The goal of our system-level design approach is to automatically implement abstract system descriptions written in SystemC as virtual MPSoC prototypes. The associated design ow is depicted in Fig. 2. At the beginning of our ESL design process, an abstract model has to be derived for the desired application. In our approach, a distinction is drawn between the application model, which describes the functional behavior of the system, and the architecture template, which represents all architecture instances of the system. The system behavior is modeled in form of actor-oriented models, which only consist of actors and channels, as depicted in the Motion JPEG example from Fig. 1. Actors are the communicating entities, which are executed concurrently. For communication, tokens are produced and consumed by actors, and transmitted via dedicated channels. The architecture template of the system is represented by a heterogeneous MPSoC platform, which is specied by connected cores. Single actors or clusters of actors can either be mapped onto processor elements (CPU) or on dedicated hardware accelerators (HW), as depicted in Fig. 1. Hardware accelerators will typically be used for computationally intensive or time critical parts of the application. In general, System-on-Chips include both processor elements as well as hardware accelerators. Depending on the actor mapping,

129

communication channels can either be mapped on internal memory of data processing units (CPU or HW accelerators), or on shared memory modules. In the Motion JPEG decoder example, all channels except c1 and c8 are mapped to the hardware accelerator, as communication takes place internally. Channels c1 and c8 represent the communication between the CPU and the dedicated accelerator, and hence have to be mapped to the shared memory. After modeling the application, the architecture template, and dening a mapping of functional to structural elements, an architectural model will be automatically generated. In this intermediate model, the actors are clustered according to the mapping on architectural resources. Due to the fact that virtual prototypes are usually implemented using transaction level modeling (TLM), we use the OSCI TLM-2.0 [10] standard in our design ow. For virtual prototyping, parts of the architectural model are subsequently replaced by the corresponding resources from a virtual component library, which consist of cycle-accurate processor models, as well as models of communication entities. Beside the architectural renement, software is generated and cross-compiled for each CPU, to match its instruction set architecture (ISA). The resulting virtual prototype can now be used for further software and communication renement. Moreover, due to the cycle-accurate processor models, performance estimation becomes possible. The steps of architectural mapping as well as prototype generation will be described later in more detail. First, our application modeling approach is described. IV. A PPLICATION M ODEL This section introduces our concept of actor-oriented modeling, which is necessary to understand our proposed mapping approach. In actor-oriented models, actors are potentially executed concurrently and communicate over dedicated abstract channels. Thereby, they produce and consume data (so called tokens), which are transmitted by those channels. These models may be represented as bipartite graphs, consisting of channels c C and actors a A. In the following, we use the term network graph for this kind of representation. Denition 1 (network graph): A network graph is a directed bipartite graph Gn = (A, C, P, E), containing a set of actors A, a set of channels C, a channel parameter function P : C N V that associates with each channel c C its buffer size n N = {1, 2, 3, .., }, and also a possibly nonempty sequence v V of initial tokens, where V denotes the set of all possible nite sequences of tokens v V . Additionally, the network graph consists of directed edges e E (C A.I) (A.O C) between actor output ports o A.O and channels, as well as channels and actor input ports i A.I. An example of a network graph is already given in the upper part of Fig. 1.

gcheck return i1[0] >= 0;

fpositive double in = i1[0]; o1[0] = in;

fnegative double in = i1[0]; o2[0] = in;

i1 (1)&gcheck &o1 (1)/fpositive o1 i1 start o2 i1 (1)&gcheck &o2 (1)/fnegative

Fig. 3. Visual representation of an actor, which sorts input data according to its algebraic sign. The actor consists of one input port i1 and two output ports o1 and o2 .

Denition 2 (Channel): A channel is a tuple c = (I, O, n, d) containing channel ports partitioned into a set of channel input ports I and a set of channel output ports O, its buffer size n N = {1, 2, 3, .., }, and also a possibly empty sequence d D of initial tokens, where D denotes the set of all possible nite sequences of tokens d D. In the following approach, we will use SysteMoC [11], a SystemC [4] based library for modeling and simulating actororiented models. In the basic SysteMoC model, each channel is an unidirectional point-to-point connection between an actor output port and an actor input port, i.e. |c.I| = |c.O| = 1. The communication between actors is restricted to these abstract channels, i.e. actors are only permitted to communicate with each other via channels, to which the actors are connected by ports. In a SysteMoC actor, the communication behavior is separated from its functionality. The communication behavior is dened as nite state machine (FSM); the functionality is a collection of functions that can access data on channels via ports. These functions are classied in actions and guards, and are driven by the nite state machine (FSM). So SysteMoC follows the FunState [12] (Functions driven by State machines) approach. An action of an actor is able to access data on all channels, the actor is connected to, and is allowed to manipulate the internal state of the actor implemented by internal variables. In contrast, a guard function is only allowed to query, not to alter neither the internal state nor the data on channels. A graphical representation of a SysteMoC actor is given in Fig. 3. The actor Sorter, which is used to sort input data tokens according to algebraic sign, possesses one input port (i1 ) and two output ports (o1 and o2 ). Tokens from input port i1 will be forwarded to output port o1 by the function fpositive , if the activation pattern i1 (1)&gcheck &o1 (1) of the state transition from the state start to the state start evaluates to true. This pattern determines under which conditions the transitions may be taken. In SysteMoC, the activation pattern can depend on

130

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

class Sorter : smoc_actor { public: smoc_port_in<double> i1; smoc_port_out<double> o1; smoc_port_out<double> o2; smoc_firing_state start; Sorter(sc_module_name name) : smoc_actor(name, start) { start = (i1(1) && GUARD(check) && o1(1)) >> CALL(positive) >> start | (i1(1) && !GUARD(check) && o2(1)) >> CALL(negative) >> start; } private: bool check(void) const { return i1[0] >= 0; } void positive(void) { double in = i1[0]; o1[0] = in; } void positive(void) { double in = i1[0]; o2[0] = in; } }; Listing 1. SysteMoC code for the actor Sorter. The FSM of the actor is dened in the constructor of the actor class, whereas the functionality is encoded as private member functions.

x4

x1

x2

x3

Fig. 4. Clustered network graph of the Motion JPEG example. The cluster X(x1 ) represents the CPU, X(x2 ) the communication bus and X(x3 ) the hardware accelerator. Cluster X(x4 ) represents the whole system.

some internal state of the actor, on availability and values of tokens on input channels, and on availability of free space on output channels. In our example, the state transition will be taken, if at least one token is available on input port (i1 (1)), the guard gchecks evaluates to true (data on input channel has positive algebraic sign), and output port o1 has space for at least one additional token (o1 (1)). Analog to this, the second transition is taken if input data is negative. The corresponding SysteMoC code is given in Listing 1. To summarize, the transition-based execution of SysteMoC actors can be divided into 4 steps: (i) Evaluation of all activation patterns k of all outgoing state transitions in the current state qc Q. (ii) Non-deterministically selecting and taking of one activated transition t T . (iii) Execution of the corresponding action f a.F . (iv) Notication of token consumption/production on channels connected to corresponding input and output actor ports after completion of action as well as transition to the next state. During system synthesis from actor-oriented models, actors a A and the communication channels c C are mapped to components of a system architecture. To reect architectural structure in network graphs after mapping, nodes can be clustered. For representation of clustering, we dene a clustered network graph. Denition 3 (clustered network graph): A clustered network graph Gcn = (Gn , T ) consist of a network graph Gn and a rooted tree T such that the leaves of T are exactly the vertices of Gn . Each node x of T represents a cluster X(x) of the vertices of network graph Gn that are leaves of the subtree

rooted by x. The representation as tree illustrates the hierarchical structure of the system. This means, the root of T represents the whole system, whereas nodes x T with height(x) = 1 represent the components of the system. As reuse of parts of models is common in the design process, hierarchical structures with more than two levels are possible. The clustered network graph of the example from Fig. 1 is depicted in Fig. 4. Although we used SysteMoC, our approach is not restricted to this framework and can be adapted to other frameworks for actor oriented design, e.g. pure SystemC FIFO channel communication. A deeper insight into SysteMoC is given in [11]. V. G ENERATING THE TLM A RCHITECTURE Transaction level modeling (TLM) with SystemC has become apparent as de-facto industry standard for virtual prototyping and architectural modeling [13], [14]. These models are characterized by an encapsulation of low-level communication details. Due to abstraction, very fast simulation speed can be achieved. To enable fast simulation, details of bus-based communication protocol signaling are replaced with single transactions. In the course of releasing a TLM standard (OSCI TLM-2.0) to enforce interoperability of models, the Open SystemC Initiative dened two coding styles [15]: the loosely-timed (LT) and the approximately-timed (AT) coding style. The loosely-timed coding style allows only two timing points to be associated with each transaction, namely the start and the end of the transaction. This timing granularity of communication is sufcient for software development using a virtual prototype model of an MPSoC. A transaction in an approximately-timed model is broken down into multiple phases, with timing points marking the transition between two consecutive phases. Due to the ner granularity of timing, approximately-timed models are used typically in architectural exploration and performance analysis. As our approach targets software development, or more precisely the renement of

131

parts of the application in software, by means of virtual prototyping, the loosely-timed coding style is adequate [15]. As described, actors a A and the communication channels c C are partitioned to clusters X(x) and mapped to components of a system architecture. Due to the mapping, the channel communication can either be internal, in case both communicating actors aa and ab mapped onto the same resource (aa X(xy ) and ab X(xy )), or external, in case communication crosses cluster boundaries (aa X(xy ) and ab X(xy )). For intra-resource communication, FIFOs / can be put in private memory of the architectural component, whereas FIFOs of inter-resource communication, like c1 and c8 from Figure 1, have to be placed in an external memory model. Either way, actor communication semantics through ports are not altered, in order to reuse the existing actors written in SystemC based SysteMoC. So, the challenge of this step in design ow is to map the FIFO-based communication via dedicated channels to a memory-mapped bus-based communication with global and local shared memory. Since our abstract communication semantics (read, write, commit) calls for uniform channel access, access transparency has to be ensured after mapping to architectural template, resulting in the transaction level architectural model. As communicating actors on different resources are concurrently executed, simultaneous access to FIFO storage has to be avoided. This means that memory coherence as well as cache coherence has to be guaranteed. To cope with actor clustering and to ensure synchronized channel access, independent of communication mapping, we use aggregators and adapters in our approach that implement a suitable communication protocol [16]. Adapters, by which the SysteMoC ports (i A.I and o A.O) are substituted, serve as links between the actors and the transaction level. Due to the fact that more than one actor can be mapped to one resource, and actors can possess multiple ports, an aggregator is needed for each transaction level component (X(xi ) : height(xi ) = 1) to encapsulate the desired number of adapters. These aggregators perform transaction level communication and implement the interface of the component to the rest of the architectural model. There is no need to connect adapters for internal channels with the aggregator, because no communication will take place over component boundaries. In Fig. 1, communication between the actors Parser, Recon, MComp and IDCT is internal and can be implemented using, e.g., internal memory. In our approach, we use a transaction level memory model for each communication channel. In the following, we will describe the functionality of adapter and aggregator in more detail. A. Adapter An adapter adapts between transactions in the virtual prototype and the asynchronous FIFO channel communication used in the application model. Hence, the communication adapter implements two different interfaces. The interface towards the actor is equivalent to the abstract channel, which has to be

P arser

Recon

Out c2

Virtual Channel c2

In c2

TLM Memory Model

Fig. 5. Mapping of parts of cluster X(x3 ) from the model, depicted in Fig. 1, to a architectural component. The internal communication takes place over a virtual channel, which substitutes the abstract channel. Therefore, adapters adapt between the abstract model and the transaction level model. The FIFO queue semantics are implemented using a TLM memory model.

replaced. To sustain abstract communication semantics, the adapter needs to access tokens in a random manner and to commit completed transitions via this interface. Therefore, a conversion of the token data type (e.g., serialization and deserialization) has to be performed in adapters. An adapter also has to respect the abstract channel synchronization mechanism. This means, the adapter has to provide an interface through which the adapter can be notied when tokens on channel are produced or consumedi, respectively. This notication can be used to trigger the corresponding actor waiting for free space or tokens on channel. The transaction level interface consist of three transaction level communication sockets (see Fig. 5). One is used for data transmission. The actor, which is connected to the adapter, can read or write data from a memory through this socket. The other two sockets are needed to sustain the channel synchronization. For synchronization, the adapters communicate among each other over arbitrary TLM communication resources. Therefor, a dedicated address has to be assigned to each adapter. Due to the fact that the SysteMoC channels possess memory, the FIFO storages have to be mapped to resources. As different locations are possible, we allocate the storage in a memory, to which the adapters are connected to. For internal communication, the sockets of the adapters can be directly coupled with each other, as depicted in Fig. 5. The synchronization sockets of the two communicating adapters are directly coupled, whereas the data sockets are connected with the memory. The memory of external communication is accessible over a bus system, to which the aggregator is connected (see Fig. 6). Allocation of storage in one adapter or splitting and distributing the storage over both communicating adapters is also possible. Independent of the chosen implementation and mapping, each adapter needs to know to which address space his buffer is mapped to, in order to read or write tokens.

132

TABLE I M EASUREMENT TERMS OF THE 5 DIFFERENT VIRTUAL PROTOTYPES .


X(x1 ) X(x3 )

VP
Out c1 In c8 In c1 Out c8

Instructions 4944835683 5319738192 5726625319 5765601708 6188808202 3492102237

Simulation Host[s] VP[ms] 44285 30494 29222 26993 7224 30870 1997 521 1791 660 1760 550

VP Performance CPI 1.79 1.15 1.02 0.94 0.23 1.77 MIPS 111.66 174.45 195.97 213.59 856.66 113.12

Aggregator

Aggregator

I II III IV V VI

TLM Bus Model

TLM Memory Model

Fig. 6. Mapping of the cross component communication between HW and CPU from Fig. 1. For the sake of clarity, internal communication structure is omitted.

B. Aggregator As real computational resources like CPUs or DSPs have a limited number of connection pins, each node x T besides the root node needs a mechanism that aggregates the children connected to x. For nodes that represent data transferring units, like buses (x2 ), this is done by arbitration and address translation. Unlike the communication resources (data transferring units), the computational resources (data processing units) need an aggregator for this purpose. The aggregators contain TLM ports to perform transaction level cross component communication. Therefore, they implement the communication protocol for the connected adapters at the transaction level. For communication, aggregators communicate among each other over arbitrary TLM communication resources. For this purpose, each aggregator is assigned a dedicated address-range. Its size depends on the number of adapters registered to the aggregator. So each adapter is assigned a single address, to which it is accessible for eventbased synchronization. Beside his own address range, each aggregator has to know addresses of peer adapters, which are associated with registered adapters, and addresses of the corresponding FIFOs in memory. VI. V IRTUAL P ROTOTYPE G ENERATION In the nal step of our automatic design ow, a virtual prototype is generated based on the transaction level architectural model. A. Architectural Renement In order to allow for an early software development, parts of the architecture have to be substituted by virtual component models. In our approach, all resources except the hardware accelerators are replaced. As our approach focuses on software

development, the inserted processor models must provide an instruction set simulator, in order to simulate or furthermore debug the software running on the models. Therefore, we use a commercial virtual component library [3], which provides the opportunity to integrate TLM. This feature is necessary to couple the hardware accelerators with the virtual components. In order to sustain the abstract channel synchronization mechanism, an interrupt controller is added for each processor element. By the use of this controller, the processor element can be informed about channel data modication by another processor or hardware accelerator. B. Target Software Generation During the process of target software generation, the actor description in SystemC is transformed into standard C/C++ code. Therefore, the ports for communication of the actor are replaced by pointers to FIFO interfaces, and the nite state machine is encoded as switch-case statement. The FIFO interfaces represent the communication interface equivalent to the TLM communication adapters, described in Section V. Moreover, scheduling strategies have to be implemented, in case multiple actors are mapped on the same processor element. VII. E XPERIMENTAL R ESULTS In order to show the applicability of our approach, we present our rst results on generating virtual prototypes from an actor-oriented Motion-JPEG model. Therefore, we use a more ne-grained model than given in Fig. 1, which consists of 19 actors, interconnected by a total of 56 FIFO channels. In Table I, the results of several test cases in terms of different mappings are presented. Since the architecture template contains 19 processors, 19 hardware accelerators, and a shared memory, which all are connected by bus, many architecture instances exist. With our approach, it is possible to generate virtual prototypes from all of them. To show the applicability, we consider only a few mappings serving as representatives. Our rst prototype (I) consists of a single processor (ARM926), onto which all actors are mapped. For the next two test cases, two processors are allocated and connected via a bus. For this architecture instance, two mappings are tested, respectively: (i) The IDCT actors are mapped to one processor, all remaining actors to the other one (II). (ii) The actors are mapped to the processors alternately, i.e. the

133

60 50

generate compile

Time (s)

40 30 20 10 0 I II III IV V VI

the application, it is often unneeded to rene all components of the TLM architectural model to virtual processor models. Prototype VI shows that there is no appreciable difference in simulated and host time in contrast to the completely rened model (II). VIII. C ONCLUSION In this paper, we have presented a two-step methodology for automatically generating virtual system-level prototypes from an abstract system specication. Our main goal was to provide a methodology to remove the dependency on hardware availability, needed for software development, in an early phase of the design ow, which starts with an abstract and executable application model. For this purpose, design decisions are rst represented in SystemC TLM, which is typically supported by all commercial virtual prototyping tools. Second, the TLM generation is used to assemble the virtual prototype and generate the embedded software. To show the applicability of our approach to real-world applications, we presented rst simulation results for an actor-oriented Motion JPEG model. R EFERENCES
[1] E. A. Lee, Overview of the ptolemy project, technical memorandum no. ucb/erl m03/25, Department of Electrical Engineering and Computer Science, University of California, Berkely, CA, USA, Tech. Rep., Jul. 2004. [2] OVPworld, http://www.ovpworld.org. [3] Synopsys, http://www.synopsys.com. [4] T. Grtker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Norwell, MA, USA: Kluwer Academic Publishers, 2002. [5] O. Moreira, F. Valente, and M. Bekooij, Scheduling multiple independent hard-real-time jobs on a heterogeneous multiprocessor, in Proceedings of EMSOFT, 2007, pp. 5766. [6] P. K. F. Hlzenspies, J. L. Hurink, J. Kuper, and G. J. M. Smit, Runtime spatial mapping of streaming applications to a heterogeneous multiprocessor system-on-chip (MPSoC), in Proceedings of DATE, 2008, pp. 212217. [7] M. Thompson, T. Stefanov, H. Nikolov, A. D. Pimentel, C. Erbas, S. Polstra, and E. F. Deprettere, A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs, in Proceedings of CODES+ISSS, 2007, pp. 914. [8] T. Kangas et al., UML-based multi-processor SoC design framework, ACM TECS, vol. 5, no. 2, pp. 281320, May 2006. [9] J. Keinert, M. Streubhr, T. Schlichter, J. Falk, J. Gladigau, C. Haubelt, J. Teich, and M. Meredith, SYSTEMCODESIGNER - An Automatic ESL Synthesis Approach by Design Space Exploration and Behavioral Synthesis for Streaming Applications, TODAES, vol. 14, no. 1, pp. 1 23, 2009. [10] Open SystemC Initiative (OSCI)., OSCI SystemC TLM 2.0, http://www.systemc.org/downloads/standards/tlm20/. [11] J. Falk, C. Haubelt, and J. Teich, Efcient representation and simulation of model-based designs in systemc, in Proceedings of FDL, Sep. 2006, pp. 129134. [12] L. Thiele, K. Strehl, D. Ziegenbein, R. Ernst, and J. Teich, Funstatean internal design representation for codesign, in Proceedings of ICCAD. Piscataway, NJ, USA: IEEE Press, 1999, pp. 558565. [13] F. Ghenassia, Transaction-Level Modeling with SystemC. Dordrecht: Springer, 2005. [14] B. Bailey and G. Martin, ESL Models and their Application. Dordrecht: Springer, 2010. [15] OSCI TLM-2.0 user manual, Open SystemC Initiative, Jun. 2008. [16] J. Gladigau, C. Haubelt, B. Niemann, and J. Teich, Mapping actororiented models to TLM architectures, in Proceedings of Forum on specication and Design Languages, FDL 2007, Barcelona, Spain, Sep. 2007, pp. 128133.

Prototype
Fig. 7. Times measured for generation and compilation of the different congurations.

neighbor of each actor in the decoding pipeline is mapped to the processor different than the processor to which the actor itself is mapped (III). For the FIFO communication between the two processors, a memory is additionally allocated and connected to the bus. In the fourth prototype (IV), three processors and a memory are allocated. Here, actors Source and Sink are clustered to one processor, IDCT is mapped to the second one, and the remaining actors are mapped to the third. To take the full advantage of pipelined execution, 19 processors are allocated in the fth prototype (V). In the last test case, VI, one processor and one hardware accelerator are allocated. This test case is analog to the second prototype, except the functionality of the IDCT actors is swapped to the hardware accelerator. Figure 7 shows the time needed for prototype generation and compilation. It can be seen that the time spent for prototype generation is nearly independent of the mapping, whereas the time for compiling depends on the components of the prototype. On the one hand, it is obvious, that the more processors are allocated, the more time is needed for compiling. On the other hand, the code for the transaction level hardware accelerators is more complex than the code running on processors, so more time is needed for compiling hardware accelerators. However, in summary it can be seen that all virtual prototypes have been generated within seconds, instead of hours. In the following, 5 measurement terms will be tested in order to decode 10 images (176x144): total instructions executed; cycles per instruction (CPI); million instructions per second (MIPS); simulation time (host time); simulated time. In order to make a statement of system performance, not of simulator performance, the terms CPI and MIPS relate to the simulated time. The corresponding values are given in Table I. It can be seen that the performance of the prototypes behave as expected. The more processors are allocated, the better the pipeline of the decoder can be exploited. This means less cycles are needed for one instruction, what causes a higher MIPS and a lower CPI rate. The small difference between II and III is based on a better workload distribution. As different developer teams implement different parts of

134

An Automated Approach to SystemC/Simulink Co-Simulation


F. Mendoza and C. K llner o
FZI Research Center for Information Technology Dept. of Embedded Systems and Sensors Engineering (ESS) Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe, Germany Email: {mendoza|koellner}@fzi.de

J. Becker and K.D. M ller-Glaser u


Institute for Information Processing Technology Karlsruhe Institute of Technology, Karlsruhe, Germany Email: {becker|klaus.mueller-glaser}@kit.edu

AbstractWe present a co-simulation framework which enables rapid elaboration, architectural exploration and verication of virtual platforms made up of SystemC and Simulink components. We exploit the benets of Simulinks graphical environment and simulation engine to instantiate, parametrize and bind SystemC modules which reside inside single or multiple component servers. Any set of SystemC module implementations can be easily added into a component server controlled by Simulink through a set of well-dened interfaces and simulation synchronization schemes. The complexity of our approach is hidden by the automated framework that enables a designer to focus on the creation and verication of SystemC models and not on the intricate low-level aspects of the co-simulation between different simulation engines.

I. I NTRODUCTION The increasing complexity of embedded systems has constantly triggered the creation of tools and methodologies that can aid in the different stages of their design and verication. Traditional approaches for the design of embedded systems are based on common practices, such as the creation of specications and modeling guidelines, simulation of key concepts and algorithms, and the implementation into hardware prototypes. Though widespread, such approaches are not tted for complex embedded systems, especially when it comes to the implementation into hardware prototypes, where up to 70% of a projects design time is invested in costly functional verication and redesign cycles [1]. There is an evident need for newer approaches that can improve design efciency and the quality of embedded systems. In the recent years System Level Design (SLD) methodologies have gained popularity in the electronic design automation market. SLD was created to cope with the increasing complexity of embedded systems and to enhance the productivity of designers. Regardless of the denition given by each author, the goals of SLD are to enable new levels of design and reuse using higher levels of modeling abstraction and to enable HW and SW Co-Design [2]. The motivation of our work is to incorporate SLD methodologies into the development ow of embedded systems. In the automotive and industry automation elds for example, Simulink is the most accepted simulation and model driven

prototyping tool for continuous and discrete time data ow designs. It is here where the functionality of algorithms are tested and where SLD methodologies can be seamlessly integrated. By adding SLD support to Simulink we can enable rapid architectural exploration in early stages of a design. Our approach uses the right tool for the right job: Simulink for the creation of functional models and test benches, and SystemC for the creation of system level models of hardware implementation solutions. A designer will be able to investigate different architectural partitions of a design that can be tested along with sensors/actuators, controllers, and embedded software. This will provide a better understanding of the functionality and interactions between the different components of a system. The acquired knowledge can then be used for the selection of an appropriate hardware prototype implementation whose functionality can be later on veried with the available simulation results. Our work uses S-Functions developed in C ++ as a common principle to extend Simulinks functionality. An S-Function is basically the source code that describes the behavior of a user dened Simulink block. S-Functions have access to Simulinks simulation engine through a set of dened function calls. Using S-Function function calls and the expressive power of C ++ we are able to instantiate, connect, parameterize and simulate SystemC models inside Simulink. An automated approach for the co-simulation of SystemC and Simulink will be further explained in this paper. Additionally, we present its implementation in the verication of a DSP algorithm. The challenges involved in the time synchronization between the simulation models of Simulink and SystemC are discussed. Simulink uses a time continuous simulation model, while SystemC uses a discrete time event-driven simulation model. In a continuous simulation model, time is discretized into xed or variable time steps, also called integration steps, according to the numerically solver used by the simulator engine. In a discrete time event-driven simulation model, time steps are inherently variable and are calculated according to events scheduled in a queue. Delta cycles are used to update all processes running concurrently in a same time step. Only when the event queue for that time step is empty, the time can

978-1-4577-0660-8/11/$26.00 c 2011 IEEE

135

be updated to the next scheduled event. II. R ELATED W ORK A list of available commercial mixed language simulation tools for the creation of virtual platforms is presented ahead. SystemVision [3] from Mentor Graphics enables the interaction of SPICE, C/C ++ , SystemC and Verliog-AMS for creating simulations of analog and digital components. System Generator [4] from Xilinx provides a library of their DSP IP blocks translated as Simulink components. They enable cosimulation of their IP DSP blocks with standard Simulink components, with the advantage of being able to synthesize into Xilinxs FPGAs. In the System-on-Chip area, tools for the simulation of multi-processor systems are common, for example VaST from Synopsis and Seamless from Mentor Graphics. A common feature found in mixed language simulation tools is the use of simulation wrappers to adapt and communicate different abstraction levels and simulation models. The use of wrappers is commonly found in the simulation of multiprocessor systems, such as [5] and [6], where processor models are wrapped and connected to a SystemC backplane. Further formal approaches for the generation of simulation wrappers are presented in [7] and [8]. The mixed language simulation approach we focus on is the co-simulation between Simulink continuous models and SystemC discrete models. A systematic analysis of continuous and discrete simulation models along with their respective triggering mechanisms is presented in [9]. We have classied the available co-simulation approaches in two variants, according to the simulation engine that takes control of the whole simulation. In the rst case, where SystemC takes control of the simulation, two synchronization schemes can be identied. A basic, though effective, synchronization scheme is presented in [1] and [10], where a SystemC application synchronizes with a Simulink model on xed time intervals. A more efcient approach based on SystemCs event driven scheduling mechanism is presented in [9]. The authors use SystemCs event queue to determine the required synchronization points with Simulink. They additionally include the possibility of a Simulink model to trigger additional synchronization points. A continuation of their research is given in [8], where better dened interfaces between Simulink and SystemC are presented. In the second approach, where Simulink takes control of the simulation, it is possible to synchronize with one or more SystemC kernel via user-dened S-Functions triggered on xed or variable sampling times. The authors in [11] use a synchronization scheme based on variable sampling times extracted from SystemCs event queue, however no technical details are given on its implementation. Our work has a similar approach, where the time synchronization scheme is controlled by Simulink and allows for a true event-based simulation inside the SystemC sub-system. Each SystemC module instance corresponds to a Simulink block with appropriate input and output connectors. Therefore, a SystemC sub-system can

consist of any number of arbitrarily interconnected module instances. Our contribution differs from the above in the sense that we exploit the benets of Simulinks graphical interface to instantiate, parameterize and bind SystemC modules which reside inside single or multiple component servers. The benet of our approach is increased usability. A hardware designer is able to re-arrange the overall system structure of a virtual platform in order to explore several design aspects and realization alternatives. Therefore, we simplify system composition and simulation control without the need to manually edit SystemC source code. Furthermore, the proposed approach enables the designer to create as many instances of SystemC modules (including multiple instances of the same module) dynamically inside a Simulink model and to interconnect them with other SystemC module instances and native Simulink blocks. III. T HE C O -S IMULATION I NTEGRATION F RAMEWORK
<<uses>>

A: sc_module B: sc_module C: sc_module

SystemC

Infrastructure

Compile

Compile

Compile

Component Server (TCP/IP) Component Server Client S-Function (single address space)

Component Server (Shared Memory)

Client S-Function (TCP/IP)

Client S-Function (Shared Memory)

<<uses>> <<uses>> <<uses>>

Matlab/ Simulink

Matlab/ Simulink

Matlab/ Simulink

Figure 1. Overview showing the design ow for creating a component server.

A. Overview Figure 1 shows an overall view of our co-simulation integration framework. During a design entry step, the developer denes and creates the SystemC modules which will be available in the repository of a component server. These modules are then compiled along with a SystemC kernel

136

and infrastructure code in order to build a component server and client application. The component server allows for the dynamic instantiation and interconnection of the SystemC modules inside its repository. The SystemC kernel built inside the component server is able to execute a dynamically created SystemC model. With the help of a set of well-dened interfaces and synchronization schemes, a client encapsulates the functionality to connect to the component server and control the data exchange. Three variants of component servers with their respective clients are available, differing only on the middleware used to connect them. This gives the designer the liberty of dening where a component server will be located, either running inside a Simulink process, locally in another process or in an external server connected via network. In the rst variant, the SystemC kernel and the Simulink solver engine are both executed in the same process, using the same address space. This approach has the advantage of high simulation performance, but has an impact on robustness. If a software bug inside a module implementation crashes the component server, it will also crash the Simulink environment. Furthermore, debugging the design is tedious compared to debugging a standalone application. In the second and third variants, the component server and client are executed as different processes communicating through shared memory or TCP/IP inter-process communication (IPC). Both variants provide a better isolation between processes and provide a convenient way to co-simulate Simulink with one or more SystemC kernels running concurrently. B. Usage Our approach separates the task of SystemC module development from the complex code infrastructure required to interface SystemC and Simulink. A SystemC module designer is provided with a small set of preprocessor macros which, when inserted inside a module class declaration, automatically register that class to a component repository. All the designer has to do is compile his module implementations along with the SystemC library and our infrastructure library. Build parameters let him decide which of the three variants (see Figure 1) will be created. The component server is displayed in Simulink as an SFunction block. All modules present in the servers repository can be instantiated in arbitrary quantities using such component server block (which all refer to the same S-Function). Parameters specify which module to create and, if necessary, its constructor arguments. Each block automatically adopts the interface of the underlying SystemC module instance in terms of input and output ports. For enhanced usability, the user can create a Simulink block library which hides the details of the S-Function parameterization. IV. I MPLEMENTATION A. Infrastructure/Component Server Figure 2 shows the class diagram of the infrastructure code. Throughout this subsection, we will focus on the key concepts
Figure 2. Class diagram of the component server infrastructure.

required to achieve the level of integration and usability described in section III-B. 1) Structural Analysis: By structural analysis we understand the process of determining the interface of a SystemC module. This includes the set of its ports along with their names and type information. The interface description is required by both the diagram editor and the solver engine. In the rst case, it is needed to present a meaningful graphical representation of the module and to check the type compatibility of diagram connections. In the second case, it is required to prepare according data structures which are needed to run the simulation. Approaches that try to reveal the structure of SystemC models by source code analysis are presented in [12] and [13]. These approaches require sticking to certain coding standards. PINAPA [14] is a hybrid approach where the elaboration phase of a SystemC model is virtually executed in order to determine module hierarchy and interface descriptions. We chose a simpler approach where the analysis is done at runtime. The SystemC base class sc_object implements two methods get_child_objects and kind which let the user enumerate dependent objects, such as ports and processes. We store the processed information in instances of ModuleInstanceDescriptor and

137

Table I T HE EK I N D ENUMERATION Literal DataIn DataOut Import Export SystemC port type sc_in<DataType> sc_out<DataType> sc_import<IfType> sc_export<IfType> Description Inbound dataow Outbound dataow Required interface Provided interface

PortInstanceDescriptor. SystemC provides four basic port types which are described using the EKind enumeration (Table I). Usually, it is the compilers responsibility to apply type-checking to a SystemC model to ensure that all module interconnections are correct. For example, ports of the data types sc_in<X> and sc_out<Y> may only be bound to a sc_signal<Z> if the types X, Y and Z match. Our framework allows for instantiation and interconnection of ports at runtime. This means that type-checking has to be incurred by the runtime infrastructure. The problem is solved using C ++ run-time type information (RTTI). We apply the dynamic_cast operator to perform type compatibility checking and typeid to get textual type information. 2) Automation: Many higher level languages, such as Java, C# and Objective-C, support reection features. Reection is a powerful metaprogramming paradigm which allows a program to examine its own structure and behavior at runtime, or even to alter its behavior. C ++ RTTI can be considered as a very basic and strongly limited implementation of reection. An application of reection is to resolve a class name (given as a string) into a class type descriptor. This way, an instance of the class can be constructed indirectly, without specifying its type at the source code level. As RTTI does not support this feature, we implemented a small meta-language which imitates that behavior, as it allows the designer to enable selected classes for indirect instantiation. This is done by inserting preprocessor macros in order to declare a class as automatable. It is also possible to describe the set of constructor arguments with respect to their types and names. When the infrastructure code initializes, the class will register itself to the Repository (which sticks to the singleton design pattern). 3) Simulation Control: An appropriate interface supports synchronization and data exchange between the SystemC kernel and Simulink. The SimulationContext class exposes functionality to control the simulation. A Reset method resets the SystemC kernel to its initial state. All module instances are destroyed and the simulation time is reset to 0. RunSimulation executes the simulation for a specied amount of time. The GetTimeOfNextEvent method tells the point in time when the next process inside SystemCs process queue is scheduled (or if currently no process is pending). B. Client S-Function The client S-Function acts as a mediator between a component server and Simulink. It synchronizes both simulator kernels and enables the exchange of signal values.

1) Signal Data Exchange: CoSimImport and CoSimExport (see Figure 2) are specializations of sc_channel which are designed to transfer signal data in and out of a SystemC simulation. Both can be interfaced with Simulink. We distinguish four kinds of connections which can occur in a Simulink model: A Simulink/Simulink connection models a dataow dependency between two native Simulink blocks. This type of connection does not need any further consideration as it is handled by the Simulink solver. A Simulink/SystemC connection links a Simulink signal to a so-called import gateway block. This block maps to the client S-Function which will create an instance of CoSimImport in order the tackle the data exchange from Simulink to SystemC. It is important to mention that the import gateway block is the only block which supports that connection type. It is not possible to link a Simulink signal directly to an arbitrary SystemC module (see gure 5). A SystemC/Simulink connection links an export gateway block to a Simulink signal. In this case, the S-Function will create an instance of CoSimExport. Again, the export gateway is the only block supporting this type of connection. A SystemC/SystemC connection links two blocks representing each a SystemC module instance. This connection type is realized using a propagate and bind scheme which will be further explained.
CoSimImport In SomeModule 1 Out CoSimExport Simulink solver transfers signal value CoSimImport In SomeModule 2 Out CoSimExport

Figure 3.

Two connected Simulink blocks and their internal representation

A simple solution to realize SystemC/SystemC connections would be to create and bind instances of CoSimImport and CoSimExport. In this case Simulink transfers the actual signal data according to Figure 3. To ensure that no signal value change is missed, it is necessary to carefully choose the sample times of both blocks. Setting them too high causes data loss and unintuitive behavior. Setting them too low results in poor simulation performance since the frequent context changes hinder the SystemC simulation kernel to skip unnecessary simulation cycles. There are applications where the simulation performance is affected in a way that the approach becomes completely intractable, for example, if SystemC is applied to analyze packet-based data [13]. In the considered application, packets are recorded by a data logger and processed by a SystemCbased data analysis framework. The framework synchronizes the simulation time with the receive timestamp of each currently processed packet. All timestamps possess a resolution

138

of 100ns. However, the time lag between two consecutive packets usually lays some orders of magnitude above this resolution. Given moderate trafc, it is possible to run the data analysis faster than real-time on a standard PC. Obviously, a context switch every 100ns would result in a non-viable analysis performance. We decided to implement a different approach which allows the SystemC sub-model to be executed at a much higher rate (or, truly event-based) than the rest of the model. Synchronization is only necessary at transitions between Simulink and SystemC blocks which are modeled explicitly by import and export gateways. The approach imitates the standard way of constructing SystemC models where a module connection is realized by binding the involved module ports to the same instance of an sc_channel (in most cases sc_signal, a specialization of sc_channel). For each SystemC/SystemC connection inside the Simulink block diagram, the client S-Function creates an appropriate sc_signal instance and binds it to the underlying port instances. Unfortunately, there is no elegant way of extracting the set of diagram connections from inside the SFunction. However, the information is gained implicitly using a propagate-and-bind scheme. As soon as the simulation is running, Simulink provides the S-Function with buffers which are used to store its input and output values. The idea is not to store an actual signal value inside a buffer, but to store a reference (or pointer) to the signal instance. During the rst Simulink simulation cycle, the Simulink propagates the references in order to complete the binding of the whole SystemC sub-model. On each simulation cycle, Simulink passes (amongst others) a calculate outputs phase which instructs each block to update its outputs. The computation may involve block inputs, given that they are marked as having a direct feedthrough property [15]. Our implementation indicates every input as direct feedthrough in order to get access throughout the calculate outputs phase. This leads to the following algorithm: 1) When a block is created: Instantiate the appropriate module class, create and bind an sc_signal instance for each data output port. Leave all data input ports unbound. 2) When entering calculate outputs for the rst time: Store the references (or pointers) to all signals which were created in step (1) in the appropriate output data buffers (provided by Simulink). Fetch all references from input data buffers (provided by Simulink) and bind all data input ports to the according signal reference. 3) When all blocks passed calculate outputs: Model is ready to elaborate, start the SystemC simulation. Figure 4a shows the internal representation of the Simulink model shown in Figure 5 before the simulation is started (step 1). After propagating all signal references the binding is completed (step 2, see Figure 4b). Elaboration and start of the SystemC simulation (step 3) still take place during the

very rst Simulink solver step, so even at simulation time 0 no information is lost. Prior to the simulation, Simulink analyzes all data dependencies in the model and computes an appropriate block execution order which ensures that the inputs of each block are readily computed before that block enters the calculate outputs phase. However, the propagation scheme is only viable for SystemC sub-models without loops. Simulink would recognize each loop as being algebraic and report an error, regardless whether a register within the underlying behavior actually breaks that loop or not. 2) Time Synchronization Algorithm: Our time synchronization algorithm is controlled by Simulink, as opposed to [9] where the SystemC event queue is use to control the synchronization intervals. If the Simulink model refers to multiple component servers, it is possible to have more than one SystemC kernel. In such case, each SystemC kernel runs independent of each other, though controlled and synchronized with Simulink simulation time. There is no direct data exchange between modules belonging to different component servers. Instead, gateway blocks have to be used. It is up to the designer to establish an appropriate sampling time for each gateway block. Setting a sampling rate too low can lead to loss of data; setting it too high will affect the simulation performance due to oversampling. The involved co-simulation algorithm (algorithm 1) is quite simple. The Simulink solver will trigger each gateway block in its specied sampling time. This is done when entering the calculate outputs phase which instructs SystemC to synchronize with Simulinks simulation time. In the case the block is an import gateway, the input signal value is transferred inside the SystemC model. If the block is an export gateway, a number of single delta cycle simulations follow until no processes are pending for the current simulation time. This step accounts for combinatorial computation paths inside the modules and ensures that all module outputs have stable values. Afterwards, the output signal value is transferred to Simulink. Algorithm 1 When entering calculate outputs: Synchronize Simulink and SystemC now := CurrentSimulinkTime t := now CurrentSystemCTime if t > 0 then RunSimulation(t) end if if current block is import gateway then Transfer Simulink input signal value to SystemC else if current block is export gateway then while GetTimeOfNextEvent() = now do RunSimulation(0) {Executes a single delta cycle} end while Transfer SystemC signal value to Simulink end if

139

Simulink solver propagates reference/pointer CoSim Import sc_ signal In FirFilter Out sc_ signal

Simulink solver propagates reference/pointer CoSim Export

Gateway In

Gateway Out

(a) During model construction


CoSim Import Gateway In In Out CoSim Export Gateway Out

sc_ signal

FirFilter

sc_ signal

(b) After binding is completed Figure 4. Internal representation of Simulink blocks representing SystemC modules (a) during model construction and (b) after binding is completed

C. Middleware The communication between a component server and a client is done via a middleware. In the case where the server and client are compiled together, no additional middleware software is used and communication is done sharing pointers to a same memory space. In the case of TCP/IP communication, an open source project called Remote Call Framework (RCF) [16] was used. Shared memory communication was implemented with an open source Boost IPC library [17]. Both middleware implementations provide convenient and powerful function calls for inter-process communication. V. R ESULTS We used the co-simulation framework for the verication of a variable-length FIR lter, an algorithm commonly used in DSP applications. The lter was modeled as a SystemC module with one input and one output port. The length and coefcients of the lter must be given as parameters when the class is instantiated. Our model is approximately timed in the sense that we assume data is processed at a constant rate. The SystemC model could be later on rened by adding timing information into the model. The FIR SystemC module with its required preprocessor macros declaring it as automatable (see Section IV-A2) was compiled into the repository of a component server. Three variants of the component server were generated according to Figure 1. Figure 5 shows how the verication of the SystemC FIR Filter was performed. We used as reference the Digital Filter Design Block from Simulinks Signal Processing toolbox to generate a 16-tap passband lter along with its coefcients. The coefcients were saved in an array and given as parameters to the SystemC model. We simulated the three component server variants and performed verication by inspecting the spectrum calculated by the FFT blocks. We were able to easily verify our SystemC application in a couple of minutes. Otherwise, this process would have required a considerably higher amount of time and effort if a designer were to manually code SystemC test benches. A certain simulation time overhead is expected due to the numerous synchronizations that must be performed between Simulink and SystemC. The total number of synchronizations

FDATool

B-FFT Digital Filter Design Spectrum FDA Tool

Random Source1 In Input Output Out B-FFT Gateway out Spectrum SystemC

Gateway In

FirFilter SystemC

Figure 5.

Verication of a FIR Filter developed in SystemC.

in a simulation is calculated according to the number of input/output gateways and their sampling rates. For our tests, we reused the three component server variants used for the FIR validation shown in Figure 5. The simulation time for each component server variant was measured and its performance calculated as the ratio of simulation time per synchronization event. In the case of TCP/IP communication, the component server was run in a local host and later on in a remote host connected to our LAN. In all cases a standard desktop computer (Intel Core2 Quad CPU) running Windows7 was used. The performance results are shown in Figure 6. The results are presented as the simulation time in seconds per synchronization, in relation to the total number of synchronizations in a simulation. In all cases, the performance of a simulation increases (meaning less time per synchronization) as the number of total synchronization increases; afterwards they stabilize to a constant value. Our results can help a designer decide which communication scheme to use according to the total number of expected synchronizations in simulation. The performance of the single address space variant differs from the rest after 100 synchronizations and reaches its maximum value, approximately 20 times faster than the other variants, after 100k synchronizations. To our surprise, the performance of the shared memory and TCP/IP localhost variants are almost the same. We believe this is because the Boost[17] library implementation used for shared memory IPC is not efcient. We would expect higher performance results if native Windows functions were used for shared memory IPC instead. Finally, the simulation performance of the TCP/IP remote host

140

variant is naturally lower and may be affected by delays in the network.


10
1

R EFERENCES
[1] J.-F. Boland, C. Thibeault, and Z. Zilic, Using matlab and simulink in a systemc verication environment, in Proceedings of Design and Verication Conference, DVCon05, 2005. [2] K. Keutzer, A. R. Newton, J. M. Rabaey, and A. Sangiovanni-Vincentelli, System-level design: orthogonalization of concerns and platform-based design, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 19, no. 12, pp. 15231543, 2000. [3] Systemvision. [Online]. Available: www.mentor.com/systemvision [4] Systemgenerator. [Online]. Available: http://www.xilinx.com/tools/sysgen.htm [5] P. Gerin, S. Yoo, G. Nicolescu, and A. A. Jerraya, Scalable and exible cosimulation of soc designs with heterogeneous multi-processor target architectures, in Proc. Asia and South Pacic Design Automation Conf the ASP-DAC 2001, 2001, pp. 6368. [6] N. Pouillon, A. Becoulet, A. V. de Mello, F. Pecheux, and A. Greiner, A generic instruction set simulator api for timed and untimed simulation and debug of mp2-socs, in Proc. IEEE/IFIP Int. Symp. Rapid System Prototyping RSP 09, 2009, pp. 116122. [7] G. Nicolescu, S. Yoo, A. Bouchhima, and A. A. Jerraya, Validation in a component-based design ow for multicore socs, in Proc. 15th Int System Synthesis Symp, 2002, pp. 162167. [8] F. Bouchhima, M. Briere, G. Nicolescu, M. Abid, and E. Aboulhamid, A systemc/simulink co-simulation framework for continuous/discreteevents simulation, in Behavioral Modeling and Simulation Workshop, Proceedings of the 2006 IEEE International, 2006, pp. 1 6. [9] F. Bouchhima, G. Nicolescu, M. Aboulhamid, and M. Abid, Discretecontinuous simulation model for accurate validation in component-based heterogeneous soc design, in Proc. 16th IEEE Int. Workshop Rapid System Prototyping (RSP 2005), 2005, pp. 181187. [10] W. Hassairi, M. Bousselmi, M. Abid, and C. Sakuyama, Using matlab and simulink in systemc verication environment by jpeg algorithm, in Electronics, Circuits, and Systems, 2009. ICECS 2009. 16th IEEE International Conference on, 2009, pp. 912 915. [11] K. Hylla, J.-H. Oetjens, and W. Nebel, Using systemc for an extended matlab/simulink verication ow, in Proc. Forum Specication, Verication and Design Languages FDL 2008, 2008, pp. 221226. [12] D. Berner, J. pierre Talpin, H. Patel, D. A. Mathaikutty, and E. Shukla, Systemcxml: An extensible systemc front end using xml, in In Proceedings of the Forum on specication and design languages (FDL, 2005. [13] C. K llner, G. Dummer, A. Rentschler, and K. M ller-Glaser, Designo u ing a graphical domain-specic modelling language targeting a lterbased data analysis framework, Object/Component/Service-Oriented Real-Time Distributed Computing Workshops , IEEE International Symposium on, vol. 0, pp. 152157, 2010. [14] M. Moy, F. Maraninchi, and L. Maillet-Contoz, Pinapa: An extraction tool for SystemC descriptions of systems-on-a-chip, in EMSOFT, September 2005, pp. 317324. [15] Matlab Simulink. [Online]. Available: http://www.mathworks.com/help/toolbox/simulink/ [16] RCF - Interprocess Communication for C++. [Online]. Available: http://www.codeproject.com [17] Boost C++ Libraries. [Online]. Available: www.boost.org [18] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele, Scalably distributed systemc simulation for embedded applications, in Industrial Embedded Systems, 2008. SIES 2008. International Symposium on, 2008, pp. 271 274.

Simulation Performance Single address space Shared memory TCP/IP localhost TCP/IP remote

Simulation time per Synchorization [sec]

10

10

-1

10

-2

10

-3

10

-4

10

-5

10

10

10

10 10 Number of Synchronziations

10

10

10

Figure 6. Simulation performance results according to the number of synchronizations between Simulink and SystemC.

VI. C ONCLUSIONS AND O UTLOOK Our work shows that thanks to the open source nature of SystemC, the principles and benets of SLD, that have proven effective in the SoC market, can also be applied to the traditional design of embedded systems in order to rapidly create virtual platforms. Our work demonstrates that it is possible, from a designer point of view, to seamlessly create and verify SystemC models within Simulink. The complexity of our approach is hidden by an automated framework that generates servers that provide a library of SystemC modules and clients attached to Simulink which control them. In our current framework version, Simulink does not allow signal loops on SystemC blocks. Loops are allowed as long as they are broken up by at least one block with delaying behavior, for example, a register, or in Simulink terms, if at least one of the involved ports is declared as non-direct feedthrough. The challenge is to nd when it is safe to declare a port as nondirect feedthrough. A heuristic solution would be to analyze the sensitivity list of all processes of a SystemC module. Ports that trigger any process, for example, a clock signal, must be dened as direct feedthrough and the rest as non-direct feedthrough. As the latter case implies a register behavior, the module outputs would not be immediately affected. Another solution would be to oblige the SystemC module designer to mark non-direct feedthrough ports with special meta tags. However, this topic requires further consideration. Our approach should be easily extended to support parallelized distributed simulation of SystemC models as done in [18]. In this way, we could increase the simulation performance by distributing the simulation across multiple CPU cores. Further work is the support for TLM2.0 interfaces, which should be possible since the reference propagation scheme could equally be applied to TLM interfaces.

141

Extension of Component-Based Models for Control and Monitoring of Embedded Systems at Runtime
Tobias Schwalb and Klaus D. M ller-Glaser u
Karlsruhe Institute of Technology, Institute for Information Processing Technologies, Germany Email: {tobias.schwalb, klaus.mueller-glaser}@kit.edu
AbstractTo allow a rapid abstract development and reuse in embedded systems nowadays component-based system development is often used. However, control and monitoring at runtime for adjustment and error identication often take place using different domains or tools. They are in general more concrete and therefore the user needs a deeper understanding of the system. This paper in contrast presents a continuous concept to raise the abstraction level and include runtime control and monitoring into component-based models. The concept is based on an extended component-based meta model and libraries, which describe present components as well as their interfaces and parameters. During design time, based to the model designed by the user, source code is generated. At runtime, control commands are send to the embedded target according to user modication in the model as well as acquired monitoring data is back annotated and displayed on model level. The concept is demonstrated and evaluated using a recongurable hardware platform.

In this context, we rst describe the state of the art in modelbased design, conguration, control and monitoring in Section II. The next section outlines an overview of the concept and describes the ow using the method. Section IV illustrates the developed meta model, while Section V describes the actions for conguration, control and monitoring as well as shows our implemented model-based developing environment. An example, based on the use of recongurable hardware, is presented in Section VI, including practical tests and results in Section VII. We close with conclusions and an outlook on future work in Section VIII. II. S TATE OF THE A RT In this section we concentrate on specic model-based design methods as part of the V-Model [3] for embedded system development. Therefore, we describe the state of the art concerning system design with focus on component-based design methods and their possibilities. Further, we show actual methods concerning model-based control and monitoring, because we integrate these into component-based models. For system planning and design in the early development phases for embedded systems, the Systems Modeling Language (SysML) [4] or the Modeling and Analysis of Real Time and Embedded systems prole (MARTE) [5] are used. Both are part of the Unied Modeling Language (UML) [6] and allow abstract specication, analysis and design of complex and real time embedded systems. However, SysML and MARTE do not support runtime functionalities. In connection with component-based design, SysML and MARTE can be seen earlier in the design process. Component-based design [7] is located in the implementation phase and well known in the software domain. It describes a concept of separation of systems into components. Thereby, an individual component is often regarded as a software module that encapsulates a set of related functions (or data). Components communicate with each other via interfaces and are congured using parameters. Components are normally stored in libraries to allow rapid reuse. The design can be performed text-based or model-based, while in the later the components are displayed as graphical objects. The main usage of components is to enable reuse of already implemented functionality in different versions and congurations of a product as well as in other projects [8]. Therefore, it reduces the time-to-market and increases the quality, because

I. I NTRODUCTION The development of embedded systems becomes more and more complex due to the increasing demands and the pressure to meet productivity targets. Dannenberg et al. [1] for example predicts a growth of 150% for the market of electric electronic automotive components up to a total of 316 Billion Euro in 2015 with an exponential rate in the future. To dominate these, many systems are build using component-based design methodologies to allow easy reuse of present design parts. However, while this method supports a rapid and abstract design, control and monitoring for runtime adjustment and error identication are normally on a more detailed level using specic tools. Therefore, the user needs a deeper understanding of the system, increasing costs and development time. In this paper we present a continuous concept to extend component-based models for runtime control and monitoring to support abstract adjustment and error identication. The method is based on an extended meta model for componentbased systems and libraries, which store predened components. In contrast to actual methods (see Section II), it allows the user to work on the same abstract level at design time and runtime. Succeeding the design phase, source code is automatically generated for the embedded target. During runtime, the components of the embedded target can be controlled and monitored using special parameters. Other parameters display the monitored status of the embedded system, both within the same abstract component-based model. Therefore, the user does not need a detailed understanding of individual components for fast prototyping.
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

142

Component Model
Code generation

Control
Mapping

Monitoring
Mapping

M3-Level

Ecore Meta Meta Model Component Meta Model Component Model


<<extends>>
Parameter Extensions

Source Code

Commands

Runtime Information
Monitoring

M2-Level

Implementation

M1-Level

<<references>>

Runtime Data (Model level)

Binary
Execution

Target platform M0-Level Source Code / <<references>> Binary


Runtime Data (Source level)

Fig. 1.

Concept of Design, Control and Monitoring (left) and relating Meta Object Facility (MOF) Levels [2] (right)

already used components are mostly well engineered and tested. Different requirements have to be considered using component-based design methodologies, including scalability, maintainability and interoperability as well as the feasibility to real systems. These also apply to the embedded systems domain, more details are outlined in [9]. If an embedded system is implemented, the developer normally needs separate tools for runtime control and monitoring. These tools allow to adjust the behavior of the implemented software, for example an offset correction. Others are used for debugging proposes, monitoring internal signals or variables. These tools normally perform control and monitoring on a detailed level in comparison to the abstract component-based design. Furthermore, they are often specic to the embedded target and sometimes adapted to the implemented system. A further domain are rich component models [10] used as a uniform representation of different design entities to support the management of functional and non-functional parts in the development process. A component-based hierarchical software platform for automotive electronics is shown in [11]. It shows a series of tools for model-driven development, visual conguration and automatic code generation. In [12], Gu et al. presents an end-to-end tool-chain for model-based design and analysis of component-based embedded real time software in the avionics domain. It includes the conguration of source code as well as runtime instrumentation and statistics that are feed back into models. The management of distributed systems and their conguration is discussed in [13]. A concept, based on design time modeling, model transformation and management policies, is presented. A framework for design and runtime debugging of component-based systems is presented in [14]. It enables to validate the interactions between components by automatic propagation of checks from the specications to application code. The usage of different MDE tools at runtime is described in [15]. Thereby, different runtime models are discussed and a tool presented to design them. Monitoring of embedded systems in terms of model-based debugging is presented in [16], [17] and [18]. These concentrate on real time monitoring of functional models (e.g. statecharts).

In our concept we follow the known issues and methods concerning component-based design, i.e. reuse, source code generation, interfaces, parameters, etc. However, in rst version do not consider specialties, e.g. product line techniques, analysis or distributed systems. In contrast to actual methods, we integrate control and monitoring directly into componentbased models, not using functional models, specic runtime models or external (low-level) tools. Thereby, the componentbased model refers to the architecture of the system and does not represent the functionality. As a result, the control and monitoring possibilities are limited and need to be considered already during the design of the components. III. C ONCEPT In our concept we extend component-based models to include runtime control and monitoring of embedded systems. Thereby, we follow the Meta Object Facility (MOF) of the Object Management Group (OMG) [2]. The ow of our method is depicted in Figure 1 on the left, the relating levels of the MOF are shown on the right. The platform independent component-based meta model, which corresponds to the M2-Level of the MOF, characterizes a system, which comprises of components with different attributes, interfaces and parameters. It has been extended with special parameters concerning conguration, control and monitoring (more details see Section IV). According to this meta model, libraries store the implemented platform dependent components. The user uses these components to build his system in a component-based model, which corresponds to the M1-Level of the MOF. Thereby, the user also connects components using their interfaces and adjusts them according to their parameters. In the next step, based on the component-based model, the source code of the system is generated using templates. The code generation thereby includes the components, their connections and adjustments according to the set parameter values. Additionally, the templates integrate functionality allowing control and monitoring at runtime, including further components, which handle the communication between the embedded target and a design tool on a PC. This generated

143

source code is between M1- and M0-Level, because it is normally high level code (VHDL, C, ...), which is not used to directly program the embedded target. For example, VHDLcode is in a further step used to generate a platform specic binary le. This implementation (M0-Level) is integrated into the embedded target. To control the embedded target, the user modies the component-based model (M1-Level). According to his actions and general information gained from the components, commands are generated and send to the embedded target via the integrated communication components. For control, the user cannot change every part at model level - only the value of specic control parameters, integrated in the components, can be modied. For monitoring, using the same communication components, information read from the embedded target is transferred to a PC. On the PC the information, which corresponds to the M0-Level, is interpreted using mapping information and displayed in the component-based model (M1-Level). Thereby, the information is associated with corresponding monitoring parameters. As a result, during design, conguration, control and monitoring, the user works on the same abstract level using the same component-based model. There are only different views on the model during design time and runtime. At design time algorithms map the interactions on model level with the generated source code for the embedded target. At runtime the algorithms generate, according to user modication in the model, control commands and interpreted received data to display it on model level. More details on the mechanisms for design, conguration, code generation, control and monitoring are described in Section V. The component-based meta model is based on the Ecore meta meta model (M3-Level), because it offers a general description and our model-based development environment (see also Section V) is build using the eclipse modeling framework [19]. IV. C OMPONENT-BASED M ETA M ODEL The extended platform independent component-based meta model is depicted in Figure 2. The model, which is instantiated from this meta model, stores later information from different perspectives (e.g. design time and runtime). Therefore, the meta model has to be able to, on the one hand describe a system comprising of components with individual interfaces and parameters (including their attributes). On the other hand it holds the information the user adds during assembling, design, conguration and controlling, as well as the data received during monitoring of the embedded target. The classes on the top right of the meta model describe a simplied standard component-based meta model. They outline a system, which composes of components and connections. Thereby, the attribute id of the class Component is unique and allows identifying a single component. The type and version specify the type of component, which is later used in correspondence with the libraries. In addition, a component has interfaces, which are modeled as Input and

Fig. 2.

Extended Component-Based Meta Model

Output classes. Thereby, the attribute type is used to clarify, which output can be connected to corresponding inputs. The boolean attribute xed is needed to signal, if an output or input is compulsory, i.e. it needs a connection. Outputs and inputs of the components are connected with each other as source and target using connections. The remaining meta model describes three different kinds of component parameters for conguration, control and monitoring. The parameters have been split to allow a clear differentiation and to implement different functionality in the development environment. The class CongurationParameter describes parameters intended for conguration of a component during design time. The class has two attributes, which describe the name and value of the parameter. In addition, the top class has two child classes, which describe different parameter forms, i.e. a numerical or a text-based parameter. The ...Number class forms a parameter in a numerical format, i.e. an integral or oating point number. The minand max-attributes are the limits for the value. The ...List class describes text-based parameters, the value is thereby selected from a list of predened values. These possible values are stored in the ...ListValue class. In this context, it is distinguished between the displayed value on model level and the coded value used in conjunction with the embedded target. These two forms have been chosen, because they represent the most used parameters in general embedded applications. Furthermore, the parameters allow abstract and easy adjustment of the embedded system from the user perspective as well as limited the possible inputs and avoid failures. The ControlParameter class describes parameters used at runtime to control a component at the embedded target or respectively adjust its behavior. The layout of this class and its subclasses is similar to the classes for conguration. The only difference is the additional command attribute, which

144

Libraries
1. Component descriptions Source code templates

Meta Model
2. Instantiate

User

2. Design 3. Configuration 5.1 Control 5.2 Monitoring

ComponentBased Model
4. Generation & Integration 5.1 Commands 5.2 Data

Embedded Target
Fig. 3. Integration Flow for Design, Conguration, Control and Monitoring

stores the command send to the embedded target to modify the parameter at runtime. The MonitoringParameter class describes parameters monitored during runtime. The layout of the classes for monitoring is similar to the classes for control. There are only no minimal and maximal limits for monitored numerical parameters, as the user cannot modify the value of these parameters (it is read from the embedded target). The classes for monitoring parameters are designed in the same way to allow a display of numerical as well as to interpret received data and display its abstract representation on model level. V. C ONFIGURATION , C ONTROL AND M ONITORING The ow of design, conguration, control and monitoring in combination with our implemented development environment and the embedded target is depicted in Figure 3. In the rst step the loads the libraries with the component descriptions and source code templates. In the next step the user assembles and designs the system on model level in a component-based model. For assembling, he uses predened components described in the libraries. These create objects in the componentbased model, that are instances of classes in the meta model. The library for example stores a multiplexer component, which can directly be used and inserted in the model. Thereby, the component is automatically instantiated and displayed with all its interfaces and parameters. A component may be integrated multiple times. During the design, the user also connects the components to each other using their interfaces. In the third step the components are congured according to their parameters. Thereby, the user adjusts the values of the conguration parameters of the components in the model. The value of the numerical parameters can only be set according to its limits (predened in the component description). For text-based parameters only an element from the predened

list can be chosen. These parameters adjust the behavior of the individual component and can only be modied during design time, because they inuence the generated source code of the component. After system design and conguration is completed, the components and their connections are checked before the source code is generated. The generated source code is in general split in multiple les regarding the individual components and the structure of the system. Additionally, communication components are integrated to allow controlling and monitoring the system at runtime. Succeeding integration on the embedded target, in the last step the user controls and monitors the system during runtime using the same component-based model, he used to design the system. For controlling, the user modies the values of the control parameters. Thereby, also predened restrictions on numerical and text-based parameters apply. According to the modications, algorithms generate commands and send these to the embedded target. The values of the control parameters can also be adjusted during design time and thereby form predened values for runtime execution. For monitoring at runtime the values of the monitoring parameters are periodically read back from the embedded target, back annotated and displayed in the model. Thereby, the respective interpretation and coding of the parameters is used. We integrated the functionality of design, conguration, control and monitoring in a model-based integrated development environment (IDE) depicted in Figure 4. The development environment is based on the open source platform eclipse [19]. For model-based design, the Eclipse Modeling Framework (EMF) and the Graphical Modeling Framework (GMF) are used. The generation of the source code is performed with the xPand framework and the checks in the model using the integrated Check-language. The component-based meta model (see Section IV), is integrated as an Ecore model in EMF. According to this model, we created three models in GMF concerning the graphical model level editor. The rst model describes the palette in the editor, i.e. the tools available to build the model. The gmfgraph model describes the graphical representation of the elements in the model, i.e. their shape, color etc. The third model layouts a mapping between the three models, i.e. it creates relations between elements in meta model, palette and graphical representations. After creation of these models a model-based IDE can be generated by the framework. The result is depicted in Figure 4. The modeling area can be seen in the middle with the tool palette on the right (already includes components from libraries). The project management with projects and corresponding les is on the left side. The window on the bottom in the middle shows model-based properties of the actual selected element. The additional functionality to allow all steps within conguration, control and monitoring, is integrated in the window on the bottom right, which has been manually implemented as an eclipse plug-in. The window is used to load the libraries, access the different parameters of a component, generate the source code and

145

Fig. 4.

Integrated Model-Based Developing Environment (1 - Modeling Area; 2 - Component Palette; 3 - Properties Window; 4 - Parameter Dialog)

communicate with the embedded target. A library is loaded from a XML-le, thereby the components get directly integrated into the palette. When a component is drawn in the model, the component is automatically instantiated with all its parameters and interfaces, which can be directly used for connections to other components. If a component is selected in the model, its parameters are displayed in the table and the user can modied them. The numerical parameters can directly be specied, the text-based parameters are displayed with a drop down menu, that offers the predened values (monitoring parameters cannot be changed). A hardware connection to the embedded system can be established after specifying the connection settings. If the connection is active and the user changes a control parameter in the table, corresponding commands are automatically generated and transmitted. Additionally, the monitoring parameters of the selected component are periodically read back, interpreted and displayed. The conguration parameters cannot be changed during an active connection, because this would require a regeneration and integration of the system. For an easier handling the window offers additional lters and search functions to easier locate parameters. VI. FPGA I NTEGRATION To demonstrate the functionality and integrity of the concept and the IDE, we implemented it along with a system for design, conguration, control and monitoring of Field Programmable Gate Array (FPGA) systems. FPGAs were chosen, because of their high computing power, the possibility to run processes in parallel and the easy extensibility. In the example, the components used for the componentbased design of the system, are integrated in two libraries. The

rst library describes general hardware components, which are often used in FPGA systems, for example multiplexer, timer, AND-gate, OR-gate etc. External inputs and outputs are also integrated as components to allow connections to external peripherals. A second library describes specic components for sensor and actuator control. Using these components, we build as an example a small cooling control system (depicted in Figure 5), which reads data from a temperature and a humidity sensor and controls a cooling fan. A sensor is connected to the system using an External Input component associated with a Sensor Preprocessing component, which is used to read the sensor value and correct it if necessary. The humidity sensor is integrated twice and connected to a Multiplexer and a Sensor Check component to switch automatically if a sensor fails. The Cooling Algorithm component takes the preprocessed sensor values and controls a fan using an additional Actuator Control component. In the libraries, the components are designed with compatible interfaces and communication protocols or adapt automatically (e.g. multiplexer) during code generation according to the connected components. The respective component-based model of the system is shown in the modeling area of the IDE in Figure 4. In comparison to the embedded system the component-based model does not include the additional components and buses for runtime control and monitoring, which are described in the next paragraph. All other components are described with their interfaces and parameters. As an example, the parameters of the Sensor Preprocessing component for the temperature sensor are depicted in the table of the Parameter window in Figure 4. For conguration, the Input Protocol and Input

146

RS232 Communication
Microprocessor

step the command and value are send on the command bus. If a monitoring command was send the answer of the component is read from the corresponding bus and send to the PC. VII. T ESTS AND R ESULTS Different tests are carried out to evaluate the functionality and integrity of the concept and the developed IDE. The tests are mainly performed using a XUP Virtex-II Pro development system, including a Xilinx Virtex-II Pro FPGA [20] as well as interfaces for communication and programming. The size and speed of the implemented system depends on the type and number of components and connections. The test system (see Section VI) uses around 6% of the logic resources of the used FPGA and has a frequency of 100 MHz, limited by the layout of the used components. Thereby, the resources of the additional microprocessor and communication components are xed to approx. 1%. The resources for additional buses as well as functions for control and monitoring depend on the number and layout of the components. The maximum speed is in general limited by the integrated microprocessor to approx. 150 MHz, because the components directly communicate with the microprocessor and therefore need the same clock signal. The communication could also be designed independently to allow different clock frequencies, but this would increase the logic resources for buses and interfaces. In tests the communication, including processing modications and sending commands as well as receiving monitored data and displaying the data in the model, worked as described in Section V. There is only a time delay of up to approx. 250 ms between a change in the model and the reaction of the embedded target, as well as between a change in the embedded target and the display in the model. The reasons are the slow RS232 communication interface and the mapping algorithms. In addition, as the development environment is designed multithreaded, it cannot be determined when the responsible thread for communication or processing is executed. Therefore, while the hardware runs in real time, real time control and monitoring are not possible. As a result the component-based model allows a rapid design of the system and reuse of present components. After assembling and conguration of the components, the generated VHDL-code can directly be integrated using Xilinx IDE tools. During runtime the components can be controlled and monitored using the implemented IDE to support adjustment and monitoring on abstract level. The functionality for conguration, control and monitoring of the individual components needs already to be considered during the design of the component. This is a challenge, because in the component-based design process only present parameters can be used on abstract level. For example, if a parameter is not integrated in the design of a component the user needs to work on low level manually adjusting the source code or using standard methods for control and monitoring. Regarding control and monitoring, there is an additional consideration, because every parameter normally increases the size and complexity of the component and maybe

Component-based System
External Input Sensor Preprocessing

Cooling Algorithm

Actuator Control

External Output

External Input External Input

Sensor Preprocessing Sensor Preprocessing Multiplexer

Sensor Check

Fig. 5.

FPGA System Integration

Address can be modied. During runtime the Offset and Slope Correction parameter can be controlled and the Status, Sensor Value and Output Value of the component are monitored. The Input Address for example is a List Parameter, therefore the user can only choose from a list of predened addresses. The parameter Status is also a List Parameter, therefore the coded value read from the embedded target is interpreted and displayed in common language (not using complex error codes). All components are integrated on the embedded target using VHDL templates in xPand. The system architecture (i.e. connection and buses) is generated in an additional xPand template as a top level VHDL structural description le. In addition to the library components, additional components and buses are automatically integrated during generation of the system to allow control and monitoring at runtime. The components include a RS232 interface for communication with a PC and an 8-Bit microprocessor for processing commands. The microprocessor is connected with three 8-bit buses to the components. The rst bus is used to identify the component, the second for sending commands and the last one for reading the status of the components. The buses are separated from the connections between the components and can be used independently, therefore they do not inuence the functionality of the system. To use the described model-based developed environment (see Section V) for the FPGA integration, the communication interface is adjusted to communicate using a RS232 connection. Furthermore, compatible mapping algorithms are integrated to generate commands for control and interpreted received monitoring data. The commands are split into three parts, the rst part is the ID of the component, the second is the command associated to the actual parameter and the third is the value (for monitoring commands the third part is empty). These parts are send together in one string via the interface to the microprocessor inside the FPGA. The microprocessor separates these parts and sets the ID on the rst bus to address the respective component. In the next

147

reduces its speed. However, with regard to rapid prototyping systems the size and speed does not directly take effect, as the system is integrated on a high power computing platform and not on the nal target platform. Furthermore, the tests outlined that, besides the dependencies concerning the interfaces, there are further dependencies concerning different components and also between parameters of the same component. For example, one parameter can inuence the value or the availability of another parameter. Actually, these dependencies are manually implemented according to the individual component and checked mainly using the Check-language. This is error-prone and does follow the concept of a continues model-based approach. Moreover, the tests showed that the expandability, in terms of new components, is not comfortable, because the library and templates need to be modied manually. VIII. C ONCLUSION AND O UTLOOK In this paper we presented a concept for expanding component-based models for runtime control and monitoring, to support abstract adjustment and debugging of embedded systems. Therefore, the practicability of a continues abstract development have been increased. In comparison to present techniques, the user does not only design on model level, but also controls and monitors the system from the same abstract component-based model and does not need to use low level domains or tools. A single model describes the structure of the system, allows adjustment and shows the status of its components during design time and runtime. All intermediate steps, from model level to the embedded target and vice versa are carried out by algorithms in the background. Therefore, generating the appropriate source code, the concept can be applied to use with rapid prototyping systems and allows adjusting, controlling and monitoring systems at runtime. The proposed meta model is capable of abstract description of different aspects of a system and its components. The user uses libraries with predened components to rapidly build the system. The components are connected using their interface and congured according to their parameters. During runtime the components are controlled and monitored using different parameters in the same model. The integrated development environment allows to perform all steps of design, conguration, control and monitoring on model level. The concept has been implemented along with a FPGA integration to show functionality and feasibility to real systems. Different tests have been carried out to evaluate size, speed and maintainability. In future, the control and monitoring parameters will become optional for implementation. Therefore, the user can decide concerning their usage and the additional resources. In connection the IDE will be expanded to allow an easier specication of new libraries, as this is at the moment performed manually. In this context, also the automatic integration of external modules as black box objects will be added. Additionally, a method will be implemented to check, if the system and components on the embedded target match to the component-based model in the IDE. Moreover, the

concept will be integrated along with other platforms and more complex systems to evaluate the scalability and performance. The actual meta model will be enhanced with respect to dependencies of components and parameters as well as possibilities for hierarchical structures. R EFERENCES
[1] J. Dannenberg and C. Kleinhans, The coming age of collaboration in the automotive industry, Mercer Management Journal, vol. 17, pp. 88 97, 2004. [2] Object Management Group (OMG), Meta Object Facility (MOF) 2.0 Core Specication, 2004. [3] iABG, V-Model, 1997. [Online]. Available: http://www.vmodell.iabg.de/vm97.html [4] A. Korff, Modellierung von eingebetteten Systemen mit UML und SysML. Spektrum Akademischer Verlag, 2008. [5] Object Management Group (OMG) , UML Prole for MARTE: Modeling and Analysis of Real-Time Embedded Systems, Specication, Version 1.0, 2009. [Online]. Available: http://www.omgmarte.org/ [6] Object Management Group (OMG), Unied Modeling Language (UML) Specication, Version 2.2, 2008. [Online]. Available: http://www.uml.org/ [7] G. Heineman and W. T. Councill, Component-Based Software Engineering. Addison-Wesley Longman, Amsterdam, 2001. [8] J. Kalaoja, E. Niemela, and H. Perunka, Feature modelling of component-based embedded software, in Software Technology and Engineering Practice, 1997. Proceedings., Eighth IEEE International Workshop on [incorporating Computer Aided Software Engineering], 1997, pp. 444451. [9] D. Hammer and M. Chaudron, Component-based software engineering for resource-constraint systems: what are the needs? in Object-Oriented Real-Time Dependable Systems, 2001. Proceedings. Sixth International Workshop on, 2001, pp. 9194. [10] W. Damm, A. Votintseva, E. Metzner, and B. Josko, Boosting re-use of embedded automotive applications through rich components abstract, Proceedings, FIT 2005 - Foundations of Interface Technologies, 2005. [11] H. Li, P. Lu, M. Yao, and N. Li, SmartSAR: A Component-Based Hierarchy Software Platform for Automotive Electronics, in Embedded Software and Systems, 2009. ICESS 09. International Conference on, 2009, pp. 164170. [12] Z. Gu, S. Wang, S. Kodase, and K. Shin, Multi-view modeling and analysis of embedded real-time software with meta-modeling and model transformation, in High Assurance Systems Engineering. Proceedings. Eighth IEEE International Symposium on, 2004, pp. 3241. [13] S. Illner, A. Pohl, H. Krumm, I. Luck, D. Manka, and T. Sparenberg, Automated runtime management of embedded service systems based on design-time modeling and model transformation, in Industrial Informatics, 2005. INDIN 05. 2005 3rd IEEE International Conference on, 2005, pp. 134139. [14] G. Waignier, S. Prawee, A.-F. Le Meur, and L. Duchien, A Framework for Bridging the Gap Between Design and Runtime Debugging of Component-Based Applications, in 3rd International Workshop on Models@runtime, Toulouse France, 2008. [15] H. Song, G. Huang, F. Chauvel, and Y. Sun, Applying MDE Tools at Runtime: Experiments upon Runtime Models, in Proceedings of the 5th International Workshop on Models at Run Time, Oslo Norway, 2010. [16] P. Graf and K. D. M ller-Glaser, Modelscope: inspecting executable u models during run-time, in ICSE Companion 08: Companion of the 30th international conference on Software engineering. New York, NY, USA: ACM, 2008, pp. 935936. [17] T. Schwalb, P. Graf, and K. D. M ller-Glaser, Architektur f r das u u echtzeitf hige Debugging ausf hrbarer Modelle auf rekongurierbarer a u Hardware, in Methoden und Beschreibungssprachen zur Modellierung und Verikation von Schaltungen und Systemen. Universit tsbibliothek a Berlin, 2009, pp. 127137. [18] T. Schwalb, P. Graf, and K. D. Mueller-Glaser, Monitoring Executions on Recongurable Hardware at Model Level, in 5th International MODELS Workshop on Models@run.time, Oslo, Norway, Oct. 2010. [19] Eclipse Foundation, Eclipse Modeling Project, 2010. [Online]. Available: http://www.eclipse.org/modeling/ [20] Xlilinx, Virtex-II Pro and Virtex-II Pro X - FPGA User Guide v.4.2, November 2007.

148

A model-driven based framework for rapid parallel SoC FPGA prototyping


Mouna Baklouti *, Manel Ammar , Philippe Marquet , Mohamed Abid and Jean-Luc Dekeyser LIFL, Univ. Lille 1, INRIA Lille Nord Europe UMR 8022, CNRS, F-59650, Villeneuve dascq, France Email:{mouna.baklouti,philippe.marquet,jean-luc.dekeyser}@li.fr CES Laboratory, Univ. Sfax, ENIS School BP 1173, Sfax 3038, Tunisia Email:manel.ammar@ceslab.org, mohamed.abid@enis.rnu.tn

AbstractModel-Driven Engineering (MDE) based approaches have been proposed as a solution to cope with the inefciency of current design methods. In this context, this paper presents an MDE-based framework for rapid SIMD (Single Instruction Multiple Data) parametric parallel SoC (System-on-Chip) prototyping to deal with the ever-growing complexity of such embedded systems design process. The design ow covers the design phases from system-level modeling to FPGA prototyping. The proposed framework allows the designer to easily and automatically generate a VHDL parallel SoC conguration from a high-level system specication model using the MARTE (Modeling and Analysis of Real-Time and Embedded systems) standard prole. It is based on an IP (Intellectual Property) library and a basic parallel SoC model. The generated parallel conguration can be adapted to the data-parallel application requirements. In an experimental setting, four steps are needed to generate a parallel SoC: dataparallel programming, SoC modeling, deployment and generation process. Experimental results for a video application validate the approach and demonstrate that the proposed framework facilitates the parallel SoC exploration.

I. I NTRODUCTION With the rising complexity of multimedia and radar/sonar signal processing applications, parallel programming techniques and multi-core Systems-on-Chip (SoC) are more and more used. Single Instruction Multiple Data (SIMD) systems have shown to be powerful executers for data-intensive applications [1], especially in pixel processing domain [2]. Many SIMD on-chip architectures, in particular based on FPGA (Field Programmable Gate Arrays) devices, have emerged to accelerate specic applications [3][6]. Compared to ASIC (Application Specic Integrated Circuit), FPGA devices are characterized by an increased capacity, smaller non-recurring engineering costs, and programmability [7]. Dealing with the ever-growing challenge of parallel SoC design, most of the proposed SIMD solutions are application-specic SoC which lack exibility: changing a SoC conguration may necessitate extensive redesign. While these specic systems provide good performances, they require long design cycles. The size of a parallel SoC and the complexity involved in its design are continuously outpacing the designer productivity. An important challenge is to nd adequate design methodologies that

efciently address the issues about large and complex SoC. Nowadays, Computer-Aided Design tools are imperative to automate complex SoC design and reduce the time-tomarket. Two approaches have been proposed to cope with this problem. Firstly, IP (Intellectual Property) reuse and platformbased design [8] are used to maximize the reuse of predesigned components and to allow the customization of the system according to system requirements. Secondly, ModelDriven Engineering (MDE) [9] approach has been introduced to raise the design abstraction level and to reduce design complexity. It stresses the use of models in the embedded systems development life cycle and argues automation via model transformation and code generation techniques. Complex systems can be easily understood thanks to such abstract and simplied representations. Approaches based on MDE have been proposed as an efcient methodology for embedded systems design [10], [11]. An interesting model specication language is UML (Unied Modeling Language) [12], which proposes general concepts allowing expressing both behavioral and structural aspects of a system. The latest release of UML (2.0) has support for proles that enable the language to be applied on particular application and platform domains with sophisticated extension mechanisms. As an example, the MARTE (Modeling and Analysis of Real-Time and Embedded systems) standard prole [13] is proposed by the OMG to add capabilities to UML for model-driven development of realtime and embedded systems. The MARTE prole enhances possibility to model SW, HW and relations between them. Using the proposed framework, the designer focuses on modeling his needed SIMD conguration and not on how implementing it, since the system modeling is independent of any implementation detail. Specifying a model is written based on unied language. The presented design ow is a library-based method that hides unnecessary details from highlevel design phases and provides an automated path from UML design entry to FPGA prototyping. So, it can be easily used by non-HW experts of on-chip systems implementation. This makes our approach better than using some clever VHDL coding. System concerns are represented in separated dimensions:

c 978-1-4577-0660-8/11/$26.00 2011 IEEE

149

data-parallel coding, SoC modeling, IP selection and implementation. The implementation is performed via the generation tool based on a model-to-text transformation using Acceleo [14]. The framework uses an IP library with various components (processors, memories, interconnection networks...) that can be selected in the deployment process to generate the needed SIMD conguration. The modeled SoC has to be conform to a basic parallel SoC model which is parametric, exible and programmable, proposed in previous work [15]. In an experimental setting that validates our approach, we consider a video color conversion application where we explore different parallel system congurations and decide the best one to run the application. Experimental results show that the proposed framework considerably reduces design costs and facilitates modifying the system model and regenerating the implementation without relying on costly re-implementation cycles. Using the framework, we can create SIMD implementations that are fast enough to meet demanding processing requirements, are automatically generated from a high-level specication model to reach the time-to-market and can easily be updated to provide a different functionality. The remaining of this paper is organized as follows. Section 2 discusses related work on model-based approaches to generate on-chip multi-processor or massively parallel systems. Section 3 presents the proposed MDE framework.A case study, which illustrates and validates the framework, is described in Section 4. The FPGA platform is chosen as a target platform since it is a better alternative to test and implement various parallel SoC congurations. Finally, Section 5 draws main conclusions and proposes future research directions. II. R ELATED WORK

Fig. 1. Parallel SIMD SoC conguration: 4 PEs, a 2D mesh neighboring network and a crossbar based mpNoC

The high-level SoC design methodology is a rapid emerging research area. There are many recent research efforts on embedded systems design using an MDE approach. In this context, different high-level synthesis approaches are currently being studied for different specication languages. For example, xtUML [11] denes an executable and translatable UML subset for embedded real time systems, allowing the simulation of UML models and the code generation for C oriented to different microcontroller platforms. In [16], an approach using VHDL synthesis from UML behavioral models is presented. The UML models are rst translated into textual code in a III. SIMD F RAMEWORK language called SMDL. This latter can be then compiled into a The proposed framework is dedicated to generate different target language as VHDL. The translation from UML models to SMDL is performed using the aUML toolkit. In [17], a SIMD congurations derived form the based parallel SoC transformation tool, called MODCO, which takes a UML state model [15]. These congurations can be then directly simudiagram as input and generates HDL output suitable for use lated using available simulation tools or prototyped on FPGA in FPGA circuit design, is presented. A HW/SW co-design devices using appropriate synthesis tools. Figure 1 illustrates a is performed based on the MDA approach. XML is used to SIMD parallel SoC conguration composed of four Processing generate HDL from high-level UML diagrams. In these two Elements (PE) connected in a 2D mesh topology. To handle works, only state machines HW designs are described. In [18], parallel I/O transfers and point-to-point communications, a a UML-based multiprocessor SoC design framework, called crossbar based mpNoC (massively parallel Network on Chip) Koski, is described. An automated architecture exploration [19] is integrated. To accelerate and facilitate a SIMD conbased on the system models in UML, as well as the automatic guration design, a model-driven framework is proposed. The back and forward annotation of information in the design ow framework allows the designer to model his needed congucould be performed. The proposed design ow provides an ration derived from the basic provided SIMD SoC model. He 150

automated path from UML design entry to FPGA prototyping. The nal implementation is application-specic. The proposed approach is based on synthesizable library components that are automatically tuned for specic application according to the results of the architecture exploration. Our approach is related to the design of massively parallel SoC and covers the design phases from system-level modeling and parallel programming to FPGA prototyping using the notion of transformations between models. The DaRT [10] (Data Parallelism to Real Time) project also proposes an MDAbased approach for SoC design that has many similarities with our approach in terms of the use of meta-modeling concepts. The DaRT work denes MOF-based meta-models to specify application, architecture, and SW/HW association and uses transformations between models as code transformations to optimize an association model. In DaRT, no data-parallel coding is specied and the code generation for RT (Register Transfer) levels is dedicated to specic HW accelerators. The proposed framework, presented in this paper, takes advantage of the MDE notion of transformation between models to generate a complete SIMD parallel SoC at RT level dedicated to compute data-intensive applications. Our approach is based on synthesizable library components and few model transformations to generate the synthesizable VHDL code of the modeled SIMD SoC.

Fig. 3.

MDE-based design ow

Fig. 2.

Framework concepts

has to specify the systems parameters (number of PE, memory size, neighboring topology) and the different components that will be integrated (mpNoC, neighborhood network, devices). A. Data-parallel programming The designer has also to code his data-parallel program using The designer has to write his data-parallel program using the specied data-parallel instruction set depending on the the provided data-parallel instruction set. Based on available chosen processor IP. A help manual is in fact provided to the designer to facilitate the parallel programming and describe the processor compilers (miniMIPS, OpenRisc 1200 and NIOS II) different instructions to use according to the chosen processor. in the IP library and the developed special parallel instructions, The framework, in particular the deployment phase, is based the designer can generate his parallel programs binary. For on an IP library which contains dedicated IP that can be the miniMIPS processor, an extended parallel MIPS assembler directly integrated in the system. Providing an extensive library language [21] is developed. For the OpenRisc and NIOS requires a signicant effort. Currently, the IP library contains processors, high-level asm macros are dened and they can be processors (MIPS, OpenRisc, NIOS II), networks (crossbar, used in any C program for control and communication instrucshared bus and multi-stage networks), memories and some tions. The NIOSII IDE (Integrated Development Environment) devices. To add new IP resources, the IP provider must and the OR1Ksim [22] tools are used respectively with the adapt the IP to the architecture dedicated specic interface NIOS and OpenRisc processors. The developed SW chain is (described in the help manual). Thus, a new component can be a multi-compiler chain that is responsible of generating the put into the library by following the requirements for interfaces SW code depending on the specied target processor. Some particular instructions are specied to be used in the formats. To assemble processors in the SIMD design, we programs as delimiters for parallel and sequential code. Table I distinguish two methodologies: reduction and replication. The reduction consists on reducing an available processor in order shows three examples of instructions from the provided datato build a PE with a small reduced size that can be tted parallel instruction set. It is clearly that these instructions in large quantities into an FPGA device. The replication depend on the processor instruction set. At this step, a SW consists on implementing the ACU as well as the PE by library is provided. It includes pre-implemented application the same processor IP so that the design process is faster. algorithms such as matrices multiplication, FIR (Finite ImWe clearly notice that there is a compromise between the pulse Response) lter, reduction algorithm, image rotation, design time and the number of integrated PEs in the SIMD color conversion (RGB to YIQ, RGB to CMYK), etc. After generating the executable SW, the second step consists conguration depending on the applied design methodology. on modeling the HW system. The designer can select the suitable methodology according to his application constraints. The three processors of the IP B. SoC modeling library are provided with the two methodologies. At this step, the designer can generate different implemenThe designer must specify the architecture models using any tations while integrating different IPs. The deployment is also UML 2.0 compliant tool with applying the MARTE prole. responsible of loading the binary data-parallel program in the The most important UML diagrams used in our approach to ACU instruction memory. The SIMD generation approach is specify the system are Class, Structure composite and Deploydepicted in Figure 2. This approach allows a exible and rapid ment diagrams. The modeling of SIMD SoC congurations platform development and platform end-user productivity. relies on the use of UML and the MARTE prole. Three To generate SIMD conguration at RT level, an MDE based MARTE packages are used: the Hardware Resource Modeling design ow, presented in Figure 3, is developed. The proposed (HRM), the Repetitive Structure Modeling (RSM) and the ow uses two meta-models: the MARTE meta-model and the Generic Component Model (GCM) packages [23]. The HRM Deployed meta-model. All meta-model concepts are specied intends to describe the HW platform by specifying its different as UML classes and then converted into Eclipse/EMF models elements. At the end, the HW modeled resources present the [20]. The generation process is based on model transforma- whole system. In our approach, only the HRM HW Logical 151

tions implemented as QVT (Query, Views, Transformations) resources, standardized by OMG. The designer can generate a SIMD massively parallel SoC conguration in four steps: data-parallel programming, SoC modeling, deployment and then implementation generation.

TABLE I SIMD PARALLEL MACROS ASM Macro


P REG SEND (reg,dir,dis,adr)

Description miniMIPS
Neighboring SEND: send data (in reg) from source to destination via the neighboring network. p p p p p p p p p p p addi r1,r0,dir addi r1,r1,dis addi r1,r1,adr SW reg,0(r1) addi r1,r0,dir addi r1,r1,dis addi r1,r1,adr LW reg,0(r1) lui r1,0x2 ori r1,r1,0 LW reg,0(r1)

OpenRisc

Coding NIOS
IOWR (WRP B,addr, data) Where: addr(11)=0 and addr(10:3)=dis and addr(2:0)=dir. data=IORD (WRP B,addr) Where: addr(11)=0 and addr(10:3)=dis and addr(2:0)=dir. NIOS2 READ CPUID(id)

P REG REC (reg,dir,dis,adr)

Neighboring RECEIVE: receive data (in reg) from the source.

P GET IDENT (reg)

read identity

l.addi r1,r0,dir l.addi r1,r1,dis l.addi r1,r1,adr l.sw 0x0(r1),reg l.addi r1,r0,dir l.addi r1,r1,dis l.addi r1,r1,adr l.lwz reg,0x0(r1) l.movhi r1,0x2 l.lwz reg,0x0(r1)

PU

Main_architecture
InterRepetition

:Local_memory [1]

:Elementary_processor [1]

<<flowPort>>

West elementary_processor local_memory


<<flowPort>>

:ACU_data_memory [1] ACU

:ACU [1] Data_mem Inst_mem mpNoc_in PU reshape mpNoC_out West ACU

shaped :PU East

East
:ACU_Instruction_memory [1] InstMem
<<flowPort>>

mpNoC_out

mpNoC_in

mpNoC_in

<<flowPort>> mpNoC_out

<<flowPort>> ACU

reshape PU_out :mpNoC_router [1] PU_in

reshape

:Device [1] mpNoC_in mpNoC_out

Fig. 4.

PU modeling in the case of a linear conguration

ACU_out ACU_in

Device_in Device_out

sub-package is used. It allows to describe information about Fig. 5. 1D conguration modeling the kind of components (HwRAM, HwProcessor, HwBus, etc.), their characteristics, and how they are connected to each other. The architecture is graphically specied at a high chooses to integrate the mpNoC in the SIMD conguration, he abstraction level with HRM. Multidimensional data arrays and must add two ports mpNoC in and mpNoC out to assure powerful constructs of data dependencies are managed thanks the communications through the mpNoC. to the use of the RSM package. It denes stereotypes and We distinguish between 1D and 2D mppSoC congurations. notations to describe in a compact way the regularity of a They differ in the modeling of the interconnections between systems structure or topology. The structures considered are PUs. In the case of 1D conguration, the number of PEs is composed of repetitions of structural elements interconnected equal to the tagged value Shape of the stereotype Shaped via a regular connection pattern. It provides the designer a way applied on the PU class. To model a linear neighboring to efciently and explicitly express models with a high number network, the interconnection link between the East and West of identical components. The concepts found in this package ports is stereotyped InterRepetition. Since the PU on the edge allow to concisely model large regular HW architectures as is not connected to the PU on the opposite edge, the tagged multi-processor architectures. Finally, the GCM package is value isModulo is set to false. The repetitionSpaceDependence used to specify the nature of ow-oriented communication attribute is used to specify the neighbor position of the element paradigm between SoC components. on which the inter-repetition dependency is dened. In this The modeling process is done in an incremental way. The case, its value is equal to {1} since each PE[i] is connected to designer begins by modeling the elementary components: PE, PE[i+1]. Figure 5 shows the mppSoC conguration modeling ACU, memories, mpNoC and I/O device. Then, the whole integrating a linear neighboring network and the mpNoC. The conguration is modeled through successive compositions. link connector, stereotyped Reshape, between the PU and the Figure 4 illustrates the elementary processing unit (PU). It ACU shows that each PU is connected to the ACU in order to is composed of a PE and its local data memory. The class receive the execution orders. To connect PUs with the mpNoC, named Elementary processor is stereotyped HwResource in two Reshape connectors are expressed between the two ports the case of the reduction methodology or HwProcessor in the of the PU and the corresponding ports of the mpNoC. This case of the replication methodology. It has a bidirectional port latter has a multiplicity equal to 1. The repetitionSpace tag is stereotyped FlowPort to connect the data memory. The class equal to the number of PEs. The patternShape tag is equal to 1 Local memory is stereotyped hwMemory with a paramet- indicating that the mpNoC port is distributed among the ports ric tagged value adressSize. In the same manner, the ACU of the PEs. The same modeling is followed in the case of a memories have a parametric size. The PU has one port to ring neighboring network. The only difference is the modulo communicate with the ACU and a number of neighboring ports tagged value which is set to true. equal to the number of its neighboring connections. In Figure In the same manner, we can model a 2D SIMD congura4, it has two neighboring ports since each PE can communicate tion. We need just to know how to model the neighboring links with its neighbor in the east or west directions. If the designer based on the MARTE prole. Figure 6 presents a conguration 152

Main_architecture
InterRepetition North :ACU_data_memory [1] ACU shaped :PU West PU reshape mpNoC_out mpNoC_out mpNoC_in South ACU East

Elementary_processor

local_memory

:ACU [1] Data_mem Inst_mem mpNoc_in

implements virtualIP VPE implements

:ACU_Instruction_memory [1] InstMem PU_out :mpNoC_router [1] ACU_out ACU_in Device_in Device_out

reshape PU_in

reshape

InterRepetition

:Device [1] mpNoC_in mpNoC_out

implements hardwareIP PEImplem

Fig. 6.

2D conguration modeling (with a mesh neighboring network)

modeling integrating a 2D mesh neighboring network. We notice that the PU class is modeled with 4 ports dedicated to inter-PE communications in east, west, north and south directions. In this case, the repetitionSpaceDependence tagged value is equal to {1,0} indicating that each PE[i,j] is connected to its neighbor PE[i+1,j] to assure east and west links. In addition, this tagged value is equal to {0,1} for north and south links to assure that each PE[i,j] is connected to its neighbor PE[i,j+1]. For a mesh topology the tagged value isModulo is set to false since there are no connections on the edges. However, it is set to true in the case of a torus topology. The Xnet network is modeled like the 2D mesh. The designer has just to model the links on the diagonals. C. Deployment As described in the previous subsection, a SIMD conguration can be modeled at a high abstraction level. To generate an executable low level model, the elementary modeled components should be associated with an existing implementation based on the provided IP library. The deployment allows to move from a general platform (Platform Independent Model) to a specic platform (Platform Specic Model) according to the MDA approach. At this step, the designer can generate and evaluate different congurations. In fact, the deployment enables to precise a specic implementation for each elementary concept among a set of possibilities. It concerns the processor IP, the instruction memory, the mpNoC interconnection network and the I/O device IP if it exists. At this stage the binary data-parallel program is specied as the memory initialisation le of the main instruction memory. In fact, in our case we deal with a single data-parallel program (one of the advantages of a SIMD architecture) so no mapping of tasks needs to be performed. Thus, the mapping of the application to the hardware architecture is systematic. Figure 7 expresses the deployment of a hardwareIP on the Elementary processor. The concept of codeFile is used to specify the code. A nal transformation chain MARTE2VHDL is developed to generate the synthesizable VHDL implementation of the modeled SIMD conguration. D. Implementation generation

Fig. 7.

Deployment of the PE

synthesizable VHDL implementation depending on the modeled conguration. A model conformed to the Deployed metamodel is generated via the transformation UML2MARTE. This model is then analysed in order to deduce the specied parameters. The number of PEs, memory size, processor design methodology and the topology of the neighboring network are extracted from the UML diagrams. The other congurable components (processor IP, mpNoC interconnection network, etc.) are specied from the deployment step. The developed transformation model-to-text is based on templates. It uses the Acceleo tool [14] which is part of the Eclipse Model to Text project and provides an implementation of the MOF Model to Text OMG standard. The following code example illustrates how to deduce the type of the processor (getPeCodeFile) in the generation step:
[ query public getPECodeFile (m:Model):CodeFile=self.ownedElement->select (oclIsKindOf(CodeFile) and name=PEImpl codele)->asOrderedSet()->rst() ]

Using an MDE based framework, the SIMD SoC design is accelerated. The VHDL implementation can be automatically generated based on model transformations. The SoC model is independent of any implementation detail making the design ow easy to use. The proposed framework also facilitates SoC exploration and helps the user choose the best conguration for a given application. The next section illustrates the use of this framework in a real application context. IV. C ASE STUDY A color conversion RGB to CMYK application widely used in color printers, extracted from the EEMBC benchmark [24] has been developed based on the provided data-parallel instruction set. The program is written using high-level macros (table I). The binary is then generated depending on the used processor by selecting the corresponding compiler. The proposed framework allowed to generate different SIMD suitable congurations. An FPGA is used to do real experimentations. A. HW platform

The used development board is the Altera D2-70 [25] equipped with a CycloneII EP2C70F896C6 FPGA which has The MARTE2VHDL transformation is based on the De- 68416 Logic Elements (LEs) and 250 M4K RAM blocks. The ployed model and the IP library to generate the corresponding used SW tools are the Quartus II V9.0 that allows synthesizing 153

TABLE II S YNTHESIS RESULTS PE IP Proc. design rep. red. rep. red. rep. LEs (%) 71 93 91 98 79 Memory PE (bytes) 1024 2048 1024 4096 512

8 32 8 16 48

miniMIPS miniMIPS OpenRisc OpenRisc NIOS

ACU (bytes) 4096 4096 4096 4096 8192

% 18 66 22 36 87

and prototyping the design on the FPGA, and the ModelSimAltera v.6.4a that allows simulating and debugging the design. To test the color-conversion application, two peripherals are used: a 1M pixel camera TRDB D5M and a 800RGB480 pixel TRDB LTM LCD displayer. The two external SDRAM and SRAM memories are also used. In fact, the implemented VHDL camera driver directly stores the captured data to the SDRAM to be read by PEs as required. A VHDL SRAM controller is implemented. It allows to store the processed data in the SRAM and fetches it as it is required by the LCD. B. SIMD congurations

Fig. 9.

Execution times for different SIMD congurations TABLE III D IFFERENCES BETWEEN TWO DESIGN SOLUTIONS

SIMD cong. Design time using the framework Design time without using the framework

Generic implem. with a reduced processor 15 minutes 1 month

Generic implem. with a replicated processor 40 seconds 7 days

For the tested application, only the mpNoC has been integrated in the system model (no neighboring network) FPGA. So, these times are measured running the colorsince we need to assure parallel data transfers: all PEs need conversion application on parallel FPGA based congurations. to read data from the SDRAM and then write data to the The different SIMD SoC congurations perform good results SRAM. In this example, each pixel processing should not while increasing the number of PEs working in parallel. exceed 10.42 Ns in order to assure real-time processing. The performance of the system is also closely related to the Therefore, a 800480 pixel frame must be processed within processor type and the design methodology. The experimental 4 Ms. The same system model is used for all implementation results show that a SIMD conguration composed of more generations. It is described in composite structure diagram as than 8 PEs is needed to assure real-time processing. Accordillustrated in Figure 8. It models all hardware components ing to these results, we can choose the best conguration. composing the system as well as their connections. Only The proposed approach easily allows exploration of several the deployment diagram changes from one conguration to platform architecture alternatives. another in order to use different processors, memories and In order to illustrate the efciency of the model-based interconnection networks. The modication from one SIMD framework, Table III compares the implementation design time conguration to a new one just needs few milliseconds and using the framework with results obtained from a conventional the re-generation process is rapidly performed. The low-level manual implementation method done by the same designer synthesizable models from the IP library are used for the without using any framework. The measured design time for nal implementation. The generated congurations could be the second conguration (using replication methodology) is directly simulated to measure execution time and decide the just the time needed to modify the rst conguration (with performance of the SIMD modeled systems. reduction methodology). The results in Table III show that Table II shows the obtained synthesis results varying the the proposed framework is a better solution to accelerate SIMD parameters and components while integrating the max- the design of specic SIMD parallel SoC according to the imum number of PEs targeting the Cyclone II FPGA. All estimated design time compared to a manual design. Two these congurations integrate a crossbar based mpNoC since months were necessary to reduce an open-source processor to the crossbar allows fast and non-blocking parallel data trans- obtain a small PE (with only execution units) [21]. Observing fers, necessary for real-time image processing applications. the results, we can conclude that the model-based design We clearly notice that the reduction methodology allows framework allows a very fast SIMD implementation. integrating a bigger number of PEs on the chip than the This case study illustrates a design framework which fareplication methodology. Since the miniMIPS is smaller than cilitates SIMD SoC implementation to run data-parallel apthe OpenRisc, we can reach 32 PEs on the FPGA compared plications. Through the Model-Driven Engineering approach to 16 PEs when using the OpenRisc IP. The implementation for parallel SoC design presented in this work, a designer can results prove that the NIOS processor is optimized for the specify the needed SIMD conguration using UML models Altera FPGA. We can integrate more than 48 PEs on the chip. and the MARTE prole at high abstraction level and automatFigure 9 shows the execution time results obtained when ically generate its implementation at RT level. The designer prototyping the generated congurations on the CycloneII can easily and rapidly generate different SoC congurations 154

Main_architecture shaped : PU

: ACU_memory [1]

: ACU [1]
Data_mem

InstMem

Inst_mem mpNoC_out mpNoc_in

PU reshape ACU mpNoC_out mpNoC_in

reshape PU_out PU_in

reshape

: mpNoC_router [1]
ACU_in

device: Device [1]


mpNoC_in mpNoC_out

Device_in

ACU_out Device2_out

Device_out Device2_in

device2: Device2 [1]


mpnoc_out mpnoc_in

Fig. 8.

SIMD conguration composite structure diagram

to look for the best alternative for a given application. V. C ONCLUSIONS AND FUTURE WORK A Model-Driven Engineering (MDE) approach for SIMD SoC design was presented. The proposed ow design is composed of four steps: application programming, system modeling, deployment and then implementation generation. The MDE fundamental notion of transformation between models is used to generate a SIMD conguration at register transfer level from its model at a high abstraction level. The framework facilitates the exploration by rapidly generating different SoC congurations in order to choose the most adequate one that better fullls the application requirements. Experimental results show that the proposed framework strongly contributes to the increase of the designers productivity. The case study with a video processing application proved that the presented design ow can facilitate the design of parallel SIMD SoC systems. The design ow allows reducing implementation costs. Besides, the use of UML and MDE promotes the reusability of application and system high-level models. One of the future directions to be considered is the modeling of a data-parallel application. We also intend to develop a high-level exploration step to automatically generate the most suitable application-specic SIMD SoC conguration. R EFERENCES
[1] W. C. Meilander, J. W. Baker, and M. Jin, Importance of SIMD Computation Reconsidered, in International Parallel and Distributed Processing Symposium, 2003. [2] R. Kleihorst and al., An SIMD smart camera architecture for real-time face recognition, in Abstracts of the SAFE & ProRISC/IEEE Workshops on Semiconductors, Circuits and Systems and Signal Processing, 2003. [3] R. Rosas, A. de Luca, and F. Santillan, SIMD Architecture for Image Segmentation using Sobel Operators Implemented in FPGA Technology, in Proc. of the 2nd International Conference on Electrical and Electronics Engineering ((ICEEE05), 2005. [4] P. Bonnot, F. Lemonnier, G. Edelin, G. Gaillat, O. Ruch, and P. Gauget, Denition and SIMD implementation of a multi-processing architecture approach on FPGA, in Proc. of DATE, 2008. [5] F. Schurz and D. Fey, A Programmable Parallel Processor Architecture in FPGA for Image Processing Sensors, in Integrated Design and Process Technology, IDPT, 2007. [6] X. Xizhen and S. G. Ziavras, H-SIMD machine: congurable parallel computing for matrix multiplication, in International Conf. on Computer Design: VLSI in Computers and Processors, 2005, pp. 671676.

[7] P. Paulin, DATE panel: Chips of the future: soft, crunchy or hard? in Proc. Design, Automation and Test in Europe, 2004, pp. 844849. [8] A. Sangiovanni-Vincentelli, L. Carloni, F. D. Bernardinis, and M. Sgroi, Benets and challenges for platform-based design, in Proc. DAC, 2004, pp. 409414. [9] D. Schmidt, Model-driven Engineering, IEEE Computer, vol. 39, no. 2, 2006. [10] C. D. L. Bond and J.-L. Dekeyser, Metamodels and MDA transformations for embedded systems, in FDL04, Lille, France, 2004. [11] S. Mellor and M. Balcer, Executable UML: A foundation for Model Driven Architecture. Boston: Addison-Wesley, 2002. [12] O. M. Group. (2004, october) Uml 2 superstructure (available specication). [Online]. Available: http://www.omg.org/cgi-bin/doc?ptc [13] L. Rioux, T. Saunier, S. Gerard, A. Radermacher, R. de Simone, T. Gautier, Y. Sorel, J. Forget, J.-L. Dekeyser, A. Cuccuru, C. Dumoulin, and C. Andre, MARTE: A new prole RFP for the modeling and analysis of real-time embedded systems, in UML-SoC05, DAC 2005 Workshop UML for SoC Design, Anaheim, CA, June 2005. [14] Acceleo. (2009). [Online]. Available: http://www.acceleo.org [15] M. Bakouti, P. Marquet, M. Abid, and J.-L. Dekeyser, IP based congurable SIMD massively parallel SoC, in PhD Forum of 20 International Conference on Field Programmable Logic and Applications (FPL), Milano, Italy, August 2010. [16] D. Bjorklund and J. Lilius, From UML Behavioral Models to Efcient Synthesizable VHDL, in 20 IEEE NORCHIP Conference, Copenhagen, Denmark, November 2002. [17] F. P. Coyle and M. A. Thornton, From UML to HDL: a Model Driven Architectural Approach to Hardware-Software Co-Design, Information Systems: New Generations Conference (ISNG), pp. 8893, April 2005. [18] T. Kangas, P. Kukkala, H. Orsila, E. Salminen, M. Hannikainen, and T. Hamalainen, UML-based multiprocessor SoC design framework, ACM Trans. Embedded Computing Systems (TECS), vol. 5, no. 2, pp. 8893, May 2006. [19] M. Bakouti, Y. Aydi, P. Marquet, M. Abid, and J.-L. Dekeyser, Scalable mpNoC for Massively Parallel Systems - Design and Implementation on FPGA, Journal of Systems Architecture (JSA), vol. 56, pp. 278292, 2010. [20] EMF. Eclipse modeling framework. [Online]. Available: http://www. eclipse.org/emf [21] M. Bakouti, P. Marquet, M. Abid, and J.-L. Dekeyser, A design and an implementation of a parallel based SIMD architecture for SoC on FPGA, in Conference on Design and Architectures for Signal and Image Processing DASIP08, Bruxelles, Belgium, November 2008. [22] OpenCores. Or1200 openrisc processor. [Online]. Available: http: //opencores.org/openrisc,or1200 [23] O. M. Group. UML Prole for MARTE: Modeling and Analysis of Real-Time Embedded Sys- tems, version 1.0. [Online]. Available: http://www.omg.org/spec/MARTE/1.0/PDF/. [24] EEMBC. (2010) The Embedded Microprocessor Benchmark Consortium. [Online]. Available: http://www.eembc.org/home.php [25] Terasic. (2010) Altera DE2-70 Board. [Online]. Available: http://www. terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=226

155

A State-Based Modeling Approach for Fast Performance Evaluation of Embedded System Architectures
Sbastien Le Nours, Anthony Barreteau, Olivier Pasquier
Univ Nantes, IREENA, EA1770, Polytech-Nantes, rue C. Pauc, Nantes, F-44000 France {sebastien.le-nours, anthony.barreteau, olivier.pasquier}@univ-nantes.fr
Abstract Abstract models are means to assist system architects in the evaluation process of hardware/software architectures and then to cope with the still increasing complexity of embedded systems. Efficient methods are necessary to correctly model system architectures and to make possible early performance evaluation and fast exploration of the design space. In this paper, we present the use of a specific modeling approach to improve evaluation of non-functional properties of embedded systems. The contribution is about a computation method defined to improve modeling of properties used for assessment of architecture performances. This method favors creation of abstract transaction level models and leads to significantly reducing simulation time but still preserving accuracy of results. The benefits of the proposed approach for evaluation of performances of system architectures are highlighted through analysis of two specific case studies.

design process of embedded systems [4]. However, achievable simulation speed of transaction level models is limited by the amount of required transactions and integration of nonfunctional properties can significantly reduce simulation speed. Therefore, specific modeling techniques are required to correctly abstract non-functional properties and improve efficiency of simulation. In this paper, an approach for creation of efficient transaction level models for performance evaluation of system architectures is presented. The contribution is about a specific computation method proposed to improve expression of nonfunctional properties assessed for performance evaluation. This proposal is based on the distinction between the description of system evolution, driven by transactions, and the description of non-functional properties. This separation of concerns leads to reducing the number of events in transaction level models and favors creation of abstract models. Simulation speed-up is then achieved due to significant reduction of required transactions and the proposed method still preserves accuracy in evaluation of performances. This method has been validated through the use of a specific modeling framework based on the SystemC language [5]. The proposed approach provides fast evaluation of performances and allows efficient exploration of different configurations of architectures. The benefits of this approach are highlighted through two case studies. Created models are simulated to evaluate performances in terms of processing resources and memory cost in order to correctly fix platform properties. The remainder of this paper is structured as follows. Section II analyzes related modeling and simulation approaches used for evaluation of performances of embedded systems. In Section III, the proposed modeling approach is presented. In Section IV, we detail the computation method used to improve simulation speed of models. In Section V, we describe the implementation of the proposed approach in the considered modeling environment. Section VI highlights the benefits of the approach through two case studies. Finally conclusions are drawn in section VII II. RELATED WORK

I.

INTRODUCTION

High performance applications supported by modern embedded devices imply definition of heterogeneous multiprocessor platforms. The process of system architecting consists of optimally defining organization and performances of such platforms in terms of processing, communication and memory resources according to functional and non-functional requirements. Typical non-functional requirements under consideration for embedded systems are timing constraints, power consumption and cost. Fast exploration of the design space and evaluation of non-functional properties early in the development process have then become mandatory to avoid costly iterations [1]. In this context, abstract models are then needed to maintain the still increasing complexity of embedded systems. As reported in [2], models for performance evaluation are usually created applying the principles of the Y-chart model. Following this approach, a model of the system application is mapped onto a model of the considered platform and the resulting description is then analyzed analytically or by simulation. Modeling of computation and modeling of communication can be strongly separated and defined at various abstraction levels on both application and platform sides [3]. Raising level of design abstraction above Register Transfer Level (RTL), Transaction Level Modeling (TLM) offers a good trade-off between modeling accuracy and simulation speed and it has then emerged recently in the
978-1-4577-0660-8/11/$26.00 2011 IEEE

Performance evaluation of embedded systems has been approached in many ways at different levels of abstraction. A good survey of various methods, tools and environments for

156

early design space exploration is presented in [6]. Typically, performance models capture characteristics of system architectures and they are used to gain reliable data of resource usage. For this purpose, performance evaluation can be performed without considering a complete description of system functionalities. This abstraction enables efficient simulation speed and favors early performance evaluation. Workload models are then defined to represent computation and communication loads applications cause on platforms when executed. Workload models are mapped onto platform models and resulting architecture models are simulated to obtain performance data. Among simulation-based approaches, TLM has recently received wide interest in industrial and research communities in order to improve system design and productivity. Transaction level models make possible to hide unnecessary details of communication and computation. Formally, a transaction has been defined in [4] as the data transfer or synchronization between two modules at an instant determined by the hardware/software system specification. The different levels of abstraction considered in TLM approaches are classified according to granularity of computation and communication and time accuracy [3][4]. TLM is supported by languages such as SystemC [5] and SystemVerilog [7], notably through the TLM2.0 standard promoted by OSCI [8]. Works presented in [9] give a quantitative analysis of the speed/accuracy trade-off in transaction level models. Typically, using the SystemC language, simulation speed is related to the number of thread context switches which usually grows with increasing number of modules. The approach presented in [10] attends to transform the structural description of designs with concurrency alignment along modules in order to minimize switches. The transformation technique re-assigns the concurrency along the dataflow keeping functionality of the model unchanged. In [11], a method is presented to minimize the number of synchronization points in system description by optimizing the granularity of transactions. In the following, we adopt a similar approach in order to reduce the amount of required events in transaction level models created for evaluation of non-functional properties. Among existing approaches for performance evaluation of embedded systems, different optimization objectives are addressed in order to assist designers to fix platform parameters early in the development process. The design framework presented in [12] supports system modeling at different levels of abstraction. The architecture exploration step supported mainly focuses on optimization of allocation and partitioning. System architecture consists of processing elements and memories. These components are selected by the system designer as part of the decision making. In [13], the proposed methodology allows for architectural exploration at different levels of abstraction. Candidate architectures are selected using analytical modeling and multi-objective optimization taking into account parameters such as processing capacities, power consumption and cost. Potential solutions are then simulated at transaction level using SysemC. In [14], performance evaluation is performed by a combined simulation, associating functionalities and timing in one single simulation model. The performance of each feasible implementation is then assessed with respect to a given set of stimuli and by means of average

latency and average throughput. The design framework proposed in [1] aims at evaluating non-functional properties such as power consumption and temperature. In this approach, the description of the application is done through a model called communication dependency graph. This description is completed by SystemC models of non-functional properties. Simulation is then performed to obtain an evaluation of achieved power consumption. Both approaches presented in [15] and [16] combine UML2 description for application modeling and platform modeling in SystemC for performance evaluation. Applications are modeled in terms of services required from the underlying platform. Workload models are defined to illustrate the load an application causes to a platform when executed in terms of processing and communication. These workload models do not contain timing information; it is left to the platform model to find out how long it takes to process the workloads. Our approach mainly differs from the above as to the way the system architecture is modeled and the models of workload are defined. Besides, we pay a specific attention to the optimization of models in order to improve the related simulation speed. III. CONSIDERED MODELING APPROACH

The considered modeling approach aims at creating approximately-timed models used for evaluation of properties related to system architectures. It is based on a single view that combines structural description of system under study and nonfunctional properties relevant to considered hardware and software resources. This approach is illustrated in Figure 1.
Model of system architecture A1 A2 M K M t = Tj t = Tk s3 t = Tl s0 CcA2=0;

McA2=0;

s1 CcA2=Ccs1; s2 CcA2=0; McA2=Mcs2;


CcA2=Ccs3; McA2=Mcs3;

McA2=Mcs1;

Considered system architecture

F11 P1

F12

Node P2 Mem1

F2

Figure 1. Considered modeling approach for evaluation of non-functional properties of system architectures.

The lower part of Figure 1 depicts a typical platform made of communication nodes, memories and processing resources. Processing resources are classified as processors and dedicated hardware resources. In Figure 1, F11, F12 and F2 represent functions of the system application. They are allocated on the processing resources P1 and P2 to form the system architecture. For clarity reason, communications and memory accesses

157

induced by this allocation are not represented. The upper part of the figure depicts the model of the system architecture. This model exhibits transactions exchanged between activities and utilization of the platform resources. This model is based on an activity diagram notation inspired from [17]. Following this notation, single arrow links correspond to transactions exchanged between activities and the communication is in conformity with the rendezvous protocol. The behavior of each activity exhibits waiting conditions on input transactions and production of output transactions. It also expresses the use of processing and memory resources considering the allocation of functions. Transitions between states are expressed as waiting transactions, time conditions or logical conditions on internal variables. In Figure 1, the use of processing resources due to execution of function F2 on P2 is modeled by evolution of the parameter denoted CcA2. In this simple example, Ccs1 operations are first executed for a duration set to Tj after reception of transaction M. Production of transaction Q is done once state s3 finished. Parameter McA2 describes evolution of the amount of memory required during execution of the activity A2. The different internal variables related to each activity can be influenced by data associated to the input transaction M. Following this approach, resulting model incorporates quantitative properties defined analytically and relevant to the use of processing resources, communication nodes and memories. These analytical expressions of quantitative properties and related time properties are directly influenced by the characteristics of resources considered to support function execution. These expressions are provided by estimations and measurements. Using languages as SystemC, created models can then be simulated to evaluate time evolution of performances obtained considering a given set of stimuli. Various platform configurations and function allocations can be compared considering different descriptions of the behavior of activities. In the following, the presented contribution is about the optimization of the descriptions of activities in order to improve the simulation time of such models. IV. PROPOSED COMPUTATION METHOD OF NONFUNCTIONAL PROPERTIES OF SYSTEM ARCHITECTURES As previously discussed, simulation speed of transaction level models can be significantly improved by avoiding context switches between threads. The proposed computation method relies on the same principle as temporal decoupling supported by the loosely-timed coding style defined by OSCI. Using this coding style, parts of the model are permitted to run ahead in a local time until they reach the point when they need to synchronize with the rest of the model. The proposed method can be seen as an application of this principle to create models for evaluation of architecture performances. It aims at minimizing number of transactions required for description of properties assessed for evaluation of performances Figure 2 illustrates application of proposed computation method.

Figure 2. Comparison of two modeling approaches in order to minimize the amount of required transactions in models used for performance evaluation.

Figure 2 depicts transactions exchanged between two activities and the behavior of the receiving activity denoted A2. The upper part of the figure corresponds to a description with 4 successive transactions. The durations between the successive transactions are denoted t1, t2 and t3 and they are relevant to the communication node used for transfer of data. In this transaction-based modeling approach, the considered property cA2 evolves each time a transaction is received. The lower part of the figure considers the description of the activity A2 at a higher abstraction level. Only one transaction occurs and the content of the transaction is defined at higher granularity. However, evolution of the property cA2 can be preserved by considering a separation with evolution of the activity behavior. In that case, the duration Ts corresponds to the time elapsed between the first transaction and the last transaction considered in the upper part of the figure. It is locally computed relatively to the arrival time of the input transaction M and it defines the next output event. In Figure 2, this is denoted by action ComputeAfterM. The time condition is evaluated during state s0 considering evolution of the simulation time denoted ts. Besides, evolution of property cA2 between two external events is also done during state s0. The successive values, denoted cs0, are evaluated in zero time according to the simulation time. This means that no SystemC wait primitives are used, leading to no thread context switches. Resulting observations correspond to the values cs0 and the associated timestamps To. Timestamps values are considered relatively to what we call the observed time denoted to. Using this technique, the evolution of the considered property can then be computed locally between external transactions. Compared to the previous transaction-based approach, the second modeling approach with the related computation technique can be considered as a state-based approach. Non-functional properties are then locally computed in the same state which reduces the number of required transactions. Figure 3 represents time evolution of property cA2 considering the two modeling approach illustrated in Figure 2.

158

according to the considered approach. Figure 4 illustrates the graphical modeling adopted in CoFluent Studio to implement the proposed computation method. It corresponds to the specific case illustrated in Figure 2 with one input transaction and one output transaction.

Figure 4. Graphical modeling in the CoFluent Studio framework to implement the proposed computation method.

Figure 3. Evolution of property cA2 considering, (a), a transaction-based modeling approach and, (b), the proposed state-based modeling approach.

The upper part of Figure 3 illustrates time evolution of property cA2 with 4 successive input transactions. During simulation of the model each transaction implies a thread context switch between activities and cA2 evolves according to the simulation time. In the lower part of the figure, successive values of property cA2 and associated timestamps are computed at reception of transaction M. Evolution is depicted according to the observed time to. Improved simulation time is achieved due to the amount of context switches avoided. More generally, we can consider that when the number of transaction is reduced by a factor of N, a simulation speed-up by the same factor could be achieved. This computation method favors creation of abstract models and utilization of platform resources can be computed at finer level with low influence on simulation time. We have considered the implementation of the proposed computation method in a specific modeling framework in order to analyze its influence on simulation time of models. V. IMPLEMENTATION OF THE COMPUTATION METHOD IN A SPECIFIC FRAMEWORK

In Figure 4, the function denoted A2 is activated once the input transaction M has been received. The production instant of the output transaction Q is computed in the operation denoted OpPerformanceAnalysis. Duration of operation OpPerformanceAnalysis corresponds to duration Ts defined in Figure 2. The other operations OpInit and OpUpdating are executed in zero time according to the simulation time. The loop with a boolean condition on the internal variable denoted Wait_Input is added to manage possible output transactions produced successively before waiting a new input transaction. The operation OpPerformanceAnalysis is described in a C/C++ sequential code to define the computation of properties and display. The example given bellow corresponds to the required instructions to obtain the observations depicted in the lower part of Figure 3.
{ To = CurrentUserTime(ns); CofDisplay(to=%f ns, cA2=%f op/s, To, To = To+t1; CofDisplay(to=%f ns, cA2=%f op/s, To, To = To+t2; CofDisplay(to=%f ns, cA2=%f op/s, To, c3); To = To+t3; CofDisplay(to=%f ns, cA2=%f op/s, To, To = To+Tl; CofDisplay(to=%f ns, cA2=%f op/s, To, OpDuration = To-CurrentUserTime(ns); } c1); c2);

c4); 0);

The proposed computation method has been implemented in the framework CoFluent Studio [18]. This environment supports creation of transaction level models of system applications and architectures. Graphical models captured and associated codes are then automatically generated in a SystemC description and they are simulated to analyze model execution and to assess performances. We used the so-called TimedBehavioral Modeling part of this framework to create models

Procedure CurrentUserTime is used in CoFluentStudio to get current simulation time. In our case, it is used to get the reception time of input transactions and to compute values of durations To and Ts. Procedure CofDisplay is used to display variables in a Y=f(X) chart. In our case, it is used to

159

display studied properties according to the observed time. The keyword OpDuration defines the duration of the operation OpPerformanceAnalysis and it is evaluated according to the simulation time. Successive values of cA2 and timestamps are provided by estimations and could also be computed according to data associated to the input transaction M. This model has been extended to the case of functions with multiple input and output transactions. In the following, we consider this implementation of the proposed method to create executable models. Models are then simulated to assess performances of considered architectures. VI. CASE STUDIES

model makes possible to analyze the use of processing resources according to the rate of input transactions. Figure 6 depicts possible observations obtained with the CoFluent Studio simulation tool. In the considered example, input transactions are successively received with a period set to 0.125 ms.

A. Modeling of a pipeline architecture First case study aims at illustrating proposed modeling approach and simulation speed-up obtained with the computation method presented in Section IV through a didactic case study. Application considered is about a Fast Fourier Transform (FFT) algorithm which is widely used in digital signal processing. A pipeline architecture based on hardware resources is analyzed. To easily illustrate proposed modeling approach, an 8-point FFT is considered. Modeling approach is used to estimate resource utilization and computation method is used to reduce simulation time of performance model. Figure 5 illustrates pipeline architecture considered and related performance model.
Model of system architecture
InputSymbol

Figure 6. Time evolution of computational complexity (in KOPS) of the considered system architecture.
Stage2
InputStage3

Stage1

InputStage2

Stage3

OutputSymbol

Considered system architecture Stage1 Stage2 Stage3

Input Symbol

+ - w

+ - w

+ - w

Output Symbol

Figure 5. Modeling of a 3-stage pipeline architecture.

The upper part of Figure 6 gives an example of time evolution of global computational complexity per time unit required for the complete architecture with three successive executions. The lower part gives an illustration of processed input and output transactions as could be observed in the timeline view proposed by the CoFluent Studio simulation tool. For clarity reason, only the first input transactions and the last output transactions produced are depicted. The behavior of each activity is relevant to the architecture of each stage and to the time constraints allocated to process each complex symbol. In the considered configuration, estimated computational complexity per time unit is evaluated to 120 KOPS. Considering the state-based modeling approach depicted in Figure 3, we have defined a model of the system with the same structure but with a higher level of data granularity. Transactions are made of eight complex symbols and only one transaction is required to execute an iteration of the complete architecture. Start time corresponds to reception of the first complex symbol and other instants are locally computed relatively to this value. Evolution of the load of processing resources for each transaction is computed considering the method previously presented and a similar observation of computational complexity is obtained. The average simulation speed-up measured is about 7.62, compared to a theoretical factor of 8. We can then notice the weak influence of the computation method on the improvement of the simulation time. Similar observations have been obtained by increasing number of stages in pipeline architecture.

The lower part of Figure 5 describes typical pipeline architecture as implemented in most commercial FFT IPs. This architecture enables to simultaneously perform transform calculations on a current frame of 8 complex symbols, load input data for next frame, and unload 8 output complex symbols of previous frame. Each stage is made of processing (adders, multipliers) and memory resources. The upper part gives the structural description of the associated model. The behavior of activities Stage1, Stage2, and Stage3 describe utilization of processing resources each time an input transaction is received. Behavior of each activity is described following modeling approach presented in Section III. Architecture has first been modeled following a transactionbased modeling approach. Input transactions are made of one single complex symbol. Eight input transactions are then required to process one iteration of FFT algorithm. This model has been captured in the CoFluent Studio framework following the previously presented modeling approach. The created

160

B. Modeling of a pipeline architecture Second case study considered here concerns the creation of a transaction level model for analysis of processing functions involved at the physical layer of the 3GPP LTE protocol [19]. The aim of the model is to study required computational complexity and memory cost according to the various possible parameters associated to each function. In the following, we consider the reception part of a downlink transmission in a single input single output (SISO) configuration. The structural representation of this system is given in Figure 7.

Figure 7. Activity diagram of communication receiver studied.

In the configuration depicted in Figure 7, input transactions are successively received each 1 ms. They are made of 14 OFDM symbols which size can vary according to considered throughput. Based on a detailed analysis of processing and memory resources required for each function [20], we have defined analytical expressions for each activity. These expressions give relation between functional parameters and resulting computational complexity in terms of arithmetic operations and required memory resources. For example, the number of sub-carriers directly gets influence on the computational complexity of the OFDM demodulator function. We used proposed modeling approach to describe each elementary activity depicted in the lower part of Figure 7. The behavior of each activity exhibits the way processing and memory resources are used. The computation method is used to locally compute time evolution of computational complexity and memory cost related to each activity. Time properties defined for each activity depend on the architecture evaluated. In the following, results are presented considering a platform made of dedicated hardware resources to implement each function. We captured the complete model using the CoFluent Studio tool. Each activity has been captured following approach illustrated in Figure 4. The captured LTE receiver model represents 3850 lines of SystemC code, with 22 % automatically generated by the tool CoFluent Studio. Rest of the code corresponds to the sequential C/C++ code defined for computation of studied non-functional properties and display. This model makes possible to observe the evolution of the computational complexity per time unit for each activity and for the complete architecture. The observation given in Figure 8 represents obtained evolution of computational complexity according to various configurations of the input frames.

Figure 8. Observation of estimated computational complexity (in GOPS) of the receiver architecture according to various configurations of LTE sub-frames.

In Figure 8, we observe evolution of computational complexity during reception of successive LTE sub-frames. The system configuration evolves during execution according to various parameters. The number of blocks of data allocated per user is denoted NbRB, the size of the OFDM symbol is denoted NFFT and the number of iterations of the channel decoder is NbIterTurboDecod. Parameters vary from one frame to another one. The demodulation scheme can also be modified during system execution. In Figure 8, modulation schemes are QPSK, 64QAM, 16QAM. We observe that the global computational complexity strongly varies during system execution and the estimated maximum value is 70 Giga Operation Per Second (GOPS) for the three configurations evaluated. The main computing complexity is due to activity of the channel decoder function. The same model is used to evaluate the memory cost associated to the receiver system. This observation is given in Figure 9.

Figure 9. Observation of the estimated memory cost (in KByte) of the receiver architecture according to various configurations of LTE sub-frames.

161

Figure 9 illustrates evolution of memory cost during successive computation of LTE sub-frames. The maximum value achieved is estimated to 570 KBytes. Observations given in Figure 8 and 9 are used to estimate expected resources of the architecture. The simulation time to execute the created model for 1000 input frames took 11s on a 2.66 GHz Intel Core2 duo machine. This is fast enough for performing performance evaluation and for simulating multiple configurations of architectures. Time properties and quantitative properties defined for each activity can be modified easily to evaluate various configurations of the architecture. Then, we also used this approach to evaluate properties related to an heterogeneous architecture made of dedicated hardware resources and one processor core. VII. CONCLUSION Creation of abstract models represents a reliable solution to maintain design complexity of embedded systems and to enable architecting of complex hardware and software resources. In this paper, we have presented an approach for creation of transaction level models for performance evaluation. According to this approach, system architecture is modeled as an activity diagram and description of activities incorporates properties relevant to resources usage. The contribution is about a specific computation method that favors creation of more abstract transaction level models. Simulation speed-up is achieved due to significant reduction in number of transactions in models and architecture properties are computed in zero time according to simulation time. This method makes possible to significantly increase simulation speed of models but still preserving accuracy of observations. The experimentation of this method has been illustrated through the use of the framework CoFluent Studio. However, the presented modeling approach is not limited to this specific environment and it could be applied to other SystemC-based frameworks. Further research is directed towards applying the same modeling principle to other non-functional properties such as dynamic power consumption. REFERENCES
[1] A. Viehl, B. Sander, O. Bringmann, and W. Rosenstiel, Integrated requirement evaluation of non-functional system-on-chip properties, in Proceedings of the Forum of specification and Design Languages (FDL08), Stuttgart, Germany, September 2008.

[2] [3]

[4] [5] [6] [7] [8] [9] [10]

[11]

[12] [13] [14] [15]

[16]

[17] [18] [19] [20]

D. Densmore, R. Passerone, and A. Sangiovanni-Vincentelli, A platform-based taxonomy for ESL design, IEEE Design and Test of Computers, vol. 23, no. 5, pp. 359-374, September/October 2006. L. Cai, and D. Gajski, Transaction level modeling: an overview, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS03), Newport Beach, October 2003. F. Ghenassia, Transaction-level modeling with SystemC: TLM concepts and applications for embedded systems, Springer, 2005. Open SystemC Initiative (OSCI), Functional specification for SystemC 2.0, http://www.systemc.org M. Gries, Methods for evaluating and covering the design space during early design development, Integration, the VLSI Journal, vol. 38, no. 2, pp. 131-183, 2004. SystemVerilog, http://www.systemverilog.org Open SystemC Initiative TLM Working Group, Transaction Level Modeling Standard 2 (TLM 2), June 2008. G. Schirner, and R. Dmer, Quantitative analysis of the speed/accuracy trade-off in transaction level modeling, ACM Transactions on Embedded Computing Systems, vol. 8, no. 4, pp. 1-29, 2008. N. Savoiu, S. K. Shukla, and R. K. Gupta, Automated concurrency reassignment in high level system models for efficient system-level simulation, in Proceedings of Design, Automation and Test in Europe, 2002. J. Cornet, F. Maraninchi, and L. Maillet-Contoz, A method for the efficient development of timed and untimed transaction-level models of systems-on-chip, in Proceedings of Design, Automation and Test in Europe (DATE08), Munich, Germany, March, 2008. R. Dmer, A. Gerstlauer, J. Peng, et al., System-on-chip environment: a SpecC-based framework for heterogeneous MPSoC design, EURASIP Journal on Embedded Systems, vol. 2008, 2008. A. D. Pimentel, C. Erbas, and S. Polstra, A systematic approach to exploring embedded system architectures at multiple abstraction levels, IEEE Transactions on Computers, vol. 55, no. 2, pp. 99-111, 2006. C. Haubelt, J. Falk, J. Keinert, et al., A SystemC-based design methodology for digital signal processing systems, EURASIP Journal on Embedded Systems, vol. 2007, 2007. J. Kreku, M. Hoppari, T. Kestil, et al., Combining UML2 application and SystemC platform modelling for performance evaluation of realtime embedded systems, EURASIP Journal on Embedded Systems, vol. 2008, 2008. T. Arpinen, E. Salminen, T. Hmlinen, and M. Hnnikinen, Performance evaluation of UML2-modeled embedded streaming applications with system-level simulation, EURASIP Journal on Embedded Systems, vol. 2009, 2009. J. P. Calvez, Embedded real-time systems: a specific and design methodology, John Wiley & Sons, May, 1993. CoFluent Design, http://www.cofluentdesign.com/ E. Dahlman, S. Parkvall, J. Skold, P. Beming, 3G Evolution, HSPA and LTE for Mobile Broadband, Academic Press, 2008. J. Berkmann, C. Carbonelli, F. Dietrich, C. Drewes, W. Xu, On 3G LTE terminal implementation Standard, algorithms, complexities and challenges, in Proceedings of International Wireless Communications and Mobile Computing Conference (IWCMC08), August, 2008.

162

Session 6 Software for Embedded Devices

163

Task Mapping on NoC-Based MPSoCs with Faulty Tiles


Evaluating the Energy Consumption and the Application Execution Time
Alexandre M. Amory, Csar A. M. Marcon, Fernando G. Moraes
FACIN Faculdade de Informtica PUCRS Catholic University Porto Alegre, Brazil {alexandre.amory, cesar.marcon, fernando.moraes}@pucrs.br
Abstract The use of spare tiles in a networks-on-chip based multi-processor chip can improve the yield, reducing the cost of the chip and maintaining the system functionality even if the chip is defective. However, the impact of this approach on application characteristics, such as energy consumption and execution time, is not documented. For instance, on one hand the application tasks might be mapped onto any tile of a defect-free chip. On the other hand, a chip with a defective tile needs special task mapping that avoid fault tiles. This paper presents a task mapping aware of faulty tiles, where an alternative task mapping can be generated and evaluated in terms of energy consumption and execution time. The results show that faults on tiles have, on average, a small effect on energy consumption but no significant effect on execution time. It demonstrates that spare tiles can improve yield with a small impact on the application requirements. Keywords: MPSoC, task mapping, yield, energy consumption, execution time.

Marcelo S. Lubaszewski
PPGC Instituto de Informtica UFRGS Federal University Porto Alegre, Brazil luba@eletro.ufrgs.br

application execution time under the presence of faulty tiles; (iii) a statistical method to generate fault scenarios for very large SoCs. The paper is organized as follows: Section II presents motivation, usage of the proposed approach, and main assumptions. Section III describes the related work. Section IV describes the task mapping tool and its models. Section V describes the experimental setup, the evaluated applications, and the fault scenarios. Section VI discusses the results. Section VII concludes the paper.

II.

PRELIMINARIES

I.

INTRODUCTION

A multiprocessor system-on-chip (MPSoC) is typically a very large scale integrated system that incorporates most or all the components necessary for an application, including multiple processors [1]. A network-on-chip (NoC) is the preferable intrachip communication infrastructure for MPSoCs due to its superior performance, scalability, and modularity. MPSoCs that use NoCs as the communication infrastructure are also called NoC-based MPSoCs. NoCs can consume more than one third of the total chip energy [2][3]. On the other hand, the shrinking feature-sizes of newer technologies and the supply voltage scaling [4][5] increases the defect rate in the chip manufacturing and reduces the yield. High manufacturability, low latency and energy consumption are conflicting design goals, thus all these requirements have to be jointly evaluated to optimize a NoC-based MPSoC design. The task mapping problem determines an association of each application task to a tile to minimize some given cost function. This paper presents a tool that finds an optimal task mapping in terms of energy consumption and application execution time, given a set of tiles with manufacturing defects. This way, even chips with defects can be sold, perhaps with some performance degradation, targeting low-end markets. The goals of this paper are to present the aforementioned task mapping tool and to investigate the energy consumption and application execution time degradations assuming different application classes. The contributions of this paper are (i) a task mapping tool for NoC-based MPSoC, which consider faulty tiles to perform the mapping; (ii) the evaluation of energy consumption and

A. System Model and Assumptions This paper assumes that the target MPSoC consists of a set of identical (or homogeneous) tiles connected by a mesh-based NoC with XY routing algorithm. Each tile contains three main components: a network interface, a processor, and a memory block. A tile supports one task only (no multitasking). This system model is equivalent, for instance, to the underlying model of HeMPS MPSoC [6] with the Hermes NoC [7]. The present work assumes faults only on the tiles since we assume that the tile area is at least 90% of the router. Therefore, the communication infrastructure is assumed faulty-free. A faulty tile is completely shutdown, thus it does not consume energy and generate traffic in the network. The faults are result of defects created during the chip manufacturing. These defects are expected to be more common due to the evolution of deep submicron technologies, thus multiple faults on the chip are considered. The proposed task mapping is executed in design time for several fault scenarios, such that an overall picture of the relationship between the fault location and the performance metrics can be draw. B. Motivating Example Redundant hardware is commonly used to tackle the yield problem. It has been successfully applied to all sorts of regular and repetitive hardware, like different types of memories, programmable logic array, field programmable gate array, and recently to MPSoCs [5]. In the context of MPSoCs, the application task located in a faulty tile can be mapped (in design time) or migrated (in run-time) to a spare tile, keeping the chip functionality. Shamshiri and Cheng [5] proposed a yield and cost analysis framework employed to evaluate the use of spare tiles in MPSoCs. This one can be used to determine the amount of redundancy required to achieve a minimum cost. For instance, given some input parameters detailed in [5], the yield of a block is 94%, the NoC link is 72%, resulting in a system yield of just 21% for a 3x3 mesh NoC, i.e. there is probability of 79% of having at least one faulty block in the system. By including three spare tiles to

978-1-4577-0660-8/11/$26.00 2011 IEEE

164

the system, increasing the number of tiles from 9 to 12, the system yield increases to 99% since only 9 out of 12 tiles are actually required to have a functional system. Moreover, the manufacturing cost is 3.2 times less than the original system, since the additional silicon area of the spare tiles is compensated by the increased yield. Given these motivating results, we decided to investigate the use of spare tiles by evaluating the side effects of multiple faulty tiles on the energy consumption and application execution time.

C. Usage of the Proposed Approach Figure 1 illustrates the proposed test approach, which starts as soon as the chip is manufactured. If the tested chip fails, a diagnose step is performed to locate the faulty tiles. Let n be the number of system tiles and m be the number of necessary tiles to implement the systems functionality, then n-m is the number of spare tiles. If the number of faulty tiles is lower or equal than n-m, the place of these faulty tiles is sent to task mapping tool, otherwise the faulty chip is discarded. The task mapping tool, presented in Section IV, loads a NoC model and the application task graph to determine the new task mapping avoiding the faulty tiles. Finally, the tool is able to estimate the energy consumption and the application execution time of the resulting task mapping. Depending on the resulting overhead, chips with up to n-m faulty tiles can still be sent to the market, perhaps targeting low-end markets.
manufacturing test pass test?
yes no

constraints, and energy consumption [12][13][14]. Recently these approaches also co-optimize reliability related metrics. Manolache et al. [11] address the reliability problem at application level. They propose a way to combine spatially and temporally redundant message transmission, where energy and latency overhead are minimized. Tornero et al. [17] propose a multi-objective optimization strategy, which minimizes energy consumption and maximizes a robustness index, called path diversity, which explores the multiple paths between a pair of nodes. In case of a faulty link, a NoC with adaptive or source-based routing algorithms could explore these multiples paths, improving the chip robustness. Choudhur et al. [18] introduce a new task mapping, whose objective is to minimize the variance of the system power and latency when faults occur and maximizes the probability that the actual system will work when deployed. Huang et al. [19] argue that some processors might age much faster than others might, reducing the systems lifetime. They proposed an analytical model to estimate the lifetime reliability of MPSoCs. This model is integrated to a task mapping algorithm that minimizes the energy consumption of the system and satisfies system lifetime reliability constraint. Huang and Xu [20] expand their previous task mapping tool [19] to support multi-mode embedded systems. Huang and Xu [21] argue that exponential lifetime distribution can be inaccurate, thus they further refine the lifetime reliability model to support arbitrary lifetime distributions, improving the accuracy of the simulation results.

diagnose (find faulty tiles)

#faulty tiles <= n fault location yes

no

chip discarded

IV.

TASK MAPPING AWARE OF FAULTY TILES

sold to high-end market - $$$

application model

task mapping tool sold to low-end market - $$

NoC model

The CAFES task mapping framework [22] is composed of high-level models, algorithms and tools, whose goal is to map application tasks onto the target architecture tiles aiming to save energy and to minimize the execution time. Figure 2 illustrates a partial mapping flow and the main elements used here.
Application t1 t2 tn Target architecture - MPSoC 1 2 3 MPSoC modeling CRG Energy and timing parameters n MPSoC synthesis

Figure 1. Proposed test flow for NoC-based MPSoCs with spare tiles.

III.

RELATED WORK

There are several papers presenting approaches to improve the reliability of NoC-based SoCs. These papers can be broadly classified (these classes are definitely not exhaustive) in: (i) fault tolerant circuitry for NoCs and MPSoCs [8][9]; (ii) fault tolerant NoC routing algorithm used to explore different routes of packets in case of network faults [10]; (iii) system-level reliability assessment [5]; (iv) system-level reliability co-optimization [11]. This paper fits best in the system-level reliability cooptimization category, where two main approaches are found: dynamic approaches executed in run-time; or static approaches executed in design time. The dynamic system-level reliability cooptimization approach is commonly based on on-line task mapping and task migration to better accommodate new incoming tasks on the fly assuming the chip might have faults. It can also be used to react, for instance, upon a run-time fault which could have been generated by transient effects or permanent faults due to wear out or aging [15][16]. In this case the tasks located at the faulty resources are moved in runtime to healthy resources. These approaches are out of the scope of this paper since the goal is to improve the yield of the chip manufacturing. Manufacturing defects are not dynamic and they do not appear during run-time. For this reason this paper is best related to static system-level reliability co-optimization approaches, based on static task scheduling executed in design time. Typically these approaches are largely used in design space exploration targeting the optimization of metrics, such as application execution time, latency, thermal

Communication and computation extraction CDCG Energy consumption and execution time results

Production test Faulty tile and spare tile lists

Mapping algorithm Optimum mapping

Figure 2. Mapping flow used to obtain optima application mappings.

Based on the description of an application already partitioned into tasks ti, the designer may extract the relevant computation and communication aspects. Communication Dependence and Computation Graph (CDCG) is a model used to describe the application. Each CDCG vertex models a communication with the source and target task, the communication volume and the computation time - the period between all dependences are solved and communication beginning. CDCG edges represent the communication dependence, i.e. all vertices are connected to each dependence with an edge. The CDCG is similar to a schedule graph, but focusing on communication aspects instead of computation, which enables to explore several requirements of communication architecture easily. Figure 3 depicts a small example of CDGC, containing three communications {C1, C2 and C3}. C1 and C3 are concurrent communications and both do not have dependences, since dependences of

165

the Start vertex (dStart_1 and dStart_3) are always solved. Thus, C1 and C3 communications start immediately after the respective computation time: 10 and 20 clock cycles, respectively. Communication C1 states that t1 send 100 bytes to t2 and C3 states that t3 send 100 bytes to t1. As soon as the last byte of C1 is inserted into the NoC, d1_2 is solved. On the other hand, d3_2 is solved only when the last byte of C3 communication arrives to the processor where t1 is mapped.
Start
dStart_1 C1 dStart_3

faulty. However, when a tile is marked as faulty, the algorithm replaces the faulty tile with a spare tile, which is faulty-free.

t1

C2

t1

10 t 100 2 d1_2 25 t3 50 d2_End

t3

20 40

t1

C3

d3_2

End

Figure 3. CDCG example.

The target architecture topology is modeled by means of a Communication Resource Graph (CRG), which consists of tiles (graph nodes) and links (graph edges). The energy and execution time parameters are extracted from the target architecture synthesized to a given technology. The faulty tile list is generated by the diagnostic flow presented in Figure 1. According to the application description, NoC energy parameters, NoC execution time parameters, NoC topology and the faulty tile list, the task mapping tool estimates the NoC energy consumption and the application execution time of different mappings, enabling to evaluate the impact of faulty tiles. The next sections detail the underlying algorithms and the timing and energy models.

B. Timing Model The total packet delay (dijq) of a wormhole routing algorithm is composed by the routing delay (dRijq) and by the packet delay (dPijq) of the remaining flits. The routing delay is the time necessary to create the communication path, which is determined during the traffic of the packet header. The packet delay depends on the number of remaining flits. Let nabq be the number of flits of the q-th packet from pa to pb, obtained by dividing wabq by the link width. Let be the period of a clock cycle, and let tr be the number of cycles needed to route a packet inside a router. In addition, let tl be the number of cycles needed to transmit a flit through a link (between tiles or between a processor and a router). The routing delay (dRijq) and the packet delay (dPijq) of the q-th packet from tile i to tile j, are represented in Equations (1) and (2), considering that a packet goes through routers without contention. Contentions can only be determined at execution time.

dRijq = ( (tr + tl) + tl) dPijq = (tl (nabq - 1))

(1) (2)

Equation (3) expresses the total packet delays (dijq) packet latency, obtained from the sum of (dRijq) and (dPijq).

dijq = ( (tr + tl) + tl nabq)

(3)

A. Mapping Algorithm As stated before, the mapping problem here consists in finding an association of each application task to a given processor placed in a given tile that minimizes the global energy consumption and the application execution time. Let n be the number of tiles, this problem allows n! possible solutions. Given that future MPSoCs may contain hundreds of tiles, an exhaustive search of the solution space is clearly unfeasible. Thus, optimal implementations of such SoCs require the development of efficient mapping heuristics. Exhaustive analyses of some small applications mapped on NoC-based MPSoCs show that task mapping is clearly a problem with self-similarity [23] behavior. In other words, there are several very different mappings with the same cost i.e. the same energy consumption and execution time. Therefore, exploring, not all, but very different random mappings followed by some refinements (new mapping with few changes), normally result on an optimized solution. Due to two nested loops an external one, which looks for very different solutions and an internal one, which looks for a local minimum Simulated Annealing (SA) is an algorithm very well adequate to find solutions for self-similar problems. Our SA mapping algorithm searches for mappings that result in an MPSoC with minimum energy consumption and low execution time. To explore these requirements in the same cost function, the execution time requirement is expressed in terms of energy consumption. Therefore, the static power dissipation is multiplied by the application execution time (texec) performing the static portion of energy consumption, which is detailed in Section IV.B. As a result, both dynamic and static energy consumption are considered to compute the mapping cost function. To improve the yield, the SA algorithm searches for mappings with minimum cost avoiding the ones that are marked as spares or

For example, when applying Equation (3) in a packet with 10 flits (nabq = 10), which is sent from tile 1 to tile 2 (two neighbors tiles, i.e. = 2), and considering = 1ns, tr = 3 and tl = 1 clock cycles, then 18ns is the packet latency. The application execution time (texec) depends on both the application computation and communication. However, a simple equation does not express texec, since several communications and computations are many times parallel. In addition, some communications may compete for the same communication resource (e.g. links and buffers) at same time, which may cause contentions increasing the overall execution time. Contentions also make a single equation more complex. Therefore, texec is computed during the mapping algorithm execution, which uses several times the dijq and time expend in each computation.

C. Energy Model The dynamic energy consumption is modeled using the concept of bit energy (EBit), similarly to the model described in [24]. For several communication architectures, EBit can be expressed as a function of four variable quantities, as depicted by Equation (4). EBit = function(Es, Eb, Ec, El) (4)

Es is the dynamic energy consumption of a single bit on wires and on logic gates of each router. Eb is the bit dynamic energy consumption on router buffers. Ec is the dynamic energy consumption of a single bit on links between routers and the local module. El is the bit dynamic energy consumption on the links between routers. Equation (5) illustrates how EBit models a 2D direct mesh NoC. It computes the dynamic energy consumed by a bit passing in such a NoC from tile i (i) to tile j (j), where ij corresponds to the number of routers that the bit traverses.

166

EBitij = ij (Es+Eb) + 2 Ec + (ij 1) El

(5)

Let wabq be the total amount of bits of a packet pabq going from pa to pb (i.e. processors a and b, correspondingly), which are mapped on tiles i and j, respectively. Then, the dynamic energy consumed by the all k packets of pa pb communications is given by Equation (6).
k

faulty tile locations, assuming a system with 1 to 3 faulty tiles. Thus, Equation 11 defines the total number of faults injected as the sum of all 1, 2, to nfaults faults combination in x y tiles. For instance, a 3 4 mesh NoC requires 298 fault scenarios (12 single faults, 66 double faults, and 220 triple faults). x y x y x y (11) scens( x y, nfaults) = + + + 1 2 nfaults

EBitab =

1 q
=

a b q

E B it

ij

(6)

Hence, Equation (7) gives the total dynamic energy consumed by the NoC (EDyNoC) and y represents the total number of communication between different processors pa to pb.
y

The static power dissipation of each router (PRouter) is proportional to the number of gates that compose the router and it can be estimated by electrical simulation. With n representing the number of tiles, Equation (8) computes NoC static power dissipation (PNoC).

PNoC = n PRouter

( )

EDyNoC =

1 i

E B it

a b

pa, pb processors set (7)

(8)

Using texec explained in Section IV.B, Equation (9) computes NoC static energy consumption (EsNoC).

EsNoC = PNoC texec

(9)

Finally, Equation (10) gives the overall energy consumption at the NoC (ENoC) that considers the static and dynamic effects, which SA algorithm uses as cost function to search for optima mappings.

ENoC = EsNoC + EDyNoC

(10)

D. Model Calibration The Hermes NoC [7], configured with 16-bit phit and input buffers with four positions, was used to validate the timing and energy models. The Hermes VHDL description was synthesized to an ASIC standard cell library. The library also supplies energy values for the cells, which are used to extract the energy parameters. The synthesis result is a logic gate netlist. This netlist is associated to a customized VHDL library, which enables fast and accurate energy consumption and timing estimations. A testbench applies both random and typical traffic to the netlist and the results achieved by VHDL simulation are compared to those obtained from high-level mapping tool. Our experiments showed average errors below 30.5% and 14% for energy consumption and execution time estimations, respectively. V. EXPERIMENTAL SETUP
This section presents the methods used to generate the combination of faulty tiles, called fault scenarios. The first method is exhaustive used for small NoCs and the second method is the statistical method used for bigger NoCs. Latter, we present the application classes evaluated in this paper.

B. Statistical Fault Generation Method The exhaustive fault generation method is precise; however, it might not be possible to perform exhaustive fault simulation due to the long CPU time. The main reason is that the total number of executions required to perform exhaustive fault simulation, defined in Equation 11, grows exponentially with the NoC size (x y) and the max number of simultaneous faults (nfaults). Moreover, the CPU time of a single execution of the task mapping tool grows with the NoC size. For instance, assuming a 3 x 4 mesh NoC with up to 3 simultaneous faults requires 298 task mapping executions (about 3 minutes of CPU) to perform exhaustive fault simulation. However, a bigger NoCs like a 5 x 5 mesh with up to 3 faults requires 2625 executions in about 60 hours of CPU time. The same 5 x 5 mesh NoC with up to 4 simultaneous faults requires 15275 executions, which we estimate that would require about 14 days of CPU use. Even with the economical motivation of spare tiles is appealing; it might be unfeasible to perform an exhaustive fault simulation since the CPU time becomes an issue for bigger NoCs with multiple faults. This section presents a statistical approach, called sample size estimation [25], used to determine the minimal number of fault scenarios required to have satisfactory results - near to the ones achieved by exhaustive approach. This way, CPU time can be drastically reduced, while the results are still accurate. Moreover, this method enables trading off CPU time and result accuracy. Before executing the sample size estimation, a pilot simulation is performed with a sample of small size. A sample represents a set of executions of the task mapping tool, where each execution assumes that the faulty tiles were randomly selected. Each execution of this pilot results in a different mapping with different energy consumptions and execution times. If the energy consumption is the value to be estimated, then this pilot gives the populations estimated standard deviation s of energy consumed in the presence of faulty tiles randomly located. The population in this context represents the entire combination of fault scenarios, as determined in Equation 11. The goal of the sample size estimation is to estimate the population average (), i.e. the average energy consumption of the entire population of fault scenarios. The Equation 12 is typically used for this purpose, where s is the estimated standard deviation of the sample. (x - ) is the difference of the estimated sample average (x) and , which represents the acceptable error between the sample and the population. t,df is the value from students tdistribution table [25], where (1 - ) is the confidence level and df is the degree of freedom, defined as df = n - 1.

n=

s2 (t ,df ) 2 2 (x )

(12)

A. Exhaustive Fault Generation Method Faulty tiles are exhaustively generated for all combinations of

Since n is unknown, one can select an initial value of n to obtain t,df. This value is used in Equation 12 to find a new n and a new t,df. This calculation is performed iteratively until the value of n stabilizes. The stable value of n is the minimal sample size required to estimate the population average with the expected

167

accuracy of results.

C. System Application We explore several parallel applications with distinct features aiming to determine what kind of application increases the overhead in energy and execution time in the presence of faulty tiles. A synthetic application generator, detailed in [22], is used to create random CDCGs. This synthetic application generator can build several application classes by varying parameters such as: (i) number of processors, which allows to investigate some target architecture dimensions; (ii) number of graph levels, which allows to specify the number of dependent communications an application has; (iii) dependence degree that defines the probability of a vertex has more than one dependence, keeping in mind that dependent communications cant concur for NoC resources; (iv) probability of end meeting that defines if a vertex will have dependences or is a final communication; (v) computation time that is the period, associated to each source task, between all dependences are solved and the communication of the source task starts; (vi) communication volume that contains the quantity of bytes transmitted in each communication; and (vii) parallel communications, which describes the minimum quantity of parallel communications an application have. For instance, varying the relation between computation time and communication volume the application may change from IObounded to CPU-bounded; varying the relation between number of graph level and dependence degree the applications may be dataflow or concurrent. We built 39 synthetic applications, which enables to explore applications classified as (i) IO or CPU bounded; (ii) dataflow with different levels of parallelism; (iii) strongly parallel or sequential applications with different levels of concurrence by the communication architecture. VI. EXPERIMENTAL RESULTS
This section evaluates (i) the application execution time under exhaustive faulty scenarios, (ii) the average energy consumption under exhaustive faulty scenarios, (iii) the proposed statistical fault generation method used to estimate energy consumption, comparing it to the exhaustive method.

tion than communication and applications use packets with hundreds of flits, then the impact of the routing delay on the overall application execution time almost is negligible. This claim can be demonstrated with the following example. Let us assume a given application on a 3x4 mesh NoC, whose normal behavior is to have more computation than communication. This application is modified such that it has three variations: low communication (packets of one flit), low computation (CPU time of 1 clock cycle), and both communication and computation are low. Exhaustive fault generation is performed for these cases generating the Figure 4, which is the difference between the average execution time of the population with faulty chips and the execution time of the fault-free chip.

Figure 4. The average overhead of application execution time of faulty chips.

This figure demonstrates that faulty tiles have a significant influence on the chip execution time only if the both the computation and the communication are low, which is not the typical situation. Most actual applications typically have bigger packet sizes and more computation than communication.

A. Evaluating the Application Execution Time All application classes haves been evaluated in terms of execution time using the exhaustive fault generation method. The result is that, independently of the application class, the execution time is not affect by the presence of faulty tiles. On average, the variation of execution time between the fault-free chip and the chips with up to 3 faulty tiles is close to 0%. The reason lies on the timing model, presented in Section IV.B, more specifically in Equation (3). The total application time consist of computation time plus the communication time. The communication time consist of the routing delay, which depends on the distance between the communication elements, plus the packet delay, which depends on the packet size. If the computation time is much greater than the communication time, then the task mapping has very small influence on the application execution time. Even if both computation and communication times are equivalent, if the packet size is big (hundreds of flits) the routing delay has a very small impact on the communication time (since the NoC works as a pipeline), thus also a small impact on the application execution time. Since in typical scenarios an application has more computa-

B. Evaluating the Energy Consumption The energy consumption is evaluated for each class of application described in Section V.C. The result is that, in spite of the application class, only the proportion of good tiles per faulty tiles affects the energy consumption. For instance, a chip with 15 tiles where two of them are faulty consumes more energy than the same chip with only one faulty tile. These results are illustrated in Figure 5 for a 3x5 mesh and an application with 12 tasks and 3 spare tiles. The average impact of a faulty tile on energy consumption is worse in the center of the NoC and it increases if there are more faulty tiles in the chip (Figure 5(a)). This impact gradually decreases as the distance from the center tiles increases (Figure 5(b)). However, if we map the same application on a 4x4 mesh NoC, then there are 12 tasks and 4 spare tiles. Figure 6 compares the energy profile of this application of a 3x5 against a 4x4 mesh NoC assuming three faults in each of them. It can be observed that the energy overhead in a 4x4 is lower. The reason is the proportion of good and faulty tiles. In a 3x5 with 3 faults the proportion is 15/3 while in a 4x4 it is 16/3. This extra tile of 4x4 gives more freedom to the task mapping tool to determine a good scheduling, improving the effect of self-similarity (Section IV.A), resulting in a better task mapping.

168

error observed for each tile.


TABLE 1. RESULTS FOR THE STATISTICAL FAULTS GENERATION METHOD. (a)

Exhaustive Sample8 Sample23

CPU time (s) 192 27 63

# scenarios 220 39 101

max obs. error (%) 2.4 0.8

(b)

Figure 5. The average energy consumption overhead of faulty chips.

Figure 7 illustrates the three situations and their respective heat charts, representing the energy overhead when a fault is found at each tile. Each square represent the average energy consumption for each tile. It can be observed that the exhaustive method produce the expected results (the energy is gradually reducing from the center to the borders). The sample23 produces almost the same results as the exhaustive method, with small error but with much less CPU time. The sample8 produce large errors, indicating that the sample size is not sufficient to estimate accurately the energy overhead for each router. Even if the exhaustive results are not available, it is still possible to check the accuracy of the sample by visually analyzing the heat chart demonstrated in Figure 7. For instance, the expected appearance of a good heat chart is like the exhaustive test set, even if we assume NoCs of different sizes and different applications. Note that the heat chart for sample8 deviates from the expected appearance, indicating that one should increase the sample size, if it is possible, to increase the accuracy of the results.

Figure 6. The energy overhead with three faults on a 3x5 and a 4x4 mesh.

exhaustive

C. Evaluating the Statistical Fault Generation Method This section demonstrates the fault generation method proposed in Section V.B. For this experiment, we assume that a small NoC is used, such as 3x4 mesh NoC, because the total CPU time for both statistical and exhaustive fault generation methods is not too high. An application with 9 tasks is used for this experiment, even though all other applications presented very similar results. Let us assume that the goal of this experiment is to estimate the average energy overhead when a fault hit a given tile, considering scenarios with 3 simultaneous faults. First, the exhaustive method is executed, running all combinations of 3 faults in 12 tiles, i.e. scens(3 x 4, 3) = 298 (Eq. 11) fault scenarios. It took about 3 minutes of CPU time to execute them. These results are considered the target results, i.e. the results we want to achieve with the statistical method. The second step is to execute a pilot experiment with small number of randomly selected fault scenarios per router. This pilot experiment is used solely to extract the standard deviation of the energy consumed by the chips with three random faulty tiles. The estimated standard deviation is 3.9% of deviation on energy consumption. The proposed approach of sample size estimation is executed assuming two situations: (i) standard deviation of 3.9, confidence interval of 95%, and maximum error of 4%; and (ii) standard deviation of 3.9, confidence interval of 98%, and maximum error of 2%. The estimated sample size for each situation is 8 and 23, respectively. It means that each tile must be in at least 8 or 23 fault scenarios. For now on, the first situation is called sample8 and the second is called sample23. TABLE 1 presents the obtained results in terms of CPU time, total number of scenarios and the maximum

Figure 7. Visual analysis of the statistical fault generation method.

Figure 8 overlaps the average results for the three situations. By comparing the exhaustive with the other test sets, it can be seen that the biggest error for sample8, located in the tile [2, 1], is 2.4% (see 1), which is below the maximum error stipulated to this set of experiments (4%). The biggest errors for sample23, located

sample8 sample23

169

in the tiles [1, 1] and [2, 0] (see 2), are around 0.8%, which is below the maximum error stipulated to this set of experiments (2%).

IX.

REFERENCES

[1] Wolf, W.; Jerraya, A. A.; Martin G. Multiprocessor system-on-chip


[2] [3] [4] [5] [6] (MPSoC) technology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(10), pp. 1701-1713, 2008. Kahng, A.; et al. ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. DATE, pp. 423428, 2009. Lee, S. E. et al. A high level power model for network-on-chip (NoC) router. Computers & Electrical Engineering, 35(6), 2009. Refan, F. et al. Reliability in application specific mesh-based NoC architectures. IEEE International On-Line Testing Symposium, pp. 207-212, 2008. Shamshiri, S.; Cheng, K-T. Yield and Cost Analysis of a Reliable NoC. VLSI Test Symposium, pp. 173-178, 2009. Carara E. A. et al. HeMPS - a framework for NoC-based MPSoC generation. ISCAS, pp. 13451348, 2009. Moraes, F. et. al. HERMES: an infrastructure for low area overhead packet-switching networks on Chip. Integration, the VLSI Journal, 38(1), pp. 69-93, 2004. Bertozzi, D.; Benini, L.; De Micheli, G. Error control schemes for onchip communication links: the energy-reliability tradeoff. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 4(6), pp. 818-831, 2005. Ejlali, A. et al. Performability/energy tradeoff in error-control schemes for on-chip networks. IEEE Transactions on Very Large Scale Integration Systems, 18(1), pp. 1-14, 2010. Zhang, Z.; Greiner, A.; Taktak, S. A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. DAC, pp. 441-446, 2008. Manolache, S.; Eles, P.; Peng, Z. Fault and energy-aware communication mapping with guaranteed latency for applications implemented on NoC, DAC, pp. 266-269, 2005. Hu, J.; Marculescu, R.. Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints. DATE, pp. 234-239, 2004. Lei, T.; Kumar, S. A two-step genetic algorithm for mapping task graphs to a network on chip architecture. Euromicro Symposium on Digital System Design, pp. 180-187, 2003. Murali, S. et al. Mapping and configuration methods for multi-usecase networks on chips. ASP-DAC, pp. 146-151, 2006. Lee, C. et al. A task remapping technique for reliable multi-core embedded systems. CODES/ISSS, pp. 307-316, 2010. Ababei, C.; Katti, R. Achieving network on chip fault tolerance by adaptive remapping. International Symposium on Parallel & Distributed Processing, pp. 1-4, 2009. Tornero, R. et al; A multi-objective strategy for concurrent mapping and routing in networks on chip. International Symposium on Parallel & Distributed Processing, pp. 1-8, 2009. Choudhury, A. et al. Yield enhancement by robust applicationspecific mapping on network-on-chips. NoCArc, pp. 37-42, 2009. Huang, L. et al. Lifetime reliability-aware task allocation and scheduling for MPSoC platforms. DATE, pp. 51-56, 2009. Huang, L; Xu, Q. Energy-efficient task allocation and scheduling for multi-mode MPSoCs under lifetime reliability constraint. DATE, pp. 1584-1589, 2010. Huang, L; Xu, Q. AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs, DATE, pp. 51-56, 2010. Marcon, C. et al. CAFES: a framework for intrachip application modeling and communication architecture design. Journal of Parallel and Distributed Computing, 71(5), pp. 714-728, 2011. Mandelbrot, B. How long is the coast of britain? statistical selfsimilarity and fractional dimension. Science. 156(3775) pp. 636-638, 1967. Ghadiry, M.; Nadi, M.; Rahmati, D. New approach to calculate energy on NoC. International Conference on Computer and Communication Engineering, pp. 1098-1104, 2008. Hill, T.; Lewicki, P. Statistics: methods and applications: a comprehensive reference for science, industry, and data mining. StaSoft, 832p., 2006.

Figure 8. Close analysis of the resulting error by overlapping the average results for exhaustive, sample8, and sample23.

[7] [8]

The example presented in this section demonstrates that the proposed fault generation approach enables to trade-off CPU time and result accuracy by selecting different values of difference (x ) and confidence level (1 - ).

[9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

VII. FINAL REMARKS


Previous papers have demonstrated that the use of spare tiles can significantly improve yield and reduce the manufacturing cost of NoC-based MPSoCs. The tool presented in this paper determines task mapping for NoC-based MPSoCs with faulty tiles, minimizing the energy consumption and the application execution time. This way, these defective chips can still execute the application, perhaps with some performance degradation, but at least it can be sold to a lower-end market, for example. This paper evaluates energy consumption and application execution time of faulty chips compared to fault-free chips. We evaluated several different classes of applications to check if there was any particular application feature that could affect the energy consumption or application execution time under faulty tiles. These results show that the spare tile approach has small impact on energy consumption and this impact can be even smaller if the proportion of good and faulty is higher. The existence of faulty tiles on the chip has, on average, no significant influence on the application execution time. Based on these results, we conclude that the spare tile approach can increase yield and cost with small penalties on the application requirements. Finally, this paper also proposed a statistical fault generation approach targeting very large MPSoCs. This approach demonstrates that a small sample of fault scenarios is sufficient to have a reasonably accurate estimation of energy consumption and it enables trading of CPU time and result accuracy.

VIII. ACKNOWLEDGMENT
Alexandre is supported by postdoctoral scholarships from Capes-PNPD and FAPERGS-ARD, grants number 02388/09-0 and 10/0701-2, respectively. Fernando Moraes is supported by CNPq and FAPERGS, projects 301599/2009-2 and 10/0814-9, respectively. Cesar Marcon and Marcelo Lubaszewski are partially supported by CNPq scholarships, grants number 308924/20088 and 478200/2008-0, respectively.

170

Me3D: A Model-driven Methodology Expediting Embedded Device Driver Development


TIMA Laboratory (CNRS Grenoble INP UJF), 46 av. F lix Viallet, 38031 Grenoble, France e {hui.chen, guillaume.godet-bar, frederic.rousseau, frederic.petrot}@imag.fr
AbstractTraditional development of reliable device drivers for multiprocessor system on chip (MPSoC) is a complex and demanding process, as it requires interdisciplinary knowledge in the elds of hardware and software. This problem can be alleviated by an advanced driver generation environment. We have achieved this by systematically synthesizing drivers from a device features model and specications of hardware and inkernel interfaces, thereby lessening the impact of human error on driver reliability and reducing the development costs. We present the methodology called Me3D and conrm the feasibility of the driver generation environment by manually converting sources of information captured in different formalisms to a Multimedia Card Interface (MCI) driver for a real MPSoC under a lightweight operating system (OS).
APPLICATION APPLICATION PROGRAMMING INTERFACE LIBRARIES b) OS KERNEL a) DRIVER d) HAL INTERFACE HARDWARE ABSTRACTION LAYER HARDWARE DRIVER DRIVER c)

Hui Chen, Guillaume Godet-Bar, Fr d ric Rousseau, and Fr d ric P trot e e e e e

Fig. 1.

Device driver as a low-level module in the OS structure

I. I NTRODUCTION Nowadays, a typical multiprocessor system on chip (MPSoC) project takes place under ever-increasing time-to-market pressure. Hardware and software are often regularly redesigned for new versions of a product. As it is wellacknowledged, the software development cycle consumes considerable time and effort. On the software side, device driver development causes a serious bottleneck. It is intrinsically complex and errorprone due to the necessity of interdisciplinary knowledge in the elds of engineering and computer science. In other words, device driver developers require in-depth understanding of innumerable peripherals that exist in a typical embedded system, programming tools, operating systems (OSes), bus protocols, network programming, system management [1], etc. Delivering a high-quality and thoroughly tested device driver is laborious. For instance, the LH7A404 system on chip (SoC) from NXP Semiconductors contains 16 peripherals, and corresponding drivers have more than 78,000 physical source lines of code (SLOC) [2] (requiring around 19.6 person-years as development effort, estimated with SLOCCount1 ). Software re-usability and automation methods are hence eagerly required to reduce design effort and ameliorate productivity. The difculty in design and implementation of reliable device drivers is notorious. Drivers in the Linux kernel 2.6.9 account for 53% of bugs [3]. Similarly, 85% of unexpected system crashes originate from driver problems, pursuant to a recent report from Microsoft [4]. With this in mind, a new methodology addressing reliability is strongly expected. The contribution presented in this paper is a exible device driver generation environment, able to produce the nal C code of a software driver, starting from a device features model.
1 SLOCCount

This environment is composed of a device driver generation tool and a validation ow. To evaluate the generation environment we conducted our case study on the Multimedia Card Interface (MCI) device driver for the Atmel D940 [5] MPSoC. We created a features model for the MCI device, specications of the MCI device and the D940 board, and an in-kernel interface specication for an ad-hoc OS, all of which are then systematically converted into the MCI driver. Afterwards, we validated the generated device driver with a validation ow. Experimentation results demonstrate the feasibility of implementing the generation environment, and the expected efciency in developing device drivers for MPSoC. The paper is organized as follows. Section II presents the anatomy of a device driver. Section III reviews related work. Section IV introduces our methodology for accelerating embedded device driver development. Section V evaluates the proposed methodology. The last section concludes the paper and identies future work based on the ndings provided here. II. D EVICE DRIVER OVERVIEW The term device, as used in this paper, does not refer to the primary central processing unit (CPU), or main memory, but a specic hardware resource for a dedicated task. The device is either attached to or embedded in a computer system architecture and can interact with the CPU and other hardware resources in the system via a single system bus or through a bus hierarchy. A device driver is a low-level software component in the OS, which allows upper-level software to interact with a device. It can be considered, from an abstract point of view, as a brick in the OS chart (Fig. 1). Device drivers mentioned here mostly target embedded systems, which differ from personal computers (PCs) in having broader adoption of SoCs and a greater variety of buses.

v2.26 by David A. Wheeler, www.dwheeler.com/sloccount/

978-1-4577-0660-8/11/$26.00 c 2011 IEEE

171

A device driver implements an interface to the kernel and/or application developers for an underlying device, and provides a lower-level communication channel to the device. It acts as a translator from the kernel interface to the hardware interface. As a form of communication, it requires kernel services, and often also offers services to other kernel components. The communication channels to the device can be provided by lower-level drivers. This leads to cascaded drivers [6]. The upper-level drivers provide an abstract view of the execution platform, while the lower-level drivers are more concrete and provide a transparent communication to the devices interfaces. An example is the Inter-Integrated Circuit (I2C) driver stack in the Linux kernel 2.6 [7]. Device driver interface can be separated into four parts (Fig. 1): a) The driver requires kernel services like memory allocation, and also offers services (e.g., hardware initialization) to the kernel. b) User application sends general commands to the driver using exported driver interface, while c) libraries provide the driver with some services like string manipulation. d) Lastly, the hardware abstraction layer (HAL) accommodates hardware access methods, which are used by the driver. One of the most elementary pieces of information about a device and the driver that manages it, is what function the device accomplishes. Different devices carry out different tasks. There are devices that play sound samples, devices that read and write data on a magnetic disk, and still other devices that display graphics to a video screen, and so on. For each type of functionality, there may be many different devices that carry out similar tasks. For instance, when displaying graphical information on a video device, the display controller may be a simple Video Graphics Array (VGA) controller, or it may be a modern video card running on Peripheral Component Interconnect Express (PCIe), with several gigabytes of graphics memory. Nevertheless, in each case, the high-level purpose of the device is the same. The device driver organization involves a set of driver entry points, a number of data structures, and possibly also global symbols and constants. A typical driver entry point encompasses the hardware programming part (gray blocks in Fig. 2) and the kernel-driver interaction part. Composing a driver entry point requires up to eight pieces of information: 1) HALrelated (e.g., register access primitives), 2) platform-related (e.g., device base address), 3) device-related (e.g., register and bit eld offsets), 4) device features (e.g., register programming sequences), 5) kernel-driver interface (e.g., return type and argument list of the driver entry point), 6) kernel services (e.g., memory allocator), 7) device class-related (e.g., access protocols), and 8) libraries-related (e.g., string manipulators). Thus, because of the various and interrelated sources of information, driver generation is intrinsically complex. III. R ELATED WORK We will briey discuss related work in the area of device driver development methodology. They can be classied into three categories: device driver synthesis methodologies, device interface languages, and hardware specication languages. Early device driver synthesis methods, as part of hardware/software co-design efforts, attempt to synthesize OS-

5 void access_hardware(...) { /* Header inclusions */ return_type entry_a_name(...) { registerB_bitfieldY_wr($VAL); ... ... <kernel memory allocation> $VAL2 = registerA_bitfieldX_rd(); ... } 4 <mask & backup interrupts> static inline uint32_t registerA_bitfieldX_rd() { return <read primitive> (IO_BASE access_hardware(...); + REGISTER_A_OFFSET) & <restore interrupts> BIT_FIELD_X_MASK; } ... 6 <invalidate CPU cache> ... 1 1 3 2 } return_type entry_b_name(...) { ... 5 strycpy(...); 8 ... <create MMC card> ... 7 } 1 : HAL-related 2 : Platform-related 3 : Device-related 4 : Device feature 5 : Kernel-driver interface 6 : Kernel services 7 : Device class-related 8 : C library-related

Fig. 2.

Looking into the driver entry points for an OS written in C

based device drivers for embedded systems [8]. However, these devices are different from those targeted by Me3D as they have a simple internal structure and a small set of input/output (I/O) signals. Moreover, the synthesized driver only runs with a platform-specic real-time operating system (RTOS). Therefore, these approaches do not take on some issues addressed here, including separation of in-kernel interface and hardware specications. Wang et al. propose a tool [9] for synthesizing embedded device drivers. This approach does not separate in-kernel and hardware interfaces of the driver, forcing the driver developer to detail the complete driver behavior for every device. In addition, they assume that the driver functionality can be split into non-overlapping control and data parts. This is the case for some simple drivers. In more complex drivers, the control and data path are tightly interleaved. Termite [10] synthesizes device drivers by merging two state machines of the OS and the device. This may unavoidably lead to state explosion and large nal code size. Tackling these limitations is addressed in this paper. In addition, we believe a device class specication, which solely denes a set of events shared between the OS and the device specication, is not necessary. Bombieri et al. [11] propose a device driver generation methodology based on the register transfer level (RTL) test bench of an intellectual property (IP). However, device drivers can not be generated unless the code of the RTL test bench is available. In contrast, we propose a methodology that is applicable without involving RTL test benches. Languages such as Laddie [12] and NDL [13] offer some constructs to describe device interface. However, the rst approach does not really deal with device driver problems, but is limited to generating register access functions. The NDL approach requires a change concerning how to write the device drivers, but does not offer a solution for legacy drivers. Hardware specication languages such as IP-XACT, Unied Modeling Language (UML) MARTE are able to describe some parts of electronic components and designs. UML MARTE is widely used, while IP-XACT is an IEEE standard that describes not only structural information about hardware devices and designs, but also some contents such as a register map. To demonstrate the feasibility of our methodology, we have modeled the device and the hardware platform in IP-XACT.

172

Device Features Model

Hardware Specifications Driver Generation

In-Kernel Interface Specification Libraries Is used by Produces

Driver Configuration Parameters


Validation

transitions. The latter denes the drivers desired reactions to requests, concerning hardware events that must happen before the driver sends a notication of completion to the kernel. C. Device and platform specications Hardware vendors often release user manuals that describe the interface and operations of a device and the architecture of a hardware board. Such a documentation is intended to provide sufcient information for driver developers. However, it is usually informal, and written in a natural language. To automate the driver development, we require device and hardware board specications, which provide not only structural information, but also some contents like a register map. Specications in a format such as IP-XACT and UML MARTE are available from some hardware vendors or can be obtained from informal device or board documentations. A device specication describes the following driver-related properties of a device: i) device name and ID, ii) register le information (e.g., register widths and offsets, bit eld widths and masks, register/bit eld accessibilities, reset value, etc.), and iii) port information. A platform specication provides some driver-related information as well, i.e. i) device instantiations, ii) I/O offsets, iii) interrupt connections (which indicate whether the interrupt pin of the target device is used or not), iv) bus (e.g., bus clock, bus type, data bus width, data transfer type, device access type, transport mode), and v) processor (e.g., byte ordering, clock frequency, name, word length). In general, IP-XACT includes most of the features mentioned above, although, to the best of our knowledge, it still lacks some information such as data transfer type (e.g., x8, x16). D. Device features model Reading from or writing to a certain register may cause a side effect. For instance, writing a value to the length register of a given direct memory access (DMA) controller may start the DMA transfer. Often, it necessitates programming some other registers before a side effect takes place. For instance, before writing the length register of this DMA controller, the source address register and the destination address register shall be set with desired values so as to achieve successful DMA operations. Such a register programming sequence needs to be modeled to ensure correct device operations. Hence, we are introducing a device features model to capture the way of interacting with the device. The device features model contains a set of predened device features, such as init, read, write, etc. This model can be translated to C functions. The translation process will be explained in more detail in the following section. E. Driver generation The driver generation ow is broken down into four steps. This section explains each step of the driver generation. Step 1: Parsing and inline functions generation. The device features model along with the hardware specications and the HAL library, are mainly used to generate bit eld access functions (Fig. 5). These inline functions, containing

Device Driver

Fig. 3.

Abstract view of the Me3D methodology


StrCpy = (Newlib) "strcpy" + "..." StrCat = (Newlib) "strcat" + "..."

Basic Library Lib X Lib Y Lib Z

StrCpy = (uClibc) "strcpy" + "..." StrCat = (uClibc) "strcat" + "..." StrCpy = "dna_strcpy" + "..." StrCat = "dna_strcat" + "..."

Fig. 4.

Basic library

IV. D RIVER GENERATION ENVIRONMENT Device drivers such as upper-level driver stacks in cascaded drivers may not have direct access to underlying hardware. In the remainder of this paper, it is assumed that a driver is directly above the HAL. Fig. 3 shows an abstract view of the Me3D methodology. The generation environment requires a device features model, hardware specications, in-kernel interface specication, libraries, and driver conguration parameters to produce device drivers. Device driver in binary format is validated on a real MPSoC or virtual platforms. During the validation phase, performance results could be extracted as shown in the referenced paper [14], for tuning driver conguration parameters. With modied parameters, a new version of the driver is generated again. A. Basic and HAL libraries The basic library contains an abstract layer for usual data manipulation methods (Fig. 4). For instance, the StrCpy (Fig. 4) primitive may be linked to the strcpy function of a standard C library implementation, such as Newlib [15] or uClibc [16], or the dna strcpy function of an ad-hoc C library. Introducing these primitives allows the exploration of memory footprints and performances through the selection of different C library implementations. The basic library could contain source and/or object les. The HAL library contains the implementation of low-level hardware access primitives (e.g., primitive to read from or write to registers), with which it allows the development and the integration of support for new hardware architectures to be executed separately from the generation tools, thereby increasing the exibility of the environment and the re-usability of the components. It may include source and/or object les. B. In-kernel interface specication In order to reect the in-kernel interface changes during kernel evolution, and to differentiate kernel-driver interaction among differing device classes, we propose an in-kernel interface specication dedicated to a certain device class for a given kernel. The in-kernel interface specication contains mainly kernel data structures (if any) to be used, software events, and

173

Device Features Model

Hardware Specifications

Is used by Produces
static inline uint32_t registerA_bitfieldX_rd() { return <read primitive> (IO_BASE + REGISTER_A_OFFSET) & BIT_FIELD_X_MASK; } ... a)

Bit Field Access Functions, Device-related Parameters


void access_hardware(...) { registerB_bitfieldY_wr($VAL); ... $VAL2 = registerA_bitfieldX_rd(); } ... a)

Device Features Model

HAL Library HAL Library

Bit Field Access Functions, Device-related Parameters

Device Features

Fig. 5.

Step 1: Parameters parsing and inline functions generation

Fig. 6.

Step 2: Device features generation


on e rati rfac igu Inte ion onf eters t nel er Cram Ker ecifica v In- Sp Dri Pa
Compute Dependency, Synthesize EFSM EFSM Map Hardware Events, Translate EFSM
m n tfor Pla ificatio ec Sp

bitwise operations (e.g., not, bit shift), are responsible for accessing the bit elds. It is not difcult to generate these bit eld access functions. An example of a bit eld read function is shown in Fig. 5.a. To produce a bit eld read function, it requires the names of the register and of the bit eld that appear in the device features model, a read primitive from the HAL library, an I/O base address from the platform specication, and the offsets and the widths of the register and the bit eld from the device specication. If the HAL of the OS provides bit-level manipulation operands, then our code will simply call these functions. There are two reasons for generating bit eld access functions. Firstly, the low-level bit operations contribute to a great part of the bug sources in the case of manual driver development. Secondly, introducing bit eld access functions can enhance readability to some extent. Apart from these bit eld access functions and string constant macros, this step parses out device-related parameters (e.g., device cardinality) as well. Step 2: Device features generation. This step makes use of the products (inline functions and parameters) in the previous step and requires a device features model. The reason for using this model is explained in subsection IV-D. The device features model reects product dependencies (if any), hardware congurations, and device operations (e.g., data transfer operations, command/response operations) described in a natural language or as functional ow charts, which are traditionally provided by hardware vendors as a part of the user manual. Thus, hardware vendors are expected to write the device features model. The device features model can be written in C or an alternative tiny language DFDL (Device Features Description Language). Our experience shows that the latter is easier to interpret because it has simpler semantics. To write for (int i = 0; i < 5; i++) in C/C++, one just has to write foreach i (0, 5) in DFDL. As you can see, the foreach loop in DFDL is simpler and less error-prone. The in-house language-based device features model is only used to dene register programming sequences, but not for writing device drivers; this model could evolve to an intermediate format or even be eliminated, once a device specication is capable of capturing these sequences. Using the products (inline functions and parameters) in Step 1, the device features model is translated to device functionalities (Fig. 6) in C. This translation is feasible, as the tiny language only uses high-level constructs for logical control, e.g., the await construct (see subsection V-D) refers

ies rar Lib

Generate Header Inclusions

Compilation Environment

Device Features

a)

b) Makefile

Driver

.c/.h

Fig. 7.

Step 3: a) Driver and b) Makele generation

to the do-while structure in C. Step 3: Driver source and Makele generation. This step makes use of the product (device features) in Step 2, and requires basic and HAL libraries, driver conguration parameters, hardware specications, and an in-kernel interface specication. The libraries are used to produce some #include commands (Fig. 7). The in-kernel interface specication describes how the driver interacts with the kernel and adjacent drivers, while the driver conguration parameters determine tunable elements such as the synchronization method (interrupt or polling). With the in-kernel interface specication and the driver conguration parameters, it is easy to synthesize an extended nite state machine (EFSM) after dependency computation. Then the hardware events in the synthesized EFSM are mapped to device features (generated from the previous step). Afterwards, the EFSM is translated into device drivers in C. In addition, the in-kernel interface specication and the platform specication dene the compiler avor and the processor type respectively, allowing the compilation environment to produce Makele. Step 4: Source code compilation. In this step the make command is iteratively executed for a certain CPU architecture, using the previously generated Makele as an input. If HAL and basic libraries are provided in the form of source les, they will also be compiled. F. Driver conguration and space exploration Device driver development involves a series of decisionmaking processes. For instance, a write Application Programming Interface (API) is to be implemented as synchronous or asynchronous. Likewise, a DMA driver can either use a circular buffer or a linked-list. Different decisions result in diverse C code and differing driver performances. We call this driver space exploration.

174

Device Features

In-Kernel Interface Specification

Libraries Driver Configuration Parameters 2

APB Bridge PDC APB MCCK MCCDA PMC MCI Device Interrupt Control MCI Interrupt PIO MCDA0 MCDA1 MCDA2 MCDA3

Driver Configuration Parameters 1

Driver Generation

Driver Generation

.c/.h

.c/.h

Fig. 8.

Code generation possibilities

DNA-OS

Kernel
...

Generic MMC Module

MCI Driver

Fig. 10.

In-kernel interfaces for a MCI driver

G. Driver validation In our experimentation, we used a real board to validate the driver (along with an application program, an OS, and the hardware platforms HAL). If a real board is not available, we propose two simulation models to validate the driver. The functionality is validated with an abstract SystemC simulation model, called transaction accurate. The performance of the driver is validated on a low abstraction SystemC simulation model, called cycle accurate bit accurate. Due to limited space, these simulation models will not be presented in this paper. V. E VALUATION In this section, we evaluate the applicability of Me3D as the methodology for expediting embedded device driver development on the Atmel D940 [5] MPSoC. A. Evaluation points The points of evaluation are shown below: 1) the feasibility of describing device features and in-kernel interfaces; 2) the feasibility of systematically converting the device features model and the specications to a device driver. To evaluate point 1, we chose a well-adopted specication language for hardware platform and devices (the chosen one is IP-XACT, but the methodology is not limited to it), specied in-kernel interfaces with in-house IISL (In-kernel Interface Specication Language), and modeled device features with DFDL (Device Features Description Language). To evaluate point 2, we manually converted the device features model, the hardware specications, and the in-kernel interface specication to the device driver according to the proposed methodology. An open source DNA-OS [17] is used by the converted software. B. Hardware specications Device specications. In order to bring the Multimedia Card Interface (MCI) (Fig. 9) in function, it requires conguring some registers of the power management controller (PMC),

of the programmable input output (PIO), and optionally of the programmable DMA controller (PDC). In other words, we need information about the register layouts of these devices. Hence, we have modeled the specications for the MCI, the PMC, and the PIO (in IP-XACT for this experimentation). The PDC specication is not modeled, because the native MCI driver does not use DMA. Platform specication. The D940 MPSoC contains many peripheral devices. It is not necessary to specify the whole platform. In reality, we have only modeled parts related to device instantiations, interrupt numbering, etc. C. In-kernel interface specication In this paper, we will not introduce the grammar of our inhouse IISL language. However we will briey present what the in-kernel interface specication covers. It species the driver interfaces (Fig. 10) toward the application, the kernel, and a generic MultiMediaCard (MMC) module. The generic MMC module is responsible for card properties discovery and MMC protocol implementation. It offers some services (denoted by the lollipop connectors on the module side) to the MCI driver, and denes APIs (e.g., read low). The in-kernel interface specication presents a partial EFSM (Fig. 11) describing the interactions between the driver, its adjacent drivers (if any), and the kernel. To describe these interactions, we use messages. A message is a token sent to the driver from its adjacent drivers or the kernel, or vice versa. The former, called inbound message, can be a kernel request (e.g., initialize hardware, publish device, etc.), a DMA request, etc., whereas the latter, called outbound message, is the driver response to the sender of the token. In Fig. 11, downward dashed arrows denote inbound messages, whereas upward dashed arrows represent outbound messages. As shown in Fig. 11, starting from the OS booting (the start state), the device driver receives inbound messages from the kernel in succession, which brings the driver from one state to another sequentially, until it reaches the idle state. In the INIT HW (hardware initialization) state, the driver sends back

175

...

...

Conventional driver space exploration refers to iteratively refactoring the driver code. The choice of the driver version is usually driven by performance, power consumption, or binary size. In order to efciently and effectively explore the driver space, we use high-level conguration parameters (Fig. 8) for the driver generation environment. A change in a design decision requires only a modication of an attribute of the driver conguration. A new version of the driver can be generated again.

Fig. 9.

MCI device and its neighborhood


Application

exit

READ

read_done/ status read


IDLE

...

success/ status_ok

...

init_hw

MCI Device Features Model

INIT_HW

...

#define MCI_BASE 0xFFFA8000 ... #define MCI_SR 0x40 #define MCI_SR_RXRDY (0x1 << 1) ... d940_mci.h

a)

write write_done/ status

HAL Library Hardware Specifications

...

d940_pio.h

...

d940_pmc.h

start state

end state

WRITE

...

Fig. 11.

EFSM of kernel-driver interaction Fig. 13.

static inline uint32_t MCI_SR_RXRDY_rd() { return cpu_read_UINT32(MCI_BASE + MCI_SR) & MCI_SR_RXRDY; } ... reg_access.h
*

b)

1 read { 2 in void *buffer 3 in int32_t word_count 4 5 foreach i (0, word_count): 6 await (MCI_SR.RXRDY == 1) 7 ((uint32_t *)buffer)[i] = MCI_RDR 8 } 9 ...

Step 1: Macros and inline functions generation


Macros, Inline Functions MCI Device Features Model

2
read(void *buffer, int32_t word_count) { for (int32_t i = 0; i < word_count; ++i) { while (MCI_SR_RXRDY_rd() != 1); ((uint32_t *)buffer)[i] = MCI_RDR_rd(); } } ...

Fig. 12.

MCI device features model

the status ok message to the OS in the case of a successful initialization. When the driver is in the idle state, it waits for an inbound message. A read inbound message sends the driver to the READ state for reading data from the MMC card. At the end of read, the driver sends a message with the read status to the kernel, and returns to the idle state. An exit inbound message brings the driver to the end state. D. MCI device features model The device features model for the MCI device consists of seven features (e.g., read, write, etc.) and the denition of the block length. It is described with our in-house language DFDL. Fig. 12 presents the read feature. This feature has two incoming arguments, i.e. the buffer pointer and the word count. It waits until the RXRDY (receive ready) bit eld of the MCI SR (status register) equals 1, then stores the value of the MCI RDR (read register) to a specied buffer. The process above iterates word count times. DFDL is a tiny ad-hoc language, which has some constructs tailored for modeling device features. The await construct in Fig. 12 simplies the do-while loop in C/C++. A C/C++ equivalent to line 6, Fig. 12 (waiting for a bit eld to reach a value) will contain several statements: reading the register value and logically anding it with a bit eld mask, comparing the masked result with a specied value, and iterating the above steps until the bit eld value equals the given one. E. Conversion At rst we analyzed which registers and bit elds appear in the MCI device features model. Then we parsed out the offsets and the widths of these registers and bit elds (Fig. 13.a) from device specications and the MCI base address from the D940 platform specication. Information related to MCI is gathered in the d940 mci.h le. Likewise, information related to PIO and PMC are collected to d940 pio.h and d940 pmc.h respectively. In addition, inline functions for accessing some bit elds are also produced (Fig. 13.b). Afterwards, the macros and inline functions, along with the MCI device features model, are used to produce the MCI

Fig. 14.

Step 2: MCI device features generation


/* In-Kernel Interface Specification */ ... process CHOICES { ... || read_low; dev.read; read_low-done[$status == DNA_OK]; } ... b)

/* Driver Config. Parameters */ INTERRUPT ... a) MCI Device Features

Libraries

3.a

/* Header inclusions */ status_t read_low(void *buffer, int32_t word_count) { for (int32_t i = 0; i < word_count; ++i) { while (MCI_SR_RXRDY_rd() != 1); ((uint32_t *)buffer)[i] = MCI_RDR_rd(); } return DNA_OK; } ...

Fig. 15.

Step 3.a: Synthesize EFSM and derive driver functionalities

device features (Fig. 14). It must be mentioned that although the read device feature looks like a C function, it is not, as it is not qualied with a return type. Finally, the driver conguration parameters (Fig. 15.a), the MCI device features, libraries, and the in-kernel interface specication (Fig. 15.b) are used to synthesize EFSM and derive driver functionalities. The INTERRUPT parameter (Fig. 15.a) determines the synchronization mechanism as interrupt. In the in-kernel interface specication there exists a process describing how the driver will interact with neighboring components when it is in the idle state and receives a message. For instance, when the driver is in the idle state and receives a read low message, it will wait until the read operation terminates, then it will return a status DNA OK to the kernel in the case of a successful read. The dev.read hardware event is mapped to the read device feature. We must note that the || construct (Fig. 15.b) is used to separate different transition conditions. Table I produces a summary of the conversion results. The SLOC column refers to the source code size (excluding de-

176

TABLE I S OURCE LINES OF CODE , BINARY SIZES OF THE NATIVE AND GENERATED MCI DRIVERS ( EXCLUDING DEBUGGING FUNCTIONS AND STATEMENTS ) SLOC 407 362 Binary (contains application, OS, & HAL) in KB 421.9 421.4

utilization achieved by the generated and native drivers are similar. VI. C ONCLUSION AND FUTURE WORK Device drivers are crucial software elements having considerable impact on both design productivity and quality. Device driver development has traditionally been error-prone and quite time-consuming. On that account, we propose an advanced device driver generation environment to shorten driver development time and improve driver quality. Experimentation in generating a Multimedia Card Interface (MCI) driver for the Atmel D940 multiprocessor system on chip (MPSoC) achieves favorable results regarding the code size. In the future, we plan to evaluate the methodology on several OSes, introduce an intermediate format for device driver generation and validation, and develop an automatic tool for driver generation. We will also study optimization issues such as performance and power consumption, and consider other constraints (e.g., up-bound timing that is imposed by critical or real-time systems) as future research subjects. ACKNOWLEDGMENT The authors would like to thank the MEDEA+ and CATRENE ofces and the French Ministry of Industry for supporting this work via the MEDEA+/CATRENE SoftSoC project. R EFERENCES
[1] Hewlett-Packard Company. (2010) HP Tru64 UNIX Operating System Version 5.1B-6. [Online]. h18004.www1.hp.com/products/quickspecs/ 13868 div/13868 div.pdf [2] NXP Semiconductors. (2007) LH7A404 Board Support Package V1.01. [Online]. ics.nxp.com/support/documents/microcontrollers/zip/ code.package.lh7a404.sdk7a404.zip [3] Coverity, Inc. (2005) Analysis of the Linux Kernel. [Online]. www.coverity.com/library/pdf/linux report.pdf [4] N. Ganapathy. (2008) Introduction to Developing Drivers with the Windows Driver Foundation. [Online]. www.microsoft.com/whdc/ driver/wdf/wdfbook intro.mspx [5] Atmel Corp. (2008) DIOPSIS 940HF AT572D940HF Preliminary. [Online]. www.atmel.com/dyn/resources/prod documents/doc7010.pdf [6] K. J. Lin and J. T. Lin, Automated development tools for Linux USB drivers, in 14th ISCE, Braunschweig, Germany, 2010, pp. 14. [7] G. Kroah-Hartman, I2C Drivers, Part II, Linux Journal, Feb. 2004. [8] M. ONils and A. Jantsch, Device Driver and DMA Controller Synthesis from HW/SW Communication Protocol Specications, Design Automation for Embedded Systems, vol. 6, no. 2, pp. 177205, 2001. [9] S. Wang, S. Malik, and R. A. Bergamaschi, Modeling and Integration of Peripheral Devices in Embedded Systems, in DATE03, Munich, Germany, 2003, pp. 10 13610 141. [10] L. Ryzhyk, P. Chubb, I. Kuz, E. Le Sueur, and G. Heiser, Automatic device driver synthesis with Termite, in 22nd SOSP, Big Sky, MT, 2009. [11] N. Bombieri, F. Fummi, G. Pravadelli, and S. Vinco, Correct-byconstruction generation of device drivers based on RTL testbenches, in DATE09, Nice, France, 2009, pp. 15001505. [12] L. Wittie, C. Hawblitzel, and D. Pierret, Generating a staticallycheckable device driver I/O interface, in Workshop on Automatic Program Generation for Embedded Systems, Salzburg, Austria, 2007. [13] C. L. Conway and S. A. Edwards, NDL: A Domain-Specic Language for Device Drivers, SIGPLAN Not., vol. 39, pp. 3036, jun 2004. [14] X. Gu rin, K. Popovici, W. Youssef, F. Rousseau, and A. Jerraya, e Flexible Application Software Generation for Heterogeneous MultiProcessor System-on-Chip, in 31st COMPSAC, Beijing, China, 2007. [15] Newlib. (2010) Red Hat, Inc. [Online]. sources.redhat.com/newlib [16] uClibc. (2011) Erik Andersen. [Online]. www.uclibc.org/downloads [17] X. Gu rin and F. P trot, A System Framework for the Design of e e Embedded Software Targeting Heterogeneous Multi-core SoCs, in 20th ASAP, Boston, MA, USA, 2009, pp. 153160, [Online] timasls.imag.fr/viewgit/apes.

Native Generated

TABLE II E FFORT FOR MCI DRIVER DEVELOPMENT WITH AND WITHOUT M E 3D Effort in person-days 2 1 2 1 6 21 SLOC 2573 219 188 64 -

Device specications Platform specication In-kernel interface specication Device features model Total effort using Me3D Total effort without Me3D

bugging functions and statements) of the native and converted MCI drivers, while the last column shows the size of the binary drivers. We can notice that the systematically generated driver source is slightly smaller than the native one. The reason is that the native MCI driver is written without optimizations, whereas the generation optimizes code size. For instance, the native driver presents registers as union structures; in contrast, the generated code only denes offsets of used bit elds. A disadvantage of using the union structure is that reserved bit elds have to be specied too. Though there might be advantages of using unions in terms of code review, more efcient validation is feasible via checking high-level intermediate models. Table II shows the results of generating the MCI driver using Me3D and developing it manually. We nd that using Me3D methodology for generating drivers results in a 350% improvement in productivity. Considering the fact that the native driver is developed by an highly experienced kernel developer with about 21 person-days of effort (around 26 to 40 person-days of effort using the Intermediate COCOMO2 formula with coefcients for embedded software projects and typical values for an effort adjustment factor), we can expect the acceleration of the driver development to be greater than 350%. It must be noted that some specications are suitable for reuse purpose. The platform specication is usable for all devices, and the in-kernel interface specication for a certain device class. Incidentally, around 500 lines from the IP-XACT specications (totaling 18%) of the devices and the platform are used for the driver generation. F. Performance We analyzed the performance of the native MCI driver for the DNA-OS against that of the generated one. Performance values were captured using a Secure Digital (SD) card (Kingston SD/1GB). We measured a benchmark that performs a sequence of unbuffered reads from the SD card connected to the MCI device. As a result, the transfer rate and CPU
2 Intermediate

COCOMO by B. Boehm, en.wikipedia.org/wiki/COCOMO

177

Session 7 Tools and Designs for Congurable Architectures

178

Schedulers-Driven Approach for Dynamic Placement/Scheduling of multiple DAGs onto SoPCs


Ikbel Belaid, Fabrice Muller
University of Nice Sophia-Antipolis LEAT-CNRS, France e-mail: {Ikbel.Belaid, Fabrice.Muller}@unice.fr AbstractWith the advent of System on Programmable Chips (SoPCs), there is a serious need for placing and scheduling algorithms which can allow multiple Directed Acyclic Graphs (DAGs) structured applications to compete for the computational resources provided by SoPCs. A runtime scheme for distributed scheduling and placement of DAG-based real time tasks on SoPCs is described in this paper. In the proposed distributed approach, called Schedulers-Driven, each scheduler associated to a DAG makes its own placement/scheduling decisions and collaborates with the available placers corresponding to SoPCs in the system. The placers focus in managing free resource space for the requirements of elected tasks. SchedulersDriven aims at optimizing the DAG slowdowns and reducing the rejection ratio of real-time DAGs. Other important goals are attained by this approach, which are the reduction of placement and scheduling overheads ensured by the techniques of prefetch and reuse, and the efficiency of resource utilization guaranteed by the reuse technique and the slickness of placement method. Keywords-real-time DAGs; Schedulers-Driven placement/scheduling; reuse; prefetch; heterogeneous device; run-time reconfiguration. I. INTRODUCTION In the recent years, the reconfigurable computing has advanced at a phenomenal rate. This new concept emerges the SoPCs to satisfy all the demands of embedded systems designers working under many tight constraints. The SoPCs present a mixture of two parts: general-purpose processors and reconfigurable hardware resources. Despite their flexibility and their high performance, the SoPCs reveal a number of challenges that must be addressed. One of them is the dynamic scheduling of parallel real-time jobs modeled by directed acyclic graphs (DAG) onto the reconfigurable resources. Hence, it is reasonable to envisage a scenario where more than one DAG compete to be scheduled onto a high density of reconfigurable resources at the same time. The purpose of our work is to provide a dynamic scheduling and placement for DAGs, as they arrive at a heterogeneous system. The objective of dynamic placement/scheduling is i) to fit tasks within
978-1-4577-0660-8/11/$26.00 2011 IEEE

Maher Benjemaa
National engineering school of Sfax, Univeristy of Sfax Tunisia e-mail: Maher.Benjemaa@enis.rnu.tn DAGs efficiently on reconfigurable units partitioned on the SoPCs, respecting their heterogeneities and taking advantage of run-time reconfiguration mechanism and ii) to order their execution so that task precedence and realtime requirements may be satisfied. Many dynamic scheduling schemes have been introduced in parallel computing systems. One simple and efficient type of scheduling method is to dynamically construct a combined DAG, composed of DAGs arrived at the system and then to schedule the composite DAG by one among efficient single-DAG algorithms in the literature. Some methods which fall into this category include those presented by Zhao and Sakellariou in [1], who focus in achieving a certain level of quality of service for the given DAGs defined by the slowdown that each DAG would experience. The idea of combining dynamic DAGs is also proposed in [2]. This paper develops Serve On Time and First Come First Serve algorithms that schedule each arrived DAG with the unfinished DAGs. The objective of these algorithms is to properly add the new submitted DAG into running DAGs, forming a new integrated DAG. In [3], the dynamic DAGs are scheduled with periodic real-time jobs running on the heterogeneous system. The proposed scheduling scheme introduces admission control for DAGs and schedules globally the tasks of each arrived DAG by modeling the spare capability left by the periodic jobs in the system. Then, each scheduled task is received by a machine where it will be scheduled locally by EDF algorithm. [4] presents a hierarchical matching and scheduling framework to execute multiple DAGs on computational resources. Based upon a client-server model and DHS algorithm, each DAG is associated with a client machine and independently, determines when a scheduling decision should be made. Through load estimates, each client machine matches its tasks to the suitable group of server machine. When the application chooses a particular group of servers to execute a given task, the low-level scheduler determines the most appropriate member of the group to execute the received task. [5] deals with parallel jobs arriving at the system following a Poisson process and takes into account the reliability measure as well as the overheads of scheduling and dispatching tasks to processors. Using admission control for real-time jobs, the paper presents DAEAP,

179

DALAP and DRCD scheduling algorithms to enhance the reliability of the system. Several researchers have developed dynamic placement methods of tasks on reconfigurable devices. The placement in [6] is considered the baseline placement algorithm. The placement is based on KAMER method that partitions the free space into Maximal Empty Rectangles (MER) and employs the bin-packing rules to fit tasks into MERs. [7] presents the on-the-fly partitioning. [8] employs the staircase method to manage the free space. Unlike the previous works, [9] manages the occupied space instead of the free space and proposes Nearest Possible Position algorithm to fit tasks while optimizing inter-task communication. To the best of our knowledge, none of these existing methods of placement and scheduling is suitable for the environment used in this paper, as most of them are proposed for purely software context or they are not applicable for real-time DAGs. In this paper, a new dynamic competitive placement/scheduling approach is proposed to execute real-time DAGs on SoPCs. The remainder of this paper is organized as follows. Section 2 details our proposed approach of placement/scheduling DAGs onto SoPCs. The experimental results are given in Section 3 followed by conclusions in Section 4. II. SCHEDULERS-DRIVEN PLACEMENT/SCHEDULING Throughout the paper, the Xilinx heterogeneous column-based FPGA was used as a reference for the SoPC. The heterogeneous system is composed of n SoPCs. Each one is composed of a set of reconfigurable hardware resources depicted by {RBk} where k denotes the number of resource type. There are NP types of reconfigurable resources in SoPCs. As mentioned in Fig. 1, the execution system is constituted by a set of m local schedulers (Sched i) associated to arrived DAGs. Local schedulers are communicating with n placers. Each placer is assigned to a SoPC and makes its own decision in managing reconfigurable resource space. Besides the m local schedulers and n placers, we introduce two other structures in the system: Recover and Pending. In the distributed Schedulers-Driven placement/scheduling, all the structures: m local schedulers, n placers, Recover and Pending operate to make decisions about scheduling and placement of real-time tasks. The real-time DAGs are submitted dynamically and periodically according to a fixed inter-arrival interval. A real-time DAG is defined by the pair (N,E). N is the set of nodes representing nonpreemptive tasks in the DAG and E is the set of edges linking the dependent tasks. Each real-time task in the DAG is characterized by its worst case execution time (CA), its relative deadline (DA) and its release time (RA). The release time is the time when the task is ready for execution and it receives all its required data from its

predecessors. RA is determined according to the arrival time of the DAG to which the task belongs and to the time of execution achievement of its predecessors. Moreover, each task (A) is presented as a set of reconfigurable resources (RBk) which are required to achieve its execution on the SoPCs and defines the RB-model of the task as expressed in (1). (1)
Sched 1

Sched i

Sched m

Recover

Pending

List_scheduler

List_recover

List_pending

placer 1 placer n

SoPC1

SoPCn

Figure 1. System overview.

Under hardware environment, the placement and scheduling problems are highly interlinked. Effectively, the placers must satisfy the resource requirements of each task elected by the schedulers, and the scheduler decisions must be made according to the ability of placers to provide sufficient RBs for tasks, respecting their precedence and real-time constraints. Thus, the major challenge in this environment is to reduce the rejection rate as much as possible. Both following sections detail our proposed algorithms for placement/scheduling DAGs on reconfigurable devices (SoPCs). A. On-line Placement Algorithm Placement problem consists of two sub-functions: i) partitioning which handles the free space of resources in the SoPC and identifies the Maximal Empty Rectangles (MER) enabling task execution. MER are the empty rectangles that are not contained within any other empty rectangle and ii) fitting which selects the best feasible placement solution within MERs by maintaining the resource efficiency. As stated above, as shown in Fig. 2, we are based on 2D column-based architectures represented by a matrix (Yi,j), LineNumber depicts the line number in the SoPC and ColumnNumber denotes its column number. (2) To achieve partitioning sub-function, we have to define the Max_widthi,j and Max_heighti,j for each Yi,j. Max_widthi,j is the number of free RBs found throughout the line of Yi,j by starting from Yi,j and without crossing an occupied RB. Max_heighti,j is the number of free RBs counted from Yi,j throughout its column till the first occupied RB. Max_widthi,j and Max_heighti,j are null for the occupied RBs (Yi,j = 0). The search of MERs also

180

claims the search of key RBs. Key RBs are the free RBk which provide the upper left vertices of MERs. A key RB is an RBk (Yi,j) having an occupied RB on its left (Yi,j-1) or the free RB on its left has a Max_heighti,j-1 inferior to that of RBk. Moreover, a key RB RBk must have an occupied RB above (Yi-1,j) or the free RB above has a Max_widthi-1,j inferior to that of RBk. In Fig. 2, the RBs in the SoPC with the star symbol are the key RBs and the values in parentheses are their Max_width and Max_height.
Column-based SoPC of 4 RB types: RB1, RB2, RB3, RB4 2 3 4 1 5 1 0
1
(1,4)

Algorithm 1. MER search.

2
(2,3)

4 MER1 Y2,2 MER2 Y2,2 MER nested in MER1 Y2,3 MER3 Y1,2 MER4 Y2,3

1
(3,3)

3
(2,4)

3
(3,1)

Figure 2. Key RB and MER search.

1) Partitioning: Based on key RB, Max_width and Max_height of RBs, this first sub-function of placement problem consists in extracting MERs to enable the placement of elected tasks on reconfigurable device. Partitioning is conducted through Algorithm 1. Algorithm 1 deals with each key RB independenly. At the beginning of Algorithm 1, to avoid MER nesting, line 5 and the test ensured by line 10 select the RBs throughout the Max_height of the key RB to be considered for SCAN function (line 12). These selected RBs must provide Max_width greater than that of RB above the current key RB (line 5). In fact the Max_widths of RBs which do not satisfy this condition are inevitably taken by the previous key RB. Throughout the Max_height of each key RB (line 6), Algorithm 1 scans all the RBs and each time, it takes the Max_width of the current RB (lines 7,8) as the current width of a new MER: MER_width (line 11) and checks whether there are RBs above this current RB and below the current key RB having Max_width inferior or equal to that of the current RB (lines 12-21). Should this be the case, the current RB would not be considered. For example, in Fig. 2, for the key RB Y1,2, the Max_widths 3 given by Y2,2 and Y3,2 and the Max_width 2 given by Y4,2 are not considered as Y1,2 above these RBs has a Max_width of 1. This test avoids the duplication of MERs as well as it checks the feasibility of MER construction. If the Max_width of the

current RB is accepted, Algorithm 1 determines the height of the MER by booking MER_width RBs on all the lines between the current key RB and the last RB having Max_width superior to MER_width (lines 22-27). Once the construction of the MER is finished (line 28) another tests of MER nesting is performed by line 29. The first test Validity_Left searches the MERs added by the key RBs situated on the same line as the current key RB on its left. If one of these old MERs has the same height as the new MER and if the upper right and the bottom right vertices of the old MER are greater than or equal to the upper left and the bottom left vertices of the new MER, the new MER is necessarily encapsulated in the old MER. In this case, the new MER will not be inserted. For example, in Fig. 2, the gray MER added by Y2,3 is nested in MER1. MER1 is created by the key RB Y2,2 on the left of the key RB Y2,3, both situated on the same line. As both MERs have the same height 2, and the upper left and the bottom left vertices of this new MER are inferior to the bottom right and the upper right vertices of MER1, the new MER of Y2,3 is deleted. Similarly, the second test Validity_Up avoids the insertion of new MER having the same width as an old MER provided by a key RB located above in the same column as the current key RB and its bottom left and

181

bottom right vertices are greater than or equal to the upper left and the upper right vertices of the new MER. Consequetly, Algorithm 1 guarantees the search of all possible MERs in the SoPCs without duplication nor nesting. 2) Fitting: During scheduling, each elected task from each local scheduler will be fitted by a placer. Thus, according to the found MERs in their correponding SoPCs, n placers provide the best Reconfigurable Physical Blocs (RPBs) for the elected tasks in the SoPCs. The placers search all the valid MERs for tasks and provide the closest RPBs in order to minimize the internal fragmentation. Valid MER must include all the types of RBs required by the task and the needed number of these RBs as specified in the RB-model of the task to enable its execution. Based on the column-based architecture, our proposed best fitting for a given task A and by a given placer is described by Algorithm 2. Algorithm 2 starts RPB search from the upper left vertex of each valid MER. Algorithm 2 hugely relies on the column-based architecture and only scans the first line of the valid MER. It searches the first column in the MER containing an RBk included in A_RB and not yet scanned (line 7). From this current first column (line 8), it scans the whole MER line horizontally to search the remaining RB types required by A_RB (lines 10-22). Max_RB represents the height of the RPB according to the required number of RBs in the hardware task and the height of the valid MER. If the required number of one RB type exceeds the height of the valid MER, Max_RB is equal to the MER_height (line 18) and the remaining number of this RB type (line 17) will be searched in the following columns of the valid MER. Otherwise, the required number of the current RB type is attained (line 14) and the Max_RB is adjusted to the last highest value (line 15). Then, Algorithm 2 checks whether all the RB types included in A_RB are found and their required number are achieved starting from this current first column (line 23). Should this be the case, it books the computed necessary number (Max_RB) (line 25) for this new possible RPB. Among all possible RPBs (Possible_RPB) extracted by scanning all the columns from overall valid MERs in a given SoPC and by a given placer, the closest one to the RB-model of A will be selected as the best fitting for A in the SoPC (line 31). B. On-line Scheduling Algorithm In this section, based on the on-line placement presented in the previous section, we define our proposed Schedulers-Driven placement/scheduling. Fig. 3 illustrates the possible states for tasks in the arrived DAGs. Schedulers-Driven placement/scheduling is performed by means of Algorithm 3 and 4. Every tick (T time units), Algorithm 3 uses all the previous algorithms to move tasks between the various states. We assume that there are DAG_number DAGs arriving at the system with fixed

Algorithm 2. Best fitting of task A.

inter-arrival interval. The arrived DAGs are assigned to the idle local schedulers. The Schedulable tasks in each DAG are fetched by its Local_scheduler and inserted in its List_scheduler. A task in a DAG is considered schedulable if either the task has no predecessors or if all its predecessors have been placed/scheduled. A task is accepted by a SoPC if its deadline and RB requirements for that SoPC remain guaranteed. If a task is not accepted by any SoPC during its laxity time then it is rejected. Consequently, the DAG that the task belongs to is rejected. At the beginning, Algorithm 3 checks if the tasks in List_scheduler, List_recover and List_pending still guarantee their deadlines (lines 19-20). If a task misses its deadlines, it is transferred to Rejected state. A DAG is accepted only if all its composite tasks are acceptable. When a task is rejected, all the schedulable tasks in the List_scheduler of the rejected task, all the recovered, pending and placed/scheduled tasks that belong to the DAG of the rejected task are deleted from their housed lists and from their assigned SoPCs (line 21). Then the Schedulers-Driven detects the SoPCs that have sustained MER modification after task completion or task rejection (line 25). The current time t is kept if the SoPC has experienced MER modification (line 26). When some deleted tasks were scheduled and placed as the last tasks to be executed in RPBs, their elimination from the system could enable the placement/scheduling of pending tasks. In addition, the completion of the last tasks in RPBs frees additional resources in SoPCs which could allow the placement/scheduling of pending tasks. Thus, in these cases, the pending tasks are transmitted to List_recover by saving the time of their recovering (lines 29-32) and their states become Recovered. In the case the rejected tasks are not the latest tasks for execution in the RPBs (line 33), Schedulers-Driven checks the possibility of replacing some of these rejected tasks by pending tasks while respecting

182

Earliest deadline Selected


(1) (2)

Schedulable Rejected DAG

Deadline missed

Algorithm 3. On-line Schedulers-Driven placer/scheduler.

Rejected
Deadline missed
Reject_middle (2)

Placed and scheduled

(3)

Recovered
Reject_last or end tasks

Pending

(1): Valid MER OR (valid occupied RPBs && Ts respect deadline) (2): No valid MER && (valid occupied RPBs and Ts do not respect deadline or invalid occupied RPBs) (3): (Valid MER (End_task or Reject_last)) OR (valid occupied RPBs && Ts respect deadline (Reject_last ))

Figure 3. Task states.

their release times, their deadlines and their RB-models (line 34). If this latter replacing is feasible for some tasks, their states change to Placed/Scheduled and the successors are searched to be the new schedulable tasks and inserted in the list of Local_scheduler to which these substitute tasks belong. Therefore, each Local-scheduler and Recover picks the schedulable task with the earliest deadline from its list List_scheduler and List_recover (lines 36-38). The state of elected tasks is changed to Selected state. When the selected task is taken from List_recover, only the placers, that sustain their MER modification at a time greater than or equal to the Recover_time of the elected task, are selected to deal with this task (lines 39-40) otherwise, all the n placers are considered to place and schedule this task (line 42). Then, each Local-scheduler and Recover calls the on-line placers described by Algorithm 4 and performed by the selected placers (line 44). Each selected placer manages its free space by Algorithm 1 detailed in the previous section (lines 56-59). If the selected placer affords valid MERs for the selected task, it searches its fittest RPB in its free RB space by means of Algorithm 2 and the start time of the task in its associated SoPC is obtained by the maximum between the release time of the task and the current time (lines 60-63). In the case that the SoPC does not include valid MERs for the selected task, it attempts to place and schedule it in its occupied RPBs (line 65). If the possible start time (Ts) provided by an occupied RPB is superior to the release time of the task (line 66), the corresponding placer checks if this start time maintains the deadline of the selected task, should this be the case, it verifies if this occupied RPB satisfies the RB requirements of the task. If the occupied RPB respects the real-time requirements and RB-model of the task (line 67), it is accepted (lines 68-69). When the Ts of the occupied RPB is inferior to the release time of the task (line 71), only RB requirements are checked (line 72) as the start time of the selected task in this occupied RPB will be its release time (lines 73-74). Among all the accepted occupied RPBs, the earliest start time for the task is chosen and the corresponding RPB is selected (lines 77-78). When several RPBs ensure the earliest start time, the fittest RPB is kept. The placers data are sent to the Local-schedulers and the Recover which make the final decision towards their

Algorithm 4. On-line placers.

183

selected tasks. If the placers provide feasible placement for a selected task by guaranteeing its real-time constraints, among all possible RPBs, its Local_scheduler or the Recover picks the fittest RPB that enables the earliest start time for task execution and the new possible schedulable tasks are searched and inserted in the list of Local_scheduler of the selected task (lines 45-46). The state of the selected tasks is then termed Placed/Scheduled. If there are no available RPBs for a selected task, it will be transmitted to Pending state (line 48), as some other task rejections or completions could allow its placement/scheduling in SoPCs. Schedulers-Driven achieves the operating phase by updating limit that controls the existence of unscheduled DAGs (line 51). III. SIMULATION RESULTS In order to analyze the feasibility of our SchedulersDriven placement/scheduling approach and to prove its performance, several simulation experiments are conducted. 10 DAG sets are generated by means of TGFF3-5 tool [10]. DAG set features are described in TABLE I. In each DAG set the inter-arrival interval of DAGs is fixed to 50 T time units. The empirical chosen values for local scheduler number and placer number are m=6 and n=4. n should be inferior to m in purpose of creating low-cost designs with high resource efficiency, we cannot produce a number of SoPCs as great as the number of arrived DAGs to satisfy their physical requirements. However, the number of local schedulers should be as big as possible in order to place and schedule several DAGs simultaneously and to exploit as much as possible the SoPC resources. We created 4 heterogeneous columnbased SoPCs of 6 lines and 7 columns having 4 RB types. In all DAG sets, the average RB heterogeneity rate in DAG tasks is 2.31 (i.e 2.31 RB types among the 4 RB types are averagely used by each task). Fig. 4 shows the run time of Schedulers-Driven placement/scheduling for the 10 DAG sets. DAG_SET6, DAG_SET8 and DAG_SET9 give the highest run times as they are composed of the biggest numbers of DAGs (30, 24, 27). Moreover, they produce also high execution times (23-25) which explain the slowness of the run time. Indeed, due to the longest execution times of tasks, the occupied RPBs remain in execution for a long time and the lateness of their releasing drives tasks to List_Pending many times. Thus, the DAGs lie for a long time in the system. The slowdown of one DAG is defined by: Slowdown(DAG) = Msingle(DAG)/Mmultiple(DAG), where Msingle is the makespan of the DAG up to its last placed/ scheduled task by Schedulers-Driven and when it has the available SoPCs on its own, and Mmultiple is the current makespan of the same DAG when it is placed/scheduled by Schedulers-Driven onto SoPCs along with all the other DAGs. The DAG_SETs comparing slowdown by

TABLE I. DAG number DAG_SET1 DAG_SET2 DAG_SET3 DAG_SET4 DAG_SET5 DAG_SET6 DAG_SET7 DAG_SET8 DAG_SET9 DAG_SET10 20 11 15 22 12 30 18 24 27 10

FEATURES OF DAG SETS Average size/DAG (Task) 8.7 15.8 10.53 11.27 14.16 12.76 14.5 16.33 14.5 16.7 Average deadline (T) 68.91 62.91 69.04 71.37 65.31 76.53 88.15 76.41 78.10 92.71 Average execution (T) 20.79 21.86 17.66 21.58 22.03 23.41 30.99 25.07 25.14 27.66

Figure 4. Run time measurements.

Schedulers-Driven placer/scheduler is shown in Fig. 5. For better performance in the system, the slowdown should be closer to 1. As expected, the DAG_SETs having lower DAG number, lower average size and shorter execution, afford more fairness to their composed DAGs such as DAG_SETs1-4 and consequently, they result in smaller slowdown. DAG_SETs7,8,9,10 produce also small slowdown due to RPB reuse and placement efficiency. DAG_SETs5,6 show the highest slowdowns since they are constructed by big numbers of DAGs of high average sizes and raised heterogeneity rate which cause conflicts between DAGs to use SoPCs and increases their slowdowns. Our real-time DAG-based placement/scheduling on the heterogeneous SoPCs suffers from the problem of task rejection due to missed deadlines and the lack of free RB space for a given selected task. Fig. 6 presents the guarantee ratio (i.e percentage of DAGs guaranteed to meet their deadlines) measured for the DAG_SETs. For all DAG_SETs, Fig. 6 shows a guarantee ratio superior to 51 %. Highly relaxed average deadline combined with low average execution time and small average size within DAG_SET, have noticeable impact on increasing guarantee ratio. We observe 100 % of DAGs accepted in DAG_SET 1,3,4 as these latter parameters are suitably chosen. In [5], by using 8 homogeneous processors, the attained guarantee ratio for 5 DAGs of 20 tasks is 70 %. Our approach outperforms [5] as for the DAG_SET nearly similar to that studied in [5]: DAG_SET5 composed of 12 DAGs with an average size for each one of 14.16 tasks Schedulers-Driven can place and schedule 83 % of

184

Figure 5. Average slowdown measurements.

Figure 6. Guarantee ratio measurements.

real-time DAGs in the system. Under strict physical resource constraints, SchedulersDriven placement/scheduling predicts the placement and scheduling of tasks often before their release times. As shown in Fig. 7, this advantage provided by our proposed approach benefits up to 91 % of placement/scheduling phases in all DAG_SETs to prefetch the schedule and placement of tasks before their release times. These remarkable prefetch ratios hugely reduce the placement and scheduling overheads. Thanks to prefetch technique, almost all the configuration operations are hidden, which will lead to improving the system performance. For better placement quality in the system, the resource efficiency should be closer to 1. Thanks to the slickness provided by our optimal placement method, we reached 0.6 of resource efficiency in all DAG_SETs. This relevant resource efficiency shows how much the used RPBs, where tasks are fitted, are closer to their RB-models. In addition, for all DAG_SETs, based on the run-time reconfiguration mechanism, by reusing the occupied RPBs in 45-75 % of placement/scheduling phases, the placement overhead is totally revoked, the configuration overhead is highly reduced and the resource efficiency is immensely enhanced by freeing more RB space for the future arriving DAGs. IV. CONCLUSION AND FUTURE WORK This paper presents a novel placement/scheduling approach for real-time DAGs with non-deterministic behavior on heterogeneous SoPCs. We think that this paper reveals an initial only study of heuristics for multiple DAG placement/scheduling onto SoPCs. Moreover, it addresses the most challenging problems disrupting the embedded systems which are the achievement of high performance expressed by run time, slowdown, reaching the highest resource efficiency and reducing configuration overhead. Further research focuses on other approaches with task preemption and with other notions of quality of service by exploiting the unused middle slot times within the RPBs.
REFERENCES
[1]

Figure 7. Placement/Scheduling prefetch measurements. [2]

L. Zhu, Z. Sun, W. Guo, Y. Jin, W. Sun and W. Hu, "Dynamic Multi DAG Scheduling Algorithm for Optical Grid Environment," proceedings of SPIE, Vol 6784; Part 1, pages 6784, 67841F, 2007. L. He, S. Jarvis, D. Spooner and G. Nudd, "Dynamic, capability-driven scheduling of DAG-based real-time jobs in heterogeneous clusters" International Journal of High Performance Computing and Networking, Vol 2, pp. 165-177, March. 2004. M. Iverson, and F. Ozguner, "Hierarchical, competitive scheduling of multiple DAGs in a dynamic heterogeneous environment," Distributed Systems Engineering journal, Vol 6, No 3, pp. 112-120, July. 1999. X. Qin, and H. Jiang, "Dynamic, reliability-driven scheduling of parallel real-time jobs in heterogeneous systems," International Conference on Parallel Processing, pp. 113-122, 2001. K. Bazargan, R. Kastner, and M. Sarrafzadeh, "Fast Template Placement for Reconfigurable Computing Systems," IEEE Design and Test, Vol. 17, pp 68-83, January. 2000. C.Steiger, H.Walder, M.Platzner, and L.Thiele, "Online scheduling and placement of real-time tasks to partially reconfigurable devices" International Real-Time Systems Symposium, pp. 224-235, December. 2003. M. Handa, and R. Vemuri, "An Efficient Algorithm for Finding Empty Space for Online FPGA Placement," Design Automation Conference, pp. 960-965, June. 2004. A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, " A New Approach for On-line Placement on Reconfigurable Devices, " International Parallel and Distributed Processing Symposium, p. 134, April. 2004.

[3]

[4]

[5]

[6]

[7]

[8]

[9]

H. Zhao, and R. Sakellariou, Scheduling multiple DAGs onto heterogeneous systems, Parallel and Distributed Processing Symposium, pp. 130, April. 2006.

[10] http://ziyang.eecs.umich.edu/~dickrp/tgff/

185

Generation of emulation platforms for NoC exploration on FPGA


Junyan TAN, Virginie FRESSE Hubert Curien Laboratory UMR CNRS 5516 University of Jean Monnet-University of Lyon 18 Rue du Professeur Benot Lauras 42000 Saint-Etienne, FRANCE firstname.surname@univ-st-etienne.fr
Abstract-NoC (Network on Chip) architecture exploration is an up to date problem with todays multimedia applications and platforms. The presented methodology gives a solution to easily evaluate timing and resource performances tuning several architectural parameters, in order to find the appropriate NoC architecture with a unique emulation platform. In this paper, a design flow that generates NoC-based emulation platforms on FPGA is presented. From specified traffic scenarios, our tool automatically inserts appropriate IP blocks (emulation blocks and routing algorithm) and generates an RTL NoC model with specific and tunable components that is synthesized on FPGA.

Frdric ROUSSEAU TIMA Laboratory, UJF/CNRS/Grenoble INP, SLS Group 46, Avenue Flix Viallet 38000 Grenoble, FRANCE firstname.surname@imag.fr between the desired accuracy and validation speed. FPGA (Field Programmable Gate Array) are commonly used as reconfigurable devices for emulation and test. FPGAs are programmable logic devices used in various applications requiring rapid prototyping of digital electronics (telecommunication, image processing, aeronautics). Modern FPGAs are now able to host processors cores or DSPs, as well as several IP blocks to perform efficient prototyping of embedded systems. Today, several NoC-based emulation platform on FPGA are proposed, such as [4][5][6]. Nevertheless, these emulation platforms are not adapted to image and signal processing applications. Emulation blocks cannot emulate all data transfers such as data transmissions from one initiator to several destinations with automatic data rate injection variation. In this paper, we propose a generic design flow for the emulation of large NoC-based MPSoCs on FPGA platform. This design flow automatically builds the emulation architecture based on the NoC architecture, the type of emulations and the routing algorithm. According to the requirements of the application, this emulation architecture provides emulations of data transmission performances from one or multi initiators to one or multi destinations with automatic data rate injection. In addition, it is implemented on a FPGA platform and supplies a statistics report for future design of the whole system. The whole emulation architecture is designed in a hierarchical VHDL description fully synthesizable and FPGA-independent. This paper is organized into 4 sections. Section 2 describes some related work. In section 3, we introduce the generic design flow for the automatic generation of the emulation platform on FPGA. This section details all required components inserted in the design flow. Section 4 presents the design flow adapted to the NoC Hermes on Xilinx platform. Experiments are presented and analyzed in this section. Section 5 contains the conclusion. II. RELATED WORK During the last years, there was an impressive evolution and development of NoC architectures on embedded platforms. These existing NoCs do not resolve one of the principal challenges of these communication architectures: to find out the optimum or a set of optimum NoC architecture for a target application. Several simulation and

I.

INTRODUCTION

Systems-on-Chip (SoC) based on Networks-on-Chip (NoC) architectures are one of the most appropriate solution for media processing embedded applications. With the growing complexity in consumer embedded systems, the emerging SoC architectures integrate numerous components such as memories, DSP, specialized processors, micro controllers and IPs. The ever increasing number of components, number of data and size of data to transfer required by the algorithm lead to design efficient ad hoc NoC architectures according to the algorithm specifications. An ad hoc NoC offers high bandwidth and high scalability at the cost of lower power and lower complexity [1]. However, the design of ad hoc NoC means making several architectural choices, just like buffer sizing, flow control policies, topology selection and routing algorithm selection. These choices must be made at the design time keeping in mind that the final NoC must fulfill a set of critical constraints which depend on the target application such as: latency, energy consumption, design time. The design space being very wide, automation of the design flow must be considered to ensure a rapid evaluation and test of each solution. In the last years, several approaches [2][3] were proposed to automate the design space exploration of the architecture. All these approaches can be categorized into two types: formal approach and experimental approach. Formal approach aims to construct a mathematical formulation to predict the NoC behavior. Experimental approach uses either simulation or emulation tools. In approaches that use software simulation, the NoC can be modeled at different level of abstraction, the abstraction level being a tradeoff 978-1-4577-0660-8/11/$26.00 2011 IEEE

186

emulation models are proposed at different abstraction levels. These models do not permit the exploration of the design space. Exploration remains a manual task requiring the experience of the designer. Today, several NoC architectures have successfully been implemented on FPGA device such as: Hermes [7], SoCIN [8], PNoC [9], HIBI [10] Extended Mesh [11]. The Placed and Routed (P&R) architecture for the FPGA implementation is generated from the design flow associated to the NoC. These FPGA-based tools or environments are based on simulation in VHDL, SystemC or the combination of specification, simulation, analyze and generation of NoCs at different levels of abstraction. In [14] a platform for modeling, simulation and evaluation of an MPSoC NoC including a real-time operating system is based on SystemC. In [15] a mixed design flow is proposed. It is based on SystemC simulation and VHDL implementation of the NoC structure called NoCGen. This platform uses a template router to simulate several interconnection networks using SystemC. In [17] a modeling environment is described for custom NoC topologies based on SystemC. However, these approaches are limited to their levels of accuracy in the estimations and level of synthesis on the FPGA. Increasing the level of accuracy significantly increases the simulation time. These simulations have a much larger execution time compared to NoC platform emulated on FPGA device. The simulation time with SystemC or modelsim for 109 packets can reach from 5 days to 36 days [4]. The NoC structure is implemented on the FPGA only, the emulation platform cannot be implemented. Emulation on FPGA is proposed to obtain faster simulation times and the higher accuracy of the functional validation. In [4] the authors present a mixed HW-SW NoC emulation platform implemented on FPGA. The VHDLbased NoC is implemented on a Virtex-II FPGA. This architecture contains a network communication, traffic generators, traffic receptors and a control module. A hardcore processor (PowerPC) is connected to the emulation hardware platform as a global controller. This controller defines the parameters of the emulation. A fast network on chip emulation framework on Virtex-II FPGA is presented [16]. It supplies a fast synthesis process by using the several hard cores for partial reconfigurations. These previously presented frameworks integrate one or several hard core processors used for emulation only. They control the communication architecture which cost the limited resources on the FPGA. All these emulation platforms presented previously are used only for one-to-one or multi-to-one communication. Image and signal processing algorithms require sending data to multi-destination which is not supported by existing emulation platforms. In order to solve these problems, the proposed NoC emulation platform consists of data emulation blocks

(traffic generator and traffic receptor) and a synthesizable NoC architecture. This platform can emulate any traffic required by image and signal processing application. III. DESIGN FLOW FOR THE GENERATION OF THE FPGA EMULATION PLATFORM A generic design flow is proposed to generate the emulation architecture for FPGA platform. The design flow takes as inputs: NoC architecture with several varying parameters, Routing algorithms, Emulation blocks: traffic generators and traffic receptors according to the type of emulation, Initiators and receptors with data transfer specifications. 1. Design Flow

The design flow depicted in Figure 1 automatically generates the emulation architecture for FPGA platform. The design flow is designed in a VHDL hierarchical structure to ensure component instantiation at each level of the design flow. Several packages (routing, data_transfers) are designed for the parameterization of architectures. Generic IP blocks are inserted for component insertion. These components are routing component, traffic generators (TG) and traffic receptors (TR).

Figure 1. Design flow for the generation of the emulation platform on FPGA.

The designer first selects the NoC structure which he wants to explore in the package. He specifies the number of switches and the size of the bus. The design flow is based on the existing NoC structure implemented on a FPGA platform and its associated design flow. Any existing parameterized synchronous NoC structure can be used as the input as long as the HDL description is available. The NoC structure should contain switches, buffers, links and

187

flow protocol. The design flow takes as input the VHDL description of the NoC structure. The first step concerns the routing algorithm. This is a decision taken by the designer. He selects the routing algorithm to be used with the NoC structure in the routing package. The routing algorithm is selected by setting a 1, the unused routing algorithms are set to 0 (meaning not used as depicted in Figure 2).
Routing_XY:=1; Routing_NFNM:=0; Routing_WFM:=0;
Figure 2. Example of routing algorithm selection.

The design flow inserts the corresponding routing IP component from the routing IP block library. We assume that a routing IP block library contains the VHDL functions of the routing algorithms. The routing algorithm is inserted by instantiating the appropriate routing function in the switch control block of the switch architecture. The design flow adds to all switches of the NoC structure these routing IP blocks to obtain the complete NoC architecture. At this level, the communication architecture is complete and nodes can be inserted. The parameterized emulation blocks are added to the platform in the emulation block insertion step according to the type of emulation. Emulation blocks are traffic generators and traffic receptors designed in VHDL IP blocks. Parameters for all emulation blocks are specified in the package Data_Transfer. This emulation block insertion step is a generic VHDL component instantiation using the Data_transfer package for allocating generic VHDL IP blocks. The complete emulation architecture on FPGA platform is generated in HDL description language. This system is synthesized and implemented using adequate tools for Synthesis and Place and Route tools for the target platform. In the presented design flow, the designer tunes the corresponding emulation platform setting the routing IP components in the routing IP block library and specifying the scenarios in the Data_transfer package. In the presented design flow, two inputs are reused from existing algorithms/blocks: the NoC structure and the routing algorithm. These inputs can be immediately reused if they have been previously described in hardware language for synthesis purpose. They are briefly described in the following sections. Emulation blocks and data transfer specifications are designed inputs for this design flow. Descriptions and parameterizations are detailed in the following sections. 2. NoC architecture

(NI), Switch, Links and Resources. These basic elements are connected using a topology to constitute the NoC architecture. Data transmitted in NoC architectures are sent through messages. Several data can be sent with one message and one data can be sent with several messages. One message is a set of packets and a packet is a set of Flits (Flow Control Unit). The flit is the basic element of the NoC transferred. The designer selects the size of flits, the number of flits for one packet and the size of packet. Packets are sent according to the data injection rate. Data injection rate is defined as the ratio of the amount of receiving data on its ability to carry data. A 50% data injection rate indicates that the packets use 50% of the bandwidth. Sizing the NoC has a direct impact on timing performances and resources used. It is important to evaluate performance of the NoC according to the size of flits, size of packets and the size of messages. Any NoC structure containing all components but the routing algorithm can be used as an input of the design flow. The structure should be described in Hardware Description Language (HDL). 3. Routing algorithm

Several routing algorithm are implemented on FPGA and ASIC devices. Routing algorithms define the path taken by a packet between source and target switches. Three types of routing algorithms exist: determinist, partially adaptive and fully adaptive [3]. 2D Meshes and k-ary n-cubes are popular for FPGA as their regular topologies simplify routing. In a 2D mesh, there are four directions, eight 90degree turns, and two abstract cycles of four turns. Most algorithms commonly used are determinist routing algorithms. The reason mentioned is their simplicity and the lower number of resources. XY is a commonly used determinist routing algorithm. Other routing algorithms are semi-deterministic (or semi-adaptive) algorithms. One example of these semi-adaptive routing algorithms is the west first algorithm. Fully adaptive routing algorithms are proposed. Such algorithms prevent from livelock and deadlock [20]. They are not used for the FPGA implementation. The reason claimed is the higher number of required resources required and a more complex algorithm compared to the other two types. Any routing algorithm can be inserted in the design flow if described in HDL language. Some routing IP blocks have already been developed in VHDL and are inserted in the design flow. 4. Traffic generators

NoC architectures [2] are communication architectures improving the flexibility of communications subsystem of SoC, with high scalability, high performance and energy efficient customized solution. NoC architecture is composed of several basic elements: Network Interface

For the emulation of the NoC, IP blocks or other components connected to the NoC are replaced by traffic generators (TG) designed in parameterized VHDL entities. Deterministic traffic generators are widely used in NoC

188

emulation. These traffic generators simulate the traffic flow between IP blocks inside the NoC. They generate stochastic traffic distribution: packet size, injection time, idle intervals duration and packet destination in order to reproduce the behavior of a real IP block for a given application. Several TGs have been previously designed [4][5][6][7]. Format of packets sent by other traffic generators are not suitable to image and signal processing applications. As an example, designed TGs can send packets to one destination node only [4][7] or in broadcast in . For most TG, there is no information about the address of the source node when receiving the data. In image processing applications, applications require unicast, multicast and broadcast transfers. Other required information is the address of the source node when data are received by the destination node. Most of nodes perform computation between 2 types of data coming from 2 different nodes. It is therefore necessary for the destination node to extract the address of the incoming data. The proposed format of packet also inserts information about timing performances (latency) to implement a complete synchronous emulation platform as depicted in Figure 3.

The number and format of packets depend on the traffic scenario specified in the data_transfer package.. 5. Traffic Receptors

Figure 3. Format of packets generated by TGs.

In our emulation platform the packets contain one header part and a data part with following information: Address of the destination cores (Dest). Any initiator core can send data to one or several destination cores. Address of the initiation core. (Source). Init clock (Clk_init). The flit is reserved for the latency evaluation. When the packet is sent, the sending time is loaded. Size of transmitted packet (Sz_pckt). Number of packets (Nb_pckt). The generic TG is depicted in Figure 4. Any TGs generate control signals and the packet in the data_in output whose size is equal to the size of flits. Parameters for any TG are the address of the destination node (Address), the coordinates of the source node (IP_address_X and IP_address_Y), the size and number of packets (Size_packet, Nbre_packet) and the data injection rate between packets (Idle_packet). All these information are used to generate the format of the packet depicted in Figure 3. The global clock of the NoC is connected to the clk signal of the TG block for constituting a synchronous platform.

Traffic flows generated by traffic generators are sent through the NoC and then received by traffic receptors (TR). Proposed traffic receptors are parameterized VHDL entities. TR analyzes received packets and extracts latencies of the NoC. Two types of traffic receptors exist. The first type performs global analyzes and statistics from the executed emulation in hardware. Global analyze consists in testing all packets and extracting the latency from each received packet (latency defined on line and inserts in the 3rd flit in the packet). Latency and global analyze are sent to a unique LCD. The second type only generates a continuous report of traces received with detailed values for the emulation on LCD available on the FPGA board. As the emulation platform is designed to emulate with the highest precision the behavior of the final system to be implemented, emulation components for output data are restricted to LCD. The designer can add components or interfaces for analyze but should keep in mind that the structure to be emulated is modified (changing the performances of the system). Both traffic receptors are parameterized VHDL blocks to ensure an automatic generation of the emulation platform. 6. Data transfer specification

TG and TR blocks are inserted according to the data transfer specification specified in the Data_Transfer Package. Data transfers are given in the highest level of the emulation architecture description (top_NoC) with generic values. The designer specifies the size (size_packet) and number (nb_packet) of packets sent by all TGs with the data injection rate (idle time between packets expressed with idle_packet). Data have the same format for one TG. The designer indicates all destination nodes with destination value. 1 indicates that the node is a TR and 0 that the node does not receive any data. Last_destination is a value indicating the number of TR for every TG. The number of packets received by every TR is given by total_packet. Then links between TG and TR are given with destination_links. The example depicted in Figure 5, indicates that switches (3,4) and (4,4) are traffic generators (called TG1 and TG2) as they both send 10 packets of 15 flits (the size of flits is automatically extracted from the NoC structure). The idle time between packets is 20 clock cycles for both TGs. Switches (0,0) (1,0) (2,0) (1,0) receive packets. TG1 sends packets to 2 TRs and TG2 sends packets to 3 TRs. The number of packets received by switch (0,0) is 20 packets, 10 packets for all other TRs. TG1 send packets to switches (0,0) (2,0) TG2 send packets to switches (0,0) (1,0) (0,1).

Figure 4. Signals and parameters for generic Traffic Generators.

189

explore the Design Space of these NoC architectures presented in the following section. The experimental study aims to evaluate performances of NoC Hermes according to routing algorithms and to the position of the nodes. The experimental platform used is the ML506 evaluation board that contains a Virtex 5 XC5VSX50 FPGA. Development tools are Xilinx ISE 10.1 with Precision RTL Synthesis. The following experimental studies are based on the average latency and the number of resources. 1.
Figure 5. Initiator and receptor with data traffic specification

Hermes NoC and ATLAS tool

According to the data transfer specification, the design flow inserts the corresponding emulation blocks using TG and TR library. The TG is inserted to the switch if the associated node is an initiator (i.e. the node sends at least one data to another node). The TR is inserted in the switch if the associated node is a receptor from any initiator (i.e. the node receives at least one data from an initiator). For other cases, TR and TG are not instantiated in the NoC architecture. The design flow can insert any other type of emulation (random accesses, random size, and parameterized latency for sending data). All these emulation blocks will be described in HDL language and inserted in the emulation block library. TGs generate several packets that are sequentially sent to the NoC architecture. These packets are generated according to the data traffic specification specified in top_NoC package. If data is sent to several destination nodes, TGs generate several packets (one packet per node). TGs can send data with different size of packets and number of packet to one or more destination nodes. Considering that the designer cannot know the data injection rate for one TG, 2 types of TGs are proposed: TGs that generate packets with a varying data injection rate. The data injection rate is automatically and dynamically generated from a 0% to a 100% load. Data injection rate is the idle_packet parameter of the TG that is automatically computed in the Top_NoC entity using the following equation:

Hermes is a NoC created by the Catholic University of Rio Grande do Sul (Porto Alegre, Brazil) [7]. This NoC is a 2D packet switched Mesh. The main components of this infrastructure are the Hermes switch and IP cores. The Hermes switch has routing control logic and five bidirectional ports. The local port builds the connection between the switch and its local IP core. All ports possess input buffers for provisional storage of information. Hermes uses the wormhole flow control. ATLAS is an open source environment designed to automate the generation of the Hermes VHDL structure. Several features can be parameterized in the ATLAS environment: size of flit, buffer depth, number of virtual channels, flow control strategies. These parameters are easily set by the designer to match the specifications of the algorithm. ATLAS is the NoC Generation Tool and the VHDL IP Blocks of the Hermes NoC is the NoC VHDL IP Block in the design flow in Figure 1. The 4x4 mesh NoC architecture with a 1 initiator (I) to 3 receptors (R) scheme is used for the experiments. 2. Routing algorithms

The routing algorithms used are XY, West First (WFM), North-Last (NLNM) and Negative First (NFNM). All these routing algorithms are designed in VHDL IP blocks for an immediate insertion in the switches. 3. Impact of the routing algorithm on the number of resources

load: data injection rate. nbcyclesflit: number of cycles for transfer one flit. TGs that generate packets with a given data injection rate. The data injection rate is specified by the designer as a constant value for each TG. IV. DESIGN FLOW FOR THE HERMES NOC

The design flow previously presented is adapted to the Hermes NoC and its associated design tools on Xilinx FPGA device. Emulation platforms generated are used to

The first experiment depicted in Figure 6 shows the number of LUTs according to the size of the NoC for several routing algorithms. Resources concern the NoC only, the emulation blocks are not considered. The number of LUTs is almost similar for all routing algorithms. Experiments are also made on the number of registers and is not depicted in this paper. The number of registers is almost identical whatever the routing algorithm used. The first observation is that the number of resources (LUTs and registers) depends on the size of the NoC. An in depth analysis is made on the number of LUTs and registers respectively depicted in Figure 7 and Figure 8 for the XY and NLNM algorithms. These routing algorithms are chosen as they use respectively the lowest number and

190

highest number of resources. The biggest difference in the number of LUTs is 71 for a 4-node NoC and 850 for a 36node NoC. This number seems high but it remains unsignificant compared to the number of resources required for the NoC itself (respectively 3,2% and 3,4% of added LUTs).

important in the choice of the routing algorithm than resource optimization. It is therefore wiser to implement a routing algorithm that avoids deadlocks and livelocks than gaining few resources. 4. Impact of the emulation blocks and routing algorithms on the timing performances

Figure 6. Number of LUTs according to the size of the NoC.

For the following experiments, 4 TGs (TG1 to TG4) and 3 TRs (TR1 to TR3) are used as depicted in Figure 9. Data transfers are based on 40 packets and 500 flits per packet with a size of 16-bit flits. The XY, NFM and NFNM routing algorithms are used. The position of nodes is chosen to ensure data transfer from right to left to compare different routing algorithm (as NFNM uses the XY routing algorithm for left to right data transfers). The design flow generates immediately and automatically three emulation platforms. Each emulation platform contains the routing algorithm (and switches with the routing IP block stated as R blocks) as depicted in Figure 9.

Figure 9: Emulation NoC platform generated by the design flow. Figure 7: Difference of LUTs for XY and NLNM routing algorithms.

The first exploration is the comparison of XY and NFNM routing algorithms. TG1 sends all its data to three TRs. Then TG2, TG3 and TG4 use successively the same scenario. The data injection rate is 25%. The total latencies are depicted in Figure 10. The routing algorithm selected does not affect the number of cycles required to transmit data when the traffic is little compared to the capacity of the communication architecture.

Figure 8: Difference of registers for XY and NLNM routing algorithms.

The same analysis is made with the number of registers as depicted in Figure 8. The difference in the number of registers is lower compared to the number of LUTs (18 for a 4-node NoC and 196 for a 36-node NoC). These extra registers represent between 2.79% and 3.65% of the registers required for the NoC architecture itself. Therefore, the difference in the number of LUTs and registers is not significant in the choice of the routing algorithm for the Hermes NoC when such structure is implemented on FPGA. The functionality and advantages are more

Figure 10. Total latency (nb of cycles) for TG1-3TRs scheme according to the position of the TR (with a 25% data injection rate).

The second exploration is the evaluation of the end to end latency for all TGs sending data to all TRs (Figure 9). Exploration is made with a 25% data injection rate.

191

The latency depends on the routing algorithm and the position of the TRs. Results depicted in Figure 11, show that the XY gives lower latency for TR2 and TR3 and the NFM is more efficient for TR1 for a 25% data injection rate. Both routing algorithms can be used.

emulation blocks are developed and scenario traffic specifications are proposed to target a wide range of image and signal processing applications. The designer can easily generate and implement several emulation platforms and can explore the NoC structure in a short time. With the immediate generation of the emulation platforms designed for design space explorations on FPGA the designer can explore many architecture solutions, specify/modify the number and position of the initiators and receptors to extract the best timing performances according to the routing algorithms, the data injection rates and the position of initiators and receptors. Experiments show that the performances of the final system significantly depend on all these parameters. REFERENCES
[1] [2] [3] [4] [5] [6] [7] [8] B. M. Al-Hashimi: System-on-Chip: Next Generation Electronics. Circuits, Devices and Systems, 2006 L. Benini: Application Specific NoC Design, in DATE, 2006. J. Chan, S. Parameswaran: NoCGEN: A Template Based Reuse Methodology for Networks on Chip Architecture. In Proceedings of the 17th Int. Conference on VLSI Design, Page(s): 717-720, 2004. N. Genko, D. Atienza, G. De Micheli and al: A Complete NetworkOn-Chip Emulation Framework. In DATE, 2005. Y. E. Krasteva, F. Criado and al: A Fast Emulation-based NoC Prototyping Framework. International Conference on Reconfigurable Computing and FPGAs, 2008. Page(s): 211 216. P. Liu, C. Xiang and al: A NoC Emulation/Verification Framework. Sixth International Conference on Information Technology: New Generations, 2009. Page(s): 859 864. F. N. Moraes, A. Mello. Calazans: HERMES: an Infrastructure for Low Area Overhead Packet-switching Networks on Chip, Integration, the VLSI Journal, vol. 38, no. 1. Oct. 2004. C. A. Zeferino, A. A. Susin: SoCIN: A Parametric and Scalable Network-on-Chip, Proc. 16th Symposium On Integrated Circuits and System Designs, 2003.Page(s): 169-174. C. Hilton, B. Nelson: PNoC: a flexible circuit-switched NoC for FPGA-based Systems, in Field Programmable Logic, Aug. 2005. E. Salminen and al.: HIBI Communication Network for System-onChip, Journal of VLSI Signal Processing Systems, Vol. 43, Issue 23, June 2006. Page(s): 185 205. U. Y. Ogras and al.: Communication Architecture Optimization: Making the Shortest Path Shorter in a Regular Networks-on-Chip, in DATE, 2006. OPNET www.opnet.com J. Chan, S. Parameswaran: NoCGEN: a template based reuse methodology for network on chip. VLSI Design 2004, S. Mahadevan, K. Virk, J. Madsen: Arts: A systemc-based framework for modelling multiprocessor systems-on-chip, Design Automation of Embedded Systems, 2006. J. Chan and al: Nocgen:a template based reuse methodology for NoC architecture. In Proc. ICVLSI, 2004. Y. E. Krasteva; F. Criado and al: "A fast emulation-based NoC prototyping framework," in Reconfigurable Computing and FPGAs, 2008, Page(s): 211-216. A. Jalabert and al: Xpipes Compiler: a tool for instantiating application specific network on chip, in DATE, 2004. U. Y. Ogras and al: Communication Architecture Optimization: Making the Shortest Path Shorter in Regular Networks-on-Chip, in DATE, 2006. OPNET www.opnet.com J. Liang; S. Swaminathan; R.Tessier: aSOC: A Scalable, SingleChip communicationsArchitecture, In: IEEE International Conference on Parallel Architectures and Compilation Techniques, Oct. 2000, Page(s): 37-46.

Figure 11: End to end latency for three TRs with a 25% data injection rate for three routing algorithms.

Figure 12. End to end latency (nb of cycles) for a 3-3 scheme according to the position of the TG and the data injection rate.

[9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

The last exploration is the impact of the data injection rate. Figure 12 shows the end to end latency according to two injection rates (50% and 75%). XY is more adapted for sending data to TR1 with 50% data injection rate but is the less adapted for sending data to TR3 with 75% data injection rate. According to the position of TRs and TGs (more precisely the number of hops required), there is not only one routing algorithm that always gives the best latency. These experiments highlight the need of an exploration aided-tool, as online emulations for all scenarios are required to evaluate the best timing performances and select the most appropriate routing algorithm. Such an exploration-aided-tool will help designers to quickly build their emulation platforms to evaluate different scenarios. V. CONCLUSION This paper presents a generic design flow designed for the automatic generation of NoC exploration platforms on FPGA. Based on existing NoC structure, all required components are inserted in the design flow. Appropriate

192

!"#$%"&%$'()&(*)+',%$(-)./0&1%)'()2'3)456$-()
)

!"#$%&'(&)$*+%$,&-+#.*&/(&)(&).*0$%,&1+2&3(&4(&-.5.6.%#,&7+*%.%"$&8(&)$*.+#& 7.095:2&$;&'%;$*<.:=0#,&>?-@A,&>$*:$&/5+B*+,&C*.6=5& D+"#$%(<$*+%$,&0+#.*(<.*0$%,&%+2(0.5.6.%#,&;+*%.%"$(<$*.+#EFG90*#(H*&


) !
!"#$%&'$()!"#$!%&'($)*%&+!&,-.$(!/0!1(/'$**%&+!$2$-$&3*!1)'4$5! %&*%5$! %&3$+()3$5! '%(',%3*! ($6,%($*! '/--,&%')3%/&! )('#%3$'3,($*! *,'#! )*! )! 7$38/(4*9/&9:#%1! ;7/:*<! 3/! 5$)2! 8%3#! *')2).%2%3=>! .)&58%53#! )&5! $&$(+=! '/&*,-13%/&! +/)2*?! @)&=! 5%00$($&3! 7/:! )('#%3$'3,($*!#)A$!.$$&!1(/1/*$5>!)&5!*$A$()2!$B1$(%-$&3*!($A$)2! 3#)3!(/,3%&+!)&5!)(.%3()3%/&!*'#$-$*!)($!4$=!5$*%+&!0$)3,($*!0/(! 7/:! 1$(0/(-)&'$?! "#$($0/($>! 3#%*! 8/(4! 1(/1/*$*! )! (/,3%&+! *'#$-$!')22$5!12)&&$5!*/,('$!(/,3%&+>!8#%'#!%*!%-12$-$&3$5!%&!)! 7/:!)('#%3$'3,($!8%3#!5%*3(%.,3$5!)(.%3()3%/&!')22$5!C$(-$*9DE?! "#$! 1)1$(! '/-1)($*! C$(-$*9DE! 3/! 3#$! C$(-$*! 7/:! 3#)3! $-9 12/=*! 5%*3%&'3! )(.%3()3%/&! )&5! (/,3%&+! -$'#)&%*-*! )&5! )2+/9 (%3#-*?! F&$! *$3! /0! $B1$(%-$&3*! $&).2$*! 3/! '/&0(/&3! 5$*%+&! 3%-$! 12)&&$5! */,('$! (/,3%&+! )&5! (,&3%-$! 5%*3(%.,3$5! (/,3%&+?! G55%9 3%/&)22=>!3#$!1)1$(!1($*$&3*!3#$!)5A)&3)+$*!/0!,*%&+!5$)52/'4!0($$! )5)13%A$! (/,3%&+! )2+/(%3#-*! )*! .)*%*! 0/(! .)2)&'%&+! 3#$! /A$()22! '/--,&%')3%/&!2/)5!%&!./3#!(/,3%&+!-$'#)&%*-*?!G&/3#$(!$B1$9 (%-$&3! ($A$)2*! 3#$! 3()5$/00*! .$38$$&! ,*%&+! '$&3()2%H$5! /(! 5%*3(%9 .,3$5! )(.%3()3%/&?! G! 2)*3! $A)2,)3%/&! $B1/*$*! 3#$! 1$(0/(-)&'$! )5A)&3)+$*! /0! '/-.%&%&+! 5%*3(%.,3$5! )(.%3$(*! 8%3#! 12)&&$5! */,('$! (/,3%&+?! E$*,23*! $&0/('$! 3#)3! 5$*%+&! 3%-$! 12)&&$5! */,('$! (/,3%&+!3$&5*!3/!)A/%5!7/:!'/&+$*3%/&!)&5!'/&3(%.,3$*!0/(!)A$(9 )+$! 2)3$&'=! ($5,'3%/&>! 8#%2$! 5%*3(%.,3$5! )(.%3()3%/&! /13%-%H$*! 7/:!*)3,()3%/&!0%+,($*?!

.7! .28+94:38.92) ;<) -"'=$(-) *5(6$%>) '?) %"&(6$6%'"6) 05") 6$@$1'() ,($%) &"5&) 5(&#@5*) %A5) $/0@5/5(%&%$'() '?) &) 1'/0@5%5) 6>6%5/) '() &) 6$(-@5)*$5B)%A5)6')1&@@5*)C>6%5/D'(D&D3A$0)EC'3F7)C'36)%&"-5%) A$-A) 05"?'"/&(15) =$%A) 6/&@@) ?''%0"$(%) &(*) @'=) 5(5"->) 1'(D 6,/0%$'()=A5()1'/0&"5*)%')%A5)6&/5)6>6%5/)$/0@5/5(%5*)#>) &()5G,$H&@5(%) 65%)'?)1A$067) !) C'3)$6)1'/0'65*)#>)&)0'66$#@>) @&"-5) &/',(%) '?) 0"'1566$(-) 5@5/5(%6) EI<6FB) *'J5(6) '") 5H5() A,(*"5*6)$(%5"1'((51%5*)#>)&()'()1A$0)1'//,($1&%$'()&"1A$D %51%,"57)!6)%A5)1'/0@5K$%>)'?)&00@$1&%$'(6)?$%%$(-)$(6$*5)&)6$(D -@5)C'3)"&$656B)61&@&#$@$%>)&(*)?@5K$#$@$%>)&"5)&1A$5H5*)%A"',-A) %A5),65)'?)/,@%$0"'1566'")6>6%5/6)'()1A$0)ELIC'36FB)&)6051$&@) 1&65)'?)C'36)=A5"5)/'6%)'")&@@)I<6)&"5)0"'-"&//&#@5)0"'156D 6'"6)MNOB)$(1"5&6$(-)%A5)C'3)&"1A$%51%,"5)?@5K$#$@$%>7) C'3)&(*)LIC'3)*56$-(6)"5@>)'()%A5)/&66$H5)"5,65)'?)0"5D *56$-(5*) I<67) 8A5) 1'//,($1&%$'() &"1A$%51%,"5B) '() %A5) '%A5") A&(*) $6) 6051$?$1&@@>) #,$@%) %') ?,@?$@@) %A5) &00@$1&%$'() "5G,$"5D /5(%6B)/&P$(-)%A5)*56$-()'?)%A565)1'/0'(5(%6)1'//,($1&%$'() 15(%"$17) 8"&*$%$'(&@) 1'//,($1&%$'() &"1A$%51%,"56) 6,1A) &6) 6A&"5*) #,6656) &(*) %A'65) #&65*) '() *5*$1&%5*) 0'$(%) %') 0'$(%) $(%5"1'((51%$'(6) *') ('%) 61&@5) =5@@) =$%A) %A5) 5H5"D-"'=$(-) &/',(%) '?) 0&"&@@5@) *&%&) %"&(6/$66$'() MQO7) 45*$1&%5*) 0'$(%) %') 0'$(%)@$(P6)@5&*)%')1'//,($1&%$'()&"1A$%51%,"56)%A&%)&"5)*$??$D 1,@%) %') "5,65) &(*) 5(A&(15) $() 6,#65G,5(%) *56$-() "5H$6$'(67) R,6656)/&>)#51'/5)#'%%@5(51P6B)$(1"5&6$(-)@&%5(1>)&(*)0'=5") *$66$0&%$'(7)4560$%5)%A5)?&1%)%A&%)A$5"&"1A$1&@)#,6656)*')6,00'"%) 0&"&@@5@) 1'//,($1&%$'(6B) 61&@&#$@$%>) 6,??5"6) &(*) 1'(%5(%$'() $(1"5&656)=A5()1'//,($1&%$'()#5%=55()I<6)@'1&%5*)&%)*$??5"D

5(%)6$*56)'?)&)#"$*-5)$6)(55*5*7)2'36)&"5)1,""5(%@>)1'(6$*5"5*) &6)&)#5%%5")&00"'&1A)?'")5(A&(1$(-)61&@&#$@$%>)&(*)0'=5")*$66$D 0&%$'()5??$1$5(1>)MSO7) 8A5)*56$-()'?)2'3D#&65*)LIC'36)/,6%)%&P5)$(%')&11',(%) 65H5"&@)1'//,($1&%$'()&"1A$%51%,"5)&6051%6B)$(1@,*$(-)%'0'@'D ->B) #,??5") *$/5(6$'($(-B) ',%0,%) 65@51%$'() E$757) "',%$(-) &@-'D "$%A/6F)&(*)$(0,%)65@51%$'()E$757)&"#$%"&%$'()&@-'"$%A/6F7).()%A$6) 0&05"B)&)!"#)$6)&)1'//,($1&%$'()&"1A$%51%,"5)1'/0'65*)#>)&) 65%) '?) 6=$%1A$(-) 5@5/5(%6) 1&@@5*) $"%&'$() %A&%) 5/0@'>) 0&1P5%) 6=$%1A$(-) 1'//,($1&%$'(7) +',%5"6) $(%5"1'((51%5*) #>) @$(P6) ?'"/)%A5)2'3)%'0'@'->7)C'/5)'")&@@)"',%5"6)/&>)&@6')1'((51%) %')I<6)%A"',-A)&)(5%='"P)$(%5"?&15B) =A$1A)$6) ('%)$%65@?)1'(6$D *5"5*)0&"%)'?)%A5)2'37)8A5)#&6$1)?,(1%$'()'?)"',%5"6)$6)%')/'(D $%'") $(1'/$(-) 0&1P5%6) ?"'/) $%6) $(0,%) 0'"%6B) 65@51%) '(5) ',%0,%) 0'"%)&(*)?'"=&"*)0&1P5%6)%A"',-A)6'/5)$(%5"(&@)0&%A7)!"#$%"&D %$'() &(*) "',%$(-) '"1A56%"&%5) "',%5") $(%5"(&@) "56',"156) &11566T0"$'"$%>)&(*)*$"51%$'(6)*51$6$'(7) 8A5) "',%$(-) #5A&H$'") /&>) #5) 5$%A5") &) *5%5"/$($6%$1) '") &() &*&0%$H5) ?,(1%$'(7) 45%5"/$($6%$1) "',%$(-) *5?$(56) %A5) ',%0,%) 0'"%)%A&%)&)0&1P5%)=$@@)%&P5)#&65*)'()$%6)6',"15)&(*)*56%$(&%$'(B) $""56051%$H5) '?) %"&??$1) 1A&"&1%5"$6%$167) !() 5K&/0@5) '?) &) 2'3) &00@>$(-) ",(%$/5) *5%5"/$($6%$1) "',%$(-) $6) ;5"/56) MSO7) !*&0D %$H5) "',%$(-) &@@'=6) 1'(6$*5"$(-) /'"5) %A&() '(5) 1&(*$*&%5) ',%D 0,%) 0'"%) ?'") &) -$H5() $(0,%) 0'"%) &%) 5&1A) "',%5"B) =A$1A) 1&() #5) $/0'"%&(%)?'")05"?'"/&(15B)1'(-56%$'()1'(%"'@)&(*)?&,@%)%'@5"D &(15B) &6) *51$6$'(6) /&>) 1'(6$*5") %A5) $(6%&(%&(5',6) (5%='"P) 6%&%,6) MUO7) !() 5K&/0@5) '?) &) 2'3) &"1A$%51%,"5) 0"'0'6$(-) %A5) ,65)'?)",(%$/5)&*&0%$H5)"',%$(-)$6)4>!4)MVO7) C$(15)5&1A)"',%5")*5&@6)=$%A)65H5"&@)6$/,@%&(5',6)"5G,56%6) %') ?'"=&"*) 0&1P5%6B) &() &"#$%"&%$'() 6%"&%5->) $6) (51566&">7) 8=') 1A'$156) &"5) %') *5&@) =$%A) '(5) "5G,56%) &%) &) %$/5B) 1&@@5*))'*&$+, -./'01 +$2.&$+&."*B) '") %')*5&@)=$%A) &) 65%) '?) "5G,56%6) $() 0&"&@@5@B) 1&@@5*)0.(&$.2%&'01+$2.&$+&."*7)8A565)1A'$156)@5&*)%')%A5)-5(5"D $1)"',%5")&"1A$%51%,"56)*50$1%5*)$()W$-,"5)N7).()#'%A)&"#$%"&%$'() 6%"&%5-$56)1'/05%$%$'()?'")"',%5")"56',"156)/&>)'11,"7))

)
!"# !$#

I%+,($!J!K!"8/!+$&$(%'!7/:!(/,3$(!)('#%3$'3,($*!.)*$5!/&!3#$!)(.%3()3%/&! '#/%'$L!;)<!'$&3()2%H$5!)(.%3()3%/&M!;.<!5%*3(%.,3$5!)(.%3()3%/&?!

35(%"&@$J5*)&"#$%"&%$'()0"'*,156)"',%5"6)=A$1A)&"5)6$/0@5"B) =A$@5)*$6%"$#,%5*)&"#$%"&%$'()%"&*56)$(1"5&65*)"',%5")1'/0@5K$%>)

193

?'") 5(A&(15*) 05"?'"/&(15) MXO7) 35(%"&@$J5*) &"#$%"&%$'() ,6,&@@>) $/0@$56) %A&%) %A5) "',%5") 1'(%&$(6) '(@>) '(5) 6$(-@5) "',%$(-) ,($%B) ?'") =A$1A) &@@) $(0,%) 0'"%6) 1'/05%57) !"#$%"&%$'() &(*) "',%$(-) *5?$(5)&)1'((51%$'()#5%=55()&()$(0,%)&(*)&()',%0,%)0'"%B)&?%5") =A$1A) %"&(6/$66$'() $() %A&%) 1'((51%$'() 6%&"%6) &(*) %A5) "',%$(-) ,($%) $6) "5@5&65*) %') 65"H5) '%A5") 05(*$(-) $(0,%) 0'"%) "5G,56%67) 4$6%"$#,%5*)&"#$%"&%$'(B)'()%A5)'%A5")A&(*B)$/0@$56)%A&%)1'/05D %$%$'() ?'") "56',"156) '11,"6) '(@>) &%) %A5) ',%0,%) 0'"%67) 8A$6) "5D G,$"56)6'/5) A&"*=&"5) ,($%)"50@$1&%$'()&%)%A5)$(0,%)&(*)',%0,%) 0'"%6) E"',%$(-) &(*) &"#$%5"6B) "56051%$H5@>FB) #,%) /&>) $(1"5&65) 05"?'"/&(15) *"&/&%$1&@@>7) 8A5) "56,@%6) $() %A$6) 0&05") 6,00'"%) %A$6)6%&%5/5(%7) 8A5)&,%'/&%5*)*56$-()'?)2'36)/&>)G,$1P@>)0"'*,15)&)-5D (5"$1) 1'//,($1&%$'() &"1A$%51%,"5) 6'@,%$'(7) ;'=5H5"B) 2'3TC'3) 6$@$1'() &"5&B) 0'=5") *$66$0&%$'() &(*) 05"?'"/&(15) /&>) #5) '0%$/$J5*) $?) &"1A$%51%,"5) 1'(?$-,"&%$'() &(*) ,6&-5) &"5) 0@&((5*)MYO7)8A$6)$6)56051$&@@>)%",5)=A5()%A5)&00@$1&%$'()1'/D /,($1&%$'() 0&%%5"(6) &"5) P('=() $() &*H&(157) 3'(6$*5"$(-) %A$6) 6$%,&%$'() &(*) "56%"$1%$(-) &%%5(%$'() %') 0&%A) 65@51%$'() &/'(-) 1'//,($1&%$(-) 0&$"6) ?'") 6',"15) "',%$(-) 2'36B) $%) $6) ,6,&@) %') 5/0@'>)#&(*=$*%A)@$/$%6)&6)&)1"$%5"$'()%')*5?$(5)&)65%)'?)1'/D /,($1&%$'() "',%56) MZO) M[O7) R&(*=$*%A) @$/$%6) &"5) &() 5??$1$5(%) =&>)%')60"5&*)%A5)'H5"&@@)@$(P)@'&*7);'=5H5"B),6$(-)'(@>)%A$6) 1"$%5"$'() /&>) "56,@%) $() 1'//,($1&%$'() @'&*6) %A&%) &"5) #&*@>) *$6%"$#,%5*)$()%$/5B)=A$1A)/&>)$()%,"()1'/0"'/$65)%A5)'H5"&@@) 2'3) 05"?'"/&(157) L'"5'H5"B) %A5) &*H&(%&-5) '?) *56$-() %$/5) 0&%A) 65@51%$'() =A5() 1'/0&"5*) %') ",(%$/5) *$6%"$#,%5*) "',%$(-) &@-'"$%A/6) $6) ('%) 1@5&"7) 8A$6) 0&05") "50'"%6) $6'@&%5*) &(*) \'$(%) 1'/0&"$6'(6) #5%=55() ("%$)') H5"6,6) 0.(&$.2%&'01 $"%&.*3) &(*) )'*&$+-./'0)H5"6,6)0.(&$.2%&'01+$2.&$+&."*)6%"&%5-$567) +5/5/#5")%A&%)/'6%B)$?)('%)&@@B)"',%$(-)&@-'"$%A/6),65*)$() 2'36)&"5)*5&*@'1P)?"55)#51&,65)%A5>)0"51@,*5)%A5),65)'?)6'/5) "',%56) ?"'/) 6',"15) %') *56%$(&%$'(7) 9?%5(B) %A5) \,6%$?$1&%$'() %') ,65)&*&0%$H5)"',%$(-)&@-'"$%A/6)$()0@&15)'?)*5%5"/$($6%$1)'(56) 1'/56)?"'/)%A5)1&0&1$%>)'?)%A5)?'"/5")%')&H'$*)1'(%5(%$'()#>) ,6$(-)&@%5"(&%5)"',%56)=A5()&)1'(?@$1%)'11,"6)&%)",(%$/57)8A$6) 0&05") H&@,56) &('%A5") $/0'"%&(%) 0"'05"%>) '?) &*&0%$H5) "',%$(-) &@-'"$%A/6B)=A$1A)$6)%A5)"$1A5")65%)'?)0'66$#@5)"',%56)&)0&1P5%) 1&(),65)%')-')?"'/)6',"15)%')*56%$(&%$'(7)8A$6)0"'05"%>)/&P56) 6',(*) 1'/#$($(-) 6',"15) &(*) &*&0%$H5) "',%$(-) /51A&($6/6) =A5()%"&??$1)0&%%5"(6)&"5)P('=()&%)*56$-()%$/57)8A5)"$1A5")65%) '?)"',%56)?&1$@$%&%56)'H5"&@@)@'&*)#&@&(1$(-)$()%A5)1'//,($1&D %$'()&"1A$%51%,"5B)=A$@5)*5&*@'1P)?"55*'/)-,&"&(%556)%A5)$(%5D -"$%>)'?)'05"&%$'()?'")%A5)1'//,($1&%$'()&"1A$%51%,"57) 8A5) "5/&$(*5") '?) %A$6) 0&05") $6) '"-&($J5*) &6) ?'@@'=67) C51D %$'()..)0"565(%6)%A5)0"'1566)&*'0%5*)?'")"',%5)/&00$(-)1',0@5*) %') %A5) ;5"/56DC+) 2'37) C51%$'() ...) *50$1%6) %A5) ;5"/56DC+) &"1A$%51%,"57)C51%$'().])*561"$#56)%A5)5K05"$/5(%&@) 65%,0)&(*) "56,@%6B) =A$@5) C51%$'() ]) *$60@&>6) &) 65%) '?) 1'(1@,6$'(6) &(*) *$"51%$'(6)?'")?,%,"5)='"P7) ..7! +9:8<)L!II.2^) 8A5)*5?$($%$'()'?)1'//,($1&%$'()"',%56)/&>)#5)-,$*5*)#>) *$??5"5(%) "5G,$"5/5(%6B) &$/$(-) &%) 0'=5") *$66$0&%$'(B) &"5&) '") 05"?'"/&(157) 8A$6) ='"P) 1'(6$*5"6) &6) P5>) "5G,$"5/5(%) 1'/D /,($1&%$'()05"?'"/&(15B)/5&6,"5*)#>)%A5)"5*,1%$'()'?)0'%5(D %$&@)1'(-56%$'(B)&(*)5H&@,&%5*)#>)&H5"&-5)0&1P5%)@&%5(1>7)R&6$D 1&@@>B) 1'(-56%$'() $6) *5%51%5*) =A5() %A5) &/',(%) '?) $(1'/$(-) *&%&)E'")"5G,56%)?'")$(1'/$(-)*&%&F)$6)@&"-5")%A&()%A5)',%-'$(-) *&%&) ?"'/) &) -$H5() 1'//,($1&%$'() 5@5/5(%) E57-7) &) "',%5"F7) !)

"5&6'() ?'") %A$6) $6) &) #&*) *$6%"$#,%$'() '?) 1'//,($1&%$'() ?@'=6B) =A$1A) /&>) $/0@>) 'H5"@'&*5*) 1A&((5@6B) 1&@@5*) 4"&(5"&(7) 8') &H'$*)A'%60'%6B)$%)$6)/&(*&%'">)%'_)E$F)'65-"$'1+-&'$*+&.7'15+&4() ?'")5&1A)1'//,($1&%$(-) /'*,@56B)E$$F))"82.*'15+&4(1"91)"8, 8%*.)+&.*315+.$() $() &) %"&??$1) 615(&"$') &(*) E$$$F) '7+-%+&') $"%&'1 8+55.*3(1?'")5&1A)%"&??$1)615(&"$'7) :;& <65-"$.*31:-&'$*+&.7'1=+&4(1 !) 0&%A) $6) *5?$(5*) A5"5) &6) %A5) 65G,5(15) '?) "',%5") ',%0,%) 0'"%6),65*)%')%"&(6/$%)0&1P5%6)?"'/)&)6',"15)%')&)*56%$(&%$'(7) 4505(*$(-)'()%A5)"',%$(-)&@-'"$%A/B)/'"5)%A&()'(5)&@%5"(&%$H5) 0&%A) /&>) 5K$6%) #5%=55() &) -$H5() 6',"15) &(*) *56%$(&%$'(7) 8A5) 5K0@'"&%$'() '?) &@%5"(&%$H5) 0&%A6) A&6) %') -,&"&(%55) %A&%) &%) @5&6%) '(5) 0&%A) 5K$6%6) &/'(-) 5&1A) 1'//,($1&%$(-) 0&$") &(*) %A&%) (') *5&*@'1P) =$@@) '11,") =A5() 0&%A6) &"5) 1'/#$(5*) $(%') %"&??$17) 8A5"5) &"5) %=') =&>6) %') '#%&$() *5&*@'1P) ?"55*'/_) E$F) %A"',-A) ?'"/&@) H5"$?$1&%$'() '") E$$F) %A"',-A) %A5) &*'0%$'() '?) *5&*@'1PD ?"55) "',%$(-) &@-'"$%A/6) &6) #&6$6) ?'") 0&%A) 1'/0,%&%$'(7) 8A5) 0"565(%)='"P),656)%A5)651'(*)&00"'&1A7).%)5/0@'>6)?',")*$??5"D 5(%) "',%$(-) &@-'"$%A/6B) I,"5) `a) E`aFB) &(*) %A5) %A"55) %,"() /'*5@) H&"$&%$'(6_) 25-&%$H5) W$"6%) E2WFB) b56%) W$"6%) EbWF) &(*) 2'"%A) c&6%) E2cF) MUO7) 8A5) ?$"6%) $6) *5%5"/$($6%$1B) =A$@5) %A5) "5D /&$($(-) &"5) &*&0%$H5) &@-'"$%A/6) $/0@5/5(%5*) $() %=') ?@&H'"6_) /$($/&@) E2WLB) bWLB) 2cLF) &(*) ('() /$($/&@) E2W2LB) bW2LB) 2c2LF7) W'") `a) "',%$(-B) 5K&1%@>) &) 0&%A) 5K$6%6) #5D %=55() %=') 1'//,($1&%$(-) 5(%$%$567) W'") /$($/&@) &*&0%$H5) &@-'"$%A/6B)%A5)(,/#5")'?)*$6%$(1%)0&%A6)E*+&$,#F)$6)5$%A5")N)'") $6) *5?$(5*) #>) <G,&%$'() ENFB) =A5"5) -) &(*) .) "50"565(%) %A5) *$6%&(15)$()A'06)&@'(-)%A5)1'""560'(*$(-)&K$6)E-)'").F)#5%=55() 6',"15)&(*)*56%$(&%$'(7) (- + . ) d ) ENF) *+&$,# = -d .d W'")('()/$($/&@)&*&0%$H5)&@-'"$%A/6B)*+&$,#)*505(*6)'() %A5)"5@&%$H5)6',"15)&(*)*56%$(&%$'()0'6$%$'(6)&(*)'()%A5)"',%$(-) &@-'"$%A/) ",@567) W'") 5K&/0@5B) ?'") %A5) ('() /$($/&@) (5-&%$H5) ?$"6%)&@-'"$%A/)E2W2LFB)$%)$6)0'66$#@5)%'),65)<G,&%$'()EQF7))

*+&$,# = /0

e -% = - #

e .% = . #

(-/0 + ./0 )d
-/0 d./0 d

EQF1

bA5() %A5) *56%$(&%$'() '?) %A5) 0&1P5%) $6) &%) &) 0'$(%) E-1B.1FB) =A$1A)$6)&#'H5)&(*)%')%A5)"$-A%)'?)%A5)6',"15)1''"*$(&%56)E-#B) .#FB) <G,&%$'() EQF) $6) H&@$*7) .() %A$6) <G,&%$'(B) %A5) 0&$") E-%B) .%F) "50"565(%6) %A5) 0'6$%$'() "56,@%$(-) ?"'/) %A5) *$60@&15/5(%) ?"'/) %A5)6',"15)%')%A5)/'6%)(5-&%$H5)0'6$%$'()$()%A5)(5%='"P7)!@6'B) -/0) &(*) ./0) "50"565(%) %A5) *$6%&(15) $() %A5) -) "5607) .) &K56) #5D %=55() 0'6$%$'() E-%B! .%F) &(*) %A5) *56%$(&%$'(B) $757) -/0)f)-%)g)-1) &(*) ./0)f).%)g).17)W'")%A5)'%A5")('()/$($/&@)&*&0%$H5)"',%$(-) &@-'"$%A/6)6$/$@&")5G,&%$'(6)&(*)1'(6$*5"&%$'(6)&00@>7) >;& #"82.*.*31=+&4(1"91#"88%*.)+&.*31=+.$(1 !)"',%5)/&00$(-)$6)*5?$(5*)=A5()?'")&@@)0&$"6)'?)1'//,($D 1&%$(-) /'*,@56B) '(5) &(*) '(@>) '(5) 0&%A) $6) 1A'65() ?'") 5&1A) 6',"15) &(*) *56%$(&%$'() /'*,@5) E"50"565(%5*) #>) %A5) $%5"&%$'() H&"$&#@5) 2F7) 8A5) &/',(%) '?) 0'66$#@5) "',%5) /&00$(-6) *505(*6) '() %A5) (,/#5") '?) 1'//,($1&%$(-) 0&$"6) &(*) %A5) "',%$(-) &@-'D "$%A/7)8A5)-"5&%5")%A5)(,/#5")'?)&@%5"(&%$H5)0&%A6)05")1'//,D ($1&%$(-)0&$"B)%A5)-"5&%5")%A5)(,/#5")'?)&1A$5H&#@5)"',%5)/&0D 0$(-67) <G,&%$'() ESF) *5?$(56) %A5) /&K$/,/) (,/#5") '?) "',%5)

194

/&00$(-6) E*3&442/5FB) =A5"5) 2) $*5(%$?$56) &) 1'//,($1&%$(-) 0&$"B) *+&2%#1 $6) %A5) %'%&@) (,/#5") '?) 1'//,($1&%$(-) 0&$"6) &(*) *+&$,#627) $6) %A5) (,/#5") '?) &@%5"(&%$H5) 0&%A6) ?'") 27) !6) &() 5KD &/0@5B) *3&442/5) $6) &@=&>6) 5G,&@) %') N) =A5() `a) "',%$(-) $6) &*'0%5*B)6$(15)%A5"5)$6)'(@>)&)0'66$#@5)0&%A)?'")5&1A)1'//,($D 1&%$(-)0&$"7)

*3&442/5 =

*+&2%# 2 =N

*+&$,#E2 F )

ESF1

#;& <7+-%+&."*1"91?"%&'1@+55.*3(1 8A5)5H&@,&%$'()'?)"',%5)/&00$(-6)$6)#&65*)'(_)E$F)%A5))"8, 8%*.)+&."*1 )4+$+)&'$.(&.)) %A&%) $6) /'*5@5*) #>) &@@) 1'//,($1&D %$'(6)'?)%A5)&00@$1&%$'()$()%5"/6)'?)6',"15)/'*,@5B)%&"-5%)/'*D ,@5) &(*) %"&(6/$66$'() "&%5B) E$$F) %A5) +-&'$*+&.7'1 5+&4() ?'") 5&1A) 1'//,($1&%$(-)0&$")%A&%)$6)/'*5@5*)#>)&)-"&0A)1'(%&$($(-)&@@) 0'66$#@5) 0&%A6) ?"'/) 5&1A) 6',"15) %') 5&1A) *56%$(&%$'() /'*,@5) &(*)E$$$F)%A5)65@51%5*))"(&19%*)&."*B)=A$1A)1'(6$*5"6_)%A5)&H5"D &-5)"&%5)'?)0&%A)'11,0&(1>B)%A5)05&P),6&-5)'?)%A5)0&%A)&(*)%A5) 0&%A) @5(-%A7) C/&@@5") H&@,56) ?'") %A565) &6051%6) @5&*) %') #5%%5") 0&%A67) .($%$&@@>B) &) H&@$*) 0&%A) $6) "&(*'/@>) &66$-(5*) %') 5&1A) 1'/D /,($1&%$(-) 0&$"7) .() %A$6) 6%50B) (') &**$%$'(&@) 1&"5) $6) %&P5() ?'") 0&%A) #$(*$(-7) 8A5) '(@>) -,&"&(%55) '??5"5*) #>) %A$6) &66$-(/5(%) 0"'1566) $6) %A5) 5K$6%5(15) '?) %A5) 0&%A) '() %A5) @$6%) '?) &@%5"(&%$H5) 0&%A6)?'")%A5)-$H5()1'//,($1&%$(-)0&$"7)!?%5")&)0&%A)A&6)#55() &66$-(5*)%')5&1A)1'//,($1&%$(-)0&$"B)2'3)'11,0&%$'()$6)56%$D /&%5*)#>)&11,/,@&%$(-)%A5)%"&(6/$66$'()"&%5)'?)5&1A)1'//,D ($1&%$(-)0&$"7)8A5)(5K%)6%506)655P)%')'0%$/$J5)%A$6)$($%$&@)"',%5) /&00$(-7) 8A5) "',%5) /&00$(-) '0%$/$J&%$'() $6) 1&""$5*) #>) H&">$(-) %A5) 0&%A6)'?)5&1A)1'//,($1&%$(-)0&$"B)%">$(-)&@@)&@%5"(&%$H5)0&%A67) bA5()&)1'//,($1&%$(-)0&$")$6)#5$(-)5H&@,&%5*B)%A5)"5/&$($(-) 0&$"6)A&H5)%A5$")0&%A)?$K5*7) !)(5=)0&%A)?'")&)1'//,($1&%$(-)0&$")$6)&66,/5*)$?B)1'/D 0&"5*)%')%A5)1,""5(%)"',%5)/&00$(-_)E$F)%A5)&H5"&-5)"&%5)'?)0&%A) '11,0&(1>)$6)@'=5")&(*)$%6)05&P),6&-5)$6)@'=5")'")5G,&@B)'")E$$F) %A5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)$6)5G,&@)&(*)%A5)05&P),6&-5) $6) @'=5"B) '") E$$$F) %A5) &H5"&-5) "&%5) '?) 0&%A) '11,0&(1>) $6) 5G,&@B) %A5)05&P),6&-5)$6)5G,&@)&(*)%A5)0&%A)@5(-%A)$6)6A'"%5"7)8A5)?$"6%) &00"'&1A)-,&"&(%556)&)#5%%5")*$6%"$#,%$'()'?)2'3)1'//,($1&%D $(-) ?@'=6B)5G,&@$J$(-)%A5)1'//,($1&%$'()1A&((5@)'11,0&%$'(7) 8A5)651'(*)&00"'&1A)-,&"&(%556)%A&%)$?)&)@'=)1'(-56%$'()J'(5) 1&(('%) #5) ?',(*) &%) @5&6%) A'%60'%6) &"5) &H'$*5*B) #"$(-$(-) 05&P) ,6&-5)*'=(7)W$(&@@>B)=A5(),6$(-)&)('()/$($/&@)"',%$(-)&@-'D "$%A/B)$?)%A5)6&/5)&H5"&-5)"&%5)'?)0&%A)'11,0&(1>)1&()#5)?',(*) $() &) 6A'"%5") 0&%AB) %A5() @'=5") 0'=5") *$66$0&%$'() /&>) #5) 'H5"D @''P5*7) 8A5) 0"'1566) '?) "',%5) /&00$(-) '0%$/$J&%$'() ?$($6A56) $() %A"55) 0'66$#@5) 6$%,&%$'(6_) E$F) =A5() %A5"5) $6) '(@>) '(5) 0'66$#@5) "',%5)/&00$(-)E57)-7)=A5(),6$(-)`a)"',%$(-FB)E$$F)&?%5")"5&1AD $(-)&)-$H5()(,/#5")'?)%"$56B)=A$1A)$6)0&"&/5%5"$J&#@5B)&(*)E$$$F) =A5() (') ?,"%A5") '0%$/$J&%$'() 1&() #5) '#%&$(5*) &?%5") &@@) 1'/D /,($1&%$(-)0&$"6)A&H5)#55()5H&@,&%5*)&%)@5&6%)'(157)8A5)"56,@%D $(-)"',%5)/&00$(-)$6)1&@@5*)48&//91(#:;%'9(%:;$2/5)EIC+F7) ...7! 8;<);<+L<CDC+)!+3;.8<38:+<) R5?'"5)*561"$#$(-);5"/56DC+B)$%)$6) (51566&">)%')&00"'&1A) %=')'%A5")2'3)&"1A$%51%,"56),65*)A5"5)&6)&)#&6$6)?'")1'/0&"$D 6'(6B);5"/56)&(*);5"/56DL7);5"/56)&(*);5"/56DL)&"5)Q4D

/56A) %'0'@'->) 2'36) =$%A) "',%5"6) ,6$(-) 1"5*$%) #&65*) ?@'=) 1'(%"'@B)$(0,%)#,??5"$(-B) ='"/A'@5)0&1P5%)6=$%1A$(-)&(*)15(D %"&@$J5*) &"#$%"&%$'() E"',(*D"'#$() &@-'"$%A/F7) ;'=5H5"B) =A$@5) %A5)"',%$(-)61A5/5)'?);5"/56)$6)*$6%"$#,%5*B);5"/56DL),656)&) 0@&((5*)6',"15)"',%$(-)61A5/57) ;5"/56DC+) *$??5"6) ?"'/) ;5"/56DL) #51&,65) $%) 5/0@'>6) &) 0.(&$.2%&'01 +$2.&$+&."*) 61A5/5) =$%A) &) ?$"6%) 1'/5D?$"6%) 65"H5*) EW3WCF)&@-'"$%A/B)=A$1A)-,&"&(%556)$(D'"*5")0&1P5%)65"H$1$(-7) 4505(*$(-) '() %A5) I<) /&00$(-B) '() %A5) 6051$?$1) E&*&0%$H5F) "',%$(-) &@-'"$%A/) &(*) '() 6'/5) &00@$1&%$'() 1A&"&1%5"$6%$16B) "',%$(-) /&>) 'H5"@'&*) 65H5"&@) (5%='"P) "5-$'(6B) $/0@>$(-) &) *51"5&65)'()%A5)1'//,($1&%$'()&"1A$%51%,"5)5??$1$5(1>B)*,5)%') %A5)$(1"5&65)'?)0&1P5%)@&%5(1$567) 8A5)0"5*5?$($%$'()'?)0&%A6)6,00'"%5*)#>);5"/56DC+)6',"15) "',%$(-)-,&"&(%556)&)P('=()='"6%)1&65)?'")@$(P)@'&*67)!@6'B)$%) /&>) A5@0) '0%$/$J$(-) 2'3) &"5&) %A"',-A) %A5) 5@$/$(&%$'() '?) ,(,65*)@$(P6)&(*)#,??5")*$/5(6$'($(-B)#'%A)'?)=A$1A)&"5)',%D 6$*5)%A5)61'05)'?)%A$6)='"P7) !@@) 2'36) $() %A$6) ='"P) &66,/5) &) 6$/0@5) 0&1P5%) 6%",1%,"5B) 1'/0'65*) #>) &) A5&*5") 1'(%&$($(-) *56%$(&%$'() &(*) 6$J5) $(?'"D /&%$'() &(*) &) 0&>@'&*7) !@@) %A"55) 2'36) 6,00'"%) &"#$%"&">) ?@$%) 6$J56B) &@%A',-A) &@@) 5K05"$/5(%6) A5"5) ,65) '(@>NXD#$%) ?@$%67) 4$?D ?5"5(%)2'3)0&1P5%6)6@$-A%@>)*$??5")$()%A5$")A5&*5")6%",1%,"57).() ;5"/56B) %A5) QD?@$%) 0&1P5%) A5&*5") 6%'"56) %A5) "',%5") *56%$(&%$'() &**"566)&6)?$"6%)?@$%B)?'@@'=5*)#>)&)?@$%)=$%A)%A5)6$J5)'?)%A5)0&>@D '&*7) .() ;5"/56DC+) &(*) ;5"/56DLB) %A5) A5&*5") 6%&"%6) =$%A) &() $() '"*5") 65G,5(15) '?) ',%0,%) 0'"%6) (51566&">) %') &""$H5) &%) %A5) *56%$(&%$'(B)?'@@'=5*)#>)%A5)6$(-@5D?@$%)0&>@'&*)6$J57)bA$@5)%A5) ;5"/56)A5&*5")'11,0$56)5K&1%@>)%=')?@$%6B)%A5)'%A5")%=')2'36h) H&"$&#@5) 6$J5) A5&*5"6) 1'/0"$65) &%) @5&6%) %A"55) ?@$%6B) *,5) %') %A5) ,65)'?)&()&**$%$'(&@)?@$%),65*)&6)"',%5)%5"/$(&%'")?@&-7) 3'(15"($(-) &"#$%"&%$'() $() ;5"/56DC+B) 5&1A) "',%5") $(0,%) 0'"%) *$"51%@>) ('%$?$56) %A5) *56$"5*) ',%0,%) 0'"%) %') %"&(6/$%) &) 0&1P5%7) .@@,6%"&%5*) $() W$-,"5) NE#FB) %A$6) &00"'&1A) 5(&#@56) %') 65"H5)/,@%$0@5)"5G,56%6)%')*$6%$(1%)0'"%6)$()0&"&@@5@7)8"&(6/$6D 6$'() "5G,56%6) &"5) 6%'"5*) &(*) 65"H5*) $() &""$H&@) '"*5") #>) 5&1A) ',%0,%)0'"%7)W$-,"5)NE&F)$@@,6%"&%56)%A5)'%A5")&00"'&1AB),65*)$() ;5"/56) &(*) ;5"/56DLB) =A$1A) 5/0@'>6) 15(%"&@$J5*) "',(*D "'#$() &"#$%"&%$'(7) ;5"5B) 5&1A) $(0,%) 0'"%) "5G,56%6) "',%$(-) ?'") &) 1'(%"'@),($%)&(*)=&$%6)5$%A5")?'")&()',%0,%)0'"%)&66$-(/5(%)'") ?'") &) *5($&@) '?) &66$-(/5(%B) $?) %A5) "5G,56%5*) 0'"%) $6) &@"5&*>) #,6>7).()&(>)1&65B)$?)&"#$%"&%$'()65"H56)&)0'"%)$%)@'656)$%6)0"$'"$D %>7) R51&,65) '?) %A$6B) $(0,%) 0'"%6) /&>) 6,??5") ?"'/) 6%&"H&%$'(B) *505(*$(-)'()%A5)(5%='"P)@'&*)&(*)%A5)&/',(%)'?)1'/05%$%$'() &/'(-)1'//,($1&%$'()?@'=67) .]7! <`I<+.L<28!c)C<8:I)!24)+<C:c8C) ;5"/56DC+B) ;5"/56) &(*) ;5"/56DL) =5"5) *561"$#5*) $() 6>(%A56$J&#@5)+8c)];4c)=$%A)?$K5*)*$/5(6$'($(-)EVKVFB)?@$%) 6$J5)ENX)#$%6F)&(*)&)VeL;J)'05"&%$(-)?"5G,5(1>B)"56,@%$(-)$()&) #&(*=$*%A) '?) ZeeL#06) 05") @$(P7) 8A5) 5K05"$/5(%6) 5/0@'>5*) H&"$',6)"',%$(-)&@-'"$%A/)&(*)#,??5")6$J5)1'/#$(&%$'(67) C5H5"&@)6>(%A5%$1)&(*)"5&@)%"&??$1)0&%%5"(6)&@@'=5*)5H&@,&%D $(-) &"#$%"&%$'() &(*) "',%$(-) 61A5/567) bA$@5) "5&@) %"&??$1) 615(&D "$'6) &@@'=) &66566$(-) %A5) #5A&H$'") '?) 6051$?$1) &00@$1&%$'(6B) 6>(%A5%$1) %"&??$1) 615(&"$'6) 5(&#@5) %') 5K0@'"5) %A5) @$/$%6) '?) %A5) 2'36B)6,1A)&6)6&%,"&%$'()&(*)#5A&H$'"),(*5")1'(-56%$'(7) !) 65%) '?) %5K%) ?$@56) *561"$#5) 5&1A) %"&??$1) 0&%%5"() &6) &) 65%) '?) 0&1P5%6)EA5&*5")i)0&>@'&*F)&(*)%A5).0'+-1.*A')&."*)8"8'*&1?'") 5&1A)0&1P5%7)8A$6)$6)&()$(%5-5")$(?'"/$(-)%A5)(,/#5")'?)1@'1P)

195

1>1@56) &?%5") 6$/,@&%$'() 6%&"%7) .(\51%$'() 6',"156) &"5) $/0@5D /5(%5*) #>) 1>1@5D) &(*) 0$(D&11,"&%5) C>6%5/3) $(0,%) /'*,@56B) "560'(6$#@5) ?'") $(%5"0"5%$(-) %"&??$1) ?$@56) &(*) $(\51%$(-) 0&1P5%6) $(%') %A5) 2'37) C>6%5/3) ',%0,%) /'*,@56) &#6'"#) 0&1P5%6) ?"'/) %A5)2'3)',%0,%6B)6%'"$(-)%A5$")1'(%5(%6)&(*)&""$H&@)/'/5(%)?'") 6%&%$6%$1&@)5H&@,&%$'(7) c&%5(1>) H&@,56) 0"565(%5*) A5"5) &"5) ('%) @$/$%5*) %') %A5) 2'3) %"&(6/$66$'()*5@&>7)W$-,"5)Q)*$??5"5(%$&%56)%"&(6/$66$'()@&%5(D 1$56)#&65*)'()$(\51%$'()&(*)"5150%$'()*$6%"$#,%$'(67)!)5-+**'01 .*A')&."*) *$6%"$#,%$'() $6) *5?$(5*) $() %A5) %"&??$1) 615(&"$'6) %5K%) ?$@56B)&(*)*50$1%6)%A5)$*5&@)$(\51%$'()/'/5(%)?'")5&1A)0&1P5%)27) 8A5) +))"85-.(4'01 .*A')&."*) *$6%"$#,%$'() 1'(6$*5"6) %A5) &1%,&@) 0&1P5%)$(65"%$'()/'/5(%)$(%')%A5)2'3B)=A$1A)1&()#5)*5@&>5*) #>)1'(%5(%$'()&%)%A5)0&1P5%)6',"157)8A5).0'+-1$')'5&."*)*$6%"$D #,%$'() "50"565(%6) %A5) 5K051%5*) *5@$H5">) /'/5(%6) '?) 0&1P5%6B) %&P$(-) (5%='"P) 6%&%,6) $(%') &11',(%) '") ('%7) 8A5) +))"85-.(4'01 $')'5&."*)*$6%"$#,%$'()"50"565(%6)%A5)"5&@)/'/5(%)=A5"5)0&1PD 5%6) &"5) *5@$H5"5*) %') %A5$") *56%$(&%$'(7) 2') 1'(%5(%$'() &%) %A5) *56%$(&%$'()$6)1'(6$*5"5*7)
%&"''()

8A5)?$"6%)5K05"$/5(%)EW$-,"5)SF)&66,/5*)&)%"&??$1)615(&"$') =A5"5) 6',"156) $(\51%) Sej) '?) 1A&((5@) #&(*=$*%A) 1&0&1$%>7) .%) /&$(@>) 1'/0&"56) ;5"/56) ",(($(-) 4+) &(*) ;5"/56DL) ,6$(-) IC+) =$%A) *$??5"5(%) "',%$(-) &@-'"$%A/67) R'%A) &"1A$%51%,"56) 5/0@'>)&)15(%"&@$J5*)&"#$%"&%$'()61A5/5)E"',(*D"'#$(F7)

0..381&29:() *)("&+,"-('./+ 4(-5367+,"-('./+ 011&2."-23'+,"-('./+ 0..381&29:() *)("&

"#$%&'()#!

+,&-%'!(./!

+,&-%'!(!

+,&-%'!(0/!

*%&%+'()#!

I%+,($!O!9!P)3$&'=!($*,23*!/.3)%&$5!8#$&!'/-1)(%&+!5%*3(%.,3$5!(/,3%&+! ;QE<!A$(*,*!12)&&$5!*/,('$!(/,3%&+!;RDE<!)11(/)'#$*?!

I%+,($!N!K:/--,&%')3%/&!2)3$&'=!3=1$*?!

!)A>0'%A5%$1&@)*$6%"$#,%$'()'?)6,1A)$(\51%$'()&(*)"5150%$'() 615(&"$'6) $6) $@@,6%"&%5*) $() W$-,"5) Q7) B0'+-1 -+&'*)C) $6) %A5) /$($D /,/)(,/#5")'?)1>1@56)&)0&1P5%)(55*6)%')"5&1A)$%6)*56%$(&%$'(7) 8A$6)$6)#&65*)'()%A5)*$??5"5(15)'?)%A5)$*5&@)$(\51%$'() /'/5(%) &(*) %A5) 5K051%5*) *5@$H5">) /'/5(%7) !'&D"$E) -+&'*)C) $6) %A5) *5@&>) H5"$?$5*) #>) %A5) 0&1P5%) *,"$(-) $%6) %"&??$1) ?"'/) 6',"15) %') *56%$(&%$'(B)=A$1A)/&>)#5)$(?@,5(15*)#>)1'/05%$%$'()?'")2'3) "56',"156)E57-7)@$(P6B)#,??5"6B)&"#$%"&%$'(B)"',%$(-F7):55-.)+&."*1 -+&'*)C)('"/&@@>)#"$(-6)%A5)/'6%)$/0'"%&(%)$/0&1%)'()%A5)$*5&@) 1'//,($1&%$'() 05"?'"/&(157) 8A$6) $6) 1'/0,%5*) &6) %A5) *$??5"D 5(15)#5%=55()%A5)$*5&@)$(\51%$'()/'/5(%)'?)0&1P5%6)&(*)%A5$") 5??51%$H5) *5@$H5">) /'/5(%) &%) %A5) *56%$(&%$'(7) !00@$1&%$'() @&%5(1>)$6)%A5)H&@,5)&66,/5*)?'")1'/0&"$6'()$()%A5)(5K%)5K05D "$/5(%67) :;& <7+-%+&.*31='$9"$8+*)'1%*0'$1:--1&"1:--1F$+99.)1=+&&'$*1 8=') 1@&6656) '?) 5K05"$/5(%6) =5"5) 05"?'"/5*) %') 5H&@,&%5) 05"?'"/&(15)'0%$/$J&%$'(7)8A5)?$"6%)1@&66)1'/0&"56)%A5),6&-5) '?)IC+)&-&$(6%)*$6%"$#,%5*)"',%$(-)E4+F)=$%A)%A5)"56,@%6)6,/D /&"$J5*) $() W$-,"5) S7) 8A5) 651'(*) 1@&66) ?'1,6) '() 15(%"&@$J5*) H5"6,6)*$6%"$#,%5*)&"#$%"&%$'(B)=$%A)"56,@%6)*50$1%5*)$()W$-,"5)U7) 8A5) 1'/0&"$6'() H&@,56) =5"5) 1&0%,"5*) ?"'/) ?$H5) *$6%$(1%) %"&??$1) 615(&"$'6B) =$%A) $(\51%$'() "&%56) NejB) QejB) SejB) Uej) &(*)Vej)'?)%A5)1A&((5@)#&(*=$*%A)1&0&1$%>B)1'""560'(*$(-)%') &#6'@,%5) "&%56) '?) "56051%$H5@>) ZeL#06B) NXeL#06B) QUeL#06B) SQeL#06) &(*) UeeL#067) 8A5) %5/0'"&@) *$6%"$#,%$'() '?) 0&1P5%) $(\51%$'() $6) ,($?'"/7) I&1P5%6) ?'") ;5"/56) A&H5) Qe) ?@$%6B) =A$@5) ?'") ;5"/56DC+) &(*) ;5"/56DL) 2'36B) %A5) 6$J5) H&"$56) &"',(*) %A$6)H&@,5B)*505(*$(-)'()%A5)&/',(%)'?)A'06)%')"5&1A)%A5)*56D %$(&%$'(7)8A5)60&%$&@)*$6%"$#,%$'()$6)'(5)%')&@@)?"'/)5&1A)$(\51D %$'()6',"15B)0"'*,1$(-)&()&@@)%')&@@)2'3)%"&??$1)0&%%5"(7)

W$-,"5)S)*50$1%6)%A&%)&@@)IC+)@5*)%')$(1"5&65*)@&%5(1>)=A5() 1'/0&"5*) %') 4+) ?'") 2W2L) &(*) 2WLB) 5K150%) ?'") '(5) 1&65B) =A$1A) "56,@%5*) $() &) 6/&@@) -&$() EN7NQj) D) &"',(*) SVe) 1@'1P) 1>1@56) ?&6%5")$()&H5"&-5F7);'=5H5"B)?'")%A5)"5/&$($(-)"',%$(-) &@-'"$%A/6) @&%5(1$56) =5"5) "5*,15*) $() &@@) 1&656) ?'") IC+) =A5() 1'/0&"5*)%')4+7) 8A5) #5A&H$'") '?) bW) &(*) 2c) A&6) %A"55) 5K0@&(&%$'(67) 8A5) ?$"6%)$6)%A5)*5-"55)'?) ?"55*'/)0"'H$*5*)#>)bW)&(*)2c) =A5() 1'/0&"5*) %') 2W) ?'") "',%5) /&00$(-) 5K0@'"&%$'(7) 8A$6) 1&() #5) ?'"/&@@>) *5/'(6%"&%5*B) #,%) $(%,$%$H5@>) 605&P$(-) bW) &(*) 2c) *5%5"/$(5)&)6$(-@5)*$"51%$'()%A&%)/,6%)#5)5/0@'>5*)&%)%A5)6%&"%) EbWF)'")5(*)E2cF)'?)%A5)"',%$(-)0"'1566B)=A$@5)2W)*5%5"/$(56) %A&%)'(@>)%=')*$"51%$'(6)E%A5)(5-&%$H5)'(56F)1&()#5)%&P5()&%)%A5) 6%&"%) '?) %A5) "',%$(-) 0"'15667) 8A5) 651'(*) 5K0@&(&%$'() $6) %A5) -@'#&@) P('=@5*-5) '?) 1A&((5@6) @'&*) =A5() &*'0%$(-) 0@&((5*) "',%$(-7)8A5)%A$"*)$6)%A5)#&*)*51$6$'()%A&%)4+)/&>)%&P5B)6$(15) \,*-/5(%6) &"5) /&*5) #&65*) '() @'1&@@>) &H&$@&#@5) $(?'"/&%$'() '(@>)%')"56'@H5)1'(-56%$'(B)0'66$#@>)*5H$&%$(-)0&1P5%6)%')'%A5") 1'(-56%5*)"5-$'(67) 4,"$(-) 6$/,@&%$'(B) 4+) &@=&>6) &1A$5H5*) @'=5") @&%5(1$56) =A5()1'/0&"$(-)bW2L)%')2c2L)&(*)bWL)%')2cL7);'=D 5H5"B) =A5() ,6$(-)IC+)@&%5(1$56)&"5)&@=&>6)@'=5") =A5()1'/D 0&"$(-) 2c2L) %') bW2L) &(*) 2cL) %') bWL7) 8A565) "56,@%6) 6A'=)%A&%)%A5)1A'$15)'?)"',%$(-)&@-'"$%A/)6%"'(-@>)*505(*6)'() %A5)1A'$15)'?)"',%$(-)6%"&%5->7) W$-,"5) U) 6A'=6) &) 651'(*) 5H&@,&%$'() 5K05"$/5(%) %A&%) &6D 6,/56) %"&??$1) 615(&"$'6) =$%A) 0&1P5%) $(\51%$'() "&%56) H&">$(-) ?"'/) Nej) %') Vej7)8A$6) 5H&@,&%$'() 5665(%$&@@>) 1'/0&"56) 15(D %"&@$J5*) E;5"/56) '") ;5"/56DLF) &(*) *$6%"$#,%5*) E;5"/56DC+F) &"#$%"&%$'()61A5/567) +',%56)=5"5)*5?$(5*)=$%A)%A5)`a)&@-'"$%A/B)-,&"&(%55$(-) %A5)6&/5)0&1P5%)*$6%"$#,%$'() ?'")&@@)2'367)2')6$-($?$1&(%)*$?D ?5"5(15)$6)'#65"H5*),0)%')Qej)'?)$(\51%$'()"&%57)!%)%A5)Sej)'?) $(\51%$'()"&%5)&(*)&#'H5B)$%)$6)('%$15&#@5)%A&%)*$6%"$#,%5*)&"#$D

196

*)9'(#:!,B:)4(';5!

%"&%$'() 1&() "5*,15) %A5) "',%5") 1'(%"'@) 1'(-56%$'(B) 6$(15) @&%5(D 1$56)&"5)6$-($?$1&(%@>)"5*,15*)=A5()1'/0&"5*)%')&)15(%"&@$J5*) &00"'&1A7) !**$%$'(&@@>B) $%) $6) '#65"H&#@5) %A&%) %A5) #$--5") %A5) #,??5")6$J56B)%A5)@'=5")%A5)&H5"&-5)@&%5(1>B)$()&@@)1&6567);'=D 5H5"B) $%) 1&() &@6') #5) '#65"H5*) %A&%) &%) 5&1A) $(\51%$'() "&%5B) %A5) @'=56%) &H5"&-5) @&%5(1>) '#%&$(5*) =A5() 15(%"&@$J5*) &"#$%"&%$'() 5/0@'>6)%A5)#$--56%)#,??5")6$J5)ESQD?@$%)#,??5"FB)$6)-"5&%5")%A&() %A5) &H5"&-5) @&%5(1>) '#%&$(5*) =$%A) %A5) 6&/5) $(\51%$'() "&%5) '?) *$6%"$#,%5*) &"#$%"&%$'() ,6$(-) %A5) 6/&@@56%) #,??5") 6$J5) EUD?@$%) #,??5"FB)5K150%)&%)&)Sej)$(\51%$'()"&%57)8A$6)1&65)$6)('%)"5&@@>) "5@5H&(%) #51&,65) %A5) *$??5"5(15) $6) 6@$-A%) E&"',(*) Ne) 1@'1P) 1>1@56)$()&H5"&-5)g)@566)%A&()Nej)'?)*$??5"5(15F7)
:$&3()2%H$5!A*!Q%*3(%.,3$5!G(.%3()3%/&)
;ST!G2+/(%3#-<)

").2$!J!K7/:!)A$()+$!2)3$&'%$*!'/-1)(%*/&!;%&!'2/'4!'='2$*<?!

1)2! ! *)9'(#:!6&;%5%! A4=('4,'()#!6&;%5% DE! 1O! 1O7 RO! RO7! 1S 1S7!

3%45%6! <(6'4(=9'%<!

3%45%6.7!

3%45%6.8*

6)94&%!>?8*@!

&%#'4,B(C%<! <(6'4(=9'%<! /FGHIJKHL! /FGHIJKHL MGJILKNL PGPLIKFL! IGIIPKNI! FG/IMKII! //GHM/KIL! QGJPHKMF! MGHFPK//! /GMNIKFJ /GLNJKQL PGLJLKJM FGPIIKJJ QGHFHK/H QG/HPKNN FMQKQN FLJKHL /GHNFKIP QGPMQKQI JFIK/M JPPKHL

) ;)<!

!"#$%&'()"*+

JUV!

,(-$%(./$"*+

!"#$%&'()"*+

NUV!

,(-$%(./$"*

;.<

) )

!"#$%&'()"*+

OUV!

,(-$%(./$"*

;'<!

) bA5() 1'/0&"$(-) ;5"/56) &(*) ;5"/56DL) 1'@,/(6) '?) 8&D #@5) N) g) &4'1 (+8'1 +$2.&$+&."*1 +*01 0.99'$'*&1 $"%&.*31 ()4'8'() g) ?'") &@@) "',%$(-) &@-'"$%A/6) E5K150%) `aF) %A5) 5??51%$H5(566) '?) IC+) 6%&(*6) ',%7) 8A5) 5K05"$/5(%) 6A'=5*) @&%5(1>) "5*,1%$'(6) ?"'/)QU7[Sj)E2cF),0)%')YS7ZYj)E2cLF)$()%A565)1&6567) bA5() 1'/0&"$(-) ;5"/56DL) &(*) ;5"/56DC+) 1'@,/(6) '?) 8&#@5)N)g)&4'1(+8'1$"%&.*31+*010.99'$'*&1+$2.&$+&."*1()4'8'()g) %A5)5??51%$H5(566)'?)*$6%"$#,%5*)&"#$%"&%$'()$6)A$-A@$-A%5*7)8A5) 5K05"$/5(%)6A'=5*)&()&H5"&-5)@&%5(1>)"5*,1%$'()'?)VS7Q[j7) !**$%$'(&@@>B) $%) $6) ('%$15&#@5) %A5) #5(5?$%) '?) 1'/#$($(-) 0@&((5*) 6',"15) "',%$(-) &(*) *$6%"$#,%5*) &"#$%"&%$'(B) &6) 6,0D 0'"%5*) #>) ;5"/56DC+7) R>) 1'/0&"$(-) %A5) ?$"6%) &(*) @&6%) 1'@D ,/(6B) %A5) &H5"&-5) @&%5(1>) $6) ,0) %') NN) %$/56) 6/&@@5") E2cLF) &(*)%A5)&H5"&-5)@&%5(1>)"5*,1%$'()?'")&@@)1&656)$6)Ye7Qej7) W$-,"5)V)$@@,6%"&%56)%A5)@$(P)='"P@'&*)56%$/&%$'()=A5()*$?D ?5"5(%)"',%$(-)&@-'"$%A/6)&"5),65*)&6)#&6$6)?'")"',%5)/&00$(-6) $() ;5"/56DC+7) 8A5) `a) 0$1%,"5) 0"565(%6) %=') 05&P6) '?) @'&*) 1'(15(%"&%$'(B) "56,@%$(-) $() A$-A) 1'/05%$%$'() &(*) 1'(65G,5(%) 05"?'"/&(15) *5-"&*&%$'(7) 2WB) 2WLB) 2c) &(*) 2cL) &"5) %A5) /'6%) 6,$%&#@5) &@-'"$%A/6) %') *$6%"$#,%5) %A5) ='"P@'&*) $(%') %A5) 2'3)@$(P6B)?'@@'=5*)#>)bW)&(*)bWL7)

!"#$%&'()"*+

WUV!

,(-$%(./$"*+

!"#$%&'()"*+

XUV!

,(-$%(./$"*

I%+,($!W!K!GA$()+$!2)3$&'%$*!0/(!'$&3()2%H$5!;C$(-$*!)&5!C$(-$*9@<!)&5! 5%*3(%.,3$5!)(.%3()3%/&!;C$(-$*9DE<?!Y&Z$'3%/&!()3$*!A)(=!0(/-!;)<!JUV9 NUV>!;.<!OUV!3/!;'<!WUV9XUV?!

>;& <7+-%+&.*31='$9"$8+*)'1%*0'$1G"&(5"&1F$+99.)1=+&&'$*1 8A$6) 5K05"$/5(%) &@@'=6) 5H&@,&%$(-) %A5) *$6%"$#,%5*) '") 15(D %"&@$J5*) &"#$%"&%$'() &"1A$%51%,"&@) 1A'$156B) &(*) *$6%"$#,%5*) '") 6',"15)"',%$(-)*51$6$'(67)!@6'B)$%)5(&#@56)/5&6,"$(-)%A5)5??51D %$H5(566)'?)6%&%$1&@@>)0@&((5*)"',%$(-7)!)A'%60'%)%"&??$1)615(&D "$') $6) ,65*B) =A5"5) %=') ('*56) 1'(15(%"&%5) &@@) 0&1P5%) *56%$(&D %$'(6) '() %A5) 2'37) !) Nee) L#06) $(\51%$'() "&%5) =$%A) ,($?'"/) %5/0'"&@)*$6%"$#,%$'()05")6',"15)$6),65*7)8&#@5)N)0"565(%6)%A5) &H5"&-5)@&%5(1>)1'/0,%5*)*,"$(-)6$/,@&%$'(7).()%A5)`a)&@-'D "$%A/)@$(5)'?)8&#@5)N)'(@>)&"1A$%51%,"&@)*51$6$'(6)1&()#5)&(&D @>J5*B)6$(15)%A5)6&/5)"',%56)&"5),65*)?'")&@@)2'367)3'/0&"$(-) ;5"/56)&(*);5"/56DL)1'@,/(6)$%)1&()#5)655()%A&%)(')-&$()$6) &1A$5H5*)=A5()&*'0%$(-)*$6%"$#,%5*)'")0@&((5*)6',"15)"',%$(-) $()%A5)`a)@$(57);'=5H5"B)%A5)1'/0&"$6'()'?);5"/56DC+)1'@D ,/(6) =$%A)%A5)%=')0"5H$',6)'(56)6A'=6)%A&%)*$6%"$#,%5*)&"#$D %"&%$'()@5*)%')/'"5)%A&()Uej)'?)@&%5(1>)"5*,1%$'()=A5()1'/D 0&"5*)%')%A5)15(%"&@$J5*)&00"'&1A7)

)
I%+,($!X!9!P%&4*!8/(42/)5!$*3%-)3%/&!8#$&!$-12/=%&+!5%*3%&'3!(/,3%&+! )2+/(%3#-*!0/(!1)3#!*$2$'3%/&!%&!)!12)&&$5!*/,('$!(/,3%&+!)112%$5!3/!)!XBX! C$(-$*9DE!7/:?!"#$!1$)4!)3!3#$!3/1!/0!$)'#!*6,)($!($1($*$&3*!3#$!-)B9 %-,-!8/(42/)5!($)'#$5!5,(%&+!*%-,2)3%/&?!

#;& <7+-%+&.*31:$'+1#"(&(1 8&#@5)Q)*50$1%6)%A5)&"5&),6&-5)'?);5"/56)&(*);5"/56DC+) 15(%"&@)"',%5"6)E=$%A)V)$(0,%)&(*)',%0,%)0'"%6F)$()&)`3V]c`Se) ]$"%5K) V) WI^!7) !"5&) $6) 5K0"5665*) $() %5"/6) '?) (,/#5") '?) c:86) &(*) ?@$0D?@'06) ?'") ?',") 1'(?$-,"&%$'(6) '?) #,??5") 6$J57) C>(%A56$6) "56,@%6) 1'/5) ?"'/) %A5) ,65) '?) %A5) `C8) %''@B) 0&"%)'?)

197

%A5) `$@$(K) .C<) [7Q$) %''@65%7) 2'%A$(-) #,%) *5?&,@%) 0&"&/5%5"6) =5"5)&66,/5*)$()%A5)6>(%A56$6)%''@7);5"/56DL)$6)('%)5K0@$1$%) "5?5"5(15*)$()8&#@5)QB)6$(15)$%6)$/0@5/5(%&%$'()$6)G,$%5)6$/$@&") %');5"/567)8A5)*$??5"5(15)$/0@$5*)#>)%A5)"',%$(-) /51A&($6/) $/0@5/5(%&%$'()A&6)(')6$-($?$1&(%)$/0&1%)$()%5"/6)'?)&"5&7)
").2$!N!K!C$(-$*!)&5!C$(-$*9DE!)($)!'/-1)(%*/&!;X91/(3!(/,3$(<?!

T9UU%4! 6(C%! I! M! /L! PQ!

3%45%6! 3%45%6.8*! SVW6! OB(+.UB)+6! SVW6! OB(+.UB)+6 /HLI! Q/Q! /IPJ! QIH //QM! QPF! /FHF! QLH /QQJ! QFI! /LPI! QMH /FPQ! QLL! /N/F! PHH

) !@%A',-A);5"/56DC+)$/0"'H56)05"?'"/&(15)$%)1@5&"@>)05D (&@$J56)&"5&B)6$(15)$%)0"565(%6)&()&H5"&-5)'?)SN7Yj)/'"5)c:86) &(*) NN7Yj) /'"5) ?@$0D?@'06) %A&() ;5"/567) ;'=5H5"B) W$-,"5) X) E5K%"&1%5*)?"'/)W$-,"5)UF)0'$(%6)',%)%A&%)*$6%"$#,%5*)&"#$%"&%$'() 2'36B) 5H5() $/0@5/5(%5*) =$%A) UD?@$%) #,??5"6B) &1A$5H56) #5%%5") 05"?'"/&(15) %A&() 15(%"&@$J5*) &"#$%"&%$'() 2'36) $/0@5/5(%5*) =$%A)SQD?@$%)#,??5")?'")&@@)$(\51%$'()"&%56B)5K150%)Sej7)3'/0&"D $(-) &) UD?@$%) #,??5") ;5"/56DC+) 2'3) =$%A) &) SQD?@$%) #,??5") ;5"/56B)%A5)&H5"&-5)@&%5(1>)?'")&@@)$(\51%$'()"&%56)"5*,156)#>) &00"'K$/&%5@>)SV7Qj7)

X!*%69B'6!U)4!,!F.+)4'!4)9'%4!

#&65*)'()0"5H$',6@>)P('=()&00@$1&%$'()%"&??$1)#5A&H$'"B)=A$1A) -,&"&(%556)&#65(15)'?)*5&*@'1P)&(*)/'"5)#&@&(15*)1'//,($D 1&%$'()@'&*7) 8A5);5"/56DC+)2'3)&"1A$%51%,"5)=&6)$/0@5/5(%5*)%')5KD 0@'"5)%A5),65)'?)*$6%"$#,%5*)&"#$%"&%$'()&(*)6',"15)"',%$(-7)8A$6) 2'3) 65"H5*) %') 1'/0&"5) %A5) %"&*5D'??6) $(H'@H5*) $() *51$*$(-) &#',%)%A5),65)'?)6',"15)H5"6,6)*$6%"$#,%5*)"',%$(-)6%"&%5-$56B)&6) =5@@)&6)$()%A5),65)'?)15(%"&@$J5*)H5"6,6)*$6%"$#,%5*)&"#$%"&%$'() 6%"&%5-$567) !**$%$'(&@@>B) %A5) 0&05") 0"'0'656) &) "',%5) /&00$(-) 0"'1566) %A&%) 1&() #5) &*H&(%&-5',6) ?'") 5(&#@$(-) %A5) 56%$/&%$'() '?) @$(P) '11,0&(1>) $() 2'36) &(*) %A5) 1'(65G,5(%) 05"?'"/&(15) $/0"'H5/5(%7) 8A$6) &@@'=6) 6'@H$(-) 1'(-56%$'() /$%$-&%$'() %A"',-A) A'%60'%6) &H'$*&(157) 8A5) "56,@%6) 0'$(%) ',%) %') &) 2'3) *56$-() =$%A) 05"?'"/&(15) $/0"'H5/5(%B) 0'=5") *$66$0&%$'() "5*,1%$'()&(*)&"5&)6&H$(-7) 9(-'$(-)='"P)15(%5"6)'()*>(&/$1)%"&??$1)615(&"$'6B)=A5"5) &00@$1&%$'(6) &"5) @'&*5*) &%) ",(%$/5) &(*) 1'//,($1&%$'() "5D G,$"5/5(%6)&"5)"5G,56%5*)'()%A5)?@>7)C5@?)&*&0%$H5)2'36)#&65*) '() -@'#&@) $(?'"/&%$'() P('=@5*-5) &"5) &() $(%5"56%$(-) 1A'$15B) 6$(15)"56,@%6)*5?$($%5@>)6A'=5*)%A&%)*51$6$'(6)#&65*)'()@'1&@@>) &1G,$"5*)$(?'"/&%$'()/&>)@5&*)%')#&*)05"?'"/&(15)"56,@%67) !3k29bc<4^<L<28C) 8A5)!,%A'"6)&1P('=@5*-5)%A5)6,00'"%)'?)%A5)32IG)%A"',-A) "565&"1A)-"&(%6)NUNQUYTQeeVDSB)SeZ[QUTQeeZDZB)Se[QVVTQeeZD QB)SeNV[[TQee[DQ)&(*)'?)%A5)W!I<+^C)-"&(%)NeTeZNUD[7) +<W<+<23<C)
MNO! b'@?B) b7) 5%) &@7) lL,@%$0"'1566'") C>6%5/D'(D3A$0) ELIC'3F) 851A('@'D ->m7) .<<<) 8"&(6&1%$'(6) '() 3'/0,%5"D!$*5*) 456$-() '?) .(%5-"&%5*) 3$"D 1,$%6)&(*)C>6%5/6B)QYENeFB)91%7)QeeZB)007)NYeNDNYNS7) MQO! I&6"$1A&B)C7n)4,%%B)27)l9(D3A$0)3'//,($1&%$'()!"1A$%51%,"56)g)C>6%5/) '()3A$0).(%5"1'((51%m7)L'"-&()k&,?/&(()C1$5(15B)QeeZB)VUU07) MSO! L'"&56B) W7) 5%) &@7) l;5"/56_) &() $(?"&6%",1%,"5) ?'") @'=) &"5&) 'H5"A5&*) 0&1P5%D6=$%1A$(-) (5%='"P6) '() 1A$0m7) .(%5-"&%$'() ]cC.) o',"(&@B) SZENFB) 91%7)QeeUB)007)X[D[S7)

)
I%+,($![!K!GA$()+$!2)3$&'%$*!8#$&!'/-1)(%&+!5%*3(%.,3$5!)(.%3()3%/&!%&! W902%3!.,00$(!C$(-$*9DE!3/!'$&3()2%H$5!ON902%3!.,00$(!C$(-$*!0/(!%&Z$'3%/&! ()3$*!A)(=%&+!0(/-!JUV!3/!XUV?!

MUO! ^@&66B)37n)2$B)c7)l8A5)8,"()L'*5@)?'")!*&0%$H5)+',%$(-m7)o',"(&@)'?)%A5) !66'1$&%$'()?'")3'/0,%$(-)L&1A$(5">B)UNEVFB)C507)N[[UB)007)ZYUD[eQ7) MVO! ;,B)o7n)L&"1,@561,B)+7)l4>!4)D)C/&"%)+',%$(-)?'")25%='"P6D'(D3A$0m7) .(_)4!3heUB)QeeUB)007)QXeDQXS7) MXO! k,@/&@&B)!7)5%)&@7)l4$6%"$#,%5*)#,6)&"#$%"&%$'()&@-'"$%A/)1'/0&"$6'()'() WI^!D#&65*)LI<^DU)/,@%$0"'1566'")6>6%5/)'()1A$0m7).<8)3'/0,%5"6) p)4$-$%&@)851A($G,56B)o,@7)QeeZB)QEUFB)007)SNUDSQV7) MYO! R5"%'JJ$B) 47n) R5($($B) c7) l`0$056_) !) 25%='"PD'(D1A$0) !"1A$%51%,"5) ?'") ^$-&61&@5) C>6%5/6D'(D3A$0m7) .<<<) 3$"1,$%6) &(*) C>6%5/6) L&-&J$(5B) UEQFB)QeeUB)007)NZDSN7) MZO! W5(B)^7n)2$(-B)b7)l!)L$($/,/DI&%A)L&00$(-)!@-'"$%A/)?'")Q4)/56A) 25%='"P)'()3A$0)!"1A$%51%,"5m7).(_)!I33!CheZB)QeeZB)007)NVUQDNVUV7) M[O! R'@'%$(B)<7)5%)&@7)l+',%$(-)8&#@5)L$($/$J&%$'()?'").""5-,@&")L56A)2'3m7) .(_)4!8<heYB)QeeYB)007)[UQD[UY7) MNeO! R&(5"\55B) 27) 5%) &@7) l!) I'=5") &(*) I5"?'"/&(15) L'*5@) ?'") 25%='"PD'(D 3A$0)!"1A$%51%,"56m7).(_)4!8<heUB)QeeUB)007)NQVeDNQVV7) MNNO! I&@/&B)o7)5%)&@7)lL&00$(-)</#5**5*)C>6%5/6)'(%')2'36)g)8A5)8"&??$1) <??51%) '() 4>(&/$1) <(5"->) <6%$/&%$'(m7) .(_) CR33.heVB) QeeVB) 007) N[XD QeN7) )

bA5() 1'(6$*5"$(-) &() $/0@5/5(%&%$'() '?) &) UD?@$%) #,??5") ;5"/56DC+)&(*)&()$/0@5/5(%&%$'()'?)&)SQD?@$%)#,??5");5"/56B) $%) $6) H$6$#@5) %A&%) 1A''6$(-) ;5"/56DC+) $/0@$56) &"5&) 1'(6,/0D %$'() "5*,1%$'() EX7Qj) @566) c:86) &(*) [7Zj) @566) ?@$0D?@'06F7) C$(15) %A5) &H5"&-5) @&%5(1>) $6) &@6') "5*,15*) $() ;5"/56DC+) $/D 0@5/5(%&%$'(B)5H5()=$%A)6,1A)&)6/&@@)#,??5")6$J5B)$%)$6)5&6>)%') 1'(1@,*5) %A&%) 15(%"&@$J5*) &"#$%"&%$'() =$%A) 0@&((5*) 6',"15) "',%$(-) $6) &) -''*) *56$-() 1A'$15B) "5*,1$(-) 6$J5) &(*) @&%5(1>7) L'"5'H5"B) 6'/5) ='"P6) 6A'=) %A&%) #,??5") 6$J5) 6%"'(-@>) 1'(%"$D #,%5) %') 0'=5") *$66$0&%$'() MNeOB) =A$@5) '%A5"6) 6A'=) %A&%) 2'3) #,??5"6)/&>)&11',(%)?'")&"',(*)[ej)'?)%A5)0'=5")*$66$0&%$'() $()&)"',%5")MNNO7)8A5"5?'"5B)%A5)"5*,1%$'()'?)SQD?@$%)#,??5")%')UD ?@$%)#,??5")0"'#&#@>)&@6')$/0@$56)6$-($?$1&(%)5(5"->)6&H$(-67) ]7! 3923c:C.92C)!24)92^9.2^)b9+k) .() %A$6) 0&05"B) %A5) /&$() 1'(%"$#,%$'(6) 1'/5) ?"'/) %A5) 05"D ?'"/&(15) 5H&@,&%$'() '?) "',%$(-) &(*) &"#$%"&%$'() &"1A$%51%,"&@) *51$6$'(6)&(*)%A5$")$/0&1%)'()&"5&)1'6%67) !)651'(*&">)1'(%"$D #,%$'() $6) %A5) 0"'0'6$%$'() '?) &) /5%A'*) ?'") 0&%A) 1'/0,%&%$'(B)

198

On-Chip Efcient Round-Robin Scheduler for High-Speed Interconnection


Pongyupinpanich Surapong and Manfred Glesner
Microelectronic Systems Research Group, Technische Universit t Darmstadt, a Darmstadt, Germany Email:{surapong; glesner}@mes.tu-darmstadt.de

AbstractDue to the simplicity of scheduling, the buffered crossbar is becoming attractive for high-speed communication system. Although the previously proposed Round-Robin algorithms achieve 100% throughput under uniform trafc, they can not achieve a satisfactory performance under non-uniform trafc. In this paper, we propose an efcient Round-Robin scheduling algorithm based on binary-tree scheme where service policy is applied to improve Quality-of-Service. With the proposed scheduling algorithm, the searching time-complexity of O(1) (one clock cycle) and 100% throughput under non-uniform trafc can be obtained. Based on a binary-tree structure, the design achieves high-speed data rate at T bps, and simpler design with combinational circuits. The design has been simulated on both FPGA-based (Virtex 5) and Silicon-based technology (0.18 um). The synthesis results show that consumed resources varied from 11 to 533 slices and from 46 to 1686 2-NAND gates for crossbars of size 4 4 to 128 128. Critical path delays from 0.72 to 4.52 ns for FPGA-based and from 1.33 to 4.0 ns for silicon-based have obtained for the design.

I. I NTRODUCTION Performance and efciency of a generic buffered crossbar depends on input-, internal-, and output-scheduling mechanisms [1]. It is composed of three main structures: input ports, output port and a switch fabric interconnecting the input and the output ports. The complexity of scheduler numbers located on all crosspoints to manage the data-queue is O(logN 2 ), where N is the number of input ports based on a symmetrical structure [1]. Thus, improving the performance and efciency of a scheduler is attractive for interconnection designers. Scheduling schemes are divided into two main categories: weighted algorithms and Round-Robin algorithms. T. Javadi et al [2] and M. Nabeshima [3] introduced LQF-RR and OCFOCF to match inputs to outputs. Since their basic building blocks of matching operations are integer comparators and multiplexers, their complexities are O(N logN ). To reduce this complexity, the internal information structure, SCBF [4], was proposed with O(logN ). However, it has unstable regions for the states of input virtual-output-queues (VOQs) and its complexity is still too sensitive to the crossbar size. Therefore, the schedulers based on weighted algorithms have limitations for building high-speed and large capacity crossbars. Due to simplicity, fairness, 100% throughput and contentionless, Round-Robin-based mechanism was proposed on RR-RR [5]. It has been improved with DRR [6] and DRR-k [1]. These two versions applied the double-pointers updating mechanism to overcome the limited performance of the Round-Robin scheme. However, since the position of doublepointer has to be updated as fast as possible, their design structure based on comparator and counter functions is a
978-1-4577-0660-8/11/$26.00 c 2011 IEEE

too complex to support data rates up to Terabits per second. Chauo [7] proposed a structure based on a binary-tree arbiter which can perform the arbitration in a fast and efcient way. However, this framework can not guarantee fairness to all inputs during non-uniform trafc. In this paper, we explore the design of an efcient RoundRobin scheduler based on binary-tree structure which guarantees fairness, 100% throughput, without contention on nonuniform trafcs. The design achieves very high-speed data rates with low time-complexity (O(1)). A service function has been included to improve Quality-of-Service (QoS) for all input ports. The rest of this paper is organized as follows: the efcient Round-Robin algorithm is explained in section II. Section III introduces the hardware implementation of an 8 8 efcient Round-Robin scheduler based on 8 8 buffered crossbar. Performance and efciency of the design are reported in section IV and compared with the related work. Finally, conclusions are presented in section IV. II. A N E FFICIENT ROUND -ROBIN A LGORITHM TABLE I: Binary-tree selection on a Leaf- and Root-Node [9].
State 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 PL 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 RL 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 PR 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 RR 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Leaf-Node right right right left left left right left right left left Root-Node right right left right left right left left -

A. Binary-Tree algorithm A time optimized Round-Robin algorithm can be realized by applying the binary-tree arbitration [7]. With a N N buffered crossbar structure, the binary-tree level equals to log(N + 1) as shown in [8]. Since the basic element of any node on the binary tree has four inputs and two outputs [9], outputs can be dened by association of input priority and input request on either left or right side respectively. Assuming that PL or R and RL or R are input priority and input request, table I determines

199

the selective state of Leaf-Node and Root-Node under their possible actions. For example, we suppose that we have four inputs comprised of priorities and requests, PL3 RL3 PR2 RR2 and PL1 RL1 PR0 RR0 respectively equal to 0101 and 0011. Conforming to the binary-tree selection table I, the RR0 is selected as shown in gure 1.

Leaf-node

01

11

Fig. 2: Design entity of the efcient Round-Robin scheduler for eight requests.
11

01

01

00

Fig. 1: A binary-tree selection conforming to table I, where node PR0 RR0 is selected when PL3 RL3 PR2 RR2 =0101 and PL1 RL1 PR0 RR0 =0011. B. Round-Robin-based scheduling mechanism with service function In this section, we introduce a Round-Robin algorithm based on binary-tree searching scheme where a service function is applied to all input ports in order to improve QoS during nonuniform trafc dened by [1]. TABLE II: A Round-Robin-Based scheduling mechanism with service function in order to improve QoS on non-uniform trafc.

Fig. 3: Block diagram of the efcient Round-Robin Scheduler, where service-ratios (S), a 8-bit VOQ Req. and credit buffer (CB) are its input.

Based on this mechanisms and under non-uniform trafc (hot-spot and unbalance data rate), the service-ratio (Servicei ) for each input port i comes from the data rate itself. Assuming that Vi is the data rate at input port i, Servicei = Vi Mechanism : Round-Robin-Based scheduling mechanism M in(V1 ,VN ) where N is the number of input ports. For Input : Credit Buffer (CBi ), Request (Reqi ), service-ratio (Servicei ) example, if the data rates of four input ports are 50, 100, Output : Grant(Granti ) 150 and 200 KBit/sec, their service-ratios will be 1, 2, 3 and Internal : Priority (P rii ), counter, start, enable, i 4 respectively. 1) Beginning P ri0 = 1, counter = 0, i = 0, enable = 1, start = 0
2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15) 16) 17) 18) 19) while (1) loop if enable = 1 then Reading all CBs, Reqs and Services, start = 0 Grant = binary-tree function(Req,CB,P ri) if Grant > 0 then enable = 0, start = 1 for j = 1 to N i = j where Grantj = 1 end for end if end if if counter = Servicei then counter = 0, enable = 1 P ri= cyclic shift left(Grant) elsif CBi > 0 and start = 1 then counter + + end if end loop

III. H ARDWARE I MPLEMENTATION A. Design Architecture The efcient Round-Robin algorithm proposed in Section II has been implemented in hardware. Since our goal is an optimal time-complexity, combinational circuits are applied as much as possible to reduce the processing time of search and grant state of binary-tree mechanism. Shift registers are used to maintain pointers information. For simplication reasons, a 8 8 buffered crossbar with four-credit levels per internal crosspoint buffer (CB) has been used as design case where eight requests become the inputs of the scheduler. Figure 2 and 3 depict a design entity and a block diagram of the expected design. The circuit has three inputs and one output. The inputs are: 1) a 8-bit VOQ request (Req) vector containing input requests; 2) a 10-bit vector array representing service-ratio (S); 3) a 2-bit vector array called credit (CB) contains the level of internal crosspoint buffer (00=full, 11=empty). The output of the circuit is a 8-bit vector containing the grant decision (GRAN T ). B. Searching Mechanism Architecture Fig.4 shows the detail of the Searching Mechanism block diagram with eight requests (Reqs), eight credit buffer arrays (CBs) and eight grants (GRAN T s). In CM P block, 1-bit

At the beginning, the internal parameters are set as Line 1). Within the loop, if enable is in enabled status, the algorithm will start to read CB, Reqs and Service information. Meanwhile, start is set in disabled status. Grant is processed by the binary-tree function as Line 5). On Line 6), if Grant is more than zero, enable and start will be in disabled and enabled status. Afterward, i is determined by corresponding to Grants value. Between Line 13) and Line 18), counter is used to count up when start is in enabled status. When counter reachs Servicei , enable is in enabled status; afterward the loop will restart.

200

Fig. 4: Simple block diagram of the binary-tree searching mechanism for eight requests.

Fig. 6: Combinational logic circuit of a L-Circuit module comprising of 8 ANDs, 4 ORs and 4 NOTs.

Fig. 7: Combinational logic circuit of a root node comprising of 6 ANDs, 4 ORs and 4 NOTs. Fig. 5: Block diagram of Leaf-Node based on two multiplexers and a L-Circuit module. (Reqi ) and 2-bit CBi operate by this function: Outi = 1 when (CBi > 00)and(Reqi = 1) else 0 (1) With the possible actions reported in table I, the notations right, lef t and are mapped to the values 01, 10 and 00 respectively. Therefore, the combinational logic circuit of a Leaf-Node can be optimized by logic technique. The input and output relations of a Leaf-Node are specied by the boolean equations 2 to 5. Figure 5 and 6 show the block diagram of the Leaf-Node and its combinational logic circuit. Gout0 = Gout1 = Sel(0) = Sel(1) = Gin Sel(0) (2) Gin Sel(1) (3) in1 Pin0 Rin0 + Pin1 Pin0 (Rin1 + Rin0 ) (4) R Rin1 Pin0 + Rin0 (Pin1 Pin0 + Pin1 Rin1 ) (5) Fig. 8: Timing diagram of the efcient Round-Robin scheduler. gure 3 under non-uniform trafc, where the 5th input port has higher data rate. We assume that the packets within all crosspoint buffers can be selected in any time-slots; thus, all CBs equal to 11 (3 decimal). Service-ratio of input ports are 1, but the Service-ratio of input port 5th is 1111111111 (1024 decimal). Priority of input port 1st set 1, and all the others are 0. After the 8-bit VOQ Req, 1101 1001, has been arrived, the binary-tree searches and generates the GRAN T corresponding to the priorities, Service-ratios, CBs and VOQ Reqs. According to this gure, input port 1 and 4 are granted for next two clock cycles; afterwards the service will be occupied by input port 5 for 1024 clock cycles, and then occupied by input port 7 on the next clock cycle.

By the same way of the Leaf-Node, combinational logic circuit of Root-Node can be specied by following boolean functions, Equ.6 and Equ.7, and illustrated in Fig.7.

IV. P ERFORMANCE AND C OMPARISON Gout0 = Pin1 Pin0 (Rin1 + Rin0 ) + Pin1 Rin1 Pin0 Rin0 (6) In this section, we present the synthesis result of the pro Gout1 = Rin1 (Pin1 Pin0 Rin0 + Pin1 Pin0 ) + Pin1 Pin0 Rin0 (7) posed Round-Robin structure based on two most commonly used technologies; FPGA-based and silicon-based. C. Timing Diagram Figure 8 illustrates the timing diagram of the efcient Round-Robin scheduling algorithm architecture depicted in A. FPGA-based Technology We implement SA according to gures of [10] and synthesize it by using Xilinx ISE tool targeting the Xilinx Virtex 5

201

device. Table III show the synthesized results of the proposed structure and SA in term of slices and critical path delay (ns)of N = 4, 8, 16, 32, 64, and 128. TABLE III: Synthesized result in terms of slices and critical path delay (ns) of efcient Round-Robin scheduler on 5vlx330ff.
Design Proposed SA [10] Report Slices ns Slices ns N=4 11 0.72 124 4.19 N=8 25 1.42 192 6.45 N=16 62 2.10 476 6.94 N=32 130 2.80 1137 7.33 N=64 264 3.60 8781 15.7 N=128 533 4.52 16527 22.23

Synopsis tool. However the proposed design guarantees the fairness with service function under non-uniform trafc, but SA can not. For comparison purposes, consider a buffered crossbar of size N = 128 and assume that the cell size is 64bytes, where the line rate is determined by 64 8. The line rates that a scheduler using SA and the proposed design are 15.2 Tbps and 12.8 Tbps. The area results of all designs grow linearly with N as shown in Table V. The proposed design consumes signicantly fewer 2-NAND gates than the PPE, PPA and SA, but more 2-NAND gates than PRRA and IPRRA for all range of N . Compared with its critical path delay, the slightly larger area of the proposed design is neglectable. V. C ONCLUSION In this paper, we propose an efcient Round-Robin scheduling algorithm based on binary-tree scheme where QoS is improved by applying service policy. The proposed scheduling algorithm can achieve the searching time-complexity of (O(1)) under non-uniform trafc. By using the service policy, 100 % throughput can be attained, corresponding to improvement of scheduling performance. The design has been simulated on both FPGA-based (Virtex 5) and Silicon-based technology (0.18 um). The synthesis results show that consumed resources varied from 11 to 533 slices and from 46 to 1686 2-NAND gates for crossbars of size 44 to 128128. Critical path delays from 0.72 to 4.52 ns for FPGA-based and from 1.33 to 4.0 ns for silicon-based have obtained for the design. R EFERENCES
[1] Y. Zheng, C. Shao, An Efcient Round-Robin Algorithm for Combined Input-Crosspoint Queued Switches, IEEE ICAS, 2005. [2] T. Javadi, R. Magill, and T. Hrabik, A high-throughput algorithm for buffered crossbar switch fabric, in Proc. IEEE ICC, June 2001, pp. 1581-1591. [3] M. Nabeshima, Performance evaluation of combined input-and crosspoint-queued switch, IEICE Trans. Commum., Col. E83-B, no.3, mar 2000. [4] X. Zhang and L. N. Bhuyan, An efcient algorithm for combined inputcrosspoint-queue (CICQ) switches, IEEE Globecom 2004, pp. 11681173. [5] R. Rojas-Cessa, E. Oki, Z. Jing and H.J. Chao, CIXB-1:Combined input one-cell-crosspoint buffered switch, Proc. 2001 IEEE WHPSR, pp. 324329. [6] J. Z. Luo, Y. Lee, J. Wu, DRR: A fast high-throughput scheduling algorithm for combined input-crosspoint-queued(CICQ) switches, IEEE MASCOTS, 2005 pp. 329-332. [7] H. J. Chao, C. H. Lam, X. Guo, Fast ping-pong arbitration for inputoutput queued packet switches, international Journal of Communication systems, 2001 pp. 663-678. [8] H. J. Chao, C. H. Lam, X. Guo, Fast fair arbitration design in packet switches, IEEE, 2005 pp. 472-476. [9] S. Q. Z, M. Yang, Algorithm-Hardware Codesign of Fast Parallel RoundRobin Arbiters, IEEE transactions on parallel and distributed systems, 2007 pp. 84-94. [10] P. Gupta, N. McKeown, Desiging and Implementing a Fast Crossbar Scheduler, IEEE Micro, vol. 19, no.1, 1999 pp. 20-29.

As shown in table III, the proposed design conforming to binary-tree structure was implemented based on combinational circuits; therefore, the consumed slices are signicantly lower than SA. The critical path delay of the proposed design optimized by the Xilinx ISE tool is varied from 0.72 to 4.52 ns for N = 4, 8, 16, 32, 64, and 128 because of the combinational circuits where the synthesizer can simply map the design with logic elements. B. Silicon-based Technology The previous works, ERR [10], PRRA [9], IPRRA [9], PPE [10], PPA [10], and SA [10], had been analyzed and synthesized based on Silicon technology 0.18 um standard cell under the same operating conditions and area optimization. For fairness, we also analyze and synthesize our design on the same setting environment. TableIV shows the critical path delays (in nanoseconds) of these designs. TableV shows the area cost in number of two-input NAND gates for N = 4, 8, 16, 32, 64, and 128. Although the results depend on the standard cell library used, they present the relative performance of these designs. TABLE IV: Critical path delay of PPE, PPA, SA, PRRA, and IPRRA in terms of ns.
Design PPE PPA SA PRRA IPRRA Proposed N=4 1.67 1.7 1.36 1.47 1.29 1.33 N=8 2.73 2.53 1.51 2.52 1.89 1.40 N=16 3.8 3.66 1.79 3.58 2.68 1.93 N=32 5.07 4.54 2.26 4.63 3.68 2.10 N=64 6.31 5.67 2.72 5.68 4.56 2.95 N=128 7.2 6.54 3.35 6.74 5.01 4.0

TABLE V: Area result of PPE, PPA, SA, PRRA, and IPRRA (number of NAND2 gates).
Design PPE PPA SA PRRA IPRRA Proposed N=4 53 63 89 31 31 46 N=8 150 143 292 72 82 112 N=16 349 313 641 155 173 255 N=32 812 644 1318 320 356 576 N=64 1826 1316 2372 651 723 867 N=128 4010 2649 4780 1312 1455 1686

As shown in table IV, the critical path delay of SA and the proposed design grow with log4 N , while the critical path delay of PPE, PPA, PRRA, IPRRA grow with log2 N , which are consistent with the analysis of these designs. SA and the proposed design operate the fastest with shortest level of basic components and combinational circuits synthesized by

202

Author Index
Abid, Mohamed, 149 Ammar, Manel, 149 Baghdadi, Amer, 79 Baldassin, Alexandro, 99 Belaid, Ikbel, 179 Bhattacharyya, Shuvra, 67 Bois, Guy, 92 Calazans, Ney, 193 Castelfranco, Antonino, 45 Chan, King, 38 Cheung, Ray, 38 Cox, Charles, 45 Deschenes, Justin, 30 Fresse, Virginie, 186 Godet-Bar, Guillaume, 171 Gu, Zonghua, 23 Heinz, Matthias, 53 Hillenbrand, Martin, 53 Kent, Kenneth, 9, 30 Koellner, Christian, 135 Lal, Sundeep, 2 Lekuch, Scott, 45 Lowry, Michael, 121 Marcon, Csar, 164 Moraes, Fernando, 164, 193 Muller, Fabrice, 179 Muscedere, Roberto, 2 Nine, Harmon, 121 Pasareanu, Corina, 121 Philipp, Franois, 85 Pressburger, Tom, 121 Ptrot, Frdric, 106, 171 Samman, Faizal, 85 Tan, Junyan, 186 Williams, Jeremy, 30 Zhang, Ming, 23

Aguiar, Alexandra, 113 Alkhayat, Rachid, 79 Amory, Alexandre, 164 Azevedo, Rodolfo, 99 Baklouti, Mouna, 149 Balasubramanian, Daniel, 121 Barreteau, Anthony, 156 Becker, Juergen, 135 Benjemaa, Maher, 179 Beyrouthy, Taha, 59 Bobda, Christophe, 16 Bochem, Alexander, 9, 30 Boland, Jean-Franois, 92 Brehm, Christian, 74 Callanan, Owen, 45 Carmel-Veilleux, Tennessee, 92 Centoducatte, Paulo, 99 Champagne, David, 38 Chen, Hui, 171 Chen, Yu-Yuan, 38 Chowdhury, Sazzadur, 2 Clancy, Charles, 67 Crawford, Catherine, 45 Dekeyser, Jean-Luc, 149 Eoin, Creedon, 45 Fesquet, Laurent, 59 Gladigau, Jens, 128 Glesner, Manfred, 199, 85 Gohring De Magalhaes, Felipe, 113 Grohans, Michael, 16 Haubelt, Christian, 128 Hedde, Damien, 106 Herpers, Rainer, 9 Hessel, Fabiano, 113 Jezequel, Michel, 79 Karsai, Gabor, 121 Klein, Felipe, 99 Klindworth, Kai, 53 Kutzer, Philipp, 128 Kuykendall, John, 67 Le Nours, Sebastien, 156 Lee, Ruby, 38 Li, Will, 38 Losier, Yves, 30 Lubaszewski, Marcelo, 164 Marcon, Cesar, 193 Marquet, Philippe, 149 Mendoza, Francisco, 135 Moreira, Joo, 99 Moreno, Edson, 193 Muller, Kay, 45 Murugappa, Purushotham, 79 Mhlbauer, Felix, 16 Mller-Glaser, Klaus D., 53, 135, 14 Nutter, Mark, 45 Pap, Gabor, 121 Pasquier, Olivier, 156 Penner, Hartmut, 45 Plishker, William, 67 Pongyupinpanich, Surapong, 199 Purcell, Brian, 45 Purcell, Mark, 45 Rigo, Sandro, 99 Rousseau, Frdric, 171, 186 Schwalb, Tobias, 142 Szefer, Jakub, 38 Teich, Jrgen, 128 Wehn, Norbert, 74 Xenidis, Jimi, 45 Zaki, George, 67 Zhang, Wei, 38

203

Potrebbero piacerti anche