Sei sulla pagina 1di 6

A Reconfigurable and Scalable Architecture for Security Coprocessor

Chao Li1, 2, Jun Zhou1, 3, Yuan Jiang1, 2, Canfeng Chen4, Yongjun Xu1, Zuying Luo2
1. Institute of Computing Technology Chinese Academy of Sciences, Beijing, China 2. Beijing Normal University, Beijing, China 3. Beihang University, Beijing, China 4. Nokia Research Center, Beijing, China E-mail: {lichao, zhoujun,xyj}@ict.ac.cn, {jiangyuan, luozy}@bnu.edu.cn, canfeng-david.chen@nokia.com
Abstract- Security coprocessor can offer special and customized security guarantee for computing and communication system. In most of the application fields, the coprocessor should be smallarea, low-power, low-cost and high-availability, which can not only provide various candidate encryption algorithms, but also with acceptable performance according to the application requirement. In the view of this context, we propose a kind of new system architecture for security coprocessor based on reconfiguration, scalability and tightly coupled structure ideas, which can achieve various encryption operations with easy upgrade and reconfiguration. On the other hand, the security coprocessor can also mitigate the contention of data bus and raise the transfer data rate through the optimized internal data bus network. The new architecture can greatly reduce the area and power consumption of the security coprocessor, and improve the adaptability, flexibility and security of the whole system. At the end of this paper, a security coprocessor is implemented in 0.13m CMOS technology to verify the ideas of the paper, where different encryption algorithms are realized through changing the content of instruction set registers. The core area of the improved security module is 56K gates at 100MHz with an average power of 83mW. Comparing with traditional work, we achieve equal security function with lower power, smaller size and easier reconfiguration. Keywords-reconfigurable; instruction set scalable; security coprocessor;

perform inefficiently on the general microprocessor [1]. Although the hardware encryption methods based on ASIC (Application Specific Integrated Circuit) can raise the data throughput, they will inevitably lead to high cost and complexity of the whole system; furthermore, the system running efficiency will be confined by the transmission capacity of data buses. In addition, with the limit of configurability and scalability, the ASIC hardware encryption can only realize single and fixed function. Therefore, the traditional security coprocessor has already failed to meet current requirements in the many aspects, such as such as the application with constraints on flexibility, module area, power consumption and throughput and so on. Consequently, it is particularly important and significant to propose a new security scheme with relatively smaller module area, lower power consumption and greater flexibility. According to the present situation of technical development and traditional security methods, we propose a reconfigurable and scalable security coprocessor which adopts the concept of reconfiguration. The logic structure and function of the circuit in this coprocessor can be reconfigured, thus it can provide various security functions matching with different encryption algorithms. Moreover, through changing the content of instruction set registers, the coprocessor provides users great convenience to upgrade the personalized encryption algorithms. Hence, this coprocessor can realize a wide variety of encryption algorithms, which makes it very flexible, adaptable and scalable. In this architecture of the security coprocessor, through the analysis on some popular algorithms, such as AES/SMS4, ECC/RSA etc, we found out the shared operations of them, which can reduce the number of computing modules of the security coprocessor to only three. This can not only reduce the module area, but also lower the system power consumption to a considerable range. Further more, the specific instruction set is realized in the main controller of the security coprocessor, which can be reconfigured by setting the specific instruction set registers. Additionally, in order to increase the system throughput, multi-bus pipelined network is adopted in the coprocessor, and every data path is set to be one-way. Therefore, the conflict of data transmission can be mitigated

I. INTRODUCTION With the rapid development and wide application of the information technology, people have paid especially attention to information and communication. For a period of time, various encryption techniques have been developed continuously. These techniques can roughly be divided into two major types: the software encryption technique and the hardware encryption techniques based on chip level realization. However, with constant development and requirement of the encryption techniques in practical applications, the traditional encryption techniques are unable to meet the present upgrading security demand. It is quite necessary to propose a new encryption technique for choose. Software encryption has been widely used in a wide range during a long period, but a lot of encryption algorithms

978-1-4244-5046-6/10/$26.00 c 2010 IEEE

1826

greatly, and the data transfer rate and the parallel computing ability will be improved significantly. . ARCHITECTURE OF THE RECONFIGURABLE AND SCALABLE
SECURITY COPROCESSOR

The main controller is made up of four parts: Finite-state Machine (FSM), Instruction Set Register (Ins_res), Instruction Decoder (Ins_dec) and Bus Controller. The detailed structure is shown in Figure 2. Finite-State Machine (FSM): FSM is used for the automatic execution of application programs. FSM has two main states: run and reset. When the signal of instruction-execution-enable (ins_exe_en) is active, FSM turns to run state. Under this state, FSM reads instructions according to their address from instruction set register in every clock cycle, and then sends the instructions to the instruction decoder. If the ins_exe_en signal is invalid, FSM stops reading instructions and turns into reset state. Instruction Set Register (Ins_res): the instruction set of the four encryption algorithms is stored in the instruction set register file, and the plentiful code space has been reserved for expanding the instruction set. Instruction Decoder (Ins_dec): Instruction decoder receives and decodes the instructions read from the instruction set register. Bus Controller: the bus controller can control the deployment of the data bus, prevent the bus contention and raise the transfer rate of data. When the bus controller is working, it sends the control signal from instruction decoder to the computing unit through control bus. And when the computing unit gets the control signal and starts to work, bus controller can send addresses of corresponding data and instructions to the main controller. 2). Instruction Set Specific instruction set for four popular algorithms is defined in this section, and all of the instructions are stored in the instruction set register of main controller. Considering of high updating rate of encryption algorithms, abundant code space has been especially reserved in the instruction set register to extend the instruction set, which makes it convenient to update new algorithms and ensures the reconfiguration and flexibility. We adopt several popular algorithms including SMS4/AES, RSA/ECC as examples to test the performance of the new security coprocessor. These algorithms are implemented in the computing unit. The SMS4 algorithm is first made public in January, 2006. SMS4 is a block cipher used for Wireless LAN WAPI (Wired Authentication and Privacy Infrastructure) in the Chinese National Standard. The AES, which is also a block cipher, is a symmetric encryption algorithm. It has been analyzed extensively and is now in a worldwide use. RSA is widely applied in electronic commerce protocols, which is an asymmetric encryption algorithm for public-key cryptography. It is believed to be secure by giving sufficiently long keys and the use of up-to-date implementations. Elliptic curve cryptography (ECC) is also an asymmetric encryption algorithm based on the algebraic structure of elliptic curves over finite fields.

As illustrated in Figure 1, our reconfigurable security coprocessor consists of four parts: main controller, computing unit, internal bus network and register file. Compared with the traditional security processor architecture, our security coprocessor adopts tightly coupled instruction set extension method, which have the characteristics of smaller area, lower power consumption and good flexibility [17].

Fig.1 System Architecture of Reconfigurable and Scalable Security Coprocessor

Main controller is mainly responsible for the system control. Computing unit is used for encryption and decryption operation. By optimizing the internal bus network, the efficiency of the system data transmission can be greatly improved, leading to a high performance of the whole system. All data is stored in the register. The details of the four modules will be described below. A. Main Controller 1). Architecture of Main Controller

Fig.2 Architecture of Main Controller

According to the features of SMS4/AES and RSA/ECC, we define a specific instruction set. There are totally 11 instructions in this instruction set (Shown in Table ). In

2010 5th IEEE Conference on Industrial Electronics and Applicationsis

1827

order to improve the code density, all instructions are conditional execution. The specific instruction set includes data processing instructions: SHIL, SHIR, SHILC, SHIRC, CMP, and LUT; access instructions: LOAD, STR; multi-cycle modular multiplication instruction: SCAL; modular multiplication adjustment instruction: MMR, which can implement modular multiplication and modular inversion.
Table Instructions LOAD STR PE CMP SCAL MMR SHIL SHIR SHILC SHIRC LUT Application specific instruction set Comment Load data into register Store data into register Permutation Compare Montgomery module multiplication Module multiplication module regulation Shift left Shift right Circular left Circular right Look up table

implementations for S-box. The speed of handling is also fast. So we adopt LUT to realize the S-box transformation of SMS4 and AES. S-box transformation includes two steps: first, fetch bytes modular inversion in the finite field GF (28), secondly, implement affine transformation. Its way of realization refer to [2, 3]: The substitution form of 8*8 is realized by 256* 8 RAM. S-box of the two algorithms is all 8* 8 S-box, the interface of them is identical, and we can design S-box as a reconfigurable part. While being realized specifically, we utilize memory unit just like LUT resources within FPGA, regard 8-bit data inputted as the address of the memory, and preserve the result of S-box transformation in the corresponding address space for the reconfiguration of the system. While assigning different initial value to the address space of memory, different S-box transformations can be realized. S-box realized through memory can change its configuration data flexibly, and S-box transformation is unknown before disposing. So, flexibility and security of Sbox transformation can both get great enhancement. 2). Montgomery Modular Multiplication Unit A scalable multiplier architecture for finite field GF (P) and GF (2m) is presented in [10], efficient dual-field adder are used to achieve modified Montgomery Multiplication to support RSA and ECC operations. As described in [10], the main characters of the multiplier are: (1) the ability working on any given operand precision at the kernel level, (2) be suited to any module area, (3) make use of a pipelined organization that reduces the impact on signal loads because of high precision of the operands. The details please consult [10]. We simplify the unit of modular multiplication in [10], use two dual-field adders connected in series to realize the modular multiplication operation. The structure is shown as in Figure 3:

B. Computing Unit of Security Coprocessor The computing unit of security coprocessor is made up of LUT, MM and ALU. LUT is mainly used for realizing the Sbox transformation in the block cipher algorithms, MM is used for realizing Montgomery Multiplication, and ALU can realize some basic logic operations and arithmetic operations. 1). LUT The basic operations of SMS4 and AES are as shown in Table , we can find out most operations of SMS4 and AES are similar. They both have circular shift, XOR and S-box transformation.
Table Algorithms Basic operation of SMS4 and AES SMS4 32-bit circular shift 8*8 S-box transformation 32-bit XOR AES (divide into groups of 128-bit) 32-bit circular shift 8*8 S-box transformation 128-bit and 32-bit XOR MixColumn

Basic operations

Fig.3 Montgomery Modular Multiplication Unit

The S-box transformation accounts for more than 30% of the operations in the block cipher algorithm [4], its cryptographic strength determines the security intensity of the whole algorithm. So the reconfiguration of S-box is very important to the reconfiguration of the whole security coprocessor [6]. S-box of AES and SMS4 algorithm can be realized by adopting both the logical circuit and LUT [5]. The S-box transformation which is realized through the logic circuit is not only non-configurable and complicated to realize, but also very slow when dealing with data. It takes up substantial resources, and is prone to bringing bottleneck of handling speed to the system. The S-box transformation which is realized through the LUT is the operation that the 256 kinds of possible results of S-box transformation are calculated to get replacing table, and then change each status byte by the way of look-up table. This way can be designed simply, easy to implement, and becomes the main method of the hardware

3). ALU MixColumn, which is the most complicated operation in AES, needs to realize modular multiplication and addition in GF (28). This operation can be expressed as a matrix multiplication, which can be implemented by using shift and XOR according to the specific choices of the parameters of the algorithm. ALU can achieve shift and XOR in the algorithms. C. Internal Bus Network Internal bus network is the path for data transmission of our security coprocessor. The connectivity, data transmitting rate and scale of the internal bus network have great significance to the flexibility, scalability, and performance of the security coprocessor. If one cryptographic computing modules output in this network can be transmitted through a certain path to input port of another cryptographic computing module, then it is claimed that the network is connected, otherwise

1828

2010 5th IEEE Conference on Industrial Electronics and Applicationsis

disconnected. The network connectivity largely determines the adaptability and scalability of the reconfigurable computing unit [14]. Any one of the encryption algorithms achieved by the reconfigurable and scalable security coprocessor is implemented by different combinations and connections of the computing unit. If the internal bus network is connected, the combination and connection of existing computing unit can be realized in any way, which can attain more encryption algorithms. In other words, this design can help the reconfigurable and scalable security coprocessor achieve maximum flexibility and connectivity. Therefore, the primary design principle of coprocessors internal bus network is to meet the connectivity. Traditional networks [11] topology is shown below in Figure 4:

greatly reduces the parallelism and affects the performance of overall system severely. According to the defects, we propose a new architecture, and the topology of the architecture is shown below in Figure 5.

Fig.5 Optimized Topology

Different from the traditional topology, we offer a multipleone-way and pipelined bus architecture. For the use of multiple buses structure, each bus only needs to transmit a part of the total data, so the bus congestion can be reduced considerably. In addition, because the bus is one-way, the amount of arbitrations can be hugely reduced, which greatly enhances the data transfer efficiency and parallelism. In this new architecture, two one-way input buses and two one-way output buses are adopted, which can greatly improve the throughput and data transfer rate, and reduce the module area. Moreover, multiple buses structure can reduce the possibility of bus contentions, which make the security coprocessor work in the way of pipeline. According to the topology in Figure 5, the system works as follows: First, the system accesses data from the register and sends the data to the computing unit through the input buses. Then the computing unit encrypts the data and sends the results to the output buses, and eventually delivers the results back to register. In this way, various modules can work in parallel structure which improves the data transmission and information processing efficiently. . WORKFLOW OF RECONFIGURABLE AND SCALABLE SECURITY
COPROCESSOR

Fig.4 Traditional Topology

The traditional architecture consists of four parts: cryptographic engines, 32-bit data bus, cryptochannels and PCI interface. Moreover, in this architecture, each part is connected to others with a bidirectional data path. Although this architecture is popular for as period of time, the defects of the one bidirectional data bus make it difficult to meet current requirements. First, in the internal bus network of traditional architecture, there is only one data bus, thus all data in the system need to be transmitted through this bus, inevitably resulting in a tremendous data burden. Whats worse, because the data bus is bidirectional, it needs to perform both read and write, and a huge amount of arbitrations are needed. As a result, the system throughput and the data transmission rate will be severely affected. Besides, due to the defects of the traditional internal data bus, bus contentions are more likely to happen. This makes the system unable to carry out the work effectively in the pipeline mode which also

Under the control of host CPU, the main controller of security coprocessor chooses corresponding instruction set of the chosen encryption algorithm decodes it and gets the control signals, the source address and destination address of the operation code. On the basis of the source address of

2010 5th IEEE Conference on Industrial Electronics and Applicationsis

1829

operation code, the main controller takes source operands from register and sends these operands to computing unit. The computing unit executes corresponding operations and sends the results to register after calculation in light of control signals and operands. After encryption, the security coprocessor sends interrupt signal to host CPU to make itself enter sleep mode. The bus controller is responsible for the scheduling between the data buses. The private and public keys are kept in the register of the security coprocessor and are available to the host CPU in sleep mode only. For encryption, there is no need to move the keys among the registers by the main controller and less information will be lost. The operational mode of the security coprocessor is controlled by RESET signal. When the RESET signal is 0, the security coprocessor is in sleep mode and the host CPU can read and write the register contents using the data bus, address bus, and the control signals. When the RESET signal is 1, the security coprocessor is in run mode and implements one of the four algorithms based on the preloaded contents of the instruction set register. . EXPERIMENT AND VERIFICATION The encrypting algorithms are all realized by Verilog HDL. We used ModelSim and Synplify to emulate and synthesize, then adopted ProASIC3 flash FPGA of Actel Company to verify. Analyze tools of Synplify can assess the power consumption. The synthesized design has a maximum clock frequency of 200MHz.The performance of simulation is showed in table . The power and energy consumption of the processor is also reported for an operating frequency of 100 MHz. The average power includes switching, internal and leakage power. Because of adopting the design of reconfiguration, the computing unit improves the systematic integrated level greatly. The computing unit also reduces the systematic area by improving the reuse rate, whose area is probably 30K gates. The main controllers size is about 5K gates.
Table Ref. 2002 [10] Comparison of Security Coprocessors Power (mW) N/A 73.3 K 1.063 G 66.89 K 130.5 M 2.656G 256 M 70 6.56 K 330.5 M 200.83 M 83 56 K 100 32 K 60 385.3 525 K 83 Area (gates) 40 K Freq. (MHz) 220

RSA (1024-bit) ECC (256-bit)

6.12 K 60.13 K

We can learn from Table (1)Our design can achieve more symmetric and asymmetric encryption algorithms which has a better versatility. (2)Our design takes fewer gates to achieve our objective function. (3)The power consumption of our design is less. (4)Our design can achieve higher operating frequency. . CONCLUSIONS In this paper, we introduced a new architecture of security coprocessor, which has reconfigurable, scalable and tightlycoupled-structure. It can effectively integrate system resources, reduce module area on condition that the rational data throughput can be guaranteed. The tightly coupled instruction set extension method and the programmable function of encrypting algorithms instruction set offer an easy way to upgrade security coprocessor. Launching a new encryption algorithm for the system can just be achieved by changing the instruction set stored in register and LUT. The whole module can achieve fine flexibility and scalability with less cost of hardware. Finally, the advantage on information handling capacity and the area of this security coprocessor have been verified through the experiments. The Internet of Things grows fast in recent years, including a great deal of practical applications, such as bar code, RFID, etc [16]. Its security problems also attract peoples extensive concern. However, because of the limitation of actual conditions, these techniques have put forward different requirements to the security unit in each system. It will be our next research content that how to design a more universal and portable security coprocessor, to further improve the communication quality and trustworthiness of the Internet of Things. ACKNOWLEDGMENT This paper is supported in part by High-Tech Research and Development Program of China (863) under grant No.(2007AA12Z321),the National Basic Research Program of China (973 Program) (No. 2006CB303000) and in part by National Natural Science Foundation of China (NSFC) under grant No.(60772070,60873244,60876025). REFERENCES
[1]Joan Dacmen, Vincent Riinmcn. AES Proposal: Rijndac1, September 3, 1999, Version 2. [2]Verbauwhede I M, Schaumont P R, Kuo H. Deign and performance testing of A 2.29 Gb/s rijndael processor, IEEE of Solid State-Circuit, vol.38, no.3, March2003, pp.569-572. [3]Henry Kuo, Ingrid Verbauwhedc. Architectural optimization for a 1.82Gbit/s VLSI implementation of the AES Rijndael Algorithm, Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded System, 2001, pp. 51-64.

Performance (bit/s) RSA (512-bit) ECC (160-bit) AES (128 -bit) RSA (1024-bit) SHA1 TRNG AES (128-bit) RSA (1024-bit) SMS4 AES (128-bit) 276 K

2005 [11]

2008 [12]

ours

1830

2010 5th IEEE Conference on Industrial Electronics and Applicationsis

[4]Anna Labbe, Annie Perez, Jean Michel Porta1. Efficient hardware implementation of crypto memory based on AES algorithm and SRAM architecture, IEEE International Symposium on Circuits and Systems, May.2004, pp.65-71 [5] Morioka S. and Satoh A. An Optimized S-Box Circuit Architecture for Low Power AES Design, Cryptographic Hardware and Embedded Systems (CHES 2002), San Francisco Bay, CA, 2002, Vo1.2523 pp.172-186 [6]Gueron, S, Kounavis, ME, Vortex: A new family of one-way hash functions based on AES rounds and carry-less multiplication, LECTURE NOTES INCOMPUTERSCIENCE, 2008, vol.5222, pp.331-340. [7] P. L. Montgomery.mltiplication without trial division. Mathematics of Computation, vol.44, no.170, pp.519521, April 1985. [8] A. F. Tenca, C. K. Koc. A scalable architecture for Montgomery Multiplication,Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, Springer, Berlin, Germany, 1999 No.1717, pp.94108. [9]C. K. Koc,T. Acar, Montgomery multiplication in GF (2K).Designs, Codes and Cryptography, April 1998,vol.14,no.1,pp.5769. [10]M-C. Sun, C-P. Su, et al, Design of a scalable RSA and ECC Cryptoprocessor, Proc Asia and South Pacific Design Automation Conf (ASP-DAC), 2003, pp. 495-498. [11]C. Wang, C. Lo, M. Lee, et al. A network security processor design based on an integrated soc design and test platform, IEEE/ACM design Automation Conf. (DAC06), 2006, pp.90495.

[12]Ronghua Lu, Jun Han, Xiaoyang Zeng,A Low-Cost Cryptographic Processor for Security Embedded System, Asia and South Pacific Design Automation Conference (ASPDAC), Seoul South Korea, JAN 21-24, 2008, pp. 113-114. [13]Esam Khan, M. Watheq El-Kharashi,et al,Network Processors for Communication Security: A Review,IEEE Pacific Rim Conference on Communications Computers and Signal Processing (PACRIM),AUG2830,2003,VICTORIACANADA,pp.173-176. [14]Hartej Singh,Lee Ming hau,Lu Guang-ruing et a1 MorphoSys an integrated reconfigurable system for data-parallel and computation-intensive applications, IEEE Transactions on Computers 2000, vol.49, no.5, pp.465481. [15] Regazzoni, F, Cevrero, A, Standaert, FX, et al, A Design Flow and Evaluation Framework for DPA-Resistant Instruction Set Extensions, 11th International Workshop on Cryptographic Hardware and Embedded Systems, Lausanne SWITZERLAND, SEP 06-09, pp.205-219. [16] DAI Hang-yang XU Hong-bing, Overview of Security in Wireless Sensor Networks (WSN Computer Applications no.7, pp.12-18. [17]Alireza Hodjat, Ingrid Verbauwhede. High Throughput Programmable Crypto Coprocessor, IEEE Micro Magazine, May/June 2004, vol. 24, no. 3, pp.34 45.

2010 5th IEEE Conference on Industrial Electronics and Applicationsis

1831

Potrebbero piacerti anche