Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract—Advanced Encryption Standard (AES), has received The rest of the paper is organized as follows. Section II
significant interest over the past decade due to its performance describes basic AES algorithm. Section III describes novel on-
and security level. Many hardware implementations have been the-fly key expansion module. Section IV describes pipeline
T
proposed. In most of the previous works subbytes and inverse design. Section V describes comparison work. Finally we
subbytes are implemented using lookup table method. In this concluded the paper in section VI.
paper we used combinational logic which helps for making inner
round pipelining in an efficient manner. Furthermore, composite
field arithmetic helped in obtaining lesser area. Using proposed II. AES ALGORITHM
architecture, a fully sub pipelined encryptor/decryptor with 3 The AES algorithm is a symmetric block cipher that
substage pipelining in each round can achieve a throughput of processes data blocks of 128 bits using a cipher key of length
ES
25.89Gbps on Xilinx xc5vlx110t-1 device which is faster and is
48.78% more effective than the fastest previous FPGA
implementations known to date. Also our ASIC implementation
achieved 58.18Gbps which is faster compared to the previous
ASIC implementations.
This AES design was implemented using Verilog HDL and
synthesized with RTL Compiler using TSMC’s 90 nm standard
cell library, physical design implementation was done using SOC
128, 192, or 256-bits. In addition, the AES algorithm is an
iterative algorithm. Each iteration can be called a round, and
the total number of rounds, Nr, is 10, 12, or 14, when the key
length is 128, 192, or 256 bits, respectively. Table 1 shows the
number of rounds as a function of key length.
Encounter and achieved the maximum through put of 58.18 Key length Block size Number of
Nk words NB works rounds(Nr)
Gbps.
AES-128 4 4 10
AES-192 6 4 12
Keywords—AES, Pipelined AES, sub pipelined design, ASIC,
AES-256 8 4 14
FPGA, VLSI.
A
The 128-bit data block is divided into 16 bytes. These bytes
I. INTRODUCTION are mapped to a 4x4 array called the State and the state
The large and growing number of internet and wireless undergoes all the internal operations of AES algorithm. Every
communication users has led to an increasing demand of byte in the State is denoted by Si,j(0 ≤ i, j < 4), and is
security measures and devices for protecting the user data considered as an element of GF(28) . Although different
transmitted over the open channels. Two types of irreducible polynomials can be used to construct GF(28), the
irreducible polynomial used in the AES algorithm is p(x) = x8 +
IJ
Mix Column () Add Round Key The MixColumns() transformation operates on the State
(i)
column-by-column, treating each column as a four-term
Add Round Key Inv Mix Column polynomial. The columns are considered as polynomials over
GF(28) and multiplied modulo x4 + 1 with a fixed polynomial
T
(i) ()
Ki Ki a(x), given by a(x) = {03}x3 + {01}x2 + {01}x + {02} .
Sub Bytes () Inv Sub Bytes ()
The function xtime is used to represent the multiplication
Shift Rows () Inv Shift Rows ()
with ‗02‘, modulo the irreducible polynomial m(x)= x8 + x4 + x3
+ x + 1. Implementation of function xtime() includes shifting
and conditional xor with ‗1B‘. Fig. 4 shows the mixed column
KNr
Add RoundKey
(Nr)
CipherTxt(128bit)
1(a). Encryption
K0 ES
Add Round Key
PlainTxt(128bit)
1(b). Decryption
S‘0,c
S‘1,c =
02
01
03
02
01 01
03 01
S0,c
S1,c
S‘2,c 01 01 02 03 S2,c
tables is greater than the other logic. By using LUT method it is
difficult to use sub pipeline structure with two pipeline stages, S‘3,c 03 01 01 02 S3,c
which prevents the further speedup. An alternative method is to
use combinational logic, which is faster than the LUT and can 0 ≤ c < 4.
also be divided into two pipeline stages, allowing further
A
speedup. In non LUT method sub bytes can be implemented by
finding multiplicative inverse followed by affine transform.
Similarly inverse sub bytes implemented by using inverse
affine transform followed by multiplicative inverse. Here
multiplicative inverse is common; by taking this advantage we
can implement a single structure for both subbytes and inverse
subbytes which is shown in Fig. 2.
IJ
T
store all the round keys, and the other one is to produce them
on-the-fly. First approach consumes more area. In second
approach, the initial key is divided into Nk words (key0,
key1,…, keyNk-1) which are used as initial words. With the help
of these initial words rest the words are generated iteratively. It
can be computed that is 4, 6, or 8, when the key length is 128,
Roundkey(i) = {w4i,w4i+1,w4i+2,w4i+3}.
W3
with r sub stages. Each round unit is divided into r sub stages
Y Y with equal delays.
X Sbox(Rot (Y)) Rcon[i] X Y In LUT method sub pipelining is limited to only two sub
stages whereas combinational logic can be divided into more
Figure 5. Data path for key generator sub stages with equal delays. In this pipelining or sub
pipelining architectures, the plain text is received at each clock
cycle through input register. A single round of algorithm is
The key expansion procedure can be described by the completed depending on the number of sub stages. Round keys
pseudo code listed below are generated by using key expansion module. Generated round
for i = 0 to Nk-1 keys are supplied to each round. At each clock cycle data is
shifted to next stage and final output is appeared only after the
wi = keyi end of ((10*r)+10)th clock cycle. Here ‗r‘ represents number
end of sub pipeline stages. Advantage of this structure is second
output can be obtained immediately in the next clock cycle
for i = Nk to 4(Nr + 1)-1 after the first output. Internal design of the each round contains
temp = wi-1
Sub bytes, Shift rows, Mix columns, and add round key which
are explained in previous sections.
V. RESULTS COMPARISION
The AES architecture was implemented using Verilog HDL,
and simulated using Cadence ncsim. Here we implemented two
types of designs. AES(LUT) is pipelined implementation using
lookup table method with an initial latency of 10 clock cycles
and AES(SP) is a fully sub-pipelined implementation with non
LUT method, which is having 3 sub-stages in each round with
an initial latency of 40 clock cycles. Compared to LUT, non
LUT implementation results in lesser area. FPGA
implementation of this design has been done using Xilinx
XC5VLX110T-1, and the corresponding results are tabulated
in TABLE 2. The fully sub-pipelined architecture of 128 bit-
length having 10 round units has been synthesized in RTL
Compiler using TSMC‘s 90 nm standard cells and the
T
corresponding results are tabulated in TABLE 3. This fully
sub-pipelined design achieves a throughput of 58.18 Gbps
which is faster compared to the previous ASIC
implementations. The backend of the design has been done in
Figure 7. Sub pipelining architecture SOC encounter and final chip layout is shown in fig. 8
Design
Elbirt el al*
Device
Xcv1000-4
ES TABLE 2
FPGA comparision results
Fmax
(Mhz)
31.8
Throughput
(Gbps)
1.938
Slices
10992
BRAMS
0
Mbps/slice
0.176
Mcloone el al* Xcv812e-8 93.9 12.02 2000 244 0.362
Jarvinen * Xcv1000e-8 129.2 16.5 11719 0 1.4
Saggese * Xcv2000e-8 158 20.3 5810 100 1.09
A
Standert * Xcv3200e-8 145 18.5 15112 0 1.28
Parhi (r = 3)* Xcv812e-8 93.5 11.965 9406 0 1.272
Parhi (r = 7)* Xcv1000e-8 168.4 21.556 11022 0 1.956
Ours (LUT-pipelined) Xc5vlx110t-1 103.4 13.238 4611 60 1.077
Ours (Sub pipelining ) Xc5vlx110t-1 202.26 25.89 8896 0 2.91
IJ
TABLE 3
Synthesis results (ASIC)
VI. CONCLUSION
Design AES(LUT) AES(SP) AES(SP)
In this paper, we presented a hardware implementation of
Technology 90nm 90nm 180nm
efficient pipeline AES architecture which includes both
Area (um2) 740870 564036 2258469 encryption and decryption. Also sub pipelining architecture
Power (mw) 136.995 147.78 655.5 helped us to get higher throughput than earlier
implementations. The design is modeled using Verilog HDL
Critical path 3.9ns 2.2ns 4.2ns
and simulated with the help of Cadence NCsim. Synthesis is
Fmax (Mhz) 256.4 454.5 238 done by using RTL Compiler v9.10 and physically designed
Throughput 32.82 58.18 30.47 with SOC Encounter, with the proposed sub-pipelining
(Gbps) architecture, throughput has increased and reached to 58.18
Gbps.
[12] X.zhang and k.parhi ― high-speed VLSI architectures for the AES
algorithm‖ IEEE transactions on VLSI systems, vol.12 sep 2004.
[13] N. Sklavos and O. Koufopavlou, ― Architectures and VLSI
Implementations of the AES-Proposal Rijndael,‖ IEEE Trans. on
Computers, vol. 51, Issue 12, pp. 1454-1459, 2002.
[14] R. Karri, K. Wu, P. Mishra, and Y. Kim, ― Concurrent Error Detection
Schemes for Fault-Based Side-Channel Cryptanalysis of Symmetric
Block Ciphers,‖ IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 21, No. 12, Dec. 2002.
[15] C.-H. Yen, T.-Y. Pai, and B.-F. Wu, ― The implementations of the re-
configurable Rijndael algorithm with throughput of 4.9 Gbps,‖ in
Proc. 16th VLSI Des./CAD Symp., Hualien, Taiwan, Aug. 2005.
[16] M. Alam, W. Badawy, and G. Jullien, ― A novel pipelined threads
architecture for AES encryption algorithm,‖ in Proc. IEEE Int. Conf.
Appl.-Specific Syst., Architectures, Process., San Jose, CA, Jul. 2002,
pp. 296–302.
T
Figure 8. Final chip layout
[1]
[2] ―
J.Daemen and V.Rijmen, ―
algorithm
REFERENCES
submission,‖ September
http://www.nist.gov/CryptoToolkit.
Draft FIPS for the AES,‖
3,
ES
AES Proposal: Rijndael, AES
available
available:
from: