New Idea&Aes

INSTRUCTION SET EXTENSIONS FOR
ENHANCING THE PERFORMANCE OF SYMMETRIC KEY

CRYPTOGRAPHIC ALGORITHMS
BY
SEAN R. OMELIA
BS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
UNIVERSITY OF MASSACHUSETTS LOWELL
Signature of Author Date
Dr. Adam J. Elbirt
Thesis Advisor
Prof. George P. Cheney
Thesis Committee Member
Dr. Dalila B. Megherbi
Thesis Committee Member
INSTRUCTION SET EXTENSIONS FOR
ENHANCING THE PERFORMANCE OF SYMMETRIC KEY
CRYPTOGRAPHIC ALGORITHMS
BY
SEAN R. OMELIA
BS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)
ABSTRACT OF A THESIS SUBMITTED TO THE FACULTY OF THE
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
MASTER OF SCIENCE IN ENGINEERING
UNIVERSITY OF MASSACHUSETTS LOWELL
2007
Thesis Advisor: Dr. Adam J. Elbirt
Assistant Professor, Department of Computer Science
ABSTRACT
In this thesis, instruction set extensions for a RISC processor are presented
to improve the performance in software of the Data Encryption Standard (DES),
Triple-DES, International Data Encryption Algorithm (IDEA), and Advanced En-
cryption Standard (AES) algorithms. The most computationally intensive operations
of each algorithm are handled by a set of new instructions. The hardware supporting
these instructions is integrated into the processors datapath. For each of the targeted
algorithms, comparisons are presented between traditional software implementations
and new implementations that take advantage of the extended instruction set ar-
chitecture. Results show that utilization of the proposed instructions signicantly
reduces program code size and improves encryption and decryption throughput. The
additional hardware resources required by all of the custom hardware increases the
total area of the processor by less than fty percent.
ii
ACKNOWLEDGEMENTS
There are several people I wish to thank for their assistance and support
in the completion of this thesis. I would like to express many thanks to my advisor,
Dr. Adam J. Elbirt, who has been an excellent guide throughout all stages of the
research, and Prof. George Cheney and Dr. Dalila Megherbi for their membership
on the defense committee. I received a great deal of support on technical matters
from Gaisler Research, the creator of the LEON2 processor, and the members of the
Instruction Set Extensions for Cryptography Project at Graz University of Technol-
ogy. Their advice was most helpful for understanding the LEON2 model and how the
processor architecture can be extended. I would also like to thank all of my friends
and family, who have encouraged me throughout the course of my work.
iii
Contents
List of Figures viii
List of Tables x
1 INTRODUCTION 1
2 PREVIOUS WORK 6
3 THE LEON2 PROCESSOR 9
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 VHDL Model Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 SPARC
r _
V8 Instruction Model . . . . . . . . . . . . . . . . . . . . . 11
3.5 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Synthesis and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Software Development Tools . . . . . . . . . . . . . . . . . . . . . . . 12
4 TARGET ALGORITHMS 14
4.1 Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 The DES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 14
iv
4.1.2 DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 The Triple-DES Algorithm . . . . . . . . . . . . . . . . . . . . . 21
4.1.4 Triple-DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . 22
4.1.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 PROPOSED INSTRUCTION SET EXTENSIONS 40
5.1 DES and Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.1 Initial and Final Permutations . . . . . . . . . . . . . . . . . . . 40
5.1.2 Set Encryption Direction . . . . . . . . . . . . . . . . . . . . . . 42
5.1.3 Key Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.4 Round Core (f ) Function . . . . . . . . . . . . . . . . . . . . . . 43
5.1.5 New DES and Triple-DES Algorithm Implementations . . . . . 43
5.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Multiplication modulo 2
16
+ 1 . . . . . . . . . . . . . . . . . . . 47
v
5.2.2 New IDEA Algorithm Implementation . . . . . . . . . . . . . . 47
5.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 SubBytes Operations . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 GF(2
m
) Matrix Multiplier Constant Loading . . . . . . . . . . . 49
5.3.3 GF(2
m
) Matrix Multiplication . . . . . . . . . . . . . . . . . . . 49
5.3.4 New AES Algorithm Implementations . . . . . . . . . . . . . . . 51
6 LEON2 HARDWARE AND SOFTWARE TOOLCHAIN MODIFI-
CATIONS 59
6.1 Custom Hardware Units . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1 DES Permutation Unit . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.2 DES Round f -function Unit . . . . . . . . . . . . . . . . . . . . 60
6.1.3 DES Key Generator . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.4 Modulo (2
16
+ 1) Multiplier . . . . . . . . . . . . . . . . . . . . 63
6.1.5 AES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.6 Galois Field Fixed Field Constant Multiplier . . . . . . . . . . . 64
6.2 Architecture Modications . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Modications to Software Development Tools . . . . . . . . . . . . . . 69
7 RESULTS AND ANALYSIS 70
7.1 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Software Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Algorithm Execution Times . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Hardware Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.5 Throughput to Area Comparisons . . . . . . . . . . . . . . . . . . . . 79
vi
8 CONCLUSIONS AND FUTURE WORK 81
REFERENCES 83
Appendix A: VHDL Source for Custom Functional Units 94
Appendix B: Modications to LEON2 VHDL Model and Development
Tools 116
Appendix C: Test Vectors for Functional Evaluation 160
Appendix D: Example Source Code for Functional and Performance
Evaluations 162
About the Author 210
vii
List of Figures
1 Structure of SPARC
r _
V8 Format 3 instructions . . . . . . . . . . . . 11
2 Block diagram for standard block ciphers . . . . . . . . . . . . . . . . 15
3 The Data Encryption Standard algorithm . . . . . . . . . . . . . . . 16
4 The DES f -function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 The computation graph for IDEA. . . . . . . . . . . . . . . . . . . . . 27
6 The AES encryption process . . . . . . . . . . . . . . . . . . . . . . . 31
7 The state representation of data blocks in AES . . . . . . . . . . . . 31
8 The AES decryption process . . . . . . . . . . . . . . . . . . . . . . . 33
9 The key expansion process for AES . . . . . . . . . . . . . . . . . . . 34
10 DES encryption routine with custom instructions . . . . . . . . . . . 44
11 DES decryption routine with custom instructions . . . . . . . . . . . 44
12 Triple-DES encryption routine with custom instructions . . . . . . . . 45
13 Triple-DES decryption routine with custom instructions . . . . . . . . 46
14 IDEA algorithm routine with custom instructions . . . . . . . . . . . 47
15 AES encryption routine with aessb and gfmmul instructions . . . . 52
16 AES decryption routine with aessb and gfmmul instructions . . . . 54
17 AES encryption routine with aessb4 and gfmmul instructions . . . . 56
18 AES decryption routine with aessb4 and gfmmul instructions . . . . 58
viii
19 DES permutation unit . . . . . . . . . . . . . . . . . . . . . . . . . . 60
20 DES key generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
List of Tables
1 LEON2 VHDL Model File Hierarchy . . . . . . . . . . . . . . . . . . 10
2 The Initial Permutation IP . . . . . . . . . . . . . . . . . . . . . . . . 17
3 The Expansion Operation E . . . . . . . . . . . . . . . . . . . . . . . 18
4 The DES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 The Pre-Output Permutation P . . . . . . . . . . . . . . . . . . . . . 20
6 The Final Permutation FP . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Permuted Choice 1 (PC-1) . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Rotations for the DES key schedule . . . . . . . . . . . . . . . . . . . 21
9 Permuted Choice 2 (PC-2) . . . . . . . . . . . . . . . . . . . . . . . . 21
10 IDEA key schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
11 Interpretation of the asi eld for the DES permutation instructions . 41
12 Usage of the simm13 eld by the aessb and aessb4 instructions . . 48
13 Code size in bytes for DES . . . . . . . . . . . . . . . . . . . . . . . . 72
14 Code size in bytes for Triple-DES . . . . . . . . . . . . . . . . . . . . 72
15 Code size in bytes for IDEA . . . . . . . . . . . . . . . . . . . . . . . 72
16 Code size in bytes for AES without gfmmul instruction . . . . . . . 73
17 Code size in bytes for AES with gfmmul instruction . . . . . . . . . 73
18 Execution cycles for DES . . . . . . . . . . . . . . . . . . . . . . . . . 74
x
19 Execution cycles for Triple-DES . . . . . . . . . . . . . . . . . . . . . 74
20 Execution cycles for IDEA . . . . . . . . . . . . . . . . . . . . . . . . 75
21 Execution cycles for AES without gfmmul instruction . . . . . . . . 75
22 Execution cycles for AES with gfmmul instruction . . . . . . . . . . 76
23 Comparison with ISEC Extensions in C/Inline Assembly . . . . . . . 77
24 Comparison with ISEC Extensions in Pure Assembly . . . . . . . . . 77
25 Hardware utilization on the Xilinx XC4VLX25 FPGA . . . . . . . . . 78
26 Throughput to area ratios for algorithm implementations . . . . . . . 80
xi
1
1 INTRODUCTION
With more than 188 million Americans connected to the Internet [1], in-
formation security has become a top priority. Many applications electronic mail,
electronic banking, medical databases, and electronic commerce require the ex-
change of private information. For example, when engaging in electronic commerce,
customers provide credit card numbers when purchasing products. If the connection
is not secure, an attacker can easily obtain this sensitive data. In order to imple-
ment a comprehensive security plan for a given network to guarantee the security of
a connection, the following services must be provided [2], [3], [4]:
Condentiality: Information cannot be observed by an unauthorized party. This
is accomplished via public-key and private-key encryption.
Data Integrity: Transmitted data within a given communication cannot be
altered in transit due to error or an unauthorized party. This is accomplished
via the use of hash functions and Message Authentication Codes.
Authentication: Parties within a given communication session must provide
certiable proof of their identity. This is accomplished via the use of digital
signatures.
Non-repudiation: Neither the sender nor the receiver of a message may deny
transmission. This is accomplished via digital signatures and third party notary
services.
Cryptographic algorithms used to ensure condentiality fall within one of two cat-
2
egories: private-key (also known as symmetric-key) and public-key. Symmetric-key
algorithms use the same key for both encryption and decryption. Conversely, public-
key algorithms use a public key for encryption and a private key for decryption. In a
typical session, a public-key algorithm will be used for the exchange of a session key
and to provide authenticity through digital signatures. The session key is then used
in conjunction with a symmetric-key algorithm. Symmetric-key algorithms tend to
be signicantly faster than public-key algorithms and as a result are typically used
in bulk data encryption [3]. The two types of symmetric-key algorithms are block
ciphers and stream ciphers. Block ciphers operate on a block of data while stream
ciphers encrypt individual bits. Block ciphers are typically used when performing
bulk data encryption and the data transfer rate of the connection directly follows the
throughput of the implemented algorithm.
High throughput encryption and decryption are becoming increasingly im-
portant in the area of high-speed networking. Many applications demand the creation
of networks that are both private and secure while using public data-transmission
links. These systems, known as Virtual Private Networks (VPNs), can demand en-
cryption throughputs at speeds exceeding Asynchronous Transfer Mode (ATM) rates
of 622 million bits per second (Mbps). Increasingly, security standards and applica-
tions are dened to be algorithm independent. Although context switching between
algorithms can be easily realized via software implementations, the task is signicantly
more dicult when using hardware implementations. The advantages of a software
implementation include ease of use, ease of upgrade, ease of design, portability, and
exibility. However, a software implementation oers only limited physical security,
especially with respect to key storage [3], [5]. Conversely, cryptographic algorithms
that are implemented in hardware are by nature more physically secure as they can-
not easily be read or modied by an outside attacker when the key is stored in special
3
memory internal to the device [5]. As a result, the attacker does not have easy access
to the key storage area and cannot discover or alter its value in a straightforward
manner [3].
When using a general-purpose processor, even the fastest software imple-
mentations of block ciphers cannot satisfy the required bulk data encryption data
rates for high-end applications [6], [7], [8], [9], [10]. As a result, hardware imple-
mentations are necessary for block ciphers to achieve this required performance level.
Although traditional hardware implementations lack exibility with respect to al-
gorithm and parameter switching, congurable hardware devices oer a promising
alternative for the implementation of processors via the use of IP cores in Applica-
tion Specic Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA)
technology. To illustrate, Altera Corporation oers IP core implementations of the
Intel 8051 microcontroller and the Motorola 68000 processor in addition to their
own Nios
r _
-II embedded processor [11]. Similarly, Xilinx Inc. oers IP core imple-
mentations of the PowerPC processor in addition to their own MicroBlaze
TM
and
PicoBlaze
TM
embedded processors [12]. ASIC and FPGA technologies provide the
opportunity to augment the existing datapath of a processor implemented via an IP
core to add acceleration modules supported through newly dened instruction set
extensions targeting performance-critical functions [13], [14], [15]. Moreover, many
licensable and extendible processor cores are also available for the same purpose [16],
[17], [18], [19].
The use of instruction set extensions follows the hardware/software co-
design paradigm to achieve the performance and physical security associated with
hardware implementations while providing the portability and exibility traditionally
associated with software implementations [20]. Moreover, when considering alterna-
tive solutions, instruction set extensions result in signicant performance improve-
4
ments versus traditional software implementations with considerably reduced logic
resource requirements versus hardware-only solutions such as co-processors [21], [22],
[23], [24], [25], [26], [27], [28], [29]. It is the goal of this research to demonstrate a set
of instruction set extensions for a reduced instruction set computing (RISC) processor
that enhance the performance of symmetric-key algorithms in software implementa-
tions.
Chapter 2 discusses related work on the various methods of speeding up
symmetric-key algorithms in software. These include optimization of pure software
implementations, o-loading of cryptographic algorithm execution to co-processors,
and other instruction set extensions. It will be shown that advances in technology
have fueled trends towards increased recongurability in embedded systems, resulting
in instruction set extensions becoming a more viable and attractive option when the
performance of symmetric-key algorithms is critical.
Chapter 3 describes the target processor, the LEON2 RISC processor whose
architecture is based on the SPARC
r _
architecture. The LEON2 processor was cho-
sen for its robust design, ease of congurability, and customization through the full
availability of the models hardware description language (HDL) source code.
In Chapter 4, the cryptographic algorithms that are the focus of this re-
search are explained. These include the Data Encryption Standard (DES) and Triple-
DES, the International Data Encryption Algorithm (IDEA), and the Advanced En-
cryption Standard (AES). The various performance bottlenecks that are commonly
encountered in software implementations of these algorithms are also discussed.
Syntax and encoding of the newly developed instructions are presented in
Chapter 5, followed by a description of the modications to the LEON2 processor
and its associated development tools in Chapter 6. To evaluate the eectiveness of
the instruction set extensions, Chapter 7 presents data on the logic utilization of the
5
custom hardware, as well as throughput data for the target algorithms both with
and without the use of the custom instructions. The thesis concludes in Chapter 8
along with recommendations for investigating further architectural enhancements to
the LEON2 processor for supporting symmetric-key algorithms.
6
2 PREVIOUS WORK
Most traditional methods for improving the throughput of pure software
implementations of symmetric-key algorithms fall into one of two categories. One
option is to construct memory-based look-up tables where results of some of the basic
operations of the algorithm have been pre-computed and stored. The substitution
boxes, or S-Boxes, of the DES and AES algorithms are commonly stored in look-up
tables in software implementations. Look-up tables may also be used to combine
operations used in the DES and AES algorithms. An implementation of DES in [30]
combines the S-Box table look-up with the subsequent 32-bit permutation and uses
tables as part of the Initial and Final Permutations. The AES algorithm requires
several complicated mathematical operations that are time-consuming on general-
purpose processors. Therefore in some implementations, large look-up tables, called
T-tables, are employed that combine several of these complex operations into a single
table access [20]. A look-up table based implementation is a viable option for systems
with a large memory space and low memory access times. However, area-constrained
systems suer large performance penalties under these implementations [20], [29],
thus they are generally not employed in those environments.
Another method for speeding up software implementations of cryptographic
algorithms involves taking advantage of mathematical or structural properties of the
particular algorithm. The Initial and Final Permutations of the DES algorithm have
regular structures that make it possible to execute a series of matrix transformations
and exclusive-OR operations as demonstrated in [31]. This translates into a sequence
of instructions that is much smaller than the traditional sequence required to perform
7
the Initial and Final Permutations. In previous work on improving the performance of
the AES algorithm on 32-bit systems, it has been shown that transforming a block of
plaintext from a column-oriented matrix to a row-oriented matrix reduces the number
of instructions required to complete the cipher rounds. In particular, a row-oriented
representation allows for a more ecient implementation of the Galois Field matrix
multiplication operations required for AES encryption and decryption [32].
In order to extend the cryptographic capabilities of an embedded system
without modifying the main processor, a co-processor solution can be adapted. When
there is data that must be encrypted or decrypted through the chosen symmetric-key
algorithm, the main processor sends the data and key material to the co-processor,
and the co-processor performs the algorithm, sending the processed data back over the
interface to the main processor. Most co-processor solutions have tended to combine
a number of dierent algorithms to provide a multi-faceted security solution. Co-
processors have generally achieved high throughput values compared to traditional
software implementations and therefore are much more capable of meeting demands
for speed-critical network communications. However, this type of solution is generally
associated with considerable overhead in terms of hardware area utilization, data
transfer latency, and complex interfaces to the main processor [23], [33], [34], [35],
[36], [37], [38], [39], [40].
There has been previous work on instruction set extensions for general per-
mutations that are useful for improving the performance of permutations for the
DES algorithm. Shi and Lee [41] presented two new instructions for general and
dynamically specied permutations. The input and a string of conguration bits are
specied in the source operands and the result is stored in the destination register.
These instructions, along with two new instructions, are discussed in [21]. In general,
permutations of n bits required log
2
(n) issues of the custom instructions, as well
8
as several loads into registers of conguration bits. The MOSES platform developed
by a group based at NEC Research Laboratories is based on the Xtensa T1040, a
RISC-like processor designed to be easily extended with additional custom hardware
and supporting instructions. Throughput improvement factors of 31.0 for DES, 33.9
for Triple-DES, and 17.4 for AES were reported for this custom architecture [42], [43].
Study of the eect of custom instructions that support the AES cipher is
extensive. Most of this work targets the memory look-ups and multiplications that
are needed to perform the encryption rounds and key schedule. The Instruction Set
Extensions for Cryptography (ISEC) project conducted at the Graz University of
Technology in Graz, Austria, has investigated instruction set extensions that perform
the mathematical operations in the AES rounds using custom functional units inte-
grated into the targeted processors datapath [29]. Earlier work in the ISEC project
demonstrated the eectiveness of instruction set extensions for elliptic curve cryptog-
raphy [27] in improving the performance of binary extension Galois Field arithmetic
[28].
9
3 THE LEON2 PROCESSOR
3.1 Overview
The target processor for this work is the LEON2, a RISC central processing
unit (CPU) that was produced by Gaisler Research [44] (note that at the time of
this writing, support for the LEON2 processor has been discontinued in favor of the
newer LEON3 processor model). The LEON2 processor is implemented in VHDL
and is fully synthesizable. The model is highly congurable, allowing for adjustments
to many features of the processor using a graphical conguration utility. The entire
source code is freely available under the GNU General Public License which enables
custom modications and enhancements to the architecture. Information presented
in this chapter is derived from the LEON2 documentation [45], [46].
3.2 VHDL Model Hierarchy
The source code for the LEON2 processor has the directory structure shown
in Table 1. The top-level folder /leon2/ is used as an example; it is permissible for
the root directory to have any name.
3.3 Processor Architecture
LEON2 is based on the Scalable Processor Architecture (SPARC
r _
).
SPARC
r _
was rst developed in 1985 at Sun Microsystems and is based on the work
10
Folder Description Refer to
leon2/ Top directory Sec. 3.2
leon2/boards/ FPGA board support les Sec. 3.6
leon2/doc/ User manuals [46]
leon2/leon/ LEON2 processor VHDL model Sec. 3.3
leon2/pmon/ Simple boot-monitor Not discussed
leon2/sim/ Simulator support les Sec. 3.6
leon2/syn/ Synthesis support les Sec. 3.6
leon2/tbench/ LEON2 VHDL test bench Sec. 3.6
leon2/tkcong/ graphical conguration utility Sec. 3.5
leon2/tsource/ LEON2 test bench (C source) Sec. 3.6
Table 1: LEON2 VHDL Model File Hierarchy
that produced the RISC I and RISC II architectures at the University of California at
Berkeley during the early 1980s [47]. The LEON2 processor attained full certication
of compliance with the SPARC
r _
V8 architecture in 2003 [48].
Features of the LEON2 processor coding style include fully synchronous
design with a single clock, use of multiplexers for loading of pipeline registers, sep-
arate combinational and sequential processes, and record types for interconnection
of component I/O signals. LEON2 provides support for on-chip peripherals such as
a oating-point unit (FPU), Peripheral Component Interconnect (PCI), and Ether-
net; co-processor support is also available in accordance with the SPARC
r _
model.
However, these features are outside the scope of this research and are therefore not
discussed in any further detail. The main focus of the LEON2 architecture with re-
gards to the proposed instruction set extensions is the pipelined integer unit (IU).
The IU pipeline consists of ve stages: fetch, decode, execute, memory, and write back.
The VHDL model implements each stage in its own process. A process
statement in VHDL is a closed block of code that runs sequentially. The inputs are
specied by a sensitivity list. The process statement executes at any time a signal
in the sensitivity list changes state. Processes are used for behavioral VHDL code, a
high-level coding style used commonly for describing sequential logic [49].
11
3.4 SPARC
r _
V8 Instruction Model
All SPARC
r _
V8 instructions are implemented in the LEON2 processor
architecture. Instructions are grouped according to the values of the various elds in
the instruction operation code. Arithmetic, logic, and memory operations have the
Format 3 structure [47] shown in Figure 1.
op rd op3 rs1 i=0 asi rs2
op rd op3 rs1 i=1 simm13
Figure 1: Structure of SPARC
r _
V8 Format 3 instructions
3.5 Customization
Most of the available features of the LEON2 processor can be enabled, dis-
abled, or adjusted by using the graphical conguration utility. For the purposes of
this work, a basic conguration is used with no FPU, PCI, Ethernet, co-processor in-
terface, or hardware multiplier or divider. To extend the LEON2 architecture beyond
the scope of the standard model, additional VHDL code is required. The specic
les that must be modied depend on what functionality is to be added, but if the
instruction set is to be extended, the module containing the SPARC
r _
V8 opcode
constants must be updated, and these instructions must follow the SPARC
r _
V8
architecture specication [47]. The graphical conguration utility may also be mod-
ied to provide an easy interface for adjusting parameters of the custom functionality.
12
3.6 Synthesis and Simulation
The LEON2 VHDL implementation is a fully synthesizable processor that
can be targeted to any type of FPGA or ASIC technology. There are pre-made
packages for several synthesis tools such as XST, Synplify, Synopsys, and Leonardo
in the /syn/ sub-folder of the LEON2 directory structure. These packages enable use
of technology-specic cells to directly instantiate or automatically infer the register
les, caches, PCI FIFOs, and I/O pads. There are also a number of packages in the
/boards/ sub-folder that support programming of physical FPGA boards [50] with
the LEON2 architecture.
Functional verication of programs built for the LEON2 architecture can be
performed with the provided generic test bench. The VHDL source for the test bench
is located in the /tbench/ sub-folder of the LEON2 directory structure. Software code
is placed in the /tsource/ sub-folder in a format readable by the test bench VHDL
code. The software can then be read and executed by the test bench for purposes of
functional verication and performance evaluation.
3.7 Software Development Tools
In order to facilitate the development of programs targeting the LEON2
processor, Gaisler Research has provided a series of compilers and simulators that
may be chosen depending on the software environment. For stand-alone applications,
the Bare C Compiler (BCC) is recommended. BCC is based on the GNU Compiler
Collection (GCC) and GNU binutils. The BCC development tools are used in the
same way as those included in the standard GCC and binutils packages. The actual
names of the executables have a sparc-elf- prex.
13
Packages containing the binaries for Linux and Cygwin environments are
available, as well as the full source code for developers who wish to take advantage
of an expanded LEON2 architecture.
14
4 TARGET ALGORITHMS
4.1 Triple-DES
4.1.1 The DES Algorithm
Many block ciphers may be characterized as Feistel networks [3]. Feistel
networks were invented by Horst Feistel [51] and are a general method of transforming
a function into a permutation. The basic Feistel network divides the data into two
halves where one half operates upon the other [52]. The f -function uses one of the
halves of the data block and a key to create a pseudo-random bit stream that is
used to encrypt or decrypt the other half of the data block. Therefore, to encrypt or
decrypt both halves requires two iterations of the Feistel network.
A generalization of the basic Feistel network allows for the support of larger
data blocks. Generalization occurs by considering the data swap as a circular right
shift. This allows for the use of the same f -function but requires multiple rounds to
input all of the sub-blocks to the f -function [53]. Figure 2 from [53] details the block
diagram for block ciphers employing both the basic Feistel network and generalized
Feistel networks of three and four blocks. The f -function is represented by the box
and the symbol represents a bit-wise XOR operation.
The f -function employs confusion and diusion to obscure redundancies in
a plaintext message [54]. Confusion obscures the relationship between the plaintext,
the ciphertext, and the key. S-Box look-up tables are an example of a confusion
operation. Diusion spreads the inuence of individual plaintext or key bits over
as much of the ciphertext as possible. Expansion and permutation functions are
15
L R
k0
R L
k1
L R
B C
k0
A B
k1
C A
A
C
B
k2
B C A
C D
k0
B C
k1
A B
B
A
D
k2
D A C
A
D
C
B
k3
C D B A
Figure 2: Block diagram for standard block ciphers
examples of diusion operations [3]. The basic operations that may be found within
an f -function include:
Bitwise XOR, AND, or OR.
Modular addition or subtraction.
Shift or rotation by a constant number of bits.
Data-dependent rotation by a variable number of bits.
Modular multiplication.
Multiplication in a Galois eld.
Modular inversion.
Look-up-table substitution.
DES is a sixteen-round Feistel Network block cipher. A block diagram of
the entire operation is given in Figure 3 from [55]. The DES cipher takes as input a
64-bit key, where 8 of the 64 bits are used for parity and the other 56 bits comprise
16
the actual key material. The input and output both have a size of 64 bits for both
encryption and decryption. The procedures for encryption and decryption are almost
exactly the same; the only dierence is that the key schedule for decryption is the
reverse of that used for encryption.
Figure 3: The Data Encryption Standard algorithm
Throughout the rest of this section, bit ordering is denoted for an n-bit
17
vector such that bit 1 is the most signicant bit and n is the least signicant bit.
In all Figures that show the bit assignments for DES permutations, the numbers
correspond to input bits that are mapped to a specic position in the output, starting
with output bits 1,2,3,... in the top row and ending with output bits ...,n-2,n-1,n in
the bottom row.
The rst part of DES encryption is an Initial Permutation (IP) on the input
block. The IP rearranges the input according to Table 2. The output of the IP is
divided into a left half L
0
and a right half R
0
, which becomes the input to the rst
round. For each round iteration i from 1 to 16:
L
i
= R
i1
R
i
= L
i1
f(R
i1
, K
i
)
58 50 42 34 26 18 10 2
60 52 44 36 28 20 12 4
62 54 46 38 30 22 14 6
64 56 48 40 32 24 16 8
57 49 41 33 25 17 9 1
59 51 43 35 27 19 11 3
61 53 45 37 29 21 13 5
63 55 47 39 31 23 15 7
Table 2: The Initial Permutation IP
Figure 4 is an illustration of the round function and shows the individual
blocks of the f -function, which is the core operation of each round.
The following tables show the mappings of input bits to output bits of the
E and P operations, as well as the eight S-Boxes. The E expansion duplicates some
of the bits of the 32-bit input to the f - function as shown in Table 3 and outputs
a 48-bit value. The result of E(R
i1
)

K
i
is partitioned into eight 6-bit values.
The S-Boxes output a 4-bit number based on a 6-bit input. The input is in the form
18
Figure 4: The DES f -function
a
5
a
4
a
3
a
2
a
1
a
0
the row index into the S-Box is the number formed from a
5
a
0
and the
column index is the number formed from a
4
a
3
a
2
a
1
. The outputs of the S-Boxes are
concatenated to form the input to the P permutation. The P permutation rearranges
the 32 bits of the combined S-Box outputs, and the result is XOR-ed with L
i1
to
obtain R
i
for the current round.
32 1 2 3 4 5
4 5 6 7 8 9
8 9 10 11 12 13
12 13 14 15 16 17
16 17 18 19 20 21
20 21 22 23 24 25
24 25 26 27 28 29
28 29 30 31 32 1
Table 3: The Expansion Operation E
19
S
1
14 4 13 1 2 15 11 8 3 10 6 12 5 9 0 7
0 15 7 4 14 2 13 1 10 6 12 11 9 5 3 8
4 1 14 8 13 6 2 11 15 12 9 7 3 10 5 0
15 12 8 2 4 9 1 7 5 11 3 14 10 0 6 13
S
2
15 1 8 14 6 11 3 4 9 7 2 13 12 0 5 10
3 13 4 7 15 2 8 14 12 0 1 10 6 9 11 5
0 14 7 11 10 4 13 1 5 8 12 6 9 3 2 15
13 8 10 1 3 15 4 2 11 6 7 12 0 5 14 9
S
3
10 0 9 14 6 3 15 5 1 13 12 7 11 4 2 8
13 7 0 9 3 4 6 10 2 8 5 14 12 11 15 1
13 6 4 9 8 15 3 0 11 1 2 12 5 10 14 7
1 10 13 0 6 9 8 7 4 15 14 3 11 5 2 12
S
4
7 13 14 3 0 6 9 10 1 2 8 5 11 12 4 15
13 8 11 5 6 15 0 3 4 7 2 12 1 10 14 9
10 6 9 0 12 11 7 13 15 1 3 14 5 2 8 4
3 15 0 6 10 1 13 8 9 4 5 11 12 7 2 14
S
5
2 12 4 1 7 10 11 6 8 5 3 15 13 0 14 9
14 11 2 12 4 7 13 1 5 0 15 10 3 9 8 6
4 2 1 11 10 13 7 8 15 9 12 5 6 3 0 14
11 8 12 7 1 14 2 13 6 15 0 9 10 4 5 3
S
6
12 1 10 15 9 2 6 8 0 13 3 4 14 7 5 11
10 15 4 2 7 12 9 5 6 1 13 14 0 11 3 8
9 14 15 5 2 8 12 3 7 0 4 10 1 13 11 6
4 3 2 12 9 5 15 10 11 14 1 7 6 0 8 13
S
7
4 11 2 14 15 0 8 13 3 12 9 7 5 10 6 1
13 0 11 7 4 9 1 10 14 3 5 12 2 15 8 6
1 4 11 13 12 3 7 14 10 15 6 8 0 5 9 2
6 11 13 8 1 4 10 7 9 5 0 15 14 2 3 12
S
8
13 2 8 4 6 15 11 1 10 9 3 14 5 0 12 7
1 15 13 8 10 3 7 4 12 5 6 11 0 14 9 2
7 11 4 1 9 12 14 2 0 6 10 13 15 3 5 8
2 1 14 7 4 10 8 13 15 12 9 0 3 5 6 11
Table 4: The DES S-Boxes
20
16 7 20 21
29 12 28 17
1 15 23 26
5 18 31 10
2 8 24 14
32 27 3 9
19 13 30 6
22 11 4 25
Table 5: The Pre-Output Permutation P
After the nal round, the left and right halves of the 64-bit block, L
16
and
R
16
, are swapped, and then subject to a Final Permutation (FP). This operation is
simply the inverse of the IP. The bit mapping for this operation is shown in Table 6;
bit positions are represented in the same manner as Table 2 for the IP.
40 8 48 16 56 24 64 32
39 7 47 15 55 23 63 31
38 6 46 14 54 22 62 30
37 5 45 13 53 21 61 29
36 4 44 12 52 20 60 28
35 3 43 11 51 19 59 27
34 2 42 10 50 18 58 26
33 1 41 9 49 17 57 25
Table 6: The Final Permutation FP
4.1.2 DES Key Schedule
The key schedule for DES operates on the 64-bit master key to produce
a series of 48-bit round keys, each used one at a time for the sixteen rounds of the
cipher. Initially, the bits of the master key are arranged by Permuted Choice 1 (PC-
1) into two 28-bit vectors, C and D (note that every eighth bit is a parity bit and is
discarded). Table 7 depicts the PC-1 operation.
For each round of the cipher, a bit rotation is performed separately on the
C and D values. Rotation moves to the left for encryption, and to the right for
decryption. The rotation amount depends on the next round; these amounts are
21
C
0
D
0
57 49 41 33 25 17 9 63 55 47 39 31 23 15
1 58 50 42 34 26 18 7 62 54 46 38 30 22
10 2 59 51 43 35 27 14 6 61 53 45 37 29
19 11 3 60 52 44 36 21 13 5 28 20 12 4
Table 7: Permuted Choice 1 (PC-1)
given for encryption in Table 8. These amounts are carried out in reverse order for
the right-rotations of the decryption key schedule.
Round 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Rotate amount 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1
Table 8: Rotations for the DES key schedule
The round key is the result of passing the current state of the C and D bit
vectors through Permuted Choice 2 (PC-2). This operation maps the concatenation
of C and D to the 48-bit round key as shown in Table 9.
14 17 11 24 1 5
3 28 15 6 21 10
23 19 12 4 26 8
16 7 27 20 13 2
41 52 31 37 47 55
30 40 51 45 33 48
44 49 39 56 34 53
46 42 50 36 29 32
Table 9: Permuted Choice 2 (PC-2)
4.1.3 The Triple-DES Algorithm
The Triple-DES algorithm has been suggested as a more secure alternative
to DES [55]. As the name suggests, this cipher sequentially executes the DES algo-
rithm three times with keys K
1
, K
2
, and K
3
, where two or all three of these keys may
be equivalent. The following rules are used for encryption and decryption, where PT
is the plaintext, CT is the ciphertext, E
K
i
is a DES encryption using K
i
, and D
K
i
is
22
a DES decryption using K
i
:
CT = E
K
3
(D
K
2
(E
K
1
(PT)))
PT = D
K
3
(E
K
2
(D
K
1
(CT)))
Since the output ciphertext from implementations 1 and 2 of DES is used as
the input plaintext to implementations 2 and 3 respectively, and the Initial and Final
Permutations are inverses of each other, the inner Initial and Final Permutations may
be removed from the algorithm.
There are three keying options commonly used for Triple DES [55]:
Keying Option 1: K
1
, K
2
, and K
3
are independent.
Keying Option 2: K
1
= K
3
and K
2
is independent from K
1
and K
3
.
Keying Option 3: K
1
= K
2
= K
3
.
Note that Keying Option 3 is equivalent to a single iteration of DES.
4.1.4 Triple-DES Key Schedule
To perform the key schedule for Triple-DES, the key expansion must be
performed on each unique key that is used in the chosen implementation of the algo-
rithm. This means that three key expansions are required for Keying Option 1, two
are required for Keying Option 2, and one is required for Keying Option 3.
4.1.5 Performance
Software implementations of DES tend to be signicantly slower than hard-
ware implementations. Bit-level manipulations such as those contained in the permu-
tation, expansion, permuted choice, and Cyclic Left/Right Shift units do not map well
to general purpose processors. General purpose processor instruction sets operate on
23
multiple bits at a time based on the processor word size. Moreover, the DES S-Boxes
do not use memory in an ecient manner. Software look-up tables would appear
to be the obvious implementation choice for the DES S-Boxes. However, the DES
S-Boxes have 6-bit addresses and 4-bit output data while most memories associated
with general purpose processors use byte addressing with either 8-bit or 32-bit output
data. As a result, many software implementations of DES exhibit throughputs that
are at least a full order of magnitude slower than hardware implementations.
Even the best software implementations are only capable of throughputs in
the range of 100200 Mbps. Most of these implementations recommend storing the L
i
and R
i
data as a 48-bit padded word within a 64-bit processor word and implementing
the permutations and S-Boxes as precomputed look-up tables. Additionally, there is
general agreement that the look-up table implementation for the S-Boxes is most
eective when the size of the look-up tables is minimized, guaranteeing that the
data will t entirely in on-chip cache. Size minimization of the S-Box look-up tables
is achieved by implementing each S-Box in its own look-up table. Finally, one key
software optimization is the unrolling of software loops to increase performance. Even
when software loops are too cumbersome to unroll, using loop counters that decrement
to zero in place of loop counters that increment to a terminal count are shown to
greatly increase the performance of software implementations of the DES algorithm.
However, the unrolling of software loops must be done with great care such that the
total data storage space does not exceed the size of the on-chip cache as this would
cause extreme performance degradation [56], [57], [58].
DES hardware implementations are easily realized in a single chip, such as
an FPGA or an ASIC. To support encryption or decryption of a new block of data
every clock cycle in an implementation operating in a non-feedback mode, such as
Electronic Code Book (ECB) or counter mode also requires that the chip have at least
24
128 input pins (for the input data and key) and 64 output pins (for the output data).
Once again, FPGA and ASIC technology provide more than enough I/O pins to meet
these requirements. As a result, numerous fast and ecient DES implementations
have been reported, reaching throughputs in the Gbps when targeting either FPGAs
or ASICs. Examples of such implementations may be found in [2], [59], [60]. When
operating in feedback modes (such as Cipher Block Chaining (CBC) mode), DES
does not map nicely to pipelined hardware implementations because of the chaining
of blocks. The chaining requires ciphertext block y
i1
to process plaintext block x
i
and thus simultaneous processing of the two blocks is impossible, requiring that the
pipeline be stalled until generation of the ciphertext block y
i1
is completed. However,
the stalling of the pipeline may be avoided in an environment with multiple data
streams, such as in a network processor. In such a situation, the pipeline may be fully
utilized by interleaving the data streams. For a fully pipelined DES implementation
where the atomic unit of the pipeline is the DES round function, the pipeline will
have sixteen stages, thus requiring sixteen interleaved data streams. Let x
0
S
0
denote
plaintext block 0 from data stream 0, x
0
S
1
denote plaintext block 0 from data stream
1, etc. Using this notation, the pipeline is lled with blocks x
0
S
0
, x
0
S
1
, x
0
S
2
, . . .,
x
0
S
15
. When x
0
S
0
has passed the nal stage of the pipeline to yield y
0
S
0
, x
1
S
0
is
ready to enter the rst stage of the pipeline and is combined with y
0
S
0
via the XOR
operation to perform CBC mode chaining. Thus each data stream is encrypted and
decrypted in CBC mode while also maintaining full pipeline utilization, maximizing
the performance of the implementation. Note that such an implementation must also
maintain sixteen Initialization Vectors, one for each data stream, to be combined with
the rst plaintext block x
0
of the associated data stream via the XOR operation.
The earliest VLSI implementations of DES [61], [62] achieved throughputs
ranging from 20 to 32 Mbps using 3 m technology. The variances in throughput are
25
compared and contrasted based upon speed versus area tradeos. The implementa-
tions support multiple modes of operation, including ECB, CBC, Cipher Feedback
(CFB), and Output Feedback (OFB) (see [63] for a detailed description of DES modes
of operation). Other ASIC implementations of DES [64], [65] achieve a throughput
of 1 Gbps using 0.8 m Gallium Arsenide (GaAs) technology. More recently, a DES
ASIC implementation has been demonstrated to operate at up to 10 Gbps using 0.6
m technology [59].
4.2 IDEA
4.2.1 Mathematical Background
The International Data Encryption Algorithm (IDEA) was originally pub-
lished as the Proposed Encryption Standard (PES) by Xuejia Lai and James Massey
[66]. The computations involved in IDEA are based on operations from three dierent
mathematical groups:
16-bit bitwise exclusive-OR, denoted by

,
Addition modulo 2
16
, denoted by ,
Multiplication modulo (2
16
+ 1), denoted by

.
For the third operation, an input of 0x0000 represents the value 2
16
. This
is because the

operation is performed over the multiplicative group Z
2
16
+1
, where
zero is not a member but 2
16
is a member of the group. The value 2
16
is therefore
denoted by 0x0000 so that only sixteen bits are required to represent all possible
values for the inputs to each operation.
The security of IDEA is based not only on its large key size but also on the
fact that the output of one group operation is never used as an input to the same
26
operation. Further details are available in the original proposal for PES [66]. IDEA
evolved into its nal form [67] due to modications required to strengthen the cipher
against dierential cryptanalysis attacks [68]. IDEA is used in many commercial
applications, such as Pretty Good Privacy (PGP). Like DES, IDEA operates across
64-bit blocks. However, while DES requires a 56-bit key, IDEA requires a 128-bit key,
accounting for the increased security of the cipher as compared to DES.
4.2.2 Algorithm Description
IDEA operates on 64-bit plaintext blocks and has a key size of 128 bits.
The algorithm consists of eight rounds followed by a nal transformation to obtain
the output. Similar to DES, the procedure (referred to as a computation graph; see
Figure 5), is the same for both encryption and decryption but dierent key schedules
are used.
Figure 5 shows the input text as four 16-bit sub-blocks X
1
, X
2
, X
3
, and X
4
.
These text sub-blocks are combined with the six 16-bit sub-blocks of the round key
for the current round r, labeled Z
(r)
1
through Z
(r)
6
, using the mathematical operations
noted above.
4.2.3 Key Schedule
The IDEA key schedule for encryption is based on a series of left rotations
of the 128-bit master key. The master key is rst partitioned into eight 16-bit blocks;
these are the rst eight key sub-blocks: Z
(1)
1
, Z
(1)
2
, Z
(1)
3
, Z
(1)
4
, Z
(1)
5
, Z
(1)
6
, Z
(2)
1
, Z
(2)
2
.
The next eight key blocks are obtained by rotating the key to the left by 25 bits,
then performing the partition again. This process is repeated until all 52 key blocks
are generated (six blocks for each of the eight rounds and four blocks for the nal
transformation).
The key schedule for decryption is based on the encryption key schedule.
Table 10 shows the relationship between the decryption key blocks and the encryption
27
Figure 5: The computation graph for IDEA.
key blocks, where Z
(r)1
n
represents the multiplicative inverse modulo (2
16
+1) of Z
(r)
n
,
and Z
(r)
n
represents the additive inverse modulo 2
16
of Z
(r)
n
.
4.2.4 Performance
In terms of the core operations of IDEA, the bit-wise XOR and addition
28
Round Encrypt keys Decrypt keys
1 Z
(1)
1
Z
(1)
2
Z
(1)
3
Z
(1)
4
Z
(1)
5
Z
(1)
6
Z
(9)1
1
Z
(9)
2
Z
(9)
3
Z
(9)1
4
Z
(8)
5
Z
(8)
6
2 Z
(2)
1
Z
(2)
2
Z
(2)
3
Z
(2)
4
Z
(2)
5
Z
(2)
6
Z
(8)1
1
Z
(8)
3
Z
(8)
2
Z
(8)1
4
Z
(7)
5
Z
(7)
6
3 Z
(3)
1
Z
(3)
2
Z
(3)
3
Z
(3)
4
Z
(3)
5
Z
(3)
6
Z
(7)1
1
Z
(7)
3
Z
(7)
2
Z
(7)1
4
Z
(6)
5
Z
(6)
6
4 Z
(4)
1
Z
(4)
2
Z
(4)
3
Z
(4)
4
Z
(4)
5
Z
(4)
6
Z
(6)1
1
Z
(6)
3
Z
(6)
2
Z
(6)1
4
Z
(5)
5
Z
(5)
6
5 Z
(5)
1
Z
(5)
2
Z
(5)
3
Z
(5)
4
Z
(5)
5
Z
(5)
6
Z
(5)1
1
Z
(5)
3
Z
(5)
2
Z
(5)1
4
Z
(4)
5
Z
(4)
6
6 Z
(6)
1
Z
(6)
2
Z
(6)
3
Z
(6)
4
Z
(6)
5
Z
(6)
6
Z
(4)1
1
Z
(4)
3
Z
(4)
2
Z
(4)1
4
Z
(3)
5
Z
(3)
6
7 Z
(7)
1
Z
(7)
2
Z
(7)
3
Z
(7)
4
Z
(7)
5
Z
(7)
6
Z
(3)1
1
Z
(3)
3
Z
(3)
2
Z
(3)1
4
Z
(2)
5
Z
(2)
6
8 Z
(8)
1
Z
(8)
2
Z
(8)
3
Z
(8)
4
Z
(8)
5
Z
(8)
6
Z
(2)1
1
Z
(2)
3
Z
(2)
2
Z
(2)1
4
Z
(1)
5
Z
(1)
6
Final
transform Z
(9)
1
Z
(9)
2
Z
(9)
3
Z
(9)
4
Z
(1)1
1
Z
(1)
2
Z
(1)
3
Z
(1)1
4
Table 10: IDEA key schedule
are easily implemented with one instruction each in software. For the reduction
modulo 2
16
, a processor such as the LEON2 that only performs arithmetic on 32-
bit register operands requires an additional logic instruction to mask out the bits
that may overow into the sixteen most signicant bits of the destination register.
The major performance bottleneck for a software implementation of the IDEA cipher
is the multiplication modulo (2
16
+ 1). The reason for this is that multiplication
in general may take several clock cycles to complete on the processor running the
algorithm (especially those without hardware multipliers), and the modular reduction,
which is commonly implemented using the Low-High Lemma [66], requires additional
execution time.
Several software implementations of the IDEA algorithm take advantage
of advanced processor architectures that employ instruction parallelism or functional
units for multimedia support. A four-way parallel implementation on a 166 Mhz Pen-
tium MMX processor [69] achieved a throughput of approximately 72 Mbps. Through-
put values ranging from 421 Mbps to 550 Mbps have been achieved on the Itanium
platform running at 733 MHz [70]. The performance evaluations reported in [71]
include a comparison of IDEA software implementations on processors with various
29
word sizes, clock frequencies, and cache sizes. Execution times for IDEA encryption
ranged from 2555 s on the 8-bit 4 MHz Atmega 103 to 9 s on the 64-bit 440 MHz
UltraSparc2
r _
with instruction and data cache sizes of 16 kbytes. The ability to
perform fast multiplications was shown to be a major factor in the performance of
the IDEA algorithm.
Implementations of IDEA on recongurable computing platforms and sys-
tems with co-processors have shown improved performance. An implementation on a
SRC-6E platform [72] achieved throughputs of approximately 590 Mbps for end-to-
end software time for bulk data processing. Comparisons have been made between
the performance of IDEA on Digital Signal Processing (DSP) chips, cryptographic
co-processors, and hardware implementations on FPGAs in a hardware-software co-
design system that makes use of encryption in a mobile device. Reported perfor-
mance gures ranged from 32 Mbps on the DEC SA-110 and 53.1 Mbps on the TI
TMX320C6x DSP chips, to 180 Mbps using the VINCI cryptographic co-processor,
to 528 Mbps with an FPGA-based implementation [73].
A VLSI implementation of PES [74] achieved a throughput of 44 Mbps using
1.5 m technology. This implementation was limited in clock frequency to maintain
compatibility with the Sun Microsystems SBus. The earliest VLSI implementations
of IDEA [75], [76] achieved throughputs of 177 Mbps using 1.2 m technology. More
recent VLSI implementations [77] achieve a throughput of 355 Mbps using 0.8 m
technology. When using 0.7 m technology, a throughput of 424 Mbps was achieved
in a single chip solution [78]. However, the performance of these implementations
were signicantly reduced when operating in feedback modes.
30
4.3 AES
4.3.1 Mathematical Background
Joan Daemen and Vincent Rijmen proposed the Rijndael algorithm to NIST
as a candidate for the Advanced Encryption Standard [79]. One of the most signicant
features of the algorithm is the extensive use of nite eld, or Galois Field, arithmetic.
The particular eld used in the AES algorithm is the Galois Field GF(2
8
). Values
are represented by polynomials of the form a(x) = a
7
x
7
+a
6
x
6
+a
5
x
5
+a
4
x
4
+a
3
x
3
+
a
2
x
2
+ a
1
x + a
0
, or in bit vector notation, a
7
a
6
a
5
a
4
a
3
a
2
a
1
a
0
, where each a
i
is a
coecient in the Galois Field GF(2). Addition is done by computing the sum mod-
ulo 2 of coecients in the same bit positions; this can be accomplished by applying
a bit-wise exclusive-OR on the coecients. Multiplication works in much the same
way as ordinary polynomial multiplication, but there is an additional step to make
a modular reduction of the product by an irreducible polynomial so that the nal
product is in the Galois Field GF(2
8
). For the AES algorithm, this polynomial is
m(x) = x
8
+ x
4
+ x
3
+ x + 1.
4.3.2 Algorithm Description
AES always operates on a block size of 128 bits, but key sizes of 128, 192,
or 256 bits are allowed. The number of rounds used in the cipher is dependent on the
key size ten rounds for a 128-bit key, twelve rounds for a 192-bit key, and fourteen
rounds for a 256 bit key. This research focuses on a 128-bit key implementation but
is easily extended for use in implementations with larger key sizes.
Encryption of one plaintext block in AES requires the sequence of operations
shown in Figure 6. The word data type is a 32-bit value. In the AES algorithm
specication [80], the plaintext is arranged into a 4 4 matrix of 8-bit values called
31
the state, depicted in Figure 7.
Encrypt(byte in[16], byte out[16], word k[44])
begin
byte state[4,4]
state = in
AddRoundKey(state, k[0, 3])
for round = 1 step 1 to 9
SubBytes(state)
ShiftRows(state)
MixColumns(state)
AddRoundKey(state, k[round*4, (round+1)*4-1])
end for
SubBytes(state)
ShiftRows(state)
out = state
end
Figure 6: The AES encryption process
s
0,0
s
0,1
s
0,2
s
0,3
s
1,0
s
1,1
s
1,2
s
1,3
s
2,0
s
2,1
s
2,2
s
2,3
s
3,0
s
3,1
s
3,2
s
3,3
Figure 7: The state representation of data blocks in AES
The four types of operations performed on the state are:
SubBytes: substitutes each byte in the state with a new value according to
the following procedure:
(1) Compute the multiplicative inverse in the Galois Field GF(2
8
), denoted as
a
1
(except for the value 0x00, which is mapped to itself);
(2) Perform the following ane transformation over the Galois Field GF(2) on
a
1
:
32
_
_
b
7
b
6
b
5
b
4
b
3
b
2
b
1
b
0
_
_
=
_
_
0 0 0 1 1 1 1 1
0 0 1 1 1 1 1 0
0 1 1 1 1 1 0 0
1 1 1 1 1 0 0 0
1 1 1 1 0 0 0 1
1 1 1 0 0 0 1 1
1 1 0 0 0 1 1 1
1 0 0 0 1 1 1 1
_
_
_
_
a
1
7
a
1
6
a
1
5
a
1
4
a
1
3
a
1
2
a
1
1
a
1
0
_
_
+
_
_
0
1
1
0
0
0
1
1
_
_
The result b is copied into the position of a in the state.
ShiftRows: performs cyclic left-shifts on each row in the state. The amount
of bytes by which to shift depends on the row: zero for the top row, one for the
second row, two for the third row, and three for the bottom row.
MixColumns: each column of the state is treated as a vector of four polyno-
mials in the Galois Field GF(2
8
) in this operation. Each of the four columns
are multiplied by a 4 4 constant matrix with coecients in the Galois Field
GF(2
8
) reduced modulo m(x) = x
8
+ x
4
+ x
3
+ x + 1. For each column c from
0 to 3,
_
_
B
(0,c)
B
(1,c)
B
(2,c)
B
(3,c)
_
_
=
_
_
02 03 01 01
01 02 03 01
01 01 02 03
03 01 01 02
_
_
_
_
A
(0,c)
A
(1,c)
A
(2,c)
A
(3,c)
_
_
.
AddRoundKey: like MixColumns, AddRoundKey operates on individual
columns of the state. Each column C
i
is combined by a bit-wise exclusive-OR
33
operation with a 32-bit word k
4r+i
from the current round key (the key schedule
is explained in Section 4.3.3).
Decryption of a ciphertext block incorporates the inverses of the operations
used in the encryption process as shown in Figure 8. Note that AddRoundKey is
its own inverse since it involves only the bitwise exclusive-OR operation.
Decrypt(byte in[16], byte out[16], word k[44])
begin
byte state[4,4]
state = in
for round = 9 step -1 downto 1
InvShiftRows(state)
InvSubBytes(state)
AddRoundKey(state, k[round*4, (round+1)*4-1])
InvMixColumns(state)
end for
InvShiftRows(state)
InvSubBytes(state)
out = state
end
Figure 8: The AES decryption process
InvSubBytes: reverses the transformation performed by SubBytes by rst
applying an ane transformation using the inverse of the 8 8 matrix used for
SubBytes followed by calculation of the multiplicative inverse in the Galois
Field GF(2
8
) modulo m(x).
InvShiftRows: performs cyclic right-shifts on each row in the state in the
same amounts as ShiftRows.
InvMixColumns: performs multiplication of the state by the inverse of the
34
Galois Field constant matrix from MixColumns:
_
_
B
(0,c)
B
(1,c)
B
(2,c)
B
(3,c)
_
_
=
_
_
0e 0b 0d 09
09 0e 0b 0d
0d 09 0e 0b
0b 0d 09 0e
_
_
_
_
A
(0,c)
A
(1,c)
A
(2,c)
A
(3,c)
_
_
.
4.3.3 Key Schedule
For the 128-bit key size implementation of the AES algorithm, the master
key is expanded into a linear array of eleven 4-byte words using the process presented
in Figure 9. There are two operations and an array of constants used specically for
the key schedule:
KeyExpansion(byte key[16], word w[44])
begin
word temp
i = 0
while (i < 4)
w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])
i = i + 1
end while
i = 4
while (i < 44)
temp = w[i-1]
if (i mod 4 = 0)
temp = SubWord(RotWord(temp)) xor Rcon[i/4]
end if
w[i] = w[i-4] xor temp
i = i + 1
end while
end
Figure 9: The key expansion process for AES
SubWord: applies a substitution to each of the four bytes in the input word
35
using the same S-Box that is used in encryption.
RotWord: performs a cyclic left rotation by one byte on the input word
a
0
a
1
a
2
a
3
to produce an output of a
1
a
2
a
3
a
0
.
Rcon[ ]: the round constant array with a size of ten words. For i from 1 to 10,
Rcon[i] = [02
i1
, 00, 00, 00],
where the powers 02
i1
are in the Galois Field GF(2
8
).
4.3.4 Performance
Rijndael software performance bottlenecks typically occur in the SubBytes
and MixColumns transformations, one or both of which are usually implemented
via 8-bit to 8-bit look-up tables. Often most of the Rijndael round transformations
SubBytes, ShiftRows, and MixColumns are combined into large look-up ta-
bles termed T-tables. Such implementations require up to three T-tables whose size
may be either 1 kbytes or 4 kbytes where the smaller tables require performing an
additional rotation operation. The goal of the T-tables is to avoid performing the
MixColumns and InvMixColumns transformations as these operations perform Galois
Field xed eld constant multiplication, an operation which maps poorly to general
purpose processors. However, the use of T-tables has signicant disadvantages. The
T-tables signicantly increase code size, their performance is dependent on the mem-
ory system architecture as well as cache size, and their use causes key expansion for
Rijndael decryption to become signicantly more complex. As an alternative to the
T-tables implementation method, it is also feasible to have the processor perform all of
the Rijndael round transformations. Row-based implementations have been demon-
strated to allow for greater eciency in the implementation of the MixColumns and
36
InvMixColumns transformations versus column-based implementations. However the
SubBytes transformation still remains as a bottleneck, requiring separate 256-byte
look-up tables for encryption and decryption [20], [29], [32], [81], [82], [83], [84].
Numerous co-processors have been developed to accelerate cryptographic
algorithm implementations. The CryptoManiac VLIW co-processor was developed
as a result of instruction set extensions designed to accelerate the performance of
a number of the AES candidate algorithms. CryptoManiac features the execution
of up to four instructions per cycle and the use of instructions with up to three
operands to allow for the combination of short latency instructions for single cycle
execution. Similarly, the Cryptonite co-processor is also VLIW based, with two 64-
bit datapaths and special instructions combined with dedicated memories to support
Rijndael implementations. Both co-processors improve the performance of Rijndael
implementations versus implementations targeting general purpose processors. Other
implementations couple FPGA co-processors with a LEON-2 processor core. The
co-processors connect to the processor core via either a dedicated interface or as a
memory-mapped peripheral and were able to signicantly improve the performance
of Rijndael implementations [23], [33], [35], [85], [86].
Multiple implementations of Rijndael have been presented targeting a wide
range of hardware technologies. These implementations use specic Galois Field xed
eld constant multipliers resulting in either logic equations or look-up tables being
generated to perform the multiplication. Implementations based on logic equations
are optimized for area and require a moderate number of logic levels. Implementations
based on look-up tables are optimized for speed at the cost of additional logic resources
though the performance of these implementations, like the software implementations
employing T-tables, is highly dependent on the memory system and cache organiza-
tion and size. In the case of the Galois Field xed eld constant multipliers used in
37
the MixColumns transformation, the 8-bit to 8-bit look-up tables may be replaced by
8 bit 8 bit mapping matrices, reducing the associated memory requirements
by a factor of nearly 20 [87], [88]. Look-up tables may also be replaced with logic
equation implementations for the SubBytes and MixColumns transformations, sig-
nicantly reducing the hardware resource requirements. To illustrate the signicant
reduction in logic resource requirements, in the case of the SubBytes transformation,
a reduction in gate count by as much as a factor 4.66 has been realized using logic
equations in place of a look-up table. When performing sixteen SubBytes transfor-
mations in parallel in a single round of Rijndael (assuming a 128-bit implementation),
this equates to a savings of over 38,000 gate equivalences. For a pipelined implemen-
tation of 128-bit AES, this savings increases to over 380,000 gate equivalences [29].
Encryption, decryption, and Key Scheduling are all easily pipelined in non-feedback
modes of operation while single-round implementations are typically used when op-
erating in feedback modes. Depending on the implementation methodology, Rijndael
throughputs as high as 70 Gbps when operating in non-feedback modes and 2.29
Gbps when operating in feedback modes have been reported [89], [90], [91], [92], [93],
[94], [95], [96], [97], [98], [99], [100], [101], [102], [103], [104], [105], [106], [107].
Instruction set extensions are an interesting implementation option that
bridges the gap between hardware and software. Signicantly improved performance
of software implementations have been demonstrated as a result of adding function-
ality to a processors datapath and corresponding control logic to decode new in-
structions. Instruction set extensions designed to accelerate the performance of soft-
ware implementations of Rijndael have been proposed for a wide range of processors.
These extensions minimize the number of memory accesses, usually by combining
the SubBytes and MixColumns transformations into one T-table look-up operation
to speed up algorithm execution. While T-table performance is heavily dependent
38
upon available cache size, these extensions have been shown to result in performance
improvements of up to a factor of 3.68 versus Rijndael implementations without the
use of the instruction set extensions [20], [42], [85], [86], [88], [108], [109].
4.4 Modes of Operation
All of the symmetric-key algorithms targeted in this research support many
dierent modes of operation methods that specify the information entered into the
algorithm. Two modes of operation are of particular interest here:
Electronic Code Book (ECB): Each block of plaintext x
i
is input directly into
the encryption function to form the ciphertext y
i
. Encrypting a specic value
for x
i
always produces the same value for y
i
and the result is not aected by
previous plaintext blocks.
Cipher Block Chaining (CBC): The rst plaintext block is combined with an
initialization vector (IV) using the bitwise exclusive-OR operation, and the
result is encrypted to form the rst ciphertext block. Subsequent blocks of
plaintext are exclusive-ORed with the last ciphertext block computed. In this
mode, every block of ciphertext depends on the preceding runs of the algorithm.
ECB mode may be employed to create pipelined implementations of block
ciphers, leading to very high throughput. However, a disadvantage of ECB mode is
that identical plaintext blocks encrypted with the same key always produce the same
ciphertext. Also, the fact that data blocks encrypted in ECB mode have no chaining
or feedback mechanisms, the encrypted data is vulnerable to a substitution attack.
In this type of attack, encrypted blocks in the original data stream are swapped
by the attacker with other data blocks that are encrypted with the same key. CBC
39
mode is not vulnerable to suck attacks, but because the current block to be encrypted
depends directly on the previous ciphertext, CBC mode is not well suited to pipelined
implementations.
40
5 PROPOSED INSTRUCTION SET EXTENSIONS
This chapter species the new instructions that are intended to enhance the
performance of the algorithms described in Chapter 4. All of the custom instructions
are intended to comply with the SPARC
r _
V8 instruction model [47]. In particular,
the instructions have the Format 3 structure described in Section 3.3.
The sub-sections below indicate the syntax and encoding, as well as a brief
description, of each instruction. All instructions that write to a register execute in
one clock cycle, except for the mmul16 instruction which takes two clock cycles.
For those instructions that store data directly into registers contained in the custom
hardware, the data is available at the start of the next cycle, after instruction execu-
tion has completed.
5.1 DES and Triple-DES
5.1.1 Initial and Final Permutations
Instruction Syntax:
desipl rs1,rs2,rd
desipr rs1,rs2,rd
desfpl rs1,rs2,rd
desfpr rs1,rs2,rd
41
Instruction Encoding:
op rd op3 rs1 i asi rs2
10 rd 001101 rs1 0 XXXXXXnn rs2
The desipl and desipr instructions produce the left and right halves of the
DES IP, respectively. Similarly, the desfpl and desfpr instructions produce the left
and right halves of the DES FP, respectively. The left half of the input block must
be located in the rs1 register and the right half must be located in the rs2 register.
The specic instruction to be executed is determined by the value of bits
[1:0] of the asi eld as given in Table 11 below. Bits [7:2] of the asi eld are ignored
by all of the DES permutation instructions.
asi[1:0] Instruction
00 desipl
01 desipr
10 desfpl
11 desfpr
Table 11: Interpretation of the asi eld for the DES permutation instructions
Inclusion of these instructions allows for the IP and FP for DES and Triple-
DES to be completed in two instructions each. Traditional implementations of the
IP and FP in software require a series of bit mask setup, shift, logical AND, and
logical OR operations for each bit for a total of 256 instructions [41]. The improved
permutation algorithm used in the reference code [110] requires 44 instructions to
complete on a SPARC
r _
V8 processor such as the LEON2, which is still signicantly
larger than the instruction count required to perform the permutations as proposed
in this research.
42
5.1.2 Set Encryption Direction
Instruction Syntax:
desdir imm
op rd op3 rs1 i simm13
10 00000 001001 00000 1 (dir)
Set up the DES key generator to output round keys in either encryption or
decryption order. The imm operand is set to zero for encryption, one for decryption.
This instruction also resets the round counter of the key generator according to the
chosen direction to ensure that output of the round keys may be immediately carried
out in the proper order. It is not necessary to re-load the master key after this in-
struction is executed. The desdir instruction is used in conjunction with the deskey
and desf instructions as explained in Sections 5.1.3 and 5.1.4.
5.1.3 Key Loading
Instruction Syntax:
deskey rs1,rs2
10 00000 001001 rs1 0 unused) rs2
The deskey instruction loads the 64-bit master key for DES. The left half
of the master key must be contained in the rs1 register, and the right half in the rs2
register.
43
5.1.4 Round Core (f ) Function
Instruction Syntax
desf rs1,rd
Instruction Encoding
10 00000 001001 rs1 1 0x1XXX
This instruction takes the right half of a round output block stored in the
rs1 register and stores the output of the core (f ) function into the rd register. The
round key is not specied here since the round key output of the DES key generator
is hard-wired to the f -function circuits round key input. After completion of this
instruction, the key generator is signaled to generate the key for the next round. Due
to the logic of the DES key generator (see Section 6.1.3), the desf instruction may
not be followed by another desf instruction. However, this is not expected to cause
a performance bottleneck due to the additional instruction required for swapping the
values of the left and right halves of the round input block.
Implementation of the desdir, deskey, and desf instructions removes the
need for storage of the sixteen round keys and S-Boxes in memory. All round keys
are generated on-the-y in the custom hardware. An implementation of the DES al-
gorithm using these instructions requires two instructions for key scheduling and four
instructions for each of the sixteen rounds (one desf, one exclusive-OR, and two regis-
ter data transfers for swapping the left and right halves of the round function output).
5.1.5 New DES and Triple-DES Algorithm Implementations
The following text show instruction sequences that may be used to imple-
ment the DES and Triple-DES algorithms for encryption and decryption. All operand
44
names are symbolic and do not represent the names of physical registers of the LEON2
processor.
desipl %[ptextl], %[ptextr], %[l]
desipr %[ptextl], %[ptextr], %[r]
deskey %[keyl], %[keyr]
desdir 0
mov 1, %[i]
round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu round
add %[i], 1, %[i]
desfpl %[r], %[l], %[ctextl]
desfpr %[r], %[l], %[ctextr]
Figure 10: DES encryption routine with custom instructions
desipl %[ctextl], %[ctextr], %[l]
desipr %[ctextl], %[ctextr], %[r]
deskey %[keyl], %[keyr]
desdir 1
mov 1, %[i]
round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu round
add %[i], 1, %[i]
desfpl %[r], %[l], %[ptextl]
desfpr %[r], %[l], %[ptextr]
Figure 11: DES decryption routine with custom instructions
45
desipl %[ptextl], %[ptextr], %[l]
desipr %[ptextl], %[ptextr], %[r]
deskey %[key1l], %[key1r]
desdir 0
mov 1, %[i]
d1round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d1round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
desdir 1
mov 1, %[i]
d2round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d2round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
desdir 0
mov 1, %[i]
d3round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d3round
add %[i], 1, %[i]
Figure 12: Triple-DES encryption routine with custom instructions
46
desipl %[ctextl], %[ctextr], %[l]
desipr %[ctextl], %[ctextr], %[r]
desdir 1
mov 1, %[i]
d1round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d1round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
desdir 0
mov 1, %[i]
d2round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d2round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
desdir 1
mov 1, %[i]
d3round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d3round
add %[i], 1, %[i]
Figure 13: Triple-DES decryption routine with custom instructions
47
5.2 IDEA
5.2.1 Multiplication modulo 2
16
+ 1
Instruction Syntax:
mmul16 rs1, rs2, rd
10 rd 101101 rs1 0 unused) rs2
This instruction calculates rs1 rs2 mod (2
16
+ 1) and stores the product
in the rd register. Both sources must be in the lower sixteen bits of their respective
registers. The 16-bit product is stored in the lower sixteen bits of the rd register.
5.2.2 New IDEA Algorithm Implementation
t0_1 = in[0]; t0_2 = in[1]; t0_3 = in[2]; t0_4 = in[3];
for(i=0;i<8;i++)
{
asm("mmul16 %[a], %[b], %[p]\n\t" : [p] "=r" (t1_1) : [a] "r" (t0_1) , [b] "r" (key[6*i+0]));
t1_2 = (t0_2 + key[6*i+1]) & 0x0000FFFF;
t1_3 = (t0_3 + key[6*i+2]) & 0x0000FFFF;
t2_1 = t1_1 ^ t1_3;
t2_2 = t1_2 ^ t1_4;
t3_2 = (t2_2 + t3_1) & 0x0000FFFF;
t3_3 = (t3_1 + t3_4) & 0x0000FFFF;
t0_1 = t1_1 ^ t3_4;
t0_2 = t1_3 ^ t3_4;
t0_3 = t1_2 ^ t3_3;
t0_4 = t1_4 ^ t3_3;
}
asm("mmul16 %[a], %[b], %[p]\n\t" : [p] "+r" (t0_1) : [a] "r" (t0_1) , [b] "r" (key[48]));
ttt = t0_2;
t0_2 = (t0_3 + key[49]) & 0x0000FFFF;
t0_3 = (ttt + key[50]) & 0x0000FFFF;
Figure 14: IDEA algorithm routine with custom instructions
48
5.3 AES
5.3.1 SubBytes Operations
Instruction Syntax:
aessb rs1, imm, rd
aessb4 rs1, imm, rd
10 rd 101100 rs1 1 see below)
These instructions perform the AES SubBytes and InvSubBytes operations
on either one (aessb) or all four (aessb4) of the bytes in the rs1 register. Only one
of these instructions may be implemented in the hardware but not both.
The value specied in the simm13 eld determines the actual operation
performed. The lease signicant bit is set to zero for SubBytes, or one for InvSub-
Bytes. The value composed of bits 5 and 4 indicate the byte to be substituted for
the aessb instruction as shown in Table 12. These bits are not used by the aessb4
instruction.
simm13[5:4] Substituted byte
00 rs1[31:24]
01 rs1[23:16]
10 rs1[15:8]
11 rs1[7:0]
Table 12: Usage of the simm13 eld by the aessb and aessb4 instructions
Bits [12:6] and [3:1] of simm13 are ignored by both the aessb and aessb4
instructions.
49
5.3.2 GF(2
m
) Matrix Multiplier Constant Loading
Instruction Syntax:
gfmkld rs1,rs2
10 00000 011001 rs1 0 unused) rs2
The gfmkld instruction is used to load one of the sixteen constants into
the constant matrix of the Galois Field xed eld constant matrix multiplier. The
constant matrix has the following structure:
_
_
K
00
K
01
K
02
K
03
K
10
K
11
K
12
K
13
K
20
K
21
K
22
K
23
K
30
K
31
K
32
K
33
_
_
The rst constants to be loaded are those in the rst row, from K
00
to K
03
.
The loading process continues for each row in descending order, from left to right,
until the last constant K
33
has been loaded. Due to the logic that has been added to
the multiplier for inclusion into the LEON2 processor datapath (see Section 6.1.6),
instances of the gfmkld instruction may not be issued consecutively.
5.3.3 GF(2
m
) Matrix Multiplication
Instruction Syntax:
gfmmul rs1, imm, rd
50
10 rd 011101 rs1 0 unused) 00000
Perform the Galois Field xed eld constant matrix multiplication on the
input in the rs1 register and store the result in the rd register.
51
5.3.4 New AES Algorithm Implementations
// First add round key
state[0] = plaintext[0] ^ key_schedule[0][0];
// The nine rounds
for (i = 1; i < Nr; i++)
{
// SubBytes + ShiftRows
asm( "aessb %[s0], 0x00, %[s0]\n\t" : [s0] "+r" (state[0]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
// MixColumns
asm(
"gfmmul %[t0], %[s0]\n\t"
: [s0] "=r" (state[0]) , [s1] "=r" (state[1]) , [s2] "=r" (state[2]) , [s3] "=r" (state[3])
: [t0] "r" (tmp[0]) , [t1] "r" (tmp[1]) , [t2] "r" (tmp[2]) , [t3] "r" (tmp[3])
);
52
// Add round key
state[0] = state[0] ^ key_schedule[i][0];
}
// Final round
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
// Add round key
ciphertext[0] = tmp[0] ^ key_schedule[Nr][0];
}
Figure 15: AES encryption routine with aessb and gfmmul instructions
53
// AddRoundKey
state[0] = ciphertext[0] ^ key_schedule[Nr][0];
for (i = Nr-1; i > 0; i--)
{
// InvShiftRowsInvSubBytes
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
// AddRoundKey
tmp[0] ^= key_schedule[i][0];
// InvMixColumns
asm(
);
}
54
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
// AddRoundKey
plaintext[0] = tmp[0] ^ key_schedule[i][0];
}
Figure 16: AES decryption routine with aessb and gfmmul instructions
55
// The nine rounds
for (i = 1; i < Nr; i++)
{
asm( "aessb4 %[s0], 0, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 0, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 0, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 0, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
// MixColumns
asm(
);
56
// Add round key
}
// Final round
asm( "aessb4 %[s0], 0, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 0, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 0, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 0, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
// Add round key
}
Figure 17: AES encryption routine with aessb4 and gfmmul instructions
57
// AddRoundKey
for (i = Nr-1; i > 0; i--)
{
asm( "aessb4 %[s0], 1, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 1, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 1, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 1, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
// AddRoundKey
// InvMixColumns
asm(
);
}
58
asm( "aessb4 %[s0], 1, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 1, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 1, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 1, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
// AddRoundKey
}
Figure 18: AES decryption routine with aessb4 and gfmmul instructions
59
6 LEON2 HARDWARE AND SOFTWARE TOOLCHAIN
MODIFICATIONS
This chapter describes additions to the LEON2 processor architecture and
software development tools that were made to support the cryptographic instruction
set extensions of Chapter 5. The VHDL code for all custom hardware modules is
available in Appendix A. None of the custom logic circuits rely on external mem-
ory for their functionality. All inputs and outputs are read from and written to the
LEON2 IU register le.
6.1 Custom Hardware Units
6.1.1 DES Permutation Unit
This module implements the IP and FP for DES. A block diagram of this
unit is shown in Figure 19. The inputs in l and in r are loaded from the source
registers specied in the permutation instruction. The inputs are passed through two
stages of 2-to-1 multiplexers. The rst stage selects the re-arranged bits of either the
IP or the FP output based on the control signal pmt type. A value of logic zero
for pmt type selects the IP permutation, and logic one selects the FP permutation.
The output of the selected permutation is represented by the pair of 32-bit vectors
out l and out r. The second stage of multiplexers sets the output based on the signal
out half: if this signal is logic zero, the left half is the output; for logic one, the right
half is the output.
60
Figure 19: DES permutation unit
6.1.2 DES Round f -function Unit
The f -function that is used in each round of the DES cipher is implemented
with this module. The structure is the same as that of Figure 4 presented in Sec-
tion 4.1.1. The 32-bit input port for R
i1
is loaded from the source register operand
specied in the desf instruction. The 48-bit input port for the current round key is
generated by the DES key generator module described in Section 6.1.3. The E and
P sections of Figure 4 are implemented by re-routing the inputs as described in Sec-
tion 4.1.1, and the S-Boxes, S
1
through S
8
, are each dened in separate logic-based
look-up tables. The output of the f -function unit is stored in the destination register
61
specied in the rd eld of the desf instruction.
6.1.3 DES Key Generator
The key generator module was designed to work in conjunction with the
f -function unit described in the Section 6.1.2. The block diagram for this module
is shown in Figure 20. All registers are sensitive to the rising edge of the clock.
Presence of the desdir instruction in the LEON2 execute stage sets the setdir input
to logic one; otherwise, setdir is held at logic zero. When this input is asserted, the
dir register gets the value of dirin, which is specied by the least signicant bit of
simm13 in the desdir instruction.
A 64-bit master key is loaded by issuing the deskey instruction. When this
instruction is in the execute stage, the load input is set to logic one; otherwise this
signal is held at logic zero. The 64 key bits (loaded from the source registers specied
in the deskey instruction) are re-arranged according to the PC-1 mapping of Table
7 and loaded into the C0 and D0 registers when the load bit is set (logic one).
The C and D registers are also loaded when the load bit is asserted, but
the values depend not only on C0 and D0, but also on the value of the dir register.
When dir is logic zero (encryption), the C and D registers are loaded with the values
of C0 and D0 rotated left by 1 bit since the rst values of C
i
and D
i
used in the
encryption key schedule are C
1
and D
1
. When dir is logic one (decryption), the C
and D registers are loaded with the exact values of C0 and D0 because the nal C
i
and D
i
values used in the encryption key schedule are the rst values used in the
decryption key schedule (C
16
= C
0
and D
16
= D
0
) [55].
When the desf instruction is in the LEON2 execute stage, the advin input
is set to logic one. The adv register keeps track of the last value of advin. When
the present state of advin is logic zero and the state of adv is logic one, the key
62
Figure 20: DES key generator
generator interprets this condition to mean that the processor has completed the desf
instruction. On the next rising edge of the clock, the key generator will output the
round key for the next round of the cipher.
63
6.1.4 Modulo (2
16
+ 1) Multiplier
Multiplication modulo (2
16
+ 1) is handled by this module. The design is
based on work by Beuchat [111]. Specically, the adder-based modular multiplier
in [111] is used as the reference for the multiplier in this research. This is because
the other designs in [111] take advantage of embedded multipliers in Xilinx Virtex-2
FPGAs. The hardware developed in this research is designed to accommodate any
type of target hardware and therefore does not force the use of any devices special
features.
The modular multiplier rst generates partial products reduced modulo
(2
16
+1) as described in Zimmermans investigation of ecient architectures for arith-
metic modulo (2
n
1) [112]. Each partial product is determined by the formula:
PP
i
= x
i
y
15i
y
0
y
15
y
16i
+ x
i
0 01 1
where the vector 0 01 1 contains 16 i zeros and i ones. In order to
handle cases where x = 0 or y = 0, a correction term k is dened:
k =
_
_
2 if x = 0 and y = 0,
x + 3 if x = 0 and y ,= 0,
y + 3 if x ,= 0 and y = 0,
1 if x ,= 0 and y ,= 0.
An intermediate sum s is then calculated: s = k +
15
i=0
PP
i
. The nal step
is a reduction modulo (2
16
+1) of s. Dening s
L
to be the sixteen least signicant bits
and s
H
to be the remaining high-order bits of s, then the result of the multiplication
is:
64
s mod (2
16
+ 1) = (s
L
+ 2
16
s
H
) mod (2
16
+ 1) = (s
L
+ s
H
+ 2) mod (2
16
+ 1)
and by the Low-High Lemma for reduction modulo (2
16
+ 1) [66]:
(s
L
+ s
H
+ 2) mod (2
16
+ 1) =
_
_
(s
L
+ s
H
+ 2) mod 2
16
if s
L
+ s
H
+ 1 < 2
16
,
(s
L
+ s
H
+ 1) mod 2
16
if s
L
+ s
H
+ 1 2
16
.
Additions are implemented in the modular multiplier with a carry-propagate
adder tree. A generic model is specied in a VHDL source le separate from the mul-
tiplier source le. The generic component design allows for adjustable width of the
inputs. For example, when the width n = 16, the output is a 17-bit number where
the carry bit has been integrated into the sum as the most signicant bit.
6.1.5 AES S-Boxes
The forward and inverse S-Boxes are implemented as hardware-based look-
up tables. The dir signal selects the S-Box output.
6.1.6 Galois Field Fixed Field Constant Multiplier
This functional unit performs the MixColumns or InvMixColumns opera-
tion required in the AES cipher Rijndael. The architecture of this multiplier is de-
scribed in [87], [88]. Recall from Chapter 4 that the MixColumns and InvMixColumns
operations are a matrix multiplication over the Galois Field GF(2
8
) on each column
of the state by a 4 4 xed eld constant matrix. This means that a total of sixteen
multiplications in the Galois Field GF(2
8
) must be performed to complete the entire
operation. Each product must then be reduced modulo m(x) = x
8
+x
4
+x
3
+x +1,
65
the irreducible polynomial specied for AES. To accomplish the multiplication and
modular reduction simultaneously, the operation can be represented as an 8 8 ma-
trix multiplication over the Galois Field GF(2). The constants in the inner matrix
are determined by the constant factor in the multiplication and the polynomial m(x).
As an example, consider the representative Galois Field GF(2
8
), used by
AES in the MixColumns and InvMixColumns transformations. Note that [A
3
: A
0
]
are the input bytes and [B
3
: B
0
] are the output bytes [89]:
_
_
B
0
B
1
B
2
B
3
_
_
=
_
_
K
00
K
01
K
02
K
03
K
10
K
11
K
12
K
13
K
20
K
21
K
22
K
23
K
30
K
31
K
32
K
33
_
_
_
_
A
0
A
1
A
2
A
3
_
_
The core operation in this xed eld multiplication is an 8-bit inner product
that must be performed sixteen times, four per row. The four inner products of each
row are then XORed to form the nal output word. For a known primitive polynomial
p(x), k(x) (representing the 8-bit constant), and a generic input a(x), we create a
polynomial equation of the form b(x) = a(x) k(x) mod p(x) where each coecient
of b(x) is a function of a(x). This results in an 8-bit 8-bit matrix representing the
coecients of b(x) in terms of a(x) [89]. To illustrate the creation of this matrix, the
following example is provided.
66
Let:
k(x) = (02)
16
= (00000010)
2
= x
p(x) = x
8
+ x
4
+ x
3
+ x + 1
a(x) = a
7
x
7
+ a
6
x
6
+ a
5
x
5
+ a
4
x
4
+ a
3
x
3
+
a
2
x
2
+ a
1
x + a
0
Therefore, we see that:
b(x) = a
7
x
8
+ a
6
x
7
+ a
5
x
6
+ a
4
x
5
+ a
3
x
4
+
a
2
x
3
+ a
1
x
2
+ a
0
x mod p(x)
Reducing modulo p(x) results in:
b(x) = a
6
x
7
+ a
5
x
6
+ a
4
x
5
+ [a
7
+ a
3
]x
4
+
[a
7
+ a
2
]x
3
+ a
1
x
2
+ [a
7
+ a
0
]x + a
7
67
This yields the resultant mapping:
_
_
b
0
b
1
b
2
b
3
b
4
b
5
b
6
b
7
_
_
=
_
_
0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 1
0 0 0 1 0 0 0 1
0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 1 0
_
_
_
_
a
0
a
1
a
2
a
3
a
4
a
5
a
6
a
7
_
_
An 8-bit 8-bit matrix must be generated for each K
xy
, resulting in a
total of sixteen matrices. Note that this analysis holds true for Galois Fields other
than GF(2
8
) with corresponding adjustments to the mapping matrix used to calculate
b(x) = a(x) k(x) mod p(x).
In order to insure correct constant loading using the gfmkld instruction, a
slight modication was made to the constant matrix load enable logic. The original
architecture would load one of the sixteen constant matrices on each rising clock edge
provided that the enable input was set to logic one. However, it was found that
the gfmkld instruction may be held in the execute stage for more than one clock
cycle when integrated into the LEON2 datapath. This led to more than one con-
stant matrix receiving the same conguration bits. Therefore, the constant loading
enable logic now holds the previous state of the enable input in a register. When the
current input is logic zero and the previous input is logic one, this is interpreted as
a gfmkld instruction leaving the execute stage. This may be accomplished, for ex-
ample, by inserting nop instructions between consecutive gfmkld instructions. The
68
constants provided in the source operands are stored appropriately into the constant
matrix, and the other matrices may be loaded in the same manner. Although this
mechanism increases the start-to-end time for full conguration of the multiplier, this
conguration needs to be done only once on initialization and each time the cipher
direction is reversed. Therefore, this modication does not aect the performance of
the MixColumns and InvMixColumns operations.
6.2 Architecture Modications
The new module /leon/ext cong.vhd has been included with the VHDL
source for the LEON2 processor architecture to provide an easy way to select specic
extensions to be included. For the AES S-Box extensions, the available options are
none for no S-Boxes, sbox for one S-Box (to implement the aessb instruction), and
sbox4 for four S-Boxes (to implement the aessb4 instruction). For all other types
of extensions, setting the conguration variables to true includes extensions into the
architecture; a value of false excludes them from the architecture. All of the custom
functional units that support the proposed instruction set extensions have been in-
cluded into the IU, described in the module /leon/iu.vhd. Component declarations
were added for each of the custom units and instantiated as part of the arithmetic
logic unit (ALU). The decode stage of the IU pipeline sets ags for the custom in-
structions and generates source register and immediate data. On the next clock cycle,
the execute stage passes the input operands to the appropriate functional unit based
on what instruction ag has been set. The result is then read from the functional
unit when the instruction species a destination register to receive an output. The
SPARC
r _
assembly instruction opcodes are dened in the /leon/sparcv8.vhd module
[46]. The lines added to this module include the values for the op3 eld of the new
69
instructions. Provided in the LEON2 base package is a generic test bench with dis-
assembly support. During a functional simulation, assembly instructions are printed
out to the simulation softwares console window as they are executed. The module
/leon/debug.vhd contains the functions that handle display of the instruction strings,
and disassembly support for the new instructions has been added. When the op and
op3 instructions match those of one of the custom instructions, the instruction name
is printed followed by any applicable source/destination registers or immediate data
for that instruction. Refer to Appendix B for specic edits to the above mentioned
VHDL source code.
6.3 Modications to Software Development Tools
Building the set of development tools from the source les is necessary when
extending the instruction set of the LEON2 processor. The source code archive for
BCC v1.0.29c includes specic modications made to two dierent versions of GNU
binutils to support the LEON2 processor. For this research the newer supported
version of binutils, v2.16.1, is used. After extracting these les to the binutils source
folder, the le /opcodes/sparc-opc.c was further edited to include op3 values for the
custom instructions. The specic additions are located in Appendix B.
70
7 RESULTS AND ANALYSIS
7.1 Testing Methodology
An extensive test suite has been constructed to verify the functionality of
the custom instruction set extensions and to obtain algorithm execution times. The
test programs can be built to perform either encryption or decryption. Separate
main test program source les are maintained for the ECB and CBC modes of oper-
ation. In order to see the eect of each individual extension, tests included all valid
combinations of extensions for each algorithm.
Building each of the test programs produces an ELF dump le called ram.dat
and an S-record le called sdram.rec. These les are copied into the /tsource/ folder
for use with the supplied testbench. After starting ModelSim and compiling the
LEON2 VHDL source, the command vsim tb func32 disas is issued to load the test
program.
For each of the four target algorithms there were two general types of tests,
written in the C language with inline SPARC
r _
assembly code:
Functional verication tests read a set of test vectors one at a time from
a table compiled into the executable. There are twelve test vectors for each
algorithm and they are all specic to the ECB mode of operation.
Performance tests programs are built to perform one iteration of the target
algorithm. The execution time is then measured from within the Modelsim test
bench by subtracting the simulation start time of the routine from the end time
71
of the routine.
Separate source les are maintained for each of the following types of algo-
rithm implementation:
Encryption in ECB mode
Encryption in CBC mode
Decryption in ECB mode
Decryption in CBC mode
Parameters are set at compile time for which custom instructions are to be
included for the algorithm under test. Refer to Appendix C for all test vectors and
Appendix D for test source code. The test source les specic to AES are based on
code used with permission from Stefan Tillich of the ISEC Project at Graz University
of Technology.
7.2 Software Code Size
The following tables present sizes in bytes of the executable code required
to implement the algorithms with dierent combinations of custom instructions. The
changes to program code sizes relative to the baseline implementation are also given
in these tables.
The code size data for the DES and Triple-DES algorithms show that both
types of extension instructions have signicant eects on cipher operations. The IP
and FP instructions decrease the total code size by 20%21.5% but have no eect
on the key schedule since IP and FP are not used in computing the round keys.
Instructions supporting the round key generation and round function have a much
72
Baseline permute instrs. key and f instrs. all extensions
Size Imp. factor Size Imp. factor Size Imp. factor
Key Schedule 1524 1524 1.000 8 190.500 8 190.50
Encrypt ECB 2036 1600 1.273 556 3.662 112 18.17
Encrypt CBC 2088 1636 1.276 632 3.222 176 11.86
Decrypt ECB 2036 1596 1.276 556 3.662 112 18.17
Decrypt CBC 2088 1640 1.273 652 3.202 192 10.88
Table 13: Code size in bytes for DES
Key Schedule 1524 1524 1.000 24 63.500 24 63.500
Encrypt ECB 2176 1740 1.251 664 3.277 224 9.714
Encrypt CBC 2208 1760 1.255 692 3.191 252 8.762
Decrypt ECB 2176 1740 1.251 664 3.277 224 9.714
Decrypt CBC 2220 1776 1.250 704 3.153 256 8.672
Table 14: Code size in bytes for Triple-DES
more pronounced impact on code size. When these instructions are used, all of
the lengthy permutation routines and memory-based S-Boxes are no longer needed.
Encryption and decryption code size is reduced by approximately 70% on average
with the use of these instructions alone. Key schedules are handled by just two
instructions, making the respective code size just a small percentage of the remaining
program code needed to implement the encryption and decryption routines.
Baseline with mmul16
Size Imp. factor
Key Schedule (encrypt) 436 436 1.000
Key Schedule (decrypt) 844 760 1.111
Encrypt ECB 596 228 2.614
Encrypt CBC 688 272 2.566
Decrypt ECB 612 244 2.508
Decrypt CBC 736 328 2.244
Table 15: Code size in bytes for IDEA
In the code size gures for IDEA, the decryption key schedule byte count
includes that of the encryption key schedule. This is because the decryption keys are
73
determined from the encryption keys as detailed in Chapter 4. The key schedule for
decryption sees a slight decrease in code size with the use of the mmul16 instruction
because multiplicative inverses modulo 2
16
+1 are calculated by repeated square-and-
multiply operations. The addition of the mmul16 instruction signicantly decreases
the required code size for both encryption and decryption in nearly equal proportions.
Due to the absence of a hardware multiplier in the LEON2 integer unit, the multi-
plication is performed by a library function whose code size is included in the gures
given in the above table (hardware multipliers supplied with the LEON2 processor
package are not included in the scope of this research).
Baseline with aessb with aessb4
Size Imp. factor Size Imp. factor
Key Schedule 216 172 1.256 160 1.350
Encrypt ECB 1512 1132 1.336 1016 1.488
Encrypt CBC 1592 1212 1.314 1096 1.567
Decrypt ECB 2236 1632 1.370 1516 1.475
Decrypt CBC 2340 1736 1.348 1620 1.444
Table 16: Code size in bytes for AES without gfmmul instruction
Baseline with gfmmul aessb + gfmmul aessb4 + gfmmul
Key Schedule 216 216 1.000 172 1.256 160 1.350
Encrypt ECB 1512 1336 1.132 944 1.602 804 1.881
Encrypt CBC 1592 1416 1.124 1024 1.555 884 1.801
Decrypt ECB 2236 1604 1.394 932 2.399 828 2.700
Decrypt CBC 2340 1708 1.370 1036 2.259 932 2.511
Table 17: Code size in bytes for AES with gfmmul instruction
For the AES instruction set extensions that support the SubBytes opera-
tions, a decrease in program code size by about 20% is observed for the key schedule,
and by 24-27% for encryption and decryption. The improvement is more pronounced
for decryption due to need for both the forward and inverse S-Boxes only the for-
ward S-Box is used in encryption and in the key schedule. When only the gfmmul
74
instruction is incorporated, code size decreases by about 11% for encryption but by
over 27% for decryption. The original routine used to perform the InvMixColumns
operation requires many more operations than MixColumns. Only four instances of
the gfmmul instruction are needed to perform both operations one for each column
of the state.
7.3 Algorithm Execution Times
The following tables present the number of clock cycles required to complete
a full iteration of each algorithm with all possible combinations of custom instruc-
tions for that algorithm. The speedup factors are relative to the baseline software
implementation.
Cycles Speedup Cycles Speedup Cycles Speedup
Key Schedule 3522 3522 1.00 10 352.20 10 352.20
Encrypt ECB 5416 5089 1.06 464 11.67 142 38.14
Encrypt CBC 5448 5121 1.06 504 10.81 175 31.13
Decrypt ECB 5397 5091 1.06 464 11.63 142 38.01
Decrypt CBC 5494 5172 1.06 538 10.21 184 29.86
Table 18: Execution cycles for DES
Key Schedule 9392 9392 1.00 30 313.10 30 313.10
Encrypt ECB 14161 14017 1.01 722 19.61 449 31.54
Encrypt CBC 14341 14032 1.02 786 18.25 470 30.51
Decrypt ECB 14166 14016 1.01 733 19.33 449 31.55
Decrypt CBC 14389 14058 1.02 757 19.01 462 31.15
Table 19: Execution cycles for Triple-DES
As indicated in the data for throughput improvements for DES and Triple-
DES, the instructions supporting round key generation and the round function have
75
the greatest impact on decreases in execution cycles. This eect is more pronounced
for Triple-DES. The opposite is true for the permutation instructions although there
is a slight overall improvement for both DES and Triple-DES, each of these algorithms
uses the IP and FP once per data block. The longer execution time for Triple-DES
leads to a less noticeable eect from the permutation instructions.
Baseline with mmul16
Cycles Speedup
Key Schedule (encrypt) 1155 1155 1.00
Key Schedule (decrypt) 41127 5886 6.99
Encrypt ECB 2646 485 5.46
Encrypt CBC 2840 553 5.14
Decrypt ECB 2621 486 5.39
Decrypt CBC 2679 552 4.85
Table 20: Execution cycles for IDEA
The use of the mmul16 instruction has decreased the execution cycle count
by a factor of about 5.2 on average for the encryption and decryption routines. The
decryption key schedule achieves approximately a seven-fold speedup because the
square-and-multiply process uses thirty issues of the mmul16 instruction as opposed
to thirty calls to the multiplication subroutine which takes several more clock cycles
to complete.
Baseline with aessb with aessb4
Cycles Speedup Cycles Speedup
Key Schedule 824 589 1.40 547 1.51
Encrypt ECB 2363 2014 1.17 1821 1.30
Encrypt CBC 2472 2112 1.17 1900 1.30
Decrypt ECB 3818 3374 1.13 3228 1.18
Decrypt CBC 3963 3535 1.12 3397 1.17
Table 21: Execution cycles for AES without gfmmul instruction
The data in Tables 21 and 22 for the AES algorithm indicates that im-
plementation of logic-based look-up tables for the S-Boxes to perform the SubBytes
and InvSubBytes operations can introduce throughput increases ranging from 12% to
76
Baseline with gfmmul aessb + gfmmul aessb4 + gfmmul
Key Schedule 824 824 1.00 598 1.40 560 1.47
Encrypt ECB 2363 1457 1.62 978 2.42 760 3.11
Encrypt CBC 2472 1549 1.60 1090 2.27 886 2.79
Decrypt ECB 3818 1583 2.41 960 3.98 779 4.90
Decrypt CBC 3963 1713 2.31 1129 3.51 965 4.11
Table 22: Execution cycles for AES with gfmmul instruction
30%. There is overall a more signicant impact from the Galois Field matrix multi-
plication, especially for decryption. The dierences in speedup factors for encryption
and decryption can be explained with the same reasons as the dierences in decreases
in code size the original InvMixColumns operation requires many more steps to
complete than MixColumns.
In the following two tables, execution cycle counts for some combinations
of extensions are compared with results published by the ISEC project at the Graz
University of Technology [29]. Their work includes the use of look-up tables as a choice
of hardware implementation for the S-Boxes. The sbox and sbox4 instructions from
the ISEC project are equivalent to the aessb and aessb4 instructions presented in
this research. The mixcol4 instruction takes an entire 32-bit word as input into
an AES-specic MixColumns/InvMixColumns functional unit. In order to address
the ShiftRows operation, the instructions sbox4s and mixcol4s combine an implicit
ShiftRows operation with SubBytes and MixColumns respectively. Table 23 shows
results of the ISEC extensions for AES used in software implementations written in C
with inline assembly, and Table 24 shows results from ISEC based on pure assembly
implementations.
These results show that the instructions set extensions for AES proposed
in this research compare well with the custom instructions proposed in [29] when
employed in the same type of software implementation. However, the pure assembly
77
Cycles Improvement Factor
Implementation Encrypt Decrypt Encrypt Decrypt
No Extensions [ISEC] 1637 1955 1.00 1.00
No Extensions [This work] 2363 3818 1.00 1.00
sbox4 [ISEC] 1020 1435 1.60 1.36
aessb4 1821 3228 1.30 1.18
mixcol4 [ISEC] 939 970 1.74 2.02
gfmmul 1457 1583 1.62 2.41
sbox and mixcol4 [ISEC] 458 458 3.57 4.27
aessb and gfmmul 978 960 2.42 3.98
sbox4s and mixcol4s [ISEC] 458 459 3.57 4.26
aessb4 and gfmmul 760 779 3.11 4.90
Table 23: Comparison with ISEC Extensions in C/Inline Assembly
Cycles Improvement Factor
Implementation Encrypt Decrypt Encrypt Decrypt
No Extensions [ISEC] 1637 1955 1.00 1.00
No Extensions [This work] 2363 3818 1.00 1.00
sbox4 [ISEC] 718 1061 2.28 1.84
aessb4 1821 3228 1.30 1.18
sbox and mixcol4 [ISEC] 337 330 4.86 5.92
aessb and gfmmul 978 960 2.42 3.98
sbox4s and mixcol4s [ISEC] 196 196 8.35 9.97
aessb4 and gfmmul 760 779 3.11 4.90
Table 24: Comparison with ISEC Extensions in Pure Assembly
implementations of AES using the ISEC extensions exhibit signicantly better per-
formance. This supports the notion that hand-coded assembly can have improved
performance over C-based implementations due to compiler limitations. Evalutaion
test programs used in this research could hypothetically produce lower cycle counts
overall if converted to pure assembly implementations.
7.4 Hardware Utilization
The following table shows the component usage of the Xilinx Virtex-4
XC4VLX25 FPGA device by each custom functional unit. The total utilizations of
78
the LEON2 with combinations of all extensions for each targeted algorithm are also
presented. The choice of FPGA was based on the devices large amount of available
logic resources. Synthesis was performed with the XST synthesis tool from within
the Xilinx ISE 8.1i development environment.
4-input Max. Freq. Max. comb.
Slices Flip-ops LUTs (MHz) path delay (ns)
FPGA available resources 10752 10752 21504 N/A N/A
Base LEON2 4395 1798 7142 130.037 No path found
DES permutation unit 32 0 64 No period 6.359
DES key generator 175 153 325 320.015 No path found
DES f-function unit 92 0 176 No period 6.639
modulo (2
16
+ 1) multiplier 229 37 431 No period No path found
AES S-boxes 133 0 264 No period 8.216
GF matrix multiplier 1023 1057 1841 265.647 6.359
LEON2 w/extensions:
DES perm. unit 4521 1806 7384 125.016 No path found
DES keygen/f-func. units 4756 1940 7827 126.644 No path found
all DES units 4809 1921 7908 122.063 No path found
mod (2
16
+ 1) mult. 4768 1849 7838 120.195 No path found
AES S-Box 4648 1808 7617 122.106 No path found
4 AES S-Boxes 5000 1790 8306 127.008 No path found
GF matrix mult. 5011 2867 8096 124.690 No path found
S-Box + GF mult. 5254 2866 8526 122.947 No path found
4 S-Boxes + GF mult. 5559 2847 9154 122.529 No path found
LEON2 with all extensions 6351 3063 10633 116.968 No path found
Table 25: Hardware utilization on the Xilinx XC4VLX25 FPGA
Most of the individual custom hardware units require an amount of logic
resources that is less than 3% of the total available resources of the XC4VLX25. The
exception is the Galois Field matrix multiplier; due to the number of storage bits
needed for matrix conguration and combinational logic used to compute the matrix
product, the Galois Field matrix multiplier takes up about 10% of the available FPGA
area. Note that LEON2 congurations with extensions have greater logic utilization
than the sum of the base LEON2 conguration plus the utilization of individual
custom hardware units. This is due to the additional logic required to handle the
custom instructions in the LEON2 IU pipeline stages. A conguration of the LEON2
79
with all of the proposed instruction set extensions leads to a total area increase over
the base conguration by approximately 45%.
Applying the proposed extensions to the LEON2 processor also caused a
slight decrease in the maximum operating frequency. This can be attributed both
to the combinational path delays of the custom hardware, and the additional logic
required for the custom hardware to interface with the remainder of the LEON2 IU.
The modulo (2
16
+ 1) multiplier for IDEA was the single largest contributor to in-
creases in minimum clock period. A pure combinational design of the multiplier has
large path delays due to the carry-propagate adder structure. Synthesis of the LEON2
with a pure combinational implementation of the modulo (2
16
+1) multiplier reported
a maximum operating frequency of only 72 MHz. The multiplier was enhanced with
registers to hold intermediate values in order to decrease the eect on the processor
clock period. With all extensions implemented, the LEON2 processor synthesized on
the target FPGA has a clock frequency of about 117 MHz, a decrease of about 10%
compared to the base LEON2 processor architecture.
7.5 Throughput to Area Comparisons
This section presents throughput gures for baseline implementations and
implementations with extensions of each algorithm when performing ECB encryption.
The results are compared to the hardware utilization required to run each implemen-
tation. The intent of these comparisons is to show the cost in hardware of improving
the execution time of the targeted algorithms using the proposed instruction set ex-
tensions. Throughput gures are given in Megabits per second (Mbps), hardware
usage is given in FPGA logic slices, and throughput/area ratios are given in bits per
second per slice.
80
baseline w/ extensions
Algorithm Throughput HW usage T/A ratio Throughput HW usage T/A ratio
DES 1.537 4395 349.7 55.014 4809 11439.8
Triple-DES 0.588 4395 133.8 17.399 4809 3618.0
IDEA 3.145 4395 715.6 15.861 4768 3326.6
AES 7.044 4395 1602.7 20.636 5559 3712.1
Table 26: Throughput to area ratios for algorithm implementations
The data in Table 26 shows that the DES algorithm by far has improved the
most in terms of throughput/area ratio. This is explained by the fact that almost all
of the functionality for performing the algorithm has been o-loaded to the custom
hardware. As expected, throughput/area values for Triple-DES are approximately
one-third of the corresponding values for DES because each algorithm requires the
same hardware in order to be implemented with the instruction set extensions. IDEA
and AES have improvement factors of about 4.65 and 2.32 respectively. The relatively
lower throughput/area ratio for AES can be attributed to the high logic utilization of
the Galois Field xed eld constant matrix multiplier. The Triple-DES, IDEA, and
AES algorithms have similar throughput/area ratios when the custom instructions
are employed. Each of these algorithms are popular choices for bulk data encryption,
and performance would not be a major concern when selecting one or more of these
algorithms as part of an embedded security solution.
81
8 CONCLUSIONS AND FUTURE WORK
Instruction set extensions for improving software implementations of
symmetric-key algorithms have been proposed. Existing literature on the subject of
enhancing the performance of symmetric-key algorithms was discussed, followed by
detailed descriptions of the targeted processor the LEON2 and the targeted
cryptographic algorithms DES, Triple-DES, IDEA, and AES. Descriptions of the
custom instructions were given along with the functional units that implement the
underlying logical and arithmetic operations.
The results of the performance evaluations show that the proposed cus-
tom instructions have a signicant positive eect on program code size for all of the
targeted algorithms. The DES and Triple-DES algorithms experienced the greatest
percentage drop in code size, shrinking to as little as 5% of the original baseline code.
Implementations of IDEA and AES were decreased to 37%75% of the original code
size using the proposed extensions depending on the specic instructions implemented
and the cipher direction.
Execution time for all algorithms was improved several times over. Speedup
factors for DES and Triple-DES ranged from 31 to 38 compared to the baseline im-
plementations. Encryption and decryption with IDEA was observed to run approx-
imately ve times faster on average by using the proposed instruction for modulo
(2
16
+ 1) multiplication. Total execution cycles for AES was decreased the most by
introduction of the Galois Field constant matrix multiplier into the processor archi-
tecture. Analysis of the performance of the cryptographic algorithms using individual
instructions is consistent with expectations as to where the largest impacts to data
82
throughput would be observed.
Improvements for each algorithm required only a small increase in the logic
utilization of the LEON2 processor on the chosen FPGA device, and an implementa-
tion of the LEON2 including all extensions increased the total processor area by less
than half while decreasing the maximum clock frequency by approximately 10%. The
DES and Triple-DES algorithms received the most benet in terms of throughput
per FPGA logic slice used, while the corresponding values for the AES and IDEA
algorithms increased by factors of about 2.3 to 4.7 respectively.
Future work on this research may include the eect these proposed instruc-
tion set extensions may have on other extensible processor architectures and systems
with dierent memory subsystems. Discussions in Chapters 2 and 4 showed the
impact on pipelining, cache size, memory hierarchy, and instruction parallelism in
traditional software implementations of the targeted symmetric-key algorithms. Al-
though the instruction set extensions described in this research were designed to be
compatible with any 32-bit architecture, most of the extensions can be easily modied
for integration into the datapath of, for example, a 64-bit processor. Investigation
of dierences in execution time for each algorithm running on systems with various
memory structures may also be of interest in determining what types of systems
benet the most from these extensions.
83
REFERENCES
[1] P. Gil. How Big is the Internet? World Wide Web
http://netforbeginners.about.com/cs/technoglossary/f/ FAQ3.htm, 2005.
[2] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook of Applied
Cryptography. CRC Press, Boca Raton, Florida, USA, 1997.
[3] B. Schneier. Applied Cryptography. John Wiley & Sons Inc., New York, New
York, USA, 2nd edition, 1996.
[4] W. Stallings. Network and Internetwork Security Principles and Practice.
Prentice Hall, Englewood Clis, New Jersey, USA, 1995.
[5] R. Doud. Hardware Crypto Solutions Boost VPN. Electronic Engineering
Times, (1056):5764, April 12 1999.
[6] K. Aoki and H. Lipmaa. Fast Implementations of AES Candidates. In The
Third Advanced Encryption Standard Candidate Conference, pages 106122,
New York, New York, USA, April 1314 2000. National Institute of Standards
and Technology.
[7] L. Bassham, III. Eciency Testing of ANSI C Implementations of Round 2 Can-
didate Algorithms for the Advanced Encryption Standard. In The Third Ad-
vanced Encryption Standard Candidate Conference, pages 136148, New York,
New York, USA, April 1314 2000. National Institute of Standards and Tech-
nology.
[8] J. Dray. NIST Performance Analysis of the Final Round Java
TM
AES Can-
didates. In The Third Advanced Encryption Standard Candidate Conference,
pages 149160, New York, New York, USA, April 1314 2000. National Institute
of Standards and Technology.
[9] A. Sterbenz and P. Lipp. Performance of the AES Candidate Algorithms in
Java
TM
. In The Third Advanced Encryption Standard Candidate Conference,
[10] T. Wollinger, M. Wang, J. Guajardo, and C. Paar. How Well Are High-End
DSPs Suited for the AES Algorithms? In The Third Advanced Encryption
Standard Candidate Conference, pages 94105, New York, New York, USA,
April 1314 2000. National Institute of Standards and Technology.
[11] Altera Corporation. Altera IP MegaStore Embedded Processors. World
Wide Web http://www.altera.com/products/ip/ processors/ipm-index.jsp.
[12] Xilinx Inc. Xilinx Embedded Processing Technology Solutions. World Wide
Web http://www.xilinx.com/products/ design resources/proc central/.
84
[13] M. Gschwind. Instruction Set Selection for ASIP Design. In A. A. Jerraya,
L. Lavagno, and F. Vahid, editors, Proceedings of the Seventh International
Symposium on Hardware/Software Codesign CODES99, pages 711, Rome,
Italy, March 1999. ACM Press.
[14] K. K u c uk cakar. An ASIP Design Methodology for Embedded Systems. In
A. A. Jerraya, L. Lavagno, and F. Vahid, editors, Proceedings of the Seventh
International Symposium on Hardware/Software Codesign CODES99, pages
1721, Rome, Italy, March 1999. ACM Press.
[15] A. Wang, E. Killian, D. E. Maydan, and C. Rowen. Hardware/Software In-
struction Set Congurability for System-On-Chip Processors. In Proceedings
of the 38th Design Automation Conference DAC 2001, pages 184188, Las
Vegas, Nevada, USA, June 1822 2001. ACM Press.
[16] P. Faraboschi, G. M. Brown, J. A. Fisher, G. Desoli, and M. O. Homewood.
Lx: A Technology Platform for Customizable VLIW Embedded Processing. In
Proceedings of the 27th Annual International Symposium on Computer Archi-
tecture ISCA 2000, pages 203213, Vancouver, British Columbia, Canada,
June 10-14 2000. ACM Press.
[17] R. E. Gonzalez. Xtensa: A Congurable and Extensible Processor. IEEE Micro,
20(2):6070, March/April 2000.
[18] ARC International. Technical Summary of the ARCtangent
TM
-A4 Pro-
cessor Core. World Wide Web http://www.arc.com/upload/download/
ARCIntl 0311 TechSummary DS.pdf, 2001.
[19] MIPS Technologies Inc. Pro Series
TM
Processor Cores. World Wide Web
http://www.mips.com/ProductCatalog/ P ProSeriesFamily/proseries.pdf,
2003.
[20] S. Tillich, J. Grochadl, and A. Szekely. An Instruction Set Extension for Fast
and Memory-Ecient AES Implementation. In J. Dittmann, S. Katzenbeisser,
and A. Uhl, editors, Proceedings of the Ninth International Conference on Com-
munications and Multimedia Security CMS 2005, volume LNCS 3677, pages
1121, Salzburg, Austria, September 1921 2005. Springer-Verlag.
[21] R. B. Lee, Z. Shi, and X. Yang. Ecient Permutation Instructions for Fast
Software Crytography. IEEE Micro, 21(6):5669, November/December 2001.
[22] R. B. Lee. Accelerating Multimedia with Enhanced Microprocessors. IEEE
Micro, 15(2):2232, April 1995.
[23] J. Burke, J. McDonald, and T. M. Austin. Architectural Support for Fast
Symmetric-Key Cryptography. In Proceedings of the Ninth International Con-
ference on Architectural Support for Programming Languages and Operating
Systems AS-PLOS 2000, pages 178189, Cambridge, Massachusetts, USA,
November 1215 2000.
85
[24] J. Grochadl. Instruction Set Extension for Long Integer Modulo Arithmetic on
RISC-Based Smart Cards. In Proceedings of the 14th Symposium on Computer
Architecture and High Performance Computing SBAC-PAD02, pages 1319,
Vitoria, Espirito Santo, Brazil, October 2830 2002.
[25] J. Grochadl and G.-A. Kamendje. Optimized RISC Architecture for Multiple-
Precision Modular Arithmetic. In First International Conference on Security in
Pervasive Computing, volume LNCS 2802, pages 253270, Boppard, Germany,
March 1214 2003. Springer-Verlag.
[26] J. Grochadl and G.-A. Kamendje. Architectural Enhancements for Mont-
gomery Multiplication on Embedded RISC Processors. In J. Zhou, M. Yung,
and Y. Han, editors, Applied Cryptography and Network Security ACNS
2003, volume LNCS 2846, pages 418434. Springer-Verlag, 2003.
[27] J. Grochadl and E. Savas. Instruction Set Extensions for Fast Arithmetic
in Finite Fields GF(p) and GF(2
m
). In M. Joye and J.-J. Quisquater, edi-
tors, Workshop on Cryptographic Hardware and Embedded Systems CHES
2004, volume LNCS 3156, pages 133147, Cambridge, Massachusetts, USA,
August 1113 2004. Springer-Verlag.
[28] S. Tillich and J. Grochadl. Accelerating AES Using Instruction Set Extensions
for Elliptic Curve Cryptography. In O. Gervasi, M. L. Gavrilova, V. Kumar,
A. Lagan` a, H. P. Lee, Y. Mun, D. Taniar, and C. J. K. Tan, editors, Inter-
national Conference on Computational Science and Its Applications ICCSA
2005, volume LNCS 3481, pages 665675, Singapore, May 912 2005. Springer-
Verlag.
[29] S. Tillich and J. Grochadl. Instruction Set Extensions for Ecient AES Imple-
mentation on 32-bit Processors. In L. Goubin and M. Matsui, editors, Workshop
on Cryptographic Hardware and Embedded Systems CHES 2006, Yokohama,
Japan, October 10-13 2006. Springer-Verlag.
[30] D. C. Feldmeier. A High-Speed Software DES Implementation. Technical re-
port, Computer Communication Research Group, 1989.
[31] D. A. Osvik. Ecient Implementation of the Data Encryption Standard. Thesis
for the Degree of Candidatus Scientiarum, Universitatis Bergensis, Apr 2003.
[32] G. Bertoni, L. Breveglieri, P. Fragneto, M. Macchetti, and S. Marchesin. Ef-
cient Software Implementation of AES on 32-Bit Platforms. In B. S. Kaliski
Jr., C . K. Ko c, and C. Paar, editors, Workshop on Cryptographic Hardware
and Embedded Systems CHES 2002, volume LNCS 2523, pages 159171,
Redwood Shores, California, USA, August 1315 2002. Springer-Verlag.
[33] L. Wu, C. Weaver, and T. Austin. CryptoManiac: A Fast Flexible Architecture
for Secure Communication. In B. Werner, editor, Proceedings of the 28th Annual
International Symposium on Computer Architecture ISCA-2001, pages 110
119, Goteborg, Sweden, June 30July 4 2001.
86
[34] A. V. Garcia and J.-P. Seifert. On the Implementation of the Advanced En-
cryption Standard on a Public-key Crypto-Coprocessor. In CARDIS, pages
135146. USENIX, 2002.
[35] D. Oliva, R. Buchty, and N. Heintze. AES and the Cryptonite Crypto Processor.
In J. H. Moreno, P. K. Murthy, T. M. Conte, and P. Faraboschi, editors, Pro-
ceedings of the 2003 International Conference on Compilers, Architecture and
Synthesis for Embedded Systems CASES 2003, pages 198209, San Jose,
California, USA, October 30-November 1 2003. ACM Press.
[36] W.-M. Lim and M. Benaissa. Design space exploration of a hardware-software
co-designed GF(2
m
) galois eld processor for forward error correction and cryp-
tography. In CODES+ISSS 03: Proceedings of the 1st IEEE/ACM/IFIP In-
ternational Conference on Hardware/software Codesign and System Synthesis,
pages 5358, New York, NY, USA, 2003. ACM Press.
[37] H. W. Kim and S. Lee. Design and implementation of a private and public key
crypto processor and its application to a security system. IEEE Transactions
on Consumer Electronics, 50(1):214224, 2004.
[38] F. Crowe, A. Daly, T. Kerins, and W. Marnane. Single-chip FPGA Imple-
mentation of a Cryptographic Co-processor. In Proceedings of the 2004 IEEE
International Conference on Field-Programmable Technology, pages 279285,
2004.
[39] A. Hodjat, D. D. Hwang, B. Lai, K. Tiri, and I. Verbauwhede. A 3.84 GBits/s
AES Crypto Coprocessor with Modes of Operation in a 0.18-m CMOS Tech-
nology. In GLSVLSI 05: Proceedings of the 15th ACM Great Lakes symposium
on VLSI, pages 6063, New York, NY, USA, 2005. ACM Press.
[40] G. A. Sathishkumar and C. Prasanna. A Novel VLSI Architecture for an In-
tegrated Crypto Processor. In 2005 Annual IEEE INDICON, pages 272275,
Dec 2005.
[41] Z. Shi and R. B. Lee. Bit Permutation Instructions for Accelerating Software
Cryptography. In IEEE International Conference on Application-specic Sys-
tems, Architectures and Processors ASAP 2000, pages 138148. IEEE Com-
puter Society, 2000.
[42] S. Ravi, A. Raghunathan, N. Potlapally, and M. Sankaradass. System Design
Methodologies for a Wireless Security Processing Platform. In Proceedings of
the 2002 Design Automation Conference DAC 2002, pages 777782, New
Orleans, Louisiana, USA, June 1014 2002.
[43] S. Ravi, A. Raghunathan, and N. Potlapally. Securing wireless data: system
architecture challenges. In Proceedings of the 15th International Symposium
on System Synthesis (ISSS-02), pages 195200, New York, October 24 2002.
ACM Press.
87
[44] Gaisler Research Web site. World Wide Web: http://www.gaisler.com.
[45] J. Gaisler. Bare-C Cross Compiler Users Manual. Gaisler Research, version
1.0.29 edition, Feb 2007.
[46] Gaisler Research. LEON2 Processor Users Manual, v1.0.30 XST edition, Jul
2005.
[47] SPARC International Inc. The SPARC Architecture Manual, version 8 edition,
1992.
[48] LEON2 SPARC V8 Compliance Certication. Available at
http://www.gaisler.com/images/leoncert.gif.
[49] D. Pellerin and D. Taylor. VHDL Made Easy! Prentice Hall, 1996.
[50] Gaisler Research LEON Development Boards. World Wide Web:
http://www.gaisler.com/cms/index.php?option=com content&task=section
&id=9&Itemid=29.
[51] H. Feistel. Cryptography and Computer Privacy. Scientic American,
228(5):1523, May 1973.
[52] B. Schneier and J. Kelsey. Unbalanced Feistel Networks and Block Cipher De-
sign. In D. Gollmann, editor, Third International Workshop on Fast Software
Encryption, volume LNCS 1039, Berlin, Germany, 1996. Springer-Verlag. Con-
ference Location: Cambridge, UK.
[53] C. Adams. The CAST-256 Encryption Algorithm. In First Advanced Encryption
Standard (AES) Conference, Ventura, California, USA, 1998.
[54] C. E. Shannon. Communication Theory of Secrecy Systems. Bell System Tech-
nical Journal, 27(4):656715, 1949.
[55] National Institute of Standards and Technology. FIPS PUB 46-3: Data En-
cryption Standard (DES). National Institute for Standards and Technology,
pub-NIST:adr, October 1999. supersedes FIPS 46-2.
[56] E. Biham. A Fast New DES Implementation in Software. In E. Biham, editor,
Fourth International Workshop on Fast Software Encryption, volume LNCS
1267, pages 260272, Haifa, Israel, January 2022 1997. Springer-Verlag.
[57] J. Hughes. Implementation of NBS/DES Encryption Algorithm in Software. In
Colloquium on Techniques and Implications of Digital Privacy and Authentica-
tion Systems, 1981.
[58] A. Ptzmann and R. Assman. More Ecient Software Implementations of
(Generalized) DES. Computers & Security, 12(5):477500, 1993.
88
[59] D. C. Wilcox, L. Pierson, P. Robertson, E. Witzke, and K. Gass. A DES
ASIC Suitable for Network Encryption at 10 Gbps and Beyond. In C . Ko c and
C. Paar, editors, Workshop on Cryptographic Hardware and Embedded Systems
CHES 1999, volume LNCS 1717, pages 3748, Worcester, Massachusetts,
USA, August 1213 1999. Springer-Verlag.
[60] S. Trimberger, R. Pang, and A. Singh. A 12 Gbps DES Encryptor/Decryptor
Core in an FPGA. In C . K. Ko c and C. Paar, editors, Workshop on Crypto-
graphic Hardware and Embedded Systems CHES 2000, volume LNCS 1965,
pages 156163, Worcester, Massachusetts, USA, August 1718 2000. Springer-
Verlag.
[61] M. Davio, Y. Desmedt, J. Goubert, F. Hoornaert, and J. J. Quisquater. Ecient
Hardware and Software Implementations for the DES. In G. R. Blakley and
D. Chaum, editors, Advances in Cryptology CRYPTO 84, volume LNCS
196, pages 144146, Berlin, Germany, 1985. Springer-Verlag.
[62] I. Verbauwhede, F. Hoornaert, J. Vandewalle, and H. De Man. Security Consid-
erations in the Design and Implementation of a New DES Chip. In D. Chaum
and W. L. Price, editors, Advances in Cryptology - EUROCRYPT 87, volume
LNCS 304, pages 287300, Berlin, Germany, 1987. Springer-Verlag.
[63] DES Modes of Operation, FIPS, Federal Information Processing Standard, Pub
No. 81. Available at http://csrc.nist.gov/ps/change81.ps, December 1980.
[64] H. Eberle. A High-speed DES Implementation for Network Applications. In
E. F. Brickell, editor, Advances in Cryptology CRYPTO 92, volume LNCS
740, pages 521539, Berlin, Germany, August 1620 1993. Springer-Verlag. Con-
ference Location: Santa Barbara, California, USA.
[65] H. Eberle and C. P. Thacker. A 1 Gbit/second GaAs DES Chip. In Proceed-
ings of the IEEE Custom Integrated Circuits Conference, pages 19.7.119.7.4,
Boston, Massachusetts, USA, May 36 1992. IEEE, Inc.
[66] X. Lai and J. Massey. A Proposal for a New Block Encryption Standard. In
Ivan B. Damgard, editor, Advances in Cryptology EUROCRYPT 90, volume
LNCS 473, pages 389404, Berlin, Germany, May 1990. Springer-Verlag.
[67] X. Lai and Y. Massey. Markov Ciphers and Dierential Cryptoanalysis. In
D. W. Davies, editor, Advances in Cryptology EUROCRYPT 91, volume
LNCS 547, Berlin, Germany, 1991. Springer-Verlag.
[68] E. Biham and A. Shamir. Dierential Cryptanalysis of DES-like Cryptosys-
tems. In A. J. Menezes and S. A. Vanstone, editors, Advances in Cryptology
CRYPTO 90, volume LNCS 537, pages 221, Santa Barbara, California, USA,
August 1115 1990. Springer-Verlag.
[69] H. Lipmaa. IDEA: A Cipher for Multimedia Architectures? In SAC: Annual
International Workshop on Selected Areas in Cryptography. LNCS, 1998.
89
[70] J.-O. Haenni. Architecture EPIC et jeux dinstructions multimedias pour ap-
plications cryptographiques. PhD thesis, Swiss Federal Institute of Technology
Lausanne, 2002.
[71] P. Ganesan, R. Venugopalan, P. Peddabachagari, A. Dean, F. Mueller, and
M. Sichitiu. Analyzing and modeling encryption overhead for sensor network
nodes. In WSNA 03: Proceedings of the 2nd ACM international conference
on Wireless sensor networks and applications, pages 151159, New York, NY,
USA, 2003. ACM Press.
[72] A. Michalski, K. Gaj, and D. A. Buell. High-Throughput Recongurable Com-
puting: A Design Study of an IDEA Encryption Cryptosystem on the SRC-
6e Recongurable Computer. In Tero Rissa, Steven J. E. Wilton, and Philip
Heng Wai Leong, editors, Proceedings of the 2005 International Conference on
Field Programmable Logic and Applications (FPL), Tampere, Finland, August
24-26, 2005, pages 681686. IEEE, 2005.
[73] O. Mencer, M. Morf, and Michael J. Flynn. Hardware Software Tri-Design
of Encryption for Mobile Communication Units. In Proceedings of the 1998
IEEE International Conference on Acoustics, Speech, and Signal Processing,
volume 5, pages 30453048, May 1998.
[74] H. Bonnenberg, A. Curiger, N. Felber, H. Kaelsin, and X. Lai. VLSI Implemen-
tation of a New Block Cipher. In Proceedings of the IEEE International Con-
ference on Computer Design, pages 510513, Los Alamitos, California, USA,
1991. IEEE Computer Society Press.
[75] H. Bonnenberg, A. Curiger, N. Felber, H. Kaeslin, R. Zimmermann, and
W. Fichtner. VINCI: Secure Test of a VLSI High-Speed Encryption System.
In Proceedings: International Test Conference, pages 782790, October 1993.
IEEE Cat Num: 93CH3356-3 ISBN: 0-7803-1430-1.
[76] R. Zimmermann, A. Curiger, H. Bonnenberg, H. Kaeslin, N. Felber, and
W. Fichtner. A 177 Mb/s VLSI Implementation of the International Data
Encryption Algorithm. IEEE Journal of Solid-State Circuits, 29(3):303307,
March 1994.
[77] S. Wolter, H. Matz, A. Schubert, and R. Laur. On the VLSI Implementation of
the International Data Encryption Algorithm IDEA. In IEEE Symposium on
Circuits and Systems, volume 1, pages 397400, New York, New York, USA,
1995. IEEE, Inc.
[78] S. L. C. Salomao, V. C. Alves, and E. M. C. Filho. HiPCrypto: A High-
Performance VLSI Cryptographic Chip. In Proceedings of the Eleventh Annual
IEEE International ASIC Conference, pages 711, Rochester, New York, USA,
September 1998.
[79] J. Daemen and V. Rijmen. AES Proposal: The Rijndael Block Cipher.
World Wide Web http://csrc.nist.gov/CryptoToolkit/aes/rijndael/Rijndael-
90
ammended.pdf, 1999.
[80] NIST FIPS PUB 197. Specication for the Advanced Encryption Standard
(AES). Federal Information Processing Standards, National Bureau of Stan-
dards, U.S. Department of Commerce, November 26 2001.
[81] P. S. L. M. Barreto. Optimized Rijndael C Code v3.0.
http://homes.esat.kuleuven.be/ rijmen/rijndael-fst-3.0.zip.
[82] J. Daemen and V. Rijmen. AES Proposal: Rijndael.
http://csrc.nist.gov/CryptoToolkit/aes/rijndael/Rijndael.pdf, 1999.
[83] J. Daemen and V. Rijmen. The Design of Rijndael. Springer, New York, New
York, USA, 2002.
[84] V. Rijmen. Rijndael Reference Code in ANSI C v2.2.
http://homes.esat.kuleuven.be/ rijmen/rijndaelref.zip.
[85] A. Hodjat and I. Verbauwhede. Interfacing a High Speed Crypto Accelerator to
an Embedded CPU. In Proceedings of the 38th Asilomar Conference on Signals,
Systems, and Computers, volume 1, pages 488492, Los Angeles, California,
USA, November 710 2004.
[86] P. Schaumont, K. Sakiyama, A. Hodjat, and I. Verbauwhede. Embedded Soft-
ware Integration for Coarse-Grain Recongurable Systems. In Proceedings of
the Eighteenth International Parallel and Distributed Processing Symposium
IPDPS 2004, pages 137142, Santa Fe, New Mexico, USA, April 2630 2004.
[87] A. J. Elbirt. Ecient Implementation of Galois Field Fixed Field Constant
Multiplication. In Proceedings of the International Conference on Information
Technology: New Generation ITNG 06, pages 172177, Las Vegas, Nevada,
USA, April 10-12 2006.
[88] A. J. Elbirt. Fast and Ecient Implementation of AES Via Instruction Set
Extensions. In Proceedings of the Third IEEE International Symposium on
Security in Networks and Distributed Systems, pages 396403, Niagara Falls,
Canada, May 21-23 2007.
[89] A. J. Elbirt. Recongurable Computing for Symmetric-Key Algorithms. PhD
thesis, Worcester Polytechnic Institute, Worcester, Massachusetts, USA, April
2002. Available at http://faculty.uml.edu/aelbirt/thesis.pdf.
[90] A. J. Elbirt, W. Yip, B. Chetwynd, and C. Paar. An FPGA Implementation
and Performance Evaluation of the AES Block Cipher Candidate Algorithm
Finalists. In The Third Advanced Encryption Standard Candidate Conference,
[91] A. J. Elbirt, W. Yip, B. Chetwynd, and C. Paar. An FPGA-Based Performance
91
Evaluation of the AES Block Cipher Candidate Algorithm Finalists. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 9(4):545557,
August 2001.
[92] K. Gaj and P. Chodowiec. Comparison of the Hardware Performance of the
AES Candidates Using Recongurable Hardware. In The Third Advanced En-
cryption Standard Candidate Conference, pages 4054, New York, New York,
USA, April 1314 2000. National Institute of Standards and Technology.
[93] A. Hodjat and I. Verbauwhede. A 21.54 Gbit/s Fully Pipelined AES Processor
on FPGA. In Proceedings of the Twelfth Annual IEEE Symposium on Field-
Programmable Custom Computing Machines FCCM 2004, pages 308309,
Napa, California, USA, April 2023 2004. IEEE, Inc.
[94] A. Hodjat and I. Verbauwhede. Minimum Area Cost for a 30 to 70 Gbits/s AES
Processor. In Proceedings of the IEEE Computer Society Annual Symposium
on VLSI Emerging Trends in VLSI System Design ISVLSI04, pages 8388,
Lafayette, Louisiana, USA, February 1920 2004. IEEE, Inc.
[95] T. Ichikawa, T. Kasuya, and M. Matsui. Hardware Evaluation of the AES
Finalists. In The Third Advanced Encryption Standard Candidate Conference,
[96] K. Jarvinen, M. Tommiska, and J. Skytta. A Fully Pipelined Memoryless 17.8
Gbps AES-128 Encryptor. In ACM/SIGDA International Symposium on Field
Programmable Gate Arrays 2003 FPGA 03, pages 207215, Monterey, Cal-
ifornia, USA, February 2325 2003. ACM Press.
[97] H. Kuo, I. Verbauwhede, and P. Schaumont. A 2.29 Gbits/sec, 56 mW Non-
Pipelned Rijndael AES Encryption IC in a 1.8V, 0.18 m CMOS Technology.
In Proceedings of the IEEE 2002 Custom Integrated Circuits Conference, pages
147150, Orlando, Florida, USA, May 1215 2002.
[98] H. Li and Z. Friggstad. An Ecient Architecture for the AES Mix Columns
Operation. In Proceedings of the 2005 IEEE International Symposium on Cir-
cuits and Systems ISCAS 2005, pages 46374640, Kobe, Japan, May 2326
2005. IEEE, Inc.
[99] M. McLoone and J. V. McCanny. Rijndael FPGA Implementations Utilising
Look-Up Tables. Journal of VLSI Signal Processing Systems for Signal, Image,
and Video Technology, 34(3):261275, July 2003.
[100] T. Pionteck, T. Staake, T. Stiefmeier, L. D. Kabulepa, and M. Geisner. Design
of a Recongurable AES Encryption/Decryption Engine for Mobile Terminals.
In Proceedings of the 2004 Symposium on Circuits and Systems ISCAS 2004,
volume 2, pages 545548, Vancouver, Canada, May 2326 2004.
[101] N. A. Saqib, F. Rodriguez-Henriquez, and A. Diaz-Perez. AES Algorithm Imple-
92
mentation An Ecient Approach for Sequential and Pipeline Architectures.
In Proceedings of the Fourth Mexican International Conference on Computer
Science ECC03, pages 126130, Apizaco, Mexico, September 812 2003.
IEEE, Inc.
[102] F. X. Standaert, G. Rouvroy, J. J. Quisquater, and J. D. Legat. Ecient Imple-
mentation of Rijndael Encryption in Recongurable Hardware: Improvements
and Design Tradeos. In Workshop on Cryptographic Hardware and Embedded
Systems CHES 2003, volume LNCS 2778, pages 334350, Cologne, Germany,
September 710 2003. Springer-Verlag.
[103] K. Stevens and O. A. Mohamed. Single-Chip FPGA Implementation of a
Pipelined, Memory-Based AES Rijndael Encryption Design. In Proceedings of
the Eighteenth Annual Canadian Conference on Electrical and Computer En-
gineering CCECE05, pages 12961299, Saskatoon, Saskatchewan, Canada,
May 14 2005. IEEE, Inc.
[104] N. Weaver and J. Wawrzynek. A Comparison of the AES Candidates Amenabil-
ity to FPGA Implementation. In The Third Advanced Encryption Standard
Candidate Conference, pages 2839, New York, New York, USA, April 1314
2000. National Institute of Standards and Technology.
[105] B. Weeks, M. Bean, T. Rozylowicz, and C. Ficke. Hardware Performance
Simulations of Round 2 Advanced Encryption Standard Algorithms. In The
Third Advanced Encryption Standard Candidate Conference, pages 286304,
New York, New York, USA, April 1314 2000. National Institute of Standards
and Technology.
[106] S.-M. Yoo, D. Kotturi, D. W. Pan, and J. Blizzard. An AES Crypto Chip Using
a High-Speed Parallel Pipelined Architecture. Microprocessors and Microsys-
tems, 29(7):317326, September 2005.
[107] X. Zhang and K. K. Parhi. Implementation Approaches for the Advanced En-
cryption Standard Algorithm. IEEE Circuits and Systems Magazine, 2(4):25
46, 2002.
[108] J. Irwin and D. Page. Using Media Processors for Low-Memory AES Imple-
mentation. In Proceedings of the Fourteenth IEEE International Conference
on Application-Specic Systems, Architectures and Processors ASAP 2003,
pages 144154, The Hague, The Netherlands, June 2426 2003.
[109] K. Nadehara, M. Ikekawa, and I. Kuroda. Extended Instructions for the AES
Cryptography and Their Ecient Implementation. In Proceedings of the Eigh-
teenth IEEE Workshop on Signal Processing Systems SIPS 2004, pages 152
157, Austin, Texas, USA, October 1315 2004.
[110] P. Karn. DES Software Implementation. Included with source code package for
Advanced Packet Vault: http://http://www.citi.umich.edu/projects/apv/.
93
[111] J. Beuchat. Modular Multiplication for FPGA Implementation of the IDEA
Block Cipher. In Proceedings of the Fourteenth IEEE International Conference
on Application-Specic Systems, Architectures and Processors ASAP 2003,
pages 412422. IEEE Computer Society, 2003.
[112] R. Zimmermann. Ecient VLSI Implementation of Mod-
ulo (2
n
1) Addition and Multiplication. World Wide Web:
http://www.stud.ee.ethz.ch/ zimmi/publications/modulo arith.ps.gz, April 22
1999.
94
Appendix A: VHDL Source for Custom Functional Units
DES permutation unit (des pmt.vhd)
library ieee;
use ieee.std_logic_1164.all;
entity des_pmt is
port (
pmt_type : in std_logic; -- IP=0, FP=1
out_half : in std_logic; -- LEFT=0, RIGHT=1
in1, in2 : in std_logic_vector(1 to 32);
dout : out std_logic_vector(1 to 32)
);
end des_pmt;
architecture behav of des_pmt is
signal x, y : std_logic_vector(1 to 32);
begin
process(pmt_type, in1, in2)
begin
case pmt_type is
when 0 =>
x <= in2(26) & in2(18) & in2(10) & in2(2) & in1(26) & in1(18) & in1(10) & in1(2) &
in2(28) & in2(20) & in2(12) & in2(4) & in1(28) & in1(20) & in1(12) & in1(4) &
in2(32) & in2(24) & in2(16) & in2(8) & in1(32) & in1(24) & in1(16) & in1(8) ;
y <= in2(25) & in2(17) & in2( 9) & in2(1) & in1(25) & in1(17) & in1( 9) & in1(1) &
when others =>
x <= in2(8) & in1(8) & in2(16) & in1(16) & in2(24) & in1(24) & in2(32) & in1(32) &
y <= in2(4) & in1(4) & in2(12) & in1(12) & in2(20) & in1(20) & in2(28) & in1(28) &
in2(1) & in1(1) & in2( 9) & in1( 9) & in2(17) & in1(17) & in2(25) & in1(25) ;
end case;
end process;
process(out_half, x, y)
begin
case out_half is
when 0 => dout <= x;
when others => dout <= y;
end case;
end process;
end behav;
DES key generator (des keygen.vhd)
95
library ieee;
use ieee.std_logic_arith.all;
use ieee.std_logic_unsigned."+";
use ieee.std_logic_unsigned."-";
entity des_keygen is
port (
clk : in std_logic;
setdir : in std_logic;
load : in std_logic;
dirin : in std_logic;
advin : in std_logic;
keyl : in std_logic_vector(1 to 32);
keyr : in std_logic_vector(1 to 32);
rndkey : out std_logic_vector(1 to 48)
);
end des_keygen;
architecture behav of des_keygen is
signal dir : std_logic; -- ENCRYPT = 0, DECRYPT = 1
signal adv : std_logic; -- last state of "advance" flag
signal rnd, rndnext : std_logic_vector(3 downto 0);
signal pkey : std_logic_vector(1 to 56); -- round key before PC-2
signal c, c0, d, d0 : std_logic_vector(1 to 28);
signal cnext, dnext : std_logic_vector(1 to 28);
begin
process(clk)
begin
if (rising_edge(clk)) then
c <= cnext;
d <= dnext;
rnd <= rndnext;
adv <= advin;
if(setdir=1) then
dir <= dirin;
end if;
if(load=1) then
-- LOAD KEY WITH PERMUTED CHOICE 1
c0 <= keyr(25) & keyr(17) & keyr( 9) & keyr( 1) &
keyl(25) & keyl(17) & keyl( 9) & keyl( 1) &
keyr(26) & keyr(18) & keyr(10) & keyr( 2) &
keyl(26) & keyl(18) & keyl(10) & keyl( 2) &
keyr(28) & keyr(20) & keyr(12) & keyr( 4) ;
d0 <= keyr(31) & keyr(23) & keyr(15) & keyr( 7) &
keyl(28) & keyl(20) & keyl(12) & keyl( 4) ;
end if;
end if;
end process;
process(dir, dirin, rnd, adv, advin, setdir, load)
begin
96
if(setdir=1 or load=1) then
if(dirin=0) then
rndnext <= "0001";
else
rndnext <= "0000";
end if;
elsif(advin=0 and adv=1) then
if(dir=0) then
rndnext <= rnd + 1;
else
rndnext <= rnd - 1;
end if;
end if;
end process;
process(dir, adv, advin, c, d, c0, d0, rnd)
begin
if(advin=0 and adv=1) then
if(dir=0) then
if(rnd="0001" or rnd="1000" or rnd="1111" or rnd="0000") then
cnext(1 to 27) <= c(2 to 28);
cnext(28) <= c(1);
dnext(1 to 27) <= d(2 to 28);
dnext(28) <= d(1);
else
cnext(1 to 26) <= c(3 to 28);
cnext(27 to 28) <= c(1 to 2);
dnext(1 to 26) <= d(3 to 28);
dnext(27 to 28) <= d(1 to 2);
end if;
else
if(rnd="0001" or rnd="0010" or rnd="1001" or rnd="0000") then
cnext(2 to 28) <= c(1 to 27);
cnext(1) <= c(28);
dnext(2 to 28) <= d(1 to 27);
dnext(1) <= d(28);
else
cnext(3 to 28) <= c(1 to 26);
cnext(1 to 2) <= c(27 to 28);
dnext(3 to 28) <= d(1 to 26);
dnext(1 to 2) <= d(27 to 28);
end if;
end if;
elsif(dir=0 and rnd="0001") then
cnext <= c0(2 to 28) & c0(1);
dnext <= d0(2 to 28) & d0(1);
elsif(dir=1 and rnd="0000") then
cnext <= c0; dnext <= d0;
else
cnext <= c; dnext <= d;
end if;
end process;
pkey <= c & d;
-- SEND OUTPUT WITH PERMUTED CHOICE 2
rndkey <= pkey(14) & pkey(17) & pkey(11) & pkey(24) & pkey( 1) & pkey( 5) &
pkey( 3) & pkey(28) & pkey(15) & pkey( 6) & pkey(21) & pkey(10) &
pkey(23) & pkey(19) & pkey(12) & pkey( 4) & pkey(26) & pkey( 8) &
pkey(16) & pkey( 7) & pkey(27) & pkey(20) & pkey(13) & pkey( 2) &
pkey(41) & pkey(52) & pkey(31) & pkey(37) & pkey(47) & pkey(55) &
pkey(46) & pkey(42) & pkey(50) & pkey(36) & pkey(29) & pkey(32) ;
97
end behav;
DES round function core unit (des fcore.vhd)
library ieee;
USE ieee.std_logic_1164.all;
entity des_fcore is
port (
din : in std_logic_vector(1 to 32);
rndkey : in std_logic_vector(1 to 48);
);
end des_fcore;
architecture dflow of des_fcore is
type sbox_inputs is array(1 to 8) of std_logic_vector(5 downto 0);
signal edata, kdata : std_logic_vector(1 to 48);
signal sbox_in : sbox_inputs;
signal sbox_out : std_logic_vector(1 to 32);
begin
edata(1) <= din(32);
e_row1: for i in 2 to 6 generate
edata(i) <= din(i-1);
end generate;
end generate;
end generate;
end generate;
end generate;
end generate;
end generate;
end generate;
edata(48) <= din(1);
kdata <= edata XOR rndkey;
sbox_rc: for i in 1 to 8 generate
sbox_in(i)(5) <= kdata(6*i-5);
sbox_in(i)(4) <= kdata(6*i);
end generate;
98
with sbox_in(1) select
sbox_out(1 to 4) <= "0000" when "001110" | "010000" | "101111" | "111101" ,
"0001" when "000011" | "010111" | "100001" | "110110" ,
"0010" when "000100" | "010101" | "100110" | "110011" ,
"0011" when "001000" | "011110" | "101100" | "111010" ,
"0100" when "000001" | "010011" | "100000" | "110100" ,
"0101" when "001100" | "011101" | "101110" | "111000" ,
"0110" when "001010" | "011001" | "100101" | "111110" ,
"0111" when "001111" | "010010" | "101011" | "110111" ,
"1000" when "000111" | "011111" | "100011" | "110010" ,
"1001" when "001101" | "011100" | "101010" | "110101" ,
"1010" when "001001" | "011000" | "101101" | "111100" ,
"1011" when "000110" | "011011" | "100111" | "111001" ,
"1100" when "001011" | "011010" | "101001" | "110001" ,
"1101" when "000010" | "010110" | "100100" | "111111" ,
"1110" when "000000" | "010100" | "100010" | "111011" ,
"1111" when "000101" | "010001" | "101000" | "110000" ,
"0000" when others;
sbox_out(5 to 8) <= "0000" when "001101" | "011001" | "100000" | "111100" ,
"0001" when "000001" | "011010" | "100111" | "110011" ,
"0010" when "001010" | "010101" | "101110" | "110111" ,
"0011" when "000110" | "010000" | "101101" | "110100" ,
"0100" when "000111" | "010010" | "100101" | "110110" ,
"0101" when "001110" | "011111" | "101000" | "111101" ,
"0110" when "000100" | "011100" | "101011" | "111001" ,
"0111" when "001001" | "010011" | "100010" | "111010" ,
"1000" when "000010" | "010110" | "101001" | "110001" ,
"1001" when "001000" | "011101" | "101100" | "111111" ,
"1010" when "001111" | "011011" | "100100" | "110010" ,
"1011" when "000101" | "011110" | "100011" | "111000" ,
"1100" when "001100" | "011000" | "101010" | "111011" ,
"1101" when "001011" | "010001" | "100110" | "110000" ,
"1110" when "000011" | "010111" | "100001" | "111110" ,
"1111" when "000000" | "010100" | "101111" | "110101" ,
"0000" when others;
sbox_out(9 to 12) <= "0000" when "000001" | "010010" | "100111" | "110011" ,
"0001" when "001000" | "011111" | "101001" | "110000" ,
"0010" when "001110" | "011000" | "101010" | "111110" ,
"0011" when "000101" | "010100" | "100110" | "111011" ,
"0100" when "001101" | "010101" | "100010" | "111000" ,
"0101" when "000111" | "011010" | "101100" | "111101" ,
"0110" when "000100" | "010110" | "100001" | "110100" ,
"0111" when "001011" | "010001" | "101111" | "110111" ,
"1000" when "001111" | "011001" | "100100" | "110110" ,
"1001" when "000010" | "010011" | "100011" | "110101" ,
"1010" when "000000" | "010111" | "101101" | "110001" ,
"1011" when "001100" | "011101" | "101000" | "111100" ,
"1100" when "001010" | "011100" | "101011" | "111111" ,
"1101" when "001001" | "010000" | "100000" | "110010" ,
"1110" when "000011" | "011011" | "101110" | "111010" ,
"1111" when "000110" | "011110" | "100101" | "111001" ,
"0000" when others;
sbox_out(13 to 16) <= "0000" when "000100" | "010110" | "100011" | "110010" ,
"0001" when "001000" | "011100" | "101001" | "110101" ,
"0010" when "001001" | "011010" | "101101" | "111110" ,
"0011" when "000011" | "010111" | "101010" | "110000" ,
"0100" when "001110" | "011000" | "101111" | "111001" ,
99
"0101" when "001011" | "010011" | "101100" | "111010" ,
"0110" when "000101" | "010100" | "100001" | "110011" ,
"0111" when "000000" | "011001" | "100110" | "111101" ,
"1000" when "001010" | "010001" | "101110" | "110111" ,
"1001" when "000110" | "011111" | "100010" | "111000" ,
"1010" when "000111" | "011101" | "100000" | "110100" ,
"1011" when "001100" | "010010" | "100101" | "111011" ,
"1100" when "001101" | "011011" | "100100" | "111100" ,
"1101" when "000001" | "010000" | "100111" | "110110" ,
"1110" when "000010" | "011110" | "101011" | "111111" ,
"1111" when "001111" | "010101" | "101000" | "110001" ,
"0000" when others;
sbox_out(17 to 20) <= "0000" when "001101" | "011001" | "101110" | "111010" ,
"0001" when "000011" | "010111" | "100010" | "110100" ,
"0010" when "000000" | "010010" | "100001" | "110110" ,
"0011" when "001010" | "011100" | "101101" | "111111" ,
"0100" when "000010" | "010100" | "100000" | "111101" ,
"0101" when "001001" | "011000" | "101011" | "111110" ,
"0110" when "000111" | "011111" | "101100" | "111000" ,
"0111" when "000100" | "010101" | "100110" | "110011" ,
"1000" when "001000" | "011110" | "100111" | "110001" ,
"1001" when "001111" | "011101" | "101001" | "111011" ,
"1010" when "000101" | "011011" | "100100" | "111100" ,
"1011" when "000110" | "010001" | "100011" | "110000" ,
"1100" when "000001" | "010011" | "101010" | "110010" ,
"1101" when "001100" | "010110" | "100101" | "110111" ,
"1110" when "001110" | "010000" | "101111" | "110101" ,
"1111" when "001011" | "011010" | "101000" | "111001" ,
"0000" when others;
sbox_out(21 to 24) <= "0000" when "001000" | "011100" | "101001" | "111101" ,
"0001" when "000001" | "011001" | "101100" | "111010" ,
"0010" when "000101" | "010011" | "100100" | "110010" ,
"0011" when "001010" | "011110" | "100111" | "110001" ,
"0100" when "001011" | "010010" | "101010" | "110000" ,
"0101" when "001110" | "010111" | "100011" | "110101" ,
"0110" when "000110" | "011000" | "101111" | "111100" ,
"0111" when "001101" | "010100" | "101000" | "111011" ,
"1000" when "000111" | "011111" | "100101" | "111110" ,
"1001" when "000100" | "010110" | "100000" | "110100" ,
"1010" when "000010" | "010000" | "101011" | "110111" ,
"1011" when "001111" | "011101" | "101110" | "111000" ,
"1100" when "000000" | "010101" | "100110" | "110011" ,
"1101" when "001001" | "011010" | "101101" | "111111" ,
"1110" when "001100" | "011011" | "100001" | "111001" ,
"1111" when "000011" | "010001" | "100010" | "110110" ,
"0000" when others;
sbox_out(25 to 28) <= "0000" when "000101" | "010001" | "101100" | "111010" ,
"0001" when "001111" | "010110" | "100000" | "110100" ,
"0010" when "000010" | "011100" | "101111" | "111101" ,
"0011" when "001000" | "011001" | "100101" | "111110" ,
"0100" when "000000" | "010100" | "100001" | "110101" ,
"0101" when "001100" | "011010" | "101101" | "111001" ,
"0110" when "001110" | "011111" | "101010" | "110000" ,
"0111" when "001011" | "010011" | "100110" | "110111" ,
"1000" when "000110" | "011110" | "101011" | "110011" ,
"1001" when "001010" | "010101" | "101110" | "111000" ,
"1010" when "001101" | "010111" | "101000" | "110110" ,
"1011" when "000001" | "010010" | "100010" | "110001" ,
100
"1100" when "001001" | "011011" | "100100" | "111111" ,
"1101" when "000111" | "010000" | "100011" | "110010" ,
"1110" when "000011" | "011000" | "100111" | "111100" ,
"1111" when "000100" | "011101" | "101001" | "111011" ,
"0000" when others;
sbox_out(29 to 32) <= "0000" when "001101" | "011100" | "101000" | "111011" ,
"0001" when "000111" | "010000" | "100011" | "110001" ,
"0010" when "000001" | "011111" | "100111" | "110000" ,
"0011" when "001010" | "010101" | "101101" | "111100" ,
"0100" when "000011" | "010111" | "100010" | "110100" ,
"0101" when "001100" | "011001" | "101110" | "111101" ,
"0110" when "000100" | "011010" | "101001" | "111110" ,
"0111" when "001111" | "010110" | "100000" | "110011" ,
"1000" when "000010" | "010011" | "101111" | "110110" ,
"1001" when "001001" | "011110" | "100100" | "111010" ,
"1010" when "001000" | "010100" | "101010" | "110101" ,
"1011" when "000110" | "011011" | "100001" | "111111" ,
"1100" when "001110" | "011000" | "100101" | "111001" ,
"1101" when "000000" | "010010" | "101011" | "110111" ,
"1110" when "001011" | "011101" | "100110" | "110010" ,
"1111" when "000101" | "010001" | "101100" | "111000" ,
"0000" when others;
dout <= sbox_out(16) & sbox_out( 7) & sbox_out(20) & sbox_out(21) &
sbox_out(29) & sbox_out(12) & sbox_out(28) & sbox_out(17) &
sbox_out( 1) & sbox_out(15) & sbox_out(23) & sbox_out(26) &
sbox_out( 5) & sbox_out(18) & sbox_out(31) & sbox_out(10) &
sbox_out( 2) & sbox_out( 8) & sbox_out(24) & sbox_out(14) &
sbox_out(32) & sbox_out(27) & sbox_out( 3) & sbox_out( 9) &
sbox_out(19) & sbox_out(13) & sbox_out(30) & sbox_out( 6) &
sbox_out(22) & sbox_out(11) & sbox_out( 4) & sbox_out(25) ;
end dflow;
Modulo (2
16
+ 1) multiplier for IDEA (idea modmul.vhd)
library ieee;
use ieee.std_logic_unsigned."+";
entity idea_modmul is
port (
clk : in std_logic;
x, y : in std_logic_vector(15 downto 0);
p : out std_logic_vector(15 downto 0)
);
end idea_modmul;
architecture behav of idea_modmul is
component cpamod is
generic(n : integer := 16);
port(
a, b : in std_logic_vector(n-1 downto 0);
sum : out std_logic_vector(n downto 0)
);
end component;
type part_prod_array is array(15 downto 0) of std_logic_vector(15 downto 0);
101
type sum_level1 is array(1 to 8) of std_logic_vector(16 downto 0);
signal pp : part_prod_array;
signal s, pp_sum : std_logic_vector(19 downto 0);
signal k : std_logic_vector(16 downto 0);
signal sl : std_logic_vector(15 downto 0);
signal sh : std_logic_vector( 3 downto 0);
signal not_x_exp : std_logic_vector(16 downto 0);
signal not_y_exp : std_logic_vector(16 downto 0);
signal not_sh_exp : std_logic_vector(15 downto 0);
signal t0, t1, t2 : std_logic_vector(16 downto 0);
signal s1 : sum_level1;
signal pp_sum_reg : std_logic_vector(19 downto 0);
signal k_reg : std_logic_vector(16 downto 0);
begin
A16_1 : cpamod generic map (n=>16) port map (pp( 0),pp( 1),s1(1));
A16_6 : cpamod generic map (n=>16) port map (pp(10),pp(11),s1(6));
A17_1 : cpamod generic map (n=>17) port map (s1(1),s1(2),s2(1));
A19_1 : cpamod generic map (n=>19) port map (s3(1),s3(2),pp_sum);
process(x, y)
begin
if(x(0) = 1) then pp(0) <= y(15 downto 0);
else pp(0) <= "0000000000000000";
end if;
if(x(1) = 1) then pp(1) <= y(14 downto 0) & not(y(15));
else pp(1) <= "0000000000000001";
end if;
if(x(2) = 1) then pp(2) <= y(13 downto 0) & not(y(15 downto 14));
else pp(2) <= "0000000000000011";
end if;
else pp(3) <= "0000000000000111";
end if;
else pp(4) <= "0000000000001111";
end if;
else pp(5) <= "0000000000011111";
end if;
else pp(6) <= "0000000000111111";
end if;
102
else pp(7) <= "0000000001111111";
end if;
else pp(8) <= "0000000011111111";
end if;
else pp(9) <= "0000000111111111";
end if;
else pp(10) <= "0000001111111111";
end if;
else pp(11) <= "0000011111111111";
end if;
else pp(12) <= "0000111111111111";
end if;
else pp(13) <= "0001111111111111";
end if;
else pp(14) <= "0011111111111111";
end if;
if(x(15) = 1) then pp(15) <= y(0) & not(y(15 downto 1));
else pp(15) <= "0111111111111111";
end if;
end process;
not_x_exp <= 0 & not(x);
not_y_exp <= 0 & not(y);
process(x, y, not_x_exp, not_y_exp)
begin
if(x = "0000000000000000" and y = "0000000000000000") then
k <= "00000000000000010";
elsif(x = "0000000000000000" and y /= "0000000000000000") then
k <= not_y_exp + 3;
elsif(x /= "0000000000000000" and y = "0000000000000000") then
k <= not_x_exp + 3;
else
k <= "00000000000000001";
end if;
end process;
process(clk)
begin
if(rising_edge(clk)) then
pp_sum_reg <= pp_sum;
k_reg <= k;
end if;
end process;
not_sh_exp <= "000000000000" & not(sh);
AT_1: cpamod generic map (n=>16) port map (sl,not_sh_exp,t0);
s <= ("000"&k_reg) + pp_sum_reg;
sl <= s(15 downto 0);
sh <= s(19 downto 16);
t1 <= t0 + 1;
t2 <= t0 + 2;
process(t1, t2)
begin
103
if(t1(16) = 1) then
p <= t1(15 downto 0);
else
p <= t2(15 downto 0);
end if;
end process;
end behav;
Carry-propagate adder for modular multiplier (cpamod.vhd)
library ieee;
use ieee.std_logic_unsigned.all;
entity cpamod is
generic(n : integer := 16);
port(
a, b : in std_logic_vector(n-1 downto 0);
sum : out std_logic_vector(n downto 0)
);
end cpamod;
architecture dflow of cpamod is
begin
sum <= conv_std_logic_vector((conv_integer(a) + conv_integer(b)),n+1);
end dflow;
AES forward and inverse S-Boxes (aes sbox.vhd)
library ieee;
entity aes_sbox is
port (
dir : in std_logic;
a : in std_logic_vector(7 downto 0);
b : out std_logic_vector(7 downto 0)
);
end aes_sbox;
architecture dflow of aes_sbox is
signal b_enc, b_dec : std_logic_vector(7 downto 0);
begin
with a select
b_enc <=
"01100011" when "00000000",-- 0x00 -> 0x63
"01111100" when "00000001",
"01110111" when "00000010",
"01111011" when "00000011",
"11110010" when "00000100",
"01101011" when "00000101",
"01101111" when "00000110",
"11000101" when "00000111",
"00110000" when "00001000",-- 0x08 -> 0x30
104
"00000001" when "00001001",
"01100111" when "00001010",
"00101011" when "00001011",
"11111110" when "00001100",
"11010111" when "00001101",
"10101011" when "00001110",
"01110110" when "00001111",
"11001010" when "00010000",-- 0x10 -> 0xca
"10000010" when "00010001",
"11001001" when "00010010",
"01111101" when "00010011",
"11111010" when "00010100",
"01011001" when "00010101",
"01000111" when "00010110",
"11110000" when "00010111",
"10101101" when "00011000",-- 0x18 -> 0xad
"11010100" when "00011001",
"10100010" when "00011010",
"10101111" when "00011011",
"10011100" when "00011100",
"10100100" when "00011101",
"01110010" when "00011110",
"11000000" when "00011111",
"10110111" when "00100000",-- 0x20 -> 0xb7
"11111101" when "00100001",
"10010011" when "00100010",
"00100110" when "00100011",
"00110110" when "00100100",
"00111111" when "00100101",
"11110111" when "00100110",
"11001100" when "00100111",
"00110100" when "00101000",-- 0x28 -> 0x34
"10100101" when "00101001",
"11100101" when "00101010",
"11110001" when "00101011",
"01110001" when "00101100",
"11011000" when "00101101",
"00110001" when "00101110",
"00010101" when "00101111",
"00000100" when "00110000",-- 0x30 -> 0x04
"11000111" when "00110001",
"00100011" when "00110010",
"11000011" when "00110011",
"00011000" when "00110100",
"10010110" when "00110101",
"00000101" when "00110110",
"10011010" when "00110111",
"00000111" when "00111000",-- 0x38 -> 0x07
"00010010" when "00111001",
"10000000" when "00111010",
"11100010" when "00111011",
"11101011" when "00111100",
"00100111" when "00111101",
"10110010" when "00111110",
"01110101" when "00111111",
"00001001" when "01000000",-- 0x40 -> 0x09
"10000011" when "01000001",
"00101100" when "01000010",
"00011010" when "01000011",
"00011011" when "01000100",
"01101110" when "01000101",
"01011010" when "01000110",
"10100000" when "01000111",
"01010010" when "01001000",-- 0x48 -> 0x52
105
"00111011" when "01001001",
"11010110" when "01001010",
"10110011" when "01001011",
"00101001" when "01001100",
"11100011" when "01001101",
"00101111" when "01001110",
"10000100" when "01001111",
"01010011" when "01010000",-- 0x50 -> 0x53
"11010001" when "01010001",
"00000000" when "01010010",
"11101101" when "01010011",
"00100000" when "01010100",
"11111100" when "01010101",
"10110001" when "01010110",
"01011011" when "01010111",
"01101010" when "01011000",-- 0x58 -> 0x6a
"11001011" when "01011001",
"10111110" when "01011010",
"00111001" when "01011011",
"01001010" when "01011100",
"01001100" when "01011101",
"01011000" when "01011110",
"11001111" when "01011111",
"11010000" when "01100000",-- 0x60 -> 0xd0
"11101111" when "01100001",
"10101010" when "01100010",
"11111011" when "01100011",
"01000011" when "01100100",
"01001101" when "01100101",
"00110011" when "01100110",
"10000101" when "01100111",
"01000101" when "01101000",-- 0x68 -> 0x45
"11111001" when "01101001",
"00000010" when "01101010",
"01111111" when "01101011",
"01010000" when "01101100",
"00111100" when "01101101",
"10011111" when "01101110",
"10101000" when "01101111",
"01010001" when "01110000",-- 0x70 -> 0x51
"10100011" when "01110001",
"01000000" when "01110010",
"10001111" when "01110011",
"10010010" when "01110100",
"10011101" when "01110101",
"00111000" when "01110110",
"11110101" when "01110111",
"10111100" when "01111000",-- 0x78 -> 0xbc
"10110110" when "01111001",
"11011010" when "01111010",
"00100001" when "01111011",
"00010000" when "01111100",
"11111111" when "01111101",
"11110011" when "01111110",
"11010010" when "01111111",
"11001101" when "10000000",-- 0x80 -> 0xcd
"00001100" when "10000001",
"00010011" when "10000010",
"11101100" when "10000011",
"01011111" when "10000100",
"10010111" when "10000101",
"01000100" when "10000110",
"00010111" when "10000111",
"11000100" when "10001000",-- 0x88 -> 0xc4
106
"10100111" when "10001001",
"01111110" when "10001010",
"00111101" when "10001011",
"01100100" when "10001100",
"01011101" when "10001101",
"00011001" when "10001110",
"01110011" when "10001111",
"01100000" when "10010000",-- 0x90 -> 0x60
"10000001" when "10010001",
"01001111" when "10010010",
"11011100" when "10010011",
"00100010" when "10010100",
"00101010" when "10010101",
"10010000" when "10010110",
"10001000" when "10010111",
"01000110" when "10011000",-- 0x98 -> 0x46
"11101110" when "10011001",
"10111000" when "10011010",
"00010100" when "10011011",
"11011110" when "10011100",
"01011110" when "10011101",
"00001011" when "10011110",
"11011011" when "10011111",
"11100000" when "10100000",-- 0xa0 -> 0xe0
"00110010" when "10100001",
"00111010" when "10100010",
"00001010" when "10100011",
"01001001" when "10100100",
"00000110" when "10100101",
"00100100" when "10100110",
"01011100" when "10100111",
"11000010" when "10101000",-- 0xa8 -> 0xc2
"11010011" when "10101001",
"10101100" when "10101010",
"01100010" when "10101011",
"10010001" when "10101100",
"10010101" when "10101101",
"11100100" when "10101110",
"01111001" when "10101111",
"11100111" when "10110000",-- 0xb0 -> 0xe7
"11001000" when "10110001",
"00110111" when "10110010",
"01101101" when "10110011",
"10001101" when "10110100",
"11010101" when "10110101",
"01001110" when "10110110",
"10101001" when "10110111",
"01101100" when "10111000",-- 0xb8 -> 0x6c
"01010110" when "10111001",
"11110100" when "10111010",
"11101010" when "10111011",
"01100101" when "10111100",
"01111010" when "10111101",
"10101110" when "10111110",
"00001000" when "10111111",
"10111010" when "11000000",-- 0xc0 -> 0xba
"01111000" when "11000001",
"00100101" when "11000010",
"00101110" when "11000011",
"00011100" when "11000100",
"10100110" when "11000101",
"10110100" when "11000110",
"11000110" when "11000111",
"11101000" when "11001000",-- 0xc8 -> 0xe8
107
"11011101" when "11001001",
"01110100" when "11001010",
"00011111" when "11001011",
"01001011" when "11001100",
"10111101" when "11001101",
"10001011" when "11001110",
"10001010" when "11001111",
"01110000" when "11010000",-- 0xd0 -> 0x70
"00111110" when "11010001",
"10110101" when "11010010",
"01100110" when "11010011",
"01001000" when "11010100",
"00000011" when "11010101",
"11110110" when "11010110",
"00001110" when "11010111",
"01100001" when "11011000",-- 0xd8 -> 0x61
"00110101" when "11011001",
"01010111" when "11011010",
"10111001" when "11011011",
"10000110" when "11011100",
"11000001" when "11011101",
"00011101" when "11011110",
"10011110" when "11011111",
"11100001" when "11100000",-- 0xe0 -> 0xe1
"11111000" when "11100001",
"10011000" when "11100010",
"00010001" when "11100011",
"01101001" when "11100100",
"11011001" when "11100101",
"10001110" when "11100110",
"10010100" when "11100111",
"10011011" when "11101000",-- 0xe8 -> 0x9b
"00011110" when "11101001",
"10000111" when "11101010",
"11101001" when "11101011",
"11001110" when "11101100",
"01010101" when "11101101",
"00101000" when "11101110",
"11011111" when "11101111",
"10001100" when "11110000",-- 0xf0 -> 0x8c
"10100001" when "11110001",
"10001001" when "11110010",
"00001101" when "11110011",
"10111111" when "11110100",
"11100110" when "11110101",
"01000010" when "11110110",
"01101000" when "11110111",
"01000001" when "11111000",-- 0xf8 -> 0x41
"10011001" when "11111001",
"00101101" when "11111010",
"00001111" when "11111011",
"10110000" when "11111100",
"01010100" when "11111101",
"10111011" when "11111110",
"00010110" when "11111111",
"00000000" when others;
with a select
b_dec <=
"01010010" when "00000000",-- 0x00 -> 0x52
"00001001" when "00000001",
"01101010" when "00000010",
"11010101" when "00000011",
"00110000" when "00000100",
108
"00110110" when "00000101",
"10100101" when "00000110",
"00111000" when "00000111",
"10111111" when "00001000",-- 0x08 -> 0xb7
"01000000" when "00001001",
"10100011" when "00001010",
"10011110" when "00001011",
"10000001" when "00001100",
"11110011" when "00001101",
"11010111" when "00001110",
"11111011" when "00001111",
"01111100" when "00010000",-- 0x10 -> 0x7c
"11100011" when "00010001",
"00111001" when "00010010",
"10000010" when "00010011",
"10011011" when "00010100",
"00101111" when "00010101",
"11111111" when "00010110",
"10000111" when "00010111",
"00110100" when "00011000",-- 0x18 -> 0x34
"10001110" when "00011001",
"01000011" when "00011010",
"01000100" when "00011011",
"11000100" when "00011100",
"11011110" when "00011101",
"11101001" when "00011110",
"11001011" when "00011111",
"01010100" when "00100000",-- 0x20 -> 0x54
"01111011" when "00100001",
"10010100" when "00100010",
"00110010" when "00100011",
"10100110" when "00100100",
"11000010" when "00100101",
"00100011" when "00100110",
"00111101" when "00100111",
"11101110" when "00101000",-- 0x28 -> 0xee
"01001100" when "00101001",
"10010101" when "00101010",
"00001011" when "00101011",
"01000010" when "00101100",
"11111010" when "00101101",
"11000011" when "00101110",
"01001110" when "00101111",
"00001000" when "00110000",-- 0x30 -> 0x08
"00101110" when "00110001",
"10100001" when "00110010",
"01100110" when "00110011",
"00101000" when "00110100",
"11011001" when "00110101",
"00100100" when "00110110",
"10110010" when "00110111",
"01110110" when "00111000",-- 0x38 -> 0x76
"01011011" when "00111001",
"10100010" when "00111010",
"01001001" when "00111011",
"01101101" when "00111100",
"10001011" when "00111101",
"11010001" when "00111110",
"00100101" when "00111111",
"01110010" when "01000000",-- 0x40 -> 0x72
"11111000" when "01000001",
"11110110" when "01000010",
"01100100" when "01000011",
"10000110" when "01000100",
109
"01101000" when "01000101",
"10011000" when "01000110",
"00010110" when "01000111",
"11010100" when "01001000",-- 0x48 -> 0xd4
"10100100" when "01001001",
"01011100" when "01001010",
"11001100" when "01001011",
"01011101" when "01001100",
"01100101" when "01001101",
"10110110" when "01001110",
"10010010" when "01001111",
"01101100" when "01010000",-- 0x50 -> 0x6c
"01110000" when "01010001",
"01001000" when "01010010",
"01010000" when "01010011",
"11111101" when "01010100",
"11101101" when "01010101",
"10111001" when "01010110",
"11011010" when "01010111",
"01011110" when "01011000",-- 0x58 -> 0x5e
"00010101" when "01011001",
"01000110" when "01011010",
"01010111" when "01011011",
"10100111" when "01011100",
"10001101" when "01011101",
"10011101" when "01011110",
"10000100" when "01011111",
"10010000" when "01100000",-- 0x60 -> 0x90
"11011000" when "01100001",
"10101011" when "01100010",
"00000000" when "01100011",
"10001100" when "01100100",
"10111100" when "01100101",
"11010011" when "01100110",
"00001010" when "01100111",
"11110111" when "01101000",-- 0x68 -> 0xf7
"11100100" when "01101001",
"01011000" when "01101010",
"00000101" when "01101011",
"10111000" when "01101100",
"10110011" when "01101101",
"01000101" when "01101110",
"00000110" when "01101111",
"11010000" when "01110000",-- 0x70 -> 0xd0
"00101100" when "01110001",
"00011110" when "01110010",
"10001111" when "01110011",
"11001010" when "01110100",
"00111111" when "01110101",
"00001111" when "01110110",
"00000010" when "01110111",
"11000001" when "01111000",-- 0x78 -> 0xc1
"10101111" when "01111001",
"10111101" when "01111010",
"00000011" when "01111011",
"00000001" when "01111100",
"00010011" when "01111101",
"10001010" when "01111110",
"01101011" when "01111111",
"00111010" when "10000000",-- 0x80 -> 0x3a
"10010001" when "10000001",
"00010001" when "10000010",
"01000001" when "10000011",
"01001111" when "10000100",
110
"01100111" when "10000101",
"11011100" when "10000110",
"11101010" when "10000111",
"10010111" when "10001000",-- 0x88 -> 0x97
"11110010" when "10001001",
"11001111" when "10001010",
"11001110" when "10001011",
"11110000" when "10001100",
"10110100" when "10001101",
"11100110" when "10001110",
"01110011" when "10001111",
"10010110" when "10010000",-- 0x90 -> 0x96
"10101100" when "10010001",
"01110100" when "10010010",
"00100010" when "10010011",
"11100111" when "10010100",
"10101101" when "10010101",
"00110101" when "10010110",
"10000101" when "10010111",
"11100010" when "10011000",-- 0x98 -> 0xe2
"11111001" when "10011001",
"00110111" when "10011010",
"11101000" when "10011011",
"00011100" when "10011100",
"01110101" when "10011101",
"11011111" when "10011110",
"01101110" when "10011111",
"01000111" when "10100000",-- 0xa0 -> 0x47
"11110001" when "10100001",
"00011010" when "10100010",
"01110001" when "10100011",
"00011101" when "10100100",
"00101001" when "10100101",
"11000101" when "10100110",
"10001001" when "10100111",
"01101111" when "10101000",-- 0xa8 -> 0x6f
"10110111" when "10101001",
"01100010" when "10101010",
"00001110" when "10101011",
"10101010" when "10101100",
"00011000" when "10101101",
"10111110" when "10101110",
"00011011" when "10101111",
"11111100" when "10110000",-- 0xb0 -> 0xfc
"01010110" when "10110001",
"00111110" when "10110010",
"01001011" when "10110011",
"11000110" when "10110100",
"11010010" when "10110101",
"01111001" when "10110110",
"00100000" when "10110111",
"10011010" when "10111000",-- 0xb8 -> 0x9a
"11011011" when "10111001",
"11000000" when "10111010",
"11111110" when "10111011",
"01111000" when "10111100",
"11001101" when "10111101",
"01011010" when "10111110",
"11110100" when "10111111",
"00011111" when "11000000",-- 0xc0 -> 0x1f
"11011101" when "11000001",
"10101000" when "11000010",
"00110011" when "11000011",
"10001000" when "11000100",
111
"00000111" when "11000101",
"11000111" when "11000110",
"00110001" when "11000111",
"10110001" when "11001000",-- 0xc8 -> 0xb1
"00010010" when "11001001",
"00010000" when "11001010",
"01011001" when "11001011",
"00100111" when "11001100",
"10000000" when "11001101",
"11101100" when "11001110",
"01011111" when "11001111",
"01100000" when "11010000",-- 0xd0 -> 0x60
"01010001" when "11010001",
"01111111" when "11010010",
"10101001" when "11010011",
"00011001" when "11010100",
"10110101" when "11010101",
"01001010" when "11010110",
"00001101" when "11010111",
"00101101" when "11011000",-- 0xd8 -> 0x2d
"11100101" when "11011001",
"01111010" when "11011010",
"10011111" when "11011011",
"10010011" when "11011100",
"11001001" when "11011101",
"10011100" when "11011110",
"11101111" when "11011111",
"10100000" when "11100000",-- 0xe0 -> 0xa0
"11100000" when "11100001",
"00111011" when "11100010",
"01001101" when "11100011",
"10101110" when "11100100",
"00101010" when "11100101",
"11110101" when "11100110",
"10110000" when "11100111",
"11001000" when "11101000",-- 0xe8 -> 0xc8
"11101011" when "11101001",
"10111011" when "11101010",
"00111100" when "11101011",
"10000011" when "11101100",
"01010011" when "11101101",
"10011001" when "11101110",
"01100001" when "11101111",
"00010111" when "11110000",-- 0xf0 -> 0x17
"00101011" when "11110001",
"00000100" when "11110010",
"01111110" when "11110011",
"10111010" when "11110100",
"01110111" when "11110101",
"11010110" when "11110110",
"00100110" when "11110111",
"11100001" when "11111000",-- 0xf8 -> 0xe1
"01101001" when "11111001",
"00010100" when "11111010",
"01100011" when "11111011",
"01010101" when "11111100",
"00100001" when "11111101",
"00001100" when "11111110",
"01111101" when "11111111",
"00000000" when others;
with dir select
b <= b_enc when 0, b_dec when 1, "00000000" when others;
112
end dflow;
Galois eld xed eld constant matrix multiplier (gf k mult.vhd)
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
USE ieee.std_logic_unsigned.ALL;
ENTITY gf_k_mult IS
PORT ( din1, din2 : IN std_logic_vector (31 DOWNTO 0);
rst, enable : IN std_logic;
clk : IN std_logic;
b : OUT std_logic_vector (31 DOWNTO 0));
END gf_k_mult;
ARCHITECTURE behav OF gf_k_mult IS
SIGNAL gf_enable, int_enable : std_logic_vector (15 DOWNTO 0);
SIGNAL a0_r0, a0_r1, a0_r2, a0_r3 : std_logic_vector ( 7 DOWNTO 0);
SIGNAL t0_1, t0_2, t1_1, t1_2 : std_logic_vector ( 7 DOWNTO 0);
SIGNAL t2_1, t2_2, t3_1, t3_2 : std_logic_vector ( 7 DOWNTO 0);
SIGNAL enable_prev : std_logic;
COMPONENT gf2x8 IS
clk, rst : IN std_logic;
enable : IN std_logic;
a : IN std_logic_vector ( 7 DOWNTO 0);
b : OUT std_logic_vector ( 7 DOWNTO 0));
END COMPONENT;
BEGIN
-- Assumes a 32-bit processor
-- Allows each 8x8 matrix to be configured in two instructions
-- Allows input a to be loaded with result computed in one instruction
-- 16 8x8 matrices and a input require 17 instructions
-- Row 0
GFK00A0: gf2x8 PORT MAP (din1 => din1, din2 => din2, clk => clk, rst => rst, a => din1 (31 DOWNTO
24), enable => gf_enable ( 0), b => a0_r0);
GFK03A3: gf2x8 PORT MAP (din1 => din1, din2 => din2, clk => clk, rst => rst, a => din1 ( 7 DOWNTO
-- Row 1
113
-- Row 2
8), enable => gf_enable (10), b => a2_r2);
-- Row 3
t0_1 <= a0_r0 XOR a1_r0;
t0_2 <= a2_r0 XOR a3_r0;
b (31 DOWNTO 24) <= t0_1 XOR t0_2;
t1_1 <= a0_r1 XOR a1_r1;
t1_2 <= a2_r1 XOR a3_r1;
b (23 DOWNTO 16) <= t1_1 XOR t1_2;
t2_1 <= a0_r2 XOR a1_r2;
t2_2 <= a2_r2 XOR a3_r2;
b (15 DOWNTO 8) <= t2_1 XOR t2_2;
t3_1 <= a0_r3 XOR a1_r3;
t3_2 <= a2_r3 XOR a3_r3;
b ( 7 DOWNTO 0) <= t3_1 XOR t3_2;
PROCESS(clk, rst)
BEGIN
IF (rst=0) THEN
int_enable <= "0000000000000001";
ELSIF (clk=0 AND clkEVENT) THEN
enable_prev <= enable;
IF (enable=0 AND enable_prev=1) THEN
int_enable <= int_enable (14 DOWNTO 0) & int_enable (15);
END IF;
END IF;
END PROCESS;
-- K30A0
WITH enable SELECT
gf_enable <= int_enable WHEN 1,
"0000000000000000" WHEN OTHERS;
END behav;
Galois eld inner product multiplier (gf2x8.vhd)
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_arith.ALL;
USE ieee.std_logic_unsigned.ALL;
114
ENTITY gf2x8 IS
clk, rst : IN std_logic;
enable : IN std_logic;
a : IN std_logic_vector ( 7 DOWNTO 0);
b : OUT std_logic_vector ( 7 DOWNTO 0));
END gf2x8;
ARCHITECTURE behav OF gf2x8 IS
TYPE ram_data IS ARRAY (7 DOWNTO 0) OF std_logic_vector (7 DOWNTO 0);
SIGNAL data : ram_data;
SIGNAL row0, row1 : std_logic_vector ( 7 DOWNTO 0);
BEGIN
PROCESS(rst,clk)
BEGIN
IF (rst=0) THEN
FOR i IN 0 to 7 LOOP
data(i) <= "00000000";
END LOOP;
ELSIF (clk=1 AND clkEVENT) THEN
IF (enable=1) THEN
data(0) <= din1 (31 DOWNTO 24);
data(3) <= din1 ( 7 DOWNTO 0);
data(7) <= din2 ( 7 DOWNTO 0);
END IF;
END IF;
END PROCESS;
row0 <= data(0);
row1 <= data(1);
row2 <= data(2);
row3 <= data(3);
row4 <= data(4);
row5 <= data(5);
row6 <= data(6);
row7 <= data(7);
-- perform matrix multiplication M * a_i where M is an 8x8 matrix of one bit elements and a_i is an
8 bit column
-- vector with a_i(7) being the MSB and top entry of the column and a_i(0) being the LSB and bottom
entry of the column
b (7) <= (row0(7) AND a (7)) XOR (row0(6) AND a (6)) XOR (row0(5) AND a (5)) XOR (row0(4) AND a (4))
XOR (row0(3) AND a (3)) XOR (row0(2) AND a (2)) XOR (row0(1) AND a (1)) XOR (row0(0) AND a (0));
115
END behav;
Instruction set extension conguration package (ext cong.vhd)
NOTE: Assigned values are shown as an example
package ext_config is
type aes_sbox_config_type is (none, sbox, sbox4);
constant des_pmt_en : boolean := false;
constant des_key_f_en : boolean := false;
constant idea_mmul_en : boolean := false;
constant aes_sbox_config : aes_sbox_config_type := none;
constant gfmmul_en : boolean := false;
end ext_config;
116
Appendix B: Modications to LEON2 VHDL Model and De-
velopment Tools
SPARC opcode package (sparcv8.vhd)
----------------------------------------------------------------------------
-- This file is a part of the LEON2 HDL model extended with custom
-- instructions for symmetric key cryptography.
--
-- Based on LEON VHDL model
-- Copyright (C): 1999, European Space Agency (ESA)
--
-- Modifications by Sean R OMelia
-- 2007, University of Massachusetts Lowell
-- Center for Network and Information Security
-----------------------------------------------------------------------------
-- This library is free software; you can redistribute it and/or
-- modify it under the terms of the GNU Lesser General Public
-- License as published by the Free Software Foundation; either
-- version 2 of the License, or (at your option) any later version.
--
-- See the file COPYING.LGPL for the full details of the license.
-----------------------------------------------------------------------------
-----------------------------------------------------------------------------
-- Package: sparcv8
-- File: sparcv8.vhd
-- Authors: Jiri Gaisler - ESA/ESTEC
-- Sean R OMelia - UML CNIS
-- Description: Package with SPARC V8 instruction definitions
------------------------------------------------------------------------------
library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_arith.conv_unsigned;
use work.config.all;
use work.ext_config.all; -- SRO (07/21/07)
package sparcv8 is
-- These are computed automatically - do not change!!
constant CWPMIN : std_logic_vector(NWINLOG2-1 downto 0) := (others => 0);
constant CWPMAX : std_logic_vector(NWINLOG2-1 downto 0) :=
std_logic_vector(conv_unsigned(NWINDOWS-1, NWINLOG2));
constant R0ADDR : std_logic_vector(RABITS-5 downto 0) :=
std_logic_vector(conv_unsigned(NWINDOWS + FPREG/16, RABITS-4));
constant F0ADDR : std_logic_vector(RABITS-5 downto 0) :=
std_logic_vector(conv_unsigned(NWINDOWS, RABITS-4));
-- OP codes (INST[31..30])
constant CALL : std_logic_vector(1 downto 0) := "01";
constant FMT2 : std_logic_vector(1 downto 0) := "00";
constant FMT3 : std_logic_vector(1 downto 0) := "10";
constant LDST : std_logic_vector(1 downto 0) := "11";
-- OP2 codes (INST[31..30])
constant UNIMP : std_logic_vector(2 downto 0) := "000";
constant BICC : std_logic_vector(2 downto 0) := "010";
constant SETHI : std_logic_vector(2 downto 0) := "100";
constant FBFCC : std_logic_vector(2 downto 0) := "110";
constant CBCCC : std_logic_vector(2 downto 0) := "111";
117
-- OP3 codes (INST[24..19])
constant IADD : std_logic_vector(5 downto 0) := "000000";
constant IAND : std_logic_vector(5 downto 0) := "000001";
constant IOR : std_logic_vector(5 downto 0) := "000010";
constant IXOR : std_logic_vector(5 downto 0) := "000011";
constant ISUB : std_logic_vector(5 downto 0) := "000100";
constant ANDN : std_logic_vector(5 downto 0) := "000101";
constant ORN : std_logic_vector(5 downto 0) := "000110";
constant IXNOR : std_logic_vector(5 downto 0) := "000111";
constant ADDX : std_logic_vector(5 downto 0) := "001000";
constant UMUL : std_logic_vector(5 downto 0) := "001010";
constant SMUL : std_logic_vector(5 downto 0) := "001011";
constant SUBX : std_logic_vector(5 downto 0) := "001100";
constant UDIV : std_logic_vector(5 downto 0) := "001110";
constant SDIV : std_logic_vector(5 downto 0) := "001111";
constant ADDCC : std_logic_vector(5 downto 0) := "010000";
constant ANDCC : std_logic_vector(5 downto 0) := "010001";
constant ORCC : std_logic_vector(5 downto 0) := "010010";
constant XORCC : std_logic_vector(5 downto 0) := "010011";
constant SUBCC : std_logic_vector(5 downto 0) := "010100";
constant ANDNCC : std_logic_vector(5 downto 0) := "010101";
constant ORNCC : std_logic_vector(5 downto 0) := "010110";
constant XNORCC : std_logic_vector(5 downto 0) := "010111";
constant ADDXCC : std_logic_vector(5 downto 0) := "011000";
constant UMULCC : std_logic_vector(5 downto 0) := "011010";
constant SMULCC : std_logic_vector(5 downto 0) := "011011";
constant SUBXCC : std_logic_vector(5 downto 0) := "011100";
constant UDIVCC : std_logic_vector(5 downto 0) := "011110";
constant SDIVCC : std_logic_vector(5 downto 0) := "011111";
constant TADDCC : std_logic_vector(5 downto 0) := "100000";
constant TSUBCC : std_logic_vector(5 downto 0) := "100001";
constant TADDCCTV : std_logic_vector(5 downto 0) := "100010";
constant TSUBCCTV : std_logic_vector(5 downto 0) := "100011";
constant MULSCC : std_logic_vector(5 downto 0) := "100100";
constant ISLL : std_logic_vector(5 downto 0) := "100101";
constant ISRL : std_logic_vector(5 downto 0) := "100110";
constant ISRA : std_logic_vector(5 downto 0) := "100111";
constant RDY : std_logic_vector(5 downto 0) := "101000";
constant RDPSR : std_logic_vector(5 downto 0) := "101001";
constant RDWIM : std_logic_vector(5 downto 0) := "101010";
constant RDTBR : std_logic_vector(5 downto 0) := "101011";
constant WRY : std_logic_vector(5 downto 0) := "110000";
constant WRPSR : std_logic_vector(5 downto 0) := "110001";
constant WRWIM : std_logic_vector(5 downto 0) := "110010";
constant WRTBR : std_logic_vector(5 downto 0) := "110011";
constant FPOP1 : std_logic_vector(5 downto 0) := "110100";
constant FPOP2 : std_logic_vector(5 downto 0) := "110101";
constant CPOP1 : std_logic_vector(5 downto 0) := "110110";
constant CPOP2 : std_logic_vector(5 downto 0) := "110111";
constant JMPL : std_logic_vector(5 downto 0) := "111000";
constant TICC : std_logic_vector(5 downto 0) := "111010";
constant FLUSH : std_logic_vector(5 downto 0) := "111011";
constant RETT : std_logic_vector(5 downto 0) := "111001";
constant SAVE : std_logic_vector(5 downto 0) := "111100";
constant RESTORE : std_logic_vector(5 downto 0) := "111101";
constant UMAC : std_logic_vector(5 downto 0) := "111110";
constant SMAC : std_logic_vector(5 downto 0) := "111111";
-- BEGIN SRO (07/21/07)
-- opcodes for custom cryptographic instructions
constant DESKDF : std_logic_vector(5 downto 0) := "001001";
constant DESPMT : std_logic_vector(5 downto 0) := "001101";
constant GFMKLD : std_logic_vector(5 downto 0) := "011001";
118
constant GFMMUL : std_logic_vector(5 downto 0) := "011101";
constant AESSB : std_logic_vector(5 downto 0) := "101100";
constant MMUL16 : std_logic_vector(5 downto 0) := "101101";
-- END SRO (07/21/07)
constant LD : std_logic_vector(5 downto 0) := "000000";
constant LDUB : std_logic_vector(5 downto 0) := "000001";
constant LDUH : std_logic_vector(5 downto 0) := "000010";
constant LDD : std_logic_vector(5 downto 0) := "000011";
constant LDSB : std_logic_vector(5 downto 0) := "001001";
constant LDSH : std_logic_vector(5 downto 0) := "001010";
constant LDSTUB : std_logic_vector(5 downto 0) := "001101";
constant SWAP : std_logic_vector(5 downto 0) := "001111";
constant LDA : std_logic_vector(5 downto 0) := "010000";
constant LDUBA : std_logic_vector(5 downto 0) := "010001";
constant LDUHA : std_logic_vector(5 downto 0) := "010010";
constant LDDA : std_logic_vector(5 downto 0) := "010011";
constant LDSBA : std_logic_vector(5 downto 0) := "011001";
constant LDSHA : std_logic_vector(5 downto 0) := "011010";
constant LDSTUBA : std_logic_vector(5 downto 0) := "011101";
constant SWAPA : std_logic_vector(5 downto 0) := "011111";
constant LDF : std_logic_vector(5 downto 0) := "100000";
constant LDFSR : std_logic_vector(5 downto 0) := "100001";
constant LDDF : std_logic_vector(5 downto 0) := "100011";
constant LDC : std_logic_vector(5 downto 0) := "110000";
constant LDCSR : std_logic_vector(5 downto 0) := "110001";
constant LDDC : std_logic_vector(5 downto 0) := "110011";
constant ST : std_logic_vector(5 downto 0) := "000100";
constant STB : std_logic_vector(5 downto 0) := "000101";
constant STH : std_logic_vector(5 downto 0) := "000110";
constant ISTD : std_logic_vector(5 downto 0) := "000111";
constant STA : std_logic_vector(5 downto 0) := "010100";
constant STBA : std_logic_vector(5 downto 0) := "010101";
constant STHA : std_logic_vector(5 downto 0) := "010110";
constant STDA : std_logic_vector(5 downto 0) := "010111";
constant STF : std_logic_vector(5 downto 0) := "100100";
constant STFSR : std_logic_vector(5 downto 0) := "100101";
constant STDFQ : std_logic_vector(5 downto 0) := "100110";
constant STDF : std_logic_vector(5 downto 0) := "100111";
constant STC : std_logic_vector(5 downto 0) := "110100";
constant STCSR : std_logic_vector(5 downto 0) := "110101";
constant STDCQ : std_logic_vector(5 downto 0) := "110110";
constant STDC : std_logic_vector(5 downto 0) := "110111";
-- BICC codes
constant BA : std_logic_vector(3 downto 0) := "1000";
-- FPOP1
constant FITOS : std_logic_vector(8 downto 0) := "011000100";
constant FITOD : std_logic_vector(8 downto 0) := "011001000";
constant FSTOI : std_logic_vector(8 downto 0) := "011010001";
constant FDTOI : std_logic_vector(8 downto 0) := "011010010";
constant FSTOD : std_logic_vector(8 downto 0) := "011001001";
constant FDTOS : std_logic_vector(8 downto 0) := "011000110";
constant FMOVS : std_logic_vector(8 downto 0) := "000000001";
constant FNEGS : std_logic_vector(8 downto 0) := "000000101";
constant FABSS : std_logic_vector(8 downto 0) := "000001001";
constant FSQRTS : std_logic_vector(8 downto 0) := "000101001";
constant FSQRTD : std_logic_vector(8 downto 0) := "000101010";
constant FADDS : std_logic_vector(8 downto 0) := "001000001";
constant FADDD : std_logic_vector(8 downto 0) := "001000010";
constant FSUBS : std_logic_vector(8 downto 0) := "001000101";
constant FSUBD : std_logic_vector(8 downto 0) := "001000110";
119
constant FMULS : std_logic_vector(8 downto 0) := "001001001";
constant FMULD : std_logic_vector(8 downto 0) := "001001010";
constant FSMULD : std_logic_vector(8 downto 0) := "001101001";
constant FDIVS : std_logic_vector(8 downto 0) := "001001101";
constant FDIVD : std_logic_vector(8 downto 0) := "001001110";
-- FPOP2
constant FCMPS : std_logic_vector(8 downto 0) := "001010001";
constant FCMPD : std_logic_vector(8 downto 0) := "001010010";
constant FCMPES : std_logic_vector(8 downto 0) := "001010101";
constant FCMPED : std_logic_vector(8 downto 0) := "001010110";
-- ALU operation codes
constant ALU_AND : std_logic_vector(2 downto 0) := "000";
constant ALU_XOR : std_logic_vector(2 downto 0) := "001";-- must be equal to ALU_PASS2
constant ALU_OR : std_logic_vector(2 downto 0) := "010";
constant ALU_XNOR : std_logic_vector(2 downto 0) := "011";
constant ALU_ANDN : std_logic_vector(2 downto 0) := "100";
constant ALU_ORN : std_logic_vector(2 downto 0) := "101";
constant ALU_DIV : std_logic_vector(2 downto 0) := "110";
constant ALU_PASS1 : std_logic_vector(2 downto 0) := "000";
constant ALU_PASS2 : std_logic_vector(2 downto 0) := "001";
constant ALU_STB : std_logic_vector(2 downto 0) := "010";
constant ALU_STH : std_logic_vector(2 downto 0) := "011";
constant ALU_ONES : std_logic_vector(2 downto 0) := "100";
constant ALU_RDY : std_logic_vector(2 downto 0) := "101";
constant ALU_FSR : std_logic_vector(2 downto 0) := "110";
constant ALU_FOP : std_logic_vector(2 downto 0) := "111";
constant ALU_SLL : std_logic_vector(2 downto 0) := "001";
constant ALU_SRL : std_logic_vector(2 downto 0) := "010";
constant ALU_SRA : std_logic_vector(2 downto 0) := "100";
constant ALU_NOP : std_logic_vector(2 downto 0) := "000";
-- ALU result select
constant ALU_RES_ADD : std_logic_vector(1 downto 0) := "00";
constant ALU_RES_SHIFT : std_logic_vector(1 downto 0) := "01";
constant ALU_RES_LOGIC : std_logic_vector(1 downto 0) := "10";
constant ALU_RES_MISC : std_logic_vector(1 downto 0) := "11";
-- ALU operand 2 codes
constant ALU_RS2 : std_logic := 0;
constant ALU_SIMM : std_logic := 1;
-- Load types
constant LDBYTE : std_logic_vector(1 downto 0) := "00";
constant LDHALF : std_logic_vector(1 downto 0) := "01";
constant LDWORD : std_logic_vector(1 downto 0) := "10";
constant LDDBL : std_logic_vector(1 downto 0) := "11";
-- Trap types
constant IAEX_TT : std_logic_vector(5 downto 0) := "000001";
constant IINST_TT : std_logic_vector(5 downto 0) := "000010";
constant PRIV_TT : std_logic_vector(5 downto 0) := "000011";
constant FPDIS_TT : std_logic_vector(5 downto 0) := "000100";
constant WINOF_TT : std_logic_vector(5 downto 0) := "000101";
constant WINUF_TT : std_logic_vector(5 downto 0) := "000110";
120
constant UNALA_TT : std_logic_vector(5 downto 0) := "000111";
constant FPEXC_TT : std_logic_vector(5 downto 0) := "001000";
constant DAEX_TT : std_logic_vector(5 downto 0) := "001001";
constant TAG_TT : std_logic_vector(5 downto 0) := "001010";
constant WATCH_TT : std_logic_vector(5 downto 0) := "001011";
constant CPDIS_TT : std_logic_vector(5 downto 0) := "100100";
constant CPEXC_TT : std_logic_vector(5 downto 0) := "101000";
constant DIV_TT : std_logic_vector(5 downto 0) := "101010";
constant DSEX_TT : std_logic_vector(5 downto 0) := "101011";
constant TICC_TT : std_logic_vector(5 downto 0) := "111111";
-- ASI types
constant LASI_IFLUSH : std_logic_vector(3 downto 0) := "0101";
constant LASI_DFLUSH : std_logic_vector(3 downto 0) := "0110";
constant LASI_UINST : std_logic_vector(3 downto 0) := "1000";
constant LASI_SINST : std_logic_vector(3 downto 0) := "1001";
constant LASI_UDATA : std_logic_vector(3 downto 0) := "1010";
constant LASI_SDATA : std_logic_vector(3 downto 0) := "1011";
constant LASI_ITAG : std_logic_vector(3 downto 0) := "1100";
constant LASI_IDATA : std_logic_vector(3 downto 0) := "1101";
constant LASI_DTAG : std_logic_vector(3 downto 0) := "1110";
constant LASI_DDATA : std_logic_vector(3 downto 0) := "1111";
constant ASI_IFLUSH : std_logic_vector(4 downto 0) := "00101";
constant ASI_DFLUSH : std_logic_vector(4 downto 0) := "00110";
constant ASI_UINST : std_logic_vector(4 downto 0) := "01000";
constant ASI_SINST : std_logic_vector(4 downto 0) := "01001";
constant ASI_UDATA : std_logic_vector(4 downto 0) := "01010";
constant ASI_SDATA : std_logic_vector(4 downto 0) := "01011";
constant ASI_ITAG : std_logic_vector(4 downto 0) := "01100";
constant ASI_IDATA : std_logic_vector(4 downto 0) := "01101";
constant ASI_DTAG : std_logic_vector(4 downto 0) := "01110";
constant ASI_DDATA : std_logic_vector(4 downto 0) := "01111";
constant ASI_FLUSH_PAGE : std_logic_vector(4 downto 0) := "10000"; -- 0x10 i/dcache flush page
constant ASI_FLUSH_CTX : std_logic_vector(4 downto 0) := "10011"; -- 0x13 i/dcache flush ctx
constant ASI_DCTX : std_logic_vector(4 downto 0) := "10100"; -- 0x14 dcache ctx
constant ASI_ICTX : std_logic_vector(4 downto 0) := "10101"; -- 0x15 icache ctx
constant ASI_MMUFLUSHPROBE : std_logic_vector(4 downto 0) := "11000"; -- 0x18 i/dtlb flush/(probe)
constant ASI_MMUREGS : std_logic_vector(4 downto 0) := "11001"; -- 0x19 mmu regs access
constant ASI_MMU_BP : std_logic_vector(4 downto 0) := "11100"; -- 0x1c mmu Bypass
constant ASI_MMU_DIAG : std_logic_vector(4 downto 0) := "11101"; -- 0x1d mmu diagnostic
constant ASI_MMU_DSU : std_logic_vector(4 downto 0) := "11111"; -- 0x1f mmu diagnostic
-- FSR ftt codes
constant FPIEEE_ERR : std_logic_vector(2 downto 0) := "001";
constant FPSEQ_ERR : std_logic_vector(2 downto 0) := "100";
end;
Instruction disassembly support function disas() (from debug.vhd, lines 424-828)
function disas(insn : debug_info) return string is
constant STMAX : natural := 9;
constant bl2 : string(1 to 2) := (others => );
constant bb : string(1 to (4)) := (others => );
variable op : std_logic_vector(1 downto 0);
121
variable op2 : std_logic_vector(2 downto 0);
variable opf : std_logic_vector(8 downto 0);
variable cond : std_logic_vector(3 downto 0);
variable rs1, rs2, rd : std_logic_vector(4 downto 0);
variable addr : std_logic_vector(31 downto 0);
variable annul : std_logic;
variable i : std_logic;
variable simm : std_logic_vector(12 downto 0);
variable disen : boolean := true;
begin
if disen then
op := insn.op(31 downto 30);
op2 := insn.op(24 downto 22);
op3 := insn.op(24 downto 19);
opf := insn.op(13 downto 5);
cond := insn.op(28 downto 25);
annul := insn.op(29);
rs1 := insn.op(18 downto 14);
rs2 := insn.op(4 downto 0);
rd := insn.op(29 downto 25);
i := insn.op(13);
simm := insn.op(12 downto 0);
case op is
when CALL =>
addr := insn.pc + (insn.op(29 downto 0) & "00");
return(tostf(insn.pc) & bb & "call" & bl2 & tost(addr));
when FMT2 =>
case op2 is
when UNIMP => return(tostf(insn.pc) & bb & "unimp");
when SETHI =>
if rd = "00000" then
return(tostf(insn.pc) & bb & "nop");
else
return(tostf(insn.pc) & bb & "sethi" & bl2 & "%hi(" &
tost(insn.op(21 downto 0) & "0000000000") & "), " & regdec(rd));
end if;
when BICC | FBFCC =>
addr(31 downto 24) := (others => 0);
addr(23 downto 2) := insn.op(21 downto 0);
if addr(23) = 1 then
else
end if;
addr := addr + insn.pc;
if op2 = BICC then
if insn.op(29) = 1 then
return(tostf(insn.pc) & bb & b & branchop(insn) & ",a" & bl2 &
tost(addr));
else
return(tostf(insn.pc) & bb & b & branchop(insn) & bl2 &
tost(addr));
end if;
else
if insn.op(29) = 1 then
return(tostf(insn.pc) & bb & "fb" & fbranchop(insn) & ",a" & bl2 &
122
tost(addr));
else
return(tostf(insn.pc) & bb & "fb" & fbranchop(insn) & bl2 &
tost(addr));
end if;
end if;
-- when CBCCC => cptrap := 1;
when others => return(tostf(insn.pc) & bb & "unknown opcode: " & tost(insn.op));
end case;
when FMT3 =>
case op3 is
when IAND => return(tostf(insn.pc) & bb & "and" & bl2 & regres(insn,hex));
when IADD => return(tostf(insn.pc) & bb & "add" & bl2 & regres(insn,dec));
when IOR =>
if ((i = 0) and (rs1 = "00000") and (rs2 = "00000")) then
return(tostf(insn.pc) & bb & "clr" & bl2 & regdec(rd));
elsif ((i = 1) and (simm = "0000000000000")) or (rs1 = "00000") then
return(tostf(insn.pc) & bb & "mov" & bl2 & regres(insn,hex));
else
return(tostf(insn.pc) & bb & "or " & bl2 & regres(insn,hex));
end if;
when IXOR => return(tostf(insn.pc) & bb & "xor" & bl2 & regres(insn,hex));
when ISUB => return(tostf(insn.pc) & bb & "sub" & bl2 & regres(insn,dec));
when ANDN => return(tostf(insn.pc) & bb & "andn" & bl2 & regres(insn,hex));
when ORN => return(tostf(insn.pc) & bb & "orn" & bl2 & regres(insn,hex));
when IXNOR =>
if ((i = 0) and ((rs1 = rd) or (rs2 = "00000"))) then
return(tostf(insn.pc) & bb & "not" & bl2 & regdec(rd));
else
return(tostf(insn.pc) & bb & "xnor" & bl2 & regdec(rd));
end if;
when ADDX => return(tostf(insn.pc) & bb & "addx" & bl2 & regres(insn,dec));
when SUBX => return(tostf(insn.pc) & bb & "subx" & bl2 & regres(insn,dec));
when ADDCC => return(tostf(insn.pc) & bb & "addcc" & bl2 & regres(insn,dec));
when ANDCC => return(tostf(insn.pc) & bb & "andcc" & bl2 & regres(insn,hex));
when ORCC => return(tostf(insn.pc) & bb & "orcc" & bl2 & regres(insn,hex));
when XORCC => return(tostf(insn.pc) & bb & "xorcc" & bl2 & regres(insn,hex));
when SUBCC => return(tostf(insn.pc) & bb & "subcc" & bl2 & regres(insn,dec));
when ANDNCC => return(tostf(insn.pc) & bb & "andncc" & bl2 & regres(insn,hex));
when ORNCC => return(tostf(insn.pc) & bb & "orncc" & bl2 & regres(insn,hex));
when XNORCC => return(tostf(insn.pc) & bb & "xnorcc" & bl2 & regres(insn,hex));
when ADDXCC => return(tostf(insn.pc) & bb & "addxcc" & bl2 & regres(insn,hex));
when UMAC => return(tostf(insn.pc) & bb & "umac" & bl2 & regres(insn,dec));
when SMAC => return(tostf(insn.pc) & bb & "smac" & bl2 & regres(insn,dec));
when UMUL => return(tostf(insn.pc) & bb & "umul" & bl2 & regres(insn,dec));
when SMUL => return(tostf(insn.pc) & bb & "smul" & bl2 & regres(insn,dec));
when UMULCC => return(tostf(insn.pc) & bb & "umulcc" & bl2 & regres(insn,dec));
when SMULCC => return(tostf(insn.pc) & bb & "smulcc" & bl2 & regres(insn,dec));
when SUBXCC => return(tostf(insn.pc) & bb & "subxcc" & bl2 & regres(insn,dec));
when UDIV => return(tostf(insn.pc) & bb & "udiv" & bl2 & regres(insn,dec));
when SDIV => return(tostf(insn.pc) & bb & "sdiv" & bl2 & regres(insn,dec));
when UDIVCC => return(tostf(insn.pc) & bb & "udivcc" & bl2 & regres(insn,dec));
when SDIVCC => return(tostf(insn.pc) & bb & "sdivcc" & bl2 & regres(insn,dec));
when TADDCC => return(tostf(insn.pc) & bb & "taddcc" & bl2 & regres(insn,dec));
when TSUBCC => return(tostf(insn.pc) & bb & "tsubcc" & bl2 & regres(insn,dec));
when TADDCCTV => return(tostf(insn.pc) & bb & "taddcctv" & bl2 & regres(insn,dec));
when TSUBCCTV => return(tostf(insn.pc) & bb & "tsubcctv" & bl2 & regres(insn,dec));
when MULSCC => return(tostf(insn.pc) & bb & "mulscc" & bl2 & regres(insn,dec));
when ISLL => return(tostf(insn.pc) & bb & "sll" & bl2 & regres(insn,dec));
when ISRL => return(tostf(insn.pc) & bb & "srl" & bl2 & regres(insn,dec));
when ISRA => return(tostf(insn.pc) & bb & "sra" & bl2 & regres(insn,dec));
-- BEGIN SRO (07/21/07)
-- disassembly support for custom instructions
123
when DESKDF =>
if(des_key_f_en = false) then
return(tostf(insn.pc) & bb & "unknown opcode: " & tost(insn.op));
else
if(i = 0) then
return(tostf(insn.pc) & bb & "deskey" & bl2 & regimm(insn,dec,false));
else
if(simm(12) = 0) then
return(tostf(insn.pc) & bb & "desdir" & bl2 & regimm(insn,dec,false));
else
return(tostf(insn.pc) & bb & "desf" & bl2 & regres(insn,dec));
end if;
end if;
end if;
when DESPMT =>
if(des_pmt_en = false) then
else
if(i = 0) then
case insn.op(7 downto 5) is
when "000" => return(tostf(insn.pc) & bb & "desipl" & bl2 & regres(insn,dec));
when "001" => return(tostf(insn.pc) & bb & "desipr" & bl2 & regres(insn,dec));
when "010" => return(tostf(insn.pc) & bb & "desfpl" & bl2 & regres(insn,dec));
when "011" => return(tostf(insn.pc) & bb & "desfpr" & bl2 & regres(insn,dec));
end case;
else
end if;
end if;
when GFMKLD =>
if(gfmmul_en = false) then
else
return(tostf(insn.pc) & bb & "gfmkld" & bl2 & regres(insn,dec));
end if;
when GFMMUL =>
if(gfmmul_en = false) then
else
return(tostf(insn.pc) & bb & "gfmmul" & bl2 & regres(insn,dec));
end if;
when AESSB =>
if(aes_sbox_config = sbox) then
return(tostf(insn.pc) & bb & "aessb" & bl2 & regres(insn,dec));
elsif(aes_sbox_config = sbox4) then
return(tostf(insn.pc) & bb & "aessb4" & bl2 & regres(insn,dec));
else
end if;
when MMUL16 =>
if(idea_mmul_en = false) then
else
return(tostf(insn.pc) & bb & "mmul16" & bl2 & regres(insn,dec));
end if;
-- END SRO (07/21/07)
when RDY =>
if rs1 /= "00000" then
return(tostf(insn.pc) & bb & "mov" & bl2 & "%asr" &
tost(rs1) & ", " & regdec(rd));
else
return(tostf(insn.pc) & bb & "mov" & bl2 & "%y, " & regdec(rd));
124
end if;
when RDPSR => return(tostf(insn.pc) & bb & "mov" & bl2 & "%psr, " & regdec(rd));
when RDWIM => return(tostf(insn.pc) & bb & "mov" & bl2 & "%wim, " & regdec(rd));
when RDTBR => return(tostf(insn.pc) & bb & "mov" & bl2 & "%tbr, " & regdec(rd));
when WRY =>
if (rs1 = "00000") or (rs2 = "00000") then
if rd /= "00000" then
return(tostf(insn.pc) & bb & "mov" & bl2
& regimm(insn,hex,false) & ", %asr" & tost(rd));
else
return(tostf(insn.pc) & bb & "mov" & bl2 & regimm(insn,hex,false) & ", %y");
end if;
else
if rd /= "00000" then
return(tostf(insn.pc) & bb & "wr " & bl2 & "%asr"
& regimm(insn,hex,false) & ", %asr" & tost(rd));
else
return(tostf(insn.pc) & bb & "wr " & bl2 & regimm(insn,hex,false) & ", %y");
end if;
end if;
when WRPSR =>
if (rs1 = "00000") or (rs2 = "00000") then
return(tostf(insn.pc) & bb & "mov" & bl2 & regimm(insn,hex,false) & ", %psr");
else
return(tostf(insn.pc) & bb & "wr " & bl2 & regimm(insn,hex,false) & ", %psr");
end if;
when WRWIM =>
if (rs1 = "00000") or (rs2 = "00000") then
return(tostf(insn.pc) & bb & "mov" & bl2 & regimm(insn,hex,false) & ", %wim");
else
return(tostf(insn.pc) & bb & "wr " & bl2 & regimm(insn,hex,false) & ", %wim");
end if;
when WRTBR =>
if (rs1 = "00000") or (rs2 = "00000") then
return(tostf(insn.pc) & bb & "mov" & bl2 & regimm(insn,hex,false) & ", %tbr");
else
return(tostf(insn.pc) & bb & "wr " & bl2 & regimm(insn,hex,false) & ", %tbr");
end if;
when JMPL =>
if (rd = "00000") then
if (i = 1) and (simm = "0000000001000") then
if (rs1 = "11111") then
return(tostf(insn.pc) & bb & "ret");
elsif (rs1 = "01111") then
return(tostf(insn.pc) & bb & "retl");
else
return(tostf(insn.pc) & bb & "jmp" & bl2 & regimm(insn,dec,true));
end if;
else
return(tostf(insn.pc) & bb & "jmp" & bl2 & regimm(insn,dec,true));
end if;
else
return(tostf(insn.pc) & bb & "jmpl" & bl2 & regres(insn,dec));
end if;
when TICC =>
return(tostf(insn.pc) & bb & t & branchop(insn) & bl2 & regimm(insn,hex,false));
when FLUSH =>
return(tostf(insn.pc) & bb & "flush" & bl2 & regimm(insn,hex,false));
when RETT =>
return(tostf(insn.pc) & bb & "rett" & bl2 & regimm(insn,dec,false));
when RESTORE =>
if (rd = "00000") then
return(tostf(insn.pc) & bb & "restore");
else
125
return(tostf(insn.pc) & bb & "restore" & bl2 & regres(insn,hex));
end if;
when SAVE =>
if (rd = "00000") then
return(tostf(insn.pc) & bb & "save");
else
return(tostf(insn.pc) & bb & "save" & bl2 & regres(insn,dec));
end if;
when FPOP1 =>
case opf is
when FITOS => return(tostf(insn.pc) & bb & "fitos" & bl2 & freg2(insn));
when FITOD => return(tostf(insn.pc) & bb & "fitod" & bl2 & freg2(insn));
when FSTOI => return(tostf(insn.pc) & bb & "fstoi" & bl2 & freg2(insn));
when FDTOI => return(tostf(insn.pc) & bb & "fdtoi" & bl2 & freg2(insn));
when FSTOD => return(tostf(insn.pc) & bb & "fstod" & bl2 & freg2(insn));
when FDTOS => return(tostf(insn.pc) & bb & "fdtos" & bl2 & freg2(insn));
when FMOVS => return(tostf(insn.pc) & bb & "fmovs" & bl2 & freg2(insn));
when FNEGS => return(tostf(insn.pc) & bb & "fnegs" & bl2 & freg2(insn));
when FABSS => return(tostf(insn.pc) & bb & "fabss" & bl2 & freg2(insn));
when FSQRTS => return(tostf(insn.pc) & bb & "fsqrts" & bl2 & freg2(insn));
when FSQRTD => return(tostf(insn.pc) & bb & "fsqrtd" & bl2 & freg2(insn));
when FADDS => return(tostf(insn.pc) & bb & "fadds" & bl2 & freg3(insn));
when FADDD => return(tostf(insn.pc) & bb & "faddd" & bl2 & freg3(insn));
when FSUBS => return(tostf(insn.pc) & bb & "fsubs" & bl2 & freg3(insn));
when FSUBD => return(tostf(insn.pc) & bb & "fsubd" & bl2 & freg3(insn));
when FMULS => return(tostf(insn.pc) & bb & "fmuls" & bl2 & freg3(insn));
when FMULD => return(tostf(insn.pc) & bb & "fmuld" & bl2 & freg3(insn));
when FSMULD => return(tostf(insn.pc) & bb & "fsmuld" & bl2 & freg3(insn));
when FDIVS => return(tostf(insn.pc) & bb & "fdivs" & bl2 & freg3(insn));
when FDIVD => return(tostf(insn.pc) & bb & "fdivd" & bl2 & freg3(insn));
when others => return(tostf(insn.pc) & bb & "unknown Fopcode: " & tost(insn.op));
end case;
when FPOP2 =>
case opf is
when FCMPS => return(tostf(insn.pc) & bb & "fcmps" & bl2 & fregc(insn));
when FCMPD => return(tostf(insn.pc) & bb & "fcmpd" & bl2 & fregc(insn));
when FCMPES => return(tostf(insn.pc) & bb & "fcmpes" & bl2 & fregc(insn));
when FCMPED => return(tostf(insn.pc) & bb & "fcmped" & bl2 & fregc(insn));
when others => return(tostf(insn.pc) & bb & "unknown Fopcode: " & tost(insn.op));
end case;
when CPOP1 =>
return(tostf(insn.pc) & bb & "cpop1" & bl2 & tost("000"&opf) & ", " &creg3(insn));
when CPOP2 =>
return(tostf(insn.pc) & bb & "cpop2" & bl2 & tost("000"&opf) & ", " &creg3(insn));
end case;
when LDST =>
case op3 is
when STC =>
return(tostf(insn.pc) & bb & "st" & bl2 & stparcp(insn, rd, dec));
when STF =>
return(tostf(insn.pc) & bb & "st" & bl2 & stparf(insn, rd, dec));
when ST =>
return(tostf(insn.pc) & bb & "clr" & bl2 & stparc(insn, rd, dec));
else
return(tostf(insn.pc) & bb & "st" & bl2 & stpar(insn, rd, dec));
end if;
when STB =>
return(tostf(insn.pc) & bb & "clrb" & bl2 & stparc(insn, rd, dec));
else
return(tostf(insn.pc) & bb & "stb" & bl2 & stpar(insn, rd, dec));
end if;
126
when STH =>
return(tostf(insn.pc) & bb & "clrh" & bl2 & stparc(insn, rd, dec));
else
return(tostf(insn.pc) & bb & "sth" & bl2 & stpar(insn, rd, dec));
end if;
when STDC =>
return(tostf(insn.pc) & bb & "std" & bl2 & stparcp(insn, rd, dec));
when STDF =>
return(tostf(insn.pc) & bb & "std" & bl2 & stparf(insn, rd, dec));
when STCSR =>
return(tostf(insn.pc) & bb & "st" & bl2 & "%csr, [" & regimm(insn,dec,true) & "]");
when STFSR =>
return(tostf(insn.pc) & bb & "st" & bl2 & "%fsr, [" & regimm(insn,dec,true) & "]");
when STDCQ =>
return(tostf(insn.pc) & bb & "std" & bl2 & "%cq, [" & regimm(insn,dec,true) & "]");
when STDFQ =>
return(tostf(insn.pc) & bb & "std" & bl2 & "%fq, [" & regimm(insn,dec,true) & "]");
when ISTD =>
return(tostf(insn.pc) & bb & "std" & bl2 & stpar(insn, rd, dec));
when STA =>
return(tostf(insn.pc) & bb & "sta" & bl2 & stpara(insn, rd, dec));
when STBA =>
return(tostf(insn.pc) & bb & "stba" & bl2 & stpara(insn, rd, dec));
when STHA =>
return(tostf(insn.pc) & bb & "stha" & bl2 & stpara(insn, rd, dec));
when STDA =>
return(tostf(insn.pc) & bb & "stda" & bl2 & stpara(insn, rd, dec));
when LDC =>
return(tostf(insn.pc) & bb & "ld" & bl2 & ldparcp(insn, rd, dec));
when LDF =>
return(tostf(insn.pc) & bb & "ld" & bl2 & ldparf(insn, rd, dec));
when LDCSR =>
return(tostf(insn.pc) & bb & "ld" & bl2 & "[" & regimm(insn,dec,true) & "]" & ", %csr");
when LDFSR =>
return(tostf(insn.pc) & bb & "ld" & bl2 & "[" & regimm(insn,dec,true) & "]" & ", %fsr");
when LD =>
return(tostf(insn.pc) & bb & "ld" & bl2 & ldpar(insn, rd, dec));
when LDUB =>
return(tostf(insn.pc) & bb & "ldub" & bl2 & ldpar(insn, rd, dec));
when LDUH =>
return(tostf(insn.pc) & bb & "lduh" & bl2 & ldpar(insn, rd, dec));
when LDDC =>
return(tostf(insn.pc) & bb & "ldd" & bl2 & ldparcp(insn, rd, dec));
when LDDF =>
return(tostf(insn.pc) & bb & "ldd" & bl2 & ldparf(insn, rd, dec));
when LDD =>
return(tostf(insn.pc) & bb & "ldd" & bl2 & ldpar(insn, rd, dec));
when LDSB =>
return(tostf(insn.pc) & bb & "ldsb" & bl2 & ldpar(insn, rd, dec));
when LDSH =>
return(tostf(insn.pc) & bb & "ldsh" & bl2 & ldpar(insn, rd, dec));
when LDSTUB =>
return(tostf(insn.pc) & bb & "ldstub" & bl2 & ldpar(insn, rd, dec));
when SWAP =>
return(tostf(insn.pc) & bb & "swap" & bl2 & ldpar(insn, rd, dec));
when LDA =>
return(tostf(insn.pc) & bb & "lda" & bl2 & ldpara(insn, rd, dec));
when LDUBA =>
return(tostf(insn.pc) & bb & "lduba" & bl2 & ldpara(insn, rd, dec));
when LDUHA =>
return(tostf(insn.pc) & bb & "lduha" & bl2 & ldpara(insn, rd, dec));
when LDDA =>
return(tostf(insn.pc) & bb & "ldda" & bl2 & ldpara(insn, rd, dec));
127
when LDSBA =>
return(tostf(insn.pc) & bb & "ldsba" & bl2 & ldpara(insn, rd, dec));
when LDSHA =>
return(tostf(insn.pc) & bb & "ldsha" & bl2 & ldpara(insn, rd, dec));
when LDSTUBA =>
return(tostf(insn.pc) & bb & "ldstuba" & bl2 & ldpara(insn, rd, dec));
when SWAPA =>
return(tostf(insn.pc) & bb & "swapa" & bl2 & ldpara(insn, rd, dec));
end case;
end case;
end if;
end;
IU execute stage signals (from iu.vhd, lines 103-132)
type execute_stage_type is record
write_cwp, write_icc, write_reg, write_y, rst_mey : std_logic;
cwp : std_logic_vector(NWINLOG2-1 downto 0); -- current window pointer
icc : std_logic_vector(3 downto 0); -- integer condition codes
alu_cin : std_logic; -- ALU carry-in
ymsb : std_logic; -- MULSCC Y(msb)
rs1data : std_logic_vector(31 downto 0); -- source operand 1
rs2data : std_logic_vector(31 downto 0); -- source operand 2
aluop : std_logic_vector(2 downto 0); -- Alu operation
alusel : std_logic_vector(1 downto 0); -- Alu result select
aluadd : std_logic; -- add/sub select
mulstep : std_logic; -- MULSCC
mulinsn : std_logic; -- SMUL/UMUL
ldbp1, ldbp2 : std_logic; -- load bypass enable
ctrl : pipeline_control_type;
result : std_logic_vector(31 downto 0); -- data forward from execute stage
micc : std_logic_vector(3 downto 0); -- icc for multiply insn
licc : std_logic_vector(3 downto 0); -- icc to me stage
-- BEGIN SRO (07/21/07)
-- these members indicate the respective instruction being in the EX stage
deskey_instr : std_logic;
despmt_instr : std_logic; -- flag for all 4 DES permute insns.
desdir_instr : std_logic;
desf_instr : std_logic;
gfmkld_instr : std_logic;
gfmmul_instr : std_logic;
aessb_instr : std_logic;
aessb4_instr : std_logic;
mmul16_instr : std_logic;
-- END SRO (07/21/07)
end record;
IU component declarations (from iu.vhd, lines 269-324)
-- BEGIN SRO (07/31/07)
component des_pmt
port (
pmt_type : in std_logic;
out_half : in std_logic;
128
in1, in2 : in std_logic_vector(1 to 32);
);
end component;
component des_fcore
port (
din : in std_logic_vector(1 to 32);
rndkey : in std_logic_vector(1 to 48);
);
end component;
component des_keygen is
port (
clk : in std_logic;
setdir : in std_logic;
load : in std_logic;
dirin : in std_logic;
advin : in std_logic;
keyl : in std_logic_vector(1 to 32);
keyr : in std_logic_vector(1 to 32);
rndkey : out std_logic_vector(1 to 48)
);
end component;
component idea_modmul is
port(
clk : in std_logic;
x, y : in std_logic_vector(15 downto 0);
p : out std_logic_vector(15 downto 0)
);
end component;
component aes_sbox is
port (
dir : in std_logic;
a : in std_logic_vector(7 downto 0);
b : out std_logic_vector(7 downto 0)
);
end component;
component gf_k_mult is
rst, enable : IN std_logic;
clk : IN std_logic;
b : OUT std_logic_vector (31 DOWNTO 0));
end component;
-- END SRO (07/31/07)
IU decode stage process (from iu.vhd, lines 449-1637)
-------------------------------------------------------------------------------
-- Instruction decode stage
-------------------------------------------------------------------------------
decode_stage : process(rst, fe, de, ex, me, mein, wrin, wr, sregs, ico, rfo,
sregsin, fpo, cpo, dco, holdn, fpu_reg, mulo, divo, tr,
iui, dsur)
129
variable rfenable1, rfenable2 : std_logic; -- regfile enable strobes
variable cond : std_logic_vector(3 downto 0);
variable rs1, rs2, rd : std_logic_vector(4 downto 0);
variable write_cwp, write_icc, write_reg, write_y : std_logic;
variable cnt : std_logic_vector(1 downto 0); -- cycle number
variable cwp_new : std_logic_vector(NWINLOG2-1 downto 0);
variable icc, br_icc : std_logic_vector(3 downto 0);
variable alu_cin : std_logic;
variable immediate_data : std_logic_vector(31 downto 0);
variable n, z, v, c : std_logic; -- temporary condition codes
variable i : std_logic; -- immidiate data bit
variable su : std_logic; -- local supervisor bit;
variable et : std_logic; -- local enable trap bit
variable inull, annul, annul_current : std_logic;
variable branch, annul_next, bres, branch_true: std_logic;
variable aluop : std_logic_vector(2 downto 0);
variable alusel : std_logic_vector(1 downto 0);
variable aluadd : std_logic;
variable mulstep : std_logic;
variable mulinsn : std_logic;
variable y0 : std_logic;
variable branch_address : std_logic_vector(31 downto PCLOW);
variable rs1data, rs2data : std_logic_vector(31 downto 0);
variable operand2_select : std_logic;
variable read_addr1, read_addr2, chkrd : std_logic_vector(RABITS-1 downto 0);
variable hold_pc : std_logic; -- Hold PC during multi-cycle ops
variable pv : std_logic; -- PC valid
variable ldlock, ldcheck1, ldcheck2, ldcheck3 : std_logic; -- load interlock
variable ldchkex, ldchkme : std_logic; -- load interlock for ex and me
variable illegal_inst : std_logic; -- illegal instruction
variable privileged_inst : std_logic; -- privileged instruction trap
variable cp_disabled : std_logic; -- CP disable trap
variable fp_disabled : std_logic; -- FP disable trap
variable watchpoint_exc : std_logic; -- watchpoint trap
variable winovf_exception : std_logic; -- window overflow trap
variable winunf_exception : std_logic; -- window underflow trap
variable ticc_exception : std_logic; -- TICC trap
variable fp_exception : std_logic; -- STDFQ trap
variable ctrl : pipeline_control_type;
variable ldbp1, ldbp2 : std_logic; -- load bypass enable
variable mulcnt : std_logic_vector(4 downto 0); -- multiply cycle number
variable ymsb : std_logic; -- next msb of Y during MUL
variable rst_mey : std_logic; -- reset me stage Y register
variable fpld, fpst, fpop : std_logic; -- FPU instructions
variable fpmov : std_logic; -- FPU instructions
variable fbres, fbranch_true : std_logic; -- FBCC branch result
variable cbres, cbranch_true : std_logic; -- CBCC branch result
variable fcc : std_logic_vector(1 downto 0); -- FPU condition codes
variable ccc : std_logic_vector(1 downto 0); -- CP condition codes
variable bicc_hold, icc_check : std_logic;
variable fsr_ld, fsr_ld_check, fsr_check, fsr_lock : std_logic;
variable fpexin : fpu_ctrl1_type;
variable rs1mod : std_logic;
variable step : std_logic;
variable divstart,mulstart : std_logic; -- start multiply or divide
variable cpldlock, fpldlock, annul_current_cp : std_logic;
-- BEGIN SRO (07/21/07)
-- variables for custom insn flags to be set for EX stage
variable deskey_instr : std_logic;
130
variable despmt_instr : std_logic;
variable desdir_instr : std_logic;
variable desf_instr : std_logic;
variable gfmkld_instr : std_logic;
variable gfmmul_instr : std_logic;
variable aessb_instr : std_logic;
variable aessb4_instr : std_logic;
variable mmul16_instr : std_logic;
-- END SRO (07/21/07)
variable dk_advin_var : std_logic; -- SRO (07/25/07)
constant RDOPT : boolean := FASTDECODE;-- optimise dest reg address generation
constant RS1OPT : boolean := FASTDECODE;-- optimise src1 reg address generation
function regdec(cwp, regin : std_logic_vector; fp : std_logic)
return std_logic_vector is
variable reg : std_logic_vector(4 downto 0);
variable ra : std_logic_vector(RABITS -1 downto 0);
begin
reg := regin; ra(4 downto 0) := reg;
if (FPIFTYPE = serial) and (fp = 1) then
ra(RABITS -1 downto 5) := F0ADDR(RABITS-5 downto 1);
elsif reg(4 downto 3) = "00" then ra(RABITS -1 downto 4) := R0ADDR;
else
-- pragma translate_off
if not (is_x(cwp & ra(4))) then
-- pragma translate_on
ra(NWINLOG2+3 downto 4) := (cwp + ra(4));
if CWPOPT then ra(RABITS-1) := 0;
elsif ra(RABITS-1 downto 4) = R0ADDR then
ra(RABITS-1 downto 4) := (others => 0);
end if;
else
ra := (others => 0);
end if;
end if;
return(ra);
end;
begin
-- instruction bit-field decoding
op := de.inst(31 downto 30);
op2 := de.inst(24 downto 22);
op3 := de.inst(24 downto 19);
opf := de.inst(13 downto 5);
cond := de.inst(28 downto 25);
annul := de.inst(29);
rs1 := de.inst(18 downto 14);
rs2 := de.inst(4 downto 0);
rd := de.inst(29 downto 25);
i := de.inst(13);
-- common initialisation
ctrl.annul := de.annul; ctrl.cnt := de.cnt; ctrl.pv := de.pv; pv := 1;
cnt := "00"; ctrl.tt := "000000"; ctrl.ld := 0; ctrl.rett := 0;
ctrl.pc := de.pc; ctrl.inst := de.inst; mulcnt := de.mulcnt;
write_y := 0; fpld := 0; fpst := 0; fpop := 0;
fp_exception := 0; fpmov := 0; step := 0;
fpexin.fpop := "00"; fpexin.dsz := 0; fpexin.ldfsr := 0;
131
winovf_exception := 0; winunf_exception := 0;
write_cwp := 0; cwp_new := de.cwp; rs1mod := 0;
rfenable1 := 0; rfenable2 := 0;
-- BEGIN SRO (07/21/07)
deskey_instr := 0;
despmt_instr := 0;
desdir_instr := 0;
desf_instr := 0;
gfmkld_instr := 0;
gfmmul_instr := 0;
aessb_instr := 0;
aessb4_instr := 0;
mmul16_instr := 0;
-- END SRO (07/21/07)
dk_advin_var := 0; -- SRO (07/25/07)
-- detect RETT instruction in the pipeline and set the local psr.su and psr.et
if ((ex.ctrl.rett and not ex.ctrl.annul) or (me.ctrl.rett and not me.ctrl.annul) or
(wr.ctrl.rett and not wr.ctrl.annul)) = 1
then
su := sregs.ps; et := 1;
else
su := sregs.s; et := sregs.et;
end if;
-- Check for illegal and privileged instructions
illegal_inst := 0; privileged_inst := 0; cp_disabled := 0;
fp_disabled := 0;
case op is
when CALL => null;
when FMT2 =>
case op2 is
when SETHI | BICC => null;
when FBFCC =>
if FPEN then fp_disabled := not sregs.ef;
else fp_disabled := 1; end if;
when CBCCC =>
if (not CPEN) or (sregs.ec = 0) then cp_disabled := 1; end if;
when others => illegal_inst := 1;
end case;
when FMT3 =>
case op3 is
when IAND | ANDCC | ANDN | ANDNCC | IOR | ORCC | ORN | ORNCC | IXOR |
XORCC | IXNOR | XNORCC | ISLL | ISRL | ISRA | MULSCC | IADD | ADDX |
ADDCC | ADDXCC | TADDCC | TADDCCTV | ISUB | SUBX | SUBCC | SUBXCC |
TSUBCC | TSUBCCTV | FLUSH | JMPL | TICC | SAVE | RESTORE | RDY => null;
-- BEGIN SRO (07/31/07)
when DESKDF =>
if(des_key_f_en = false) then illegal_inst := 1; end if;
when DESPMT =>
if(des_pmt_en = false or i = 1) then illegal_inst := 1; end if;
when GFMKLD =>
if(gfmmul_en = false or i = 1) then illegal_inst := 1; end if;
when AESSB =>
if(aes_sbox_config = none or i = 0) then illegal_inst := 1; end if;
when GFMMUL =>
if(gfmmul_en = false or i = 1) then illegal_inst := 1; end if;
when MMUL16 =>
if(idea_mmul_en = false or i = 1) then illegal_inst := 1; end if;
132
-- END SRO (07/31/07)
when UMAC | SMAC =>
if not MACEN then illegal_inst := 1; end if;
when UMUL | SMUL | UMULCC | SMULCC =>
if MULTIPLIER = none then illegal_inst := 1; end if;
when UDIV | SDIV | UDIVCC | SDIVCC =>
if DIVIDER = none then illegal_inst := 1; end if;
when RETT => illegal_inst := et; privileged_inst := not su;
when RDPSR | RDTBR | RDWIM => privileged_inst := not su;
when WRY =>
if not ((rd = "00000") or ((rd = "10010") and MACEN) or
((rd(4 downto 3) = "11") and (WATCHPOINTS > 0)))
then
illegal_inst := 1;
end if;
when WRPSR =>
privileged_inst := not su;
when WRWIM | WRTBR => privileged_inst := not su;
when FPOP1 | FPOP2 =>
if FPEN then fp_disabled := not sregs.ef; fpop := 1;
else fp_disabled := 1; fpop := 0; end if;
when CPOP1 | CPOP2 =>
end case;
when others =>-- LDST
case op3 is
when LDD | ISTD => illegal_inst := rd(0); -- trap if odd destination register
when LD | LDUB | LDSTUB | LDUH | LDSB | LDSH | ST | STB | STH | SWAP =>
null;
when LDDA | STDA =>
illegal_inst := i or rd(0); privileged_inst := not su;
when LDA | LDUBA| LDSTUBA | LDUHA | LDSBA | LDSHA | STA | STBA | STHA |
SWAPA =>
illegal_inst := i; privileged_inst := not su;
when LDDF | STDF | LDF | LDFSR | STF | STFSR =>
if FPEN then fp_disabled := not sregs.ef;
else fp_disabled := 1; end if;
when STDFQ =>
if (not FPEN) or (sregs.ef = 0) then fp_disabled := 1; end if;
when STDCQ =>
when LDC | LDCSR | LDDC | STC | STCSR | STDC =>
end case;
end case;
-- branch address adder
branch_address := (others => 0);
if op = CALL then branch_address(31 downto 2) := de.inst(29 downto 0);
else branch_address(31 downto 2) := de.inst(21) & de.inst(21) & de.inst(21) &
de.inst(21) & de.inst(21) & de.inst(21) & de.inst(21) &
de.inst(21) & de.inst(21 downto 0);
end if;
if not (is_x(branch_address) or is_x(de.pc)) then
133
branch_address := branch_address + de.pc; -- address adder (branch)
else
branch_address := (others => X);
end if;
fecomb.branch_address <= branch_address;
-- ICC pipeline and forwarding
if (me.write_icc and not me.ctrl.annul) = 1 then icc := wrin.icc;
elsif (wr.write_icc and not wr.ctrl.annul) = 1 then icc := wr.icc;
else icc := sregs.icc; end if;
br_icc := icc;
if ((ex.write_icc and not ex.ctrl.annul) = 1) then
icc := ex.icc;
if not ICC_HOLD then br_icc := icc; end if;
end if;
write_icc := 0; alu_cin := 0;
case op is
when FMT3 =>
case op3 is
when SUBCC | TSUBCC | TSUBCCTV =>
write_icc := 1;
when ADDCC | ANDCC | ORCC | XORCC | ANDNCC | ORNCC | XNORCC | MULSCC |
TADDCC | TADDCCTV =>
write_icc := 1;
when UMULCC | SMULCC =>
if MULTIPLIER = iterative then
if de.cnt /= "11" then write_icc := 1; end if;
end if;
if MULTIPLIER = m32x32 then write_icc := 1; end if;
when ADDX | SUBX =>
alu_cin := icc(0);
when ADDXCC | SUBXCC =>
write_icc := 1; alu_cin := icc(0);
when others => null;
end case;
end case;
exin.write_icc <= write_icc; exin.alu_cin <= alu_cin;
-- BICC/TICC evaluation
n := br_icc(3); z := br_icc(2); v := br_icc(1); c := br_icc(0);
case cond(2 downto 0) is
when "000" => bres := 0; -- bn, ba
when "001" => bres := z; -- be, bne
when "010" => bres := z or (n xor v); -- ble, bg
when "011" => bres := n xor v; -- bl, bge
when "100" => bres := c or z; -- blue, bgu
when "101" => bres := c; -- bcs, bcc
when "110" => bres := n; -- bneg, bpos
when others => bres := v; -- bvs. bvc
end case;
branch_true := cond(3) xor bres;
134
-- FBFCC evaluation
if FPEN then
if FPIFTYPE = serial then
if (fpu_reg.me.fpop = "10") and (me.ctrl.annul = 0) then
fcc := fpu_reg.me.fcc;
elsif (fpu_reg.wr.fpop = "10") and (wr.ctrl.annul = 0) then
fcc := fpu_reg.wr.fcc;
else fcc := fpu_reg.fsr.fcc; end if;
else fcc := fpo.cc; end if;
when "000" => fbres := 0; -- fba, fbn
when "001" => fbres := fcc(1) or fcc(0);
when "010" => fbres := fcc(1) xor fcc(0);
when "011" => fbres := fcc(0);
when "100" => fbres := (not fcc(1)) and fcc(0);
when "101" => fbres := fcc(1);
when "110" => fbres := fcc(1) and not fcc(0);
when others => fbres := fcc(1) and fcc(0);
end case;
fbranch_true := cond(3) xor fbres;
-- decode some FPU instruction types
case opf is
when FMOVS | FABSS | FNEGS => fpmov := 1;
when FITOD | FSTOD | FSQRTD | FADDD | FSUBD | FMULD | FDIVD =>
fpexin.dsz := 1;
end case;
end if;
-- CBCCC evaluation
if CPEN then
ccc := cpo.cc;
when "000" => cbres := 0;
when "001" => cbres := ccc(1) or ccc(0);
when "010" => cbres := ccc(1) xor ccc(0);
when "011" => cbres := ccc(0);
when "100" => cbres := (not ccc(1)) and ccc(0);
when "101" => cbres := ccc(1);
when "110" => cbres := ccc(1) and not ccc(0);
when others => cbres := ccc(1) and ccc(0);
end case;
cbranch_true := cond(3) xor cbres;
end if;
-- Alu operation generation
aluop := ALU_NOP; alusel := ALU_RES_MISC; aluadd := 1;
mulstep := 0; mulinsn := 0;
case op is
when CALL => alusel := ALU_RES_ADD;
when FMT2 =>
case op2 is
when SETHI => aluop := ALU_PASS2;
when others =>
end case;
when FMT3 =>
case op3 is
when IADD | ADDX | ADDCC | ADDXCC | TADDCC | TADDCCTV | SAVE | RESTORE |
TICC | JMPL | RETT => alusel := ALU_RES_ADD;
when ISUB | SUBX | SUBCC | SUBXCC | TSUBCC | TSUBCCTV =>
135
alusel := ALU_RES_ADD; aluadd := 0;
when MULSCC => alusel := ALU_RES_ADD; mulstep := 1;
when UMUL | UMULCC =>
case de.cnt is
when "00" => aluop := ALU_XOR; alusel := ALU_RES_MISC;
when "01" | "10" => alusel := ALU_RES_ADD; mulinsn := 1;
when others => alusel := ALU_RES_ADD;
end case;
end if;
if MULTIPLIER > iterative then mulinsn := 1; end if;
when SMUL | SMULCC =>
case de.cnt is
when "00" => aluop := ALU_XOR; alusel := ALU_RES_MISC;
when "01" | "10" => alusel := ALU_RES_ADD; mulinsn := 1;
when others => alusel := ALU_RES_ADD; aluadd := 0;
end case;
end if;
if MULTIPLIER > iterative then mulinsn := 1; end if;
when UMAC | SMAC =>
if MACEN then mulinsn := 1; end if;
when UDIV | UDIVCC | SDIV | SDIVCC =>
if DIVIDER /= none then aluop := ALU_DIV; alusel := ALU_RES_LOGIC; end if;
when IAND | ANDCC => aluop := ALU_AND; alusel := ALU_RES_LOGIC;
when ANDN | ANDNCC => aluop := ALU_ANDN; alusel := ALU_RES_LOGIC;
when IOR | ORCC => aluop := ALU_OR; alusel := ALU_RES_LOGIC;
when ORN | ORNCC => aluop := ALU_ORN; alusel := ALU_RES_LOGIC;
when IXNOR | XNORCC => aluop := ALU_XNOR; alusel := ALU_RES_LOGIC;
when XORCC | IXOR | WRPSR | WRWIM | WRTBR | WRY =>
aluop := ALU_XOR; alusel := ALU_RES_LOGIC;
when RDPSR | RDTBR | RDWIM => aluop := ALU_PASS2;
when RDY => aluop := ALU_RDY;
when ISLL => aluop := ALU_SLL; alusel := ALU_RES_SHIFT;
when ISRL => aluop := ALU_SRL; alusel := ALU_RES_SHIFT;
when ISRA => aluop := ALU_SRA; alusel := ALU_RES_SHIFT;
if ((FPIFTYPE = serial) and FPEN) then
if de.cnt /= "00" then
if opf(1) = 1 then rs1(0) := 1; rs2(0) := 1; end if;
if fpexin.dsz = 1 then rd(0) := 1; end if;
end if;
if op3 = FPOP1 then fpexin.fpop := "01"; else fpexin.fpop := "10"; end if;
if fpmov = 1 then aluop := ALU_FOP; fpexin.fpop := "11";
else aluop := ALU_PASS2; end if;
end if;
-- BEGIN SRO (07/21/07)
when DESKDF =>
if(des_key_f_en = true) then
if (i = 0) then deskey_instr := 1;
else
if(de.inst(12) = 0) then desdir_instr := 1;
else
desf_instr := 1;
dk_advin_var := 1; -- SRO (07/25/07)
end if;
end if;
end if;
when DESPMT =>
if(des_pmt_en = true and i = 0) then despmt_instr := 1; end if;
when GFMKLD =>
if(gfmmul_en = true and i = 0) then gfmkld_instr := 1; end if;
when GFMMUL =>
136
if(gfmmul_en = true and i = 0) then gfmmul_instr := 1; end if;
when AESSB =>
if(aes_sbox_config = sbox and i = 1) then aessb_instr := 1;
elsif(aes_sbox_config = sbox4 and i = 1) then aessb4_instr := 1;
end if;
when MMUL16 =>
if(idea_mmul_en = true and i = 0) then mmul16_instr := 1; end if;
-- END SRO (07/21/07)
when others =>
end case;
case de.cnt is
when "00" =>
alusel := ALU_RES_ADD;
if FPEN then fpld := (op3(5) and not op3(2));
else fpld := 0; end if;
when "01" =>
if (op3(2) and not op3(3)) = 1 then -- ST
rs1 := rd; rs1mod := 1;
end if;
case op3 is
when LDD | LDDA | LDDC =>
rd(0) := 1; alusel := ALU_RES_ADD;
when LDDF =>
rd(0) := 1; alusel := ALU_RES_ADD;
if FPEN then fpld := 1; end if;
when STFSR => if ((FPIFTYPE = serial) and FPEN) then aluop := ALU_FSR; end if;
when SWAP | SWAPA | LDSTUB | LDSTUBA =>
alusel := ALU_RES_ADD;
when STF | STDF =>
aluop := ALU_PASS1; fpst := 1;
end if;
when others =>
aluop := ALU_PASS1;
if op3(2) = 1 then -- ST
if op3(1 downto 0) = "01" then -- store byte
aluop := ALU_STB;
elsif op3(1 downto 0) = "10" then -- store halfword
aluop := ALU_STH;
end if;
end if;
end case;
when "10" =>
aluop := ALU_PASS1;
rs1 := rd; rs1mod := 1;
if op3(2) = 1 then -- ST
if (op3(3) and not op3(1))= 1 then aluop := ALU_ONES; -- LDSTUB/A
elsif op3(3 downto 0) = "0111" then
rs1(0) := 1; -- STD/F/A
if ((FPIFTYPE = serial) and FPEN) and (op3(5) = 1) then fpst := 1; end if;
end if;
end if;
when others =>
end case;
end case;
exin.aluop <= aluop; exin.alusel <= alusel; exin.aluadd <= aluadd;
exin.mulstep <= mulstep; exin.mulinsn <= mulinsn;
-- Alu operand select
137
operand2_select := ALU_SIMM;
case op is
when FMT2 =>
case op2 is
when SETHI => operand2_select := ALU_SIMM;
when others => operand2_select := ALU_RS2;
end case;
when FMT3 =>
case op3 is
when RDWIM | RDPSR | RDTBR => operand2_select := ALU_SIMM;
when FPOP1 | FPOP2 => if ((FPIFTYPE = serial) and FPEN) then operand2_select := ALU_RS2; end if;
when others =>
if (de.inst(13) = 1) then operand2_select := ALU_SIMM;
else operand2_select := ALU_RS2; end if;
end case;
when LDST =>
if (de.inst(13) = 1) then operand2_select := ALU_SIMM;
else operand2_select := ALU_RS2; end if;
when others => operand2_select := ALU_RS2;
end case;
-- CWP generation, pipelinig and forwarding
-- Also check for window underflow/overflow conditions
if (op = FMT3) and ((op3 = RETT) or (op3 = RESTORE) or (op3 = SAVE)) then
write_cwp := 1;
if (op3 = SAVE) then
if not is_x(de.cwp) then
if (not CWPOPT) and (de.cwp = CWPMIN) then cwp_new := CWPMAX;
else cwp_new := de.cwp - 1 ; end if;
else
cwp_new := (others => X);
end if;
else
if not is_x(de.cwp) then
if (not CWPOPT) and (de.cwp = CWPMAX) then cwp_new := CWPMIN;
else cwp_new := de.cwp + 1; end if;
else
cwp_new := (others => X);
end if;
end if;
if sregs.wim(conv_integer(0 & cwp_new)) = 1 then
if op3 = SAVE then winovf_exception := 1;
else winunf_exception := 1; end if;
end if;
end if;
exin.write_cwp <= write_cwp;
exin.cwp <= cwp_new;
-- Immediate data generation
immediate_data := (others => 0);
138
case op is
when FMT2 =>
immediate_data := de.inst(21 downto 0) & "0000000000";
when FMT3 =>
case op3 is
when RDPSR => immediate_data(31 downto 5) := std_logic_vector(IMPL) &
std_logic_vector(VER) & icc & "000000" & sregs.ec & sregs.ef &
sregs.pil & su & sregs.ps & et;
immediate_data(NWINLOG2-1 downto 0) := de.cwp;
when RDTBR => immediate_data(31 downto 4) := sregs.tba & sregs.tt;
when RDWIM => immediate_data(NWINDOWS-1 downto 0) := sregs.wim;
when others =>
immediate_data := de.inst(12) & de.inst(12) & de.inst(12) & de.inst(12) &
de.inst(12) & de.inst(12) & de.inst(12) & de.inst(12) & de.inst(12) &
de.inst(12 downto 0);
end case;
immediate_data := de.inst(12) & de.inst(12) & de.inst(12) & de.inst(12) &
de.inst(12 downto 0);
end case;
-- register read address generation
if RS1OPT then
if rs1mod = 1 then
read_addr1 := regdec(de.cwp, de.inst(29 downto 26) & rs1(0), (fpst or fpop));
else
read_addr1 := regdec(de.cwp, de.inst(18 downto 15) & rs1(0), (fpst or fpop));
end if;
else
read_addr1 := regdec(de.cwp, rs1, (fpst or fpop));
end if;
read_addr2 := regdec(de.cwp, rs2, fpop);
-- register write address generation
write_reg := 0; fsr_ld := 0;
case op is
when CALL =>
write_reg := 1; rd := "01111"; -- CALL saves PC in r[15] (%o7)
when FMT2 =>
if (op2 = SETHI) then write_reg := 1; end if;
when FMT3 =>
case op3 is
if MULTIPLIER = none then write_reg := 1; end if;
if MULTIPLIER = m32x32 then write_reg := 1; end if;
if de.cnt = "10" then write_reg := 1; end if;
end if;
if DIVIDER /= none then write_reg := 0; else write_reg := 1; end if;
when RETT | WRPSR | WRY | WRWIM | WRTBR | TICC | FLUSH => null;
when FPOP1 | FPOP2 => null;
when CPOP1 | CPOP2 => null;
-- BEGIN SRO (07/21/07)
when DESKDF =>
139
if(des_key_f_en = true) then
if(i = 0) then
write_reg := 0;
else
if(de.inst(12) = 0) then
write_reg := 0;
else
write_reg := 1;
end if;
end if;
end if;
when DESPMT => if(des_pmt_en = true) then write_reg := 1; end if;
when GFMMUL => if(gfmmul_en = true) then write_reg := 1; end if;
when AESSB => if(aes_sbox_config /= none) then write_reg := 1; end if;
when MMUL16 =>
if(idea_mmul_en = true) then
end if;
-- END SRO (07/21/07)
when others => write_reg := 1;
end case;
when LDST =>
ctrl.ld := not op3(2);
if (op3(2) = 0) and
not ((CPEN or (FPIFTYPE = parallel)) and (op3(5) = 1))
then write_reg := 1; end if;
case op3 is
when SWAP | SWAPA | LDSTUB | LDSTUBA =>
when LDFSR => if ((FPIFTYPE = serial) and FPEN) then write_reg := 0; fsr_ld := 1; end if;
end case;
end case;
if (rd = "00000") and not (((FPIFTYPE = serial) and FPEN) and (fpld = 1)) then
write_reg := 0;
end if;
ctrl.rd := regdec(cwp_new, rd, (fpld or fpop));
if RDOPT then chkrd := regdec(de.cwp, rd, (fpld or fpop));
else chkrd := ctrl.rd; end if;
-- LD/BICC/TICC delay interlock generation
ldcheck1 := 0; ldcheck2 := 0; ldcheck3 := 0; ldlock := 0;
ldchkex := 1; ldchkme := 1; bicc_hold := 0; icc_check := 0;
fsr_check := 0; fsr_ld_check := 0; fsr_lock := 0;
if (de.annul = 0) then
case op is
when FMT2 =>
if (op2 = BICC) and (cond(2 downto 0) /= "000") then
icc_check := 1;
end if;
when FMT3 =>
ldcheck1 := 1; ldcheck2 := not i;
case op3 is
when TICC =>
if (cond(2 downto 0) /= "000") then icc_check := 1; end if;
when RDY | RDWIM | RDTBR =>
ldcheck1 := 0; ldcheck2 := 0;
when RDPSR =>
140
if MULTIPLIER = m32x32 then icc_check := 1; end if;
when ADDX | ADDXCC | SUBX | SUBXCC =>
if MULTIPLIER = m32x32 then icc_check := 1; end if;
if (de.cnt = "00") then ldcheck1 := 1; end if;
if (de.cnt = "01") then ldcheck2 := not i; end if;
end if;
case opf is
when FITOS | FITOD | FSTOI | FDTOI | FSTOD | FDTOS | FMOVS |
FNEGS | FABSS | FSQRTS | FSQRTD =>
ldcheck2 := 1;
when others => ldcheck1 := 1; ldcheck2 := 1;
end case;
if de.cnt /= "00" then ldchkex := 0; end if;
fsr_ld_check := 1;
end if;
when others =>
end case;
when LDST =>
ldcheck1 := 1; ldchkex := 0;
case de.cnt is
when "00" => -- check store data dependency if 2-cycle load delay
if (LDDELAY = 2) and (op3(2) = 1) and not (((FPIFTYPE = serial) and FPEN) and
(op3 = STFSR))
then ldcheck3 := 1; end if;
ldcheck2 := not i; ldchkex := 1;
when "01" => ldcheck2 := not i;
when others => ldchkme := 0;
end case;
if ((FPIFTYPE = serial) and FPEN) and ((op3 = LDFSR) or (op3 = STFSR)) then
fsr_check := 1;
if (op3 = STFSR) then fsr_ld_check := 1; end if;
end if;
end case;
end if;
-- MAC has two-cycle latency, check for data-dependecies
if MACEN then
if ((ex.mulinsn and ex.ctrl.inst(24) and ldchkex and not ex.ctrl.annul) = 1) and
(((ldcheck1 = 1) and (ex.ctrl.rd = read_addr1)) or
((ldcheck2 = 1) and (ex.ctrl.rd = read_addr2)) or
((ldcheck3 = 1) and (ex.ctrl.rd = chkrd)))
then ldlock := 1; end if;
end if;
if MACEN or (MULTIPLIER = m32x32) then
bicc_hold := icc_check and ex.write_icc and ex.mulinsn and not ex.ctrl.annul;
end if;
if ICC_HOLD then
bicc_hold := bicc_hold or (icc_check and ex.write_icc and not ex.ctrl.annul);
end if;
if ((ex.ctrl.ld and ex.write_reg and ldchkex and not ex.ctrl.annul) = 1) and
(((ldcheck1 = 1) and (ex.ctrl.rd = read_addr1)) or
((ldcheck2 = 1) and (ex.ctrl.rd = read_addr2)) or
((ldcheck3 = 1) and (ex.ctrl.rd = chkrd)))
141
if ((me.ctrl.ld and me.write_reg and ldchkme and not me.ctrl.annul) = 1) and
((LDDELAY = 2) or ((fsr_ld_check and not fsr_check) = 1)) and
(((ldcheck1 = 1) and (me.ctrl.rd = read_addr1)) or
((ldcheck2 = 1) and (me.ctrl.rd = read_addr2)))
if (fsr_check = 1) then
fsr_lock := ((xorv(fpu_reg.ex.fpop) and not ex.ctrl.annul) or
(xorv(fpu_reg.me.fpop) and not me.ctrl.annul) or
(xorv(fpu_reg.wr.fpop) and not wr.ctrl.annul));
end if;
if fsr_ld_check = 1 then
fsr_lock := fsr_lock or (fpu_reg.ex.ldfsr and not ex.ctrl.annul)
or (fpu_reg.me.ldfsr and not me.ctrl.annul)
or (fpu_reg.wr.ldfsr and not wr.ctrl.annul);
end if;
end if;
ldlock := ldlock or bicc_hold or fsr_lock;
cpldlock := ldlock; fpldlock := ldlock;
if CPEN then
if FPIFTYPE = parallel then cpldlock := cpldlock or fpo.ldlock; end if;
ldlock := ldlock or cpo.ldlock;
end if;
if FPIFTYPE = parallel then
if CPEN then fpldlock := fpldlock or cpo.ldlock; end if;
ldlock := ldlock or fpo.ldlock;
end if;
-- data forwarding detection. Forward data if destination and source
-- registers are equal and destination register will be written.
ldbp1 := 0;
if (rs1 = "00000") and not (((FPIFTYPE = serial) and FPEN) and ((fpop or fpst) = 1)) then
rs1data := (others => 0);
elsif ldcheck1 = 1 then
if ((ex.write_reg and ldchkex and not ex.ctrl.annul) = 1) and
(read_addr1 = ex.ctrl.rd)
then
rs1data := ex.result;
else
if ((me.write_reg and ldchkme and not me.ctrl.annul) = 1) and (read_addr1 = me.ctrl.rd) then
rs1data := mein.bpresult;
if LDDELAY = 1 then ldbp1 := me.ctrl.ld; end if;
elsif ((wr.write_reg and not wr.ctrl.annul) = 1) and (read_addr1 = wr.ctrl.rd) then
rs1data := wr.result;
else rfenable1 := 1; rs1data := rfo.data1(31 downto 0); end if;
end if;
else rs1data := rfo.data1(31 downto 0); end if;
ldbp2 := 0;
if (operand2_select = ALU_SIMM) then
rs2data := immediate_data;
elsif (rs2 = "00000") and not (((FPIFTYPE = serial) and FPEN) and (fpop = 1)) then
elsif ldcheck2 = 1 then
if ((ex.write_reg and ldchkex and not ex.ctrl.annul) = 1) and (read_addr2 = ex.ctrl.rd) then
rs2data := ex.result;
else
if ((me.write_reg and ldchkme and not me.ctrl.annul) = 1) and (read_addr2 = me.ctrl.rd) then
rs2data := mein.bpresult;
if LDDELAY = 1 then ldbp2 := me.ctrl.ld; end if;
elsif ((wr.write_reg and not wr.ctrl.annul) = 1) and (read_addr2 = wr.ctrl.rd) then
142
rs2data := wr.result;
else rfenable2 := 1; rs2data := rfo.data2(31 downto 0); end if;
end if;
else rs2data := rfo.data2(31 downto 0); end if;
-- multiply operand generation
if (ex.write_y and not ex.ctrl.annul) = 1 then
y0 := mein.y(0);
elsif (me.write_y and not (me.ctrl.annul or me.ctrl.trap)) = 1 then
y0 := me.my(0);
else
y0 := wr.y(0);
end if;
ymsb := -;
-- mul/div unit
divi.y <= (wr.y(31) and op3(0)) & wr.y;
divstart := 0; mulstart := 0;
case op is
when FMT3 =>
case op3 is
when MULSCC =>
ymsb := rs1data(0); rs1data := (icc(3) xor icc(1)) & rs1data(31 downto 1);
if y0 = 0 then
rs2data := (others => 0); rfenable2 := 0; ldbp2 := 0;
end if;
case de.cnt is
when "00" =>
rs2data := (others => 0); ymsb := rs1data(0); rfenable2 := 0;
when "01" | "10" =>
ymsb := ex.result(0);
rs1data := (ex.micc(3) xor ex.micc(1)) & ex.result(31 downto 1);
if (mein.y(0) = 0) or (de.cnt = "10") then
rs2data := (others => 0); ldbp2 := 0; rfenable2 := 0;
end if;
when others =>
if (op3 = UMUL) or (op3 = UMULCC) then
rs2data := ex.result; rfenable2 := 1;
if rfo.data2(31) = 0 then
end if;
else
rs1data := ex.result; rfenable1 := 1;
if rfo.data1(31) = 0 then
rs2data := (others => 0); rfenable2 := 0;
end if;
end if;
end case;
end if;
end case;
end case;
exin.ldbp1 <= ldbp1; exin.ldbp2 <= ldbp2;
-- PC generation
branch := 0; annul_next := 0; annul_current := 0;
143
inull := not Rst; hold_pc := 0; ticc_exception := 0;
fpop := 0; fpld := 0;
if ((ldlock or de.annul) = 0) then
case op is
when CALL =>
branch := 1;
if mein.inull = 1 then
hold_pc := 1; annul_current := 1;
end if;
when FMT2 =>
if (op2 = BICC) or (FPEN and (op2 = FBFCC)) or (CPEN and (op2 = CBCCC)) then
if (FPEN and (op2 = FBFCC)) then
branch := fbranch_true;
if (FPIFTYPE = parallel) and (fpo.ccv /= 1) then
hold_pc := 1; annul_current := 1;
end if;
elsif (CPEN and (op2 = CBCCC)) then
branch := cbranch_true;
if cpo.ccv /= 1 then hold_pc := 1; annul_current := 1; end if;
else branch := branch_true; end if;
if hold_pc = 0 then
if (branch = 1) then
if (cond = BA) and (annul = 1) then annul_next := 1; end if;
else annul_next := annul; end if;
if mein.inull = 1 then -- contention with JMPL
hold_pc := 1; annul_current := 1; annul_next := 0;
end if;
end if;
end if;
when FMT3 =>
case op3 is
case de.cnt is
when "00" =>
if (opf(1) or fpexin.dsz) = 1 then
hold_pc := 1; pv := 0; cnt := "01";
end if;
if (opf(1) or fpmov) = 0 then fpop := holdn; end if;
if op3 = FPOP1 then write_reg := not (opf(1) and not fpexin.dsz); end if;
when others =>
if op3 = FPOP1 then write_reg := 1; end if;
fpop := opf(1) and holdn; cnt := "00";
end case;
end if;
case de.cnt is
when "00" =>
cnt := "01"; hold_pc := 1; mulcnt := (others => 0); pv := 0;
when "01" =>
hold_pc := 1; pv := 0; cnt := "01"; mulcnt := mulcnt + 1;
if (de.mulcnt = "11111") then cnt := "10"; end if;
when "10" =>
cnt := "11"; pv := 0; hold_pc := 1;
when "11" =>
cnt := "00";
end case;
end if;
if (MULTIPLIER > iterative) and (MULTIPLIER /= m32x32) then
case de.cnt is
when "00" =>
144
mulstart := 1;
when "01" =>
if mulo.ready = 1 then cnt := "00";
else cnt := "01"; pv := 0; hold_pc := 1; end if;
end case;
end if;
if DIVIDER /= none then
case de.cnt is
when "00" =>
divstart := 1;
when "01" =>
if divo.ready = 1 then cnt := "00";
else cnt := "01"; pv := 0; hold_pc := 1; end if;
end case;
end if;
when TICC =>
if branch_true = 1 then ticc_exception := 1; end if;
when RETT =>
ctrl.rett := 1; su := sregs.ps;
-- BEGIN SRO (07/21/07)
when MMUL16 =>
case de.cnt is
when "00" =>
cnt := "01"; hold_pc := 1; pv := 0;
when "01" => cnt := "00";
end case;
-- END SRO (07/21/07)
end case;
when LDST =>
case de.cnt is
when "00" =>
if (op3(2) = 1) or (op3(1 downto 0) = "11") then -- ST/LDST/SWAP/LDD
cnt := "01"; hold_pc := 1; pv := 0;
end if;
when "01" =>
if (op3(2 downto 0) = "111") or (op3(3 downto 0) = "1101") or
((CPEN or FPEN) and ((op3(5) & op3(2 downto 0)) = "1110"))
then -- LDD/STD/LDSTUB/SWAP
cnt := "10"; pv := 0; hold_pc := 1;
else
cnt := "00";
end if;
when "10" =>
cnt := "00";
end case;
end case;
end if;
muli.start <= mulstart;
divi.start <= divstart;
-- instruction watchpoints
watchpoint_exc := 0;
for i in 0 to WATCHPOINTS-1 loop
if ((tr(i).exec and not de.annul) = 1) then
145
if (((tr(i).addr xor de.pc(31 downto 2)) and tr(i).mask) = Zero32(31 downto 2)) then
watchpoint_exc := 1;
end if;
end if;
end loop;
if DEBUG_UNIT then
if ((iui.debug.dsuen and iui.debug.bwatch and not de.annul) = 1) then
watchpoint_exc := de.pv and (watchpoint_exc or iui.debug.dbreak or de.step);
end if;
end if;
-- prioritise traps
ctrl.trap := de.mexc or privileged_inst or illegal_inst or fp_disabled or
cp_disabled or ticc_exception or winunf_exception or
winovf_exception or fp_exception or watchpoint_exc;
if de.mexc = 1 then ctrl.tt := IAEX_TT;
elsif privileged_inst = 1 then ctrl.tt := PRIV_TT;
elsif illegal_inst = 1 then ctrl.tt := IINST_TT;
elsif fp_disabled = 1 then ctrl.tt := FPDIS_TT;
elsif cp_disabled = 1 then ctrl.tt := CPDIS_TT;
elsif watchpoint_exc = 1 then ctrl.tt := WATCH_TT;
elsif winovf_exception = 1 then ctrl.tt := WINOF_TT;
elsif winunf_exception = 1 then ctrl.tt := WINUF_TT;
elsif fp_exception = 1 then ctrl.tt := FPEXC_TT;
elsif ticc_exception = 1 then ctrl.tt := TICC_TT;
end if;
hold_pc := (hold_pc or ldlock) and not wr.annul_all;
if hold_pc = 1 then dein.pc <= de.pc;
else dein.pc <= fe.pc; end if;
annul_current_cp := annul_current;
annul_current := (annul_current or ldlock or wr.annul_all);
ctrl.annul := de.annul or wr.annul_all or annul_current;
pv := pv and not ((mein.inull and not hold_pc) or wr.annul_all);
annul_next := (mein.inull and not hold_pc) or annul_next or wr.annul_all
or (ldlock and de.annul);
if (annul_next = 1) or (rst = 0) then
cnt := (others => 0); mulcnt := (others => 0);
end if;
if DEBUG_UNIT then
step := iui.debug.step and pv and not de.annul;
end if;
fecomb.hold_pc <= hold_pc;
fecomb.branch <= branch;
dein.annul <= annul_next;
dein.cnt <= cnt;
dein.mulcnt <= mulcnt;
dein.step <= step;
dein.pv <= pv;
-- pv means that the corresponding pc can be save on a trap
ctrl.pv := de.pv and
not ((de.annul and not de.pv) or wr.annul_all or annul_current);
inull := inull or mein.inull or hold_pc or wr.annul_all;
ici.nullify <= inull;
ici.su <= su;
exin.ctrl <= ctrl;
exin.write_reg <= write_reg;
146
-- latch next cwp
if wr.trapping = 1 then
dein.cwp <= sregsin.cwp;
elsif (write_cwp and not ctrl.annul) = 1 then
dein.cwp <= cwp_new;
elsif (ex.write_cwp and not ex.ctrl.annul) = 1 then
dein.cwp <= ex.cwp;
elsif (me.write_cwp and not me.ctrl.annul) = 1 then
dein.cwp <= me.cwp;
elsif (wr.write_cwp and not wr.ctrl.annul) = 1 then
dein.cwp <= wr.cwp;
else
dein.cwp <= sregs.cwp;
end if;
-- y-register write select and forwarding
rst_mey := 0;
case op is
when FMT3 =>
case op3 is
when MULSCC => write_y := 1;
when WRY => if rd = "00000" then write_y := 1; end if;
when UMAC | SMAC =>
if MACEN then write_y := 1; end if;
if de.cnt = "00" then rst_mey := 1; end if;
if de.cnt = "11" then write_y := 1; end if;
end if;
if MULTIPLIER = m32x32 then write_y := 1; end if;
end case;
end case;
-- debug unit diagnostic regfile read
if DEBUG_UNIT and (dsur.dmode and iui.debug.denable) = 1 then
read_addr1 := iui.debug.daddr(RABITS+1 downto 2); rfenable1 := 1;
end if;
exin.write_y <= write_y;
exin.rst_mey <= rst_mey;
exin.rs1data <= rs1data;
exin.rs2data <= rs2data;
exin.ymsb <= ymsb;
rfi.rd1addr <= read_addr1; rfi.rd2addr <= read_addr2;
-- BEGIN SRO (07/21/07)
-- pass insn. flags to EX stage
exin.deskey_instr <= deskey_instr;
exin.despmt_instr <= despmt_instr;
exin.desdir_instr <= desdir_instr;
exin.desf_instr <= desf_instr;
exin.gfmkld_instr <= gfmkld_instr;
exin.gfmmul_instr <= gfmmul_instr;
exin.aessb_instr <= aessb_instr;
exin.aessb4_instr <= aessb4_instr;
exin.mmul16_instr <= mmul16_instr;
-- END SRO (07/21/07)
147
dk_advin <= dk_advin_var; -- SRO (07/25/07)
-- CP/FPU interface
if (FPIFTYPE = serial) then
fpu_regin.fpop <= fpop and (not fpu_reg.fpld) and not ctrl.annul;
fpexin.ldfsr := fsr_ld;
fpu_regin.ex <= fpexin;
end if;
if CPEN then
-- cpi.dannul <= annul_current_cp or cpldlock or wr.annul_all or de.annul;
cpi.dannul <= cpldlock or wr.annul_all or de.annul;
cpi.dtrap <= ctrl.trap;
end if;
if FPIFTYPE = parallel then
-- fpi.dannul <= annul_current_cp or fpldlock or wr.annul_all or de.annul;
fpi.dannul <= fpldlock or wr.annul_all or de.annul;
fpi.dtrap <= ctrl.trap;
fpi.fdata <= ico.data;
fpi.frdy <= (not ico.mds) or (holdn and not hold_pc);
end if;
if RF_LOWPOW = false then rfenable1 := 1; rfenable2 := 1; end if;
rfi.ren1 <= rfenable1; rfi.ren2 <= rfenable2;
end process;
IU execute stage process (from iu.vhd, lines 1639-2334)
-------------------------------------------------------------------------------
-- execute stage
-------------------------------------------------------------------------------
-- BEGIN SRO (07/21/07)
-- added outputs of custom hardware units to sensitivity list
execute_stage : process(rst, de, ex, me, wr, wrin, sregs, fpu_reg, fpu_regin,
fpo, cpo, fpuo, mulo, divo, tr, iui, dsur, sum32,
dp_dout, df_dout, im_p, as0_b, as1_b, as2_b, as3_b, gm_b)
-- END SRO (07/21/07)
variable rs1 : std_logic_vector(4 downto 0);
variable inull, jump, link_pc : std_logic;
variable dcache_write : std_logic; -- Load or store cycle
variable memory_load : std_logic;
variable signed : std_logic;
variable enaddr : std_logic;
variable force_a2 : std_logic; -- force A(2) in second LDD cycle
variable addr_misal : std_logic; -- misaligned address (JMPL/RETT)
variable ld_size : std_logic_vector(1 downto 0); -- Load size
variable read : std_logic;
variable su : std_logic; -- Local supervisor bit
variable asi : std_logic_vector(7 downto 0); -- Local ASI
variable ctrl : pipeline_control_type;
variable res, y : std_logic_vector(31 downto 0);
variable icc, licc, micc : std_logic_vector(3 downto 0);
variable addout : std_logic_vector(31 downto 0);
148
variable shiftout : std_logic_vector(31 downto 0);
variable logicout : std_logic_vector(31 downto 0);
variable miscout : std_logic_vector(31 downto 0);
variable edata : std_logic_vector(31 downto 0);
variable aluresult : std_logic_vector(31 downto 0);
variable eaddress : std_logic_vector(31 downto 0);
variable nexty : std_logic_vector(31 downto 0);
variable aluin1, aluin2 : std_logic_vector(31 downto 0);
variable shiftin : std_logic_vector(63 downto 0);
variable shiftcnt : std_logic_vector(4 downto 0);
variable ymsb : std_logic; -- next msb of Y during MUL
variable write_reg, write_icc, write_y : std_logic;
variable lock : std_logic;
variable dsu_cache : std_logic;
variable fpmein : fpu_ctrl2_type;
variable mulop1, mulop2 : std_logic_vector(32 downto 0);
variable wpi : integer range 0 to 3; -- watchpoint index
variable addin2 : std_logic_vector(31 downto 0);
variable cin : std_logic;
-- BEGIN SRO (07/21/07)
variable dp_pmt_type_var : std_logic;
variable dp_out_half_var : std_logic;
variable dp_in1_var : std_logic_vector(1 to 32);
variable dp_in2_var : std_logic_vector(1 to 32);
variable df_din_var : std_logic_vector(1 to 32);
variable dk_load_var : std_logic;
variable dk_setdir_var : std_logic;
variable dk_dirin_var : std_logic;
-- BEGIN SRO (07/25/07)
-- handling this signal has been moved to decode stage
-- variable dk_advin_var : std_logic;
-- END SRO (07/25/07)
variable dk_keyl_var : std_logic_vector(1 to 32);
variable dk_keyr_var : std_logic_vector(1 to 32);
variable im_x_var : std_logic_vector(15 downto 0);
variable im_y_var : std_logic_vector(15 downto 0);
variable as_dir_var : std_logic;
variable as0_a_var : std_logic_vector(7 downto 0);
-- SRO (07/31/07)
variable gm_din1_var : std_logic_vector (31 downto 0);
variable gm_din2_var : std_logic_vector (31 downto 0);
variable gm_enable_var : std_logic;
-- END SRO (07/21/07)
begin
-- op-code decoding
op := ex.ctrl.inst(31 downto 30);
op3 := ex.ctrl.inst(24 downto 19);
opf := ex.ctrl.inst(13 downto 5);
rs1 := ex.ctrl.inst(18 downto 14);
-- common initialisation
ctrl := ex.ctrl; memory_load := 0;
ctrl.annul := ctrl.annul or wr.annul_all;
read := not op3(2);
dcache_write := 0; enaddr := 0; wpi := 0;
ld_size := LDWORD; signed := 0; addr_misal := 0; lock := 0;
write_reg := ex.write_reg;
149
write_icc := ex.write_icc;
write_y := ex.write_y;
fpmein.fpop := fpu_reg.ex.fpop;
fpmein.dsz := fpu_reg.ex.dsz;
fpmein.ldfsr := fpu_reg.ex.ldfsr;
fpmein.cexc := fpuo.excep(4 downto 0);
fpmein.fcc := fpuo.ConditionCodes;
muli.mac <= op3(5);
dsu_cache := 0;
-- BEGIN SRO (07/21/07)
dp_pmt_type_var := -;
dp_out_half_var := -;
dp_in1_var := (others => -);
dp_in2_var := (others => -);
df_din_var := (others => -);
dk_load_var := 0;
dk_setdir_var := 0;
dk_dirin_var := -;
-- dk_advin_var := 0; -- SRO (07/25/07)
dk_keyl_var := (others => -);
dk_keyr_var := (others => -);
im_x_var := (others => -);
im_y_var := (others => -);
as_dir_var := -;
as0_a_var := (others => -);
-- SRO (07/31/07)
gm_din1_var := (others => -);
gm_din2_var := (others => -);
gm_enable_var := 0;
-- END SRO (07/21/07)
-- load/store size decoding
case op is
when LDST =>
case op3 is
when LDUB | LDUBA => ld_size := LDBYTE;
when LDSTUB | LDSTUBA => ld_size := LDBYTE; lock := 1;
when LDUH | LDUHA => ld_size := LDHALF;
when LDSB | LDSBA => ld_size := LDBYTE; signed := 1;
when LDSH | LDSHA => ld_size := LDHALF; signed := 1;
when LD | LDA | LDF | LDC => ld_size := LDWORD;
when SWAP | SWAPA => ld_size := LDWORD; lock := 1;
when LDD | LDDA | LDDF | LDDC => ld_size := LDDBL;
when STB | STBA => ld_size := LDBYTE;
when STH | STHA => ld_size := LDHALF;
when ST | STA | STF => ld_size := LDWORD;
when ISTD | STDA => ld_size := LDDBL;
when STDF | STDFQ => if FPEN then ld_size := LDDBL; end if;
when STDC | STDCQ => if CPEN then ld_size := LDDBL; end if;
end case;
end case;
link_pc := 0; jump:= 0; inull :=0; force_a2 := 0;
-- load/store control decoding
if (ctrl.annul = 0) then
case op is
150
when CALL =>
link_pc := 1;
when FMT3 =>
case op3 is
when JMPL =>
jump := 1; link_pc := 1;
inull := me.ctrl.annul or not me.jmpl_rett;
when RETT =>
jump := 1; inull := me.ctrl.annul or not me.jmpl_rett;
end case;
when LDST =>
if (ctrl.trap or (wrin.ctrl.trap and not wrin.ctrl.annul)) = 0 then
case ex.ctrl.cnt is
when "00" =>
memory_load := op3(3) or not op3(2); -- LD/LDST/SWAP
read := memory_load; enaddr := 1;
when "01" =>
memory_load := not op3(2); -- LDD
enaddr := memory_load;
force_a2 := memory_load;
if op3(3 downto 2) = "01" then -- ST/STD
dcache_write := 1;
end if;
if op3(3 downto 2) = "11" then -- LDST/SWAP
enaddr := 1;
end if;
when "10" => -- STD/LDST/SWAP
dcache_write := 1;
end case;
end if;
end case;
end if;
-- supervisor bit generation
if ((wr.ctrl.rett and not wr.ctrl.annul) = 1) then su := sregs.ps;
else su := sregs.s; end if;
if su = 1 then asi := "00001011"; else asi := "00001010"; end if;
if (op3(4) = 1) and ((op3(5) = 0) or not CPEN) then
asi := ex.ctrl.inst(12 downto 5);
end if;
-- load data bypass in case (LDDELAY = 1)
aluin1 := ex.rs1data; aluin2 := ex.rs2data; ymsb := ex.ymsb;
if LDDELAY = 1 then
if ex.ldbp1 = 1 then aluin1 := wr.result; ymsb := wr.result(0); end if;
if ex.ldbp2 = 1 then aluin2 := wr.result; end if;
end if;
-- bypassed operands to multiplier
muli.signed <= op3(0); divi.signed <= op3(0);
mulop1 := (aluin1(31) and op3(0)) & aluin1;
mulop2 := (aluin2(31) and op3(0)) & aluin2;
if (ex.mulinsn = 0) and not INFER_MULT then -- try to minimise power
mulop1 := (others => 0); mulop2 := (others => 0);
end if;
muli.op1 <= mulop1; muli.op2 <= mulop2;
151
divi.op1 <= (aluin1(31) and op3(0)) & aluin1;
divi.op2 <= (aluin2(31) and op3(0)) & aluin2;
-- ALU add/sub
icc := "0000";
if not (is_x(aluin1) or is_x(aluin2)) then
cin := ex.alu_cin; addin2 := aluin2;
if ex.aluadd = 0 then
addin2 := not aluin2; cin := not cin;
end if;
-- addout := aluin1 + addin2 + cin;
if FASTADD then addout := sum32;
else
if ex.aluadd = 0 then addout := aluin1 - aluin2 - ex.alu_cin;
else addout := aluin1 + aluin2 + ex.alu_cin; end if;
end if;
else
addout := (others => X);
end if;
add32in1 <= aluin1;
add32in2 <= addin2;
add32cin <= cin;
-- fast address adders if enabled
if FASTJUMP then
if not (is_x(aluin1) or is_x(aluin2)) then
fecomb.jump_address <= aluin1(31 downto PCLOW) + aluin2(31 downto PCLOW);
if (aluin1(1 downto 0) + aluin2(1 downto 0)) = "00" then
addr_misal := 0;
else
addr_misal := 1;
end if;
else
addr_misal := X;
fecomb.jump_address <= (others => X);
end if;
else
fecomb.jump_address(31 downto PCLOW) <= addout(31 downto PCLOW);
if addout(1 downto 0) = "00" then
addr_misal := 0;
else
addr_misal := 1;
end if;
end if;
res := (others => -);
-- alu ops which set icc
case ex.aluop is
when ALU_OR => logicout := aluin1 or aluin2;
152
when ALU_ORN => logicout := aluin1 or not aluin2;
when ALU_AND => logicout := aluin1 and aluin2;
when ALU_ANDN => logicout := aluin1 and not aluin2;
when ALU_XOR => logicout := aluin1 xor aluin2;
when ALU_XNOR => logicout := aluin1 xor not aluin2;
when ALU_DIV =>
if DIVIDER /= none then logicout := aluin2;
else logicout := (others => -); end if;
when others => logicout := (others => -);
end case;
-- generate condition codes
if (ex.alusel(1) = 0) then
res := addout;
if ex.aluadd = 0 then
icc(0) := ((not aluin1(31)) and aluin2(31)) or -- Carry
(addout(31) and ((not aluin1(31)) or aluin2(31)));
icc(1) := (aluin1(31) and (not aluin2(31)) and not addout(31)) or -- Overflow
(addout(31) and (not aluin1(31)) and aluin2(31));
else
icc(0) := (aluin1(31) and aluin2(31)) or -- Carry
((not addout(31)) and (aluin1(31) or aluin2(31)));
icc(1) := (aluin1(31) and aluin2(31) and not addout(31)) or -- Overflow
(addout(31) and (not aluin1(31)) and (not aluin2(31)));
end if;
else
res := logicout;
icc(1 downto 0) := "00";
end if;
if res = zero32 then -- Zero
icc(2) := 1;
else
icc(2) := 0;
end if;
icc(3) := res(31); -- Negative
-- select Y
if (me.write_y and not (me.ctrl.annul or me.ctrl.trap)) = 1
then y := me.my; else y := wr.y; end if;
-- alu ops which dont set icc
miscout := (others => -); edata := (others => -);
case ex.aluop is
when ALU_STB => edata := aluin1(7 downto 0) & aluin1(7 downto 0) &
aluin1(7 downto 0) & aluin1(7 downto 0);
miscout := edata;
when ALU_STH => edata := aluin1(15 downto 0) & aluin1(15 downto 0);
miscout := edata;
when ALU_PASS1 => miscout := aluin1; edata := aluin1;
when ALU_PASS2 => miscout := aluin2;
when ALU_ONES => miscout := (others => 1); edata := (others => 1);
when ALU_RDY =>
miscout := y;
if (WATCHPOINTS > 0) and (rs1(4 downto 3) = "11") then
wpi := conv_integer(unsigned(rs1(2 downto 1)));
if rs1(0) = 0 then miscout := tr(wpi).addr & 0 & tr(wpi).exec;
else miscout := tr(wpi).mask & tr(wpi).load & tr(wpi).store; end if;
end if;
when ALU_FSR =>
153
edata := fpu_reg.fsr.rd & "00" & fpu_reg.fsr.tem & "000" &
std_logic_vector(FPUVER) & fpu_reg.fsr.ftt & "00" & fpu_reg.fsr.fcc &
fpu_reg.fsr.aexc & fpu_reg.fsr.cexc;
miscout := edata;
end if;
when ALU_FOP =>
miscout := aluin2;
case opf(3 downto 2) is
when "01" => miscout(31) := not miscout(31);
when "10" => miscout(31) := 0;
end case;
end if;
end case;
-- shifter
shiftin := zero32 & aluin1;
shiftcnt := aluin2(4 downto 0);
if ex.aluop = ALU_SLL then
shiftin(31 downto 0) := zero32;
shiftin(63 downto 31) := 0 & aluin1;
shiftcnt := not shiftcnt;
elsif ex.aluop = ALU_SRA then
if aluin1(31) = 1 then
shiftin(63 downto 32) := (others => 1);
else
shiftin(63 downto 32) := zero32;
end if;
end if;
if shiftcnt (4) = 1 then
shiftin(47 downto 0) := shiftin(63 downto 16);
end if;
end if;
end if;
end if;
end if;
shiftout := shiftin(31 downto 0);
-- generate overflow for tagged add/sub
case op is
when FMT3 =>
case op3 is
when TADDCC | TADDCCTV | TSUBCC | TSUBCCTV =>
icc(1) := aluin1(0) or aluin1(1) or aluin2(0) or aluin2(1) or icc(1);
end case;
end case;
-- BEGIN SRO (07/21/07)
154
-- assign inputs to custom units
if(ex.deskey_instr = 1) then
dk_load_var := 1;
dk_keyl_var := aluin1;
dk_keyr_var := aluin2;
elsif(ex.despmt_instr = 1) then
dp_in1_var := aluin1;
dp_in2_var := aluin2;
dp_pmt_type_var := ex.ctrl.inst(6);
dp_out_half_var := ex.ctrl.inst(5);
elsif(ex.desdir_instr = 1) then
dk_setdir_var := 1;
dk_dirin_var := aluin2(0);
elsif(ex.desf_instr = 1) then
df_din_var := aluin1;
-- dk_advin_var := 1; -- SRO (07/25/07)
elsif(ex.aessb_instr = 1) then
as_dir_var := aluin2(0);
case aluin2(5 downto 4) is
when "00" => as0_a_var := aluin1(31 downto 24);
when "11" => as0_a_var := aluin1( 7 downto 0);
end case;
elsif(ex.aessb4_instr = 1) then
as_dir_var := aluin2(0);
as0_a_var := aluin1(31 downto 24);
as3_a_var := aluin1( 7 downto 0);
elsif(ex.gfmkld_instr = 1) then
--BEGIN SRO (07/31/07)
gm_din1_var := aluin1;
gm_enable_var := 1;
elsif(ex.gfmmul_instr = 1) then
gm_enable_var := 0;
--END SRO (07/31/07)
elsif(ex.mmul16_instr = 1) then
im_x_var := aluin1(15 downto 0);
im_y_var := aluin2(15 downto 0);
end if;
dp_pmt_type <= dp_pmt_type_var;
dp_out_half <= dp_out_half_var;
dp_in1 <= dp_in1_var;
dp_in2 <= dp_in2_var;
df_din <= df_din_var;
dk_load <= dk_load_var;
dk_setdir <= dk_setdir_var;
dk_dirin <= dk_dirin_var;
-- dk_advin <= dk_advin_var; -- SRO (07/25/07)
dk_keyl <= dk_keyl_var;
dk_keyr <= dk_keyr_var;
im_x <= im_x_var;
im_y <= im_y_var;
as_dir <= as_dir_var;
as0_a <= as0_a_var;
as1_a <= as1_a_var;
as2_a <= as2_a_var;
as3_a <= as3_a_var;
gm_din1 <= gm_din1_var;
155
gm_din2 <= gm_din2_var;
gm_enable <= gm_enable_var;
-- END SRO (07/21/07)
-- select alu output
aluresult := (others => 0);
if link_pc = 1 then
aluresult := ex.ctrl.pc(31 downto 2) & "00"; -- save PC during jmpl
else
case ex.alusel is
when ALU_RES_ADD => aluresult := addout;
when ALU_RES_SHIFT => aluresult := shiftout;
when ALU_RES_LOGIC => aluresult := logicout;
when others => aluresult := miscout;
end case;
end if;
ex.icc <= icc;
-- BEGIN SRO (07/21/07)
if(ex.despmt_instr = 1) then
aluresult := dp_dout;
elsif(ex.desf_instr = 1) then
aluresult := df_dout;
elsif(ex.gfmmul_instr = 1) then
aluresult := gm_b;
elsif(ex.aessb_instr = 1) then
case aluin2(5 downto 4) is
when "00" => aluresult := as0_b & aluin1(23 downto 0);
when "01" => aluresult := aluin1(31 downto 24) & as0_b & aluin1(15 downto 0);
when "10" => aluresult := aluin1(31 downto 16) & as0_b & aluin1( 7 downto 0);
when "11" => aluresult := aluin1(31 downto 8) & as0_b;
end case;
elsif(ex.aessb4_instr = 1) then
aluresult := as0_b & as1_b & as2_b & as3_b;
elsif(ex.mmul16_instr = 1) then
aluresult := "0000000000000000" & im_p;
end if;
-- END SRO (07/21/07)
-- FPU interface
if is_x(aluin1) then aluin1 := (others => 0); end if;
if is_x(aluin2) then aluin2 := (others => 0); end if;
if is_x(de.inst(19) & de.inst(13 downto 5)) then
fpui.FpInst <= (others => 0);
else
fpui.FpInst <= de.inst(19) & de.inst(13 downto 5);
end if;
if is_x(fpu_reg.fsr.rd) then fpui.RoundingMode <= (others => 0);
else
fpui.RoundingMode <= fpu_reg.fsr.rd;
end if;
if (ex.ctrl.cnt = "00") or (opf(1) = 0) then
fpui.fprf_dout1 <= aluin1 & aluin1;
156
fpui.fprf_dout2 <= aluin2 & aluin2;
else
fpui.fprf_dout1 <= fpu_reg.op1h & aluin1;
fpui.fprf_dout2 <= me.result & aluin2;
end if;
fpu_regin.op1h <= aluin1;
if fpu_reg.ex.fpop = "01" and (ex.write_reg = 1) then
if fpu_reg.ex.dsz = 1 then
if (ex.ctrl.cnt /= "00") then
aluresult := fpuo.FracResult(34 downto 3);
end if;
else
aluresult := fpuo.SignResult & fpuo.ExpResult(7 downto 0) &
fpuo.FracResult(54 downto 32);
end if;
end if;
fpu_regin.me <= fpmein;
end if;
if (MULTIPLIER = m32x32) and (ex.mulinsn = 1) then
aluresult := mulo.result(31 downto 0);
end if;
if MACEN then
if ex.aluop = ALU_RDY then
if rs1 = "10010" then
if ((me.mulinsn and me.ctrl.inst(24)) = 1) then
else aluresult := wr.asr18; end if;
else
if ((me.mulinsn and me.ctrl.inst(24)) = 1) then
end if;
end if;
end if;
end if;
ex.result <= aluresult;
-- generate Y
micc := icc;
licc := icc;
nexty := y;
if ex.mulstep = 1 then
nexty := ymsb & y(31 downto 1);
elsif (ex.mulinsn = 1) and (MULTIPLIER = iterative) then
case ex.ctrl.cnt is
when "00" => nexty := y;
when "01" => nexty := ymsb & me.y(31 downto 1);
when "10" =>
aluresult := ymsb & me.y(31 downto 1);
licc(3) := ymsb; licc(1 downto 0) := "00";
if aluresult = zero32 then
licc(2) := 1; else licc(2) := 0;
end if;
nexty := me.y;
end case;
elsif ((ex.rst_mey or ex.write_y) = 1) and (MULTIPLIER = iterative) then
if ex.ctrl.cnt = "11" then nexty := addout;
else nexty := logicout; end if;
elsif ex.write_y = 1 then
nexty := logicout;
157
end if;
if (MULTIPLIER = iterative) then
micc(3) := icc(3) and not ex.rst_mey;
micc(1) := icc(1) and not ex.rst_mey;
end if;
-- data address generation
eaddress := miscout;
if ex.alusel = ALU_RES_ADD then
addout(2) := addout(2) or force_a2; eaddress := addout;
end if;
if CPEN and (op = LDST) and ((op3(5 downto 4) & op3(2)) = "111") and
(ex.ctrl.cnt /= "00")
then
dci.edata <= cpo.data; -- store co-processor
aluresult := cpo.data;
elsif (FPIFTYPE = parallel) and (op = LDST) and
((op3(5 downto 4) & op3(2)) = "101") and (ex.ctrl.cnt /= "00")
then
dci.edata <= fpo.data; -- store fpu co-processor
aluresult := fpo.data;
else
dci.edata <= edata;
aluresult(2) := aluresult(2) or force_a2;
end if;
if (MULTIPLIER > iterative) and (MULTIPLIER /= m32x32) then
write_reg := write_reg or mulo.ready;
write_y := write_y or mulo.ready;
write_icc := write_icc or (mulo.ready and op3(4));
end if;
if DIVIDER /= none then
write_reg := write_reg or divo.ready;
write_icc := write_icc or (divo.ready and op3(4));
end if;
-- debug unit cache access
if DEBUG_UNIT then
if dsur.dmode = 1 then
dcache_write := 0; read := 1; dsu_cache := 0;
if (iui.debug.denable and iui.debug.daddr(20)) = 1 then
enaddr := 1;
asi(4 downto 0) := iui.debug.daddr(17) & "11" & iui.debug.daddr(19 downto 18);
if M_EN and iui.debug.daddr(21) = 1 then
-- ASI_ICTX "10101" 0x90300000-0x90340000
-- ASI_DCTX "10100" 0x90380000-0x903c0000
asi(4 downto 0) := "1010" & not (iui.debug.daddr(19));
end if;
dsu_cache := 1; ld_size := LDWORD;
if iui.debug.dwrite = 1 then
dcache_write := 1; read := 0;
end if;
if (dsur.dsuen and iui.debug.dwrite) = 1 then aluresult := iui.debug.ddata;
else aluresult(21 downto 2) := iui.debug.daddr; end if;
end if;
end if;
end if;
mein.y <= nexty;
dciin.enaddr <= enaddr;
dciin.read <= read;
158
dciin.write <= dcache_write;
dciin.asi <= asi;
dciin.lock <= lock and not ctrl.annul;
dci.eenaddr <= enaddr;
dci.eaddress <= eaddress;
dci.dsuen <= dsur.dsuen;
dci.esu <= su;
fecomb.jump <= jump;
ex.micc <= micc;
mein.inull <= inull;
mein.icc <= licc;
mein.memory_load <= memory_load;
mein.ld_size <= ld_size;
mein.signed <= signed;
mein.addr_misal <= addr_misal;
mein.result <= aluresult;
mein.write_reg <= write_reg;
mein.write_icc <= write_icc;
mein.write_y <= write_y;
mein.ctrl <= ctrl;
mein.su <= su;
end process;
IU custom hardware instantiation (from iu.vhd, lines 3100-3140)
-- BEGIN SRO (07/21/07)
-- custom hardware component instantiation
des_pmt_gen : if(des_pmt_en = true) generate
des_pmt_unit : des_pmt
port map (dp_pmt_type, dp_out_half, dp_in1, dp_in2, dp_dout);
end generate;
des_key_f_gen: if(des_key_f_en = true) generate
des_fcore_unit : des_fcore
port map (df_din, df_rndkey, df_dout);
des_keygen_unit : des_keygen
port map (clk, dk_setdir, dk_load, dk_dirin, dk_advin, dk_keyl, dk_keyr, dk_rndkey);
end generate;
df_rndkey <= dk_rndkey;
idea_modmul_gen : if(idea_mmul_en = true) generate
idea_modmul_unit : idea_modmul
port map (clk, im_x, im_y, im_p);
end generate;
aes_sbox_gen : if(aes_sbox_config /= none) generate
as0 : aes_sbox
port map (as_dir, as0_a, as0_b);
aes_sbox_gen_others : if(aes_sbox_config = sbox4) generate
as1 : aes_sbox
as2 : aes_sbox
as3 : aes_sbox
end generate;
159
end generate;
gf_k_mult_gen: if(gfmmul_en = true) generate
gf_k_mult_unit : gf_k_mult
port map (gm_din1, gm_din2, rst, gm_enable, clk, gm_b);
end generate;
-- END SRO (07/21/07)
Added SPARC opcodes for binutils v2.16.1 (from sparc-opc.c, lines 1036-1052)
/* BEGIN SRO (07/14/07) */
/* opcodes for DES and Triple-DES */
{ "deskey", F3(2, 0x09, 0), F3(~2, ~0x09, ~0)|RD_G0, "1,2", 0, v8 },
{ "desdir", F3(2, 0x09, 1), F3(~2, ~0x09, ~1)|SIMM13(0x1000)|RS1_G0|RD_G0, "i", 0, v8 },
{ "desf", F3(2, 0x09, 1)|SIMM13(0x1000), F3(~2, ~0x09, ~1), "1,d", 0, v8 },
{ "desipl", F3(2, 0x0d, 0)|ASI(0), F3(~2, ~0x0d, ~0)|ASI(~0), "1,2,d", 0, v8 },
{ "desipr", F3(2, 0x0d, 0)|ASI(1), F3(~2, ~0x0d, ~0)|ASI(~1), "1,2,d", 0, v8 },
{ "desfpl", F3(2, 0x0d, 0)|ASI(2), F3(~2, ~0x0d, ~0)|ASI(~2), "1,2,d", 0, v8 },
{ "desfpr", F3(2, 0x0d, 0)|ASI(3), F3(~2, ~0x0d, ~0)|ASI(~3), "1,2,d", 0, v8 },
/* opcodes for AES */
{ "gfmkld", F3(2, 0x19, 0), F3(~2, ~0x19, ~0)|RD_G0, "1,2", 0, v8 },
{ "gfmmul", F3(2, 0x1d, 0), F3(~2, ~0x1d, ~0)|ASI_RS2(~0), "1,d", 0, v8 },
{ "aessb" , F3(2, 0x2c, 1), F3(~2, ~0x2c, ~1), "1,i,d", 0, v8 },
{ "aessb4", F3(2, 0x2c, 1), F3(~2, ~0x2c, ~1), "1,i,d", F_ALIAS, v8 },
/* opcodes for IDEA */
{ "mmul16", F3(2, 0x2d, 0), F3(~2, ~0x2d, ~0), "1,2,d", 0, v8 },
/* END SRO (07/14/07) */
160
Appendix C: Test Vectors for Functional Evaluation
DES
source:
Phil Karns DES source, available from Advanced Packet Vault package apv100.tar.gz
URL: http://www.citi.umich.edu/projects/apv/
KEY PLAINTEXT CIPHERTEXT
49793EBC79B3258F 437540C8698F3CFA 6FBF1CAFCFFD0556
4FB05E1515AB73A7 072D43A077075292 2F22E49BAB7CA1AC
49E95D6D4CA229BF 02FE55778117F12A 5A6B612CC26CCE4A
018310DC409B26D6 1D9D5C5018F728C2 5F4C038ED12B2E41
1C587F1C13924FEF 305532286D6F295A 63FAC0D034D9F793
0101010101010101 0123456789ABCDEF 617B3A0CE8F07100
1F1F1F1F0E0E0E0E 0123456789ABCDEF DB958605F8C8C606
E0FEE0FEF1FEF1FE 0123456789ABCDEF EDBFD1C66C29CCC7
0000000000000000 FFFFFFFFFFFFFFFF 355550B2150E2451
FFFFFFFFFFFFFFFF 0000000000000000 CAAAAF4DEAF1DBAE
0123456789ABCDEF 0000000000000000 D5D44FF720683D0D
FEDCBA9876543210 FFFFFFFFFFFFFFFF 2A2BB008DF97C2F2
Triple-DES
source:
First six rows are from the test vectors for DES listed above
Last six rows are from DES test vectors for the NESSIE project
(New European Schemes for Signatures, Integrity, and Encryption)
Includes vectors for two-key and three-key Triple-DES
URL: https://www.cosic.esat.kuleuven.be/nessie/testvectors/bc/des/
49793EBC79B3258F 49793EBC79B3258F 49793EBC79B3258F 437540C8698F3CFA 6FBF1CAFCFFD0556
4FB05E1515AB73A7 4FB05E1515AB73A7 4FB05E1515AB73A7 072D43A077075292 2F22E49BAB7CA1AC
49E95D6D4CA229BF 49E95D6D4CA229BF 49E95D6D4CA229BF 02FE55778117F12A 5A6B612CC26CCE4A
FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFF 0000000000000000 CAAAAF4DEAF1DBAE
0123456789ABCDEF 0123456789ABCDEF 0123456789ABCDEF 0000000000000000 D5D44FF720683D0D
FEDCBA9876543210 FEDCBA9876543210 FEDCBA9876543210 FFFFFFFFFFFFFFFF 2A2BB008DF97C2F2
0001020304050607 08090A0B0C0D0E0F 0001020304050607 0011223344556677 D117BD6373549FAA
2BD6459F82C5B300 952C49104881FF48 2BD6459F82C5B300 EA024714AD5C4D84 C616ACE843958247
0001020304050607 08090A0B0C0D0E0F 0001020304050607 E4FC19D69463B783 0011223344556677
2BD6459F82C5B300 952C49104881FF48 2BD6459F82C5B300 8598538A8ECF117D EA024714AD5C4D84
0001020304050607 08090A0B0C0D0E0F 1011121314151617 0011223344556677 97A25BA82B564F4C
0001020304050607 08090A0B0C0D0E0F 1011121314151617 982662605553244D 0011223344556677
161
IDEA
source:
First row comes from
http://www.cc.gatech.edu/classes/AY2001/cs8803g spring/crypto8 1.ppt
All other rows are from Crypto++ v5.5.1 URL: http://www.cryptopp.com/
7CA110454A1A6E5701A1D6D039776742 690F5B0D9A26939B 1BDDB24214237EC7
00010002000300040005000600070008 0000000100020003 11FBED2B01986DE5
00010002000300040005000600070008 0102030405060708 540E5FEA18C2F8B1
00010002000300040005000600070008 0019324B647D96AF 9F0A0AB6E10CED78
00010002000300040005000600070008 F5202D5B9C671B08 CF18FD7355E2C5C5
00010002000300040005000600070008 FAE6D2BEAA96826E 85DF52005608193D
00010002000300040005000600070008 0A141E28323C4650 2F7DE750212FB734
00010002000300040005000600070008 050A0F14191E2328 7B7314925DE59C09
0005000A000F00140019001E00230028 0102030405060708 3EC04780BEFF6E20
3A984E2000195DB32EE501C8C47CEA60 0102030405060708 97BCD8200780DA86
006400C8012C019001F4025802BC0320 05320A6414C819FA 65BE87E7A2538AED
9D4075C103BC322AFB03E7BE6AB30006 0808080808080808 F5DB1AC45E5EF9F9
AES
source:
Known Answer Test vectors for Rijndael algorithm submitted to NIST
URL: http://csrc.nist.gov/CryptoToolkit/aes/rijndael/rijndael-vals.zip
CASE KEY, PLAINTEXT, CIPHERTEXT CASE KEY, PLAINTEXT, CIPHERTEXT
1 000102030405060708090A0B0C0D0E0F 7 64656667696A6B6C6E6F707173747576
000102030405060708090A0B0C0D0E0F 67429969490B9711AE2B01DC497AFDE8
0A940BB5416EF045F1C39458C653EA5A E1F94DFA776597BEACA262F2F6366FEA
2 00010203050607080A0B0C0D0F101112 8 78797A7B7D7E7F80828384858788898A
506812A45F08C889B97F5980038B8359 93385C1F2AEC8BED192F5A8E161DD508
D8F532538289EF7D06B506A4FD5BE9C9 F29E986C6A1C27D7B29FFD7EE92B75F1
3 14151617191A1B1C1E1F202123242526 9 8C8D8E8F91929394969798999B9C9D9E
5C6D71CA30DE8B8B00549984D2EC7D4B B5BF946BE19BEB8DB3983B5F4C6E8DDB
59AB30F4D4EE6E4FF9907EF65B1FB68C 131C886A57F8C2E713ABA6955E2B55B5
4 28292A2B2D2E2F30323334353738393A 10 A0A1A2A3A5A6A7A8AAABACADAFB0B1B2
53F3F4C64F8616E4E7C56199F48F21F6 41321EE10E21BD907227C4450FF42324
BF1ED2FCB2AF3FD41443B56D85025CB1 D2AB7662DF9B8C740210E5EEB61C199D
5 3C3D3E3F41424344464748494B4C4D4E 11 B4B5B6B7B9BABBBCBEBFC0C1C3C4C5C6
A1EB65A3487165FB0F1C27FF9959F703 00A82F59C91C8486D12C0A80124F6089
7316632D5C32233EDCB0780560EAE8B2 14C10554B2859C484CAB5869BBE7C470
6 50515253555657585A5B5C5D5F606162 12 C8C9CACBCDCECFD0D2D3D4D5D7D8D9DA
3553ECF0B1739558B08E350A98A39BFA 7CE0FD076754691B4BBD9FAF8A1372FE
408C073E3E2538072B72625E68B8364B DB4D498F0A49CF55445D502C1F9AB3B5
162
Appendix D: Example Source Code for Functional and Per-
formance Evaluations
DES and Triple-DES
des keysched.s
#if USE_KEY_F==0
.section ".text"
/*
des_keysched: perform DES key schedule.
The round key array is arranged in encryption order,
so it is the main programs responsibility to use the keys
in the correct order when doing the rounds.
arguments:
%i0 = pointer to master key
%i1 = starting address of round key array
(each key is stored in 2 words, the 8 LSBs in each word are not used)
locals:
%l0, %l1 = Ci, Di
%l2 = temp register for shifted data
%l3 = bit mask
%o0, %o1 = left/right halves of master key
%o2, %o3 = left/right halves of round key
*/
.align 4
.global des_keysched
.type des_keysched, #function
.proc 04
des_keysched:
save %sp, -128, %sp
ld [%i0], %o0
ld [%i0+4], %o1
/* PC-1 */
sethi %hi(0x80000000), %l3
sll %o1, 24, %l2
and %l2, %l3, %l2
or %g0, %l2, %l0
sethi %hi(0x40000000), %l3
sll %o1, 15, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x20000000), %l3
sll %o1, 6, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x10000000), %l3
srl %o1, 3, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x08000000), %l3
sll %o0, 20, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x04000000), %l3
sll %o0, 11, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x02000000), %l3
163
sll %o0, 2, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x01000000), %l3
srl %o0, 7, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00800000), %l3
sll %o1, 17, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00400000), %l3
sll %o1, 8, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00200000), %l3
srl %o1, 1, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00100000), %l3
srl %o1, 10, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00080000), %l3
sll %o0, 13, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00040000), %l3
sll %o0, 4, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00020000), %l3
srl %o0, 5, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00010000), %l3
srl %o0, 14, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00008000), %l3
sll %o1, 10, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00004000), %l3
sll %o1, 1, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00002000), %l3
srl %o1, 8, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sethi %hi(0x00001000), %l3
srl %o1, 17, %l2
and %l2, %l3, %l2
or %l0, %l2, %l0
sll %o0, 6, %l2
and %l2, 0x800, %l2
or %l0, %l2, %l0
srl %o0, 3, %l2
and %l2, 0x400, %l2
or %l0, %l2, %l0
srl %o0, 12, %l2
and %l2, 0x200, %l2
or %l0, %l2, %l0
164
srl %o0, 21, %l2
and %l2, 0x100, %l2
or %l0, %l2, %l0
sll %o1, 3, %l2
and %l2, 0x80, %l2
or %l0, %l2, %l0
srl %o1, 6, %l2
and %l2, 0x40, %l2
or %l0, %l2, %l0
srl %o1, 15, %l2
and %l2, 0x20, %l2
or %l0, %l2, %l0
srl %o1, 24, %l2
and %l2, 0x10, %l2
or %l0, %l2, %l0
sethi %hi(0x80000000), %l3
sll %o1, 30, %l2
and %l2, %l3, %l2
or %g0, %l2, %l1
sethi %hi(0x40000000), %l3
sll %o1, 21, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x20000000), %l3
sll %o1, 12, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x10000000), %l3
sll %o1, 3, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x08000000), %l3
sll %o0, 26, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x04000000), %l3
sll %o0, 17, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x02000000), %l3
sll %o0, 8, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x01000000), %l3
srl %o0, 1, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00800000), %l3
sll %o1, 21, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00400000), %l3
sll %o1, 12, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00200000), %l3
sll %o1, 3, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00100000), %l3
srl %o1, 6, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00080000), %l3
165
sll %o0, 17, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00040000), %l3
sll %o0, 8, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00020000), %l3
srl %o0, 1, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00010000), %l3
srl %o0, 10, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00008000), %l3
sll %o1, 12, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00004000), %l3
sll %o1, 3, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00002000), %l3
srl %o1, 6, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sethi %hi(0x00001000), %l3
srl %o1, 15, %l2
and %l2, %l3, %l2
or %l1, %l2, %l1
sll %o0, 8, %l2
and %l2, 0x800, %l2
or %l1, %l2, %l1
srl %o0, 1, %l2
and %l2, 0x400, %l2
or %l1, %l2, %l1
srl %o0, 10, %l2
and %l2, 0x200, %l2
or %l1, %l2, %l1
srl %o0, 19, %l2
and %l2, 0x100, %l2
or %l1, %l2, %l1
sll %o0, 3, %l2
and %l2, 0x80, %l2
or %l1, %l2, %l1
srl %o0, 6, %l2
and %l2, 0x40, %l2
or %l1, %l2, %l1
srl %o0, 15, %l2
and %l2, 0x20, %l2
or %l1, %l2, %l1
srl %o0, 24, %l2
and %l2, 0x10, %l2
or %l1, %l2, %l1
or %g0, 0, %g1
/* rotate 1 or 2 bits depending on round
(round numbers decreased by 1 for indexing purposes) */
rotate:
cmp %g1, 0
be rotate1
cmp %g1, 1
be rotate1
cmp %g1, 8
166
be rotate1
cmp %g1, 15
be rotate1
rotate2:
sll %l0, 2, %l2
srl %l0, 26, %l3
and %l3, 0x30, %l3
sll %l1, 2, %l4
srl %l1, 26, %l5
and %l5, 0x30, %l5
ba,a rotate_done
rotate1:
sll %l0, 1, %l2
srl %l0, 27, %l3
and %l3, 0x10, %l3
sll %l1, 1, %l4
srl %l1, 27, %l5
and %l5, 0x10, %l5
rotate_done:
or %l2, %l3, %l0
or %l4, %l5, %l1
/* PC-2 */
sethi %hi(0x80000000), %l3
sll %l0, 13, %l2
and %l2, %l3, %l2
or %g0, %l2, %o0
sethi %hi(0x40000000), %l3
sll %l0, 15, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x20000000), %l3
sll %l0, 8, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x11000000), %l3
sll %l0, 20, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x0a400000), %l3
srl %l0, 4, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x04000000), %l3
srl %l0, 1, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00800800), %l3
or %l3, %lo(0x00800800), %l3
sll %l0, 6, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00280000), %l3
sll %l0, 10, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00100400), %l3
or %l3, %lo(0x00100400), %l3
srl %l0, 2, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00040000), %l3
sll %l0, 5, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
167
sethi %hi(0x00022000), %l3
srl %l0, 3, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00010000), %l3
srl %l0, 12, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00008000), %l3
sll %l0, 9, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00004200), %l3
or %l3, %lo(0x00004200), %l3
srl %l0, 10, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
sethi %hi(0x00001000), %l3
srl %l0, 13, %l2
and %l2, %l3, %l2
or %o0, %l2, %o0
srl %l0, 22, %l2
and %l2, 0x100, %l2
or %o0, %l2, %o0
sethi %hi(0x80010000), %l3
sll %l1, 12, %l2
and %l2, %l3, %l2
or %g0, %l2, %o1
sethi %hi(0x40000000), %l3
sll %l1, 22, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x20000000), %l3
and %l1, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x10000000), %l3
sll %l1, 5, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x08800000), %l3
sll %l1, 14, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x04000000), %l3
sll %l1, 21, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x02000000), %l3
srl %l1, 5, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x01000000), %l3
sll %l1, 4, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00444000), %l3
sll %l1, 7, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00201000), %l3
srl %l1, 6, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00100000), %l3
168
sll %l1, 8, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00080000), %l3
sll %l1, 3, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00020000), %l3
srl %l1, 4, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00008000), %l3
srl %l1, 11, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sethi %hi(0x00002000), %l3
srl %l1, 1, %l2
and %l2, %l3, %l2
or %o1, %l2, %o1
sll %l1, 1, %l2
and %l2, 0x800, %l2
or %o1, %l2, %o1
srl %l1, 14, %l2
and %l2, 0x400, %l2
or %o1, %l2, %o1
srl %l1, 22, %l2
and %l2, 0x200, %l2
or %o1, %l2, %o1
srl %l1, 20, %l2
and %l2, 0x100, %l2
or %o1, %l2, %o1
/* store round key */
sll %g1, 3, %o2
st %o0, [%i1+%o2]
add %o2, 4, %o2
st %o1, [%i1+%o2]
/* do rotate and PC-2 16 times */
cmp %g1, 15
bne rotate
inc %g1
ret
restore
.size des_keysched, .-des_keysched
#endif
des sbox.s
.section ".data"
.type S1, #object
.size S1, 64
S1:
.byte 14
.byte 4
.byte 13
.byte 1
.byte 2
.byte 15
.byte 11
.byte 8
.byte 3
.byte 10
.byte 6
169
.byte 12
.byte 5
.byte 9
.byte 0
.byte 7
.byte 0
.byte 15
.byte 7
.byte 4
.byte 14
.byte 2
.byte 13
.byte 1
.byte 10
.byte 6
.byte 12
.byte 11
.byte 9
.byte 5
.byte 3
.byte 8
.byte 4
.byte 1
.byte 14
.byte 8
.byte 13
.byte 6
.byte 2
.byte 11
.byte 15
.byte 12
.byte 9
.byte 7
.byte 3
.byte 10
.byte 5
.byte 0
.byte 15
.byte 12
.byte 8
.byte 2
.byte 4
.byte 9
.byte 1
.byte 7
.byte 5
.byte 11
.byte 3
.byte 14
.byte 10
.byte 0
.byte 6
.byte 13
.type S2, #object
.size S2, 64
S2:
.byte 15
.byte 1
.byte 8
.byte 14
.byte 6
.byte 11
.byte 3
.byte 4
170
.byte 9
.byte 7
.byte 2
.byte 13
.byte 12
.byte 0
.byte 5
.byte 10
.byte 3
.byte 13
.byte 4
.byte 7
.byte 15
.byte 2
.byte 8
.byte 14
.byte 12
.byte 0
.byte 1
.byte 10
.byte 6
.byte 9
.byte 11
.byte 5
.byte 0
.byte 14
.byte 7
.byte 11
.byte 10
.byte 4
.byte 13
.byte 1
.byte 5
.byte 8
.byte 12
.byte 6
.byte 9
.byte 3
.byte 2
.byte 15
.byte 13
.byte 8
.byte 10
.byte 1
.byte 3
.byte 15
.byte 4
.byte 2
.byte 11
.byte 6
.byte 7
.byte 12
.byte 0
.byte 5
.byte 14
.byte 9
.type S3, #object
.size S3, 64
S3:
.byte 10
.byte 0
.byte 9
.byte 14
.byte 6
171
.byte 3
.byte 15
.byte 5
.byte 1
.byte 13
.byte 12
.byte 7
.byte 11
.byte 4
.byte 2
.byte 8
.byte 13
.byte 7
.byte 0
.byte 9
.byte 3
.byte 4
.byte 6
.byte 10
.byte 2
.byte 8
.byte 5
.byte 14
.byte 12
.byte 11
.byte 15
.byte 1
.byte 13
.byte 6
.byte 4
.byte 9
.byte 8
.byte 15
.byte 3
.byte 0
.byte 11
.byte 1
.byte 2
.byte 12
.byte 5
.byte 10
.byte 14
.byte 7
.byte 1
.byte 10
.byte 13
.byte 0
.byte 6
.byte 9
.byte 8
.byte 7
.byte 4
.byte 15
.byte 14
.byte 3
.byte 11
.byte 5
.byte 2
.byte 12
.type S4, #object
.size S4, 64
S4:
.byte 7
.byte 13
172
.byte 14
.byte 3
.byte 0
.byte 6
.byte 9
.byte 10
.byte 1
.byte 2
.byte 8
.byte 5
.byte 11
.byte 12
.byte 4
.byte 15
.byte 13
.byte 8
.byte 11
.byte 5
.byte 6
.byte 15
.byte 0
.byte 3
.byte 4
.byte 7
.byte 2
.byte 12
.byte 1
.byte 10
.byte 14
.byte 9
.byte 10
.byte 6
.byte 9
.byte 0
.byte 12
.byte 11
.byte 7
.byte 13
.byte 15
.byte 1
.byte 3
.byte 14
.byte 5
.byte 2
.byte 8
.byte 4
.byte 3
.byte 15
.byte 0
.byte 6
.byte 10
.byte 1
.byte 13
.byte 8
.byte 9
.byte 4
.byte 5
.byte 11
.byte 12
.byte 7
.byte 2
.byte 14
.type S5, #object
.size S5, 64
173
S5:
.byte 2
.byte 12
.byte 4
.byte 1
.byte 7
.byte 10
.byte 11
.byte 6
.byte 8
.byte 5
.byte 3
.byte 15
.byte 13
.byte 0
.byte 14
.byte 9
.byte 14
.byte 11
.byte 2
.byte 12
.byte 4
.byte 7
.byte 13
.byte 1
.byte 5
.byte 0
.byte 15
.byte 10
.byte 3
.byte 9
.byte 8
.byte 6
.byte 4
.byte 2
.byte 1
.byte 11
.byte 10
.byte 13
.byte 7
.byte 8
.byte 15
.byte 9
.byte 12
.byte 5
.byte 6
.byte 3
.byte 0
.byte 14
.byte 11
.byte 8
.byte 12
.byte 7
.byte 1
.byte 14
.byte 2
.byte 13
.byte 6
.byte 15
.byte 0
.byte 9
.byte 10
.byte 4
.byte 5
174
.byte 3
.type S6, #object
.size S6, 64
S6:
.byte 12
.byte 1
.byte 10
.byte 15
.byte 9
.byte 2
.byte 6
.byte 8
.byte 0
.byte 13
.byte 3
.byte 4
.byte 14
.byte 7
.byte 5
.byte 11
.byte 10
.byte 15
.byte 4
.byte 2
.byte 7
.byte 12
.byte 9
.byte 5
.byte 6
.byte 1
.byte 13
.byte 14
.byte 0
.byte 11
.byte 3
.byte 8
.byte 9
.byte 14
.byte 15
.byte 5
.byte 2
.byte 8
.byte 12
.byte 3
.byte 7
.byte 0
.byte 4
.byte 10
.byte 1
.byte 13
.byte 11
.byte 6
.byte 4
.byte 3
.byte 2
.byte 12
.byte 9
.byte 5
.byte 15
.byte 10
.byte 11
.byte 14
.byte 1
.byte 7
175
.byte 6
.byte 0
.byte 8
.byte 13
.type S7, #object
.size S7, 64
S7:
.byte 4
.byte 11
.byte 2
.byte 14
.byte 15
.byte 0
.byte 8
.byte 13
.byte 3
.byte 12
.byte 9
.byte 7
.byte 5
.byte 10
.byte 6
.byte 1
.byte 13
.byte 0
.byte 11
.byte 7
.byte 4
.byte 9
.byte 1
.byte 10
.byte 14
.byte 3
.byte 5
.byte 12
.byte 2
.byte 15
.byte 8
.byte 6
.byte 1
.byte 4
.byte 11
.byte 13
.byte 12
.byte 3
.byte 7
.byte 14
.byte 10
.byte 15
.byte 6
.byte 8
.byte 0
.byte 5
.byte 9
.byte 2
.byte 6
.byte 11
.byte 13
.byte 8
.byte 1
.byte 4
.byte 10
.byte 7
.byte 9
176
.byte 5
.byte 0
.byte 15
.byte 14
.byte 2
.byte 3
.byte 12
.type S8, #object
.size S8, 64
S8:
.byte 13
.byte 2
.byte 8
.byte 4
.byte 6
.byte 15
.byte 11
.byte 1
.byte 10
.byte 9
.byte 3
.byte 14
.byte 5
.byte 0
.byte 12
.byte 7
.byte 1
.byte 15
.byte 13
.byte 8
.byte 10
.byte 3
.byte 7
.byte 4
.byte 12
.byte 5
.byte 6
.byte 11
.byte 0
.byte 14
.byte 9
.byte 2
.byte 7
.byte 11
.byte 4
.byte 1
.byte 9
.byte 12
.byte 14
.byte 2
.byte 0
.byte 6
.byte 10
.byte 13
.byte 15
.byte 3
.byte 5
.byte 8
.byte 2
.byte 1
.byte 14
.byte 7
.byte 4
.byte 10
177
.byte 8
.byte 13
.byte 15
.byte 12
.byte 9
.byte 0
.byte 3
.byte 5
.byte 6
.byte 11
des subproc.s
#if USE_KEY_F==0
.include "des_sbox.s"
#endif
.section ".text"
#if USE_PMT==0
/* des_ip : DES initial permutation
arguments:
%i0 = starting address of two-word input buffer
%i1 = starting address of two-word output buffer
locals:
%l0 = left half of input
%l1 = right half of input
%l2 = "work" register
%l3 = bit mask
*/
.align 4
.global des_ip
.type des_ip, #function
.proc 04
des_ip:
save %sp, -128, %sp
ld [%i0], %l0
ld [%i0+4], %l1
/* work = ((left >> 4) ^ right) & 0x0f0f0f0f */
srl %l0, 4, %l2
xor %l2, %l1, %l2
sethi %hi(0x0f0f0f0f), %l3
or %l3, %lo(0x0f0f0f0f), %l3
and %l2, %l3, %l2
xor %l1, %l2, %l1 /* right ^= work */
/* left ^= work << 4 */
sll %l2, 4, %l4
xor %l0, %l4, %l0
/* work = ((left >> 16) ^ right) & 0x0000ffff */
srl %l0, 16, %l2
xor %l2, %l1, %l2
sethi %hi(0x0000ffff), %l3
or %l3, %lo(0x0000ffff), %l3
and %l2, %l3, %l2
/* left ^= work << 16 */
sll %l2, 16, %l4
xor %l0, %l4, %l0
178
/* work = ((right >> 2) ^ left) & 0x33333333 */
srl %l1, 2, %l2
xor %l2, %l0, %l2
sethi %hi(0x33333333), %l3
or %l3, %lo(0x33333333), %l3
and %l2, %l3, %l2
xor %l0, %l2, %l0 /* left ^= work */
/* right ^= (work << 2) */
sll %l2, 2, %l4
xor %l1, %l4, %l1
/* work = ((right >> 8) ^ left) & 0x00ff00ff */
srl %l1, 8, %l2
xor %l2, %l0, %l2
sethi %hi(0x00ff00ff), %l3
or %l3, %lo(0x00ff00ff), %l3
and %l2, %l3, %l2
xor %l0, %l2, %l0 /* left ^= work */
/* right ^= (work << 8) */
sll %l2, 8, %l4
xor %l1, %l4, %l1
/* right <<<= 1 */
sll %l1, 1, %l5
srl %l1, 31, %l4
or %l5, %l4, %l1
/* work = (left ^ right) & 0xaaaaaaaa */
xor %l0, %l1, %l4
sethi %hi(0xaaaaaaaa), %l3
or %l3, %lo(0xaaaaaaaa), %l3
and %l4, %l3, %l2
xor %l0, %l2, %l0 /* left ^= work */
/* in this implementation, left and right are NOT to be rotated
since the Spboxs are not used */
/* right >>>= 1 */
srl %l1, 1, %l5
sll %l1, 31, %l4
or %l5, %l4, %l1
st %l0, [%i1]
st %l1, [%i1+4]
ret
restore
.size des_ip, .-des_ip
/* des_fp : DES final permutation
arguments:
%i0 = starting address of two-word input buffer
%i1 = starting address of two-word output buffer
locals:
%l0 = left half of input
%l1 = right half of input
%l3 = bit mask
*/
.align 4
.global des_fp
179
.type des_fp, #function
.proc 04
des_fp:
save %sp, -128, %sp
ld [%i0], %l0
ld [%i0+4], %l1
/* left <<<= 1 */
sll %l0, 1, %l5
srl %l0, 31, %l4
or %l5, %l4, %l0
/* work = (left ^ right) & 0xaaaaaaaa */
xor %l0, %l1, %l4
sethi %hi(0xaaaaaaaa), %l3
or %l3, %lo(0xaaaaaaaa), %l3
and %l4, %l3, %l2
xor %l0, %l2, %l0 /* left ^= work */
/* left >>>= 1 */
srl %l0, 1, %l5
sll %l0, 31, %l4
or %l5, %l4, %l0
/* work = (left >> 8) ^ right) & 0x00ff00ff */
srl %l0, 8, %l2
xor %l2, %l1, %l2
sethi %hi(0x00ff00ff), %l3
or %l3, %lo(0x00ff00ff), %l3
and %l2, %l3, %l2
/* left ^= work << 8 */
sll %l2, 8, %l4
xor %l0, %l4, %l0
/* work = ((left >> 2) ^ right) & 0x33333333 */
srl %l0, 2, %l2
xor %l2, %l1, %l2
sethi %hi(0x33333333), %l3
or %l3, %lo(0x33333333), %l3
and %l2, %l3, %l2
/* left ^= work << 2 */
sll %l2, 2, %l4
xor %l0, %l4, %l0
/* work = ((right >> 16) ^ left) & 0x0000ffff */
srl %l1, 16, %l2
xor %l2, %l0, %l2
sethi %hi(0x0000ffff), %l3
or %l3, %lo(0x0000ffff), %l3
and %l2, %l3, %l2
xor %l0, %l2, %l0 /* left ^= work */
/* right ^= work << 16 */
sll %l2, 16, %l4
xor %l1, %l4, %l1
180
/* work = ((right >> 4) ^ left) & 0x0f0f0f0f */
srl %l1, 4, %l2
xor %l2, %l0, %l2
sethi %hi(0x0f0f0f0f), %l3
or %l3, %lo(0x0f0f0f0f), %l3
and %l2, %l3, %l2
xor %l0, %l2, %l0 /* left ^= work */
/* right ^= work << 4 */
sll %l2, 4, %l4
xor %l1, %l4, %l1
st %l1, [%i1]
st %l0, [%i1+4]
ret
restore
.size des_fp, .-des_fp
#endif
#if USE_KEY_F==0
/* des_f: perform the f-function for one round of DES
arguments:
%i0 = right-half of current round block
%i1 = pointer to current round key (stored in 2 words)
locals:
%l0 = left half of expanded block
%l1 = right half of expanded block
%l3 = bit mask
*/
.align 4
.global des_f
.type des_f, #function
.proc 04
des_f:
save %sp, -128, %sp
/* start with E expansion */
sll %i0, 31, %l2
or %g0, %l2, %l0
sethi %hi(0x7c000000), %l3
srl %i0, 1, %l2
and %l2, %l3, %l2
or %l2, %l0, %l0
sethi %hi(0x03f00000), %l3
srl %i0, 3, %l2
and %l2, %l3, %l2
or %l2, %l0, %l0
sethi %hi(0x000fc000), %l3
srl %i0, 5, %l2
and %l2, %l3, %l2
or %l2, %l0, %l0
sethi %hi(0x00003f00), %l3
or %l3, %lo(0x00003f00), %l3
srl %i0, 7, %l2
and %l2, %l3, %l2
or %l2, %l0, %l0
sethi %hi(0xfc000000), %l3
sll %i0, 15, %l2
and %l2, %l3, %l2
or %l2, %g0, %l1
sethi %hi(0x03f00000), %l3
sll %i0, 13, %l2
and %l2, %l3, %l2
181
or %l2, %l1, %l1
sethi %hi(0x000fc000), %l3
sll %i0, 11, %l2
and %l2, %l3, %l2
or %l2, %l1, %l1
sethi %hi(0x00003e00), %l3
or %l3, %lo(0x00003e00), %l3
sll %i0, 9, %l2
and %l2, %l3, %l2
or %l2, %l1, %l1
srl %i0, 23, %l2
and %l2, 0x100, %l2
or %l2, %l1, %l1
/* perform XOR with round key */
ld [%i1], %l4
ld [%i1+4], %l5
xor %l0, %l4, %l0
xor %l1, %l5, %l1
/* input result into S-Boxes */
/* %l2 holds value of P input */
sethi %hi(S1), %g1
or %g1, %lo(S1), %g1
srl %l0, 30, %l4
and %l4, 0x2, %l4
srl %l0, 26, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l0, 27, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 28, %l6
or %g0, %l6, %l2
sethi %hi(S2), %g1
or %g1, %lo(S2), %g1
srl %l0, 24, %l4
and %l4, 0x2, %l4
srl %l0, 20, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l0, 21, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 24, %l6
or %l2, %l6, %l2
sethi %hi(S3), %g1
or %g1, %lo(S3), %g1
srl %l0, 18, %l4
and %l4, 0x2, %l4
srl %l0, 14, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l0, 15, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 20, %l6
or %l2, %l6, %l2
sethi %hi(S4), %g1
or %g1, %lo(S4), %g1
182
srl %l0, 12, %l4
and %l4, 0x2, %l4
srl %l0, 8, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l0, 9, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 16, %l6
or %l2, %l6, %l2
sethi %hi(S5), %g1
or %g1, %lo(S5), %g1
srl %l1, 30, %l4
and %l4, 0x2, %l4
srl %l1, 26, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l1, 27, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 12, %l6
or %l2, %l6, %l2
sethi %hi(S6), %g1
or %g1, %lo(S6), %g1
srl %l1, 24, %l4
and %l4, 0x2, %l4
srl %l1, 20, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l1, 21, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 8, %l6
or %l2, %l6, %l2
sethi %hi(S7), %g1
or %g1, %lo(S7), %g1
srl %l1, 18, %l4
and %l4, 0x2, %l4
srl %l1, 14, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l1, 15, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
sll %l6, 4, %l6
or %l2, %l6, %l2
sethi %hi(S8), %g1
or %g1, %lo(S8), %g1
srl %l1, 12, %l4
and %l4, 0x2, %l4
srl %l1, 8, %l5
and %l5, 0x1, %l5
or %l5, %l4, %l4
srl %l1, 9, %l5
and %l5, 0xF, %l5
sll %l4, 4, %l4
183
or %l4, %l5, %l4
ldub [%l4+%g1], %l6
or %l2, %l6, %l2
/* do P permutation to obtain output */
sethi %hi(0x80000000), %l3
sll %l2, 15, %l4
and %l4, %l3, %l4
or %g0, %l4, %i0
sethi %hi(0x40402400), %l3
or %l3, %lo(0x40402400), %l3
sll %l2, 5, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x30000000), %l3
sll %l2, 17, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x08000000), %l3
sll %l2, 24, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x04000000), %l3
sll %l2, 6, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x02000000), %l3
sll %l2, 21, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x01000000), %l3
sll %l2, 9, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00880000), %l3
srl %l2, 8, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00200000), %l3
sll %l2, 12, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00100000), %l3
sll %l2, 14, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00040000), %l3
sll %l2, 4, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00020000), %l3
sll %l2, 16, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00011080), %l3
or %l3, %lo(0x00011080), %l3
srl %l2, 6, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00008100), %l3
or %l3, %lo(0x00008100), %l3
srl %l2, 15, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sethi %hi(0x00004000), %l3
184
srl %l2, 10, %l4
and %l4, %l3, %l4
or %i0, %l4, %i0
sll %l2, 11, %l4
and %l4, 0x800, %l4
or %i0, %l4, %i0
srl %l2, 20, %l4
and %l4, 0x200, %l4
or %i0, %l4, %i0
srl %l2, 13, %l4
and %l4, 0x40, %l4
or %i0, %l4, %i0
sll %l2, 3, %l4
and %l4, 0x20, %l4
or %i0, %l4, %i0
srl %l2, 22, %l4
and %l4, 0x10, %l4
or %i0, %l4, %i0
srl %l2, 7, %l4
and %l4, 0x9, %l4
or %i0, %l4, %i0
srl %l2, 19, %l4
and %l4, 0x4, %l4
or %i0, %l4, %i0
srl %l2, 27, %l4
and %l4, 0x2, %l4
or %i0, %l4, %i0
ret
restore
.size des_f, .-des_f
#endif
des ext ecb.c
#include "des_test_vectors.h"
#if USE_PMT==0
void des_ip(unsigned long *, unsigned long *);
void des_fp(unsigned long *, unsigned long *);
#endif
#if USE_KEY_F==0
void des_keysched(unsigned long *, unsigned long[][]);
unsigned long des_f(unsigned long, unsigned long *);
#endif
int main(void)
{
register unsigned long l, r, rtol_temp;
int i, k;
#if USE_PMT==1
register unsigned long res0, res1;
#else
unsigned long lr[2], result[2];
#endif
#if USE_KEY_F==0
unsigned long rndkeys[16][2];
#endif
#if SINGLE_TEST==0
for(k=0;k<VECTORS;k++)
{
185
#else
k=0;
#endif
#if USE_KEY_F==1
asm("deskey %[mkeyl], %[mkeyr]\n\t"
: : [mkeyl] "r" (*(TEST_KEY(k))), [mkeyr] "r" (*(TEST_KEY(k)+1)));
#if DIR==0
asm("desdir 0\n\t");
#else
#endif
#else
des_keysched(TEST_KEY(k), rndkeys);
#endif
#if USE_PMT==1
#if DIR==0
asm("desipl %[l_in], %[r_in], %[l]\n\t"
: [l] "=r" (l) : [l_in] "r" (*(TEST_PT(k))), [r_in] "r" (*(TEST_PT(k)+1)));
asm("desipr %[l_in], %[r_in], %[r]\n\t"
: [r] "=r" (r) : [l_in] "r" (*(TEST_PT(k))), [r_in] "r" (*(TEST_PT(k)+1)));
#else
: [l] "=r" (l) : [l_in] "r" (*(TEST_CT(k))), [r_in] "r" (*(TEST_CT(k)+1)));
: [r] "=r" (r) : [l_in] "r" (*(TEST_CT(k))), [r_in] "r" (*(TEST_CT(k)+1)));
#endif
#else
#if DIR==0
lr[0] = *(TEST_PT(k));
lr[1] = *(TEST_PT(k)+1);
#else
lr[0] = *(TEST_CT(k));
lr[1] = *(TEST_CT(k)+1);
#endif
des_ip(lr, lr);
l = lr[0]; r = lr[1];
#endif
#if USE_KEY_F==1
for(i = 0; i < 16; i++)
{
rtol_temp = r;
asm ("desf %[rin], %[rout]\n\t" : [rout] "+r" (r) : [rin] "r" (r));
r = r ^ l;
l = rtol_temp;
}
#else
#if DIR==0
for(i = 0; i < 16; i++)
#else
for(i = 15; i >= 0; i--)
#endif
{
rtol_temp = r;
r = des_f(r, rndkeys[i]) ^ l;
l = rtol_temp;
}
#endif
#if USE_PMT==1
// these instructions wont get compiled in under the usual build settings
// unless res0 and res1 get used!
186
asm("desfpl %[r], %[l], %[res0]\n\t"
: [res0] "=r" (res0) : [r] "r" (r), [l] "r" (l));
asm("desfpr %[r], %[l], %[res1]\n\t"
: [res1] "=r" (res1) : [r] "r" (r), [l] "r" (l));
asm("cmp %0, %1\n\t" : : "r" (res0), "r" (res1) );
#else
lr[0] = l; lr[1] = r;
des_fp(lr, result);
#endif
#if FUNC_TEST
// if the result doesnt match the ciphertext,
// trap so we can exit immediately
#if USE_PMT==1
#if DIR==0
asm("cmp %0, %1\n\t"
: : "r" (res0), "r" (*(TEST_CT(k))));
asm("tne 0\n\tnop\n\t");
: : "r" (res1), "r" (*(TEST_CT(k)+1)));
#else
: : "r" (res0), "r" (*(TEST_PT(k))));
: : "r" (res1), "r" (*(TEST_PT(k)+1)));
#endif
#else
#if DIR==0
asm("cmp %0, %1\n\t" : : "r" (result[0]), "r" (*(TEST_CT(k))));
asm("cmp %0, %1\n\t" : : "r" (result[1]), "r" (*(TEST_CT(k)+1)));
#else
asm("cmp %0, %1\n\t" : : "r" (result[0]), "r" (*(TEST_PT(k))));
asm("cmp %0, %1\n\t" : : "r" (result[1]), "r" (*(TEST_PT(k)+1)));
#endif
#endif
#endif
#if SINGLE_TEST==0
}
#endif
// test(s) passed, now make it stop!
asm("ta 0\n\tnop\n\t");
return 0;
}
tdes ext ecb e.c
#include "tdes_test_vectors.h"
187
#if USE_PMT==0
void des_ip(unsigned long *, unsigned long *);
void des_fp(unsigned long *, unsigned long *);
#endif
#if USE_KEY_F==0
void des_keysched(unsigned long *, unsigned long[][]);
unsigned long des_f(unsigned long, unsigned long *);
#endif
int main(void)
{
register unsigned long l, r, rtol_temp;
int i, k;
#if USE_PMT==1
register unsigned long res0, res1;
#else
unsigned long lr[2], result[2];
#endif
#if USE_KEY_F==0
unsigned long rndkeys[3][16][2];
#else
register unsigned long k1l, k1r, k2l, k2r, k3l, k3r;
#endif
#if SINGLE_TEST==0
{
#else
k=10;
#endif
#if USE_PMT==1
: [l] "=r" (l) : [l_in] "r" (*(TEST_PT(k))), [r_in] "r" (*(TEST_PT(k)+1)));
: [r] "=r" (r) : [l_in] "r" (*(TEST_PT(k))), [r_in] "r" (*(TEST_PT(k)+1)));
#else
lr[0] = *(TEST_PT(k));
lr[1] = *(TEST_PT(k)+1);
des_ip(lr, lr);
l = lr[0]; r = lr[1];
#endif
#if USE_KEY_F==0
des_keysched(TEST_KEY1(k), rndkeys[0]);
// DES iteration #1
for(i = 0; i < 16; i++)
{
rtol_temp = r;
r = des_f(r, rndkeys[0][i]) ^ l;
l = rtol_temp;
}
rtol_temp = r;
r = l;
l = rtol_temp;
// DES iteration #2
for(i = 15; i >= 0; i--)
{
188
rtol_temp = r;
l = rtol_temp;
}
rtol_temp = r;
r = l;
l = rtol_temp;
// DES iteration #3
for(i = 0; i < 16; i++)
{
rtol_temp = r;
l = rtol_temp;
}
#else // USE_KEY_F==1
k1l = *(TEST_KEY1(k)); k1r = *(TEST_KEY1(k)+1);
// DES iteration #1
: : [mkeyl] "r" (k1l), [mkeyr] "r" (k1r));
for(i = 0; i < 16; i++)
{
rtol_temp = r;
r = r ^ l;
l = rtol_temp;
}
rtol_temp = r;
r = l;
l = rtol_temp;
// DES iteration #2
for(i = 0; i < 16; i++)
{
rtol_temp = r;
r = r ^ l;
l = rtol_temp;
}
rtol_temp = r;
r = l;
l = rtol_temp;
// DES iteration #3
for(i = 0; i < 16; i++)
{
rtol_temp = r;
189
r = r ^ l;
l = rtol_temp;
}
#endif
#if USE_PMT==1
// these instructions wont get compiled in under the usual build settings
// unless res0 and res1 get used!
asm("desfpl %[r], %[l], %[res0]\n\t"
: [res0] "=r" (res0) : [r] "r" (r), [l] "r" (l));
asm("desfpr %[r], %[l], %[res1]\n\t"
: [res1] "=r" (res1) : [r] "r" (r), [l] "r" (l));
#else
lr[0] = l; lr[1] = r;
des_fp(lr, result);
#endif
#if FUNC_TEST
#if USE_PMT==1
: : "r" (res0), "r" (*(TEST_CT(k))));
: : "r" (res1), "r" (*(TEST_CT(k)+1)));
#else
asm("cmp %0, %1\n\t" : : "r" (result[0]), "r" (*(TEST_CT(k))));
asm("cmp %0, %1\n\t" : : "r" (result[1]), "r" (*(TEST_CT(k)+1)));
#endif
#else
#if USE_PMT==1
asm("cmp %0, %1\n\t" : : "r" (res0), "r" (res1) );
#endif
#endif
#if SINGLE_TEST==0
}
#endif
return 0;
}
IDEA
mmul16.s
.section ".text"
/*
mmul16: calculate %i0 * %i1 mod (2^16+1)
used when custom "mmul16" instruction is not available
NOTE: 0x0000 must be treated as 2^16!
*/
190
.align 4
.global mmul16
.type mmul16, #function
.proc 04
mmul16:
save %sp, -128, %sp
/*
p = a * b;
pm = p & 0xFFFF; // pm = p mod (2^16)
pd = p >> 16; // pd = p div (2^16)
if(pm < pd)
pm += 0x10001;
p = pm - pd;
*/
sethi %hi(0x10001), %l2
or %l2, %lo(0x10001), %l2
sethi %hi(0xFFFF), %l3
or %l3, %lo(0xFFFF), %l3
! handle zero inputs
cmp %i0, 0
beq a_z
cmp %i1, 0
beq a_nz_b_z
ab_nz:
! %i0 != 0, %i1 != 0
mov %i0, %o0
call .umul, 0
mov %i1, %o1
and %l3, %o0, %l0
srl %o0, 16, %l1
cmp %l0, %l1
blu,a mdsub
add %l0, %l2, %l0
mdsub:
sub %l0, %l1, %i0
ret
restore
a_z:
cmp %i1, 0
beq a_z_b_z
! %i0 == 0, %i1 != 0 --> p = 0x10001 - %i1
sub %l2, %i1, %i0
b,a zreturn
a_nz_b_z:
! %i0 != 0, %i1 == 0 --> p = 0x10001 - %i0
sub %l2, %i0, %i0
b,a zreturn
a_z_b_z:
! #i0 == 0, %i1 == 0 --> p = 1
or %g0, 1, %i0
zreturn:
and %i0, %l3, %i0
ret
restore
.size mmul16, .-mmul16
idea keysched.c
191
#include <string.h>
void key_sched_enc(unsigned short *mk, unsigned short *ek)
{
unsigned short t1[8], t2[8];
int i;
memcpy(t1, mk, 16);
for(i=0;i<6;i++)
{
memcpy(ek+(i*8),t1,16);
t2[0] = ((t1[1] << 9) | (t1[2] >> 7)) & 0xFFFF;
t2[1] = ((t1[2] << 9) | (t1[3] >> 7)) & 0xFFFF;
t2[2] = ((t1[3] << 9) | (t1[4] >> 7)) & 0xFFFF;
t2[3] = ((t1[4] << 9) | (t1[5] >> 7)) & 0xFFFF;
t2[4] = ((t1[5] << 9) | (t1[6] >> 7)) & 0xFFFF;
t2[5] = ((t1[6] << 9) | (t1[7] >> 7)) & 0xFFFF;
t2[6] = ((t1[7] << 9) | (t1[0] >> 7)) & 0xFFFF;
t2[7] = ((t1[0] << 9) | (t1[1] >> 7)) & 0xFFFF;
memcpy(t1,t2,16);
}
memcpy(ek+48,t1,8);
}
#if DIR==1
#if USE_MODMUL==1
unsigned short inverse(unsigned short a)
{
unsigned short p;
int i;
/* if a = 0x0000 (2^16) or a = 0x0001, then inv(a) = a */
if((a >> 1) == 0)
return a;
p = a;
for(i=1;i<16;i++)
{
if(p == 0x0000) // (2^16)^2 mod (2^16+1) = 1
p = 0x0001;
else
asm("mmul16 %[p1], %[p2], %[p]\n\t" : [p] "+r" (p) : [p1] "r" (p) , [p2] "r" (p));
if(p == 0x0000) // (2^16) * a mod (2^16+1) = -a mod (2^16+1)
p = 0x10001 - a;
else
asm("mmul16 %[p1], %[p2], %[p]\n\t" : [p] "+r" (p) : [p1] "r" (p) , [p2] "r" (a));
}
return p;
}
#else
unsigned short mmul16(unsigned short, unsigned short);
unsigned short inverse(unsigned short a)
{
unsigned short p;
192
int i;
/* if a = 0x0000 (2^16) or a = 0x0001, then inv(a) = a */
if((a >> 1) == 0)
return a;
p = a;
for(i=1;i<16;i++)
{
if(p == 0x0000) // (2^16)^2 mod (2^16+1) = 1
p = 0x0001;
else
p = mmul16(p,p);
if(p == 0x0000) // (2^16) * a mod (2^16+1) = -a mod (2^16+1)
p = 0x10001 - a;
else
p = mmul16(p,a);
}
return p;
}
#endif
void key_sched_dec(unsigned short *ek, unsigned short *dk)
{
int i;
dk[0] = inverse(ek[48]);
dk[1] = - ek[49];
dk[2] = - ek[50];
for(i=1;i<8;i++)
{
dk[6*(i-1)+4] = ek[6*(8-i)+4];
dk[6*(i-1)+5] = ek[6*(8-i)+5];
dk[6*i+0] = inverse(ek[6*(8-i)+0]);
dk[6*i+1] = - ek[6*(8-i)+2];
dk[6*i+2] = - ek[6*(8-i)+1];
dk[6*i+3] = inverse(ek[6*(8-i)+3]);
}
dk[46] = ek[4];
dk[47] = ek[5];
dk[49] = - ek[1];
dk[50] = - ek[2];
}
#endif
idea ext ecb.c
#include "idea_test_vectors.h"
void key_sched_enc(unsigned short *, unsigned short *);
void key_sched_dec(unsigned short *, unsigned short *);
#if USE_MODMUL==0
unsigned short mmul16(unsigned short, unsigned short);
193
#endif
int main()
{
unsigned short encrypt_keys[52], *key;
#if DIR==1
unsigned short decrypt_keys[52];
#endif
register unsigned long t0_1, t0_2, t0_3, t0_4,
t1_1, t1_2, t1_3, t1_4,
t2_1, t2_2, ttt,
t3_1, t3_2, t3_3, t3_4;
int i, k;
#if SINGLE_TEST==0
{
#else
k=1;
#endif
key_sched_enc(TEST_KEY(k), encrypt_keys);
#if DIR==1
key_sched_dec(encrypt_keys, decrypt_keys);
#endif
#if DIR==0
t0_1 = *(TEST_PT(k)); t0_2 = *(TEST_PT(k)+1); t0_3 = *(TEST_PT(k)+2); t0_4 = *(TEST_PT(k)+3);
key = encrypt_keys;
#else
t0_1 = *(TEST_CT(k)); t0_2 = *(TEST_CT(k)+1); t0_3 = *(TEST_CT(k)+2); t0_4 = *(TEST_CT(k)+3);
key = decrypt_keys;
#endif
for(i=0;i<8;i++)
{
#if USE_MODMUL==1
#else
t1_1 = mmul16(t0_1,key[6*i+0]);
#endif
t1_2 = (t0_2 + key[6*i+1]) & 0x0000FFFF;
t1_3 = (t0_3 + key[6*i+2]) & 0x0000FFFF;
#if USE_MODMUL==1
#else
t1_4 = mmul16(t0_4,key[6*i+3]);
#endif
t2_1 = t1_1 ^ t1_3;
t2_2 = t1_2 ^ t1_4;
#if USE_MODMUL==1
#else
t3_1 = mmul16(t2_1,key[6*i+4]);
#endif
t3_2 = (t2_2 + t3_1) & 0x0000FFFF;
#if USE_MODMUL==1
#else
t3_4 = mmul16(t3_2,key[6*i+5]);
#endif
t3_3 = (t3_1 + t3_4) & 0x0000FFFF;
194
t0_1 = t1_1 ^ t3_4;
t0_2 = t1_3 ^ t3_4;
t0_3 = t1_2 ^ t3_3;
t0_4 = t1_4 ^ t3_3;
}
#if USE_MODMUL==1
#else
t0_1 = mmul16(t0_1,key[48]);
#endif
ttt = t0_2;
t0_2 = (t0_3 + key[49]) & 0x0000FFFF;
t0_3 = (ttt + key[50]) & 0x0000FFFF;
#if USE_MODMUL==1
#else
t0_4 = mmul16(t0_4,key[51]);
#endif
#if FUNC_TEST==1
#if DIR==0
asm("cmp %0, %1\n\t" : : "r" (t0_1), "r" (*(TEST_CT(k))));
asm("cmp %0, %1\n\t" : : "r" (t0_2), "r" (*(TEST_CT(k)+1)));
#else
asm("cmp %0, %1\n\t" : : "r" (t0_1), "r" (*(TEST_PT(k))));
asm("cmp %0, %1\n\t" : : "r" (t0_2), "r" (*(TEST_PT(k)+1)));
#endif
#else
// use the operands so the compiler keeps all the instructions in
asm("cmp %0, %1\n\t" : : "r" (t0_1), "r" (t0_2));
asm("cmp %0, %1\n\t" : : "r" (t0_3), "r" (t0_4));
#endif
#if SINGLE_TEST==0
}
#endif
return 0;
}
AES
aes common.c
// aes_common.c -- common data for encrypt and decrypt routines
195
#if !(USE_SBOX==1 || USE_SBOX==4)
const unsigned char sbox[256]={
0x63,0x7c,0x77,0x7b,0xf2,0x6b,0x6f,0xc5,
0x30,0x01,0x67,0x2b,0xfe,0xd7,0xab,0x76,
0xca,0x82,0xc9,0x7d,0xfa,0x59,0x47,0xf0,
0xad,0xd4,0xa2,0xaf,0x9c,0xa4,0x72,0xc0,
0xb7,0xfd,0x93,0x26,0x36,0x3f,0xf7,0xcc,
0x34,0xa5,0xe5,0xf1,0x71,0xd8,0x31,0x15,
0x04,0xc7,0x23,0xc3,0x18,0x96,0x05,0x9a,
0x07,0x12,0x80,0xe2,0xeb,0x27,0xb2,0x75,
0x09,0x83,0x2c,0x1a,0x1b,0x6e,0x5a,0xa0,
0x52,0x3b,0xd6,0xb3,0x29,0xe3,0x2f,0x84,
0x53,0xd1,0x00,0xed,0x20,0xfc,0xb1,0x5b,
0x6a,0xcb,0xbe,0x39,0x4a,0x4c,0x58,0xcf,
0xd0,0xef,0xaa,0xfb,0x43,0x4d,0x33,0x85,
0x45,0xf9,0x02,0x7f,0x50,0x3c,0x9f,0xa8,
0x51,0xa3,0x40,0x8f,0x92,0x9d,0x38,0xf5,
0xbc,0xb6,0xda,0x21,0x10,0xff,0xf3,0xd2,
0xcd,0x0c,0x13,0xec,0x5f,0x97,0x44,0x17,
0xc4,0xa7,0x7e,0x3d,0x64,0x5d,0x19,0x73,
0x60,0x81,0x4f,0xdc,0x22,0x2a,0x90,0x88,
0x46,0xee,0xb8,0x14,0xde,0x5e,0x0b,0xdb,
0xe0,0x32,0x3a,0x0a,0x49,0x06,0x24,0x5c,
0xc2,0xd3,0xac,0x62,0x91,0x95,0xe4,0x79,
0xe7,0xc8,0x37,0x6d,0x8d,0xd5,0x4e,0xa9,
0x6c,0x56,0xf4,0xea,0x65,0x7a,0xae,0x08,
0xba,0x78,0x25,0x2e,0x1c,0xa6,0xb4,0xc6,
0xe8,0xdd,0x74,0x1f,0x4b,0xbd,0x8b,0x8a,
0x70,0x3e,0xb5,0x66,0x48,0x03,0xf6,0x0e,
0x61,0x35,0x57,0xb9,0x86,0xc1,0x1d,0x9e,
0xe1,0xf8,0x98,0x11,0x69,0xd9,0x8e,0x94,
0x9b,0x1e,0x87,0xe9,0xce,0x55,0x28,0xdf,
0x8c,0xa1,0x89,0x0d,0xbf,0xe6,0x42,0x68,
0x41,0x99,0x2d,0x0f,0xb0,0x54,0xbb,0x16 };
const unsigned char inv_sbox[256] = {
0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38,
0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb,
0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87,
0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb,
0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d,
0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e,
0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2,
0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25,
0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16,
0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92,
0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda,
0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84,
0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a,
0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06,
0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02,
0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b,
0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea,
0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73,
0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85,
0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e,
0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89,
0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b,
0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20,
0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4,
0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31,
0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f,
196
0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d,
0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef,
0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0,
0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61,
0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26,
0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d };
#endif
const unsigned char rcon[12]={ 0x00, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20,
0x40, 0x80, 0x1b, 0x36, 0x6c };
aes keysched.c
#include "aes_common.h"
// Functions for precomputed key schedule
/**
* Precompute key schedule for 128 bit key.
* @param key Must be an array of length 4 of columns of the 128 bit
* encryption key.
* @param key_schedule An array of (Nr+1) round keys, each in row-wise
* form. For 128 bit keys (Nr+1) == 11.
*/
void AES_keySchedule128(unsigned long* key, unsigned long key_schedule[][4])
{
int i;
// The encryption key is the first round key
key_schedule[0][0] = key[0];
for (i = 1; i <= 10; i++)
{
#if USE_SBOX==1
asm(
"sll %[kp3], 8, %%l0\n\t"
"srl %[kp3], 24, %%l1\n\t"
"or %%l0, %%l1, %%l0\n\t"
"sll %[rcon_i], 24, %%l2\n\t"
"aessb %%l0, 0x00, %%l0\n\t"
"aessb %%l0, 0x10, %%l0\n\t"
"aessb %%l0, 0x20, %%l0\n\t"
"aessb %%l0, 0x30, %%l0\n\t"
"xor %%l0, %%l2, %%l0\n\t"
"xor %%l0, %[kp0], %[kc0]\n\t"
: [kc0] "=r" (key_schedule[i][0])
: [kp0] "r" (key_schedule[i-1][0]), [kp3] "r" (key_schedule[i-1][3]), [rcon_i] "r"
((unsigned long)rcon[i])
: "%l0", "%l1", "%l2"
);
#elif USE_SBOX==4
asm(
"sll %[kp3], 8, %%l0\n\t"
"srl %[kp3], 24, %%l1\n\t"
"or %%l0, %%l1, %%l0\n\t"
"sll %[rcon_i], 24, %%l2\n\t"
197
"aessb4 %%l0, 0, %%l0\n\t"
"xor %%l0, %%l2, %%l0\n\t"
"xor %%l0, %[kp0], %[kc0]\n\t"
: [kc0] "=r" (key_schedule[i][0])
: [kp0] "r" (key_schedule[i-1][0]), [kp3] "r" (key_schedule[i-1][3]), [rcon_i] "r"
((unsigned long)rcon[i])
: "%l0", "%l1", "%l2"
);
#else
key_schedule[i][0] = key_schedule[i-1][0] ^
(
((unsigned long)sbox[(unsigned char)(key_schedule[i-1][3] >> 24)]) |
(((unsigned long)sbox[(unsigned char)(key_schedule[i-1][3] >> 16)]) << 24) |
(((unsigned long)sbox[(unsigned char)(key_schedule[i-1][3] >> 8)]) << 16) |
(((unsigned long)sbox[(unsigned char)(key_schedule[i-1][3])]) << 8)
) ^
(((unsigned long)rcon[i]) << 24);
#endif
key_schedule[i][1] = key_schedule[i-1][1] ^ key_schedule[i][0];
}
}
aes encrypt.c
// Helper function for MixColumns
#if USE_GFMMUL==0
/**
* Interprets the parameter word as four elements of GF(2^8) and
* doubles them using the irreducible polynomial x^8 + x^4 + x^3 + x + 1 (0x11b)
*/
static inline
unsigned long fourGFdouble(unsigned long word)
{
unsigned long reduction_mask = (word & 0x80808080) >> 7;
word = (word & 0x7f7f7f7f) << 1;
word ^= (reduction_mask * 0x1b);
return word;
}
#endif
void AES_encrypt128Precomputed_inline(unsigned long* plaintext, unsigned long key_schedule[][4],
unsigned long* ciphertext)
{
const int Nr = 10;
int i;
unsigned long tmp[4];
#if USE_GFMMUL==0
unsigned long tmp0, tmp1;
#endif
198
// ciphertext is used as state throughout the encryption
// unsigned long* state = ciphertext;
// NOTE: It is more beneficial to just work on local arrays
// as the compiler can hold the values in registers
unsigned long tmp2[4];
unsigned long* state = tmp2;
// On P4, memcpy and direct assignment result in same speed
// memcpy required additional includes, so we avoid it
//memcpy(state, plaintext 16);
// The nine rounds
for (i = 1; i < Nr; i++)
{
#if USE_SBOX==1
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
#elif USE_SBOX==4
199
asm( "aessb4 %[s0], 0, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 0, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 0, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 0, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
#else
tmp[0] = ((unsigned long)sbox[(unsigned char)(state[0] >> 24)]) << 24;
tmp[0] |= ((unsigned long)sbox[(unsigned char)(state[1] >> 16)]) << 16;
tmp[0] |= ((unsigned long)sbox[(unsigned char)(state[3])]);
#endif
// MixColumns
#if USE_GFMMUL==1
asm(
);
#else
// Note: fourGFdouble is still a seperate function
200
// Column 0
tmp0 = fourGFdouble(tmp[0]);
tmp1 = tmp0 ^ tmp[0];
state[0] = tmp0 ^ ((tmp1 >> 24) | (tmp1 << 8)) ^ ((tmp[0] >> 16) | (tmp[0] << 16)) ^
((tmp[0] >> 8) | (tmp[0] << 24));
// Column 1
((tmp[1] >> 8) | (tmp[1] << 24));
// Column 2
((tmp[2] >> 8) | (tmp[2] << 24));
// Column 3
((tmp[3] >> 8) | (tmp[3] << 24));
#endif
// Add round key
}
// Final round
#if USE_SBOX==1
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
201
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
#elif USE_SBOX==4
asm( "aessb4 %[s0], 0, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 0, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 0, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 0, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
#else
202
#endif
// Add round key
}
aes ext ecb e.c
// AES implementations including instruction set extensions
// developed by Sean R. OMelia UML CNIS
// based on reference code provided by Stefan Tillich, IAIK, Graz Univ. of Technology
#include "aes_test_vectors.h"
void AES_encrypt128Precomputed_inline(unsigned long* plaintext, unsigned long key_schedule[][4],
unsigned long* ciphertext);
int main(void)
{
unsigned long state0[4], state1[4];
int k;
#if USE_GFMMUL==1
asm(
"set 0x80402010, %%l2\n\t"
"set 0x08040201, %%l3\n\t"
"set 0x40201088, %%l4\n\t"
"set 0x84028180, %%l5\n\t"
"set 0xc0603098, %%l6\n\t"
"set 0x8c068381, %%l7\n\t"
"gfmkld %%l4, %%l5\n\tnop\n\t"
: : : "%l2","%l3","%l4","%l5","%l6","%l7");
#endif
#if SINGLE_TEST==0
{
203
#else
k=0;
#endif
AES_keySchedule128( TEST_KEY(k), rndkeys );
state0[0] = *(TEST_PT(k));
state0[1] = *(TEST_PT(k)+1);
AES_encrypt128Precomputed_inline(state0, rndkeys, state1);
#if FUNC_TEST==1
asm( "cmp %0, %1\n\t" : : "r" (state1[0]), "r" (*(TEST_CT(k))) );
asm( "cmp %0, %1\n\t" : : "r" (state1[1]), "r" (*(TEST_CT(k)+1)) );
#endif
#if SINGLE_TEST==0
}
#endif
return 0;
}
aes decrypt.c
// Helper function for MixColumns
#if USE_GFMMUL==0
/**
* Interprets the parameter word as four elements of GF(2^8) and
* doubles them using the irreducible polynomial x^8 + x^4 + x^3 + x + 1 (0x11b)
*/
static inline
unsigned long fourGFdouble(unsigned long word)
{
unsigned long reduction_mask = (word & 0x80808080) >> 7;
word = (word & 0x7f7f7f7f) << 1;
word ^= (reduction_mask * 0x1b);
return word;
}
#endif
void AES_decrypt128Precomputed_inline(unsigned long* ciphertext, unsigned long key_schedule[][4],
204
unsigned long* plaintext)
{
int Nr = 10;
int i;
unsigned long tmp[4];
// plaintext is used as state throughout the encryption
// unsigned long* state = plaintext;
// NOTE: It is more beneficial to just work on local arrays
// as the compiler can hold the values in registers
unsigned long tmp2[4];
unsigned long* state = tmp2;
#if USE_GFMMUL==0
unsigned long g2, g4, g9;
#endif
// AddRoundKey
for (i = Nr-1; i > 0; i--)
{
#if USE_SBOX==1
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
205
tmp[3] |= state[0] & 0x000000FF ;
#elif USE_SBOX==4
asm( "aessb4 %[s0], 1, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 1, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 1, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 1, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
#else
tmp[0] = ((unsigned long)inv_sbox[(unsigned char)(state[0] >> 24)]) << 24;
tmp[0] |= ((unsigned long)inv_sbox[(unsigned char)(state[3] >> 16)]) << 16;
tmp[0] |= ((unsigned long)inv_sbox[(unsigned char)(state[1])]);
#endif
// AddRoundKey
// InvMixColumns
206
#if USE_GFMMUL==1
asm(
);
#else
// Column 0
g2 = fourGFdouble(tmp[0]);
g4 = fourGFdouble(g2);
g9 = tmp[0] ^ fourGFdouble(g4);
g4 ^= g9;
state[0] = tmp[0] ^ g2 ^ g4 ^
(((unsigned long)(g2^g9) >> 24) | ((unsigned long)(g2^g9) << 8)) ^
(((unsigned long)(g4) >> 16) | ((unsigned long)(g4) << 16)) ^
(((unsigned long)(g9) >> 8) | ((unsigned long)(g9) << 24));
// Column 1
g4 ^= g9;
state[1] = tmp[1] ^ g2 ^ g4 ^
// Column 2
g4 ^= g9;
state[2] = tmp[2] ^ g2 ^ g4 ^
// Column 3
g4 ^= g9;
state[3] = tmp[3] ^ g2 ^ g4 ^
#endif
}
#if USE_SBOX==1
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
207
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
#elif USE_SBOX==4
asm( "aessb4 %[s0], 1, %[s0]\n\t"
: [s0] "+r" (state[0]) );
asm( "aessb4 %[s1], 1, %[s1]\n\t"
: [s1] "+r" (state[1]) );
asm( "aessb4 %[s2], 1, %[s2]\n\t"
: [s2] "+r" (state[2]) );
asm( "aessb4 %[s3], 1, %[s3]\n\t"
: [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[1] & 0x000000FF ;
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[0] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[2] & 0x000000FF ;
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[1] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[3] & 0x000000FF ;
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[2] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[0] & 0x000000FF ;
#else
208
#endif
// AddRoundKey
}
aes ext ecb d.c
// AES implementations including instruction set extensions
// developed by Sean R. OMelia UML CNIS
// based on reference code provided by Stefan Tillich, IAIK, Graz Univ. of Technology
#include "aes_test_vectors.h"
void AES_decrypt128Precomputed_inline(unsigned long* plaintext, unsigned long key_schedule[][4],
unsigned long* ciphertext);
int main(void)
{
unsigned long state0[4], state1[4];
int k;
#if USE_GFMMUL==1
asm(
"set 0x90c8e472, %%l0\n\t"
"set 0xa9c46221, %%l1\n\t"
"set 0xd0e8f4fa, %%l2\n\t"
"set 0x2dc6e3a1, %%l3\n\t"
"set 0xb0d86cb6, %%l4\n\t"
"set 0xeb45a261, %%l5\n\t"
"set 0x70b85c2e, %%l6\n\t"
"set 0x674321e0, %%l7\n\t"
209
: : : "%l0","%l1","%l2","%l3","%l4","%l5","%l6","%l7");
#endif
#if SINGLE_TEST==0
{
#else
k=0;
#endif
AES_keySchedule128( TEST_KEY(k), rndkeys );
state0[0] = *(TEST_CT(k));
state0[1] = *(TEST_CT(k)+1);
AES_decrypt128Precomputed_inline(state0, rndkeys, state1);
#if FUNC_TEST==1
// if the result doesnt match the plaintext,
asm( "cmp %0, %1\n\t" : : "r" (state1[0]), "r" (*(TEST_PT(k))) );
asm( "cmp %0, %1\n\t" : : "r" (state1[1]), "r" (*(TEST_PT(k)+1)) );
#endif
#if SINGLE_TEST==0
}
#endif
return 0;
}
210
About the Author
Sean R. OMelia earned the Bachelor of Science degree in the eld of
Computer Engineering in 2005 from the University of Masschusetts Lowell. He has
been a member of the Institute of Electrical and Electronic Engineers (IEEE) since
2001 and of the IEEE Computer Society since 2006. From 2005 until the time of
publication he has worked in the area of software quality assurance for Analogic
Corporation in Peabody, Massachusetts.

New Idea&Aes

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

New Idea&Aes

Caricato da

Copyright:

Formati disponibili

INSTRUCTION SET EXTENSIONS FOR

ENHANCING THE PERFORMANCE OF SYMMETRIC KEY

Potrebbero piacerti anche

New Idea&amp;Aes

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

New Idea&amp;Aes

Caricato da

Copyright:

Formati disponibili

INSTRUCTION SET EXTENSIONS FOR

ENHANCING THE PERFORMANCE OF SYMMETRIC KEY

Potrebbero piacerti anche

New Idea&Aes

New Idea&Aes