TUKE DP Petrvalsky 2012

Technical University of Koice Faculty of Electrical Engineering and Informatics
ECC Cryptographic Library for ARM Processors
Masters Thesis
2012
Bc. Martin Petrvalsk
Technical University of Koice Faculty of Electrical Engineering and Informatics
ECC Cryptographic Library for ARM Processors
Masters Thesis
Study Programme: Field of study: Department:
Infoelectronics 5.2.13 Electronics Department of Elecronics and Multimedia Communications (KEMT)
Supervisor: Consultant(s):
doc. Ing. Milo Drutarovsk, CSc.
Koice 2012
Bc. Martin Petrvalsk
Abstract The Masters Thesis deals with creation of cryptographic library based on elliptic curves for ARM processors. It briefly describes used hardware, software and theoretical mathematics knowledge which this type of cryptography needs. They are for example Galois fields or elliptic curves theory itself. The core of the thesis deals with making library for the processors. It consists of the functions for addition, substraction and multiplication of numbers. Then there are routines for Galois field arithmetics and elliptic curves point operations. The top of the library are protocols for encryption and decryption. The effort was to use C language in the library as much as possible. Next part of the thesis examines speed optimization options of library as well as countermeasures against well known attacks. The end of the thesis explains creating the test algorithms and it is testing all parts of the library. This testing consists of checking if results are correct and of measuring cycles that are needed for operations. It compares the test results for different platforms, algorithms and different implementations of third parties and it evaluates all achievements. Keywords ECC, library, ARM, processors
Abstrakt Tto diplomov prca pojednva o tvorbe kryptografickej kninice zaloenej na bze eliptickch kriviek pre ARM procesory. Stune opisuje pouit hardvrov, softvrov prostriedky a teoretick poznatky z matematiky, ktor s potrebn pre tento typ kryptografie. S to naprklad Galoisove polia alebo samotn teria eliptickch kriviek. Jadro prce sa zaober tvorbou kninice pre spomnan mikroprocesory. Obsahuje najjednoduchie funkcie pre stanie, oditanie a nsobenie sel, zloitejie rutiny pre aritmetiku v Galoisovch poliach a matematick opercie s bodmi na eliptickch krivkch. Poslednou astou kninice s samotn protokoly, ktor umonuj ifrovanie a deifrovanie. Cieom bolo necha o najviu as kninice v jazyku C. alej prca detailne skma monosti optimalizcie kninice vzhadom na rchlos vpotu a ochranu algoritmov proti znmym tokom. Na zver je vysvetlen tvorba testovacch algoritmov a testovanie vetkch sast kninice. Toto testovanie pozostva z overenia sprvnosti vpotov a meranie potov cyklov potrebnch na jednotliv opercie. Porovnvaj sa testy pre rzne platformy, pre rzne algoritmy a pre rzne implementcie tretch strn a hodnotia sa dosiahnut vsledky. Kov slov ECC, kninica, ARM, procesory
Declaration
I hereby declare that this thesis is my own work and effort. Where other sources of information have been used, they have been acknowledged.
Koice, April 27, 2012
...........................
Signature
Acknowledgement
I would like to express my sincere thanks to doc. Ing. Milo Drutarovsk, CSc., the Supervisor. I also want to thank Elcom and ST Microelectronics companies which provide me hardware equipment necessary for my thesis. Special mention should go to Ing. Jaroslav Bn for his help with optimization of multiplication routines and to Ing. Marin Baa for his advices related with linker and start-up scripts. I indeed have to thank prof. Peter Husr who allows me to finish my thesis at Ilmenau University of Technology. To all other who gave a hand, I say thank you very much.
Preface
Cryptography plays a huge role in world nowadays. Without it we could not imagine a great part of our life. It appears on the internet, in banking, diplomacy, military, telecommunications and many more subjects of everyday life. Real beauty of cryptography is that it connects huge amount of theoretical knowledge achieved in mathematics with other practical subjects in a very elegant way. Cryptographers have to solve many problems. One of the biggest and most obvious problem is called security. Only secure cryptographic system could survive and make profit for developer. The bit length of the keys used for encryption and also the software implementation on selected hardware are connected very tightly with the term of security. The larger length of keys is used the more secure is whole system. It could take many years to break code with large keys but we have to take it together with context of the particular implementation. It needs to be bulletproof because if the secret information is leaking the system could not be used. One of the newest approach in cryptography is based on elliptic curves which are pure mathematical objects and it is called Elliptic Curve Cryptography (ECC). It is asymmetric method with different keys for encryption and decryption. It can be used for encryption, decryption, for key exchange and digital signatures. The main advantage is that ECC uses small key lengths in consideration of required security level. Using small key sizes implies faster implementation with lower power consumption which could be useful in many applications especially in micro and nano technology. ARM processors made a huge step in recent years. They can be found as micro controller units but today they are widely spread also in cell phones, tablets and other portable electronics. They appears even in personal computers which have been based only on x86 or amd64 architecture until now. If our goal is to create secure, fast, low power consumption and reliable micro system it is wise to choose combination of elliptic curve cryptography and ARM microcontrollers.
Contents
Introduction 1 Hardware and software 1.1 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 1.1.2 1.1.3 1.1.4 1.2 ARM7TDMI . . . . . . . . . . . . . . . . . . . . . . . . . . . ARM Cortex-M0 . . . . . . . . . . . . . . . . . . . . . . . . . ARM Cortex-M1 . . . . . . . . . . . . . . . . . . . . . . . . . ARM Cortex-M3 . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 3 4 5 6 6 7 7 8 9
Microcontroller units and evaluation boards . . . . . . . . . . . . . . 1.2.1 1.2.2 1.2.3 Phillips LPC2138 and Keil MCBSTM2130 . . . . . . . . . . . NXP LPC1113 and NXP LPCXpresso1113 . . . . . . . . . . . STM32, Keil MCBSTM32 and KEMT STM32 . . . . . . . . .
1.3 1.4
In-circuit debuggers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Development tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 1.4.2 1.4.3 Keil Vision IDE . . . . . . . . . . . . . . . . . . . . . . . . . 10 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Low level system libraries . . . . . . . . . . . . . . . . . . . . 11 13
2 Elliptic curve cryptography 2.1
Finite field arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Prime field arithmetic . . . . . . . . . . . . . . . . . . . . . . 14
2.2
Elliptic curve arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 2.2.2 2.2.3 2.2.4 Equations of elliptic curves . . . . . . . . . . . . . . . . . . . . 16 Point operations . . . . . . . . . . . . . . . . . . . . . . . . . 17
NIST curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Other coordinate systems . . . . . . . . . . . . . . . . . . . . 20
2.3
Cryptographic protocols . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Signature schemes . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 2.3.3 3 Library 3.1
Public-key encryption
. . . . . . . . . . . . . . . . . . . . . . 22
Key establishment . . . . . . . . . . . . . . . . . . . . . . . . 23 24
Basic routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Pseudo-random number generator . . . . . . . . . . . . . . . . 26
3.2 3.3 3.4 3.5 3.6
Prime field arithmetic routines . . . . . . . . . . . . . . . . . . . . . . 27 Point operation routines . . . . . . . . . . . . . . . . . . . . . . . . . 28 Protocol routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Reference C implementation . . . . . . . . . . . . . . . . . . . . . . . 32 Optimization of low-level routines . . . . . . . . . . . . . . . . . . . . 32 3.6.1 3.6.2 3.6.3 3.6.4 Instruction set analysis . . . . . . . . . . . . . . . . . . . . . . 34 Optimization of addition and subtraction . . . . . . . . . . . . 35 Optimization of multiplication . . . . . . . . . . . . . . . . . . 37 Optimization of multiplication for ARM Cortex-M0 and CortexM1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6.5 Optimization of multiplication for ARM7TDMI and ARM Cortex-M3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7 3.8
Optimization of high-level routines . . . . . . . . . . . . . . . . . . . 42 Library testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.8.1 3.8.2 3.8.3 3.8.4 3.8.5 Test of multiplication routines . . . . . . . . . . . . . . . . . . 46 Test of inversion routines . . . . . . . . . . . . . . . . . . . . . 47 Test of point operation routines . . . . . . . . . . . . . . . . . 48 Test of cryptographic protocols routines . . . . . . . . . . . . 50 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 56 58 64
4 Conclusion Bibliography Appendices
Appendix A Appendix B Appendix C Appendix D
65 66 76 85
List of Figures
2 1 Elliptic curve cryptography pyramidal hierarchy. W represents bit length of the used multi-precision numbers. Every level of pyramid depends on the lower one. Together levels creates support system for the protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 2 Graphical representation of ECC point adding - (a) and point doubling - (b). Example curves on the figure are defined on set of real numbers. In real ECC applications the finite fields are used. . . . . . 18 3 1 Elliptic curve cryptography pyramidal hierarchy with labeled which part of library is written in C code and which will be optimized in assembler code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 2 This is the graphical representation of multi-precision schoolbook multiplication C=A*B. This implementation is not very useful for our purpose because of the high amount of memory entries. . . . . . 38
3 3 Comba multiplication is better for our application. It calculates whole word by word of the output number and writes it to memory. This approach can be good at some cases but it still has a lot of operand memory entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3 4 One of the best implementation for our library is hybrid multiplication method. It use whole register set that ARM processors can provide. It revokes multiple words of operands from memory (e.g. for Cortex-M3 application it is two 32 bit words for each operand). It performs schoolbook multiplication with these operands words stored in registers. M3 processor have space for two words from each operand so it computes 64 64 bit multiplication. These multiplications are used for whole multi-precision multiplication using Comba method. On this picture 4 words from each operand are used. It minimize memory entries which leads to speed-up of the multiplication. . . . . 39
3 5 Comparison of speed and code size of 256 256 bit multiplication on different types of CPU. ARM7TDMI and Cortex-M3 have big code sizes and high speed. Cortex-M0 and Cortex-M1 have small code sizes but on the other hand it is not very fast. This is caused by different instruction sets of the processors. GNU compiler was used for this testing with level2(speed) optimization. . . . . . . . . . . . . 51 3 6 Comparison of speed of 256 256 bit multiplication on different targets of different processors. The fastest is always simulator. Second fastest is execution from RAM and slowest is execution form Flash. ARM7TDMI and Cortex-M3 processors have very little differences between targets. It is much faster for Cortex-M0 and M1 to execute code from RAM than from Flash. GNU compiler was used for this testing with level2(speed) optimization and for RAM testing it was used RealView compiler because of lack of linker scripts for GNU projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3 7 Comparison of speed of 256 256 bit multiplication using fastest C code algorithm and fastest algorithm in assembler on different processors. For ARM7TDMI and ARM Cortex-M3 we managed to reduce the number of cycle approximately to half. Cortex-M0 and M1 have simpler instruction set so C compiler can optimize code better. Assembly code for these processors can not be as optimized as for ARM7TDMI or Cortex-M3. GNU compiler was used for this testing with level2(speed) optimization. . . . . . . . . . . . . . . . . . . . . 53
4 1 This is the photo of programming and debugging of MCB2130 evaluation board (in the middle). Connector on the left side of board is USB type B which provides power supply. At the bottom there is COM connector with connected USB-to-Serial converter. Ulink2 USB-JTAG programmer and debugger is connected to the right side of the board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 2 Photo shows a process of programming and debugging of evaluation board KEMT_STM32 (in the middle). Connector on the left side of board is COM connector with connected USB-to-Serial converter. At the bottom there is USB type B which provides power supply. ST-Link USB-JTAG programmer and debugger is connected to the right side of the board. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 3 This photo shows programming and debugging of LPCXpresso evaluation board (in the middle left). It has no connectors but power supply mini USB connector. If we want to connect serial interface we need to use pins headers, external receiver, transceiver and connector (used the one on the MCBSTM32 board on the left side of the photo). Output pins Rx and Tx on the header of LPCXpresso have to be connected to PA10 and PA9 pins of MCBSTM32 board in this order. MCU on the MCBSTM32 board needs to be switched off so it does not interfere with our processes. We can for example start it in RAM mode or program blank infinite loop to it and start it from Flash. Easier solution would be usage of the Serial(3v3)-toUSB interface. If we need to use our own debugger (Ulink2 on the right side) we have to connect it to drills on the board prepared for debugger connector. We use SWD protocol for debugging and code loading. We have to connect SWDIO, SWCLK, RESET, VCC and GND pins of the debugger to according drills on LPCXpresso board. These drills can be found in the middle of the board between R-link and MCU part of the board. . . . . . . . . . . . . . . . . . . . . . . 68
4 4 Detail of the connection between LPCXpresso (right) and MCBSTM32 (left). Pin 9 (Tx) on the pin header of LPCXpresso is connected to PA9 drill on MCBSTM32 with black wire. Pin 10 (Rx) is connected to PA10 with red wire. Jumpers BOOT0 and BOOT1 on MCBSTM32 are set to boot in RAM mode. This makes the processor on the board inactive after we plug it to power supply. . . . . . . . . 69 4 5 Detail of the connection between LPCXpresso (left) and Ulink2 (right). The 20-pin ARM connector on Ulink2 is connected to drills for debug connector on the board. Pins 1, 4, 7, 9 and 15 (VCC, GND, SWDIO, SWCLK and RST) are connected to drills 2, 15, 3, 5, 11 (twisted red and black, green, white and black wires) of the debug connector on the board. This allows us to debug and load code to MCU. . . . . . 70
4 6 Performance analyzer included in Keil Vision IDE. With this tool we are able to find speed bottlenecks of our implementation. This current state of analyzer shows part of ECC point multiplication using 256 bit long words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
List of Tables
3 1 Table of the critical functions in C library. It shows name of the function, number of calls and time that program was inside that function. Measuring was performed during point multiplication and it was used ARM7TDMI target and EC192 elliptic curve. It shows only functions that have 1% of running time or more. . . . . . . . . . . . . . . . . . 33 3 2 Table for 224 and 256 bit implementation of ECIES operations for used processors. The numbers in this table represent number of cycles that takes the operation. Decryption takes approximately half time of encryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4 1 Table with number of cycles needed for ECIES operations for used processors. Decryption takes approximately half time of encryption. . 84
List of Symbols and Abbreviations

(2 ) Galois Field - using binary polynomials (p) Galois Field - using prime numbers (p ) Galois Field - optimal extension field EC192 Elliptic Curve (e.g. NIST 192 bit) P192 p192 Point on Elliptic Curve (e.g. NIST 192 bit) Prime Number (e.g. NIST 192 bit)
3DES Triple Data Encryption Standard AEL ARM ECC Library AES Advanced Encryption Standard ARM Advanced RISC Machine CAN Controller Area Network CMSIS Cortex Microcontroller Software Interface Standard DPA Differential Power Analysis ECC Elliptic Curve Cryptography ECDLP Elliptic Curve Discrete Logarithm Problem ECDSA Elliptic Curve Digital Sign Algorithm
ECIES Elliptic Curve Integrated Encryption Scheme ECMQV Elliptic Curve Menezes-Qu-Vanstone FPGA Field Programmable Gate Array GCC GNU C Compiler GNU GNUs Not Unix IDE Integrated Development Environment
JTAG Joint Test Action Group KDF Key Derivation Function LSB Least Significant Bit
LSW Least Significant Word MAC Message Authenticate Code MCU Micro Controller Unit MSB Most Significant Bit MSW Most Significant Word NIST National Institute of Standards and Technology NVIC Nested Vectored Interrupt Controller PRNG Pseudo-Random Number Generator
RISC Reduced Instruction Set Computer RSA Rivest Shamir Adleman RTOS Real-Time Operating System SPA Simple Power Analysis SWD Single Wire Debugger USART Universal Synchronous/Asynchronous Receiver/Transmitter USB Universal Serial Bus XOR eXclusive OR XTEA eXtended Tiny Encryption Algorithm
FEI
KEMT
Introduction
Elliptic curve cryptography has a big advantage comparing to other public key cryptography protocols. The advantage is that it has small sizes of keys. The goal of this thesis is to create cryptographic library for deeply embedded systems such as chip card or micro sensor systems. These systems need to have power consumption as low as possible. For example if we take RSA algorithm we need five times longer keys (see Table 1.1 in [1]). The most common and also the most critical operation in cryptographic systems is multiplication. We can compare multiplication of numbers used by ECC and RSA. Smaller keys means faster operations. In conclusion when we consider equal security level, speed and power consumption ECC is the best option among public key algorithms. Also parameters of the processors used have to satisfy requirements of the deeply embedded systems. We choose the low-end processors from ARM. They are relatively new and they have low power consumption. ARM7TDMI was chosen in order to compare implementation with algorithms in [14]. Then there are ARM CortexM0, ARM Cortex-M3 and ARM Cortex-M1. All processors have special sleep modes to save power. Very interesting is using Cortex-M1 which is a soft-core processor and can be implemented into FPGA circuits. Nowadays we can find some libraries using elliptic curve cryptography that are especially designed for ARM processors. For example they are INVIA - ECC Software Library IP [12], TinyECC [13] or Cryptographic library for ARM7TDMI processors by Jaroslav Bn [14]. These libraries often provides very high speed optimization. For example in [14] almost every part of library is programmed in assembler. This gives a bad disadvantage to the library when it needs to be ported to another target.
1
FEI
KEMT
Our goal is to create ARM ECC library (AEL) and find optimum between portability, speed and security of the implementation. We can take approach from the bottom to top so we develop low-level function first and the highest level functions at the end. We create implementation in C code first in order to have some reference. With this reference implementation we can later compare optimized parts of library. We choose to implement only NIST curves. These curves has speed advantage because we can use fast algorithms for these curves. We also choose only one Galois field. (p) suits our processors the best when we consider speed and code sizes. Our goal was also minimize the memory requirements so we implement only general function. Indeed we could also use some window functions or unpacked algorithm but it is not the goal. Next objective was optimization in terms of speed and security. Firstly we need to find bottlenecks of the implementation. For this purpose we can use performance analyzer included in Keil Vision IDE. This tool can show us routines with the highest CPU time. Then we use assembler code to speed up these bottlenecks as much as we can. After speed optimization we take a look at security optimization. We need to find vulnerable part of the library and secure it against all known attacks. The last goal is to test the library. We make sure that all routines give correct result. We check the speed, code sizes of all algorithms. And at the end we compare our implementations to the others.
FEI
KEMT
Hardware and software
This chapter briefly introduces hardware and software resources used in this thesis. First we take a look at processor cores. We find main features and differences between ARM7TDMI [28], ARM Cortex-M0 [29], ARM Cortex-M1 [30] and ARM Cortex-M3 [31] cores. Then we list some of the MCUs using these cores and also evaluation boards used for real application testing. At the end of chapter we show software resources such as used IDE a toolchains.
1.1
1.1.1
Processors
ARM7TDMI
Processor ARM7TDMI [28] uses ARM architecture version 4 (ARMv4 architecture is now obsolete). It is a common mistake to mix ARM7TDMI with ARMv7 [33] architecture. Abbreviation TDMI stands for: Thumb - 16-bit Thumb instruction set Debugging - On-chip debugger Multiplier - 64-bit result powerful multiplier ICE - EmbeddedICE hardware It has quite good performance/power consumption ratio. Register set consists of 16 32-bit wide registers with R13-R15 used for special purpose (R13 is stack pointer, R14 is link register and R15 is program counter). It uses 3 stage pipeline and wide selection of addressing modes. Other features are: Orthogonal instruction set
FEI
KEMT
32-bit fixed width instruction Conditional execution of almost all instructions Barrel shifter - allows quick shifts Dedicated stack pointer STM and LDM for fast save or load multiple instructions
1.1.2
ARM Cortex-M0
The Cortex-M0 [29] processor is designed especially for low power consumption and is low level gate. This is required for deeply embedded applications for which this processor suits best. Compared to ARM7TDMI core this processor implements ARMv6 architecture [32]. M0 is multistage, configurable, 32-bit RISC processor. It executes Thumb code and it is compatible with other Cortex-M cores. ARM Cortex M0 also has: Very tight integration of peripherals in order to area and development costs Power control optimization of system components Sleep mode functions for power savings Configurable hardware multiplier with 32-bit result Serial wire debug Deterministic and high-performance interrupt handling for time-critical applications Cortex-M0 has register set that consists of 16 32-bit registers. R13-R15 are used for special purpose as well as for ARM7TDMI. The rest of the registers are divided into
FEI
KEMT
two groups. Lower 8 registers (R0-R7) are general purpose registers which could be used with almost any instruction. Rest 5 registers R8-R12 are high registers and can be used only with some instructions (e.g. ADD, MOV). ARM developed also new Cortex-M0+ which is the most energy efficient processor. Information can be found at http://www.arm.com/products/processors/cortex-m/cortex-m0plus.php
1.1.3
ARM Cortex-M1
Cortex-M1 [30] is first ARM processor especially designed for FPGA implementation. It can be used for all major FPGA targets and includes support for leading FPGA synthesis tools. It uses ARMv6-M [32] Thumb instruction set with some of Thumb2 instructions, similar to Cortex-M0 processor. ARM Cortex-M1 is compatible with this tools: Actel Libero Altera Quartus-II Synopsis Synplify Pro Mentor Precision Xilinx ISE Big advantage of M1 comparing to other Cortex processors is configurability. We can choose parts of processor needed for our application. We can change sizes of instruction and data memories, debug logic with removable debug, watchpoint and breakpoint, little/big endian option and hardware multiplier.
FEI
KEMT
1.1.4
ARM Cortex-M3
The last processor on our list is ARM Cortex-M3 [31] with ARMv7-M [33] architecture. It is low-power processor with low interrupt latency and low-cost debug useful mainly for deeply embedded application where latency of interrupts are critical. More features of Cortex-M3: Thumb and subset of Thumb-2 instruction set Hardware divide instructions Banked stack pointer (SP) only Handler and Thread mode, Thumb and Debug states Memory protection unit JTAG and SWD for low-cost debugging Instruction set is very similar to ARM7TDMI so porting to Cortex-M3 is easy and we dont have to learn new instructions. For example programmer has to change all conditionally executed instructions and put it into IF-THEN loop which is new feature in M3 instruction set. There are no LDMDA, STMDA, LDMIB or STMIB in Cortex-M3 instruction set so programmer has to also change these instructions to ones that have only IA and DB suffixes. This two main differences were enough for me to port complex multi-precision multipliplication routines from ARM7TDMI to Cortex-M3 core.
1.2
Microcontroller units and evaluation boards
In this section we take a tour around microcontrollers and evaluation boards used for real application testing. All of this hardware is based on the mentioned processors.
FEI
KEMT
Later we use the boards for cycle measuring and checking if results comparing to simulator are correct.
1.2.1
Phillips LPC2138 and Keil MCBSTM2130
Phillips (now NXP) LPC2138 is microcontroller based on ARM7TDMI-S processor. It provides 512 kB of Flash, 32 kB RAM, two analog-digital converters, two UARTs and real-time clock. Complete list of peripherals and more information about LPC2138 could be found in [34]. Maximum core frequency is 60 MHz. MCBSTM2130 [35] is the evaluation board by Keil. It allows us to test application programs for LPC2138. The connectors on board provide easy access to many onchip peripherals. Dual serial ports Low frequency amplifier Analog voltage control for ADC JTAG download and debug
1.2.2
NXP LPC1113 and NXP LPCXpresso1113
NXP (former Phillips) LPC1113 is low-cost microcontroller based on ARM CortexM0 core. It is designed mainly for low power consumption easy applications. Maximum CPU frequency is 50 MHz. Memory sizes are 32 kB for Flash and 8 kB for RAM. More about LPC1113 can be found in [36]. LPCXpresso is low-cost development platform for LPC microcontrollers. Key features are:
FEI
KEMT
1. JTAG debugger as a part of evaluation board 2. Eclipse-based IDE 3. Easy upgrade options Evaluation board LPCXPresso1113 [37] is based on LPC1113 MCU. It has no connectors just testing field which allows access to all CPU pins. Board provides also LPC-Link JTAG debugger which has one micro-USB connector. This debugger can be completely separated from the board and can be used alone.
1.2.3
STM32, Keil MCBSTM32 and KEMT STM32
Microcontroller units STM32 [38] are manufactured by ST Microelectronics. Their cores are based on ARM Cortex-M3 processors. STM32 MCUs could be divided into several categories: Performance Line with all peripherals, 72MHz highest clock Access Line with only basic peripherals, 36MHz highest clock Value Line also with basic peripherals, 24MHz highest clock Connectivity Line similar to Performance Line but in addition it has USB On-The-Go, Ethernet and dual CAN peripherals There are STM32 microcontrollers with different SRAM and Flash memories. The range for SRAM is 4 - 96 kB and for Flash it is 16 kB - 1 MB. Common peripherals included in all lines are for example timers, USB, CAN, USART, analog-digital converter and much more. ST Microelectronics also introduced new STM32 family based on ARM Cortex-M0 - STM32F0 family [39].
FEI
KEMT
Evaluation board Keil MCBSTM32 is well suited for all kinds application testing. It includes STM32 performance line microcontroller STM32F103RB. It has CAN, USB, RS232, SD card and JTAG connectors, 8 MHz high speed oscillator, low speed 32.768 kHz for real time clock and 2 16 characters LCD display. More info in [40]. Next board was developed in Department of Electronics and Multimedia Communications on Technical University of Koice. KEMT STM32 is based on STM32F103C8 microcontroller. Further information can be found in [41].
1.3
In-circuit debuggers
For programming and in-circuit debugging we use two USB-JTAG adapters. The first is Ulink2 [42] in-circuit programmer/debugger developed by Keil corporation. It works very well with Keil Vision development environment. Programming and debugging are without any problem. It can work perfectly in JTAG [46] or in SWD [45] mode. Second adapter is STlink [43] by ST Microelectronics. Compared to Ulink2 it is very easy and cheap tool. It is especially designed for STM32 and STM8 programming and debugging. Debugging is correct but some of the features are missing such as registers watching, peripheral surveying or exact timing information. Flash programming in cooperation with Keil Vision environment is giving wrong results. In order to load program to STM32 Flash with STlink correctly we have to use special software by ST called ST-Link Utility [44]. Another possibility for Flash programming is to load program using bootloader which is present in almost all microcontrollers. First we have to start it (often we need to set some pins to required value at reset). Then we can choose way of loading. Often there are lot of possibilities such as loading code through CAN, USART or USB.
FEI
KEMT
1.4
Development tools
This section introduces software resources of this work. First one is IDE (Integrated Development Environment). Then we take a closer look on used toolchains - RealView and GCC compilers. At the end we describe low level software such as CMSIS and standard libraries for MCUs.
1.4.1
Keil Vision IDE
Development environment Vision by Keil [47] is part of the MDK-ARM package [48]. For compiling the library and test routines we use IDE Keil uVision version 4.50 with ARM CC compiler V4.1.0.894. This environment combines 4 features which are needed or helpful for program development: Project management and editing Compilation options Source code editor Environment for debugging and program simulation Keil Vision is very easy to understand in which we can develop our applications effectively. Evaluation version is free to download from manufacturer website but it has some restrictions (can be found at http://www.keil.com/demo/limits.asp ).
1.4.2
Compilers
MDK-ARM package also includes RealView compiler. It is a professional and mighty tool but in evaluation it is restricted only to 32 kB of code. If we want to create bigger program we have to pay for full version or switch to another compiler.
10
FEI
KEMT
In this thesis we also use GCC [49] compiler. It is under GNU policy so it is almost without any restrictions. The one which suits us best is included in CodeSourcery Source CodeBench Lite for ARM by Mentor Graphics [50]. It is fully specialized for ARM processors and their architecture. For library and test routines compiling we use GNU compiler Sourcery CodeBench Lite 2011.09-69. It cooperates with IDE very well. In past there had to be "glue" software to make GCC work with Keil Vision. Nowadays it is no longer problem and it works as it is. In comparison with RealView it generates little bit larger code and the program speed is slightly slower but it is the price to pay for relatively free compiler.
1.4.3
Low level system libraries
In this section we introduce some hardware specific software for processors or microcontrollers. First one is the Cortex Microcontroller Software Interface Standard - CMSIS [51]. It defines a common way to access core peripheral registers and to set exception vectors. It also includes names for registers and core exception vectors. CMSIS provides device-independent interface for RTOS kernels including a debug channel. There are two ways of using peripherals in microcontrollers. In easy applications we can just use direct registry entry. Main advantage of this approach is small code size and higher application speed. When application grows large then direct registry entry will be disarranged and unreadable. Almost every manufacturer provides standard peripheral library which allow user to arrange code and make it readable. If we use standard library we do not have to bother with learning all registers of all peripherals we use. We simple take library function and use it but we have to be cautious because also in manufacturers libraries could be bugs. This last paragraph takes a closer look at startup files and linker scripts. Startup file
11
FEI
KEMT
is code which is executed at start of MCU. It contains interrupt vectors arrangement, it also calls processor setup routine and then it jumps to main program. It can be written in C or more often in assembler language. Then we need different startup files for different compilers. Linker scripts are used to determine where code parts, variables and all other data will be stored in memories. It needs to have information about all memory sectors and it also provide all needed information for linker. When using MDK-ARM package there is no need to use external script. It uses information from target option window. On the other hand GCC can not use that data so it needs linker script files. During my work I found that in order to make program work properly all three components (compiler, linker script and startup files) have to be fit together and cooperate well.
12
FEI
KEMT
Elliptic curve cryptography
Elliptic Curve Cryptography (ECC) is spreading very fast today. The main reason is that it is an asymmetric cipher with key lengths that are comparable to AES [4] cipher keys with equivalent security level. For example RSA algorithm needs at least 10 times larger key if we want to reach the same security level. Small key sizes obviously lead to faster, more efficient and lower power consumption algorithms. All cryptography algorithms uses mathematical problems (one-way functions) which allow fast computing only one way. In these functions it is easy to compute products from operands but very hard to get operands (or even one operand when the other is known) from product. Cryptographic algorithms use integer factorization problem or discrete logarithm problem very often. For ECC the typical mathematical problem is Elliptic Curve Discrete Logarithm Problem (ECDLP) which will be described later in this chapter.
ECC Protocols EC Point Multiplication EC Point Addition and Doubling Finite Field Operations
Basic W-bit Operations
Figure 2 1 Elliptic curve cryptography pyramidal hierarchy. W represents bit length of the used multi-precision numbers. Every level of pyramid depends on the lower one. Together levels creates support system for the protocols.
Next we analyze all prerequisites which we need for elliptic curve cryptography. First part is about finite field arithmetic and especially about prime fields. Second
13
FEI
KEMT
section describes elliptic curves, points on these curves and point operation. Elliptic curve arithmetic, especially point multiplication is essential for the last subchapter which talks about cryptographic protocols. With protocols we can encrypt, decrypt messages, share secret keys. All these prerequisites forms a pyramidal hierarchy as shown on Figure 2 1.
2.1
Finite field arithmetic
All operations that elliptic curve arithmetic is using are finite field arithmetic so it is very important to understand it and implement it as efficient as possible. There are three finite field arithmetics which suits ECC best: Prime field arithmetic (p) Binary field arithmetic (2 ) Optimal extension field arithmetic (p ) For our implementation we choose only prime field arithmetic. It is well suited for our type of processors. Binary field could be considered faster because of lack of multiplications (it uses only shifts, AND and XOR) but ARM processors have hardware multipliers which erase this disadvantage. (2 ) are well suited for hardware implementation and for general purpose processors without fast multiplications instructions. Next section is about prime field arithmetic. For further information about field arithmetics used in ECC see [1], Chapter 2.
2.1.1
Prime field arithmetic
Let p be a prime number. The integers modulo p create set Fp = {0, 1, 2, ..., p 1}. In order to create a field, beside the set we need addition and multiplication (denoted
14
FEI
KEMT
by + and .) that satisfy usual arithmetic properties: (Fp , +) is an abelian group with (additive) identity denoted by 0 (Fp {0}, .) is an abelian group with (multiplicative) identity denoted by 0 The distributive law holds: ( + ). = (.) + (.) for all , , from Fp Because Fp is finite set created with prime number the field (Fp , +, .) is called Gaussian field over prime numbers (denoted by (p)). In this field we make all operations that will be used later. These are: 1. Addition: = + p (or just = + ) 2. Substraction: = p (or just = ) 3. Multiplication: = .p (or just = .) 4. Inversion: 1 so 1 = .1 p (or just = 1 ) All operations must have result that fits into set selected by the prime number. If we add two numbers and the result overlaps set all we need to do is subtract prime number in order to correct result. The same situation is with the substraction but now if the result overlaps the set we need to do additional prime number addition. Multiplication (often is denoted without " . " e.g. = 2 3 ) is more complicated. The result is twice as long as the prime number and substraction does not help us. We need to use special reduction algorithms such as Barret reduction (see Chapter 2.2.4 in [1]). The last thing we use is to divide. Firstly we have to compute an inversion of the divider and then perform multiply.
15
FEI
KEMT
2.2
Elliptic curve arithmetic
The goal of this section is to briefly explain what elliptic curves are. Then what operations can be performed with them. And at the end we will understand how point multiplication on elliptic curves works. Point multiplication then will be powerful operation which is used in all elliptic curve cryptographic protocols such as ECIES [3] or ECDSA (Chapter 4.4.1 in [1]).
2.2.1
Equations of elliptic curves
Elliptic curve EC over a field K is defined by Weierstrass equation: EC : 2 + 1 + 3 = 3 + 2 2 + 4 + 6 (2.1)
where 1 , 2 , 3 , 4 , 6 K and = 0, where is discriminant of EC and is defined as follows:
= 2 8 83 272 + 92 4 6 2 4 6 2 = 2 + 42 1 4 = 24 + 1 3 6 = 2 + 46 3 8 = 2 + 42 6 1 3 4 + 2 2 2 1 3 4
(2.2)
For elliptic curve cryptography it is not necessary to use whole (2.1). If we take some simplifying steps (described in [1], Chapter 3.1.1) we get simplified Weierstrass equation which is always used in ECC. For prime field arithmetic where p > 3 the simplified Weierstrass equation has form: EC : 2 = 3 + 2 +
16
(2.3)
FEI
KEMT
where , K and = 16(43 + 272 ). Example of elliptic curves defined on set of real numbers are on Figure 2 2. We must define one more point on the curves. It is point at infinity (denoted by ) which is the only point on the line at infinity that satisfies elliptic curve equations. With this point we can define group law and also operations with points on curves. More information about group law, group order and structure can be found in [1], Chapter 3.1.2.
2.2.2
Point operations
Points on elliptic curve are points with coordinates which satisfies (2.3) (from now we consider only prime field arithmetic). With these points and defined group we can perform addition, subtraction, doubling and most important multiplication. Point at infinity serves as identity for addition (P + = + P = P). Negative point to point P(, ) is equal to point P(, ). Point P(, ) is written in affine coordinates. There are also another types of coordinates which will be explained later in next section. Addition of two points can be easily explained geometrically. In order to add point P to point Q we need to draw line through this points. Then we take intersection of the line and curve and reflect it about the x-axis. The result point R is the sum of the points P and Q. We can write R(3 , 3 ) = P(1 , 1 ) + Q(2 , 2 ) where: 2 1 3 = 2 1 2 1 2 1 3 = (1 3 ) 1 2 1 The addition is depicted on 2 2. When we want to subtract point all we have to do is to add the negative point. In the same way we can perform point doubling (2 2 (b)). The only change compared to the addition is that first we have to draw a tangent through the doubled
17
2
(2.4)
FEI
KEMT
point and the rest of explanation is equal to point addition. For point P doubling we can write R(3 , 3 ) = 2 * P(1 , 1 ) where: 32 + 1 21 3 = 21 32 + 3 = 1 (1 3 ) 1 21
y y
(2.5)
Q = (x2 , y2 )
P = (x1 , y1 )
x P = (x1 , y1 )
R = (x3 , y3 )
R = (x3 , y3 )
(a) Addition: P + Q = R.
(b) Doubling: P + P = R.
Figure 2 2 Graphical representation of ECC point adding - (a) and point doubling - (b). Example curves on the figure are defined on set of real numbers. In real ECC applications the finite fields are used.
The most important point operation in elliptic curve cryptography is point multiplication. It denotes the execution time of ECC schemes. We can imagine multiplication * P (where is integer from (p) and P is point on elliptic curve) as multiple adding of P (2.6). * P = P + P + P + ... + P where number of added P points are exactly . This way of point multiplication is excellent to imagine how it works but it is inefficient. In real applications we should use another algorithms such as left-to-right, right-to-left binary method or Montgomery ladder. These algorithms will be later described in 3.7.
18
(2.6)
FEI
KEMT
2.2.3
NIST curves
NIST curves [7] are specially chosen curves by NIST organization. There are 5 recommended prime fields determined by 5 prime numbers: p192 = 2192 264 1 p224 = 2224 296 + 1 p256 = 2256 2224 + 2192 + 296 1 p384 = 2384 2128 296 + 232 1 p521 = 2521 1 For each field the random elliptic curve is generated with cofactor = 1 and coefficient = 3 (according to (2.3)). These curves and fields are chosen according to bit length, security and possibility of fast algorithms. More info, exact NIST curve parameters, algorithms and test vectors can be found in [7]. Example parameters of elliptic curve EC384 over (p384 ) are:
1 p = ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 2 ffffffff fffffffe ffffffff 00000000 00000000 ffffffff
3 a = ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 4 ffffffff fffffffe ffffffff 00000000 00000000 fffffffc
5 b = b3312fa7 e23ee7e4 988 e056b e3f82d19 181 d9c6e fe814112 6 7 8 // base point P 9 xP = aa87ca22 be8b0537 8 eb1c71e f320ad74 6 e1d3b62 8 ba79b98 10 59 f741e0 82542 a38 5502 f25d bf55296c 3 a545e38 72760 aB7 0314088 f 5013875 a c656398d 8 a2ed19d 2 a85c8ed d3ec2aef
11 yP = 3617 de4a 96262 c6f 5 d9e98bf 9292 dc29 f8f41dbd 289 a147c 12 13 14 // order of P ( = order of EC ) 15 n = ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 16 c7634d81 f4372ddf 581 a0db2 48 b0a77a ecec196a ccc52973 e9da3113 b5f0b8c0 0 a60b1ce 1 d7e819d 7 a431d7c 90 ea0e5F
19
FEI
KEMT
2.2.4
Other coordinate systems
Sometimes affine coordinates (, ) does not suit certain algorithms. They often use projective coordinates instead which are denoted as ( : : ). There are certain types of these coordinates. Standard projective coordinates uses one more coordinate compared to affine. It is called -coordinate. coordinate is equal to * and is equal to *. For example point (3, 5) in affine coordinates represents point (3 : 5 : 1) in projective coordinates. Jacobian projective coordinates are pretty similar to standard one. The difference is that coordinate is equal to * 2 and = * 3 . Another type of coordinate system is Co-Z coordinates. The main goal is to not change coordinate. It remains the same all the time so we do not have to calculate it during operations. The last system is called OnlyX coordinate system. It is a Co-Z system without coordinate. All operations are done without and at the end we revoke it. More info about last two systems could be found in [17].
2.3
Cryptographic protocols
This chapter introduces signature schemes, public key encryption/decryption and key establishment schemes all based on elliptic curves. All protocols are based on the Elliptic Curve Discrete Logarithm Problem (ECDLP). ECDLP is: given an elliptic curve EC defined over (p), a point P on curve EC of order n and point Q {P} (set generated by P), find the integer [0, 1] such that Q = * P. The integer is called discrete logarithm of Q to the base P, denoted = P Q. One of the most nave way of finding discrete logarithm is the exhaustive search. Running time is n (order) steps in worst case and n/2 on average. Countermeasure to
20
FEI
KEMT
exhaustive search is to choose n sufficiently large (e.g. 280 ). The most effective mathematical attack yet discovered is combination of Pohling-Herman algorithm and Pollards Rho algorithm (see [1], Chapter 4.1). It has a fully exponential run. This can be circumvented by having n divisible by prime number p sufficiently large and also this order number have to be larger than 2160 . If system is resistant to other types of attacks (see Chapter 3.7) these steps secure system against braking with todays computer technology. Before we can proceed to cryptographic protocols the domain parameters of elliptic curves and key pairs generation have to be explained. Domain parameters have to be chosen in a way that whole system will be resistant to all known attacks. They are: 1. - field order 2. - filed representation (e.g. GF(p)) 3. - a seed if curve was randomly generated 4. , - two coefficients in EC equation 5. P , P - two field elements which represent base point P(P ,P ) on curve EC with prime prime order. 6. n - order of point P 7. - cofactor equal to #EC/ (# =order of curve) First step for cryptography is the key generation. In public key cryptography we have to produce key pair - private and public keys. An ECC key pair is always associated with domain parameters. We can obtain public key by choosing random point Q from set of points generated by base point P. The corresponding private key is is = P Q. We can notice that computing private key from the public
21
FEI
KEMT
one is exactly ECDLP. In practical applications we first randomly chose private key from (p). Then we compute public key Q by multipliciation * P. At the end we validate the resulting key pair (e.g. public key can not be point at infinity).
2.3.1
Signature schemes
Signature schemes are digital form of handwritten signatures. The main purpose is to identify the authority which signed a message and also data integrity. A signature scheme consists of four parts: 1. Domain parameters generation. Could be omitted if used fixed parameters. 2. Key generation. Using domain parameters it generates key pair (, Q). 3. Signature generation. It takes domain parameters, private key d, message m and creates signature 4. Signature verification. It takes domain parameters, public key Q, signature and the result is acceptation or rejection of the signature. Most popular signature schemes are The Elliptic Curve Digital Signature Algorithm (ECDSA) and The Elliptic Curve Korean Certificate-based Digital Signature Algorithm (EC-KCDSA).
2.3.2
Public-key encryption
They are for example The Elliptic Curve Integrated Encryption Scheme (ECIES) [3] and Provably Secure Encryption Curve scheme (PSEC)(Chapter 4.5.2 in [1]). These schemes are providing confidentiality. They are much slower than symmetric key algorithms such as AES [4], 3DES [5] or XTEA [6]. Their purpose is to encrypt only small amount of data (credit card number, PIN) or to transport symmetric keys
22
FEI
KEMT
which is used for symmetric encryption algorithms. Public-key encryption schemes consists of four algorithms. They are: 1. Domain parameters generation. Could be omitted if used fixed parameters. 2. Key generation. Using domain parameters it generates key pair (, Q). 3. Encryption. It takes domain parameters, public key Q, plaintext message and creates a ciphertext . 4. Decryption. It takes domain parameters, private key , ciphertext and either rejects or as invalid or produces message .
2.3.3
Key establishment
Last schemes are used for key establishment. The goal is to establish shared secret among all users. ECIES can be considered as a key establishment between two parts if the secret message is the key. Established secret can be than used to achieve cryptographic goals like confidentiality or data integrity. There are two types of protocols that are used in key establishment schemes: Key transport protocol - one entity creates the secret key and transports it confidentially to the others. Key agreement protocol - all entities are participating on the secret key establishment. The Station-To-Station protocol (STS) is discrete logarithm problem based key establishment scheme. In elliptic curve cryptography it is used modified version of this protocol. The last protocol that will be mentioned is ECMQV (Elliptic Curve Menezes-Qu-Vanstone) algorithm (Chapter 4.6.2 in [1]). Interesting about this protocol is that it is three-pass key establishment scheme.
23
FEI
KEMT
Library
We use approach of creating library from the bottom to top. First we create whole library in C language - basic multi-precision routines, (p) arithmetics, elliptic point operations and the last one is ECIES protocol to demonstrate library usage. After having entire library written in C we can continue with next goal which is optimization of the low-level routines. In all public key cryptographic protocols the most used of all algorithms is multi-precision multiplication. It takes most of the processor computing time so our effort is to maximize speed of this algorithm. This can be done by optimization process in assembler language. Then we focus on securing library against side-channel attacks. The heart of ECC is point multiplication and it is very vulnerable to these kind of attacks. We use some algorithms and techniques which are minimizing side-channel leaks and can also improve other security issues. The end of this chapter belongs to library testing. We design test programs to prove that results are correct. Also we measure algorithm speed and we compare results with other implementations.
3.1
Basic routines
First we have to build the base of the pyramid (Figure 2 1). We briefly describe each function and we point out interesting issues that we need to solve. All used processors have 32 bit precision arithmetics so we have to create functions which can handle with long multi-precision numbers like adding, subtracting, multiplying, comparing, halving etc. Multi-precision numbers are represented by array with LSW at first position and MSW at last position.
24
FEI
KEMT
AEL_cmp() - Function for comparing two numbers. It takes 2 multi-precision numbers and compares them. It returns 0,1 or 2 if first number is bigger, second number is bigger or they are equal respectively. AEL_copy() - Function for copying of multi-precision numbers. It takes provided number and copies it to desired destination. In this point we need to realize that for copying and comparing routines we could use standard libraries which can be more optimized for processors. Our decision is to avoid them in order to preserve uniformity of the library. These functions are not critical so overall speed is not decreased. AEL_testx() - This function tests if input multi-precision number is equal to given single-precision number. It is often used if we want to know if multi-precision number is zero or one. It is much faster than prepare multi-precision zero or one and then use AEL_cmp() function mentioned above. AEL_add_multi() and AEL_sub_multi() - First approach was inspired by Algorithms 2.5 and 2.6 in [1]. I created routine to add/substract single-precision words with carry/borrow and then I used these routines in multi-precision addition/substraction. This approach was not very fast and it had big code size. The final solution was inspired by polarSSL [27] C code multi-precision addition/substraction. I improved this code in a way that compilers for processors produce faster code that original polarSSL polar implementation. As example the addition code looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 } for ( i = 0; i < t ; i ++) { tc = a [ i ] + b [ i ]; tmp = ( tc < a [ i ] ) ; // count c = a + b // if overflow set tmp ( new ) carry // add old carry to result // if overflow set tmp ( new ) carry // copy tmp ( new ) carry to output * e = 0; // clear first carry // add all the words in vector uint i ; uint tc , tmp ; // loop counter // tmp result and tmp carry
c [ i ] = * e + tc ; tmp += ( c [ i ] < * e ) ; * e = tmp ;
25
FEI
KEMT
AEL_hlv_multi() - This function performs halving of multi-precision number. It returns the halved number and also the remainder. mdmp_mul() - On input of multiplication routine there are two multi-precision multiplicands. The product is twice as long as the multiplicands. In this implementation of multi-precision multiplication I used the schoolbook method of multiplication (Chapter 3.1 in [16]). This method was previously created by Ing. Jaroslav Bn and we combined it with multiplication using 16 16 bit core. This may be useful for processors which do not have 32 32 bit MUL instruction in their set (e.g. ARMv6 architecture processors such as ARM Cortex-M0 and ARM Cortex-M1). polar_mul() - This is pure C code implementation of multiplication from PolarSSL library [27]. It can be optimized according to longlong variable type compiler/processor support. In general it is faster than mdmp_mul() routine. It performs also accumulation so if we need only to multiply we must clear the output vector first.
3.1.1
Pseudo-random number generator
We also need some Pseudo-Random Number Generator (PRNG). It is used for example to create test numbers. In a place where we need to pick random numbers (e.g. key pair creation) it would be better to have true random number generator. None of the microcontrollers have hardware support for this generator, so we use only PRNG. There are a lot of PRNG implementations (see [24], [25] and [26]). We implement Marsaglia-Multicarry-RNG mentioned in [25]. It is simple and robust PRNG. The core of the generator is really small:
1 # define znew (( z =36969*( z &65535) +( z > >16) ) < <16) 2 # define wnew (( w =18000*( w &65535) +( w > >16) ) &65535) 3 # define IUNI ( znew + wnew ) 4 # define UNI ( znew + wnew ) *2.328306 e -10 5 static unsigned long z =362436069 , w =521288629; 6 void setseed ( unsigned long i1 , unsigned long i2 ) { z = i1 ; w = i2 ;}
26
FEI
KEMT
AEL_prng() - Implemented Marsaglia-Multicarry-RNG returns 32 bit number. The seed set function AEL_prng_init() need to be called before we can use this function for the first time. AEL_prng_multi() - We usually need longer words than 32 bits. This function generates multi-precision word using AEL_prng() multiple times.
3.2
Prime field arithmetic routines
The second level of ECC pyramid consists of (p) routines. It have to implement addition, subtraction, multiplication and inversion in prime field arithmetics. If we want to perform multiplication in (p) we also need reduction routines. AEL_GFp_add() and AEL_GFp_sub() - Operations of adding and subtracting in (p) first evaluate basic addition or subtraction of input numbers. Then if result overflows or underflows the interval [0, p 1] we need to add or subtract prime number. See Algorithms 2.7 and 2.8 in [1]. AEL_GFp_bininv() and AEL_GFp_RSbininv() - These functions create multiplicative inversion of input number. It uses binary or right shift binary inversion algorithm which raises speed due to lack of operations of division (in Euclidian inversion algorithm) or exponentiation (when using the little Fermats theorem for inversion). AEL_GFp_FERMinv() - Inversion routine that uses the little Fermats theorem. We use it in testing functions to test if inversion gives correct results. AEL_GFp_mul() - Multiplication in (p). It multiplies two multi-precision numbers. In general the result is twice as long as the multiplicands so then we need to reduce the product to interval [0, p 1]. This is done by reduction algorithms (general or fast).
27
FEI
KEMT
AEL_GFp_bred() - Barrets function for number reduction. Not yet working in the library. All elliptic curves that we are using in the library are NIST curves so we can always use fast reduction algorithms. If we need general curves in the future we need to implement this function first. AEL_GFp_fredXXX() - Fast reduction routines are based on fact that the special curve is used. All fast reductions implemented in AEL are using NIST curves [7] Algorithms 2.27-2.31 in [1].
3.3
Point operation routines
This part of library represents third and fourth level of our ECC pyramid. The goal is to provide routine for point multiplication which is used in next section for ECC protocols. AEL_ECC_GFp_pointEC() - This routine takes domain parameters of a curve and one point and it performs the test if the point is on the curve. Technically it is using only the simplified Weierstrass equation (2.3). After substitution it evaluates if both sides of the equation are equal. Point at infinity () is always on curve. AEL_ECC_GFp_AFtoJC() - We often need transformation between different coordinate systems. The optimized functions for point addition, doubling and multiplication described later in this section have different input and output coordinate types. Easiest transformation is from affine to Jacobian coordinates. To affine (, ) we just add third coordinate equal to 1 so on the output we get point ( : : 1) in Jacobian coordinates. AEL_ECC_GFp_JCtoAF() - This transformation is changing Jacobian coordinates
back to affine. Input point ( : : ) transforms to point ( 2 , 3 ) with affine
coordinates. If is zero it returns and if all numbers are zero then the point can
28
FEI
KEMT
not be transformed and the error must be reported to higher level routines. AEL_ECC_GFp_add() - This routine takes two points as inputs. One is in affine and second is in Jacobian coordinate system. The result is also in Jacobian coordinates and it is evaluated using modified (2.4). The modification (using Jacobian coordinates) provides more speed. The function is an implementation of Algorithm 3.22 from [1]. It requires coefficient a from domain parameters to be equal to 3 (or p 3). In future when we want to use randomly generated curves we need to add support for curves with = 3 AEL_ECC_GFp_dbl() - This function takes point in Jacobian coordinates and doubles it (modified (2.5)). The result is in Jacobian coordinates. Similarly as for addition this function is implemented only for = 3. Algorithm 3.23 from [1] was used. AEL_ECC_GFp_mul() - Point multiplication is the most important function in this section. It is the heart of the ECC. The basic implementation for evaluating R = * P is using point addition and doubling and bit scanning of m-bit multiplier :
1 set R point at infinity 2 for i from m to 0 do 3 4 R = 2 R ( using point doubling ) if d$_ { i } $ == 1 then R = R + P ( using point addition )
5 return R
However this algorithm is very vulnerable to side-channel attacks (e.g. [19]). If we consider that multiplier holds secret information then this information can be obtained by time or power consumption measuring. Speed of one iteration in loop depends on current scanned bit of multiplier. The counter-attacks is discussed and implemented in Section 3.7
29
FEI
KEMT
3.4
Protocol routines
In this section we create one simple protocol - ECIES. It is used as an example. Other protocols such as ECDSA can be added in future. AEL_key_gen() - This function generates key pairs and also does the key verification. As input it only needs domain parameters. In first step we choose random number (p) lower than order n. This number represents private key. If we are using AEL_prng_multi() to generate random number we need to check if generated number is smaller than order number n. In second step we evaluate Q = * P. Number Q is called public key. It is very important to check if key pairs are not vulnerable. The verification consists of 4 steps: 1. Verify that Q = 2. Verify that Q , Q (p) 3. Verify that Q , Q satisfy simplified Weierstrass equation (2.3) with coefficients , from domain parameters. 4. Verify that Q * = If cofactor = 1 (it is true for all NIST curves) then first 3 steps implies the fourth one and there is no need to perform expensive point multiplication. If the verification fails we must change generated private key and repeat all the steps one more time. AEL_order_corr() - This function is used when any of the verifications used in protocol routines fail. It decrements input number using () arithmetics in order to have number lower than order. This decremented number is used for new attempt to pass verification process. AEL_ECIES_enc() - This is encryption algorithm using ECIES protocol. Inputs of the function are domain parameters, public key Q and plain-text message . It
30
FEI
KEMT
returns cipher-text which consists of 3 parts. - point on elliptic curve, - ciphered message and - MAC code. We need to have prepared 3 cryptographic primitives: KDF - Key derivation function which derives 2 keys from input using hash function ENC, DEC - Encryption and decryption algorithm using symmetric key. MAC - Message authenticate code such as HMAC. In this implementation of ECIES we use XTEA-based cryptographic primitives. The implementation of XTEA algorithms was taken from accompaniment for an article published in phrack #63. ECIES algorithm can be performed in 4 steps: 1. Create random number [1, 1]. 2. Evaluate R = * P and Z = * * Q. If Z = then go to step 1. Actually we do not need to create random number again. We can just correct the first one using order correction function. 3. Create keys 1 , 2 with KDF using points Z and R as input. 4. Compute = 1 () and = 2 (). AEL_ECIES_dec() - Reverse function to ECIES encryption. It uses the same cryptographic primitives. Inputs are domain parameters and cipher-text(R, , ). It decrypts the message or rejects it if verifications fail. The algorithm is: 1. Validate point R (like public key validation). If validation fails reject the message. 2. Evaluate Z = * * R. If Z = reject message. 3. Create keys 1 , 2 with KDF using points Z and R as input.
31
FEI
KEMT
4. Compute = 2 (). If = reject message. 5. Decrypt message = 1 () and return plain-text message .
3.5
Reference C implementation
Now we have prepared all routines for library in C code. We can use this implementation as a reference implementation. Then we can compare our results during the process of optimization. We can check if our results are correct or we can check speed-up of the library. This reference implementation is also ported for PC platform. It allows us to test the library on different CPU architecture and make sure that our implementation is fully portable. The executables can be found on the CD in Appendix A.
3.6
Optimization of low-level routines
The main purpose of this optimization is to speed up whole library. Our goal is to optimize library for speed using assembler and simultaneously keep as much as possible C code. First we have to find where are the performance bottlenecks so we can optimize them. Keil Vision IDE contains amazing tool called Performance analyzer. During the program simulation it is showing time spent inside all routines. With this tool we can find which routines are the most critical. Table 3 1 shows times spent in pure C code routines. Testing target in IDE simulator is ARM7TDMI and analyzer was set to show times during point multiplication. When we take a look on the Table 3 1 we can see that most of the computing time takes multiplication routine. We can also mention basic addition and substraction.
32
FEI
KEMT
Table 3 1 Table of the critical functions in C library. It shows name of the function, number of calls and time that program was inside that function. Measuring was performed during point multiplication and it was used ARM7TDMI target and EC192 elliptic curve. It shows only functions that have 1% of running time or more.
Function polar_mul AEL_add_multi AEL_sub_multi AEL_GFp_add AEL_GFp_fred192 AEL_GFp_mul AEL_cmp AEL_ECC_GFp_dbl AEL_GFp_sub
Calls 2,529 9,404 3,836 8,633 2,529 2,529 6,107 192 1,310
Time in % 51% 21% 8% 6% 5% 4% 1% 1% 1%
These functions are optimized in this section to be as fast as possible. For perfect optimization we need to develop assembler code for all processors individually. Other functions are not critical at all so we do not deal with them here and we use prepared C codes as shown on Figure 3 1.
ECC Protocols EC Point Multiplication EC Point Addition and Doubling FiniteField Operations
Basic W-bit Operations
C ASM
Figure 3 1 Elliptic curve cryptography pyramidal hierarchy with labeled which part of library is written in C code and which will be optimized in assembler code.
33
FEI
KEMT
3.6.1
Instruction set analysis
First step to optimize code using assembler is to familiarize with the instruction set of the processor we are using. Let us start with ARM Cortex-M0 and Cortex-M1 processors. They are using the same architecture and the same instruction set (see [29] and [30]). It is a subset of Thumb and Thumb2 sets. It contains only 56 instructions so it can be easily mastered. Comparing to Cortex-M3 or ARM7TDMI instruction set it does not include advanced conditionally executed instructions. Only branch instruction (B) could have condition suffixes. Also register shifts can not be performed as a part of one instruction. Multiply can be done in 1 instruction cycle but it can evaluate only 32 bit result. These properties makes the instruction set easy to learn and use. On the other hand almost all instructions can work only with lower 8 registers. Instructions MOV, ADD and CMP can work only with higher registers. In general it needs more cycles and more instructions (comparing to ARM7TDMI or Cortex-M3) when we want to create more complex assembler code. ARM7TDMI uses ARMv4 architecture which is no longer supported. It contains instructions from ARM and also from Thumb instruction sets. Programmer can switch between these two modes. This makes assembler programming for ARM7TDMI little bit complicated. The benefits are that we can use 64 bit result multiplier, we can shift registers with no time costs and a lot of instructions can have suffixes for conditional execution. Cortex-M3 processor is using ARMv7 architecture. Instruction set has almost all instructions from Thumb and Thumb2 sets. The benefits are similar to ARM7TDMI. Moreover Cortex-M3 do not have ARM mode so it is much easier for programmer to optimize the code. Porting code between different processors is quite difficult in some cases. The porting
34
FEI
KEMT
between M0 and M1 cores is without problems because they have the same instruction set. Codes for M0/M1 work on M3 core and should work on ARM7TDMI in Thumb mode but it is not very wise solution because M3 and ARM7TDMI has more powerful instructions and if we use only M0/M1 subset the final code will not be very optimized. Harder task is to port ARM7TDMI code to the Cortex-M3. The ARM mode code needs to be rewritten to Thumb instructions. For example LDM and STM instructions (to load or store multiple words) can have in ARM7TDMI suffixes IA, IB, DA or DB (increment after/before, decrement after/before). Cortex-M3 has only LDMIA, STMIA, LDMDB and STMDB instruction. So the rest of the instructions need to be rewritten to these 4 ones. If we need in Cortex-M3 assembler code conditional instruction we must put it in the IF-THEN construction. The hardest goal is to port ARM7TDMI or Cortex-M3 code to M1 or M0. For almost all cases it is absolutely impossible so for these cores we need to develop new optimized code with completely different approach.
3.6.2
Optimization of addition and subtraction
As it was said in Section 3.1 the C code for basic addition and subtraction is already optimized. Now we try to optimize codes in assembler. Multi-precision adding and subtracting are very easy operations so we use inline assembler and implement the assembler code right into the functions. The main idea is to set terminating condition, read 1 word from both operands, add/subtract with carry/borrow, store result and check the ending condition. Next two codes show structure of inline assembler entry in C code for ARM Cortex-M0 and Cortex-M3 so we can compare them and notice the differences. First code is adding for Cortex-M0:
35
FEI
KEMT
1 2 3 4 5 6 " 0: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
asm volatile ( " lsl " add " add ldm " ldm " adc " stm " lsl " eor " bne " mov " adc %[ tmp ] , %[ tmp ] , #2 %[ tmp ] , %[ a ] %[ tmp ] , #0 %[ a ]! , { %[ ta ] } %[ b ]! , { %[ tb ] } %[ ta ] , %[ tb ] %[ c ]! , { %[ ta ] } %[ ta ] , %[ a ] , #0 %[ ta ] , %[ tmp ] 0b %[ tmp ] , #0 %[ tmp ] , %[ tmp ] \ n \ t " // logical left shift of tmp by 2 \ n \ t " // store ending register into len \ n \ t " // clear carry \ n \ t " // load word from a \ n \ t " // load word from b \ n \ t " // addition with carry \ n \ t " // store result to c \ n \ t " // move register a to register ta \ n \ t " // test if a reaches end ( len ) \ n \ t " // jump to 0: if it does not \ n \ t " // set tmp to 0 \ n \ t " // store new carry to car
: [c] "+l" (c), [a] "+l" (a), [b] "+l" (b), [ tmp ] " + l " ( tmp ) , [ ta ] " = l " ( ta ) , [ tb ] " = l " ( tb ) : : " cc " , " memory " );
Second code is for Cortex-M3:

1 2 3 4 5 6 " 0: 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ); : : " cc " , " memory " : [c] "+r" (c), [a] "+r" (a), [b] "+r" (b), [ tmp ] " + r " ( tmp ) , [ ta ] " = r " ( ta ) , [ tb ] " = r " ( tb ) " adds ldm " ldm " adcs " stm " teq " bne " mov " adc asm volatile ( " add %[ tmp ] , %[ a ] , %[ tmp ] , LSL #2 %[ tmp ] , #0 %[ a ]! , { %[ ta ] } %[ b ]! , { %[ tb ] } %[ ta ] , %[ tb ] %[ c ]! , { %[ ta ] } %[ tmp ] , %[ a ] 0b %[ tmp ] , #0 %[ tmp ] , #0 \ \ n \ t " // store ending register into len \ n \ t " // clear carry \ n \ t " // load word from a \ n \ t " // load word from b \ n \ t " // addition with carry \ n \ t " // store result to c \ n \ t " // test if ( a ) reaches end ( tmp ) \ n \ t " // jump to 0: if it does not \ n \ t " // set tmp to 0 \ n \ t " // store new carry to tmp
36
FEI
KEMT
3.6.3
Optimization of multiplication
We pay huge amount of attention to multiplication routine. According to Table 3 1 it takes the most work time of the processor. Our effort is to speed this bottleneck up as much as we can so the whole library will be fast. We have to develop assembler codes for each of the processors we have. Using inline assembler would be complicated so we create assembler files. The code for ARM Cortex-M0 and Cortex-M1 is the same. ARM7TDMI can take the code for M0/M1 but it will not be optimized. We need to create code especially designed for ARM7TDMI. We can try to avoid ARM mode so porting to Cortex-M3 will be easier. There are many approaches for multi-precision multiplication [16]: Operand scanning method - also referred as schoolbook multiplication is depicted on Figure 3 2. Product scanning method - also referred as Comba multiplication is depicted on Figure 3 3. Hybrid method - combination of schoolbook and Comba multiplication is depicted on Figure 3 4. Operand caching method - focused on minimizing access to memory. In some cases could be the fastest method but big disadvantage is complexity of the code and amount of register used. We do not use this method in library.
37
FEI
KEMT
C[14]
C[7] A[7]B[0]
C[0]
A[7]B[7]
A[0]B[0]
A[0]B[7]
Figure 3 2 This is the graphical representation of multi-precision schoolbook multiplication C=A*B. This implementation is not very useful for our purpose because of the high amount of memory entries.
C[14] C[7] A[7]B[0] C[0]
A[7]B[7]
A[0]B[0]
A[0]B[7]
Figure 3 3 Comba multiplication is better for our application. It calculates whole word by word of the output number and writes it to memory. This approach can be good at some cases but it still has a lot of operand memory entries.
3.6.4
Optimization of multiplication for ARM Cortex-M0 and CortexM1
First processor for which we optimize multiplication is Cortex-M0. It has instruction for multiplication with only 32 bit result. There are two approaches which we can follow. First one is to create 64 bit result multiplication core and then use Comba method with this core. Second approach takes all numbers as 16 bit words. All load, store and multiplication instructions uses 16 bit operands and with these 16 bit
38
FEI
KEMT
C[14]
C[7] A[7]B[0]
C[0]
A[7]B[7]
3
A[0]B[0]
A[0]B[7]
Figure 3 4 One of the best implementation for our library is hybrid multiplication method. It use whole register set that ARM processors can provide. It revokes multiple words of operands from memory (e.g. for Cortex-M3 application it is two 32 bit words for each operand). It performs schoolbook multiplication with these operands words stored in registers. M3 processor have space for two words from each operand so it computes 6464 bit multiplication. These multiplications are used for whole multi-precision multiplication using Comba method. On this picture 4 words from each operand are used. It minimize memory entries which leads to speed-up of the multiplication.
instructions we use hybrid method of multiplication. Two assembler functions are: AEL_mulc() - This function is using first approach. It multiplies two multi-precision numbers and result of multiplication is twice as long as multiplicands. Core multiplies 32 32 bit numbers to 64 bit result. Hybrid method with this core can not be used because of lack of registers so we must use Comba method. Operands loading is in this code:
1 mulc_h1_lp2 : 2 3 4 sub ldr pyt , #4 y1 , [ pyt ] ldmia pxt ! , { x1 } @ read one word from x to x1 reg @ and increment x ptr @ decrement y ptr ( one word ) @ read one word from y to y1
AEL_mulh() - This method is using second approach. We can call this method half-hybrid because it is a hybrid method using 16 bit long half words. It is slightly slower than first approach. Memory load and store instructions are handling only 16 bit half words so we need twice as much as in the Comba method. Also using 32 bit registers to store 16 bit words is very inefficient. Loading 16 bit words are depicted on
39
FEI
KEMT
next code. As you can see operands loading takes much more instructions comparing to Comba method before.
1 mulh_h1_lp2 : 2 3 4 5 6 ... 7 ldrh yH , [ pyt , #2 ] @ store high halfword to yH ldrh ldrh add sub ldrh xL , [ pxt ] xH , [ pxt , #2 ] pxt , pxt , #4 pyt , pyt , #4 yL , [ pyt ] @ store low halfword to xL @ store high halfword to xH @ increment x pointer @ decrement y pointer @ store low halfword to yL
All of these two assembler codes can be also used for processor ARM Cortex-M0 because of the identical instruction set. There are no changes needed. All results, code sizes and number of cycles needed for algorithms can be found in 4
3.6.5
Optimization of multiplication for ARM7TDMI and ARM CortexM3
Optimization in assembler for ARM7TDMI was already done by Ing.Bn in [14]. He used Comba method which is not the best for this kind of architecture. I used hybrid method as suggested in [15] instead. The core of hybrid multiplication:
1 core1 : 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 umull adds @ multiply 2 lower words ( aL * bL ) pro0 , pro1 , mlp0 , mlp2 acu0 , acu0 , pro0 @ add result to lower 64 bit of acc umull adds adcs addcs @ mult . high with low word ( aH * bL ) pro0 , pro1 , mlp1 , mlp2 acu1 , acu1 , pro0 acu2 , acu2 , pro1 carc , carc , #0 x100 @ store carry to second byte of CC @ add result to 32 -95 bit of acc ldmda src2 ! , { mlp2 } @ load second word from b ( bL ) umull adds adcs addcs umull adds adcs addcs @ multiply high words ( aH * bH ) pro0 , pro1 , mlp1 , mlp2 acu2 , acu2 , pro0 acu3 , acu3 , pro1 carc , carc , #0 x10000@ store carry to third byte of CC @ mult . low with high word ( aL * bH ) pro0 , pro1 , mlp0 , mlp2 acu1 , acu1 , pro0 acu2 , acu2 , pro1 carc , carc , #0 x100 @ store carry to second byte of CC @ add result to 32 -95 bit of acc @ add result to upper 64 bit of acc ldmia ldmda src1 ! , { mlp0 , mlp1 } @ load 2 words from a ( aL , aH ) src2 ! , { mlp2 } @ load first word from b ( bH )
40
FEI
KEMT
26 27
adcs adc
acu1 , acu1 , pro1 carc , carc , #0 @ add carry to carry catcher ( CC )
Core of this implementation of hybrid method takes 4 words from operands (2 and 2) and multiply it together. The result is stored in 128 bit accumulator created by 4 registers (acu0, acu1 and acu2). Carries are stored in compressed carry catcher register (carc). The compression can be achieved by shifting carry bits to higher bytes of carry catcher register. It allows us to save registers and use hybrid method. Another problem with this implementation occurs when operands have odd lengths. We can use zero padding or we can multiply the rest of the numbers separately. Routines in library for ARM7TDMI multiplications are: acl_p_mul_c() - Original Comba implementation by Jaroslav Bn [14]. It is not very fast. It was used as a reference. Our goal was to go under time of this routine. AEL_mulh() - This function uses hybrid method. The core of the multiplication is depicted on the last code section. The speed of this implementation as expected is higher than acl_p_mul_c() function. This function can not handle odd lengths so input number needs to be zero padded to even length. acl_p_mul_hE() - Jaroslav Bns implementation of hybrid method. It is even faster than AEL_mulh() but it still can not multiply operands with odd length. acl_p_mul_h() - The best implementation of multiplication on ARM7TDMI. Added was odd length support by Jaroslav Bn and even more speed by Martin Petrvalsk. To port these algorithms for ARM7TDMI to Cortex-M3 first we need to change unsupported instruction. For example code for ARM7TDMI:
1 2 3 ... 4 5 addcs carry , #1 @ if carry set then do : @ add 1 to carry catcher ( cc ) ldmia ldmda src1 ! , { pro1 , pro2 } @ load 2 words from src1 src2 ! , { pro3 , pro4 } @ load 2 words from src2
41
FEI
KEMT
It needs to be changed to code for Cortex-M3:

1 2 3 4 5 6 ... 7 8 it addcs cs carry , #1 @ if carry set then do : @ add 1 to carry catcher ( cc ) ldmia add ldmdb sub src1 ! , { pro1 , pro2 } @ load 2 words from src1 src2 , #4 @ used 3 instructions as a ...
src2 ! , { pro3 , pro4 } @ ... replacement of ldmda ... src2 , #4 @ ... instruction to load 2 words
@ tmp1 , sum1 <- pro1 * pro2
As you can see we can not use LDMDA instruction and we are forced to use LDMDB. Second part of codes shows that conditionally executed instructions must be in IT loop. All cycle counts and code sizes of these routines can be found in 4. Routines for ARM Cortex-M3 are: AEL_mulc - This function is ported version of acl_p_mul_c() to Cortex-M3 processor. Comba method is used. AEL_mulh - This function is ported version of acl_p_mul_h() to Cortex-M3 processor. Used is hybrid method with support for odd lengths.
3.7
Optimization of high-level routines
This optimization is mainly focused not on the speed but on the security. We discussed about mathematical attacks in second chapter but now we take a look to side channel attacks, other attacks and to countermeasures. The heart of ECC is the point multiplication. There are often confidential data on the input of this routine so we need to secure it against all known attacks. Attacks that we discuss in this section: Power analysis [19]. There are two types of power analysis. First one is Simple Power Analysis (SPA). It is a measurement of the power consumption during operations with confidential data. Second one is Differential Power Analysis (DPA) which uses statistical methods to guess the confidential information.
42
FEI
KEMT
Safe error attack [23]. The basic principal is to inject error data during computation. If the result is still correct it means that there were dummy code (e.g. when current bit of confidential information is is 0). If the result is incorrect it means that injected data were used (e.g. when the bit is 1). This way we can extract secret information. Basic function for ECC point scalar multiplication AEL_ECC_GFp_mul() is vulnerable even against SPA attack. When we take a closer look to the main part of the code of multiplication we can see that when the bit of the multiplier is 0 it performs only point doubling. If it is number 1 it must do point doubling and adding. When we measure power consumption of the processor we should clearly see where the bits of the multiplier are 0 or 1. This way we can obtain whole information about confidential data.
1 2 3 4 5 6 7 8 9 10 } AEL_ECC_GFp_dbl (Q , Q , ec , tmp ) ; // 2.2. if k [j - word ]( i - bit ) is 1 if ((( k [ j ] >> i ) & 0 x00000001 ) == 1) // do Q <- Q + P AEL_ECC_GFp_add (Q , Q , P , ec , tmp ) ; for ( j = len - 1; j >= 0; j - -) // 2. from length of k downto 0 for ( i = 8* sizeof ( uint ) - 1; i >= 0; i - -) { // 2.1. Q <- 2 Q
We can use dummy operation operation to balance the time and power consumption. This will secure the code against SPA but makes it vulnerable to safe error attack. There is an algorithm called Montgomery ladder ([17], [20], [21] and [22]) which can solve this problem. We can see on the next code that the ladder must perform both operations - point add and point double independently on the current bit of the multiplier k. Pseudo-code of Montgomery ladder core:
1 Through all bits of multiplier k do : 2 3 4 5 6 end . else S = S + Q and Q = 2* Q if i - th bit of k is 0 Q = Q + S and S = 2* S
43
FEI
KEMT
This countermeasure still leave algorithm vulnerable against DPA which is very mighty attack tool. As suggested in [17] we can perform countermeasure against DPA using randomization of input data. This will cancel correlations and makes algorithm strong against DPA. Finally if we want to cancel all fault injections attacks at the end we need to perform check if the point is satisfying equation thus it lies on curve. In order to save memory requirements it is suggested to switch point coordinates to Co-Z ([17], [22]). Co-Z system is using similar coordinate for all points. This can save RAM and speed-up implementation because some of the operation can be pre-evaluated in advance. In [17] it is practically used only one coordinate ( coordinate only system) so the speed and memory savings are better. More speed can be also achieved if we combine point adding with point doubling. It is no problem when using Montgomery ladder because every step it needs to perform both operations. If we combine all the information of this section we get optimized functions for point multiplication which are the application of (3.1).
= = 2
(3.1)
= 3
2 2 1 = [(1 + 2 )(1 + 2 + 2 ) + ] 2 2 = [(2 )2 22 ]
44
FEI
KEMT
where = 2 = 4 3 = = (1 2 )2
2 = 42 (2 + ) + ]
These equation combines point addition and doubling. They can share some of the results. This can save computing time. On the other hand we need to store these pre-computed values so it uses more RAM space. Compared to classic methods of point addition and doubling it needs space for one multi-precision number more. This suits Montgomery ladder very well. Every step in the ladder we need to do addition and doubling. In classic method of point multiplication we need to use doublings and /2 additions on average. We have to use -times both operations in the ladder (which means /2 more additions). When we use joined add and double operations in the ladder we can save time. If we consider that add takes exact same time as double then the ladder should be 25% slower. However if we use the equations we can be slower only by approximately 10% which is great speed-up. AEL_ECC_GFp_Xonlyadddbl() - Joined operations of point adding and point doubling. This implementation is using Co-Z with coordinate only system. It is faster and it saves RAM space more then add and double functions because some numbers can be prepared in advance and shared together in one function. AEL_ECC_GFp_XonlytoAF() - This function revokes back affine coordinates from coordinate only system with Co-Z. First it converts these coordinates to standard projective coordinates. Then it checks if the point lies on the curve (checks using
45
FEI
KEMT
equation) as a countermeasure against fault injection attacks. Finally it converts projective coordinates to affine ones. AEL_ECC_GFp_Xonlymul() - Implementation of secure scalar point multiplication. The algorithm was taken from [17] (Algorithm 3). First it (optionally) performs point randomization as a countermeasure against DPA. Then it perform point multiplication using Montgomery ladder and AEL_ECC_GFp_Xonlyadddbl() function. At the end it recovers affine coordinates using AEL_ECC_GFp_XonlytoAF() routine.
3.8
Library testing
Last section in this chapter is dedicated to testing of the library. We will test 4 main parts of the library. They are multiplication, inversion, point operation and cryptographic protocols. Each of this part has its own routine which is called from main. The results of the tests can be found in Appendix C. The test projects are included on CD in Appendix A. At the end of the section we will take a closer look to timing for each of the processors used. We discuss C code versus assembler. We compare simulations and on-board tests and finally we do comparisons between different libraries.
3.8.1
Test of multiplication routines
First we will discuss routine for multiplication testing test_MUL(). This routine is measuring number of cycles for all multiplication algorithms in library. It is possible to change length of the word and multiplied numbers. They can be set manually or generated by PRNG. Test routine writes the results of multiplication and number of cycles to USART so we can examine if the results are correct and we can see the
46
FEI
KEMT
time of execution in cycles. This testing can provide us information which algorithm is the fastest and which algorithm we should use. We can perform the long testing at the end of the routine. This test is running in infinite loop. It produces two numbers from PRNG and multiply them using two different methods. Then it compares them and if they are the same it continues with next numbers from PRNG. This process can test millions numbers per second and allow us to discover errors which occur very rarely. Indeed we have discovered these errors and they were always related with carry propagation.
3.8.2
Test of inversion routines
test_INV() is function especially prepared for inversion testing. It takes random number from PRNG and inverts it. After it checks if multiplication of the original number and the inverted number is equal to one. The results and speed in processor cycles are written to USART so it is possible to check if results are correct. The routines that are tested are: AEL_GFp_bininv() AEL_GFp_RSbininv() In the same way as for multiplication test routine we have long test at the end of the inversion test function. It generates number from PRNG. Then it inverts the number using one of the inversion routines. Finally it inverts the number using the little Fermats theorem and compare the results. If they are the same it continues with another numbers. If not the input and inverted numbers will be written to USART. Inversion function which is using the little Fermats theorem is AEL_GFp_FERMinv().
47
FEI
KEMT
3.8.3
Test of point operation routines
Routine test_ECC() performs all point operations which are included in library. It uses test vectors from paper [8]. The vectors are especially prepared for testing operation on elliptic curves such as addition, doubling, multiplication and double multiplication ( * + * ). For every routine this test function takes input vectors then it does the operation and it checks if the result is correct. It also measures speed of algorithms in cycles and it writes all these information to USART. We put a special emphasis on point multiplication algorithms. We developed two routines with different algorithms, different speed and different security. First one is basic point multiplication AEL_ECC_GFp_mul(). Second one is special multiplication using Montgomery ladder, Co-Z with only coordinate and some security features. Testing showed up that second method with AEL_ECC_GFp_Xonlymul() routine is almost as fast as basic method. The difference in speed is only around 10% (for exact numbers see Appendix C. When using Montgomery ladder we need to use add and double for every bit of multiplier. In basic method we always need to use point doubling but we use point adding only for half of the bits on average. The predicted speed disadvantage for the ladder would be 25% assuming doubling and adding have same speed. The fact that we have only 10% difference is caused by the optimization of add and double function. They are joined together to one function and they share some pre-results so they can reach higher speed together. It is an excellent result to have quite secure algorithm with just a little cost of the speed. There is a special test at the end of test_ECC(). It takes one point and it multiplies it one hundred times. The result is then compared with pre-computed point in Magma tool ([9], [10], [11]). All Magma scripts can be found in Appendix D. The algorithm tests the point multiplication very hard and if the tests passes there is a
48
FEI
KEMT
very little chance of having wrong multiplication routine. We can see the algorithm for Magma in next code:
1 p := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f f ; 2 b := 0 x 6 4 2 1 0 5 1 9 e 5 9 c 8 0 e 7 0 f a 7 e 9 a b 7 2 2 4 3 0 4 9 f e b 8 d e e c c 1 4 6 b 9 b 1 ; 3 a := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f c ; 4 5 Sx := 0 x d 4 5 8 e 7 d 1 2 7 a e 6 7 1 b 0 c 3 3 0 2 6 6 d 2 4 6 7 6 9 3 5 3 a 0 1 2 0 7 3 e 9 7 a c f 8 ; 6 Sy := 0 x 3 2 5 9 3 0 5 0 0 d 8 5 1 f 3 3 6 b d d c 0 5 0 c f 7 f b 1 1 b 5 6 7 3 a 1 6 4 5 0 8 6 d f 3 b ; 7 d := 0 x a 7 8 a 2 3 6 d 6 0 b a e c 0 c 5 d d 4 1 b 3 3 a 5 4 2 4 6 3 a 8 2 5 5 3 9 1 a f 6 4 c 7 4 e e ; 8 e := 0 x c 4 b e 3 d 5 3 e c 3 0 8 9 e 7 1 e 4 d e 8 c e a b 7 c c e 8 8 9 b c 3 9 3 c d 8 5 b 9 7 2 b c ; 9 10 testSx :=0 x 0 D E B D 9 2 E D 7 B 8 0 D D B 6 4 5 E 2 F C F 3 A 6 B 3 D 3 E 0 D 4 7 C A 0 D E A 0 6 2 B B C ; 11 testSy :=0 x 6 2 B 9 E 3 C 1 9 E 5 5 F F F B E 3 7 6 3 4 C B C C C 0 A 9 E 2 0 9 9 4 1 3 2 C F F 0 B 6 0 3 C ; 12 13 " This is test for ECC point multiplication using P192 NIST curve . " ; 14 15 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 16 S := E ![ Sx , Sy ]; 17 18 " Do 100 times S = d * S with pre - update d =( d * e ) mod p : " ; 19 20 for i := 1 to 100 do 21 d := ( d * e ) mod p ; 22 S := d * S ; 23 end for ; 24 25 testS := E ![ testSx , testSy ]; 26 " Result : " ; 27 elem := E lem en tTo Se que nc e ( S ) ; 28 x := IntegerRing () ! elem [1]; 29 y := IntegerRing () ! elem [2]; 30 x_hex := IntegerToString ( x , 16 ) ; 31 y_hex := IntegerToString ( y , 16 ) ; 32 " Sx ( hexa ) = " ; 33 x_hex ; 34 " Sy ( hexa ) = " ; 35 y_hex ; 36 37 " Test if result is correct : " ; 38 S eq testS ; 39 40 " End of test . " ;
Magma is a very strong mathematical tool which can handle many, even new fields in mathematics. It is very suitable for our purposes because it has very good highlevel support for (p) and ECC operations. We used evaluation version of Magma which can be found online: http://magma.maths.usyd.edu.au/calc/. Calculations for this version are restricted to 120 seconds. All our tests are performed by online evaluation Magma V2.18-4.
49
FEI
KEMT
3.8.4
Test of cryptographic protocols routines
The last test we use is test_PRO(). It is testing protocol routines created for this library. They are key generation, ECIES encryption and ECIES decryption. First it creates a pair of keys which will be used later for encryption and decryption. Second step is to encrypt simple message using ECIES protocol and generated public key. Then it tries to decrypt message with private key and if the original message and decrypted messages are equal the test passes. All necessary information is written on the USART so we can watch whole process. At the end of the testing routine there is an infinite loop which is testing messages encryption, decryption and key generation. This little test bench first creates random key pair. Then using PRNG it creates secret message. It encrypts the message, decrypts the message and finally it compares if the result is same as original message. If they are equal it takes another key and message and the test continues. The routine will notify us with the message on USART if the test fails.
3.8.5
Comparisons
This section compares different targets and codes. We can take multi-precision multiplication routine as a good reference algorithm to compare. We discuss these four comparisons: Processors comparison. Simulator, Flash and RAM comparison. C and assembler codes comparison. Comparison with other ECC libraries.
50
FEI
KEMT
Processors work with different frequencies so measuring time intervals is not very good for comparing. We take number of cycles as a reference units and we compare the fastest multiplication routines for each CPU. ARM Cortex-M0 and ARM CortexM1 have the most primitive instruction set so the speed is the worst of all processors. But on the other hand it has very low code size. ARM7TDMI and Cortex M3 have quite similar instruction set which is much richer then M0/M1 have thus it can be faster but code size is bigger as it is shown on Figure 3 5.
Cycles/Code
3500 3000 2500 2000 Cycles 1500 1000 500 0 Cortex-M0/M1 Cortex-M3 ARM7TDMI Code(B)
Figure 3 5 Comparison of speed and code size of 256 256 bit multiplication on different types of CPU. ARM7TDMI and Cortex-M3 have big code sizes and high speed. Cortex-M0 and CortexM1 have small code sizes but on the other hand it is not very fast. This is caused by different instruction sets of the processors. GNU compiler was used for this testing with level2(speed) optimization.
Second comparison can be done between different targets. We can compare speed of multiplication during simulation in Keil Vision IDE, speed when executing code from Flash and speed when the code is executed from RAM. The fastest is always simulation speed because the simulator does not include any waiting states. For Cortex-M0/M1 execution from RAM is faster as shown on Figure 3 6. ARM7TDMI and Cortex-M3 has special fetching mechanism which makes executing of code from
51
FEI
KEMT
Flash really fast. The speeds of multiplication on these processors are almost the same. It does not matter if the code is executed from Flash or RAM.
Cycles
4500 4000 3500 3000 2500 Simulator Flash RAM
2000
1500 1000 500 0 ARM7TDMI Cortex-M0/M1 Cortex-M3
Figure 3 6 Comparison of speed of 256 256 bit multiplication on different targets of different processors. The fastest is always simulator. Second fastest is execution from RAM and slowest is execution form Flash. ARM7TDMI and Cortex-M3 processors have very little differences between targets. It is much faster for Cortex-M0 and M1 to execute code from RAM than from Flash. GNU compiler was used for this testing with level2(speed) optimization and for RAM testing it was used RealView compiler because of lack of linker scripts for GNU projects.
Thirdly we compare C code and assembler. C code have advantage of portability and it is easy understandable. On the other hand in assembler we can optimize code for speed and also for code size. On the Figure 3 7 there are speed comparison of the fastest algorithms in C code and in assembly code. Now we compare whole library with other libraries available. We will start with features of other libraries and at the end we provide some information about our library and do a comparison. INVIA - ECC Software Library IP: Optimized for SPARC V8 and ARM Cortex-M0, M3. ECC 256 bit completed in 86 ms on ARM Cortex-M3 at 100 MHz in stand-
52
FEI
KEMT
Cycles
4500 4000 3500 3000

2500
ASM
C
2000 1500 1000 500 0 ARM7TDMI Cortex-M0/M1 Cortex-M3
Figure 3 7 Comparison of speed of 256 256 bit multiplication using fastest C code algorithm and fastest algorithm in assembler on different processors. For ARM7TDMI and ARM Cortex-M3 we managed to reduce the number of cycle approximately to half. Cortex-M0 and M1 have simpler instruction set so C compiler can optimize code better. Assembly code for these processors can not be as optimized as for ARM7TDMI or Cortex-M3. GNU compiler was used for this testing with level2(speed) optimization.
alone mode. Code size: 16 kB on ARM Cortex-M3. Memory utilization: 4kB for 256 bit key on ARM Cortex-M3. ECC part from ACL by Jaroslav Bn: Optimized for ARM7TDMI. ECC 256 bit point multiplication completed in 6,727,332 cycles. Code size: 19,000 B - only parts comparable with our library. Information about ECIES operations durations of our library can be found in Table 3 2. There are timings for encryption and decryption for all used NIST curves. Code sizes of implementations on processors are as follows:
53
FEI
KEMT
ARM7TDMI - 12,690 B ARM Cortex-M0 and Cortex-M1 - 8,360 B ARM Cortex-M3 - 7,948 B
Table 3 2 Table for 224 and 256 bit implementation of ECIES operations for used processors. The numbers in this table represent number of cycles that takes the operation. Decryption takes approximately half time of encryption.
OP CPU ENC224 ENC256 DEC224 DEC256
ARM7TDMI 12,606,214 26,921,818 6,528,066 14,302,535
Cortex-M0/M1 24,309,132 43,920,133 12,600,655 23,333,680
Cortex-M3 11,279,067 23,900,818 5,839,810 12,695,380
RAM utilization for ECIES encryption depends on used curve and some other parameters. It can be counted as shown in 3.2. = * 17 + /2 + * 2 + * 2 + 7 (3.2)
where LEN is number of 32 bit words used to store one multi-precision number, LENM is length of the message in chars, KEY and MAC are length of key and length of MAC code generated by KDF function. Maximum RAM used is for 521 bit length words and it is 2.5 kB. To compare results of our library with ACL library we need speed of ECC point multiplication on ARM7TDMI. Using 256 bit long words the multiplication takes 13,744,840 cycles and with security optimization it is 15,033,796 cycles. It is more then two times slower than ACL implementation but it is the price to pay for good portability. Almost all codes from ACL point multiplication was programmed in assembler. In terms of code sizes our libraries are comparable. If we subtract code sizes of curves that were used in ACL we get almost the same result.
54
FEI
KEMT
Comparison with INVIA library can be also interesting. The results they offer are for ARM Cortex-M3 core. Encryption (using ECIES and 256 bit long words) in 86 ms on ARM Cortex-M3 at 100 MHz could be recounted as 8,600,000 cycles. Compared to our implementation it is 3 times faster but as you can see the code size and RAM used are at least 2 times larger. We can assume that INVIA implementation is using methods to speed up which take a lot of space in RAM and in Flash. For example they can use window functions or pre-counted values stored in memories which make code much faster but it also take a lot of Flash and RAM space.
55
FEI
KEMT
Conclusion
We have created elliptic curve cryptography library for various types of processors like ARM7TDMI, ARM Cortex-M3, M1 and M0. It supports all NIST curves over primary field arithmetics. Our library is optimized for speed and also for security. Almost whole library is written in C language. Only few low-level functions are optimized in assembler for better speed performance. It can handle ECIES algorithm for encryption and decryption. Library was developed using freely available software tools. We used GNU compiler CodeSourcery Source CodeBench Lite for ARM processors and Keil Vision IDE. The library was compiled using GNU compiler and linked to GNU and Real-View test projects. One of the successes of this work is that our library can be linked to projects using both kind of compilers. Testing demo software was also implemented in this thesis. The tests helped us with finding errors and bug. They were also useful for measuring the speed and memory requirements of the library and also with comparisons with other libraries. A big effort was made to create and speed up basic multi-precision multiplication algorithms for all processors. In my opinion we are very close to the implementation maximum for the architectures of the processors. Of course we use general purpose algorithms and we do not count special unrolled assembler codes which can be even faster. Whole library in comparison with other libraries is relatively slow but our effort was also to maintain portability (almost all codes are written in C language). On the other hand in terms of code sizes and memory used our library provides very good results. The Flash and RAM memory requirements are as good as in other libraries and in some cases our library provides much better results.
56
FEI
KEMT
Future work can be divided into couple parts: Add support for general curves. Add support for binary field arithmetics - (2 ). Add more ECC protocols. Even more speed could be added. New algorithms for multiplication, reduction, inversion can be applied. Add support for more processors such as ARM Cortex-M4 or Cortex-R4. Implement this library to soft cores on FPGA circuits. Use true random number generator to provide bulletproof security. Perform test of the reliability and efficiency of implemented countermeasures by practical measurement using new departments laboratory [18].
57
FEI
KEMT
References
[1] HANKERSON, D. MENEZES, A. and VANSTONE, S. 2004. Guide to Elliptic Courve Cryptography. New York : Springer-Verlag, 2004. ISBN 0-38795273-X [2] MENEZES, A. VAN OORSCHOT, P. and VANSTONE, S. 1997. Handbook of Applied Cryptography. CRC Press, 1997. [referenced 27.4.2012]. Chapters available online at: http://www.cacr.math.uwaterloo.ca/hac [3] MARTINEZ, V. ALVAREZ, F. and ENCIAS, 2010. A Comparison of the Standardized Versions of ECIES. [referenced 27.4.2012]. Chapters available online at:
http://digital.csic.es/bitstream/10261/32674/1/Gayoso_A%20Comparison
%20of%20the%20Standardized%20Versions%20of%20ECIES.pdf
[4] FIPS, 2001. AES encryption standard. [referenced 27.4.2012]. Chapters available online at: http://csrc.nist.gov/publications/fips/fips197/fips-197.pdf [5] FIPS, 2011. Triple Data Encryption Standard. [referenced 27.4.2012]. Chapters available online at: http://csrc.nist.gov/publications/fips/fips140-2/fips1402annexa.pdf [6] NEEDHAM, R. and WHEELER,D. 1997. eXtentions to Tiny Encryption Algorithm. [referenced 27.4.2012]. Chapters available online at:
http://www.cix.co.uk/ klockstone/xtea.pdf
[7] BROWN, M. HANKERSON, D. LOPEZ, J. and MENEZES, A. 2001. Software implementation of the NIST elliptic curves over prime fields. Topics in Cryptology-CT-RSA, 2001. [referenced 27.4.2012]. Chapters available online at: http://www.dms.auburn.edu/faculty/hankerson/full-paper.pdf [8] NSA, 2010. Mathematical routines for the NIST prime elliptic curves. [refer-
58
FEI
KEMT
enced 27.4.2012]. Chapters available online at: http://www.nsa.gov/ia/_files/nistroutines.pdf
[9] University of Sydney, 2012. Magma. [referenced 27.4.2012]. Available online at: http://magma.maths.usyd.edu.au/magma [10] University of Sydney, 2012. Magma Handbook. [referenced 27.4.2012]. Available online at: http://magma.maths.usyd.edu.au/magma/handbook/ [11] BOSMA, W. and CANNON, J. 2006. Discovering Mathematics with Magma. Volume 19, Springer, 2006. ISBN 978-3-540-37632-3 [referenced 27.4.2012]. Chapters available online at: http://www.springerlink.com/content/978-3-540-376347#section=397120&page=1
[12] INVIA, 2012. ECC Software Library IP. [referenced 27.4.2012]. Available online at: http://www.invia.fr/ECC-Software-Library-36.html [13] NCSU, 2012. TinyECC. [referenced 27.4.2012]. Available online at:
http://discovery.csc.ncsu.edu/software/TinyECC/
[14] BN, J 2007. Cryptographic library for ARM7TDMI processors. Masters Thesis. Koice: Technical University of Koice, Faculty of Electrical Engineering and Informatics, 2007. 238 p. [referenced 27.4.2012]. Chapters available online at: http://www.kemt.fei.tuke.sk/personal/drutarovsky/students/pdfs/ban2007.pdf [15] SCOTT, M. and SZCZECHOWIAK, P. 2007. Optimizing Multiprecision Multiplication for Public Key Cryptography. [referenced 27.4.2012]. Chapters available online at: http://eprint.iacr.org/2007/299.pdf [16] HUTTER, M. and WENGER, E 2011. Fast Multi-precision Multiplication for Public-Key Cryptography on Embedded Microprocessors. Springer Berlin / Heidelberg, 2011. ISBN: 978-3-642-23950-2 [referenced 27.4.2012]. Chapters available online at: http://www.springerlink.com/content/166hl834k55r5454/fulltext.pdf
59
FEI
KEMT
[17] HUTTER, M. JOYE, M. and SIERRA, Y. 2011. Memory-Constrained Implementations of Elliptic Curve Cryptography in Co-Z Coordinate Representation. Springer, 2011. [referenced 27.4.2012]. Chapters available online at:
http://www.springerlink.com/content/v48358k011x82311/fulltext.pdf
[18] The Differential Power Analysis Laboratory Setup / Michal Varchola, Milo Drutarovsk - 2012. - 1 elektronick optick disk (CD-ROM). In: Radioelektronika 2012 : Proceedings of 22nd International Conference : April 17-18, 2012, Brno, Czech Republic. - Brno : Brno University of Technology, 2012 P. 1-4. - ISBN 978-80-214-4469-0 [VARCHOLA, Michal - DRUTAROVSK, Milo] [19] AIGNER, torial. M. and OSWALD, E. 2000. Power available Analysis online Tuat:
[referenced
27.4.2012].
Chapters
http://www.iaik.tugraz.at/content/research/implementation_attacks/introduction _to_impa/dpa_tutorial.pdf
[20] RIVAIN, M. 2011. Fast and Regular Algorithms for Scalar Multiplication over Elliptic Curves. [referenced 27.4.2012]. Chapters available online at:
http://eprint.iacr.org/2011/338.pdf
[21] KARAKLAJI, D. et al. 2011. Low-cost fault detection method for ECC using Montgomery Powering Ladder. [referenced 27.4.2012]. Chapters available online at: http://www.cosic.esat.kuleuven.be/publications/article-1518.pdf [22] GOUNDAR, R. JOYE, M. and MIYAJI, A. 2010. Co-Z Addition Formulea and Binary Ladders on Elliptic Curves. [referenced 27.4.2012]. Chapters available online at: http://joye.site88.net/papers/GJM10zcoord.pdf [23] YEN, S. and JOYE, M. 2000. Checking before output may not be enough against faultbased cryptanalysis. IEEE Transactions on Computers, 49(9):967970, 2000.
60
FEI
KEMT
[24] JONES, D. 2010. Good Practice in (Pseudo) Random Number Generation for Bioinformatics Applications. May, 2010. [referenced 27.4.2012]. Chapters available online at: http://www.cs.ucl.ac.uk/staff/d.jones/GoodPracticeRNG.pdf [25] GHOSH, A. 2006. Materials for subject Empirical Methods for Computer Science Research. 2006. [referenced 27.4.2012]. Chapters available online at:
http://www.public.iastate.edu/ apghosh/Stat430x-S06/Notes/simul-1.pdf
[26] ROSE, G. 2011. KISS: A bit too simple. 2011. [referenced 27.4.2012]. Chapters available online at: http://eprint.iacr.org/2011/007.pdf [27] Polar: PolarSSL [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://polarssl.org/ [28] ARM: ARM7TDMI [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.arm.com/products/processors/classic/arm7/index.php [29] ARM: Cortex-M0 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.arm.com/products/processors/cortex-m/cortex-m0.php [30] ARM: Cortex-M1 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.arm.com/products/processors/cortex-m/cortex-m1.php [31] ARM: Cortex-M3 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.arm.com/products/processors/cortex-m/cortex-m3.php [32] ARM: April, ARMv6 2012. Architecture [referenced Reference Manual [online]. at:
27.4.2012].
Available
online
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0419c/index.html
[33] ARM: April,
ARMv7 2012.
Architecture [referenced
Reference
Manual
[online]. at:
27.4.2012].
Available
online
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html
61
FEI
KEMT
[34] NXP: LPC2138 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://ics.nxp.com/products/lpc2000/all/ LPC2138/ [35] Keil: MCB2130 Evaluation Board [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.keil.com/mcb2130/ [36] NXP: LPC1113 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://ics.nxp.com/products/lpc1000/all/ LPC1113/ [37] NXP: LPCXpresso1113 Evaluation Board [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://ics.nxp.com/lpcxpresso/ LPC1113/ [38] ST Microelectronics: STM32 F1 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.st.com/internet/mcu/subclass/1169.jsp [39] ST Microelectronics: STM32 F0 [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.st.com/internet/mcu/subclass/1588.jsp [40] Keil: MCBSTM32 Evaluation Board [online]. April, 2012. [referenced
27.4.2012]. Available online at: http://www.keil.com/mcbstm32/ [41] VARGA, T. CAN senzorov sie pre monitorovanie vybranch parametrov v automobiloch: Diplomov prca. Koice: TUKE FEI, May, 2010. 60 s. [42] Keil: Ulink2 [online]. April, 2012. [referenced 27.4.2012]. Available online at:
http://www.keil.com/ulink2
[43] ST Microelectronics: ST-Link [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.st.com/internet/evalboard/product/219866.jsp [44] ST 2012. Microelectronics: [referenced ST-Link 27.4.2012]. Utility Available [online]. online April, at:
http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL _LITERATURE/USER_MANUAL/CD00262073.pdf
62
FEI
KEMT
[45] ARM: Single Wire Debug [online]. April, 2012. [referenced 27.4.2012]. Available online at:
http://www.arm.com/products/system-ip/debug-trace/coresight-soc-
components/serial-wire-debug.php
[46] IEEE: JTAG [online]. April, 2012. [referenced 27.4.2012]. Available online at:
http://standards.ieee.org/findstds/standard/1149.1-1990.html
[47] Keil: Vision [online]. April, 2012. [referenced 27.4.2012]. Available online at:
http://www.keil.com/uvision
[48] Keil: MDK-ARM [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.keil.com/arm/mdk.asp [49] GNU: GCC [online]. April, 2012. [referenced 27.4.2012]. Available online at:
http://gcc.gnu.org
[50] Mentor Graphics: Sourcery CodeBench Lite Edition [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.mentor.com/embeddedsoftware/sourcery-tools/sourcery-codebench/editions/lite-edition/
[51] ARM: CMSIS [online]. April, 2012. [referenced 27.4.2012]. Available online at: http://www.arm.com/products/processors/cortex-m/cortex-microcontroller-softwareinterface-standard.php
63
Appendices
Appendix A CD with this thesis in electronic form and test projects Appendix B Photos and pictures of the measuring process Appendix C Tables with the library testing information Appendix D Source codes for Magma
FEI
KEMT
Appendix A
CD with this thesis in electronic form and test projects
Structure of the directories: MT (contains the thesis in electronic form) - petrvalsky_MT.pdf (PDF format of the thesis)
A - petrvalsky_MT.rar (compressed native L TEXsource files)
Projects (contains test projects for all targets) - testAEL.rar (compressed file which contains test projects)
65
FEI
KEMT
Appendix B
Photos, pictures of the measuring process
Figure 4 1 This is the photo of programming and debugging of MCB2130 evaluation board (in the middle). Connector on the left side of board is USB type B which provides power supply. At the bottom there is COM connector with connected USB-to-Serial converter. Ulink2 USB-JTAG programmer and debugger is connected to the right side of the board.
66
FEI
KEMT
Figure 4 2
Photo shows a process of programming and debugging of evaluation board
KEMT_STM32 (in the middle). Connector on the left side of board is COM connector with connected USB-to-Serial converter. At the bottom there is USB type B which provides power supply. ST-Link USB-JTAG programmer and debugger is connected to the right side of the board.
67
FEI
KEMT
Figure 4 3 This photo shows programming and debugging of LPCXpresso evaluation board (in the middle left). It has no connectors but power supply mini USB connector. If we want to connect serial interface we need to use pins headers, external receiver, transceiver and connector (used the one on the MCBSTM32 board on the left side of the photo). Output pins Rx and Tx on the header of LPCXpresso have to be connected to PA10 and PA9 pins of MCBSTM32 board in this order. MCU on the MCBSTM32 board needs to be switched off so it does not interfere with our processes. We can for example start it in RAM mode or program blank infinite loop to it and start it from Flash. Easier solution would be usage of the Serial(3v3)-to-USB interface. If we need to use our own debugger (Ulink2 on the right side) we have to connect it to drills on the board prepared for debugger connector. We use SWD protocol for debugging and code loading. We have to connect SWDIO, SWCLK, RESET, VCC and GND pins of the debugger to according drills on LPCXpresso board. These drills can be found in the middle of the board between R-link and MCU part of the board.
68
FEI
KEMT
Figure 4 4 Detail of the connection between LPCXpresso (right) and MCBSTM32 (left). Pin 9 (Tx) on the pin header of LPCXpresso is connected to PA9 drill on MCBSTM32 with black wire. Pin 10 (Rx) is connected to PA10 with red wire. Jumpers BOOT0 and BOOT1 on MCBSTM32 are set to boot in RAM mode. This makes the processor on the board inactive after we plug it to power supply.
69
FEI
KEMT
Figure 4 5 Detail of the connection between LPCXpresso (left) and Ulink2 (right). The 20-pin ARM connector on Ulink2 is connected to drills for debug connector on the board. Pins 1, 4, 7, 9 and 15 (VCC, GND, SWDIO, SWCLK and RST) are connected to drills 2, 15, 3, 5, 11 (twisted red and black, green, white and black wires) of the debug connector on the board. This allows us to debug and load code to MCU.
70
FEI
KEMT
Figure 4 6 Performance analyzer included in Keil Vision IDE. With this tool we are able to find speed bottlenecks of our implementation. This current state of analyzer shows part of ECC point multiplication using 256 bit long words.
71
FEI
KEMT
Example of USART output of testMUL() routine:

1 Test of multiplication rutines . 2 3 JB + MD + MP : 0 x00000C2C cycles . 4 Result of multiplication : 5 12 41 CC 16 6 10 1 B 97 3 A 7 6 F 0 C 32 C5 8 60 A0 A6 B3 9 10 PolarSSL : 0 x000008A8 cycles . 11 Result of multiplication : 12 12 41 CC 16 13 10 1 B 97 3 A 14 6 F 0 C 32 C5 15 60 A0 A6 B3 16 17 Ban Comba ASM : 0 x000004D4 cycles . FC 71 26 97 DA 5 E C5 3 C 75 E5 EF 8 D 44 8 B 5 F C6 30 E2 D8 D4 D0 31 75 7 A 9 F 6 D CD D7 BF 56 55 CB 3 E 4 C AE C6 0 C 3 C 83 7 A 5 A 55 A1 03 15 61 68 4 A FC 71 26 97 DA 5 E C5 3 C 75 E5 EF 8 D 44 8 B 5 F C6 30 E2 D8 D4 D0 31 75 7 A 9 F 6 D CD D7 BF 56 55 CB 3 E 4 C AE C6 0 C 3 C 83 7 A 5 A 55 A1 03 15 61 68 4 A
18 Result of multiplication : 19 12 41 CC 16 20 10 1 B 97 3 A 21 6 F 0 C 32 C5 22 60 A0 A6 B3 23 24 MP Hybrid ( even ) ASM : 0 x00000440 cycles . FC 71 26 97 DA 5 E C5 3 C 75 E5 EF 8 D 44 8 B 5 F C6 30 E2 D8 D4 D0 31 75 7 A 9 F 6 D CD D7 BF 56 55 CB 3 E 4 C AE C6 0 C 3 C 83 7 A 5 A 55 A1 03 15 61 68 4 A
25 Result of multiplication : 26 12 41 CC 16 27 10 1 B 97 3 A 28 6 F 0 C 32 C5 29 60 A0 A6 B3 30 31 Ban Hybrid ( even ) ASM : 0 x000003E0 cycles . FC 71 26 97 DA 5 E C5 3 C 75 E5 EF 8 D 44 8 B 5 F C6 30 E2 D8 D4 D0 31 75 7 A 9 F 6 D CD D7 BF 56 55 CB 3 E 4 C AE C6 0 C 3 C 83 7 A 5 A 55 A1 03 15 61 68 4 A
32 Result of multiplication : 33 12 41 CC 16 34 10 1 B 97 3 A 35 6 F 0 C 32 C5 36 60 A0 A6 B3 37 38 Ban Hybrid ASM : 0 x000003E0 cycles . FC 71 26 97 DA 5 E C5 3 C 75 E5 EF 8 D 44 8 B 5 F C6 30 E2 D8 D4 D0 31 75 7 A 9 F 6 D CD D7 BF 56 55 CB 3 E 4 C AE C6 0 C 3 C 83 7 A 5 A 55 A1 03 15 61 68 4 A
39 Result of multiplication : 40 12 41 CC 16 41 10 1 B 97 3 A 42 6 F 0 C 32 C5 43 60 A0 A6 B3 44 45 Long testing ( it takes some time ) . 46 0 x00000001 * 10000 numbers OK . 47 0 x00000002 * 10000 numbers OK . FC 71 26 97 DA 5 E C5 3 C 75 E5 EF 8 D 44 8 B 5 F C6 30 E2 D8 D4 D0 31 75 7 A 9 F 6 D CD D7 BF 56 55 CB 3 E 4 C AE C6 0 C 3 C 83 7 A 5 A 55 A1 03 15 61 68 4 A
72
FEI
KEMT
Example of USART output of testINV() routine:

1 Test of inversion rutines . 2 3 Binary inversion : 0 x000368A8 cycles . 4 Result of inversion : 5 10 23 FB E6 6 2 D B7 C4 43 7 8 Test if inversion was success ( should be number 1) : 9 00 00 00 00 10 00 00 00 00 11 12 Right shift binary inversion : 0 x000371E4 cycles . 13 Result of inversion : 14 10 23 FB E6 15 2 D B7 C4 43 16 17 Test if inversion was success ( should be number 1) : 18 00 00 00 00 19 00 00 00 00 20 21 Inversion using Fermats theorem : 0 x001529FC cycles . 22 Result of inversion : 23 10 23 FB E6 24 2 D B7 C4 43 25 26 Test if inversion was success ( should be number 1) : 27 00 00 00 00 28 00 00 00 00 29 30 Test of binary inversion using Fermats theorem : ( it takes long time ) 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 63 12 1 B F8 FE F4 60 05 7 D 05 2 A A1 73 59 7 F A6 D5 2 C 0 D 20 E2 2 F 30 99 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 63 12 1 B F8 FE F4 60 05 7 D 05 2 A A1 73 59 7 F A6 D5 2 C 0 D 20 E2 2 F 30 99 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 63 12 1 B F8 FE F4 60 05 7 D 05 2 A A1 73 59 7 F A6 D5 2 C 0 D 20 E2 2 F 30 99
73
FEI
KEMT
Example of USART output of testECC() routine:

1 Test of ECC point operations . 2 3 Point addition R = S + T : 0 x00009988 cycles . 4 Result of addition : 5 Rx = 6 72 B1 3 D D4 7 70 34 91 91 8 Ry = 9 8 D 58 5 C BB 10 C3 3 B 13 31 2 E 13 27 D7 5 A A5 C9 D4 52 41 A8 A1 6 D 01 30 11 22 D7 62 0 D 74 4 A C2 64 35 4 B 6 B 81 AC 47 6 B D4 74 51 95 E9 55 3 C F3 5 A 8 C C5 BA 69 54 5 A 06 7 E
11 The result is correct . 12 13 Point doubling R =2* S : 0 x000077C0 cycles . 14 Result of doubling : 15 Rx = 16 76 69 E6 90 17 DF 6 C 22 F3 18 Ry = 19 FA 87 81 62 20 38 9 E F3 EE 18 7 A 54 F6 CD 03 02 3 D C3 9 F 6 E E0 E1 0 C A2 C1 07 2 F 33 DE DB 61 D0 C7 16 06 EE 3 B B1 74 81 B8 A1 A8 EE F1 2 A 86 0 F FC E0 02 4 C 33 DB 61 27 B0
21 The result is correct . 22 23 Point multiplication R = d * S : 0 x00D1BAC8 cycles . 24 Result of multiplying : 25 Rx = 26 51 D0 8 D 5 F 27 E6 2 B EC C3 28 Ry = 29 75 EE 68 EB 30 06 F8 FC F8 8 B F6 26 AA A6 C0 CF 30 5 B 67 3 A B5 35 BE CA 95 1 F 6 E 74 4 E 6 A 7 B 41 D5 2 D 42 78 88 CF C1 8 B ED 29 46 D8 8 D AC C8 9 B A3 83 C9 7 D 11 4 E EC A0 3 F
31 The result is correct . 32 33 Point multiplication R = d * S using OnlyX : 0 x00E565C4 cycles . 34 Result of multiplication : 35 Rx = 36 51 D0 8 D 5 F 37 E6 2 B EC C3 38 Ry = 39 75 EE 68 EB 40 06 F8 FC F8 8 B F6 26 AA A6 C0 CF 30 5 B 67 3 A B5 35 BE CA 95 1 F 6 E 74 4 E 6 A 7 B 41 D5 2 D 42 78 88 CF C1 8 B ED 29 46 D8 8 D AC C8 9 B A3 83 C9 7 D 11 4 E EC A0 3 F
41 The result is correct . 42 43 Point multiplications and add R = d * S + e * T : 0 x01A04384 cycles . 44 Result of multiplication : 45 Rx = 46 D8 67 B4 67 47 EF CF 58 41 48 Ry = 49 F2 50 40 55 50 10 6 B 66 07 C0 3 C ED E1 EC 7 E 50 DD 2 D 22 72 0 D 35 D5 4 B D8 AD 69 C7 45 0 F 61 52 75 92 21 00 92 3 D AA CB EF 34 93 92 21 F8 57 B8 58 B8 04 62 45 83 41 F6 B8
51 The result is correct . 52 53 Point multiplications testing ( takes long time ) . 54 0 x0000000A MULTIPLICATIONS .
74
FEI
KEMT
Example of USART output of testPRO() routine:

1 Test of ECC protocols routines . 2 3 Public key : 4 x = 5 F0 76 CD E5 6 47 E1 0 B 7 F 7 y = 8 E0 8 C 57 2 F 9 0 C 5 E 1 E A4 10 Private key : 11 z = 12 D2 0 B E1 3 D 13 BD 6 C 5 B D9 14 15 ECIES encryption : 16 Plaintext : m = 51 6 F CF B3 DF BF A8 9 D 29 F9 38 10 96 04 95 E8 74 CA BF 8 F B3 A5 2 F FD 85 72 7 F 08 96 2 D 3 F 39 0 B 44 F8 C3 CB 39 C8 B4 86 C2 87 2 C D3 30 56 ED D8 D0 65 99 5 E 80 ED 8 D 3 A 87 99 85 3 F 25 D2 60 8 E 8 C 6 D C5 02 F4 21 42
17 HELLO ECC WORLD !!! Mates je najlepsi na svete !!! : o ) 18 In hexa : 19 00 00 00 00 20 20 61 6 E 20 21 73 65 74 61 22 43 45 20 4 F 23 24 Ciphering ... 25 Ciphertext : 26 Point R 1 st coord : Rx = 27 C5 DF 70 79 28 89 F8 38 56 11 7 B 34 5 C 7 E 01 43 12 24 19 55 E5 42 C1 5 A 1 E 21 6 C 2 D 9 F 7 C 35 3 D 1 A 29 6 F 3 A 20 69 73 70 65 4 D 20 21 21 4 C 4 C 45 48 21 21 21 65 6 C 6 A 61 6 E 21 44 4 C 52 74 65 76 73 20 65 6 A 20 4 F 57 20 43
29 Point R 2 nd coord : Ry = 30 E2 73 62 C7 31 78 CB 96 05 32 33 Encrypted message in hexa : 34 DB BA BD F7 35 E7 4 C 2 B 76 36 C2 7 C 4 A 7 C 37 10 AE 97 02 38 39 Generated MAC code : t = 40 22 9 F 19 83 41 42 ECIES decryption : 43 Decrypted message : m = D5 26 5 F DA 84 8 D AA D0 02 DD 1 A FF DE F7 74 35 89 25 66 98 C = 81 63 27 52 B8 D2 A9 17 BB 00 16 B5 7 C 0 F B3 BC AF DE F3 F7 62 CD 13 E3 C4 20 36 E3 E8 C2 41 0 A 3 B A1 14 47 7 D FD D1 90 6 F 44 79 E5 B1 E4 C1 F3
44 HELLO ECC WORLD !!! Mates je najlepsi na svete !!! : o ) 45 46 ECIES test : 47 48 Test OK for 0 x00000001 messages and keys ( in hexa ) . 49 50 Test OK for 0 x00000002 messages and keys ( in hexa ) .
75
FEI
KEMT
Appendix C
Tables with the library testing information
Table of all basic multi-precision multiplication measurements:
___________________________________ARM7TDMI:___________________________________ TARGET: Simulator
ACL_P_MUL_H SIZE : CYCLES
assembler 0688B
multiprecision
hybrid
m u l t i p l i c a t i o n by J a r o s l a v Ban
0988 c ( 2 5 6 b i t *256 b i t >512 b i t number from PRNG) 0796 c ( 2 5 6 b i t *256 b i t >512 b i t all z e r o s number )
0772 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 0720 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : f a s t e s t , most effective all z e r o s number ) size l e n >= 4
and medium c o d e
ARM7TDMI r o u t i n e can m u l t i p l y numbers w i t h
AEL_MULH SIZE : CYCLES
assembler 0448B
implementation
of
multiplication
using
h y b r i d method MP
1088 c ( 2 5 6 b i t *256 b i t >512 b i t number from PRNG) 0892 c ( 2 5 6 b i t *256 b i t >512 b i t all z e r o s number ) size
INFO :
fast ,
effective
a l g o r i t h m and medium c o d e
ARM7TDMI r o u t i n e can m u l t i p l y numbers o n l y w i t h e v e n l e n >= 4
ACL_P_MUL_HE SIZE : CYCLES
assembler 0340B
multiprecision
hybrid
INFO :
f a s t e s t , most
effective
and s m a l l c o d e
ACL_P_MUL_C SIZE : CYCLES
a s s e m b l e r comba method m u l t i p r e c i s i o n 0284B
m u l t i p l i c a t i o n by JB
0972 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 0824 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : medium s p e e d and s m a l l e s t all code z e r o s number ) size l e n >= 3
POLAR_MUL_P SIZE : CYCLES :
C multiprecision 1700B
multiplication
using
P o l a r SSL
1968 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 1824 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : all z e r o s number ) without longlong support
s l o w and e v e n more u n e f f e c t i v e big code size for a l l MCUs
r e a l i z i n g MAC o p e r a t i o n for multiply o n l y must be i n p u t vector cleared
MDMP_MUL_P SIZE : CYCLES :
multiplication
c r e a t e d by MP and MD
3188 c ( 2 5 6 b i t *256 b i t >512 b i t number from PRNG)
76
FEI
KEMT
3396 c ( 2 5 6 b i t *256 b i t >512 b i t
all
z e r o s number )
2540 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 2700 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : smaller little faster code size all z e r o s number )
c o m p a r i n g t o POLAR SSL a l l MCUs o r 64 b i t *64 b i t
slow implementation for
than POLAR when m u l t i p l i c a t i o n 32 b i t *32 b i t
___________________________________ARM7TDMI:___________________________________ TARGET: LPC2138 F l a s h
assembler 0688B
multiprecision
hybrid
and medium c o d e
assembler 0448B
implementation
of
multiplication
using
INFO :
fast ,
effective
assembler 0340B
multiprecision
hybrid
INFO :
effective
multiplication
using
P o l a r SSL
multiplication
77
FEI
KEMT
3316 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : smaller little faster code size
all
z e r o s number )
___________________________________ARM7TDMI:___________________________________ TARGET: LPC2138 RAM (RV c o m p i l e r )
assembler 0688B
multiprecision
hybrid
and medium c o d e
assembler 0448B
implementation
of
multiplication
using
INFO :
fast ,
effective
assembler 0340B
multiprecision
hybrid
INFO :
effective
multiplication
using
P o l a r SSL
multiplication
2588 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 2744 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : smaller code size all z e r o s number )
c o m p a r i n g t o POLAR SSL
78
FEI
KEMT
little faster
a l l MCUs o r 64 b i t *64 b i t
& & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & &
_________________________________CORTEX M0 :____________________________________ TARGET: Simulator
AEL_MULH_M0 SIZE : CYCLES
assembler 0264B
multiplication
using
" h a l f " h y b r i d method by MP
2632 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 2638 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : fast , effective all z e r o s number ) size
AEL_MULC_M0 SIZE : CYCLES
m u l t i p l i c a t i o n by MP
multiplication
using
P o l a r SSL
multiplication
_________________________________CORTEX M0 :____________________________________ TARGET: LPC1114 /302 F l a s h
assembler 0264B
multiplication
using
79
FEI
KEMT
INFO :
fast ,
effective
size
multiplication
using
P o l a r SSL
multiplication
_________________________________CORTEX M0 :____________________________________ TARGET: LPC1114 /302 RAM (RV c o m p i l e r )
assembler 0264B
multiplication
using
2695 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 2697 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : fast , effective all z e r o s number ) size
multiplication
using
P o l a r SSL
80
FEI
KEMT
multiplication
& & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & &
_________________________________CORTEX M3 :____________________________________ TARGET: Simulator
multiplication
using
P o l a r SSL
multiplication
assembler 0678B
multiprecision
hybrid
m u l t i p l i c a t i o n by JB and MP
0737 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 0639 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : f a s t e s t , most effective all z e r o s number ) size
and medium c o d e
r o u t i n e can m u l t i p l y numbers w i t h
l e n >= 4
AEL_MULc_M3 SIZE : CYCLES
assembler 0678B
multiprecision
hybrid
81
FEI
KEMT
0823 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : fast , effective
all
z e r o s number ) size l e n >= 3
_________________________________CORTEX M3 :__________________________________ TARGET: STM32F10x F l a s h
multiplication
using
P o l a r SSL
multiplication
assembler 0678B
multiprecision
hybrid
and medium c o d e
l e n >= 4
assembler 0678B
m u l t i p r e c i s i o n comba m u l t i p l i c a t i o n by JB and MP
1073 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 0981 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : fast , effective all z e r o s number ) size l e n >= 3
_________________________________CORTEX M3 :__________________________________ TARGET: STM32F10x RAM (RV c o m p i l e r )
multiplication
using
P o l a r SSL
82
FEI
KEMT
multiplication
assembler 0678B
multiprecision
hybrid
and medium c o d e
l e n >= 4
assembler 0678B
m u l t i p r e c i s i o n comba m u l t i p l i c a t i o n by JB and MP
1064 c ( 2 2 4 b i t *224 b i t >448 b i t number from PRNG) 0978 c ( 2 2 4 b i t *224 b i t >448 b i t INFO : fast , effective all z e r o s number ) size l e n >= 3
Code sizes of AEL library on processors are as follows: ARM7TDMI - 12,690 B ARM Cortex-M0 and Cortex-M1 - 8,360 B ARM Cortex-M3 - 7,948 B
83
FEI
KEMT
Table 4 1 Table with number of cycles needed for ECIES operations for used processors. Decryption takes approximately half time of encryption.
OP CPU ENC192 ENC224 ENC256 ENC384 ENC521 DEC192 DEC224 DEC256 DEC384 DEC521
ARM7TDMI 8,456,210 12,606,214 26,921,818 58,019,306 71,376,787 4,360,325 6,528,066 14,302,535 30,165,158 37,231,212
Cortex-M0/M1 24,075,141 24,309,132 43,920,133 111,811,266 200,143,564 7,694,749 12,600,655 23,333,680 58,163,964 104,425,639
Cortex-M3 7,620,669 11,279,067 23,900,818 52,115,110 66,734,812 3,927,787 5,839,810 12,695,380 27,097,361 34,806,555
84
FEI
KEMT
Appendix D
Source codes for Magma
This is the first Magma code for step-by-step debugging of Only-X Co-Z implementation:
1 " Control test for OnlyX implementation . " ; 2 " CURVE NIST p -224. " ; 3 4 p := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ; 5 b := 0 x b 4 0 5 0 a 8 5 0 c 0 4 b 3 a b f 5 4 1 3 2 5 6 5 0 4 4 b 0 b 7 d 7 b f d 8 b a 2 7 0 b 3 9 4 3 2 3 5 5 f f b 4 ; 6 a := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f f f f f f f f f e ; 7 xG := 0 x b 7 0 e 0 c b d 6 b b 4 b f 7 f 3 2 1 3 9 0 b 9 4 a 0 3 c 1 d 3 5 6 c 2 1 1 2 2 3 4 3 2 8 0 d 6 1 1 5 c 1 d 2 1 ; 8 yG := 0 x b d 3 7 6 3 8 8 b 5 f 7 2 3 f b 4 c 2 2 d f e 6 c d 4 3 7 5 a 0 5 a 0 7 4 7 6 4 4 4 d 5 8 1 9 9 8 5 0 0 7 e 3 4 ; 9 10 Sx := 0 x 6 e c a 8 1 4 b a 5 9 a 9 3 0 8 4 3 d c 8 1 4 e d d 6 c 9 7 d a 9 5 5 1 8 d f 3 c 6 f d f 1 6 e 9 a 1 0 b b 5 b ; 11 Sy := 0 x e f 4 b 4 9 7 f 0 9 6 3 b c 8 b 6 a e c 0 c a 0 f 2 5 9 b 8 9 c d 8 0 9 9 4 1 4 7 e 0 5 d c 6 b 6 4 d 7 b f 2 2 ; 12 13 X1 := 0 x0 ; 14 X2 := Sx ; 15 TD := Sx ; 16 Ta := a ; 17 Tb := (4* b ) mod p ; 18 19 // adddbl_1 20 U := (( X1 - X2 ) ^2) mod p ; 21 V := (4 * X2 * ( X2 ^2 + Ta ) + Tb ) mod p ; 22 W := ( V * U ) mod p ; 23 TD_new := ( TD * W ) mod p ; 24 Ta_new := ( Ta * W ^2 ) mod p ; 25 Tb_new := ( Tb * W ^3 ) mod p ; 26 TMP := ( X1 ^2 + X2 ^2 - U + 2* Ta ) mod p ; 27 X1_new := ( V * ( ( X1 + X2 ) * TMP + Tb ) - TD_new ) mod p ; 28 TMP := ( X2 ^2 - Ta ) mod p ; 29 X2_new := ( U * ( 30 31 // copy new var 1 32 X1 := TD_new ; 33 X2 := X2_new ; 34 TD := TD_new ; 35 Ta := Ta_new ; 36 Tb := Tb_new ; 37 38 X 39 Y 40 Z 41 42 TMP :=( Z *( Y ^2 - b * Z ^2) ) mod p ; 43 IntegerToString ( TMP , 16 ) ; 44 45 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 46 P := E ![ xG , yG ]; 47 := ( 4* Sx * Sy * TD ^2* X1 ) mod p ; := ( Sx ^3*( Tb +2*( TD * X1 + Ta ) *( X1 + TD ) -2* X2 *( X1 - TD ) ^2) ) mod p ; := ( 4* Sy * TD ^3 ) mod p ; TMP ^2 - 2* X2 * Tb ) ) mod p ;
85
FEI
KEMT
48 ord := # E ; 49 k := ord - 1; 50 kP := k * P ; 51 elem := E lem en tTo Se que nc e ( kP ) ; 52 x := IntegerRing () ! elem [1]; 53 y := IntegerRing () ! elem [2]; 54 x_hex := IntegerToString ( x , 16 ) ; 55 y_hex := IntegerToString ( y , 16 ) ; 56 x_hex ; 57 y_hex ; 58 59 infinity1 := kP + P ; 60 infinity2 := ord * P ; 61 infinity1 eq infinity2 ;
86
FEI
KEMT
Second Magma code for step-by-step debugging of Only-X Co-Z implementation.

1 " Control test for OnlyX implementation . " ; 2 " CURVE NIST p -224. " ; 3 p 4 5 Gx = 0 x b 7 0 e 0 c b d 6 b b 4 b f 7 f 3 2 1 3 9 0 b 9 4 a 0 3 c 1 d 3 5 6 c 2 1 1 2 2 3 4 3 2 8 0 d 6 1 1 5 c 1 d 2 1 6 Gy = 0 x b d 3 7 6 3 8 8 b 5 f 7 2 3 f b 4 c 2 2 d f e 6 c d 4 3 7 5 a 0 5 a 0 7 4 7 6 4 4 4 d 5 8 1 9 9 8 5 0 0 7 e 3 4 7 8 a 9 b 10 11 12 INIT STATE : 13 Sx = 0 x 6 e c a 8 1 4 b a 5 9 a 9 3 0 8 4 3 d c 8 1 4 e d d 6 c 9 7 d a 9 5 5 1 8 d f 3 c 6 f d f 1 6 e 9 a 1 0 b b 5 b 14 Sy = 0 x e f 4 b 4 9 7 f 0 9 6 3 b c 8 b 6 a e c 0 c a 0 f 2 5 9 b 8 9 c d 8 0 9 9 4 1 4 7 e 0 5 d c 6 b 6 4 d 7 b f 2 2 15 16 X1 = 0 x0 17 X2 = 0 x 6 e c a 8 1 4 b a 5 9 a 9 3 0 8 4 3 d c 8 1 4 e d d 6 c 9 7 d a 9 5 5 1 8 d f 3 c 6 f d f 1 6 e 9 a 1 0 b b 5 b 18 TD = 0 x 6 e c a 8 1 4 b a 5 9 a 9 3 0 8 4 3 d c 8 1 4 e d d 6 c 9 7 d a 9 5 5 1 8 d f 3 c 6 f d f 1 6 e 9 a 1 0 b b 5 b 19 Ta = 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f f f f f f f f f e 20 Tb = 0 x D 0 1 4 2 A 1 4 3 0 1 2 C E A F D 5 0 4 C 9 5 9 4 1 1 2 C 2 E 1 5 E F F 6 2 E 8 9 C 2 C E 5 0 C 8 D 5 7 F E C E 21 22 1 STEP ( EQ ) : 23 U 24 V 25 W 26 27 X1 = 0 x 8 B 4 C 9 2 D 4 F A 9 4 8 5 8 1 2 1 9 8 1 4 1 7 0 C 8 6 2 8 A D 9 0 5 B 1 1 7 F F 2 B 5 2 2 8 F B 0 F F 6 6 0 1 28 X2 = 0 x 2 C 3 2 9 2 A C 5 1 1 4 9 0 0 5 4 D 4 4 4 1 C E A A 9 8 3 3 4 5 E 6 F 6 6 4 7 8 5 5 4 8 1 7 E C 0 0 7 A 3 B F 7 29 TD = 0 x 9 4 2 A 2 A 2 2 B 7 1 C 5 1 0 7 E D 5 2 7 E B 7 8 B 3 B 7 9 2 C F 6 1 1 6 A 4 C 1 A 8 C D E 2 2 F 2 6 F B A B 5 30 Ta = 0 x 8 B 3 5 4 D 0 4 0 0 4 9 9 B B F E 9 1 A 9 3 B 6 D 8 5 C B 5 4 E 3 D 6 5 7 5 B B 2 0 3 5 6 E 9 E 9 1 E A F F 0 9 31 Tb = 0 x 6 E 2 2 6 B F 0 E 7 7 B 0 E D 0 E 2 6 C 9 2 C A 1 E 5 8 B A 1 4 9 0 0 4 B 9 3 4 6 A 2 8 4 5 A 8 C F 3 4 4 A E 3 32 33 1 STEP ( MCU ) : 34 X1 = 35 X2 = 36 TD = 37 Ta = 38 Tb = 0 x91357EB4 5 A656CF7 BC237EB1 22936824 6 AAE720C 39020 E91 65 EF44A6 = 0 x4B73C58D76B06FB26468A008C50790E9F1E1E7B60ABE8FB6EDAF82BB = 0 x6290EC3AFDC7FA0412C5EF843BADBC97000D42F6B1B5B747D54DE58F = 0 x6FBC4B0577647E300496E793E3399E8F4657308B39A49105840218D1 = 0 xfffffffffffffffffffffffffffffffefffffffffffffffffffffffe = 0 xb4050a850c04b3abf54132565044b0b7d7bfd8ba270b39432355ffb4 = 0 xffffffffffffffffffffffffffffffff000000000000000000000001
87
FEI
KEMT
Magma script of test for ECC point multiplication using P192 NIST curve.
1 p := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f f ; 2 b := 0 x 6 4 2 1 0 5 1 9 e 5 9 c 8 0 e 7 0 f a 7 e 9 a b 7 2 2 4 3 0 4 9 f e b 8 d e e c c 1 4 6 b 9 b 1 ; 3 a := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f c ; 4 5 Sx := 0 x d 4 5 8 e 7 d 1 2 7 a e 6 7 1 b 0 c 3 3 0 2 6 6 d 2 4 6 7 6 9 3 5 3 a 0 1 2 0 7 3 e 9 7 a c f 8 ; 6 Sy := 0 x 3 2 5 9 3 0 5 0 0 d 8 5 1 f 3 3 6 b d d c 0 5 0 c f 7 f b 1 1 b 5 6 7 3 a 1 6 4 5 0 8 6 d f 3 b ; 7 d := 0 x a 7 8 a 2 3 6 d 6 0 b a e c 0 c 5 d d 4 1 b 3 3 a 5 4 2 4 6 3 a 8 2 5 5 3 9 1 a f 6 4 c 7 4 e e ; 8 e := 0 x c 4 b e 3 d 5 3 e c 3 0 8 9 e 7 1 e 4 d e 8 c e a b 7 c c e 8 8 9 b c 3 9 3 c d 8 5 b 9 7 2 b c ; 9 10 testSx :=0 x 0 D E B D 9 2 E D 7 B 8 0 D D B 6 4 5 E 2 F C F 3 A 6 B 3 D 3 E 0 D 4 7 C A 0 D E A 0 6 2 B B C ; 11 testSy :=0 x 6 2 B 9 E 3 C 1 9 E 5 5 F F F B E 3 7 6 3 4 C B C C C 0 A 9 E 2 0 9 9 4 1 3 2 C F F 0 B 6 0 3 C ; 12 13 " This is test for ECC point multiplication using P192 NIST curve . " ; 14 15 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 16 S := E ![ Sx , Sy ]; 17 18 " Do 100 times S = d * S with pre - update d =( d * e ) mod p : " ; 19 20 for i := 1 to 100 do 21 d := ( d * e ) mod p ; 22 S := d * S ; 23 end for ; 24 25 testS := E ![ testSx , testSy ]; 26 " Result : " ; 27 elem := E lem en tTo Se que nc e ( S ) ; 28 x := IntegerRing () ! elem [1]; 29 y := IntegerRing () ! elem [2]; 30 x_hex := IntegerToString ( x , 16 ) ; 31 y_hex := IntegerToString ( y , 16 ) ; 32 " Sx ( hexa ) = " ; 33 x_hex ; 34 " Sy ( hexa ) = " ; 35 y_hex ; 36 37 " Test if result is correct : " ; 38 S eq testS ; 39 40 " End of test . " ;
88
FEI
KEMT
Magma script of test for ECC point multplication using P224 NIST curve.
1 p := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ; 2 b := 0 x b 4 0 5 0 a 8 5 0 c 0 4 b 3 a b f 5 4 1 3 2 5 6 5 0 4 4 b 0 b 7 d 7 b f d 8 b a 2 7 0 b 3 9 4 3 2 3 5 5 f f b 4 ; 3 a := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f f f f f f f f f f f f f f f f e ; 4 5 Sx := 0 x 6 e c a 8 1 4 b a 5 9 a 9 3 0 8 4 3 d c 8 1 4 e d d 6 c 9 7 d a 9 5 5 1 8 d f 3 c 6 f d f 1 6 e 9 a 1 0 b b 5 b ; 6 Sy := 0 x e f 4 b 4 9 7 f 0 9 6 3 b c 8 b 6 a e c 0 c a 0 f 2 5 9 b 8 9 c d 8 0 9 9 4 1 4 7 e 0 5 d c 6 b 6 4 d 7 b f 2 2 ; 7 d := 0 x a 7 8 c c c 3 0 e a c a 0 f c c 8 e 3 6 b 2 d d 6 f b b 0 3 d f 0 6 d 3 7 f 5 2 7 1 1 e 6 3 6 3 a a f 1 d 7 3 b ; 8 e := 0 x 5 4 d 5 4 9 f f c 0 8 c 9 6 5 9 2 5 1 9 d 7 3 e 7 1 e 8 e 0 7 0 3 f c 8 1 7 7 f a 8 8 a a 7 7 a 6 e d 3 5 7 3 6 ; 9 10 testSx :=0 x B 8 E D 1 8 3 5 9 2 0 A 2 9 6 C C A 7 E 9 9 2 7 8 C 7 F 7 4 1 9 5 9 2 D 3 2 A 7 6 C 7 6 9 E 3 C 7 F 0 8 5 A A 4 ; 11 testSy :=0 x A 9 8 7 8 E 3 4 C C 9 5 5 6 6 3 9 1 8 6 C 7 4 D C B F 9 0 A 6 E 5 5 E E E F F E 8 1 8 9 B 0 0 6 B 0 B 9 F 3 E 4 ; 12 13 " This is test for ECC point multiplication using P224 NIST curve . " ; 14 15 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 16 S := E ![ Sx , Sy ]; 17 18 " Do 100 times S = d * S with pre - update d =( d * e ) mod p : " ; 19 20 for i := 1 to 100 do 21 d := ( d * e ) mod p ; 22 S := d * S ; 23 end for ; 24 25 testS := E ![ testSx , testSy ]; 26 " Result : " ; 27 elem := E lem en tTo Se que nc e ( S ) ; 28 x := IntegerRing () ! elem [1]; 29 y := IntegerRing () ! elem [2]; 30 x_hex := IntegerToString ( x , 16 ) ; 31 y_hex := IntegerToString ( y , 16 ) ; 32 " Sx ( hexa ) = " ; 33 x_hex ; 34 " Sy ( hexa ) = " ; 35 y_hex ; 36 37 " Test if result is correct : " ; 38 S eq testS ; 39 40 " End of test . " ;
89
FEI
KEMT
1 p := 0 x f f f f f f f f 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f f f f f f f f f f f f f f f f f f f f f f f f ; 2 b := 0 x 5 a c 6 3 5 d 8 a a 3 a 9 3 e 7 b 3 e b b d 5 5 7 6 9 8 8 6 b c 6 5 1 d 0 6 b 0 c c 5 3 b 0 f 6 3 b c e 3 c 3 e 2 7 d 2 6 0 4 b ; 3 a := 0 x f f f f f f f f 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f f f f f f f f f f f f f f f f f f f f f f f c ; 4 5 Sx := 0 x d e 2 4 4 4 b e b c 8 d 3 6 e 6 8 2 e d d 2 7 e 0 f 2 7 1 5 0 8 6 1 7 5 1 9 b 3 2 2 1 a 8 f a 0 b 7 7 c a b 3 9 8 9 d a 9 7 c 9 ; 6 Sy := 0 x c 0 9 3 a e 7 f f 3 6 e 5 3 8 0 f c 0 1 a 5 a a d 1 e 6 6 6 5 9 7 0 2 d e 8 0 f 5 3 c e c 5 7 6 b 6 3 5 0 b 2 4 3 0 4 2 a 2 5 6 ; 7 d := 0 x c 5 1 e 4 7 5 3 a f d e c 1 e 6 b 6 c 6 a 5 b 9 9 2 f 4 3 f 8 d d 0 c 7 a 8 9 3 3 0 7 2 7 0 8 b 6 5 2 2 4 6 8 b 2 f f b 0 6 f d ; 8 e := 0 x d 3 7 f 6 2 8 e c e 7 2 a 4 6 2 f 0 1 4 5 c b e f e 3 f 0 b 3 5 5 e e 8 3 3 2 d 3 7 a c d d 8 3 a 3 5 8 0 1 6 a e a 0 2 9 d b 7 ; 9 10 testSx :=0 x 0 4 8 2 F F D 4 1 6 B 2 5 A C 0 F 5 5 C 1 C 0 1 E 6 E C C E 3 2 2 3 E F F 8 A 3 2 3 6 9 0 F 1 1 6 3 F 2 C 6 5 3 8 3 3 A 0 B 0 8 ; 11 testSy :=0 x 3 3 1 2 9 6 B D 3 A 1 0 4 1 F 6 1 B 7 F 1 0 6 4 1 C 6 C D 0 0 F 7 7 9 D 3 F 0 E 8 E 0 6 2 1 3 9 7 2 9 B 6 4 4 9 6 6 D 4 F 8 5 C ; 12 13 " This is test for ECC point multiplication using P256 NIST curve . " ; 14 15 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 16 S := E ![ Sx , Sy ]; 17 18 " Do 100 times S = d * S with pre - update d =( d * e ) mod p : " ; 19 20 for i := 1 to 100 do 21 d := ( d * e ) mod p ; 22 S := d * S ; 23 end for ; 24 25 testS := E ![ testSx , testSy ]; 26 " Result : " ; 27 elem := E lem en tTo Se que nc e ( S ) ; 28 x := IntegerRing () ! elem [1]; 29 y := IntegerRing () ! elem [2]; 30 x_hex := IntegerToString ( x , 16 ) ; 31 y_hex := IntegerToString ( y , 16 ) ; 32 " Sx ( hexa ) = " ; 33 x_hex ; 34 " Sy ( hexa ) = " ; 35 y_hex ; 36 37 " Test if result is correct : " ; 38 S eq testS ; 39 40 " End of test . " ;
90
FEI
KEMT
Magma script of test for ECC point multipication using P384 NIST curve.
1 p := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f f f f f f f f ; 2 b := 0 x b 3 3 1 2 f a 7 e 2 3 e e 7 e 4 9 8 8 e 0 5 6 b e 3 f 8 2 d 1 9 1 8 1 d 9 c 6 e f e 8 1 4 1 1 2 0 3 1 4 0 8 8 f 5 0 1 3 8 7 5 a c 6 5 6 3 9 8 d 8 a 2 e d 1 9 d 2 a 8 5 c 8 e d d 3 e c 2 a e f ; 3 a := 0 x f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f e f f f f f f f f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f f f f f f f c ; 4 5 Sx := 0 x f b a 2 0 3 b 8 1 b b d 2 3 f 2 b 3 b e 9 7 1 c c 2 3 9 9 7 e 1 a e 4 d 8 9 e 6 9 c b 6 f 9 2 3 8 5 d d a 8 2 7 6 8 a d a 4 1 5 e b a b 4 1 6 7 4 5 9 d a 9 8 e 6 2 b 1 3 3 2 d 1 e 7 3 c b 0 e ; 6 Sy := 0 x 5 f f e d b a e f d e b a 6 0 3 e 7 9 2 3 e 0 6 c d b 5 d 0 c 6 5 b 2 2 3 0 1 4 2 9 2 9 3 3 7 6 d 5 c 6 9 4 4 e 3 f a 6 2 5 9 f 1 6 2 b 4 7 8 8 d e 6 9 8 7 f d 5 9 a e d 5 e 4 b 5 2 8 5 e 4 5 ; 7 d := 0 x a 4 e b c a e 5 a 6 6 5 9 8 3 4 9 3 a b 3 e 6 2 6 0 8 5 a 2 4 c 1 0 4 3 1 1 a 7 6 1 b 5 a 8 f d a c 0 5 2 e d 1 f 1 1 1 a 5 c 4 4 f 7 6 f 4 5 6 5 9 d 2 d 1 1 1 a 6 1 b 5 f d d 9 7 5 8 3 4 8 0 ; 8 e := 0 x a f c f 8 8 1 1 9 a 3 a 7 6 c 8 7 a c b d 6 0 0 8 e 1 3 4 9 b 2 9 f 4 b a 9 a a 0 e 1 2 c e 8 9 b c f c a e 2 1 8 0 b 3 8 d 8 1 a b 8 c f 1 5 0 9 5 3 0 1 a 1 8 2 a f b c 6 8 9 3 e 7 5 3 8 5 d ; 9 10 testSx :=0 x 9 4 F 4 9 3 7 A F 6 6 E F F A 2 6 5 1 C 5 8 6 1 7 0 2 9 1 E 8 3 3 C F 9 0 8 6 0 7 1 7 F E 2 B A 2 3 8 8 4 0 E 3 3 8 F 9 9 C A A 3 F 3 7 D 1 7 B 4 4 0 3 B C D C E 9 F 5 0 0 9 7 5 D A F F 2 3 2 ; 11 testSy :=0 x E 2 8 2 4 7 C 8 7 9 3 B C 8 6 7 E 1 D 9 3 B 7 4 E E 0 E C 0 E 2 6 4 E B E 3 D B E 9 7 3 1 C 3 0 5 6 7 C 6 7 1 1 6 8 A 9 1 C 6 3 5 2 D 6 0 6 0 4 A 5 F 1 B E 9 6 3 E F 4 D 2 2 4 F 4 1 F 2 6 8 F ; 12 13 " This is test for ECC point multiplication using P384 NIST curve . " ; 14 15 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 16 S := E ![ Sx , Sy ]; 17 18 " Do 100 times S = d * S with pre - update d =( d * e ) mod p : " ; 19 20 for i := 1 to 100 do 21 d := ( d * e ) mod p ; 22 S := d * S ; 23 end for ; 24 25 testS := E ![ testSx , testSy ]; 26 " Result : " ; 27 elem := E lem en tTo Se que nc e ( S ) ; 28 x := IntegerRing () ! elem [1]; 29 y := IntegerRing () ! elem [2]; 30 x_hex := IntegerToString ( x , 16 ) ; 31 y_hex := IntegerToString ( y , 16 ) ; 32 " Sx ( hexa ) = " ; 33 x_hex ; 34 " Sy ( hexa ) = " ; 35 y_hex ; 36 37 " Test if result is correct : " ; 38 S eq testS ; 39 40 " End of test . " ;
91
FEI
KEMT
1 p := 0 x 0 0 0 0 0 1 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f 2 ffffffffffffffffffffffffffffffffffffffffff ; 3 b := 0 x 0 0 0 0 0 0 5 1 9 5 3 e b 9 6 1 8 e 1 c 9 a 1 f 9 2 9 a 2 1 a 0 b 6 8 5 4 0 e e a 2 d a 7 2 5 b 9 9 b 3 1 5 f 3 b 8 b 4 8 9 9 1 8 e f 1 0 9 e 1 5 6 1 9 3 9 5 1 e c 7 e 9 3 7 b 1 6 5 2 c 0 4 bd3bb1bf073573df883d2c34f1ef451fd46b503f00 ; 5 a := 0 x 0 0 0 0 0 1 f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f 6 fffffffffffffffffffffffffffffffffffffffffc ; 7 8 Sx := 0 x 0 0 0 0 0 1 d 5 c 6 9 3 f 6 6 c 0 8 e d 0 3 a d 0 f 0 3 1 f 9 3 7 4 4 3 4 5 8 f 6 0 1 f d 0 9 8 d 3 d 0 2 2 7 b 4 b f 6 2 8 7 3 a f 5 0 7 4 0 b 0 b b 8 4 a a 1 5 7 f c 8 4 7 b c f 8 d c 1 9 6 a8b2b8bfd8e2d0a7d39af04b089930ef6dad5c1b4 ; 10 Sy := 0 x 0 0 0 0 0 1 4 4 b 7 7 7 0 9 6 3 c 6 3 a 3 9 2 4 8 8 6 5 f f 3 6 b 0 7 4 1 5 1 e a c 3 3 5 4 9 b 2 2 4 a f 5 c 8 6 6 4 c 5 4 0 1 2 b 8 1 8 e d 0 3 7 b 2 b 7 c 1 a 6 3 a c 8 9 e b a a 1 1 e 11 07 d b 8 9 f c e e 5 b 5 5 6 e 4 9 7 6 4 e e 3 f a 6 6 e a 7 a e 6 1 a c 0 1 8 2 3 ; 12 d := 0 x 0 0 0 0 0 1 e b 7 f 8 1 7 8 5 c 9 6 2 9 f 1 3 6 a 7 e 8 f 8 c 6 7 4 9 5 7 1 0 9 7 3 5 5 5 4 1 1 1 a 2 a 8 6 6 f a 5 a 1 6 6 6 9 9 4 1 9 b f a 9 9 3 6 c 7 8 b 6 2 6 5 3 9 6 4 d f 0 d 6 d a 13 940 a 6 9 5 c 7 2 9 4 d 4 1 b 2 d 6 6 0 0 d e 6 d f c f 0 e d c f c 8 9 f d c b 1 ; 14 e := 0 x 0 0 0 0 0 1 3 7 e 6 b 7 3 d 3 8 f 1 5 3 c 3 a 7 5 7 5 6 1 5 8 1 2 6 0 8 f 2 b a b 3 2 2 9 c 9 2 e 2 1 c 0 d 1 c 8 3 c f a d 9 2 6 1 d b b 1 7 b b 7 7 a 6 3 6 8 2 0 0 0 0 3 1 b 9 1 2 2 c 2 15 f 0 c d a b 2 a f 7 2 3 1 4 b e 9 5 2 5 4 d e 4 2 9 1 a 8 f 8 5 f 7 c 7 0 4 1 2 e 3 ; 16 17 testSx :=0 x 0 0 0 0 0 1 7 E 9 1 7 D 4 D B 4 E 7 C 5 4 3 7 0 B 8 0 8 E A 6 C E 8 F D 1 2 4 8 E 2 C 6 5 7 A 4 6 3 2 7 3 A C 9 0 9 1 D 7 7 A D 0 3 F D 2 C 8 D D E 0 0 2 E D 7 3 0 3 8 1 E 18 01104 C 8 8 C E A E 8 2 3 0 B F E E 7 2 5 8 D B 4 F 2 5 C 2 5 5 1 0 9 E B 3 0 4 6 7 C D 7 3 3 8 ; 19 testSy :=0 x 0 0 0 0 0 1 A 9 E 1 C A 9 0 2 5 3 0 7 9 7 4 B F 6 9 F C 9 A 9 7 9 A 2 A F B 3 A 4 1 9 5 9 B D F 3 2 5 7 3 D 9 9 B 3 D D C 0 2 2 3 9 6 6 3 2 3 4 2 C E C 7 F 8 F E 0 C 7 4 3 20 261 B 8 E 8 3 1 7 8 C 2 8 E A C 5 8 6 0 3 2 C 4 4 2 1 A 2 A 3 F 3 F 5 B D 4 A 9 5 A 9 0 7 F D D 9 ; 21 22 " This is test for ECC point multiplication using P521 NIST curve . " ; 23 24 E := EllipticCurve ([ GF ( p ) |a , b ]) ; 25 S := E ![ Sx , Sy ]; 26 27 " Do 100 times S = d * S with pre - update d =( d * e ) mod p : " ; 28 29 for i := 1 to 100 do 30 d := ( d * e ) mod p ; 31 S := d * S ; 32 end for ; 33 34 testS := E ![ testSx , testSy ]; 35 " Result : " ; 36 elem := E lem en tTo Se que nc e ( S ) ; 37 x := IntegerRing () ! elem [1]; 38 y := IntegerRing () ! elem [2]; 39 x_hex := IntegerToString ( x , 16 ) ; 40 y_hex := IntegerToString ( y , 16 ) ; 41 " Sx ( hexa ) = " ; 42 x_hex ; 43 " Sy ( hexa ) = " ; 44 y_hex ; 45 46 " Test if result is correct : " ; 47 S eq testS ; 48 49 " End of test . " ;
92

TUKE DP Petrvalsky 2012

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

TUKE DP Petrvalsky 2012

Caricato da

Copyright:

Formati disponibili

Technical University of Koice Faculty of Electrical Engineering and Informatics

ECC Cryptographic Library for ARM Processors

Bc. Martin Petrvalsk

Technical University of Koice Faculty of Electrical Engineering and Informatics

ECC Cryptographic Library for ARM Processors

Study Programme: Field of study: Department:

Infoelectronics 5.2.13 Electronics Department of Elecronics and Multimedia Communications (KEMT)

doc. Ing. Milo Drutarovsk, CSc.

Bc. Martin Petrvalsk

Koice, April 27, 2012

2 Elliptic curve cryptography 2.1

Finite field arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Prime field arithmetic . . . . . . . . . . . . . . . . . . . . . . 14

NIST curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Other coordinate systems . . . . . . . . . . . . . . . . . . . . 20

Cryptographic protocols . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Signature schemes . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.2 2.3.3 3 Library 3.1

Basic routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Pseudo-random number generator . . . . . . . . . . . . . . . . 26

3.2 3.3 3.4 3.5 3.6

4 Conclusion Bibliography Appendices

Appendix A Appendix B Appendix C Appendix D

List of Symbols and Abbreviations

Hardware and software

Microcontroller units and evaluation boards

Phillips LPC2138 and Keil MCBSTM2130

NXP LPC1113 and NXP LPCXpresso1113

STM32, Keil MCBSTM32 and KEMT STM32

Keil Vision IDE

Low level system libraries

Elliptic curve cryptography

Finite field arithmetic

Prime field arithmetic

Elliptic curve arithmetic

Equations of elliptic curves

Elliptic curve EC over a field K is defined by Weierstrass equation: EC : 2 + 1 + 3 = 3 + 2 2 + 4 + 6 (2.1)

where 1 , 2 , 3 , 4 , 6 K and = 0, where is discriminant of EC and is defined as follows:

Other coordinate systems

c [ i ] = * e + tc ; tmp += ( c [ i ] < * e ) ; * e = tmp ;

Pseudo-random number generator

Prime field arithmetic routines

Point operation routines

Optimization of low-level routines

Time in % 51% 21% 8% 6% 5% 4% 1% 1% 1%

Instruction set analysis

Optimization of addition and subtraction

Second code is for Cortex-M3:

Optimization of multiplication for ARM Cortex-M0 and CortexM1

Optimization of multiplication for ARM7TDMI and ARM CortexM3

acu1 , acu1 , pro1 carc , carc , #0 @ add carry to carry catcher ( CC )

It needs to be changed to code for Cortex-M3:

@ tmp1 , sum1 <- pro1 * pro2

Optimization of high-level routines

Test of multiplication routines

Test of inversion routines

Test of point operation routines

Test of cryptographic protocols routines

4500 4000 3500 3000 2500 Simulator Flash RAM

4500 4000 3500 3000

2000 1500 1000 500 0 ARM7TDMI Cortex-M0/M1 Cortex-M3

OP CPU ENC224 ENC256 DEC224 DEC256

ARM7TDMI 12,606,214 26,921,818 6,528,066 14,302,535

Cortex-M0/M1 24,309,132 43,920,133 12,600,655 23,333,680

Cortex-M3 11,279,067 23,900,818 5,839,810 12,695,380

enced 27.4.2012]. Chapters available online at: http://www.nsa.gov/ia/_files/nistroutines.pdf

_ARM7TDMI:_ TARGET: LPC2138 F l a s h

_ARM7TDMI:_ TARGET: LPC2138 RAM (RV c o m p i l e r )

_CORTEX M0 :____ TARGET: Simulator