Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
HIGH PERFORMANCE
RISC ARCHITECTURE
– ARM
OVERVIEW:
Arcon RISC Machine
Architectural Inheritance, Core & Architectures
Registers, Pipeline, Interrupts
ARM organization
ARM processor family, Co-processors
ARM instruction set, Thumb Instruction set, Instruction cycle timings
The ARM Programmer’s model
ARM Development tools
ARM Assembly Language Programming, C programming
Optimizing ARM Assembly Code – Optimized Primitives.
REFERENCES:
1. Steve Furber , “ARM System –On –Chip architecture”, Addision
Wesley, 2000.
2. Valvano, "Embedded Microcomputer Systems", Thomson Asia PVT
LTD first reprint 2001.
INTRODUCTION
ARM Ltd.
ARM founded in November 1990
• Advanced RISC Machines
100+
billion
30+
billion
cores to
date
Typi./Max.
212mW /106mW < 200mW
Power - -
@ 59MHz @ 133MHz
D issipation
• Automotive
- ComRoad, empeg, Raytheon, Marine, SENA
• Consumer Multimedia
- Sega, Sharp, Sony, Toshiba, Pace
• Embedded Control
- Conexant, Gemplus, IBM, Olivetti
ARM Powered Applications
• Handheld Computing
- Apple, Ericsson, Hewlett Packard, Psion
• Internet Appliances
- Daewoo, Oracle, RCA, Samsung
• Networking
- 3Com, Ericsson, Virata, VLSI
• Portable Telephony
- Ericsson, Hitachi, Nokia, Philips, Qualcomm
The ARM Architecture Inheritance
• The ARM chip was designed based on Berkeley RISC I and II and the Stanford MIPS
(Microprocessor without Interlocking Pipeline Stages)
• Features Used from Berkeley RISC design
-a load-store architecture
-fixed length 32-bit instructions
-3-address instruction formats
• Features Rejected
-Register windows
-Delayed Branches
- Single Cycle execution of all instructions
The ARM Architecture Inheritance
• Based upon RISC Architecture with enhancements to meet requirements of
embedded applications
Little Endian
In little endian, you store the least significant byte in the smallest address. Here's how it would look:
Address Value
1000 CD
1001 12
1002 AB
1003 90
The performance of a processor core can be improved by:
It assumes that the memory it is connected to can deliver one word in a clock cycle and
deliver the next sequential word half a cycle later.
A 64-bit wide memory has the required characteristics, but delaying the arrival of
the second word by half a clock cycle allows a 32-bit bus to be used and can save area
since routing a 32-bit bus requires less area than routing a 64-bit bus.
ARMS applications:
Load and store instruction that cannot complete does not stall the pipeline execution.
Processor Core Vs CPU Core
Processor Core
– The engine that fetches instructions and execute them
– E.g.: ARM7TDMI, ARM9TDMI, ARM9E-S
CPU Core
– Consists of the ARM processor core and some tightly coupled function
blocks
– Cache and memory management blocks
– E.g.: ARM710T, ARM720T, ARM740T, ARM920T, ARM922T, ARM940T,
ARM946E-S, and ARM966E-S
ARM CPU CORES
• Although some ARM applications use a simple integer processor core as
the basic processing component, others require tightly coupled functions
such as cache memory and memory management hardware. ARM Limited
offers a range of such 'CPU' configurations based around its integer cores.
• The ARM CPU cores described here include the ARM710T, 720T and 740T,
the ARM810 (now superseded by the ARM9 series), the StrongARM, the
ARM920T and 940T, and the ARM1020E.
The ARM710T, ARM720T and ARM740T
• The ARM710T, ARM720T and ARM740T are based upon the ARM7TDMI
processor core, to which an 8 Kbyte mixed instruction cache and data
cache has been added.
• External memory and peripherals are accessed via an AMBA(Advanced
Microcontroller Bus Architecture) bus master unit, and a write buffer and
memory management (ARM71OT and 720T) unit or memory protection
(ARM740T) unit are also incorporated.
ARM710T cache
• Since the ARM7TDMI processor core has a single memory port it is
logical for it to be paired with a unified instruction and data cache.
The ARM710T incorporates such a cache, with a capacity of 8 Kbytes.
Double-bandwidth cache
cpsr
spsr
spsr spsr spsr spsr spsr spsr
Register Organization Summary
User FIQ IRQ SVC Undef Abort
r0
r1
User
r2 mode
r3 r0-r7,
r4 r15, User User User User
r5 and mode mode mode mode Thumb state
cpsr
r6
r0-r12,
r15,
r0-r12,
r15,
r0-r12,
r15,
r0-r12,
r15,
Low registers
r7 and and and and
r8 r8 cpsr cpsr cpsr cpsr
r9 r9
r10 r10 Thumb state
r11 r11 High registers
r12 r12
r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp)
r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr)
r15 (pc)
cpsr
spsr spsr spsr spsr spsr
• The current processor mode governs which of several banks is accessible. Each mode can access
• a particular set of r0-r12 registers
• a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
• the program counter, r15 (pc)
• the current program status register, cpsr
NZCVQ J U n d e f i n e d I F T mode
f s x c
• Condition code flags • Interrupt Disable bits.
• N = Negative result from ALU • I = 1: Disables the IRQ.
• Z = Zero result from ALU • F = 1: Disables the FIQ.
• C = ALU operation Carried out
• V = ALU operation oVerflowed
• T Bit
• Architecture xT only
• Sticky Overflow flag - Q flag • T = 0: Processor in ARM state
• Architecture 5TE/J only • T = 1: Processor in Thumb state
• Indicates if saturation has occurred
• Mode bits
• J bit • Specify the processor mode
• Architecture 5TEJ only
• J = 1: Processor in Jazelle state
Program Counter (r15)
• When the processor is executing in ARM state:
• All instructions are 32 bits wide
• All instructions must be word aligned
• Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or
byte aligned).
Exceptions result whenever the normal flow of a program has to be halted temporarily,
for example to service an interrupt from a peripheral.
Before attempting to handle an exception, the ARM preserves the current processor
state so that the original program can resume when the handler routine has finished.
Exception Handling
Entering an exception the ARM core
• Saves the address of the next instruction in the
appropriate LR
Data
Data transfer
processing
instructions
instructions
Software
Multiply
interrupt
instructions
instructions
ARM Instruction Set
Conditional execution:
• Each data processing instruction prefixed by condition code
• Result – smooth flow of instructions through pipeline
• 16 condition codes:
signed greater
EQ equal MI negative HI unsigned higher GT than
• Arithmetic operations:
• ADD, ADDC, SUB, SUBC, RSB, RSC
• Bit-wise logical operations:
• AND, EOR, ORR, BIC
• Register movement operations:
• MOV, MVN
• Comparison operations:
• TST, TEQ, CMP, CMN
Data Processing Instructions
Conditional codes
+
Data processing instructions
+
Barrel shifter
=
Powerful tools for efficient coded programs
Data Processing Instructions
e.g.:
if (z==1)
R1=R2+(R3*4)
compiles to
EQADDS R1,R2,R3, LSL #2
( SINGLE INSTRUCTION ! )
Data Processing Instructions
1. Simple register operands
2. Immediate operands
3. Shifted register operands
4. Multiply
Simple Register Operands (1/2)
• Arithmetic Operations
ADD r0,r1,r2 ;r0:=r1+r2
ADC r0,r1,r2 ;r0:=r1+r2+C
SUB r0,r1,r2 ;r0:=r1–r2
SBC r0,r1,r2 ;r0:=r1–r2+C–1
RSB r0,r1,r2 ;r0:=r2–r1, reverse subtraction
RSC r0,r1,r2 ;r0:=r2–r1+C–1
• By default data processing operations do no affect the condition
flags
• Comparison Operations
• Not produce result; omit the destination from the format
• Just set the condition code bits (N, Z, C and V) in CPSR
CMP r1,r2 ;set cc on r1 - r2, compare
CMN r1,r2 ;set cc on r1 + r2, compare negated
TST r1,r2 ;set cc on r1 AND r2, bit test
TEQ r1,r2 ;set cc on r1 XOR r2, test equal
Immediate Operands
• Replace the second source operand with an immediate operand,
which is a literal constant, preceded by “#”
ADD r3,r3,#1 ;r3:=r3+1
AND r8,r7,#&FF ;r8:=r7[7:0], &:hexadecimal
r5:=r5+r3*2r2 31 0 31 0
• MOV r12,r4,ROR r3 C
ROR #5 RRX
Using the Barrel Shifter: the 2nd Operand
• Multiply-Accumulate
MLA r1,r3,r2,r1 ;r4:=(r3*r2+r1)[31:0]
Multiply Instructions
• Accumulation is denoted by “+=”
• Example: form a scalar product of two vectors
MOV r11,#20 ;initialize loop counter
MOV r10,#0 ;initialize total
Loop LDR r0,[r8],#4 ;get first component
LDR r1,[r9],#4 ;get second component
MLA r10,r0,r1,r10 ;accumulate product
SUBS r11,r11,#1 ;decrement loop counter
BNE Loop
Multiplication by a Constant
• Multiplication by a constant equals to a ((power of 2) +/- 1) can be done in a single cycle
• Using MOV, ADD or RSBs with an inline shift
• Example: r0 = r1 * 5
• Example: r0 = r1 + (r1 * 4)
• ADD r0,r1,r1,LSL #2
• Can combine several instruction to carry out other multiplies
• Example: r2 = r3 * 119
• Example: r2 = r3 * 17 * 7
• Example: r2 = r3 * (16 + 1) * (8 - 1)
• ADD r2,r3,r3,LSL #4 ;r2:=r3*17
• RSB r2,r2,r2,LSL #3 ;r2:=r2*7
Multiply Instructions
• 32-bit product (Least Significant)
• MUL{<cond>}{S} Rd,Rm,Rs
• MLA{<cond>}{S} Rd,Rm,Rs,Rn
• 64-bit Product
• <mul>{<cond>}{S} RdHi,RdLo,Rm,Rs
• <mul> is UMULL,UMLAC,SMULL,SMLAL
Examples
LDMIA r3!, {r0,r4}
LDMIA r5!, {r0-r7}
STMIA r0!, {r6,r7}
STMIA r3!, {r3,r5,r7}
Incorrect examples
LDMIA r3!,{r0,r9} ; high registers not allowed
STMIA r5!, {} ; must be at least one register ; in list
STMIA r5!,{r1-r6} ; value stored from r5 is unpredictable
Data Transfer Instructions-summary
• Load/store instructions
• Used to move signed and unsigned Word, Half Word and Byte to and from registers
• Can be used to load PC (if target address is beyond branch instruction range)
R2
Mi+14
Mi+15
R14 STM
R15
Swap Instruction
R0
R1
• Exchanges a word between registers R2
• Two cycles but single atomic action
• Support for RT semaphores R7
R8
R15
Modifying the Status Registers
• Only indirectly
• MSR moves contents from CPSR/SPSR to selected GPR R0
R1
• MRS moves contents from selected GPR to CPSR/SPSR
MRS
• Only in privileged modes
R7
CPSR MSR R8
SPSR
R14
R15
Branching Instructions
• The Thumb instruction set does not include some instructions that are needed for
exception handling, so ARM code needs to be used for at least the top-level exception
handlers.
Thumb State
The Thumb state register set is a subset of the ARM state set. The programmer has
direct access to:
• Eight general registers r0 - r7
• The program counter PC
• A Stack pointer SP
• A Link register LR
• The current program status register CPSR
In Thumb state, the high registers (r8 - r15) are not part of the standard register set.
Thumb vs. ARM
How to change into Thumb state, then back
Example
;start off in ARM state
CODE32
ADR r0,Into_Thumb+1 ;generate branch target
;address & set bit 0
;hence arrive Thumb state
BX r0 ;branch exchange to Thumb
…
CODE16 ;assemble subsequent as Thumb
Into_Thumb …
ADR r5,Back_to_ARM ;generate branch target to
;word-aligned address,
;hence bit 0 is cleared.
BX r5 ;branch exchange to ARM
…
CODE32 ;assemble subsequent as ARM
Back_to_ARM …
ARM Assembly Language Programming
Introduction to assembly language
programming
• The following is a simple example which illustrates some of the core constituents of an
ARM assembler module:
Assembler
derivative
operands
label opcode The objects to be operate by opcode comment
198
Assemble Instruction
• One line of code - first : ADD r1,r2,r3
optional
opcode
Code : In register:
200
Assembly Language Programming
• Delimiters
• Labels
• Operation code(Mnemonics)
• Directives
Delimiters:
Directives:
Define Constant:
• It allows the programmer to enter fixed data into program memory.
• Assembler:
; compute and test condition
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
ADR r4,b ; get address for b
LDR r1,[r4] ; get value for b
CMP r0,r1 ; compare a < b
BLE fblock ; if a <= b, branch to false block
If statement, cont’d.
; true block
MOV r0,#5 ; generate value for x
ADR r4,x ; get address for x
STR r0,[r4] ; store x
ADR r4,c ; get address for c
LDR r0,[r4] ; get value of c
ADR r4,d ; get address for d
LDR r1,[r4] ; get value of d
ADD r0,r0,r1 ; compute y
ADR r4,y ; get address for y
STR r0,[r4] ; store y
B after ; branch around false block
If statement, cont’d.
; false block
Fblock: ADR r4,c ; get address for c
LDR r0,[r4] ; get value of c
ADR r4,d ; get address for d
LDR r1,[r4] ; get value for d
SUB r0,r0,r1 ; compute a-b
ADR r4,x ; get address for x
STR r0,[r4] ; store value of x
after ...
EXAMPLE PROGRAMS
16 Bit data transfer
Some primitives are supported directly by the ARM instruction set, including
32-bit addition and multiplication.
• ARM cores don’t have hardware support for division. To divide two
numbers you must call a software routine that calculates the result
using standard arithmetic operations. If you can’t avoid a division
then you need access to very optimized division routines.
The image format files can be built to include the debug tables
required by the ARM symbolic debugger (ARMsd which can load, run
and debug programs either on hardware such as the ARM
Development Board or using a software emulation of the ARM (the
ARMulator).
The ARM assembler
• The ARM assembler is a full macro assembler which produces ARM
object format output that can be linked with output from the C
compiler.
• Assembly source language is near machine-level, with most assembly
instructions translating into single ARM (or Thumb) instructions.
The linker
• The linker takes one or more object files and combines them into an
executable program.
• It resolves symbolic references between the object files and extracts
object modules from libraries as needed by the program.
ARMsd
• The ARM symbolic debugger is a front-end interface to assist in
debugging programs running either under emulation (on the
ARMulator) or remotely on a target system such as the ARM
development board. The remote system must support the
appropriate remote debug protocols either via a serial line or through
a JTAG test interface
• At its most basic, ARMsd allows an executable program to be loaded
into the ARMulator or a development board and run. It allows the
setting of breakpoints, which are addresses in the code that, if
executed, cause execution to halt so that the processor state can be
examined.
ARMulator
• The ARMulator (ARM emulator) is a suite of programs that models the
behaviour of various ARM processor cores in software on a host
system. It can operate at various levels of accuracy:
• Instruction-accurate modelling gives the exact behavior of the system
state without regard to the precise timing characteristics of the
processor.
• Cycle-accurate modelling gives the exact behavior of the processor on
a cycle-by-cycle basis, allowing the exact number of clock cycles that a
program requires to be established.
• Timing-accurate modelling presents signals at the correct time within
a cycle, allowing logic delays to be accounted for.
ARM development board