ARM Final

1
TM 1 39v10 The ARM Architecture

ARM PROCESSOR,
ESRTP,
SEM-8,B.E.(ELECTRONICS)

WITH BEST OF LUCK FROM:
Prof. Vidya Gogate
SAKEC, Chembur
2
TM T H E A R C H I T E C T U R E F O R T H E D I G I T A L W O R L D
The ARM Architecture
3
Agenda
Introduction to ARM Ltd
Programmers Model
Instruction Set
System Design
Development Tools

4
ARM Ltd
Founded in November 1990
Spun out of Acorn Computers

Designs the ARM range of RISC processor
cores
Licenses ARM core designs to semiconductor
partners who fabricate and sell to their
customers.
ARM does not fabricate silicon itself

Also develop technologies to assist with the
design-in of the ARM architecture
Software tools, boards, debug hardware,
application software, bus architectures,
peripherals etc
5
ARM Partnership Model
6
ARM Powered Products
7
Latest NEWS
For 30 years, Intel basically had the market to itself and as a result, its
chips were priced much higher than ARM chips.

But ARMs entry into Intels territory is a real threat to Intels dominance.

In fact, Microsofts first version of the Surface tablet/keyboard device
uses ARM chips. The Intel-based versions will come out about three
month later.

8
Latest NEWS
There has been an important new development in the processor world
lately. The folks behind the ARM processor the chip that powers most
smart phones and tablets today decided to scale up this processor
technology to run at speeds that could be used in advanced tablets and
more importantly, laptops and even desktops.

And ARM processors got a major boost when Microsoft made the
decision to create an ARM-based version of Windows 8. For the first time,
Microsoft broke away from whats been called the Win-Tel monopoly.
9
Latest NEWS

But Microsofts new operating system for ARM, called Windows RT, has
touched off a new battlefront in processor wars.

This has forced Intel to try and make its x86 processors more energy
efficient;

the company hopes to have chips that are on par with ARM by mid-2013.

Read more: http://techland.time.com/2012/07/16/arm-vs-intel-how-the-
processor-wars-will-benefit-consumers-most/#ixzz23XHpDxkn

10
RISC Design Philosophy
InstructionsRISC processors have a reduced number of instruction
classes which provide simple operations that can each execute in a
single cycle.
In contrast, in CISC processors the instructions are often of variable
size and take many cycles to execute.
PipelinesThe processing of instructions is broken down into smaller
units that can be executed in parallel by pipelines.
There is no need for an instruction to be executed by a mini-program
called microcode as on CISC processors.
RegistersRISC machines have a large general-purpose register set.
Any register can contain either data or an address. Registers act as the
fast local memory store for all data processing operations.
In contrast, CISC processors have dedicated registers for specific
purposes.
11

Load-store architectureThe processor operates on
data held in registers. Separate load and store instructions
transfer data between the register bank and external
memory.

Memory accesses are costly, so separating memory
accesses from data processing provides an advantage
because you can use data items held in the register bank
multiple times without needing multiple memory accesses.

In contrast, with a CISC design the data processing
operations can act on memory directly.

12
These design rules allow a RISC processor to be simpler,
and thus the core can operate at higher clock
frequencies.

In contrast, traditional CISC processors are more complex
and operate at lower clock frequencies.
13
ARM Design Philosophy
Portable embedded systems require some form of battery power. The
ARM processor has been specifically designed to be small to reduce
power consumption and extend battery operationessential for
applications such as PDAs.
Since embedded systems have limited memory due to cost and/or
physical size restrictions; High code density is useful feature of ARM
for applications that have limited on-board memory. The ability to use
low-cost memory devices produces substantial savings.
For a single-chip solution, the smaller the area used by the embedded
processor,(reduced die size) the more available space for specialized
peripherals. This in turn reduces the cost of the design and
manufacturing since fewer discrete chips are required for the end
product.
14
ARM Design Philosophy
ARM has incorporated hardware debug technology
within the processor so that software engineers can view
what is happening while the processor is executing code.

With greater visibility, software engineers can resolve
issues faster, which has a direct effect on the time to
market and reduces overall development costs.
15
Instruction set FEATURES

Variable cycle execution for certain instructions-
load-store-multiple instructions vary in the number of execution cycles
depending upon the number of registers being transferred.
The transfer can occur on sequential memory addresses, which
increases performance since sequential memory accesses are often
faster than random accesses.
Code density is also improved since multiple register transfers are
common operations at the start and end of functions.
Inline barrel shifter leading to more complex instructions
The inline barrel shifter is a hardware component that preprocesses
one of the input registers before it is used by an instruction.
This expands the capability of many instructions to improve core
performance and code density.

16
Instruction set FEATURES
Thumb 16-bit instruction setThe 16-bit instructions improve
code density by about 30% over 32-bit fixed-length instructions.

Conditional execution- An instruction is only executed when a
specific condition has been satisfied. This feature improves
performance and code density by reducing branch instructions.

Enhanced instructions - The enhanced digital signal processor
(DSP) instructions were added to the standard ARM instruction set to
support fast 1616-bit multiplier operations and saturation. These
instructions allow a faster-performing ARM processor in some cases to
replace the traditional combinations of a processor plus a DSP.
17
Nomenclature.
Instruction set architecture (ISA) is upward compatible.
ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}
xfamily
ymemory management/protection unit
zcache
TThumb 16-bit decoder
DJTAG debug
Mfast multiplier
IEmbedded ICE macro-cell
Eenhanced instructions (assumes TDMI)
JJazelle
Fvector floating-point unit
Ssynthesizible version
18
Nomenclature

All ARM cores after the ARM7TDMI include the TDMI
features even though they may not include those letters
after the ARM label.

The processor family is a group of processor
implementations that share the same hardware
characteristics.

For example, the ARM7TDMI, ARM740T, and ARM720T all
share the same family characteristics and belong to the
ARM7 family.
19
Nomenclature

JTAG is described by IEEE 1149.1 Standard Test Access Port
and boundary scan architecture.
It is a serial protocol used by ARM to send and receive
debug information between the processor core and test
equipment.

Embedded ICE macro-cell is the debug hardware built into
the processor that allows breakpoints and watch-points to
be set.
Synthesizable means that the processor core is supplied as
source code that can be compiled into a form easily used
by EDA tools.

20
Syllabus

ARM processor fundamentals introduction to ARM and THUMB
instruction set--
processor and memory organization CPU Bus configuration
ARM Bus Memory devices Input/output devices
Component interfacing designing with microprocessor
development and debugging
Design Example
Instruction set with enhanced DSP features with ARM core, mix
mode programming as Thumb + ARM core,
Assembly programming concept, compare with ARM7, ARM9,
ARM11 with new features additions

21
PIN DIAGRAM
22
Architecture block diagram
23
Hardware Fundamentals
The ARM processor can be abstracted
into eight components
ALU, barrel shifter, MAC, register file,
instruction decoder, address register,
incrementer, and sign extend.
24
Data Sizes and Instruction Sets
The ARM is a 32-bit architecture.

When used in relation to the ARM:
Byte means 8 bits
Halfword means 16 bits (two bytes)
Word means 32 bits (four bytes)

Most ARMs implement two instruction sets
32-bit ARM Instruction Set
16-bit Thumb Instruction Set

Jazelle cores can also execute Java bytecode
25
Processor Modes
The ARM has seven basic operating modes:

User : unprivileged mode under which most tasks run

FIQ : entered when a high priority (fast) interrupt is raised

IRQ : entered when a low priority (normal) interrupt is raised

Supervisor : entered on reset and when a Software Interrupt
instruction is executed

Abort : used to handle memory access violations

Undef : used to handle undefined instructions

System : privileged mode using the same registers as user mode
26
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
FIQ IRQ SVC Undef Abort
User Mode
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
r0
r1
r2
r3
r4
r5
r6
r7
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
User IRQ SVC Undef Abort
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
FIQ Mode IRQ Mode
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
User FIQ SVC Undef Abort
r13 (sp)
r14 (lr)
Undef Mode
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
User FIQ IRQ SVC Abort
r13 (sp)
r14 (lr)
SVC Mode
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
User FIQ IRQ Undef Abort
r13 (sp)
r14 (lr)
Abort Mode
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
User FIQ IRQ SVC Undef
r13 (sp)
r14 (lr)
The ARM Register Set
27
Register Organization Summary
User
mode
r0-r7,
r15,
and
cpsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
FIQ
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
r15 (pc)
cpsr
r0
r1
r2
r3
r4
r5
r6
r7
User
r13 (sp)
r14 (lr)
spsr
IRQ
User
mode
r0-r12,
r15,
and
cpsr
r13 (sp)
r14 (lr)
spsr
Undef
User
mode
r0-r12,
r15,
and
cpsr
r13 (sp)
r14 (lr)
spsr
SVC
User
mode
r0-r12,
r15,
and
cpsr
r13 (sp)
r14 (lr)
spsr
Abort
User
mode
r0-r12,
r15,
and
cpsr
Thumb state
Low registers
Thumb state
High registers
Note: System mode uses the User mode register set
28
The Registers
ARM has 37 registers all of which are 32-bits long.
1 dedicated program counter
1 dedicated current program status register
5 dedicated saved program status registers
30 general purpose registers

The current processor mode governs which of several banks is
accessible. Each mode can access
a particular set of r0-r12 registers
a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
the program counter, r15 (pc)
the current program status register, cpsr

Privileged modes (except System) can also access
a particular spsr (saved program status register)
29
Program Status Registers
Condition code flags
N = Negative result from ALU
Z = Zero result from ALU
C = ALU operation Carried out
V = ALU operation oVerflowed

Sticky Overflow flag - Q flag
Architecture 5TE/J only
Indicates if saturation has occurred

J bit
Architecture 5TEJ only
J = 1: Processor in Jazelle state

Interrupt Disable bits.
I = 1: Disables the IRQ.
F = 1: Disables the FIQ.

T Bit
Architecture xT only
T = 0: Processor in ARM state
T = 1: Processor in Thumb state

Mode bits
Specify the processor mode
27 31
N Z C V Q
28 6 7
I F T mode
16 23

8 15

5 4 0 24
f s x c
U n d e f i n e d J
30
When the processor is executing in ARM state:
All instructions are 32 bits wide
All instructions must be word aligned
Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as
instruction cannot be halfword or byte aligned).

When the processor is executing in Thumb state:
All instructions must be halfword aligned
Therefore the pc value is stored in bits [31:1] with bit [0] undefined (as
instruction cannot be byte aligned).

When the processor is executing in Jazelle state:
Processor performs a word access to read 4 instructions at once
Program Counter (r15)
31
Vector Table
Exception Handling
When an exception occurs, the ARM:
Copies CPSR into SPSR_<mode>
Sets appropriate CPSR bits
Change to ARM state
Change to exception mode
Disable interrupts (if appropriate)
Stores the return address in LR_<mode>
Sets PC to vector address
To return, exception handler needs to:
Restore CPSR from SPSR_<mode>
Restore PC from LR_<mode>
This can only be done in ARM state.
Vector table can be at
0xFFFF0000 on ARM720T
and on ARM9/10 family devices
FIQ
IRQ
(Reserved)
Data Abort
Prefetch Abort
Software Interrupt
Undefined Instruction
Reset
0x1C
0x18
0x14
0x10
0x0C
0x08
0x04
0x00
32
Instruction set
ARM has three instruction setsARM, Thumb, and Jazelle.
The register file contains 37 registers, but only 17 or 18 registers
are accessible at any point in time; the rest are banked
according to processor mode.
The current processor mode is stored in the CPSR
It holds the current status of the processor core as well as
interrupt masks, condition flags, and state.
The state determines which instruction set is being executed.
33
ARM instructions can be made to execute conditionally by postfixing
them with the appropriate condition code field.
This improves code density and performance by reducing the number of
forward branch instructions.
CMP r3,#0 CMP r3,#0
BEQ skip ADDNE r0,r1,r2
ADD r0,r1,r2
skip

By default, data processing instructions do not affect the condition code
flags but the flags can be optionally set by using S. CMP does not
need S.
loop

SUBS r1,r1,#1
BNE loop

if Z flag clear then branch
decrement r1 and set flags
Conditional Execution and Flags
34
Condition Codes
Not equal
Unsigned higher or same
Unsigned lower
Minus
Equal
Overflow
No overflow
Unsigned higher
Unsigned lower or same
Positive or Zero
Less than
Greater than
Less than or equal
Always
Greater or equal
EQ
NE
CS/HS
CC/LO
PL
VS
HI
LS
GE
LT
GT
LE
AL
MI
VC
Suffix Description
Z=0
C=1
C=0
Z=1
Flags tested
N=1
N=0
V=1
V=0
C=1 & Z=0
C=0 or Z=1
N=V
N!=V
Z=0 & N=V
Z=1 or N=!V

The possible condition codes are listed below:
Note AL is the default and does not need to be specified
35
Examples of conditional
execution
Use a sequence of several conditional instructions
if (a==0) func(1);
CMP r0,#0
MOVEQ r0,#1
BLEQ func

Set the flags, then use various condition codes
if (a==0) x=0;
if (a>0) x=1;
CMP r0,#0
MOVEQ r1,#0
MOVGT r1,#1

Use conditional compare instructions
if (a==4 || a==10) x=0;
CMP r0,#4
CMPNE r0,#10
MOVEQ r1,#0
36
Branch : B{<cond>} label
Branch with Link : BL{<cond>} subroutine_label

The processor core shifts the offset field left by 2 positions, sign-extends
it and adds it to the PC
32 Mbyte range
How to perform longer branches?
28 31 24 0
Cond 1 0 1 L Offset
Condition field
Link bit 0 = Branch
1 = Branch with link
23 25 27
Branch instructions
37
Data processing Instructions
Consist of :
Arithmetic: ADD ADC SUB SBC RSB RSC
Logical: AND ORR EOR BIC
Comparisons: CMP CMN TST TEQ
Data movement: MOV MVN

These instructions only work on registers, NOT memory.

Syntax:

<Operation>{<cond>}{S} Rd, Rn, Operand2

Comparisons set flags only - they do not specify Rd
Data movement does not specify Rn

Second operand is sent to the ALU via barrel shifter.
38
The Barrel Shifter
Destination CF 0 Destination CF
LSL : Logical Left Shift
ASR: Arithmetic Right Shift
Multiplication by a power of 2 Division by a power of 2,
preserving the sign bit
Destination CF ...0 Destination CF
LSR : Logical Shift Right ROR: Rotate Right
Division by a power of 2 Bit rotate with wrap around
from LSB to MSB
Destination
RRX: Rotate Right Extended
Single bit rotate with wrap around
from CF to MSB
CF
39
Register, optionally with shift operation
Shift value can be either be:
5 bit unsigned integer
Specified in bottom byte of another
register.
Used for multiplication by constant

Immediate value
8 bit number, with a range of 0-255.
Rotated right through even number of
positions
Allows increased range of 32-bit
constants to be loaded directly into
registers

Result
Operand
1
Barrel
Shifter
Operand
2
ALU
Using the Barrel Shifter:
The Second Operand
40
No ARM instruction can contain a 32 bit immediate constant
All ARM instructions are fixed as 32 bits long
The data processing instruction format has 12 bits available for operand2

4 bit rotate value (0-15) is multiplied by two to give range 0-30 in steps of 2
Rule to remember is 8-bits shifted by an even number of bit positions.
0 7 11 8
immed_8
Shifter
ROR
rot
x2
Quick Quiz:
0xe3a004ff
MOV r0, #???
Immediate constants (1)
41
Examples:

The assembler converts immediate values to the rotate form:
MOV r0,#4096 ; uses 0x40 ror 26
ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16

The bitwise complements can also be formed using MVN:
MOV r0, #0xFFFFFFFF ; assembles to MVN r0,#0

Values that cannot be generated in this way will cause an error.
0 31
ror #0
range 0-0xff000000 step 0x01000000 ror #8
range 0-0x000000ff step 0x00000001
range 0-0x000003fc step 0x00000004 ror #30
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Immediate constants (2)
42
To allow larger constants to be loaded, the assembler offers a pseudo-
instruction:
LDR rd, =const
This will either:
Produce a MOV or MVN instruction to generate the value (if possible).
or
Generate a LDR instruction with a PC-relative address to read the constant
from a literal pool (Constant data area embedded in the code).
For example
LDR r0,=0xFF => MOV r0,#0xFF
LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]

DCD 0x55555555
This is the recommended way of loading constants into a register
Loading 32 bit constants
43
Multiply
Syntax:
MUL{<cond>}{S} Rd, Rm, Rs Rd = Rm * Rs
MLA{<cond>}{S} Rd,Rm,Rs,Rn Rd = (Rm * Rs) + Rn
[U|S]MULL{<cond>}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := Rm*Rs
[U|S]MLAL{<cond>}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := (Rm*Rs)+RdHi,RdLo

Cycle time
Basic MUL instruction
2-5 cycles on ARM7TDMI
1-3 cycles on StrongARM/XScale
2 cycles on ARM9E/ARM102xE
+1 cycle for ARM9TDMI (over ARM7TDMI)
+1 cycle for accumulate (not on 9E though result delay is one cycle longer)
+1 cycle for long

Above are general rules - refer to the TRM for the core you are using
for the exact details
44
Single register data transfer
LDR STR Word
LDRB STRB Byte
LDRH STRH Halfword
LDRSB Signed byte load
LDRSH Signed halfword load

Memory system must support all access sizes

Syntax:
LDR{<cond>}{<size>} Rd, <address>
STR{<cond>}{<size>} Rd, <address>

e.g. LDREQB
45
Address accessed
Address accessed by LDR/STR is specified by a base register plus an
offset
For word and unsigned byte accesses, offset can be
An unsigned 12-bit immediate value (ie 0 - 4095 bytes).
LDR r0,[r1,#8]
A register, optionally shifted by an immediate value
LDR r0,[r1,r2]
LDR r0,[r1,r2,LSL#2]
This can be either added or subtracted from the base register:
LDR r0,[r1,#-8]
LDR r0,[r1,-r2]
LDR r0,[r1,-r2,LSL#2]
For halfword and signed halfword / byte, offset can be:
An unsigned 8 bit immediate value (ie 0-255 bytes).
A register (unshifted).
Choice of pre-indexed or post-indexed addressing
46
0x5
0x5
r1
0x200
Base
Register
0x200
r0
0x5
Source
Register
for STR
Offset
12 0x20c
r1
0x200
Original
Base
Register
0x200
r0
0x5
Source
Register
for STR
Offset
12 0x20c
r1
0x20c
Updated
Base
Register
Auto-update form: STR r0,[r1,#12]!
Pre or Post Indexed Addressing?
Pre-indexed: STR r0,[r1,#12]
Post-indexed: STR r0,[r1],#12
47
LDM / STM operation
Syntax:
<LDM|STM>{<cond>}<addressing_mode> Rb{!}, <register list>
4 addressing modes:
LDMIA / STMIA increment after
LDMIB / STMIB increment before
LDMDA / STMDA decrement after
LDMDB / STMDB decrement before
IA
r1
Increasing
Address
r4
r0
r1
r4
r0
r1
r4
r0 r1
r4
r0
r10
IB DA DB
LDMxx r10, {r0,r1,r4}
STMxx r10, {r0,r1,r4}
Base Register (Rb)
48
Software Interrupt (SWI)
Causes an exception trap to the SWI hardware vector
The SWI handler can examine the SWI number to decide what operation
has been requested.
By using the SWI mechanism, an operating system can implement a set
of privileged operations which applications running in user mode can
request.
Syntax:
SWI{<cond>} <SWI number>
28 31 24 27
0
Cond 1 1 1 1 SWI number (ignored by processor)
23
Condition Field
49
PSR Transfer Instructions
MRS and MSR allow contents of CPSR / SPSR to be transferred to / from
a general purpose register.
Syntax:
MRS{<cond>} Rd,<psr> ; Rd = <psr>
MSR{<cond>} <psr[_fields]>,Rm ; <psr[_fields]> = Rm
where
<psr> = CPSR or SPSR
[_fields] = any combination of fsxc
Also an immediate form
MSR{<cond>} <psr_fields>,#Immediate
In User Mode, all bits can be read but only the condition flags (_f) can be
written.
27 31
N Z C V Q
28 6 7
I F T mode
16 23

8 15

5 4 0 24
f s x c
U n d e f i n e d J
50
ARM Branches and Subroutines
B <label>
PC relative. 32 Mbyte range.
BL <subroutine>
Stores return address in LR
Returning implemented by restoring the PC from LR
For non-leaf functions, LR will have to be stacked
STMFD
sp!,{regs,lr}
:
BL func2
:
LDMFD
sp!,{regs,pc}
func1 func2

:
:
BL func1
:
:
:
:
:
:
:
MOV pc, lr
51
Thumb
Thumb is a 16-bit instruction set
Optimised for code density from C code (~65% of ARM code size)
Improved performance from narrow memory
Subset of the functionality of the ARM instruction set
Core has additional execution state - Thumb
Switch between ARM and Thumb using BX instruction
0 15
31 0
ADDS r2,r2,#1
ADD r2,#1
32-bit ARM Instruction
16-bit Thumb Instruction
For most instructions generated by compiler:
Conditional execution is not used
Source and destination registers identical
Only Low registers used
Constants are of limited size
Inline barrel shifter not used
52
Agenda
Introduction
Programmers Model
Instruction Sets
System Design
Development Tools

53
Example ARM-based System
16 bit RAM
8 bit ROM
32 bit RAM
ARM
Core
I/O
Peripherals
Interrupt
Controller
nFIQ nIRQ
54
ARM Based microcontroller
55
A Basic ARM MEMORY SYSTEM.

56
AMBA
B
r
i
d
g
e

Timer
On-chip
RAM
ARM
Interrupt
Controller
Remap/
Pause
TIC
Arbiter
Bus Interface External
ROM
External
RAM
Reset
System Bus Peripheral Bus
AMBA
Advanced Microcontroller Bus
Architecture
ADK
Complete AMBA Design Kit

ACT
AMBA Compliance Testbench

PrimeCell
ARMs AMBA compliant peripherals

AHB or ASB APB
External
Bus
Interface
Decoder
57
System Design-Hardware
An embedded system includes the following hardware
components:
ARM processors are found embedded in chips.
Programmers access peripherals through memory-mapped
registers.
There is a special type of peripheral called a controller,
which embedded systems use to configure higher-level
functions such as memory and interrupts.
The AMBA on-chip bus is used to connect the processor
and peripherals together.
58
System Design-Software
An embedded system also includes the following software
components:

Initialization code configures the hardware to a known state.
Once configured, operating systems can be loaded and executed.
Operating systems provide a common programming environment for
the use of hardware resources and infrastructure.
Device drivers provide a standard interface to peripherals.
An application Program performs the task-specific duties of an
embedded system.

59
Agenda
Introduction
Programmers Model
Instruction Sets
System Design
Development Tools

60
The RealView Product Families
Debug Tools
AXD (part of ADS)
Trace Debug Tools
Multi-ICE
Multi-Trace
Platforms
ARMulator (part of ADS)
Integrator Family
Compilation Tools
ARM Developer Suite (ADS)
Compilers (C/C++ ARM & Thumb),
Linker & Utilities

RealView Compilation Tools (RVCT)

RealView Debugger (RVD)
RealView ICE (RVI)
RealView Trace (RVT)
RealView ARMulator ISS (RVISS)

61
ARM Debug Architecture

ARM
core
ETM
TAP
controller
Trace Port JTAG port
Ethernet
Debugger (+ optional
trace tools)
EmbeddedICE Logic
Provides breakpoints and processor/system
access
JTAG interface (ICE)
Converts debugger commands to JTAG
signals
Embedded trace Macrocell (ETM)
Compresses real-time instruction and data
access trace
Contains ICE features (trigger & filter logic)
Trace port analyzer (TPA)
Captures trace in a deep buffer
EmbeddedICE
Logic
63
Thumb instruction set
Thumb instruction set encodes a subset of the 32-bit ARM
instructions into a 16-bit instruction set space. So it has
higher code density: 30% less memory

Since Thumb has higher performance than ARM on a
processor with a 16-bit data bus, but
lower performance than ARM on a 32-bit data bus,
use Thumb for memory-constrained systems.
64
Thumb Instruction set

65
Thumb Instruction set
66
Thumb register usage
67
Code Density

68
Thumb instruction decoding
69
Thumb Instructions limitations
Only the branch relative instruction can be conditionally executed.
The limited space available in 16 bits causes the barrel shift
operations ASR, LSL, LSR, and ROR to be separate instructions
in the Thumb ISA.
there is no direct access to the CPSR or SPSR. So there are no
MSR- and MRS-equivalent Thumb instructions.
To alter the CPSR or SPSR, you must switch into ARM state to use
MSR and MRS.
Similarly, there are no coprocessor instructions in Thumb state.
You need to be in ARM state to access the coprocessor for
configuring cache and memory management.
70
ARM Thumb internetworking
The method of linking ARM and Thumb code together for both
assembly and C/C++.

It handles the transition between the two states. Extra code, called a
veneer, is sometimes needed to carry out the transition.

ATPCS defines the ARM and Thumb procedure call standards.
To call a Thumb routine from an ARM routine, the core has to change
state of T bit of the CPSR.

The BX and BLX branch instructions cause a switch between ARM and
Thumb state while branching to a routine
71
MIX Mode Programming
branch instructions
There are two versions of the BX or BLX instructions: an ARM
instruction and a Thumb equivalent.
The ARM BX instruction enters Thumb state only if bit 0 of the address
in Rn is set to binary 1; otherwise it enters ARM state.
The Thumb BX instruction does the same.
Syntax: BX Rm BLX Rm | label
Unlike the ARM version, the Thumb BX instruction cannot be
conditionally executed.
The conditional branch instruction is the only conditionally executed
instruction in Thumb state.
B branch BL branch with link
lr =(instruction address after the BL) + 1
72
The Thumb data processing instructions are a subset of the ARM
data processing instructions.
Most Thumb data processing instructions operate on low registers
and update the cpsr. The exceptions are
MOV Rd,Rn ADD Rd,Rm CMP Rn,Rm
ADD sp, #immediate SUB sp, #immediate
ADD Rd,sp,#immediate ADD Rd,pc,#immediate
which can operate on the higher registers r8r14 and the pc.
These instructions, except for CMP, do not update the condition flags
in the cpsr when using the higher registers.
The CMP instruction, however, always updates the cpsr.
73
Single Register Load Store
Instructions

T

The Thumb instruction set supports load and storing registers, or LDR and
STR.
These instructions use two pre-indexed addressing modes: offset by register
and offset by immediate.
Load/store register [Rn, Rm]
Base register + offset [Rn, #immediate]
Relative [pc|sp, #immediate]
The offset by register uses a base register Rn + the register offset Rm.
The second uses the same base register Rn + a 5-bit immediate or a value
dependent on the data size.
The 5-bit offset encoded in the instruction is multiplied by one for byte
accesses, two for 16-bit accesses, and four for 32-bit accesses.

74
Multiple Register Load-Store I
The Thumb versions of the load-store multiple instructions are reduced
forms of the ARM load-store multiple instructions.
They only support the increment after (IA) addressing mode.
Syntax : <LDM|STM>IA Rn!, {low Register list}
LDMIA load multiple registers
{Rd}*N <- mem32[Rn + 4 N], Rn = Rn + 4 N
STMIA save multiple registers
{Rd}*N -> mem32[Rn + 4 N], Rn = Rn + 4 N
Here N is the number of registers in the list of registers.
these instructions always update the base register Rn after execution.
The base register and list of registers are limited to the low registers r0
to r7.
75
stack instructions
The Thumb stack operations are different from the equivalent ARM
instructions because they use the more traditional POP and PUSH
concept.

Syntax: POP {low_register_list{, pc}}

PUSH {low_register_list{, lr}}

POP pop registers from the stacks RdN <- mem32[sp+4 N], sp =
sp+4 N

PUSH push registers on to the stack RdN -> mem32[sp+4 N], sp =
sp4 N
76
No stack pointer in the instruction because the stack pointer
is fixed as register r13 in Thumb operations and sp is
automatically updated.
The list of registers is limited to the low registers r0 to r7.

The PUSH register list also can include the link register lr.
similarly the POP register list can include the pc.
This provides support for subroutine entry and exit.
The stack instructions only support full descending stack
operations.

77
software interrupt.
Similar to the ARM equivalent, the Thumb software interrupt (SWI)
instruction causes a software interrupt exception.
If any interrupt or exception flag is raised in Thumb state, the
processor automatically reverts back to ARM state to handle the
exception.
Syntax: SWI immediate
The Thumb SWI instruction has the same effect and nearly the same
syntax as the ARM equivalent.
It differs in that the SWI number is limited to the range 0 to 255 and it
is not conditionally executed.
78
ARM7 Family
One significant variation in the ARM7 family is the ARM7TDMI-S.
synthesizable.
ARM720T includes an MMU being capable of handling the Linux and
Microsoft embedded platform operating systems.
The processor also includes a unified 8K cache. The vector table can
be relocated to a higher address by setting a coprocessor 15 register.
Another variation is the ARM7EJ-S processor, also synthesizable,
provides both Java acceleration and the enhanced instructions but
without any memory protection
ARM7EJ-S is quite different since it includes a five-stage pipeline
and executes ARMv5TEJ instructions.
79
ARM PROCESSOR VARIANTS
80
ARM7,ARM9,ARM11
ARM7, ARM9, ARM10 and ARM11 cores are directly dependent upon
the type and geometry of the manufacturing process, which has a direct effect
on the frequency (MHz) and power consumption (watts).
An ARM processor is an implementation of a specific instruction set
architecture (ISA).
The ISA has been continuously improved from the first ARM processor design.
Processors are grouped into implementation families (ARM7, ARM9, ARM10,
and ARM11) with similar characteristics.
81
ARM9 Family
The ARM9 family was announced in 1997,with five-stage
pipeline, can run at higher clock frequencies than the ARM7.
The extra stages improve the overall performance of the
processor.
The memory system has been redesigned to follow the Harvard
architecture, which separates the data D and instruction I
buses.
The first processor in the ARM9 family was the ARM920T, which
includes a separate D + I cache and an MMU for virtual memory
support.
ARM922T is a variation on the ARM920T but with half the D +I
cache size.
82
ARM9 Family
The ARM940T includes a smaller D +I cache and an MPU designed
for applications that do not require a platform operating system.
Both ARM920T and ARM940T execute the architecture v4T
instructions.
The next processors are based on the ARM9E-S core, a synthesizable
version of the ARM9 core with the E extensions.
There are two variations: the ARM946E-S and the ARM966E-S. Both
execute architecture v5TE instructions.
They also support the optional embedded trace macro-cell (ETM),
which allows a developer to trace instruction and data execution in real
time on the processor.
This is important when debugging applications with time-critical
segments.
83
ARM9 Family
The ARM946E-S includes TCM, cache, and an MPU. The sizes of the TCM
and caches are configurable.
This processor is designed for use in embedded control applications that
require deterministic real-time response.
In contrast, the ARM966E does not have the MPU and cache extensions but
does have configurable TCMs.
The latest core in the ARM9 product line is the ARM926EJ-S synthesizable
processor core, announced in 2000 the first ARM processor core to include
the Jazelle technology, which accelerates Java byte-code execution.
It is designed for use in small portable Java-enabled devices such as 3G
phones and personal digital assistants (PDAs).
It features an MMU, configurable TCMs, and D +I caches with zero or
nonzero wait state memories.
84
ARM10 Family
The ARM10, announced in 1999, was designed for
performance.

It extends the ARM9 pipeline to six stages.
It also supports an optional vector floating-point (VFP) unit,
which
adds a seventh stage to the ARM10 pipeline. The VFP
significantly increases

floating-point performance and is compliant with the IEEE
754.1985 floating-point standard.
85
ARM10 Family

The ARM1020E is the first processor to use an ARM10E core.
Like the ARM9E,
It includes the enhanced E instructions. It has separate 32K D
+ I caches, optional vector floating-point unit, and an MMU.

The ARM1020E also has a dual 64-bit bus interface for
increased performance.
ARM1026EJ-S is very similar to the ARM926EJ-S but with both
MPU and MMU.
This processor has the performance of the ARM10 with the
flexibility of an ARM926EJ-S.

86
ARM11 Family
The ARM1136J-S, announced in 2003, was designed for high
performance and power-efficient applications.
It was the first processor implementation to execute architecture
ARMv6 instructions.
It incorporates an eight-stage pipeline with separate load-store
and arithmetic pipelines.
Included in the ARMv6 instructions are single instruction multiple
data (SIMD) extensions for media processing, specifically
designed to increase video processing performance.
The ARM1136JF-S is an ARM1136J-S with the addition of the
vector floating-point unit for fast floating-point operations.
87
Enhanced DSP features
Processing digitized signals requires high memory bandwidths and fast
multiply accumulate operations.
A single-core design can reduce cost and power consumption over a
two-core solution.
DSP applications are typically multiply and load-store intensive.
A basic operation is a multiply accumulate multiplying two 16-bit
signed numbers and accumulating onto a 32-bit signed
accumulator.
The ARMv5TE extensions available in the ARM9E and later cores
provide efficient multiply accumulate operations.
With careful coding, the ARM9E processor will perform decently on
the DSP parts of an application while
outperforming a DSP on the control parts of the application.
88
Generations suitable for DSP
applications.
89
DSP Algorithms Characteristics
Due to their high data bandwidth and performance requirements,
we have to code DSP algorithms in hand-written assembly.

We need fine control of register allocation and instruction
scheduling to achieve the best performance.

Filtering is probably the most commonly used signal processing
operation. It can be used to remove noise, to analyze signals, or in
signal compression.

Another very common algorithm is the Discrete Fourier Transform
(DFT), which converts a signal from a time representation to a
frequency representation or vice versa.

90
How to represent a signal on the
ARM
Use a floating-point representation for prototyping algorithms. Do not
use floating point in applications where speed is critical. Most ARM
implementations do not include hardware floating-point support.
Use a fixed-point representation for DSP applications where speed
is critical with moderate dynamic range. The ARM cores provide good
support for 8-, 16- and 32-bit fixed-point DSP.
For applications requiring speed and high dynamic range, use a block-
floating or logarithmic representation.
The key idea is to use block algorithms that calculate several results
at once, and thus require less memory bandwidth, increase
performance and decrease power consumption compared with
calculating single results.
91
Figure shows a sine wave signal digitized at the sampling points 0, 1, 2,
3, and so on.
92
Dynamic Range &Accuracy
There are two things to worry about when choosing a representation of x[t
]:
1. The dynamic range of the signalthe maximum fluctuation in the
signal defined by Equation-A.
For a signed signal we are interested in the maximum absolute value M
possible. For this example, lets take M = 1 volt.
M = max|x[t ]| over all t = 0, 1, 2, 3 . . (A)

2. The accuracy required in the representation- sometimes given as
a proportion of the maximum range.
For example, an accuracy of 100 parts per million means that each x[t ]
needs to be represented within an error of
E = M 0. 0001 = 0. 0001 volts
93
Suitable Representation
We could use a floating-point representation for x[t ].
1)This would certainly meet our dynamic range and accuracy
constraints, and
2) it would also be easy to manipulate using the C type float.
However, most ARM cores do not support floating point in
hardware, and so a floating-point representation would be very slow.

A better choice for fast code is a fixed-point representation.
A fixed-point representation uses an integer to represent a fractional
value by scaling the fraction.
94
Error Vs Accuracy
A common error is to think that floating point is more accurate than fixed point. This
is false!
For the same number of bits, a fixed-point representation gives greater accuracy.
The floating-point representation gives higher dynamic range at the expense of
lower absolute accuracy.
For example, if you use a 32-bit integer to hold a fixed-point value scaled to full
range, then the maximum error in a representation is 232. However, single-
precision 32-bit floating-point values give a relative error of 224.
The single-precision floating-point mantissa is 24 bits. The leading 1 of the mantissa
is not stored, so 23 bits of storage are actually used. For values near the maximum,
the fixed-point representation is 23224 = 256 times more accurate!
The 8-bit floating-point exponent is of little use when you are interested in maximum
error rather than relative accuracy.
95
Better representation
To summarize, a fixed-point representation is best when there is a
clear bound to the strength of the signal and when maximum error is
important.

When there is no clear bound and you require a large dynamic range,
then floating point is better.

You can also use the other representations, which give more dynamic
range than fixed point while still being more efficient to implement than
floating point.

96
General rules on writing DSP
algorithms for the ARM.
ARM does not provide operations that saturate automatically. Design the DSP
algorithm so that saturation is not required because saturation will cost extra
cycles.
ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit
operations very well. Use extended-precision arithmetic or additional scaling
rather than saturation.
The ARM core is not a dedicated DSP. There is no single instruction that
issues a multiply accumulate and data fetch in parallel. However, by reusing
loaded data you can achieve a respectable DSP performance.
Design the DSP algorithm to minimize loads and stores. Once you load a data
item, then perform as many operations that use the datum as possible. You
can often do this by calculating several output results at once.
Another way of increasing reuse is to concatenate several operations. For
example, you could perform a dot product and signal scale at the same time,
while only loading the data once.
97
Guidelines for writing DSP code.
FromARM9onwards,ARMimplementations use a multistage execute pipeline
for loads and multiplies, which introduces potential processor interlocks.
If you load a value and then use it in either of the following two instructions, the
processor may stall for a number of cycles waiting for the loaded value to
arrive.
Similarly if you use the result of a multiply in the following instruction, this may
cause stall cycles. It is particularly important to schedule code to avoid these
stalls.
Write ARM assembly to avoid processor interlocks. The results of load and
multiply instructions are often not available to the next instruction without
adding stall cycles.
Sometimes the results will not be available for several cycles
There are 14 registers available for general use on the ARM, r0 to r12 and r14.
Design the DSP algorithm so that the inner loop will require 14 registers or
fewer.
98
An example- a DOT Product
A dot-product is one of the simplest DSP operations and highlights the
difference among different ARM implementations.
A dot-product combines N samples from two signals x(t) and c(t) to
produce a correlation value a: a= Ci * Xi
The C interface to the dot-product function is
int dot_product(sample *x, coefficient *c, unsigned int N);
where
sample is the type to hold a 16-bit audio sample, usually a short
coefficient is the type to hold a 16-bit coefficient, usually a short
x[i] and c[i] are two arrays of length N (the data and coefficients)
the function returns the accumulated 32-bit integer dot product a

99
DSP rating of ARM7TDMI
This example shows a 16-bit dot-product optimized for the ARM7TDMI.
Each MLA takes a worst case of four cycles.
We store the 16-bit input samples in 32-bit words so that we can use the
LDM instruction to load them efficiently.
This code assumes that the number of samples N is a multiple of five.
Therefore we can use a five-word load multiple to increase data bandwidth.
The cost per load is 7/4 = 1.4 cycles compared to 3 cycles per load if we had
used LDR or LDRSH.
The inner loop requires a worst case of 7 + 7 + 5 4 + 1 + 3 = 38 cycles to
process each block of 5 products from the sum.
This gives the ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap for a
16-bit dot-product.
100
Assembly code for DOT Product
x RN 0 ; input array x[]
c RN 1 ; input array c[]
N RN 2 ; number of samples (a multiple
of 5)
acc RN 3 ; accumulator
x_0 RN 4 ; elements from array x[]
x_1 RN 5
x_2 RN 6
x_3 RN 7
x_4 RN 8
c_0 RN 9 ; elements from array c[]
c_1 RN 10
c_2 RN 11
c_3 RN 12
c_4 RN 14
; int dot_16by16_arm7m(int *x, int *c, unsigned N)
dot_16by16_arm7m
STMFD sp!, {r4-r11, lr}
MOV acc, #0
loop_7m ; accumulate 5 products
LDMIA x!, {x_0, x_1, x_2, x_3, x_4}
LDMIA c!, {c_0, c_1, c_2, c_3, c_4}
MLA acc, x_0, c_0, acc
SUBS N, N, #5
BGT loop_7m
MOV r0, acc
LDMFD sp!, {r4-r11, pc}
101
DSP Rating -16bit DOT Product

ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap
ARM9TDMI-The inner loop requires 28 cycles per tap, giving 28/4 = 7
cycles per tap.
STRONGARM-The inner loop uses 19 cycles to process 4 taps, giving
a rating of 19/4 = 4.75 cycles per tap.
ARM9E-The inner loop requires 20 cycles to accumulate 8 products, a
rating of 20/8 = 2.5 cycles per tap.
ARM10E-The inner loop requires 25 cycles to process 10 samples, or
2.5 cycles per tap.
Intel- XSCALE-The inner loop requires 14 cycles to accumulate 8
products, a rating of 1.75 cycles per tap.

102
Performance improvement in DSP
The block
filter
algorithm
gives a
much better
performance
per tap if you
are
calculating
multiple
products.
103
THANK YOU

ARM Final

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

ARM Final

Caricato da

Copyright:

Formati disponibili

1

TM 1 39v10 The ARM Architecture

Potrebbero piacerti anche