Sei sulla pagina 1di 243

UNIT II

HIGH PERFORMANCE
RISC ARCHITECTURE
– ARM
OVERVIEW:
Arcon RISC Machine
Architectural Inheritance, Core & Architectures
Registers, Pipeline, Interrupts
ARM organization
ARM processor family, Co-processors
ARM instruction set, Thumb Instruction set, Instruction cycle timings
The ARM Programmer’s model
ARM Development tools
ARM Assembly Language Programming, C programming
Optimizing ARM Assembly Code – Optimized Primitives.
REFERENCES:
1. Steve Furber , “ARM System –On –Chip architecture”, Addision
Wesley, 2000.
2. Valvano, "Embedded Microcomputer Systems", Thomson Asia PVT
LTD first reprint 2001.
INTRODUCTION
ARM Ltd.
 ARM founded in November 1990
• Advanced RISC Machines

 Company head quarters in Cambridge, UK

 Best known for it range of RISC processor cores design


• Other products – fabric IP, software models and development tools,
graphic cores and peripherals to help partners develop and ship
ARM-based SoCs.

 ARM does not manufacture silicon

 More information about ARM: http://www.arm.com/aboutarm


History of ARM
• Acorn Computers: a British computer company founded in Cambridge, England, in
1978, by Hermann Hauser and Chris Curry. The company produced a number of
computers which were especially popular in the UK.
• These included the Acorn Electron, the BBC Micro and the Acorn Archimedes. Acorn's
BBC Micro computer dominated the UK educational computer market during the 1980s
and early 1990s.
• VLSI Technology, Inc. produced the first ARM processor based on Acorn designs.
• ARM based PCs did not sell well, Acorn acquired by Olivetti in 1985
• ARM contracted to develop for Apple for the Apple Newton Handheld built by VLSI.
• The company was broken up into several independent operations in 2000, one of which,
notably, was ARM Holdings
• ARM holdings primary business model is to license its RISC based designs to other
manufactures.
Where ARM Products Play?
 ARM supplies system level IPs in to the chip along with
physical IPs to make sure it is manufacturable

 Good Range of Software development tools.

 ARM doesn’t necessarily features in:


• Other Inputs and Activities to produce the finish product like Industrial design, packaging,
case work, operating systems, peripheral IP and so on
• These are done by ARM Partners
ARM Partnership Model
ARM Powered Products
Huge Opportunity For ARM
Technology

100+
billion

30+
billion
cores to
date

1998 2015 2020


The ARM Range of Processor Cores
Features

Architectural simplicity which allows Very small implementations


which result in Very low power consumption
• Typical RISC architecture:
• Large uniform register file
• Load/store architecture
• Simple addressing modes
• Uniform and fixed-length instruction fields
Results:
• High performance
• Low code size
• Low power consumption
• Low silicon area
ARM Microprocessor
AR M9 ARM10 StrongARM
AR M7 family
family family (SA1100)

CP U Speed 59MHz 120MHz 133MHz


300MHz
(Internal) ~ 66MHz 200MHz ~ 220MHz

B us Interface 32 or 16 bits 32 or 16 bits 32 or 16bits 32 or 16 bits

Typi./Max.
212mW /106mW < 200mW
Power - -
@ 59MHz @ 133MHz
D issipation

53MIPS 220MIPS 400MIPS 150MIPS


Performance
@ 59MHz @ 200MHz @ 300MHz @ 133MHz
Processor Core Comparison
Which Architecture is my processor?
ARM Architecture v7 profiles
ARM Powered Applications
ARM Powered Applications

• Automotive
- ComRoad, empeg, Raytheon, Marine, SENA
• Consumer Multimedia
- Sega, Sharp, Sony, Toshiba, Pace
• Embedded Control
- Conexant, Gemplus, IBM, Olivetti
ARM Powered Applications
• Handheld Computing
- Apple, Ericsson, Hewlett Packard, Psion

• Internet Appliances
- Daewoo, Oracle, RCA, Samsung

• Networking
- 3Com, Ericsson, Virata, VLSI

• Portable Telephony
- Ericsson, Hitachi, Nokia, Philips, Qualcomm
The ARM Architecture Inheritance
• The ARM chip was designed based on Berkeley RISC I and II and the Stanford MIPS
(Microprocessor without Interlocking Pipeline Stages)
• Features Used from Berkeley RISC design
-a load-store architecture
-fixed length 32-bit instructions
-3-address instruction formats
• Features Rejected
-Register windows
-Delayed Branches
- Single Cycle execution of all instructions
The ARM Architecture Inheritance
• Based upon RISC Architecture with enhancements to meet requirements of
embedded applications

• A Large uniform register file


• Load-store architecture
• Uniform and fixed length instructions
• 32-bit processor
• Instructions are 32-bit long
• Good speed/power consumption ratio
• High Code Density
Register windows.
• The register banks on the Berkeley RISC processors incorporated a
large number of registers, 32 of which were visible at any time.
• Procedure entry and exit instructions moved the visible 'window' to
give each procedure access to new registers, thereby reducing the
data traffic between the processor and memory resulting from
register saving and restoring
• This feature was therefore rejected on cost grounds, although the
shadow registers used to handle exceptions on the ARM are not too
different in concept.
Delayed branches.
• Branches cause pipelines problems since they interrupt the smooth
flow of instructions.
• On the original ARM delayed branches were not used because they
made exception handling more complex
Single-cycle execution of all instructions.
• Although the ARM executes most data processing instructions in a
single clock cycle, many other instructions take multiple clock cycles.
The rationale here was based on the observation that with a single
memory for both data and instructions, even a simple load or store
instruction requires at least two memory accesses (one for the
instruction and one for the data).
• Therefore single cycle operation of all instructions is only possible
with separate data and instruction memories
• Instead of single-cycle execution of all instructions, the ARM was
designed to use the minimum number of cycles required for memory
accesses.
ARM CORE AND PROCESSORS
ARM PROCESSOR CORES
seq

/mreq- indicates a processor cycle which requires a memory access.


the 4 bytes are: 90, AB, 12, CD where each byte requires 2 hex digits.
It turns out there are two ways to store this in memory.
Big Endian
In big endian, you store the most significant byte in the smallest address. Here's how it would look:
Address Value
1000 90
1001 AB
1002 12
1003 CD

Little Endian
In little endian, you store the least significant byte in the smallest address. Here's how it would look:
Address Value
1000 CD
1001 12
1002 AB
1003 90
The performance of a processor core can be improved by:

Increasing the clock rate.


This requires the logic in each pipeline stage to be simplified and,
therefore, the number of pipeline stages to be increased.

Reducing the CPI (clock cycles per instruction).


This requires either that instructions which occupy more than one
pipeline slot in an ARM7 are re-implemented to occupy fewer slots, or
that pipeline stalls caused by dependencies between instructions are
reduced, or a combination of both.
Reducing the CPI:

 The fundamental problem with reducing the CPI relative to an ARM7


core is related to the von Neumann bottleneck - any stored-program
computer with a single instruction and data memory will have its
performance limited by the available memory bandwidth.
 To get a significantly better CPI than ARM7 the memory system must
deliver more than one value in each clock cycle either by delivering
more than 32 bits per cycle from a single memory or by having separate
memories for instruction and data accesses.
Double-bandwidth memory:

 ARMS retains a unified memory to achieve double-bandwidth from a single memory.

 It assumes that the memory it is connected to can deliver one word in a clock cycle and
deliver the next sequential word half a cycle later.

 A 64-bit wide memory has the required characteristics, but delaying the arrival of
the second word by half a clock cycle allows a 32-bit bus to be used and can save area
since routing a 32-bit bus requires less area than routing a 64-bit bus.
ARMS applications:

 ARMS was designed as a general-purpose processor core that can readily be


manufactured by ARM Limited's many licensees, so it is not highly optimized for a
particular process technology.
 It offers significantly (two to three times) higher performance than the simpler ARM7
cores for a similar increase in silicon area, and requires the support of double-
bandwidth on-chip memory if it is to realize its full potential.
 One application of the ARMS core is to build a high-performance CPU such as the
ARM810
 The ARM9TDMI core takes the functionality of the ARM7TDMI up to a
significantly higher performance level.
 Like the ARM7TDMI (and unlike the ARMS) it includes support for the
Thumb instruction set and an EmbeddedlCE module for on-chip debug
support.
 The performance improvement is achieved by adopting a 5-stage pipeline to
increase the maximum clock rate and by using separate instruction and data
memory ports to allow an improved CPI (Clocks Per Instruction - a measure
of how much work a processor does in a clock cycle).
Thumb decoding

The ARM7TDMI implements the Thumb instruction set by 'decompressing'


Thumb instructions into ARM instructions using slack time in the ARM7 pipeline.
The ARM9TDMI pipeline is much tighter and does not have sufficient slack time
to allow Thumb instructions to be first translated into ARM instructions and then
decoded; instead it has hardware to decode both ARM and Thumb instructions
directly.
Branch Prediction:
Instructions are fetched at a rate of two per clock cycle. Branch prediction unit predicts before
branch occurs.

Load and store instruction that cannot complete does not stall the pipeline execution.
Processor Core Vs CPU Core
 Processor Core
– The engine that fetches instructions and execute them
– E.g.: ARM7TDMI, ARM9TDMI, ARM9E-S

 CPU Core
– Consists of the ARM processor core and some tightly coupled function
blocks
– Cache and memory management blocks
– E.g.: ARM710T, ARM720T, ARM740T, ARM920T, ARM922T, ARM940T,
ARM946E-S, and ARM966E-S
ARM CPU CORES
• Although some ARM applications use a simple integer processor core as
the basic processing component, others require tightly coupled functions
such as cache memory and memory management hardware. ARM Limited
offers a range of such 'CPU' configurations based around its integer cores.

• The ARM CPU cores described here include the ARM710T, 720T and 740T,
the ARM810 (now superseded by the ARM9 series), the StrongARM, the
ARM920T and 940T, and the ARM1020E.
The ARM710T, ARM720T and ARM740T
• The ARM710T, ARM720T and ARM740T are based upon the ARM7TDMI
processor core, to which an 8 Kbyte mixed instruction cache and data
cache has been added.
• External memory and peripherals are accessed via an AMBA(Advanced
Microcontroller Bus Architecture) bus master unit, and a write buffer and
memory management (ARM71OT and 720T) unit or memory protection
(ARM740T) unit are also incorporated.
ARM710T cache
• Since the ARM7TDMI processor core has a single memory port it is
logical for it to be paired with a unified instruction and data cache.
The ARM710T incorporates such a cache, with a capacity of 8 Kbytes.

ARM710T and ARM720T organization


The organization of the ARM710T cache:
• The cache is organized with 16-byte lines and is 4-way set associative. A random
replacement algorithm selects which of the four possible locations will be
overwritten by new data on a cache miss.
• Bits [10:4] of the virtual address are used to index into each of the four tag stores.
• The tags contain bits [31:11] of the virtual addresses of the corresponding data,
so these tags are compared with bits [31:11] of the current virtual address. If one
of the tags matches, the cache has hit and the corresponding line can be accessed
from the data RAM using the same index (bits [10:4] of the virtual address)
together with two bits which encode the number of the tag store which produced
the matching tag.
• Virtual address bits [3:2] select the word from the line and, if a byte or half-word
access is requested, bits [1:0] select the byte or half-word from the word.
The ARM71OT cache organization
Cache speed
• High-associativity caches give the best hit rate, but require sequential
CAM then RAM accesses which limits how fast the cycle time can
become. Caches with a lower associativity can perform parallel tag
and data accesses to give faster cycle times, and although a direct
mapped cache has a significantly lower hit rate than a fully associative
one.
• However, a fully associative CAM-RAM cache is much simpler than a
4-way associative RAM-RAM cache.
Cache power
• CAM is somewhat power-hungry, requiring a parallel comparison with
every entry on each cycle. Segmenting the cache by reducing the
associativity a little and activating only a subsection of the CAM reduces
the power cost significantly for a small increase in complexity.
• In a static RAM the main users of power are the analogue sense-amplifiers.
A 4-way cache must activate four times as many sense-amplifiers in the tag
store as a direct-mapped cache.
• Waste power can be minimized by using self-timed power-down circuits to
turn off the sense-amplifiers as soon as the data is valid, but the power
used in the sense-amplifiers is still significant.
Sequential accesses
• Where the processor is accessing memory locations which fall within
the same cache line it should be possible to bypass the tag look-up for
all but the first access. The ARM generates a signal which indicates
when the next memory access will be sequential to the current one.
• Where an access will be in the same line, bypassing the tag look-up
increases the access speed and saves power. Potentially, sequential
accesses could use slower sense-amplifiers and save considerable
power
Power optimization
• The cache designer must remember that the goal is to minimize the
overall system power, not just the cache power. Off-chip accesses cost
a lot more energy than on-chip accesses, so the first priority must be
to find a cache organization which gives a good hit rate. Deciding
between a highly associative CAM—RAM organization or a set-
associative RAM—RAM organization requires a detailed investigation
of all of the design issues.
• Exploiting sequential accesses to save power and to increase
performance is always a good idea. Typical dynamic execution
statistics from the ARM show that 75% of all accesses are sequential.
• nonsequential accesses take two clock cycles; this will only reduce
performance by about 25%
ARM710T MMU
• The translation look-aside buffer (TLB) is a 64-entry associative cache
of recently used translations which accelerates the translation process
by removing the need for the 2-stage table look-up in a high
proportion of accesses.
ARM710T write buffer
• The write buffer holds four addresses and eight data words. The
memory management unit defines which addresses are bufferable.
Each address may be associated with any number of the data words,
so the write buffer may hold one word (or byte) of data to write to
one address and seven words to write to another address
• The write buffer gives a performance benefit of around 15% for a
modest hardware cost.
ARM720T
The ARM720T is very similar to the ARM710T with the following
extensions:
Virtual addresses in the bottom 32 Mbytes of the address space can
be relocated to the 32 Mbyte memory area specified in the ProcessID
register (CP15 register 13).
The exception vectors can be moved from the bottom of memory to
OxffffOOOO, thereby preventing them from being translated by the
above mechanism. This function is controlled by the 'V bit in CP15
register 1.
ARM740T
The ARM740T differs from the ARM710T only in having a simpler
memory protection unit in place of the 710T's memory management
unit.
AMBA BUS
Types of AMBA Bus
Three distinct buses are defined within the AMBA
specification:

 The Advanced High-performance Bus (AHB)


 The Advanced System Bus (ASB)
 The Advanced Peripheral Bus (APB).
ARM810 CPU CORE:
 The ARM810 is a high-performance ARM CPU chip with an on-chip
cache and memory management unit.
 It was the first implementation of the ARM instruction set developed
by ARM Limited to use a fundamentally different pipeline structure
from that used on the orig inal ARM chip designed at Acorn Computers
and carried through to ARM6 and ARM7.
 The ARM810 has now been superseded by the ARM9 series.
ARM810 organization
An 8 Kbyte virtually addressed unified instruction and data cache using a copy-back
(or write-through, controlled by the page table entry) write strategy and offering a
double-bandwidth capability as required by the ARMS core. The cache is 64-way
associative.

Double-bandwidth cache

The core's double-bandwidth requirement is satisfied by the cache; external memory


accesses use conventional line refill and individual data transfer protocols. Double-
bandwidth is available from the cache only for sequential memory accesses
Processor Modes
Data Sizes and Instruction Sets
• The ARM is a 32-bit architecture.

• When used in relation to the ARM:


• Byte means 8 bits
• Halfword means 16 bits (two bytes)
• Word means 32 bits (four bytes)

• Most ARM’s implement two instruction sets


• 32-bit ARM Instruction Set
• 16-bit Thumb Instruction Set

• Jazelle cores can also execute Java bytecode


Processor Modes
• The ARM has seven basic operating modes:

• User : unprivileged mode under which most tasks run

• FIQ : entered when a high priority (fast) interrupt is raised

• IRQ : entered when a low priority (normal) interrupt is raised

• Supervisor : entered on reset and when a Software Interrupt


instruction is executed

• Abort : used to handle memory access violations

• Undef : used to handle undefined instructions

• System : privileged mode using the same registers as user mode


The ARM Register Set
Current
Current Visible
Visible Registers
Registers
r0
Abort
SVC
Undef
IRQ
FIQ
User Mode
Mode
Mode
Mode
Mode
Mode
r1
r2
r3 Banked
Banked out
out Registers
Registers
r4
r5
r6 User FIQ IRQ SVC Undef Abort
r7
r8 r8 r8
r9 r9 r9
r10 r10 r10
r11 r11 r11
r12 r12 r12
r13 (sp)
r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp)
r14
r14 (lr)
(lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr)
r15 (pc)

cpsr
spsr
spsr spsr spsr spsr spsr spsr
Register Organization Summary
User FIQ IRQ SVC Undef Abort
r0
r1
User
r2 mode
r3 r0-r7,
r4 r15, User User User User
r5 and mode mode mode mode Thumb state
cpsr
r6
r0-r12,
r15,
r0-r12,
r15,
r0-r12,
r15,
r0-r12,
r15,
Low registers
r7 and and and and
r8 r8 cpsr cpsr cpsr cpsr
r9 r9
r10 r10 Thumb state
r11 r11 High registers
r12 r12
r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp)
r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr)
r15 (pc)

cpsr
spsr spsr spsr spsr spsr

Note: System mode uses the User mode register set


The Registers
• ARM has 37 registers all of which are 32-bits long.
• 1 dedicated program counter
• 1 dedicated current program status register
• 5 dedicated saved program status registers
• 30 general purpose registers

• The current processor mode governs which of several banks is accessible. Each mode can access
• a particular set of r0-r12 registers
• a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
• the program counter, r15 (pc)
• the current program status register, cpsr

Privileged modes (except System) can also access


• a particular spsr (saved program status register)
Program Status Registers
31 28 27 24 23 16 15 8 7 6 5 4 0

NZCVQ J U n d e f i n e d I F T mode
f s x c
• Condition code flags • Interrupt Disable bits.
• N = Negative result from ALU • I = 1: Disables the IRQ.
• Z = Zero result from ALU • F = 1: Disables the FIQ.
• C = ALU operation Carried out
• V = ALU operation oVerflowed
• T Bit
• Architecture xT only
• Sticky Overflow flag - Q flag • T = 0: Processor in ARM state
• Architecture 5TE/J only • T = 1: Processor in Thumb state
• Indicates if saturation has occurred

• Mode bits
• J bit • Specify the processor mode
• Architecture 5TEJ only
• J = 1: Processor in Jazelle state
Program Counter (r15)
• When the processor is executing in ARM state:
• All instructions are 32 bits wide
• All instructions must be word aligned
• Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or
byte aligned).

• When the processor is executing in Thumb state:


• All instructions are 16 bits wide
• All instructions must be halfword aligned
• Therefore the pc value is stored in bits [31:1] with bit [0] undefined (as instruction cannot be byte aligned).

• When the processor is executing in Jazelle state:


• All instructions are 8 bits wide
• Processor performs a word access to read 4 instructions at once
Exception Handling
• When an exception occurs, the ARM:
• Copies CPSR into SPSR_<mode>
• Sets appropriate CPSR bits
0x1C FIQ
• Change to ARM state
0x18 IRQ
• Change to exception mode
0x14 (Reserved)
• Disable interrupts (if appropriate) 0x10 Data Abort
• Stores the return address in LR_<mode> 0x0C Prefetch Abort
• Sets PC to vector address 0x08 Software Interrupt
0x04 Undefined Instruction
• To return, exception handler needs to: 0x00 Reset
• Restore CPSR from SPSR_<mode> Vector Table
Vector table can be at
• Restore PC from LR_<mode>
0xFFFF0000 on ARM720T
This can only be done in ARM state. and on ARM9/10 family
devices
PIPELINE
INTERRUPTS
Operating Modes
• Seven operating modes:
• User
• Privileged:
• System (version 4 and above)
• FIQ
• IRQ
• Abort
• Undefined exception modes
• Supervisor
Operating Modes
User mode: Exception modes:
• Normal program execution mode • Entered upon exception
• System resources unavailable • Full access to system resources
• Mode changed by exception only • Mode changed freely
INTERRUPTS:
• 32 interrupt request inputs
• 16 vectored IRQ interrupts
• 16 priority levels dynamically assigned to interrupt requests
• Software interrupt generation
• The Vectored Interrupt Controller (VIC) takes 32 interrupt request
inputs and programmably assigns them into 3 categories,
 FIQ
 vectored IRQ
 non-vectored IRQ.
• The programmable assignment scheme means that priorities of
interrupts from the various peripherals can be dynamically assigned
and adjusted.
• Fast Interrupt reQuest (FIQ) requests have the highest priority. If more
than one request is assigned to FIQ, the VIC ORs the requests to
produce the FIQ signal to the ARM processor.
• The fastest possible FIQ latency is achieved when only one request is
classified as FIQ, because then the FIQ service routine can simply start
dealing with that device.
• But if more than one request is assigned to the FIQ class, the FIQ service
routine can read a word from the VIC that identifies which FIQ source(s) is
(are) requesting an interrupt.
• The FIQ (Fast Interrupt reQuest) exception is externally generated by taking
the nFIQ input LOW.
• Vectored IRQs have the middle priority, but only 16 of the 32 requests
can be assigned to this category. Any of the 32 requests can be assigned
to any of the 16 vectored IRQ slots
• Non-vectored IRQs have the lowest priority.
• The IRQ (Interrupt ReQuest) exception is a normal interrupt caused by a
LOW level on the nIRQ input. It has a lower priority than FIQ
Software Interrupt
 The software interrupt instruction (SWI) is used for getting into Supervisor mode,
usually to request a particular supervisor function.
 When a SWI is executed, ARM7 performs the following:
(1) Saves the address of the SWI instruction plus 4 in R14_svc; saves CPSR in SPSR_svc
(2) Forces M[4:0]=10011 (Supervisor mode) and sets the I bit in the CPSR
(3) Forces the PC to fetch the next instruction from address 0x08
 To return from a SWI, use MOVS PC,R14_svc. This will restore the PC and CPSR and
return to the instruction following the SWI.
Exceptions
Exceptions

Exceptions result whenever the normal flow of a program has to be halted temporarily,
for example to service an interrupt from a peripheral.

Before attempting to handle an exception, the ARM preserves the current processor
state so that the original program can resume when the handler routine has finished.
Exception Handling
 Entering an exception the ARM core
• Saves the address of the next instruction in the
appropriate LR

• Copies the CPSR into the appropriate SPSR

• Sets appropriate CPSR bits


 Interrupt disable bits
 mode field bits
 If running in Thumb state, enter ARM state
• Forces PC to fetch next instruction from relevant
exception vector
Leaving Exception

To leave an exception, the exception handler must


• Copy SPSR back into CPSR

• Move contents of current LR minus offset to PC

•Offset varies according to type of exception: 2, 4 or 8


Multiple Exceptions
 Exception priorities
When multiple exceptions arise at the same time, a fixed priority system
determines the order in which they are handled
Exceptions
Exception Mode Priority IV Address
Reset Supervisor 1 0x00000000
Undefined instruction Undefined 6 0x00000004
Software interrupt Supervisor 6 0x00000008
Prefetch Abort Abort 5 0x0000000C
Data Abort Abort 2 0x00000010
Interrupt IRQ 4 0x00000018
Fast interrupt FIQ 3 0x0000001C

Table 1 - Exception types, sorted by Interrupt Vector addresses


COPROCESSORS
 The ARM7TDMI core instruction set enables you to implement
specialized additional instructions using coprocessors to extend
functionality.
Coprocessors are separate processing units that are tightly coupled
to the ARM7TDMI processor. A typical coprocessor contains:
• an instruction pipeline (pipeline follower)
• instruction decoding logic
• handshake logic
• a register bank
• special processing logic, with its own data path.
• A coprocessor is connected to the same data bus as the ARM7TDMI
processor in the system, and tracks the pipeline in the ARM7TDMI
processor.
• This means that the coprocessor can decode the instructions in the
instruction stream, and execute those that it supports.
• Each instruction progresses down both the ARM7TDMI core pipeline
and the coprocessor pipeline at the same time.
• The execution of instructions is shared between the ARM7TDMI core
and the coprocessor
The ARM7TDMI processor:
1. Evaluates the instruction type and the condition codes to determine

whether the instructions are executed by the coprocessor, and


communicates this to any coprocessors in the system, using nCPI.
2. Generates any addresses that are required by the instruction,
including prefetching the next instruction to refill the pipeline.
3. Takes the undefined instruction trap if no coprocessor accepts the
instruction.
The coprocessor:
1. Decodes instructions to determine whether it can accept the instruction.
2. Indicates whether it can accept the instruction by using CPA and CPB.
3. Fetches any values required from its own register bank.
4. Performs the operation required by the instruction.
Coprocessor availability
• Up to 16 coprocessors can be referenced by a system, each with a
unique coprocessor ID number to identify it. The ARM7TDMI core
contains one internal coprocessor:
• CP14, the Debug Communications Channel (DCC) coprocessor. Other
coprocessor numbers have also been reserved.
Debug Communications Channel
The EmbeddedICE-RT logic contains a Debug Communications
Channel (DCC) for passing information between the target and the host
debugger. This is implemented as coprocessor 14.
The DCC comprises:
• a 32-bit wide communications data read register
• a 32-bit wide communications data write register
• a 32-bit wide (only 6 bits are used) communications control register
for synchronized handshaking between the processor and the
asynchronous debugger.
Connecting a single coprocessor
An example of how to connect:
• A coprocessor into an ARM7TDMI processor system if you are using a
bidirectional bus is shown in Figure
Connecting multiple coprocessors
If you have multiple coprocessors in your system, connect the
handshake signals as follows:
• nCPI
Connect this signal to all coprocessors present in the system.
• CPA and CPB
The individual CPA and CPB outputs from each coprocessor must be
ANDed together, and connected to the CPA and CPB inputs on the
ARM7TDMI processor.
If you are not using an external coprocessor
• If you are implementing a system that does not include any external
coprocessors, you must tie both CPA and CPB HIGH. This indicates
that no external coprocessors are present in the system.
• If any coprocessor instructions are received, they take the undefined
instruction trap so that they can be emulated in software if required.
INSTRUCTION SET
Instruction Set

Two instruction sets:


• ARM
• Standard 32-bit instruction set
• THUMB
• 16-bit compressed form
• Code density better than most CISC
• Dynamic decompression in pipeline
Features of the ARM Instruction Set
• Load-store architecture
• Process values which are in registers
• Load, store instructions for memory data accesses
• 3-address data processing instructions
• Conditional execution of every instruction
• Load and store multiple registers
• Shift, ALU operation in a single instruction
Thumb instruction set
• Thumb is a 16-bit instruction set
• Optimized for code density from C code
• Improved performance form narrow memory
• Subset of the functionality of the ARM instruction set
• Core has two execution states – ARM and Thumb
• Switch between them using BX instruction
• Thumb has characteristic features:
• Most Thumb instruction are executed unconditionally
• Many Thumb data process instruction use a 2-address format
• Thumb instruction formats are less regular than ARM instruction formats, as a
result of the dense encoding.
ARM Instruction Set
ARM instruction
set

Data
Data transfer
processing
instructions
instructions

Block transfer Branching


instructions instructions

Software
Multiply
interrupt
instructions
instructions
ARM Instruction Set
Conditional execution:
• Each data processing instruction prefixed by condition code
• Result – smooth flow of instructions through pipeline
• 16 condition codes:
signed greater
EQ equal MI negative HI unsigned higher GT than

unsigned lower or signed less than or


NE not equal PL positive or zero LS same LE equal

unsigned higher signed greater


CS or same VS overflow GE than or equal AL always

CC unsigned lower VC no overflow LT signed less than NV special purpose


Data Processing Instruction
• Consist of
• Arithmetic (ADD, SUB, RSB)
• Logical (BIC, AND)
• Compare (CMP, TST)
• Register movement (MOV, MVN)
• All operands are 32-bit wide; come from registers or specified as literal in the
instruction itself
• Second operand sent to ALU via barrel shifter
• 32-bit result placed in register; long multiply instruction produces 64-bit result
• 3-address instruction format
Data Processing Instructions

• Arithmetic operations:
• ADD, ADDC, SUB, SUBC, RSB, RSC
• Bit-wise logical operations:
• AND, EOR, ORR, BIC
• Register movement operations:
• MOV, MVN
• Comparison operations:
• TST, TEQ, CMP, CMN
Data Processing Instructions
Conditional codes
+
Data processing instructions
+
Barrel shifter
=
Powerful tools for efficient coded programs
Data Processing Instructions

e.g.:

if (z==1)
R1=R2+(R3*4)
compiles to
EQADDS R1,R2,R3, LSL #2
( SINGLE INSTRUCTION ! )
Data Processing Instructions
1. Simple register operands
2. Immediate operands
3. Shifted register operands
4. Multiply
Simple Register Operands (1/2)
• Arithmetic Operations
ADD r0,r1,r2 ;r0:=r1+r2
ADC r0,r1,r2 ;r0:=r1+r2+C
SUB r0,r1,r2 ;r0:=r1–r2
SBC r0,r1,r2 ;r0:=r1–r2+C–1
RSB r0,r1,r2 ;r0:=r2–r1, reverse subtraction
RSC r0,r1,r2 ;r0:=r2–r1+C–1
• By default data processing operations do no affect the condition
flags

• Bit-wise Logical Operations


AND r0,r1,r2 ;r0:=r1ANDr2
ORR r0,r1,r2 ;r0:=r1ORr2
EOR r0,r1,r2 ;r0:=r1XORr2
BIC r0,r1,r2 ;r0:=r1AND (NOT r2), bit clear
Simple Register Operands (2/2)
• Register Movement Operations
• Omit 1st source operand from the format
MOV r0,r2 ;r0:=r2
MVN r0,r2 ;r0:=NOT r2, move 1’s complement

• Comparison Operations
• Not produce result; omit the destination from the format
• Just set the condition code bits (N, Z, C and V) in CPSR
CMP r1,r2 ;set cc on r1 - r2, compare
CMN r1,r2 ;set cc on r1 + r2, compare negated
TST r1,r2 ;set cc on r1 AND r2, bit test
TEQ r1,r2 ;set cc on r1 XOR r2, test equal
Immediate Operands
• Replace the second source operand with an immediate operand,
which is a literal constant, preceded by “#”
ADD r3,r3,#1 ;r3:=r3+1
AND r8,r7,#&FF ;r8:=r7[7:0], &:hexadecimal

• Since the immediate value is coded within the 32 bits of the


instruction, it is not possible to enter every possible 32-bit value as an
immediate.
Shift Register Operands
• ADD r3,r2,r2,LSL#3 ;r3 31 0 31 0
:= r2 + 8 * r1
• A single instruction executed in a
single cycle
00000 00000
• LSL: Logical Shift Left by 0 to 31
LSL #5 LSR #5
places, 0 filled at the lsb end
31 0 31 0
• LSR, ASL (Arithmetic Shift Left), 0 1
ASR, ROR (Rotate Right), RRX
(Rotate Right extended by 1
00000 0 11111 1
place)
• ADD r5,r5,r3,LSL r2 ; , positive operand
ASR #5 ASR #5
, negative operand

r5:=r5+r3*2r2 31 0 31 0

• MOV r12,r4,ROR r3 C

;r12:=r4 rotated right


by value of r3
C C

ROR #5 RRX
Using the Barrel Shifter: the 2nd Operand

• Register, optionally with shift


operation applied
• Shift value can be either
• 5-bit unsigned integer
• Specified in bottom byte of another
register
• Used for multiplication by constant
• Immediate value
• 8-bit number, with a range of 0 - 255
• Rotated right through even number of
positions
• Allows increased range of 32-bit
constants to be loaded directly into
registers
Multiply
• Multiply
MUL r4,r3,r2 ;r4:=(r3*r2)[31:0]

• Multiply-Accumulate
MLA r1,r3,r2,r1 ;r4:=(r3*r2+r1)[31:0]
Multiply Instructions
• Accumulation is denoted by “+=”
• Example: form a scalar product of two vectors
MOV r11,#20 ;initialize loop counter
MOV r10,#0 ;initialize total
Loop LDR r0,[r8],#4 ;get first component
LDR r1,[r9],#4 ;get second component
MLA r10,r0,r1,r10 ;accumulate product
SUBS r11,r11,#1 ;decrement loop counter
BNE Loop
Multiplication by a Constant
• Multiplication by a constant equals to a ((power of 2) +/- 1) can be done in a single cycle
• Using MOV, ADD or RSBs with an inline shift
• Example: r0 = r1 * 5
• Example: r0 = r1 + (r1 * 4)
• ADD r0,r1,r1,LSL #2
• Can combine several instruction to carry out other multiplies
• Example: r2 = r3 * 119
• Example: r2 = r3 * 17 * 7
• Example: r2 = r3 * (16 + 1) * (8 - 1)
• ADD r2,r3,r3,LSL #4 ;r2:=r3*17
• RSB r2,r2,r2,LSL #3 ;r2:=r2*7
Multiply Instructions
• 32-bit product (Least Significant)
• MUL{<cond>}{S} Rd,Rm,Rs
• MLA{<cond>}{S} Rd,Rm,Rs,Rn
• 64-bit Product
• <mul>{<cond>}{S} RdHi,RdLo,Rm,Rs
• <mul> is UMULL,UMLAC,SMULL,SMLAL

Opco de Mnemo ni c Meani ng Effect


[2 3 :2 1 ]
000 MUL Multiply (32-bit result) Rd := (Rm * Rs) [31:0]
001 MLA Multiply-accumulate (32-bit result) Rd := (Rm * Rs + Rn) [31:0]
100 UMULL Unsigned multiply long RdHi:RdLo := Rm * Rs
101 UMLAL Unsigned multiply-accumulate long RdHi:RdLo += Rm * Rs
110 SMULL Signed multiply long RdHi:RdLo := Rm * Rs
111 SMLAL Signed multiply-accumulate long RdHi:RdLo += Rm * Rs
Data Processing Instructions-
SummaryOpcode Mnemonic Meaning Effect
[24:21]
0000 AND Logical bit-wise AND Rd := Rn ANDOp2
0001 EOR Logical bit-wise exclusive OR Rd := Rn EOR Op2
0010 SUB Subtract Rd := Rn - Op2
0011 RSB Reverse subtract Rd := Op2 - Rn
0100 ADD Add Rd := Rn + Op2
0101 ADC Add with carry Rd := Rn + Op2 + C
0110 SBC Subtract with carry Rd := Rn - Op2 + C - 1
0111 RSC Reverse subtract with carry Rd := Op2 - Rn + C - 1
1000 TST Test Scc on Rn ANDOp2
1001 TEQ Test equivalence Scc on Rn EOR Op2
1010 CMP Compare Scc on Rn - Op2
1011 CMN Compare negated Scc on Rn + Op2
1100 ORR Logical bit-wise OR Rd := Rn OR Op2
1101 MOV Move Rd := Op2
1110 BIC Bit clear Rd := Rn ANDNOTOp2
1111 MVN Move negated Rd := NOTOp2
Data Processing Instructions
• Allows direct control of whether or not the condition codes are affected by S bit
(condition code unchanged when S = 0)
• N = 1 if the result is negative; 0 otherwise (i.e. N = bit 31 of the result)
• Z = 1 if the result is zero; 0 otherwise
• C = 1 carry out from the ALU when ADD, ADC, SUB, SBC, RSB, RSC, CMP, or CMN; carry out
from the shifter
• V = 1 if overflow from bit 30 to bit 31; 0 if no overflow
(V is preserved in non-arithmetic operations)
• PC may be used as a source operand (address of the instruction plus 8) except
when a register-specified shift amount is used
• PC may be specified as the destination register, the instruction is a form of branch
(return from a subroutine)
Data transfer instructions
The ARM has 3 types of data transfer instruction:
• single register loads and stores
flexible byte, half-word and word transfers
• multiple register loads and stores
less flexible, multiple words, higher transfer rate
• single register - memory swap
mainly for system use
Data transfer instructions
Addressing memory
all ARM data transfer instructions use register indirect addressing.
Examples of load and store instructions:
LDR r0, [r1] ; r0 := mem[r1]
STR r0, [r1] ; mem[r1] := r0
therefore before any data transfer is possible:
a register must be initialized with an address close to the target
Initializing an address pointer
• any register can be used for an address the assembler has special ‘pseudo
instructions’ to initialize address registers:
ADR r1, TABLE1 ; r1 points to TABLE1
..
TABLE1 ; LABEL
• ADR will result in a single ARM instruction
ADRL r1, TABLE1
• ADRL will handle cases that ADR can’t
Single register loads and stores
• the simplest form is just register indirect:
LDR r0, [r1] ; r0 := mem[r1]
• this is a special form of ‘base plus offset’:
LDR r0, [r1,#4] ; r0 := mem[r1+4]
– the offset is within ± 4 Kbytes
• auto-indexing is also possible:
LDR r0, [r1,#4]! ; r0 := mem[r1+4]
; r1 := r1 + 4
Single register loads and stores (..ctd)
• another form uses post-indexing
LDR r0, [r1],#4 ; r0 := mem[r1]
; r1 := r1 + 4
• finally, a byte or half-word can be loaded instead of a word (with some
restrictions):
LDRB r0, [r1] ; r0 := mem8[r1]
LDRSH r0, [r1] ; r0 := mem16[r1](signed)
• stores (STR) have the same forms
Multiple register loads and stores
• ARM also supports instructions which transfer several registers:
LDMIA r1, {r0,r2,r5} ; r0 := mem[r1]
; r2 := mem[r1+4]
; r5 := mem[r1+8]
– the {..} list may contain any or all of r0 – r7
– the lowest register always uses the lowest address, and so on, in
increasing order
• it doesn’t matter how the registers are ordered in {..}
op Rn!, {reglist}
op
is either:
LDMIA
Load multiple, increment after
STMIA
Store multiple, increment after.
Rn
is the register containing the base address. Rn mustbe in the range r0-r7.
reglist
is a comma-separated list of low registers or low-register ranges.

Examples
LDMIA r3!, {r0,r4}
LDMIA r5!, {r0-r7}
STMIA r0!, {r6,r7}
STMIA r3!, {r3,r5,r7}
Incorrect examples
LDMIA r3!,{r0,r9} ; high registers not allowed
STMIA r5!, {} ; must be at least one register ; in list
STMIA r5!,{r1-r6} ; value stored from r5 is unpredictable
Data Transfer Instructions-summary
• Load/store instructions
• Used to move signed and unsigned Word, Half Word and Byte to and from registers
• Can be used to load PC (if target address is beyond branch instruction range)

LDR Load Word STR Store Word


LDRH Load Half Word STRH Store Half Word
LDRSH Load Signed Half STRSH Store Signed Half
Word Word
LDRB Load Byte STRB Store Byte
LDRSB Load Signed Byte STRSB Store Signed Byte
Block Transfer Instructions
• Load/Store Multiple instructions (LDM/STM) Mi
LDM
• Whole register bank or a subset copied to memory or Mi+1
restored with single instruction R0
Mi+2
R1

R2
Mi+14

Mi+15
R14 STM
R15
Swap Instruction
R0

R1
• Exchanges a word between registers R2
• Two cycles but single atomic action
• Support for RT semaphores R7

R8

R15
Modifying the Status Registers
• Only indirectly
• MSR moves contents from CPSR/SPSR to selected GPR R0
R1
• MRS moves contents from selected GPR to CPSR/SPSR
MRS
• Only in privileged modes
R7

CPSR MSR R8
SPSR

R14
R15
Branching Instructions

• Branch (B): jumps forwards/backwards up to 32 MB


• Branch link (BL): same + saves (PC+4) in LR
• Suitable for function call/return
• Condition codes for conditional branches
• Branch exchange (BX) and Branch link exchange (BLX):same as B/BL +exchange
instruction set (ARM  THUMB)
• Only way to swap sets
Thumb
state
The Thumb instruction set is a re-encoded subset of the ARM instruction set.
Thumb instructions are half the size of ARM instructions (16 bits)
Greater code density can usually be achieved by using the Thumb instruction set
instead of the ARM instruction set.
The Thumb instruction set is always used in conjunction with a suitable version of
the ARM instruction set.
Its presence is denoted by the variant letter T, and it is not valid prior to ARM
architecture version 4.
Thumb state
• Two limitations of the Thumb instruction set compared with the ARM
instruction set are:
• Thumb code usually uses more instructions for the same job, so ARM code is usually
best for maximizing the performance of time-critical code.

• The Thumb instruction set does not include some instructions that are needed for
exception handling, so ARM code needs to be used for at least the top-level exception
handlers.
Thumb State

Set of instructions re-coded into 16 bits


• Improved code density by ~ 30%
• Saving program memory space
In Thumb state only the program code is 16-bit wide
• After fetching the 16-bit instructions from memory, they are de-compressed to 32 bit
instructions before they are decoded and executed the all operations are still in 32-
bit operations.
Registers in Thumb State

The Thumb state register set is a subset of the ARM state set. The programmer has
direct access to:
• Eight general registers r0 - r7
• The program counter PC
• A Stack pointer SP
• A Link register LR
• The current program status register CPSR
In Thumb state, the high registers (r8 - r15) are not part of the standard register set.
Thumb vs. ARM
How to change into Thumb state, then back
Example
;start off in ARM state
CODE32
ADR r0,Into_Thumb+1 ;generate branch target
;address & set bit 0
;hence arrive Thumb state
BX r0 ;branch exchange to Thumb

CODE16 ;assemble subsequent as Thumb
Into_Thumb …
ADR r5,Back_to_ARM ;generate branch target to
;word-aligned address,
;hence bit 0 is cleared.
BX r5 ;branch exchange to ARM

CODE32 ;assemble subsequent as ARM
Back_to_ARM …
ARM Assembly Language Programming
Introduction to assembly language
programming
• The following is a simple example which illustrates some of the core constituents of an
ARM assembler module:

Assembler
derivative

operands
label opcode The objects to be operate by opcode comment
198
Assemble Instruction
• One line of code - first : ADD r1,r2,r3

optional

opcode

Label Operand Operand Operand


(optional) 1 2 3
199
General purpose register R0-R12 usage
15
• MOV r0,#15
• Opcode : MOV Operands : r0,#15
R0
• In C, you may write int a= 15;
• Meaning : MOVe the value #15 (decimal) into register R0.
• “MOV” means “to move”
• R0 is register 0 (32-bit)
• # (hash) means it is a direct value, defined by a number
following #.
• If you add ‘0x’ as suffix, the number is in hexadecimal

Code : In register:
200
Assembly Language Programming

• Delimiters

• Labels

• Operation code(Mnemonics)

• Directives
Delimiters:
Directives:
Define Constant:
• It allows the programmer to enter fixed data into program memory.

DCW-Word(32 bit),DCB-Byte(8 bit)


EQUATE derivative:
• It allows programmer to equate names with address or data
Eg:
TTY EQU 5
LAST EQU 5000
LAST EQU FINAL
AREA Derivative:
• It allows the programmer to specify the memory locations where the
programs, subroutines, or data will reside.
House keeping derivatives:
Example: C assignments
• C:
x = (a + b) - c;
• Assembler:
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
ADR r4,b ; get address for b, reusing r4
LDR r1,[r4] ; get value of b
ADD r3,r0,r1 ; compute a+b
ADR r4,c ; get address for c
LDR r2,[r4] ; get value of c
C assignment, cont’d.
SUB r3,r3,r2 ; complete computation of x
ADR r4,x ; get address for x
STR r3,[r4] ; store value of x
Example: C assignment
• C:
y = a*(b+c);
• Assembler:
ADR r4,b ; get address for b
LDR r0,[r4] ; get value of b
ADR r4,c ; get address for c
LDR r1,[r4] ; get value of c
ADD r2,r0,r1 ; compute partial result
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
C assignment, cont’d.
MUL r2,r2,r0 ; compute final value for y
ADR r4,y ; get address for y
STR r2,[r4] ; store y
Example: C assignment
• C:
z = (a << 2) | (b & 15);
• Assembler:
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
MOV r0,r0,LSL 2 ; perform shift
ADR r4,b ; get address for b
LDR r1,[r4] ; get value of b
AND r1,r1,#15 ; perform AND
ORR r1,r0,r1 ; perform OR
C assignment, cont’d.
ADR r4,z ; get address for z
STR r1,[r4] ; store value for z
Example: if statement
• C:
if (a > b)
{
x = 5;
y = c + d;
}
else
x = c - d;
end

• Assembler:
; compute and test condition
ADR r4,a ; get address for a
LDR r0,[r4] ; get value of a
ADR r4,b ; get address for b
LDR r1,[r4] ; get value for b
CMP r0,r1 ; compare a < b
BLE fblock ; if a <= b, branch to false block
If statement, cont’d.
; true block
MOV r0,#5 ; generate value for x
ADR r4,x ; get address for x
STR r0,[r4] ; store x
ADR r4,c ; get address for c
LDR r0,[r4] ; get value of c
ADR r4,d ; get address for d
LDR r1,[r4] ; get value of d
ADD r0,r0,r1 ; compute y
ADR r4,y ; get address for y
STR r0,[r4] ; store y
B after ; branch around false block
If statement, cont’d.
; false block
Fblock: ADR r4,c ; get address for c
LDR r0,[r4] ; get value of c
ADR r4,d ; get address for d
LDR r1,[r4] ; get value for d
SUB r0,r0,r1 ; compute a-b
ADR r4,x ; get address for x
STR r0,[r4] ; store value of x
after ...
EXAMPLE PROGRAMS
16 Bit data transfer

The TTL directive inserts a title at the start of each page of a listing file.


The SUBT directive places a subtitle on the pages of a listing file.
One’s complement:
32 bit addition:
Write a assembly language program to calculate Y=((A+B)-C)*2
Addition of two numbers:(offset addressing)
Shift left one bit:
Largest of two numbers:
Write a assembly language program to calculate largest number from
three given input numbers.
64 bit addition:
Sum of Numbers:
Finding number of negative numbers in series of numbers:
Find the length of null terminated string:
INSTRUCTION CYCLE TIMINGS
o Timings can vary between different revisions of an implementation and
are also affected by external events such as interrupts, memory speed,
and cache misses.

o ARM cores use pipelined implementations. The number of cycles that an


instruction takes may depend on the previous and following instructions.

o When you optimize code, you need to be aware of these interactions


Use the following steps to calculate the number of
cycles taken by an instruction:
1. Find which ARM core you are using. For example, ARM7xx parts
usually contain an ARM7TDMI core; ARM9xx parts, an ARM9TDMI
core; and ARM9xxE, parts an ARM9E core.
2. Find the instruction description table for the ARM core you are
using.
3. Read the value in the “Cycles” column. This is the number of cycles
the instruction usually takes, assuming the instruction passes its
condition codes and there are no interactions with other
instructions. The cycle count may depend on one of the
abbreviations in below table.
4. If the “Notes” column contains any notes of the form +k if condition,
then add on to your cycle count all the additions that apply.
5. Look for interlock conditions that will cause the processor to stall.
These are occasions where an instruction attempts to use the result of a
previous instruction before it is ready.
Unless otherwise stated, input registers are required on the first cycle of
the instruction and output results are available at the end of the last
cycle of the instruction.
6. If your instruction fails its condition codes, then it is not executed.
Usually this costs one cycle. However, on some implementations,
instructions may cost multiple cycles even if they are not executed.
Look for a note of the form “[k cycles if not executed].
OPTIMIZED PRIMITIVES
A primitive is a basic operation that can be used in a wide variety of different
algorithms and programs.

For example, addition, multiplication, division, and random number


generation are all primitives.

Some primitives are supported directly by the ARM instruction set, including
32-bit addition and multiplication.

However, many primitives are not supported directly by instructions, and we


must write routines to implement them (for example, division and random
number generation).
Division:

• ARM cores don’t have hardware support for division. To divide two
numbers you must call a software routine that calculates the result
using standard arithmetic operations. If you can’t avoid a division
then you need access to very optimized division routines.

• Suppose we need to calculate the quotient q = n/d and remainder r =


n % d for unsigned integers n and d.
unsigned udiv_simple(unsigned d, unsigned n, unsigned N)
{
unsigned q=0, r=n;
do
{ /* calculate next quotient bit */
N--; /* move to next bit */
if ( (r >> N) = d ) /* if r>=d*(1 << N) */
{
r -= (d << N); /* update remainder */
q += (1 << N); /* update quotient */
}
} while (N);
return q;
}
Square Roots
• Square roots can be handled by the same techniques we used for division. We calculate the square root
of a 32-bit unsigned integer d. The answer is a 16-bit unsigned integer q and a 17-bit unsigned remainder
r.
unsigned usqr_simple(unsigned d, unsigned N)
{
unsigned t, q=0, r=d;
do
{ /* calculate next quotient bit */
N--; /* move down to next bit */
t = 2*q+(1 << N); /* new r = old r - (t<<N) */
if ( (r >> N) = t ) /* if (r >= (t << N)) */
{
r -= (t << N); /* update remainder */
q += (1 << N); /* update root */
}
} while (N);
return q;
}
ARM development tools
• Software development for the ARM is supported by a coherent range
of tools developed by ARM Limited, and there are also many third
party and public domain tools available, such as an ARM back-end for
the gcc C compiler.
• Since the ARM is widely used as an embedded controller where the
target hardware will not make a good environment for software
development, the tools are intended for cross-development (that is,
they run on a different architecture from the one for which they
produce code) from a platform such as a PC running Windows or a
suitable UNIX workstation.
C or assembler source files are compiled or assembled into ARM
object format (.aof) files, which are then linked into ARM image
format (.aif) files.

The image format files can be built to include the debug tables
required by the ARM symbolic debugger (ARMsd which can load, run
and debug programs either on hardware such as the ARM
Development Board or using a software emulation of the ARM (the
ARMulator).
The ARM assembler
• The ARM assembler is a full macro assembler which produces ARM
object format output that can be linked with output from the C
compiler.
• Assembly source language is near machine-level, with most assembly
instructions translating into single ARM (or Thumb) instructions.
The linker
• The linker takes one or more object files and combines them into an
executable program.
• It resolves symbolic references between the object files and extracts
object modules from libraries as needed by the program.
ARMsd
• The ARM symbolic debugger is a front-end interface to assist in
debugging programs running either under emulation (on the
ARMulator) or remotely on a target system such as the ARM
development board. The remote system must support the
appropriate remote debug protocols either via a serial line or through
a JTAG test interface
• At its most basic, ARMsd allows an executable program to be loaded
into the ARMulator or a development board and run. It allows the
setting of breakpoints, which are addresses in the code that, if
executed, cause execution to halt so that the processor state can be
examined.
ARMulator
• The ARMulator (ARM emulator) is a suite of programs that models the
behaviour of various ARM processor cores in software on a host
system. It can operate at various levels of accuracy:
• Instruction-accurate modelling gives the exact behavior of the system
state without regard to the precise timing characteristics of the
processor.
• Cycle-accurate modelling gives the exact behavior of the processor on
a cycle-by-cycle basis, allowing the exact number of clock cycles that a
program requires to be established.
• Timing-accurate modelling presents signals at the correct time within
a cycle, allowing logic delays to be accounted for.
ARM development board

• The ARM Development Board is a circuit board incorporating a range of


components and interfaces to support the development of ARM-based
systems.

• It includes an ARM core (for example, an ARM7TDMI), memory


components which can be configured to match the performance and
bus-width of the memory in the target system, and electrically
programmable devices which can be configured to emulate application-
specific peripherals.

Potrebbero piacerti anche