Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ON
MULTIPROCESSOR REAL-TIME ANALYSIS FOR
SCAN-BASED EMULATION: A METHODOLOGY
OF DSP APPLICATION
FOR
CSFEST-‘08
AT
AUTHORS:
The ability to analyze the proper With a traditional emulator, the CPU
execution of real-time applications to be emulated is usually removed
in an embedded system is critical to from its socket and replaced with an
their development and deployment. emulator pod. The emulator pod
This applies to real-time typically has a replacement CPU, plus
applications ranging from mission various amounts of random logic and
critical to multimedia. memory to monitor what is happening
on the CPU pins. With modern CPUs
In embedded systems the ability to such as DSPs, the traditional approach
perform Real-Time Analysis (RTA) can has several problems. The first
involve a dedicated hardware and problem is the speed of newer DSP
software capability with an end-to-end chips. Bus cycle times can be 25
methodology that supports the nanoseconds or shorter, and all
transferring of data between the host and instructions typically
execute in a single cycle. This makes emulators.
it difficult for a traditional emulator to • In-circuit emulation - The
allow emulation at full speed. The CPU can be soldered to the
number of pins to monitor can be board while emulating.
staggering, with chips having multiple This makes denser
32-bit address and data buses, making packaging possible, and
a traditional emulator expensive. The also makes the emulator a
second, and more serious, problem is manufacturing test tool.
that DSPs often have on-chip caches, • Full access to internal
pipelines, memory and peripherals. memory, caches, pipelines
Sometimes a whole algorithm can and registers - The complete
execute without any activity on the state of the processor is
CPU pins. visible to the outside through
a scan interface.
The solution to these problems is • Complete access to the
scan-based emulation. With scan-based system from the CPU - Any
emulation, the CPU is never removed peripheral or memory that
from the socket; in fact it can be the CPU can access in the
soldered directly onto the board. system can also be accessed
Instead, the CPU has a serial scan through the scan interface.
interface, allowing the emulator to scan The emulator can look at the
the internals of the device through a system "through the eyes of
standard connector. the CPU". This makes it
The scan-based approach to emulation possible to debug and
has many advantages: diagnose a system where
nothing is working except the
CPU itself.
• Emulation at full device
speed - Since there is no
The JTAG Interface
logic needed to monitor
what happens on the CPU
The Joint Test Action Group (JTAG)
bus, the emulator can allow
defines an interface called the JTAG
the device to execute
interface for testing individual devices
programs at full speed.
on printed circuit boards, without the
• Non-intrusive emulation - need to remove the devices from the
Since no logic is attached to board. This is accomplished by a method
any CPU pin, the CPU bus is called boundary scan, whereby the state
not affected at all by the of each pin of each device (with some
emulation process. The special logic on the device) is serially
emulator will not affect the scannedoutfromthedevice.
operation of the bus, as is so
often the case with traditional
Multiple devices can be daisy chained,
and an entire PC board can therefore be A boundary scan cell is connected to
scanned in a single scan chain. It is each boundary scan register on each
possible to use the same method to scan device that is being scanned. The
out not only the state of a devices pins, architecture further specifies a finite
but to scan out any internal information machine TAP controller with inputs
from the device, such as register values, TMS and TCK. There is an Instruction
memory location; as a consequence Register (IR) holding the current
scan-based emulation was born. The instruction, a bypass register, and an
JTAG specification does not include the optional 32-bit identification register for
pin out for the JTAG connector. The permanent identification.
extension to JTAG defines a 14 pin, 2
row, 0.1" spacing JTAG connector Principles of Boundary Scan
header, with pin out and physical Boundary scan cells are configured into
dimensions common to all DSPs that a parallel-in, parallel-out shift register.
support JTAG involved in this Parallel load operations cause signals
2 from the core logic to be loaded into the
methodology .During JTAG emulation, output cells. Parallel unload operations
the emulator supplies the clock that cause the signals to be loaded from the
scans the device. This means that the input cells to the core logic. Data is
target clock speed is completely shifted in serial mode by daisy chaining
independent of the emulation clock, and devices.
the emulator can support targets running
at any clock speed.
The figure below shows the TDI of
The boundary scan mechanism each device connected to the TDO of the
Device architecture next device in the scan chain. It is
possible to avoid scanning any device
The JTAG device architecture is by placing it in bypass mode.
based on the IEEE 1149.1
architecture. In this specification, Typically, the system architect is
there are four dedicated pins responsible for determining the type
collectively known as the Test (homogeneous or heterogeneous) type
Access Port (TAP). They are: of arrangement of devices, their order
in the scan chain and if they will be
• Test Data In (TDI) placed in bypass.
• Test Data Out (TDO)
• Test Mode Select (TMS)
• Test Clock (TCK)
• Test Rest (optional)
cycle: during development, as means
to debug; towards the end of
development, for tuning
performance; and after the
application is deployed, for failure
analysis. Logic analyzers have been
used for many years to clamp onto
the data busses of the target and
monitor the data flow of the
application in order to analyze
Real-Time Analysis (RTA) application behavior. Aside from the
fact that logic analyzers are
The following application in the domain expensive ($15K to $60K for a
of high energy physics illustrates the DSP), the increase in system-level
necessity for RTA in a heterogeneous integration over the years has
multiprocessor environment. The Fermi resulted in fewer exposed data paths
lab Tevatron Collider generates 15 for the logic analyzer to monitor.
million particle collisions per second. Most modern microprocessors are
These particle collisions result in the architected with specialized
creation of subatomic particles that hardware counters that can be
travel through a spectrometer. The data programmed for the purpose of
output from the spectrometer is in the tracing applications.
order of terabytes per second and must
be analyzed in real time. The analysis
engine comprises a massively parallel Traditionally these registers have
arrangement of heterogeneous DSPs and been used to determine the design of
GPPs (general purpose processors). the micro architecture such as caches
Analysis consists of applying algorithms and TLBs, etc. Whereas these
that reconstruct and filter the collision registers can be used to trace the
data. The result is a select set of behavior of applications at a very
interesting collisions from which fine level of granularity, they cannot
physicists can study some of the easily be used as a RTA mechanism.
remaining mysteries of matter and An ancillary yet significant issue is
3
antimatter in the universe . that analysis requires that the user
have an advanced knowledge of the
Analysis of real-time embedded target micro architecture in order to
applications is necessary at interpret the data.
severalpoints during the software life
Fig 2: Debugger with real-time data
exchange
Performance
An important consideration in
providing a methodology for
multiprocessor RTA is
Performance