Lecture 4

Lecture 4
Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas
Outline/objectives
Identify the most important DSP processor architecture features and how they relate to DSP applications Understand the types of code appropriate for DSP implementation
ACOE343 - Embedded Real-Time Processor Systems Frederick University
What is a DSP?
A specialized microprocessor for realtime DSP applications
Digital filtering (FIR and IIR) FFT Convolution, Matrix Multiplication etc
DIGITAL INPUT ADC DSP DIGITAL OUTPUT DAC
ANALOG INPUT
ANALOG OUTPUT
Hardware used in DSP

ASIC Performance Flexibility Very High Very low FPGA High High GPP Medium High DSP Medium High High
Power Very low consumption

Development Long Time
low
Medium
Low Medium
Medium
Short
Short
Common DSP features

Harvard architecture Dedicated single-cycle Multiply-Accumulate (MAC) instruction (hardware MAC units) Single-Instruction Multiple Data (SIMD) Very Large Instruction Word (VLIW) architecture Pipelining Saturation arithmetic Zero overhead looping Hardware circular addressing Cache DMA
ACOE343 - Embedded Real-Time Processor Systems Frederick University 5
Harvard Architecture
Physically separate memories and paths for instruction and data
DATA MEMORY
CPU
PROGRAM MEMORY
Single-Cycle MAC unit

ai xi
Multiplier a i-1 x i-1 ai xi Adder
(a ix i )
i=0
Can compute a sum of nproducts in n cycles
a i x i + a i-1 x i-1 Register
Single Instruction - Multiple Data (SIMD)

A technique for data-level parallelism by employing a number of processing elements working in parallel
Very Long Instruction Word (VLIW)

A technique for instruction-level a parallelism by executing b instructions without dependencies (known at e compile-time) in parallel g x Example of a single VLIW instruction: y
F=a+b; c=e/g; d=x&y; w=z*h;
z
PU VLIW instruction
F=a+b
c=e/g
d=x&y
w=z*h
F
PU
c
PU
d
PU
CISC vs. RISC vs. VLIW
10
Pipelining
DSPs commonly feature deep pipelines TMS320C6x processors have 3 pipeline stages with a number of phases (cycles):
Fetch
Program Address Generate (PG) Program Address Send (PS) Program ready wait (PW) Program receive (PR)
Decode
Dispatch (DP) Decode (DC)
Execute
6 to 10 phases
Saturation Arithmetic
fixed range for operations like addition and multiplication normal overflow and underflow produce the maximum and minimum allowed value, respectively Associativity and distributivity no longer apply 1 signed byte saturation arithmetic examples:
64 + 69 = 127 -127 5 = -128 (64 + 70) 25 = 122 64 + (70 -25) = 109
12
Examples
Perform the following operations using one-byte saturation arithmetic
0x77 + 0x99 = 0x4*0x42= 0x3*0x51=
13
Zero Overhead Looping

Hardware support for loops with a constant number of iterations using hardware loop counters and loop buffers No branching No loop overhead No pipeline stalls or branch prediction No need for loop unrolling
Hardware Circular Addressing

A data structure implementing a fixed length queue of fixed size objects where objects are added to the head of the queue while items are removed from the tail of the queue. Requires at least 2 pointers (head and tail) Extensively used in digital filtering
y[n] = a0x[n]+a1x[n-1]++akx[n-k]
Head X[n] X[n-1] X[n] Cycle1 X[n-1] X[n-2] Cycle2
X[n-2] X[n-3] X[n-3]
Tail
Direct Memory Access (DMA)

The feature that allows peripherals to access main memory without the intervention of the CPU Typically, the CPU initiates DMA transfer, does other operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation is complete. Can create cache coherency problems (the data in the cache may be different from the data in the external memory after DMA) Requires a DMA controller
Cache memory
Separate instruction and data L1 caches (Harvard architecture) Cache coherence protocols required, since most systems use DMA
17
DSP vs. Microcontroller

DSP
Harvard Architecture VLIW/SIMD (parallel execution units) No bit level operations Hardware MACs DSP applications
Microcontroller
Mostly von Neumann Architecture Single execution unit Flexible bit-level operations No hardware MACs Control applications
18
Examples
Estimate how long will the following code fragment take to execute on
A general purpose processor with 1 GHz operating frequency, five-stage pipelining and 5 cycles required for multiplication, 1 cycle for addition A DSP running at 500 MHz, zero overhead looping and 6 independent ALUs and 2 independent singlecycle MAC units?
for (i=0; i<8; i++) { a[i] = 2*i + 3; b[i] = 3*i + 5; }
Review Questions
Which of the following code fragments is appropriate for SIMD implementation?
a[0]=b[0]+c[0]; a[2]=b[2]+c[2]; a[4]=b[4]+c[4]; a[6]=b[6]+c[6]; a[0]=b[0]&c[0]; a[0]=b[0]%c[0]; a[0]=b[0]+c[0]; a[0]=b[0]/c[0];
Can the following instructions be merged into one VLIW instruction? If not in how many?
a=b+c; d=c/e; f=d&a; g=b%c;
Review Questions
Which of the following is not a typical DSP feature?
Dedicated multiplier/MAC Von Neumann memory architecture Pipelining Saturation arithmetic
Which implementation would you choose for lowest power consumption?

ASIC FPGA General-Purpose Processor DSP
Examples
How many VLIW instructions does the following program fragment require if there two independent data paths (a,b), with 3 ALUs and 1 MAC available in each and 8 instructions/word? How many cycles will it take to execute if they are the first instructions in the program and all instructions require 1 cycle, assuming the pipelining architecture of slide 10 with 6 phases of execution?
ADD a1,a2,a3 SUB b1,b3,b4 MUL a2,a3,a5 MUL b3,b4,b2 AND a7,a0,a1 MUL a3,a4,a5 OR a6,a3,a2 ;a3 ;b4 ;a5 ;b2 ;a1 ;a5 ;a2 = = = = = = = a1+a2 b1-b3 a2-a3 b3*b4 a7 AND a0 a3*a4 a6 OR a3
22
References
DR. Chassaing, DSP Applications using C and the TMS320C6x DSK, Wiley, 2002 Texas Instruments, TMS320C64x datasheets Analog Devices, ADSP-21xx Processors
23

Lecture 4

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Lecture 4

Caricato da

Copyright:

Formati disponibili

Lecture 4

Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas

ACOE343 - Embedded Real-Time Processor Systems Frederick University

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Hardware used in DSP

Power Very low consumption

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Common DSP features

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Single-Cycle MAC unit

Multiplier a i-1 x i-1 ai xi Adder

a i x i + a i-1 x i-1 Register

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Single Instruction - Multiple Data (SIMD)

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Very Long Instruction Word (VLIW)

ACOE343 - Embedded Real-Time Processor Systems Frederick University

CISC vs. RISC vs. VLIW

ACOE343 - Embedded Real-Time Processor Systems Frederick University

ACOE343 - Embedded Real-Time Processor Systems Frederick University

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Zero Overhead Looping

Hardware Circular Addressing

X[n-2] X[n-3] X[n-3]

Direct Memory Access (DMA)

ACOE343 - Embedded Real-Time Processor Systems Frederick University

DSP vs. Microcontroller

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Which implementation would you choose for lowest power consumption?

ACOE343 - Embedded Real-Time Processor Systems Frederick University

ACOE343 - Embedded Real-Time Processor Systems Frederick University

Potrebbero piacerti anche