Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Feipei Lai 33664924 flai@ntu.edu.tw CSIE 419 Grade: Mid-term 30%, Paper presentation 40%, Final 30%,
1
Key references
Intl Conf. on CADs Intl Symp. on Low Power Electronics and Design IEEE Trans. on CADs ACM Trans. on DAES IEEE/ACM DAC Intl Symp. on Circuits and Systems IEEE Intl Solid-State Circuits Conference
2
Outline
1. Low-Power CMOS VLSI Design 2. Physics of Power Dissipation in CMOS FET Devices 3. Power Estimation 4. Synthesis for Lower Power 5. Low Voltage CMOS Circuits 6. Low-Power SRAM Architectures 7. Energy Recovery Techniques 8. Software Design for Low Power 9. Low Power SOC design 10. Embedded Software
3
Motivation
Energy-efficient computing is required by:
Mobile electronic systems Large-scale electronic systems
The quest for energy efficiency affects all aspects of system design
Packaging costs; cooling costs Power supply rail design Noise immunity
4
Technology directions
year 1999 2002 2005 2008 2011 2014
130
26
100
47
70
115
50
284
35
701
170
768 600
214
1024 800
235
1204 1100
269
1280 1400
308
1408 1800
354
1472 2200
Wiring level
voltage Power (W)
7
1.8 90
8
1.5 130
9
1.2 160
9
0.9 170
10
0.6 174
10
0.6 183
5
Just as with CMOS replacement of HBTs (heterojunction bipolar transistor) , a lower performance/lower power technology ultimately will deliver superior system throughput because of the higher integration it enables.
The International Roadmap for Semiconductor (ITRS) projects that MOSFETs with equivalent oxide thickness of 5A and for junction depths less than 10nm will be in production in the next decade. While 6nm gate lengths MOSFETs have been demonstrated, performance and Manufacturability problems remain.
Design:
HW: computation, storage and communication SW: application and system software
Run-time management:
Run-time system management and control of all units including peripherals
8
Examples
Modeling:
Choice of algorithm Application-specific hardware vs. programmable hardware (software) implementation Word-width and precision
Design:
Structural trade-off
Resource sharing and logic supplies
Management:
Operating system Dynamic power management
9
10
System models
Modeling is an abstraction:
Represent important features and hide unnecessary details
Functional models:
Capture functionality and requirements Executable models:
Support hw and/or sw compilation and simulation
Implementation models:
Describe target realization
11
Algorithm selection
Inputs
A target macro-architecture Abstract functional/executable spec. Constraints Library of algorithms
Objective
Select the most energy-efficient algorithm that satisfies constraints
12
13
Approximate processing
Introducing well-controlled errors can be advantageous for power
Reduced data width (coarse discretization) Layered algorithms (successive approximations) Lossy communication
14
Processing elements
Several classes of PEs
General-purpose processors (e.g. RISC core) Digital signal processors (e.g. VLIW core) Programmable logic (e.g. LUT-based FPGA) Specialized processors (e.g. custom DCT core)
Constrained optimization
Design space
Who does what and when (binding & scheduling) Supply voltage of the various PEs:
TCLK = K Vdd/(Vdd Vt)2
Design target
Minimize power Performance constraint (e.g. Titeration = 21 sec)
16
Datasheet Analysis
PDA #Comp Vdd Iidle
3.3 0.5 3.3 0.1
Ion
50 12
%on
0.7 0.7
%idle
0.3 0.3
I(mA)
36.15 8.43
Processor 1 DRAM 1
FLASH
IR RTC DC-DC
5
1 1 1
3.3 0.0
3.3 0.0 3.3 0.0 0.1
9
64 0.1 5.5
0.7
0.05 1 0.99
0.3
0.95 0 0.01
31.5
3.2 0.1
17 5.44
System Design
Input
The output of the conceptualization phase
A macro-architectural template A hardware-software partition Component by component constraints
Output
Complete hardware design
18
Design process
Specify computation, storage, template components, and software
Synergic process
19
20
21
23
Multi-frequency clocks
24
25
26
Application-specific processors
Parameterized processors tailored to a specific application
Optimally exploit parallelism Eliminate unneeded features
28
Just-in-time computation
Stretch execution time up to the max tolerable
29
Variable-supply architecture
High-efficiency adjustable DC-DC converter Adjustable synchronization
Variable-frequency clock generator Self-timed circuits
30
Memory optimization
Custom data processors
Computation is less critical than data storage (for datadominated applications)
General-purpose processors
A significant fraction of system power is consumed by memories
Optimization approaches
Fixed memory access patterns
Optimize memory architecture
32
33
36
Data encoding
Theoretical results:
Bounds on transition activity reduction:
The higher the entropy rate of the source is, the lower is the gain achievable by coding
Practical applications:
Processor-memory (and other) busses
Data busses, address busses
When INV=1
Data is complement of remaining bus lines
Performance:
Peak: at most n/2 bus lines switch Average: code is optimal. No other code with 1-bit redundancy can do better
38
Implementation issues:
Different (XOR) of two data samples and majority vote
39
Word-oriented machines:
Increments by 4 (32 bit) or by 8 (64 bit). Modify Gray code to switch 1 bit per increment Gray code adder for jumps
Harder to partition Convert to Gray code after update
40
T0 Code
Add redundant line INC to bus When INC = 0
Address is equal to remaining bus lines
When INC = 1
Transmitter freezes other bus lines Receiver increments previously transmitted address by a parameter called stride
Dual encoding:
Good for time-multiplexed address busses Use redundant line SEL:
SEL = 1 denotes addresses SEL is already present in the bus interface
Dual T0:
Use T0 code when SEL is asserted.
Dual T0_BI:
Use T0 when SEL is asserted: otherwise use BI
42
Impact of software
For a given hardware platform, the energy to realize a function depends on software
Operating system Different algorithms to embody a function Different coding styles Application software compilation
43
Coding styles
Use processor-specific instruction style:
Function calls style Conditionalized instructions (for ARM)
44
45
46
Instruction-level analysis
Analyze loop execution containing specific instructions
Loop should be long enough to neglect overhead and short enough to avoid cache misses About 200 instructions
Cold scheduling:
Reorder instructions to reduce inter-instruction effects on instruction bus
Consider instruction op-codes Inter-instruction cost is op-code Hamming distance Use list scheduler where priority criterion is tied to Hamming distance
48
Software pipelining
Decreases the number of stalls by fetching instructions in different iterations
Power manager:
Observes and controls the system Power consumption of power manager is negligible
52
Components may:
Self-manage state transitions Be controlled externally
Power manager:
Abstraction of power control unit May be realized in hardware or software
53
54
Predictive techniques
Observe time-varying workload
Predict idle period Tpred ~ Tidle Go to sleep state if Tpred is long enough to amortize state transition cost
55
58
Implementations of DPM
Shut down idle components Gate clock of idle units Clock setting and voltage setting
Support multiple-voltage multiple-frequency components Components with multiple working power states
59