1 Low-Power

Low Power System Design
Feipei Lai 33664924 flai@ntu.edu.tw CSIE 419 Grade: Mid-term 30%, Paper presentation 40%, Final 30%,
1
Key references
Intl Conf. on CADs Intl Symp. on Low Power Electronics and Design IEEE Trans. on CADs ACM Trans. on DAES IEEE/ACM DAC Intl Symp. on Circuits and Systems IEEE Intl Solid-State Circuits Conference
2
Outline
1. Low-Power CMOS VLSI Design 2. Physics of Power Dissipation in CMOS FET Devices 3. Power Estimation 4. Synthesis for Lower Power 5. Low Voltage CMOS Circuits 6. Low-Power SRAM Architectures 7. Energy Recovery Techniques 8. Software Design for Low Power 9. Low Power SOC design 10. Embedded Software
3
Motivation
Energy-efficient computing is required by:
Mobile electronic systems Large-scale electronic systems
The quest for energy efficiency affects all aspects of system design
Packaging costs; cooling costs Power supply rail design Noise immunity
4
Technology directions
year 1999 2002 2005 2008 2011 2014
Feature size 180 (nm)

M trans/cm2 7
130
26
100
47
70
115
50
284
35
701
Chip size (mm2)

Signal pins Clock rate
170
768 600
214
1024 800
235
1204 1100
269
1280 1400
308
1408 1800
354
1472 2200
Wiring level
voltage Power (W)
7
1.8 90
8
1.5 130
9
1.2 160
9
0.9 170
10
0.6 174
10
0.6 183
5
Just as with CMOS replacement of HBTs (heterojunction bipolar transistor) , a lower performance/lower power technology ultimately will deliver superior system throughput because of the higher integration it enables.
The International Roadmap for Semiconductor (ITRS) projects that MOSFETs with equivalent oxide thickness of 5A and for junction depths less than 10nm will be in production in the next decade. While 6nm gate lengths MOSFETs have been demonstrated, performance and Manufacturability problems remain.
Electronic system design

Conceptualization and modeling:
From idea to model
Design:
HW: computation, storage and communication SW: application and system software
Run-time management:
Run-time system management and control of all units including peripherals
8
Examples
Modeling:
Choice of algorithm Application-specific hardware vs. programmable hardware (software) implementation Word-width and precision
Design:
Structural trade-off
Resource sharing and logic supplies
Management:
Operating system Dynamic power management
9
10
System models
Modeling is an abstraction:
Represent important features and hide unnecessary details
Functional models:
Capture functionality and requirements Executable models:
Support hw and/or sw compilation and simulation
Implementation models:
Describe target realization
11
Algorithm selection
Inputs
A target macro-architecture Abstract functional/executable spec. Constraints Library of algorithms
Objective
Select the most energy-efficient algorithm that satisfies constraints
12
Issues in algorithm selection

Applicable only to general-purpose primitives with many alternative implementation Pre-characterization on target architecture Limited search space exploration
13
Approximate processing
Introducing well-controlled errors can be advantageous for power
Reduced data width (coarse discretization) Layered algorithms (successive approximations) Lossy communication
14
Processing elements
Several classes of PEs
General-purpose processors (e.g. RISC core) Digital signal processors (e.g. VLIW core) Programmable logic (e.g. LUT-based FPGA) Specialized processors (e.g. custom DCT core)
Tradeoff flexibility vs. efficiency

Specialized is faster and power-efficient General-purpose is flexible and inexpensive
15
Constrained optimization
Design space
Who does what and when (binding & scheduling) Supply voltage of the various PEs:
TCLK = K Vdd/(Vdd Vt)2
Design target
Minimize power Performance constraint (e.g. Titeration = 21 sec)
16
Datasheet Analysis
PDA #Comp Vdd Iidle
3.3 0.5 3.3 0.1
Ion
50 12
%on
0.7 0.7
%idle
0.3 0.3
I(mA)
36.15 8.43
Processor 1 DRAM 1
FLASH
IR RTC DC-DC
5
1 1 1
3.3 0.0
3.3 0.0 3.3 0.0 0.1
9
64 0.1 5.5
0.7
0.05 1 0.99
0.3
0.95 0 0.01
31.5
3.2 0.1
17 5.44
System Design
Input
The output of the conceptualization phase
A macro-architectural template A hardware-software partition Component by component constraints
Output
Complete hardware design
18
Design process
Specify computation, storage, template components, and software
Synergic process
Fundamental tradeoff: general-purpose vs. application-specific

Flexibility has a cost in terms of power
19
Application-specific computational units

Synthesized from high-level executable specification (behavioral synthesis)
Supply voltage reduction Load capacitance reduction Minimization of switching activity
20
CMOS Gate Power equations

P = CLVDD2f 01 + tsc VDD Ipeak f 0 1 + VDD Ipeak
Dynamic term CLVDD2f 01 Short-circuit term tsc VDD Ipeakf 0 1 Leakage term VDD Ipeak
21
Power-driven voltage scaling

From faster to power efficient by scaling down voltage supply
Traditional speed-enhancing transformations can be exploited for low power design
Pipelining Parallelization Loop unrolling Re-timing
22
Advanced voltage scaling

Multiple voltages
Slow down non-critical path with lower voltage supply Two or more power grids High-efficiency voltage converters
23
Clock frequency reduction

fclk does not decrease energy
But it may increase battery life Reduce power
Multi-frequency clocks
24
Reducing load capacitance

Reduce wiring capacitance
Reduce local loads Reduce global interconnect Global interconnect can be reduced by improving spatial locality: trade off communication for computation
25
Reduce switching activity

Improve correlation between consecutive input to functional macros Reduced glitching All basic high-level-synthesis steps have been modified
A synergic approach lead best results
26
Application-specific processors
Parameterized processors tailored to a specific application
Optimally exploit parallelism Eliminate unneeded features
Applied to different architectures

Single-issue cores instruction subsetting Superscalar cores # and type of Functions VLIW cores Functions and compiler
27
Low power core processors

Low voltage Reduce wasted switching Specialized modes of operations/instructions Variable voltage supply
28
Exploiting variable supply

Supply voltage can be dynamically changed during system operation
Quadratic power savings Circuit slowdown
Just-in-time computation
Stretch execution time up to the max tolerable
29
Variable-supply architecture
High-efficiency adjustable DC-DC converter Adjustable synchronization
Variable-frequency clock generator Self-timed circuits
30
Memory optimization
Custom data processors
Computation is less critical than data storage (for datadominated applications)
General-purpose processors
A significant fraction of system power is consumed by memories
Key idea: exploit locality

Hierarchical memory Partitioned memory
31
Optimization approaches
Fixed memory access patterns
Optimize memory architecture
Fixed memory architecture

Optimize memory access patterns
Concurrently optimize memory architecture and accesses
32
Optimize memory architecture

Data replication to localize accesses
Implicit: multi-level caches Explicit: buffers
Partitioning to minimize cost per access

Multi-bank caches Partitioned memories
33
Optimize memory accesses

Sequentialize memory accesses
Reduce address bus transitions Exploit multiple small memories
Localize program execution

Fit frequently executed code into a small instruction buffer (or cache)
Reduce storage requirements

34
Design of communication units

Trends:
Faster computation blocks, larger chips Communication speed is critical Energy cost of communication is significant
Multifaceted design approach:

On chip, networks, wireless Protocol stack
35
Optimize memory architecture and access patterns

Two phase-process
Specification (program) transformations
Reduce memory requirements Improve regularity of accesses
Build optimized memory architecture
36
Data encoding
Theoretical results:
Bounds on transition activity reduction:
The higher the entropy rate of the source is, the lower is the gain achievable by coding
Practical applications:
Processor-memory (and other) busses
Data busses, address busses
Transition activity reduction does not guarantee energy savings

37
Bus-Invert coding for data busses

Add redundant line INV to bus When INV=0
Data is equal to remaining bus lines
When INV=1
Data is complement of remaining bus lines
Performance:
Peak: at most n/2 bus lines switch Average: code is optimal. No other code with 1-bit redundancy can do better
38
Average switching reduction is bus-width dependent:

Ex: 3.27 for an 8-bit bus
Average switching per line decreases as busses get wider

Use partitioned codes No longer optimal (among redundant codes)
Implementation issues:
Different (XOR) of two data samples and majority vote
39
Encoding instruction addresses

Most instruction addresses are consecutive
Use Gray code
Word-oriented machines:
Increments by 4 (32 bit) or by 8 (64 bit). Modify Gray code to switch 1 bit per increment Gray code adder for jumps
Harder to partition Convert to Gray code after update
40
T0 Code
Add redundant line INC to bus When INC = 0
Address is equal to remaining bus lines
When INC = 1
Transmitter freezes other bus lines Receiver increments previously transmitted address by a parameter called stride
Asymptotically zero transitions for sequences

Better than Gray code
41
Mixed bus encoding techniques

T0_BI:
Use two redundant lines: INC and INV Good for shared address/data busses
Dual encoding:
Good for time-multiplexed address busses Use redundant line SEL:
SEL = 1 denotes addresses SEL is already present in the bus interface
Dual T0:
Use T0 code when SEL is asserted.
Dual T0_BI:
Use T0 when SEL is asserted: otherwise use BI
42
Impact of software
For a given hardware platform, the energy to realize a function depends on software
Operating system Different algorithms to embody a function Different coding styles Application software compilation
43
Coding styles
Use processor-specific instruction style:
Function calls style Conditionalized instructions (for ARM)
Follow general guidelines for software coding

Use table look-up instead of conditionals Make local copies of global variables so that they can be assigned to registers Avoid multiple memory look-ups with pointer chains
44
Example: ARM variable types

Default int variable type is 18.2% more energy efficient than char or short Sign or zero extending is needed for shorter variable types
45
ARM conditional execution

All ARM instructions are conditional Conditional execution reduces the number of branches
46
Instruction-level analysis
Analyze loop execution containing specific instructions
Loop should be long enough to neglect overhead and short enough to avoid cache misses About 200 instructions
Measure instruction base cost Measure inter-instruction effects

47
Compilation for low-power operation scheduling

Reorder instructions:
Reduce inter-instruction effects Switching in control part
Cold scheduling:
Reorder instructions to reduce inter-instruction effects on instruction bus
Consider instruction op-codes Inter-instruction cost is op-code Hamming distance Use list scheduler where priority criterion is tied to Hamming distance
48
Scheduling to reduce off-chip traffic

Schedule instructions to minimize Hamming distance Scheduling algorithm:
Operations within each basic block Searches for linear orders consistent with dataflow Prunes search space by avoiding redundant solutions (hash sub-trees) and heuristically limits the number of sub-trees
49
Compilation for low-power register assignment

Minimize spills to memory Register labeling
Reduce switching in instruction register/bus and register file decoder by encoding Reduce Hamming distance between addresses of consecutive register accesses Approaches is complementary to cold scheduling
50
Other compiler optimizations

Loop unrolling to reduce overhead
Contra: increased code space
Software pipelining
Decreases the number of stalls by fetching instructions in different iterations
Eliminate tail recursion

Reduce overhead and use of stack
51
Dynamic power management

Systems are:
Designed to deliver peak performance Not needing peak performance most of the time
Components are idle at times Dynamic power management (DPM)

Put idle components in low-power non-operational states when idle
Power manager:
Observes and controls the system Power consumption of power manager is negligible
52
Structure of power-manageable systems

Systems consists of several components:
E.g., Laptop: Processor, memory, disk, display E.g., SOC: CPU, DSP, FPU, RF unit
Components may:
Self-manage state transitions Be controlled externally
Power manager:
Abstraction of power control unit May be realized in hardware or software
53
Power manageable components

Components with several internal states
Corresponding to power and service levels
Abstracted as a power state machine

State diagram with:
Power and service annotation on states Power and delay annotation on edges
54
Predictive techniques
Observe time-varying workload
Predict idle period Tpred ~ Tidle Go to sleep state if Tpred is long enough to amortize state transition cost
Main issue: prediction accuracy
55
When to use predictive techniques

When workload has memory Implementing predictive schemes
Predictor families must be chosen based on workload types Predictor parameters must be tuned to the instance-specific workload statistics When workload is non-stationary or unknown, on-line adaptation is required.
56
Operating system-based power management

In systems with an operating system (OS)
The OS knows of tasks running and waiting The OS should perform the DPM decisions
Advanced Configuration and Power Interface (ACPI)

Open standard to facilitate design of OS-based power management
57
58
Implementations of DPM
Shut down idle components Gate clock of idle units Clock setting and voltage setting
Support multiple-voltage multiple-frequency components Components with multiple working power states
59

1 Low-Power

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

1 Low-Power

Caricato da

Copyright:

Formati disponibili

Low Power System Design

Feature size 180 (nm)

Chip size (mm2)

Electronic system design

Issues in algorithm selection

Tradeoff flexibility vs. efficiency

Fundamental tradeoff: general-purpose vs. application-specific

Application-specific computational units

CMOS Gate Power equations

Power-driven voltage scaling

Advanced voltage scaling

Clock frequency reduction

Reducing load capacitance

Reduce switching activity

Applied to different architectures

Low power core processors

Exploiting variable supply

Key idea: exploit locality

Fixed memory architecture

Concurrently optimize memory architecture and accesses

Optimize memory architecture

Partitioning to minimize cost per access

Optimize memory accesses

Localize program execution

Reduce storage requirements

Design of communication units

Multifaceted design approach:

Optimize memory architecture and access patterns

Build optimized memory architecture

Transition activity reduction does not guarantee energy savings

Bus-Invert coding for data busses

Average switching reduction is bus-width dependent:

Average switching per line decreases as busses get wider

Encoding instruction addresses

Asymptotically zero transitions for sequences

Mixed bus encoding techniques

Follow general guidelines for software coding

Example: ARM variable types

ARM conditional execution

Measure instruction base cost Measure inter-instruction effects

Compilation for low-power operation scheduling

Scheduling to reduce off-chip traffic

Compilation for low-power register assignment

Other compiler optimizations

Eliminate tail recursion

Dynamic power management

Components are idle at times Dynamic power management (DPM)

Structure of power-manageable systems

Power manageable components

Abstracted as a power state machine

Main issue: prediction accuracy

When to use predictive techniques

Operating system-based power management

Advanced Configuration and Power Interface (ACPI)

Potrebbero piacerti anche