Novel Class of Energy-Efficient

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO.
7, JULY 2014
1593
Novel Class of Energy-Efficient Very High-Speed

Conditional PushPull Pulsed Latches
Elio Consoli, Gaetano Palumbo, Fellow, IEEE, Jan M. Rabaey, Fellow, IEEE,
and Massimo Alioto, Senior Member, IEEE
Abstract In this paper, a new class of pulsed latches is

introduced and experimentally assessed in 65-nm CMOS. Its
conditional pushpull pulsed latch topology is based on a push
pull final stage driven by two split paths with a conditional
pulse generator. Two circuit implementations of the concept
are discussed, with their main difference being in the pulse
generator, which can be either shared (CSP3 L) or not (CP3 L).
Measurements show that the proposed topology is very fast,
as it outperforms the well-known transmission gate pulsed
latch (TGPL) [1] by 1.52; hence the proposed pulsed latch
has the highest performance ever reported. The proposed pulsed
latch is also shown to significantly improve the energy efficiency
compared to the state of the art. Indeed, a 2.3 improvement in
ED3 product (energy delay3 ) over TGPL was found for designs
targeting minimum ED3 . For designs targeting minimum ED,
a 1.3 improvement was found in ED product. This comes at
the cost of a 1.151.35 cell area penalty, which translates
into an overall area increase well below 1% in typical systems.
Measurements on 256 replicas confirm that the above benefits are
kept in the presence of variations. Accordingly, the proposed class
of pulsed latches goes beyond the current state of the art and is
well suited for VLSI systems that require both high performance
and energy efficiency.
Fig. 1. Pareto-optimal energy-delay curve of existing FF topologies for a

typical load of 16 minimum inverters (energy per cycle and DQ delay are
in arbitrary units).
Index Terms Clocking, energy efficiency, energy-delay

tradeoff, flip-flops (FFs), high speed, low power, nanometer
CMOS, pulsed latches, VLSI.
I. I NTRODUCTION
LIP-FLOPS (FFs) and latches are well known to be

responsible for a large fraction of the power budget of
microprocessors and VLSI systems [1][7]. Typically, they
dissipate 80% of the total clock power [5], and 30% of the
overall power budget [2]. Energy efficiency of FFs and latches
is nowadays even more critical than in the past, considering
that speed can be increased only through improvements in
energy efficiency, since VLSI systems are power limited
[2], [8], [9]. Therefore, the search for novel topologies with a
Manuscript received February 12, 2013; revised July 11, 2013; accepted
July 28, 2013. Date of publication September 9, 2013; date of current version
June 23, 2014.
E. Consoli is with Maxim Integrated Products, Catania 92100, Italy (e-mail:
elioconsoli83@gmail.com).
G. Palumbo is with the DIEEI, Universit di Catania, Catania I-95125, Italy
(e-mail: gaetano.palumbo@dieei.unict.it).
J. M. Rabaey is with the Electrical Engineering and Computer Science
Department, University of California, Berkeley, CA 94720 USA (e-mail:
jan@eecs.berkeley.edu).
M. Alioto is with the Electronics and Computer Engineering Department, National University of Singapore, 117576 Singapore (e-mail:
malioto@ieee.org).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2276100
(a)
(b)
Fig. 2. (a) TGPL topology. (b) Pulse generator topology (area in dashed line
is shareable among multiple cells).
targeted speed under a relatively low consumption (with their

tradeoff quantified by composite E i D j metrics [10][12]) is
crucial.
Among state-of-the-art topologies, pulsed latches typically
exhibit the best energy efficiency from moderate to high
performance design targets, among the existing classes of
FFs [10][15]. In particular, from moderate to very high
performance targets, only very few topologies belong to the
Pareto-optimal curve of designs having minimum energy for
a given performance [10], [11]. As recalled in Fig. 1, the
transmission gate pulsed latch (TGPL) [1] (see Fig. 2) used
in various Intel microprocessors is the most energy-efficient
FF in a rather wide portion of the Pareto-optimal curve,
ranging from high-speed (i.e., points with minimum E D j
product with j > 1) to energy-efficient designs (i.e., points
with minimum ED). Only the skew-tolerant FF (STFF) is able
1063-8210 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1594
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 7, JULY 2014
to outperform transmission gate flip-flop (TGFF) for extremely

high-speed design targets [16] (i.e., points with minimum ED j
for j 5). In this region, the STFF speed advantage in terms
of DQ delay is typically about 10%, at the cost of a 2
greater energy [11]. Hence, although STFF is slightly better
than TGPL in terms of pure performance, but its significantly
worse energy efficiency does not make it as competitive as
TGPL in applications where energy efficiency is a concern.
Hence, in the following, TGPL will be adopted as a reference
for high-speed energy-efficient designs. When slower design
slower design targets are considered, master-slave FFs exhibit
better energy efficiency. The traditional TGFF [17] and the
recently proposed Toshiba ACFF [18] are, respectively, the
most efficient among designs with balanced energy-delay (i.e.,
minimum ED) and ultralow energy designs (i.e., minimum
E j D with j > 1).
In this paper, a novel class of pulsed latches (conditional
pushpull pulsed latch) is introduced. The main idea is to
adopt a pushpull output stage, which is driven by two split
paths for rise and fall output transitions, with the explicit aim
of reducing both the path effort and the parasitic delay [19].
In addition, the capacitance at the output of the first stage is
further reduced by adopting half-latches in the split paths and
moving the cross-coupled inverters to the output node.
Two versions are presented, respectively, without (CP3 L)
and with (CSP3 L) shareable conditional pulse generator. Measurements on a 65-nm test chip demonstrate 1.32.3 better
energy efficiency compared to TGPL, as well as 1.52
DQ delay improvement even in the presence of process
variations. The proposed pulsed latches have a 1.151.35
larger area than TGPL, with a resulting increase in the area
of practical VLSI systems that is well below 1%.
This paper is organized as follows. In Section II, the
rationale behind the herein proposed novel topologies and their
operation is described, and their detailed circuit implementation is discussed in Section III. The potential speed advantage
compared to TGPL is analytically evaluated in Section IV,
and aspects related to physical design and layout parasitics
are discussed in Section V. Details on the 65-nm test chip
and the adopted delay-energy testing circuitry are provided in
Section VI. Measurements results and comparison with stateof-the-art topologies are discussed in Section VII. Conclusions
are reported in Section VIII, and an Appendix presents a
detailed logical effort analysis.
II. C ONDITIONAL P USH P ULL P ULSED L ATCH :
M AIN I DEAS AND O PERATION
In the proposed class of pulsed latches shown in Fig. 3,
a pushpull output stage is adopted (M7M8) as opposed to
the traditional output inverter stage employed in most existing
topologies [10][18] (see M5M6 in TGPL in Fig. 2). Such a
technique allows for reducing the load of the driving circuitry
by a factor 23, thereby making it faster and more energyefficient. This also allows M7M8 in Fig. 3 to be up-sized,
and hence have a faster output stage.
The pushpull output stage in Fig. 3 is driven by two
split paths that generate the active-high R (active-low set S)
Fig. 3.
General scheme of the proposed class of pulsed latches.
pulsed signal, which resets (sets) the output when active.

Pulses R and S are alternatively generated to enable a fall/rise
output transition, respectively. These pulses are generated at
the falling clock edge by the conditional pulse generator in
Fig. 3, and are transferred to the output stage by either the
half latch M1M3 or M4M6, depending on whether input D
is, respectively, low or high (see below for detailed description
of pulse waveforms). These half latches in the first stage within
the DQ critical path have less parasitics compared to typical clocked inverters or inverters with cascaded transmission
gate [10][18] (see M1M4 in Fig. 2). The input D drives
two different paths, respectively, through an nMOS (M5) and
a pMOS (M2) transistor in Fig. 3, which is equivalent to the
load of a traditional input inverter stage (see M1M2 in TGPL
in Fig. 2).
The operation of the scheme in Fig. 3 is explained in detail
in Fig. 4, which depicts the main waveforms of the internal
signals. After the falling clock edge (cycle 1 in Fig. 4), the
pulse generator checks if the previous output1 Q D in Fig. 3
is high or low. If previous output is Q D = 1, next output Q
can stay at the same value or make a falling transition, hence
a pulse is generated in the fall path in Fig. 3 through the
active-low signal CP f , whereas nothing changes in the rise
path (active-high signal CPr is kept low, thus latch M4M6
keeps S high and maintains M8 OFF). Subsequently, if input
stays at the previous value D = 1, the latch M1M3 is not
enabled; hence R is dynamically kept at the previous value
R = 0 (then, it is statically tied to ground once the pulse
expires). On the other hand, if input changes to D = 0, the
latch M1M3 is enabled and the CP f pulse determines a high
pulse in R, which turns M7 ON and brings the output Q to
low. Afterwards, its delayed output replica Q D experiences
the same transition.
If the previous output is Q D = 0, right after the falling clock
edge (cycle 2 in Fig. 4), a pulse is generated in the rise path
through the active-high signal CPr (nothing changes in the
fall path). If input stays at the previous value D = 0, the latch
M4M6 is disabled and S is kept high, so that nothing changes
in the rise path. If input changes to D = 1, the latch M4M6 is
thereby turning M8
enabled and the CPr pulse pulls down S,
1 More precisely, the delayed version Q of the output Q is fed back to
D
the conditional pulse generator. As explained below, feeding back Q D (rather
than Q) permits to reduce the internal activity and hence energy per cycle.
CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES
Fig. 4.
1595
Waveforms of internal signals of the general scheme in Fig. 3.
A. CP3 L: Conditional PushPull Pulsed Latch
Fig. 5.
cells).
CP3 L topology (area in dashed line is shareable among multiple
ON and bringing Q to high. Afterwards, the delayed output

replica Q D experiences the same transition.
in Fig. 3 is set to 0 (1), thereby
At the steady state, R ( S)
turning OFF the output transistors M7M8, with the output
being maintained at the desired value by a keeper. In other
words, the memory element within the proposed topology in
Fig. 3 is actually placed at the output node, as opposed to most
of the existing topologies where it is placed before the output
stage (see the gated cross-coupled inverter pair in Fig. 2, which
is connected to the input of the output stage M5M6). This
permits to move the parasitics associated with the memory
element to the output node, thereby making the input node
of the output stage lightly loaded, and hence faster and more
energy efficient.
III. I MPLEMENTATION OF THE C ONDITIONAL

P USH P ULL P ULSED L ATCH C ONCEPT: CP3 L AND
CSP3 L T OPOLOGIES
As discussed above, the proposed class of pulsed latch
in Fig. 3 tends to have a lightly loaded DQ critical
path, thereby making it potentially fast and energy-efficient.
Such features can be implemented in different ways. In the
following, we present two versions, respectively, without
(Section III-A) and with (Section III-B) shareable pulse
generator.
The schematic of CP3 L topology is depicted in Fig. 5.

The keeper (M9M12 in Fig. 5) drives the output Q and comprises a cross-coupled inverter pair, whose forward inverter
is gated to avoid current contention with the output stage
M7M8. Indeed, if R = 1 the pull-down M7 of the output
stage is ON and the pull-up network of the keeper is OFF
through M11. Analogously, if S = 0 the pull-up M8 of the
output stage is ON and the pull-down network of the keeper
is OFF through M10.
As an additional advantage brought by placing the keeper
after the output stage rather than before, CP3 L has lighter load
on its critical path since the half latch M1M3 (M4M6) in the
first stage has to drive the single transistor M11 (M10). Also,
since the two pulses R and S are alternatively generated, either
M10 or M11 in the keeper are actually subject to transitions
of the gate terminal in a given cycle. In contrast, the first stage
of traditional topologies must drive two transistors associated
with the keeper, and both of them are subject to transitions
[10][18] (see transistors M11M12 in Fig. 2, which load
transistors M3M4 lying in the critical path). This clearly
reduces the parasitic load of the first stage of CP3 L and reduces
activity at the keeper capacitances, thereby making the first
stage faster and potentially more energy efficient.
Regarding the pulse generator, it comprises a clock
phase generator, a pseudo-NAND for the fall path
(M15M19 in Fig. 5), and a pseudo-NOR gate for the
rise path (M20M24). Operation is summarized in Fig. 6,
which depicts the waveforms of the signals involved in
the generation of the CP f and CPr pulses. Generally, the
pseudo-NAND (pseudo-NOR) gate sets signal CP f (CPr ) high
)
(I V ) (CK and CK (III) )
(low), since signals CK (I
N and CK
N
are complementary and thus keep either transistor M18 or
M19 (M20 or M21) ON. However, after the falling clock
(I )
(III)
edge, signals CK N and CK (I V ) (CK and CK N ) are both
temporarily high (low) due to the transitions of the four
inverters within the clock phase generator (in the example in
Fig. 6, each inverter is assumed to have the same delay inv
for simplicity). Accordingly, during the time slot inv 4inv in
Fig. 6, the pseudo-NAND temporarily sets CP f low through
transistors M15M17 if Q D = 1 (otherwise, CP f remains
high). Similarly, during the time slot 03inv in Fig. 6, the
pseudo-NOR temporarily sets CPr high through transistors
M22M24 if Q D = 0 (otherwise, CP f remains low). Hence,
1596
cycle 1
CK
Q
Q=1
Q=0
CPf
CPr
glitch on CPr
R=1
D
R
Fig. 6. Clock phase generator and waveforms defining CPr and CP f pulses.
D=0
the clock phase generator and the pseudo-NAND/NOR gates

implement a conditional pulse generator, which alternatively
produce a pulse on either CP f or CPr , as determined by the
previous output value Q D . The clock phase generator can be
shared among multiple latches to amortize its overhead.
It is useful to observe that the width of CP f and CPr pulses
determines the width of the transparency window of CP3 L
latch in which the input can affect the output. From a design
point of view, the width of the transparency window can be
modified by changing the delay of the inverters within the
clock phase generator in Fig. 5. The effect of process variations
on timing can be compensated through post-silicon tuning of
the pulse width, possibly sharing the tuning circuitry among
multiple latches [1], [20], [21]. In this paper, no tune-ability
is added to the considered pulsed latches since the addition
of such feature would impact area/energy of any pulsed latch
equally. Indeed, almost all existing pulsed latches adopt the
same pulse generator topology (e.g., cascaded inverters as in
Figs. 2, 5, and 6) [10], [11].
The delay stage in the feedback path in Figs. 35 generates
a delayed replica Q D of the output Q, and is implemented by
the two inverters M13M14 and M25M26 in Fig. 5. Actually,
only slow transistors M25M26 are added to implement such
delay, as the inverter M13M14 is already available (i.e.,
M13M14 are used to both latch and delay the output).
This delay stage makes sure that Q D is kept stable at its previous value during the transparency window, thereby preventing
glitches in CPr and CP f and reducing dynamic energy, as
discussed in the following.
Without the delay stage, the output Q would be connected
directly to the pseudo-NAND/NOR in Fig. 5, hence any output transition within the transparency window immediately
triggers the generation of an additional (undesired) pulse.
As shown in detail in Fig. 7, which refers to the case
where Q is directly connected to the pseudo-NAND/NOR, a
falling transition of Q following the same input transition
immediately triggers a high pulse in CPr , as the pseudoNOR in Fig. 5 temporarily has all pMOS transistors M22
M24 ON during the transparency window (i.e., the CPr time
slot in Fig. 6). Observe that this glitch in CPr pulse increases
the dynamic energy, but it does not affect correct operation.
Indeed, if previous output was Q = 1 and the current input is
D = 0 as in Fig. 7, the CPr glitch cannot propagate through
Fig. 7. Glitch in CPr occurring if no delay stage is inserted in the feedback

path in Figs. 35.
the half latch M4M6 since M5 is OFF. On the other hand,

if the previous output was Q = 1 and the current input is
D = 1, the CPr glitch propagates through the half latch
M4M6 and temporarily sets S = 0, but it does not affect
the output anyway since the latter is kept at the desired value
Q = 1 through M8. Dual considerations hold for glitches in
CP f when no delay stage is inserted. As a result, the delay
stage in Figs. 35 is not strictly necessary, but its insertion
reduces the activity in CPr and CP f and hence energy.
B. CSP3 L: Conditional Shareable PushPull Pulsed Latch
In CP3 L, the pulse generator cannot be shared among
multiple latches since pseudo-NOR/NAND are driven by Q D ,
which is different for each latch. In this subsection, we present
a different implementation of the same concept by integrating
the conditional logic in the latch so that the whole pulse
generator can be shared. The resulting conditional shareable
pushpull pulsed latch (CSP3 L) topology is depicted in Fig. 8.
In CSP3 L, static NAND/NOR gates are introduced in the
shareable pulse generator to generate the pulses CPf,ext and
CPr,ext that are distributed to multiple latches and have the
same role as CP f and CPr had in CP3 L. In each latch, such
external pulses are enabled through the switches implemented
by M16M22 in Fig. 8, which implement the conditional
pulse selection logic. The latter comprises two transmission
gates and two small keepers to maintain the same operation
as before. As discussed above, the delay stage M23M26 is
introduced in the feedback path (two more than CP3 L since
the transmission gates need complementary control signals).
The resulting transistor count is the same as CP3 L, hence
CSP3 L area is expected to be roughly the same as CP3 L
(excluding the shareable part).
Since CSP3 L is based on the same concept as CP3 L, operation is very similar. The main difference is in the conditional
pulse selection logic, which enables the propagation of either
CPf,ext or CPr,ext to the half latches, according to the value
of the delayed output replica Q D . In particular, if Q D = 1
(Q D = 0) the fall (rise) path is activated, as the transmission
gate M15M16 (M19M20) transfers the CPf,ext (CPf,ext )
CPf,ext
conditional
half
pulse selection
latches
QD M17 CPr,ext M18
M3
CPf D
M15
M2
M16
FALL
M1
PATH
Qn,D
CPr,ext
output
stage
M6
M20
CPr
M22
QD M21 CPf,ext RISE
PATH
M19
Qn,D
D M5
M8
M7
M12
R M11
M4
S
delay
M26 QDM24
M25
M23
M10
M9
M14
M13
Qn
CK
CKn(I)
CKn(III)
CKn(I)
CK(IV)
CPf,ext
CK
CKn(III)
CPr,ext
CK(IV) shareable pulse generator

Fig. 8.
cells).
CSP3 L topology (area in dashed line is shareable among multiple
pulse to the input of the half latch M1M3 (M4M6), similar

to the pseudo-NAND (pseudo-NOR) of CP3 L in Fig. 5. As a
minor difference from CP3 L, the input capacitance seen from
CPf,ext and CPr,ext in CSP3 L depends on Q, which may lead
to data-dependent clock skew (see Fig. 8). In practical cases,
this is not a concern considering that pulsed latches inherently
tolerate a significant amount of skew.
IV. A NALYSIS OF S PEED P OTENTIAL
In this section, CP3 L and CSP3 L are comparatively evaluated to TGPL in terms of maximum achievable performance
through logical effort analysis [22]. According to the analysis
under the assumptions in the Appendix, the minimum DQ
delay normalized to the technology-dependent constant [22]
for the CP3 L, CSP3 L, and TGPL topology is

4 CL
5
(1)
+
Dmin,CP3 L Dmin,CSP3 L
3 Cin
3

5 CL
34
Dmin,TGPL
(2)
+
3 Cin
9
where C L and Cin are, respectively, the load and the input
capacitance of the pulsed latch. From (1)(2), CP3 L and
CSP3 L have basically the same minimum DQ delay, as is
expected by considering that they have the same DQ critical
path (M1M8 in Figs. 5 and 8).
From (1)(2), CP3 L and CSP3 L are always faster than
TGPL. Their theoretical maximum speed advantage is about
2.3 and is obtained at light loads (i.e., electrical effort
1597
C L /Cin 1). For typical electrical efforts ranging from

10 to 30, the potential speed advantage is 1.41.5, and
decreases to 1.3 for 60 or more. Although this analysis does
not account for wire parasitics, which will be included in the
next section, it suggests that the potential advantage of CP3 L
and CSP3 L over TGPL typically ranges from 1.4 to 2.
The above speed improvement is justified by the lighter
load of the stages lying in the critical path, as was discussed
in detail in the previous section. Logical effort analysis in
the Appendix permits to quantify the advantages of CP3 L and
CSP3 L in each critical path stage. Comparison of (A1)(A5)
and (A2)(A7) clearly shows that CP3 L and CSP3 L have a
speed advantage over TGPL both in the first and second stage.
In particular, the first stage has 1.25 lower logical effort
and 2 lower parasitic delay thanks to the lighter loading
effect of parasitics, compared to TGPL. In addition, the second
stage has 1.5 lower logical effort thanks to the pushpull
configuration.
The potential performance improvement enabled by CP3 L
and CSP3 L is kept also in the presence of layout parasitics,
as will be discussed in the next section, and can be traded off
for significantly lower energy at iso-performance, as will be
demonstrated in Section VII.
V. L AYOUT-AWARE S IZING M ETHODOLOGY AND
P HYSICAL -L EVEL C ONSIDERATIONS
Although the analysis in the previous section shows a
clear advantage of CP3 L and CSP3 L in terms of maximum
performance, in practical cases transistors are optimized to
have a reasonable balance with energy. To explore the energydelay tradeoff in meaningful design cases, the pulsed latches
were sized to minimize the energy-delay ED j figure of merit
[10], [11]. The resulting designs are energy optimal, in the
sense that they belong to the Pareto-optimal energy-delay
curve [23]. In the following, we focus on the design points
with minimum ED and ED3 , which are, respectively, representative of applications targeting balanced energy-delay and
high speed [10], [11].
Layout parasitics are well known to be comparable to
(or dominate) device parasitics even in internal nodes of
standard cells, and hence they have a considerable impact
on the optimal transistor sizes that minimize energy for a
given performance constraint [24]. However, most of the
previous work on FF/latch sizing and comparison presents
sub-optimal designs in terms of energy efficiency, as they
includes layout parasitics only after transistor are actually
sized [13], [17], [25][28]. In contrast, in this paper, layout
parasitics are explicitly included into the circuit design loop
by resorting to the layout-aware sizing methodology proposed
by the same authors in [23]. In short, a preliminary layout
organization was setup in the form of stick diagram, then
layout parasitics were estimated from the stick diagram as a
function of transistor sizes based. Successively, transistor sizes
were optimized for the targeted energy-delay figure of merit
by including estimated layout parasitics in the optimization.
Fig. 9(a)(c) show the layout of a TGPL, CP3 L, and CSP3 L
for a minimum energy-delay target and a load of 16 minimumsized inverters. In these figures, the dashed line defines the
(a)
Fig. 10.
(b)
(c)
Fig. 9.
Layout under sizing for minimum ED. (a) TGPL. (b) CP3 L.
(c) CSP3 L. (Area in dashed line is shareable among multiple cells.)
TABLE I
A REA C OMPARISON (65 nm, S TD C ELL H EIGHT: 3.9 m)
portion that can be shared and amortized among multiple

latches (i.e., the pulse generator). These figures show that
the shareable portion occupies a significant fraction of the
overall latch area. According to data in Table I, TGPL is
confirmed to have a very low area, as is well known from
the comparison with other existing topologies [10], [11].
In particular, TGPL offers a 1.15 (1.28) area reduction over
CP3 L (CSP3 L), when pulse generator is not shared. When the
latter is shared, TGPL has a 1.33 (1.35) area reduction over
CP3 L (CSP3 L), which is similar to the advantage that TGPL
has compared to most of the area-efficient existing topologies
[10], [11]. As a numerical example, in typical microprocessor
systems, latches typically occupy less than 5% of the total
area2 hence the adoption of CP3 L (CSP3 L) as replacement of
all FFs would lead to 1.6% (1.7%) area increase compared
to TGPL. In practical cases, the area increase resulting from
the adoption of CP3 L or CSP3 L is well below 1%, since
more traditional topologies are typically used in noncritical
paths [28].
2 This estimate is based on the consideration that the cache memory takes
up more than 50% of the area of todays processors [29], and flip-flops/latches
typically account for 1% of the total gate count of a core [30], with their
size being in the same order of magnitude as other cells.
384:1
Scan Chain (SC)
1598
Architecture of the testchip.
Finally, post-layout parasitic extraction on different design

points showed that the intracell wire parasitics at the output
of the first and second stage of CP3 L and CSP3 L are very
similar (within few percents) to those of TGPL for a given
energy-delay target. In other words, the qualitative energyperformance advantages expected from Section III and the
quantitative performance benefits estimated in Section IV are
expected to hold in practical cases where layout parasitics are
included (details will be provided in Section VII).
VI. O N -C HIP T ESTING H ARNESS
A test chip in 65-nm CMOS was prototyped to validate
the proposed class of pulsed latches. The test chip contains
CP3 L, CSP3 L, and TGPL latches implemented in two versions, respectively, targeting minimum ED and minimum ED3
design, with a 16 and 64 load (1 is equivalent to the
input capacitance of a minimum-sized symmetric inverter).
Each latch version was implemented in 64 replicas. To the best
of our knowledge, this is the first paper that validates novel
latch topologies through measurements of multiple replicas.
The architecture of the test chip is shown in Fig. 10.
A scan chain (SC) scans in the test settings and scans out
measurement data. A test controller (TC) manages and applies
settings throughout the latch arrays and steers measurement
data. For each measurement, TC triggers the generation of
a pulse with appropriate width through the pulse generator,
and the delay generator (DG) generates delayed versions of
test signals, whose delay can be tuned with a 1.8-ps step
(its use is discussed below). Each latch (test unit in Fig. 10)
is excited after a preliminary coarse selection to reduce the
number of cells switching at the same time, and then a final
selection to steer measurements to SC. The testing harness
measures timing parameters and energy of the above latches.
Two different arrays are used to measure energy and delay,
so that the reconfigure-ability (i.e., parasitics) added to better
measure one of the two parameters does not interfere with the
other. The die photo is in Fig. 11.
The energy is measured through an external 1-V pin supplying power to nine replicas. The activity factor can be tuned
among the following values, which widely cover practical
applications: 6.25%, 12.5%, 25%, 50%, and 100%. To separate
dynamic and leakage energy components, leakage of a bank of
1599
TABLE II
AVERAGE AND S TANDARD D EVIATION OF M AIN PARAMETERS OF I NTEREST (256 R EPLICAS , M IN .- E D D ESIGN )
TABLE III
C OMPARISON W ITH S TATE OF THE A RT
Fig. 11.
Die photo.
Fig. 12.
Block diagram of test unit for timing characterization.
delays). For example, DCK delay can be measured according

to the following steps [31].
latches is measured through a separate pin. Regarding timing

parameters, the testing harness proposed in [31] has been
implemented to measure DCK, CKQ, and DQ delay of
each latch replica. Each delay is measured as difference of
the arrival times of the involved signals (e.g., D and CK for
DCK delay). The test unit for timing characterization of a
single latch is schematized in Fig. 12, which shows that D,
CK, and Q of the latch under test are MUXed and captured by
a master slave (transmission-gate) flip-flop clocked by clock
CKMS. The arrival time of each signal relative to CKMS edge
is measured by sweeping the tunable delay between it and
CKMS, and checking when the capturing FF starts failing [31]
(beyond that point, data is incorrectly captured for greater
1) Select CK through MUX, sweep tunable delay between

CK and CKMS. Beyond a certain delay, CK is captured
incorrectly. The delay that was applied immediately
before defines the arrival time of CK relative to CKMS.
2) Repeat same steps by sweeping delay between D and
CKMS. This defines arrival time of D relative to CKMS.
3) Evaluate DCK delay as the difference between the two
arrival times.
Differently from [31], we characterized an array of replicas
by using a single DG to test all replicas, to enable fulltiming characterization of the pulse generator. To keep layout
parasitics small and have good control of the latch load, each
replica has its own local measurement unit (see Fig. 12) and
they are placed next to each other.
1600
(a)
Fig. 14.
(b)
Fig. 13. Setup characteristic: CKQ and DQ delay versus CK-D of latches
designed for (a) minimum ED (16 load) and (b) minimum ED3 (64 load).
VII. M EASUREMENT R ESULTS

The above-described testing harness was used to characterize CP3 L, CSP3 L, and TGPL latch.
A. Performance, Hold Time Robustness, and Energy
The measured setup curves (i.e., CKQ and DQ delay
versus CKD) are reported in Fig. 13(a) and (b), respectively, for the design targeting minimum ED and ED3 . The
measured replica is representative of the nominal corner in
this process. From Fig. 13a (13b), CP3 L and CSP3 L have
very similar minimum DQ delay, as expected. DQ delay of
CP3 L (CSP3 L) is 17.3 ps (17.9 ps) for minimum-ED sizing,
while it is 15.6 ps (16.1 ps) for minimum-ED3. From the
same figures, the TGPL latch under the same conditions,
respectively, achieves 34.6 and 24 ps. Accordingly, the TGPL
is slower than CP3 L (CSP3 L) by 2.03 (1.92) for the
minimum ED design, and 1.54 (1.47) for minimum ED3 .
This is particularly interesting, considering that TGPL is well
known for being the fastest existing topology among those
with reasonably high energy efficiency [11] (and only 10%
slower than the very fastest).
The hold time of CP3 L (CSP3 L) results to 90.5 ps (99.3 ps)
for minimum-ED design, and 123.6 ps (130.1 ps). On the
other hand, TGPL has a hold time of 121.9 ps and 171.1 ps
Energy per cycle versus data activity.
for the minimum-ED and minimum-ED3 design, respectively.

Hence, CP3 L and CSP3 L have a hold time that is slightly
better than TGPL (by about 1.3) for both the minimum-ED
and minimum-ED3 design.
The transient energy per cycle E TRAN (i.e., dynamic and
short-circuit) is plotted in Fig. 14 versus data activity. This figure shows that energy of CP3 L and CSP3 L is from 40% to 60%
higher than TGPL depending on the specific activity. Energy
itself is clearly not representative of energy efficiency, as
it should be evaluated as iso-performance. The energy-delay
tradeoff of the above topologies for the different design targets
is depicted in Fig. 15(a), which shows that the minimumED CP3 L and CSP3 L (which are once again very close to
each other) is even faster and consumes less energy than
the minimum-ED3 TGPL. More quantitatively, the energy of
CP3 L, CSP3 L, and TGPL for 25% data activity is, respectively,
42, 41.5, and 26.1 fJ for minimum-ED energy, hence CP3 L and
CSP3 L exhibit a 1.3 better energy-delay product compared to
TGPL. For minimum-ED3 design, the energy of CP3 L, CSP3 L,
and TGPL is 73.7, 75.7, and 46.1 fJ, hence CP3 L and CSP3 L
improve ED3 by 2.3, compared to TGPL. From Fig. 14,
similar or better energy efficiency is expected at other realistic
values of data activity. The energy improvement enabled
by CP3 L and CSP3 L is intuitively explained by considering
that these topologies are significantly faster than TGPL (see
Section IV). Hence, CP3 L and CSP3 L tend to have smaller
transistor sizes for a given performance target, which in turn
translates into smaller dynamic and leakage energy compared
to TGPL.
Leakage can also be a concern in FF and latches, for
example, in VLSI systems operating in standby mode while
retaining information in registers and power gating all other
gates [32]. The leakage current under equiprobable inputs for
CP3 L, CSP3 L, and TGPL is 316, 401.6, and 424.6 nA, respectively, for a minimum-ED design. As shown in Fig. 15(b),
this translates into a more favorable leakage-delay tradeoff, with a 2.7 improvement in the leakage-delay product.
For minimum-ED3 design, leakage of CP3 L, CSP3 L, and
TGPL is 561.7, 685.7, and 832.5 nA, which translates into
a 5.4 improvement in the leakage-delay3 product.
1601
(a)
(a)
(b)
(b)
Fig. 15.
Fig. 16. Histogram of CP3 L D-Q delay for (a) minimum-ED design and
(b) minimum-ED3 design (256 measurements).
(a) Energy-delay tradeoff. (b) Leakage-delay tradeoff.
B. Variations and Comparison With the State of the Art

The above measurements were repeated on 256 replicas
of each version of the considered pulsed latches (over four
dice). As an example, Fig. 16 reports the resulting histogram
of the DQ delay for the CP3 L in its minimum-ED and
minimum-ED3 versions (CSP3 L histograms are very similar).
The variability of these and other parameters of interest is
summarized in Tables II and III for the minimum ED and
ED3 designs.
From Tables II and III, CP3 L and CSP3 L have 1.7
lower standard deviation of DQ delay compared to the
TGPL in minimum-ED design, whereas there is no significant difference in the minimum-ED3 case. The comparable or smaller variations in CP3 L and CSP3 L translate into
a 1.4 worse variability / compared to TGPL, due to
the much lower average delay of CP3 L and CSP3 L. This
variability difference does not significantly affect the above
mentioned speed advantage of CP3 L and CSP3 L. Indeed,
the 3-sigma worst case value of their DQ delay is better
than the TGPL counterpart by 1.4 to 1.9 depending
on the design target. This is close to the above results in
nominal corner (1.52). Hence, CP3 L and CSP3 L are
confirmed to be largely faster than TGPL in the presence
of variations.
CP3 L and CSP3 L have approximately the same variability as TGPL in regard to setup time and leakage from
Tables II and III. On the other hand, CP3 L and CSP3 L have
similar or 2 worse variability of CKQ delay, compared
to TGPL. From the perspective of VLSI systems timing,
the above-discussed DQ delay variations are more impactful
than CKQ delay variations. Indeed, from Tables II and III,
CKQ variations are smaller than DQ delay variations. In
addition, critical paths typically go through a DQ delay, rather
than CKQ delay (late computations are finished during the
transparency window). As expected, energy variations were
found to be extremely small ( 1%), hence related results
are omitted for brevity. From Tables II and III, CP3 L and
CSP3 L also have 1.72.6 less variations in hold time,
which translates into a proportionally lower number of buffers
inserted by place and route tools at the timing closure design
phase.
For completeness, the proposed class of pulsed latches was
also compared to other existing topologies that cover a much
wider range of applications, from very high performance to
very low energy. In addition to TGPL, we thus considered
STFF for its very high performance [16], TGFF for its high
energy efficiency at moderate performance [17], and ACFF
for its high energy efficiency at low performance targets [18].
The results of the comparison are summarized in Table IV,
where data are normalized to the best, and the results from
1602
TABLE IV
AVERAGE AND S TANDARD D EVIATION OF M AIN PARAMETERS OF I NTEREST (256 R EPLICAS , M IN .- E D 3 D ESIGN )
simulations are based on post-layout extraction. From this

table, the proposed class of pulsed latches exhibits the lowest
DQ delay, as it is 1.5 lower than that of the very fast
STFF (which is confirmed to be slightly faster than TGPL
for minimum-ED3 design). Also, CP3 L and CSP3 L largely
improve energy efficiency at high-performance design targets
(i.e., minimum ED3 ), compared to such high-speed topologies.
Indeed, CP3 L and CSP3 L reduce ED3 by 2.2 and 2.8
compared to TGPL and STFF, respectively. CP3 L and CSP3 L
exhibit a significantly better energy efficiency even compared
to topologies that are typically used for moderate to lowspeed design targets. More specifically, from Table IV CP3 L
and CSP3 L designed for minimum-ED improve the energydelay product by 1.4 compared to the energy-efficient TGFF
topology, and by 1.9 compared to the ultralow energy ACFF.
Summarizing, the proposed class of pulsed latches outperforms the state of the art in terms of pure performance,
with DQ delay improvements in the order of 1.5 or more.
In current power-limited VLSI systems, the more exploitable
advantage of CP3 L and CSP3 L is their high energy efficiency,
as they outperform the state of the art by more than 2
when compared to topologies targeting high speed. In addition,
the proposed pulsed latches exhibit a better energy efficiency
(1.41.9) even when compared to topologies targeting very
low energy.
VIII. C ONCLUSION
In this paper, a new class of pulsed latches has been
introduced. Its pushpull final stage and split paths in the
first stage enable a significant reduction in path and parasitic
effort. Measurements on 65-nm test chip demonstrated a
1.52 speed improvement compared to TGPL, which makes
the proposed topologies the fastest ever reported. At the
best of authors knowledge, for the first time the proposed
latches are validated through measurements on 256 replicas.
Measurements confirm the above advantages in the presence
of variations.
More importantly, the energy efficiency of the proposed
pulsed latches enables a significant improvement beyond the
state of the art. Indeed, a 2.3 improvement over TGPL was
found in terms of ED3 product, and a 1.3 improvement in the
ED product. The area penalty paid by the proposed latches is
1.15 1.35 compared to TGPL, which is among the small-
est existing latches. The proposed pulsed latches also exhibit

a better energy efficiency (1.41.9) compared to state-ofthe-art topologies that target ultralow energy operation.
Finally, the CP3 L and CSP3 L were shown to be equivalent
in terms of energy and performance, hence both topologies
are equally worth considering when designing highly energyefficient systems. The choice between CP3 L and CSP3 L
is driven by preliminary design decisions on the clocking
scheme. Indeed, CP3 L does not allow for sharing a pulse
generator, but has lower area than CSP3 L if the pulse generator
is included. Hence, CP3 L is preferable when only a small
subset of FFs needs to be replaced by a pulsed latch (i.e., in
pipeline stages that have few critical paths, as might occur in
random logic). Indeed, in this case latches tend to be far from
each other, hence it does not make sense to share their pulse
generator. On the other hand, CSP3 L is preferable in systems
where a significant number of FFs need to be replaced (e.g.,
pipeline stages with many critical paths, as occurs in regular
modules).
A PPENDIX
In this appendix, transistor sizing for maximum speed (i.e.,
minimum DQ delay) is analytically discussed for CP3 L,
CSP3 L, and TGPL. In the following, all transistor channel widths are normalized to the minimum allowed by the
technology, PN ratio is equal to two, and capacitances are
normalized to the input capacitance of a minimum symmetric
inverter (about 0.3 fF at 1 V in this technology). Also,
transmission gates were sized with equally sized pMOS and
nMOS transistors, keeper transistors are all minimum sized
to reduce dissipation, channel lengths are generally minimum,
and series transistors are equally sized. Different sizing (e.g.,
non-minimum channel length) and PN ratio was allowed in
the pulse generator to adjust the transparency window width,
while ensuring pulses with symmetrical rise/fall time.
A. Logical Effort Optimization of TGPL
The transistor sizes of TGPL to be optimized under the
above assumptions are reported in Fig. 17. From this figure,
the two independent sizes W1 and W2 need to be optimized
in the critical path.
From Fig. 17, the first stage of the critical DQ path
comprises the input inverter and the subsequent transmission
Fig. 17.
1603
Transistor sizes in TGPL.
gate. From usual logical effort calculations, the timing of

the first stage is characterized by the following logical effort
parameters:
5
3
3W2 + 4
h1 =
3W1
25
.
p1 =
9
g1 =
WL
3W2
Transistor sizes in the critical path of CP3 L and CSP3 L.
B. Logical Effort Optimization of CP3 L and CSP3 L

(A.1a)
(A.1b)
(A.1c)
Similarly, the second stage is a simple inverter and hence has

parasitic delay p2 = 1 and logical effort g2 = 1, whereas its
electrical effort immediately results to
h2 =
Fig. 18.
(A.2)
being W L the load expressed as equivalent transistor

width [22] (i.e., the transistor width such that its gate cap
equals the load capacitance C L in Fig. 17), normalized to
the minimum channel width. By setting g1h 1 = g2 h 2 , and
neglecting the small contribution of minimum-sized transistors
connected to the output of the first stage (i.e., transistors with
normalized width equal to 1 in Fig. 17), from (A.1) to (A.2)
the optimum W2 that minimizes the DQ delay results to
W2 = (W L W1 /5)1/2. By substituting W2 , (A.1)(A.2) lead
to the following minimum achievable delay:

5 3W2 + 4 WL
25
+1
+
Dmin,TGPL =
3
3W1 3W2
9

5 WL
5 CL
34
34
(A.3)
+
+
9 W1
9
3 Cin
9
where we considered that the input capacitance Cin in Fig. 17
is equal to the gate capacitance of a transistor with width 3W1 ,
and the load capacitance C L is by definition the gate capacitance of a transistor with width W L .
Finally, the detailed pulse generator sizing is very simple
and herein omitted, as transistors of the output NAND gate
in Fig. 2 must be simply sized to ensure the targeted slope
(i.e., rise/fall time) of signal CP. Commonly adopted values
of the clock slope range from F O3 to F O4 [3], being F O X
the slope of the output waveform of an inverter loaded by X
inverters with the same size. Subsequently, inverters are easily
sized to obtain the targeted transparency window.
The critical DQ path of CP3 L and CSP3 L (they both

have exactly the same path) with the related transistor sizes
under the above assumptions is shown in Fig. 18. From this
figure, CP3 L and CSP3 L have two independent DQ paths
with two stages: the first stage is a half latch (top latch for
fall path; bottom for rise), and the second is a transistor of
the second pushpull stage (nMOS for fall path; pMOS for
rise).
For the fall path, logical effort analysis leads to
4
3
W2 + W PP R + 1
=
2W1
4
=
3
g1,FALL =
(A.4a)
h 1,FALL
(A.4b)
p1,FALL
(A.4c)
whereas the rise path has the following parameters:

2
3
2W2 + W NP R + 1
=
W1
2
= .
3
g1,RISE =
(A.5a)
h 1,RISE
(A.5b)
p1,RISE
(A.5c)
For the second stage, analysis for the fall path leads to
1
3
4 + WL
=
W2
=1
g2,FALL =
(A.6a)
h 2,FALL
(A.6b)
p2,FALL
(A.6c)
whereas the rise path has

2
3
4 + WL
=
W2
= 1.
g2,RISE =
(A.7a)
h 2,RISE
(A.7b)
p2,RISE
(A.7c)
1604
By imposing equal stage efforts, the minimum delay of the

fall path is found to be

2 (W2 + W PP R + 1) 4 + W L 7
+
Dmin,CP3 L,FALL =
3W1
3W2
3

2C L
7
(A.8)
=
+
3Cin
3
where we neglected the small capacitance associated with the
minimum-sized transistors in the keeper in Fig. 18 (i.e., W L
4) and the capacitance of the small precharge transistors (i.e.,
W2 W PP R +1). Under the same approximations

2 (2W2 + W NP R + 1) 4 + W L
Dmin,CP3 L,RISE =
3W1
3W2

4C L
5
5
(A.9)
+
+ .
3
3Cin
3
From comparison of (A.8)(A.9), the worst-case DQ delay
of CP3 L and CSP3 L is the rise path delay, given by (A.9).
ACKNOWLEDGMENT
The authors would like to thank the sponsors of the Berkeley Wireless Research Center, STMicroelectronics, for chip
fabrication, and Prof. D. Blaauw and D. Sylvester for testing
support.
R EFERENCES
[1] S. Naffziger and G. Hammond, The implementation of the nextgeneration 64b itanium microprocessor, in Proc. IEEE ISSCC,
Feb. 2002, pp. 276504.
[2] B. Dally, Architectures and circuits for energy-efficient computing, in
Proc. CICC, Sep. 2012, pp. 110.
[3] M. Alioto, E. Consoli, and G. Palumbo, Flip-flop energy/performance
versus clock slope and impact on the clock network design, IEEE Trans.
Circuits Syst., vol. 57, no. 6, pp. 12731286, Jun. 2010.
[4] C. Giacomotto, N. Nedovic, and V. Oklobdzija, The effect of the system
specification on the optimal selection of clocked storage elements, IEEE
J. Solid-State Circuit, vol. 42, no. 6, pp. 13921404, Jun. 2007.
[5] T. Fischer, S. Arekapudi, E. Busta, C. Dietz, M. Golden, S. Hilker,
A. Horiuchi, K. A. Hurd, D. Johnson, H. McIntyre, S. Naffziger, J. Vinh,
J. White, and K. Wilcox, Design solutions for the Bulldozer 32nm SOI
2-core processor module in an 8-core CPU, in IEEE ISSCC Dig. Tech.
Papers, Feb. 2011, pp. 7880.
[6] P. Gronowski, W. Bowhill, R. Preston, M. Gowan, and R. Allmon,
High-performance microprocessor design, IEEE J. Solid-State Circuits, vol. 33, no. 5, pp. 676686, May 1998.
[7] D. Bailey and B. Benschneider, Clocking design and analysis for a
600-MHz alpha microprocessor, IEEE J. Solid-State Circuits, vol. 33,
no. 11, pp. 16271633, Nov. 1998.
[8] S. Naffziger, High-performance processors in a power-limited world,
in Proc. Symp. VLSI Circuits, Jun. 2006, pp. 9397.
[9] (2011). International Technology Roadmap for Semiconductors [Online].
Available: http://www.itrs.net
[10] M. Alioto, E. Consoli, and G. Palumbo, Analysis and comparison in the energy-delay-area domain of nanometer CMOS flip-flops:
Part IMethodology and design strategies, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 19, no. 5, pp. 725736, May 2011.
[11] M. Alioto, E. Consoli, and G. Palumbo, Analysis and comparison in
the energy-delay-area domain of nanometer CMOS flip-flops: Part II
Results and figures of merit, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 19, no. 5, pp. 737750, May 2011.
[12] M. Alioto, E. Consoli, and G. Palumbo, From energy-delay metrics to

constraints on the design of digital circuits, Int. J. Circuit Theory Appl.,
vol. 40, no. 8, pp. 815834, Aug. 2012.
[13] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De,
Comparative delay and energy of single edge-triggered and dual edgetriggered pulsed flip-flops for high-performance microprocessors, in
Proc. ISLPED, Aug. 2001, pp. 147152.
[14] V. Stojanovic and V. Oklobdzija, Comparative analysis of masterslave latches and flip-flops for high-performance and low-power systems, IEEE J. Solid-State Circuits, vol. 34, no. 4, pp. 536548, Apr.
1999.
[15] H. Partovi, Clocked storage elements, in Design of High-Performance
Microprocessor Circuits. Piscataway, NJ, USA: IEEE Press,
pp. 207234, 2001.
[16] N. Nedovic, V. Oklobdzija, and W. Walker, A clock skew absorbing
flip-flop, in IEEE ISSCC Dig. Tech. Papers, Feb. 2003, pp. 342497.
[17] D. Markovic, B. Nikolic, and R. Brodersen, Analysis and design of
low-energy flip-flops, in Proc. Int. Symp. Low Power Electron. Design,
Aug. 2001, pp. 5255.
[18] C. Teh, T. Fujita, H. Hara, and M. Hamada, A 77% energy-saving
22-transistor single-phase-clocking D-flip-flop with adaptive-coupling
configuration in 40nm CMOS, in IEEE ISSCC Dig. Tech. Papers,
Feb. 2011, pp. 338340.
[19] E. Consoli, M. Alioto, G. Palumbo, and J. Rabaey, Conditional push
pull pulsed latch with 726 fJops energy delay product in 65nm CMOS,
in IEEE ISSCC Dig. Tech. Papers, Feb. 2012, pp. 482483.
[20] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K.
Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, A 1.3GHz
fifth generation SPARC64 microprocessor, in Proc. DAC, Jun. 2003,
pp. 702705.
[21] M. Wieckowski, Y. M. Park, C. Tokunaga, D. W. Kim, Z. Foo,
D. Sylvester, and D. Blaauw, Timing yield enhancement through soft
edge flip-flop based design, in Proc. CICC, Sep. 2008, pp. 543546.
[22] I. Sutherland, B. Sproull, and D. Harris, Logical Effort. Designing Fast
CMOS Circuits. San Mateo, CA, USA: Morgan Kaufmann Publishers,
1999.
[23] M. Alioto, E. Consoli, and G. Palumbo, General strategies to design
nanometer flip-flops in the energy-delay space, IEEE Trans. Circuits
Syst., vol. 57, no. 7, pp. 15831596, Jul. 2010.
[24] R. Ho, K. W. Mai, and M. A. Horowitz, The future of wires, Proc.
IEEE, vol. 89, no. 4, pp. 490504, Apr. 2001.
[25] T. Lang, E. Musoll, and J. Cortadella, Individual flip-flops with gated
clocks for low power datapaths, IEEE Trans. Circuits Syst. II, Analog
Digits Signal Process., vol. 44, no. 6, pp. 507516, Jun. 1997.
[26] S. Heo and K. Asanovic, Load-sensitive flip-flop characterization, in
Proc. CSW-VLSI, Apr. 2001, pp. 8792.
[27] S. Heo, R. Krashinsky, and K. Asanovic, Activity-sensitive flip-flop
and latch selection for reduced energy, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 15, no. 9, pp. 10601064, Sep. 2007.
[28] V. Oklobdzija, V. Stojanovic, D. Markovic, and N. Nedovic, Digital
System Clocking: High-Performance and Low-Power Aspects, New
York, NY, USA: Wiley, 2003.
[29] J. D. Warnock, Y. H. Chan, W. V. Huott, S. M. Carey, M. F. Fee,
H. Wen, M. J. Saccamango, F. Malgioglio, P. J. Meaney, D. W. Plass,
Y. Chan, M. D. Mayo, G. Mayer, L. J. Sigal, D. L. Rude, R. Averill,
M. Wood, T. Strach, H. H. Smith, B. W. Curran, E. M. Schwarz,
L. Eisen, D. Malone, S. Weitzel, P. K. Mak, T. J. McPherson, and
C. F. Webb, A 5.2GHz microprocessor chip for the IBM zEnterprise system, in Proc. IEEE ISSCC Dig. Tech. Papers, Feb. 2011,
pp. 7072.
[30] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K.
Morita, T. Muta, T. Motokurumada, S. Okada, H. Yamashita, Y. Satsukawa, A. Konmoto, R. Yamashita, and H. Sugiyama, A 1.3GHz
fifth generation SPARC64 microprocessor, in Proc. DAC, Jun. 2003,
pp. 702705.
[31] N. Nedovic, W. Walker, and V. Oklobdzija, A test circuit for measurement of clocked storage element characteristics, IEEE J. Solid-State
Circuits, vol. 39, no. 8, pp. 12941304, Aug. 2004.
[32] S. G. Narendra and A. Chandrakasan, Leakage in Nanometer CMOS
Technologies, New York, NY, USA: Springer-Verlag, 2006.
Elio Consoli was born in Catania, Italy, in 1983.

He received the masters degree in microelectronic
engineering from the University of Catania, Catania,
in 2008, and the Ph.D. degree from the Department
of Electrical, Electronic and Information Engineering, University of Catania, in 2012.
He has been a Visiting Scholar with the Berkeley
Wireless Research Center, UC Berkeley, Berkeley,
CA, USA, in 2010. In 2011, he joined Maxim
Integrated, Catania DC, as a Designer of analog and
mixed-signal ICs for interface, switch and protection products. He is the co-author of several scientific papers on referred
international journals and conferences. His current research interests include
clocking strategies and energy-efficient design for high-performance and low
power digital VLSI systems in nanometer CMOS technologies, as well as the
definition of novel circuits and design techniques to be employed in ultralow-power duty-cycled wireless sensor nodes.
Gaetano Palumbo (F07) was born in Catania,

Italy, in 1964. He received the Laurea degree in
electrical engineering and the Ph.D. degree from the
University of Catania, Catania, in 1988 and 1993,
respectively.
He conducts courses on electronic devices, electronics for digital systems and basic electronics in
1993. In 1994, he joined the Dipartimento Elettrico Elettronico e Sistemistico, now Dipartimento
di Ingegneria Elettrica Elettronica e dei Sistemi with
the University of Catania, as a Researcher, becoming
an Associate Professor, in 1998. Since 2000, he has been a Full Professor
with the same department. He is developing some the research activities in
collaboration with STMicroelectronics of Catania. He was the co-author of
three books CMOS Current Amplifiers and Feedback Amplifiers: Theory and
Design and Model and Design of Bipolar and MOS Current-Mode Logic
(CML, ECL and SCL Digital Circuits) (Kluwer Academic Publishers, 1999,
2001 and 2005) and a textbook on electronic device in 2005. He is the
author of over 380 scientific papers on referred international journals (more
than 150) and in conferences. He is the co-author of several patents. His
research has embraced digital circuits with emphasis on bipolar and MOS
current-mode digital circuits, adiabatic circuits, and high-performance building
blocks focused on achieving optimum speed within the constraint of low
power operation. His current research interests include analog circuits with
particular emphasis on feedback circuits, compensation techniques, currentmode approach, and low-voltage circuits.
Dr. Palumbo was served as an Associate Editor of the IEEE T RANSAC TIONS ON C IRCUITS AND S YSTEMS PART I for the topic Analog Circuits
and Filters and digital circuits and systems from 1999 to 2001 and from 2004
to 2005. From 2006 to 2007, he served as an Associate Editor of the IEEE
T RANSACTIONS ON C IRCUITS AND S YSTEMS PART II. From 2008 to 2011,
he served as an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS
AND S YSTEMS PART I. In 2005, he was one of the 12 panelists in the
scientific-disciplinare area 09 - industrial and information engineering of the
Committee for Evaluation of Italian Research, which has the aim to evaluate
the Italian research from 2001 to 2003. In 2003, he received the Darlington
Award. Since 2011, he has been a member of the Board of Governors of the
IEEE CAS Society.
1605
Jan M. Rabaey (M83SM92F95) received the

Ph.D. degree in applied sciences from Katholieke
Universiteit Leuven, Leuven, Belgium.
He joined the Faculty with the Electrical Engineering and Computer Science Department, University of California, Berkeley, CA, USA, in 1987,
where he holds the Donald O. Pederson Distinguished Professorship. He is currently the Scientic
Co-Director with the Berkeley Wireless Research
Center, Berkeley, CA, USA, and the Director of
the Berkeley Ubiquitous SwarmLab, Berkeley, CA,
USA. His current research interests include the conception and implementation
of next-generation integrated wireless systems.
Prof. Rabaey has received a wide range of major awards. He is a member
of the Royal Flemish Academy of Sciences and Arts of Belgium.
Massimo Alioto (M01SM07) was born in Brescia, Italy, in 1972. He received the Laurea (M.Sc.)
degree in electronics engineering and the Ph.D.
degree in electrical engineering from the University
of Catania, Catania, Italy, in 1997 and 2001, respectively.
He is an Associate Professor with the Department
of Electrical and Computer Engineering, National
University of Singapore, Singapore. He was an
Associate Professor with the Department of Information Engineering, University of Siena, Siena, Italy. In
2013, he was a Visiting Scientist with Intel Labs CRL, Hillsboro, OR, USA,
on ultra-scalable microarchitectures. From 2011 to 2012, he was a Visiting
Professor with the University of Michigan, Ann Arbor, MI, USA, investigating
on active techniques for resiliency in near-threshold processors, error-aware
VLSI design for wide energy scalability, and self-powered circuits. From 2009
to 2011, he was a Visiting Professor with BWRC University of California,
Berkeley, CA, USA, investigating on next-generation ultra-low power circuits
and wireless nodes. In 2007, he was a Visiting Professor with EPFL Lausanne, Lausanne, Switzerland. He has authored or co-authored over 180
publications on journals (60+, mostly IEEE Transactions) and conference
proceedings. He is the co-author of two books Flip-Flop Design in Nanometer
CMOS - from High Speed to Low Energy (Springer, 2013) and Model and
Design of Bipolar and MOS Current-Mode Logic: CML, ECL and SCL Digital
Circuits (Springer, 2005). His current research interests include ultra-low
power VLSI circuits, self-powered and wireless nodes, near-threshold circuits
for green computing, error-aware and widely energy-scalable VLSI circuits,
and circuit techniques for emerging technologies.
Prof. Alioto was a member of the HiPEAC Network of Excellence (EU)
and the MuSyC FCRP Center, USA. From 2010 to 2012, he was the Chair
of the VLSI Systems and Applications Technical Committee of the IEEE
Circuits and Systems Society, for which he was a Distinguished Lecturer
from 2009 to 2010 and a member of the DLP Coordinating Committee from
2011 to 2012. He currently serves as an Associate Editor-in-Chief of the
IEEE T RANSACTIONS ON VLSI S YSTEMS , and served as a Guest Editor
of various journal special issues (including the issue on Ultra-Low Voltage
Circuits and Systems for Green Computing published in 2012 on IEEE
T RANSACTIONS ON C IRCUITS AND S YSTEMS PART II). He serves or has
served as an Associate Editor of a number of journals (IEEE T RANSACTIONS
ON VLSI S YSTEMS , ACM Transactions on Design Automation of Electronic
Systems, IEEE T RANSACTIONS ON CAS - PART I, Microelectronics Journal,
Integration The VLSI Journal, Journal of Circuits, Systems, and Computers,
Journal of Low Power Electronics, and Journal of Low Power Electronics
and Applications). He was a Technical Program Chair of the ICECS in 2013,
NEWCAS in 2012, and ICM in 2010 conferences, and a Track Chair in a
number of conferences (ICCD, ISCAS, ICECS, VLSI-SoC, APCCAS, ICM).

Novel Class of Energy-Efficient

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Novel Class of Energy-Efficient

Caricato da

Copyright:

Formati disponibili

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO.

Novel Class of Energy-Efficient Very High-Speed

Abstract In this paper, a new class of pulsed latches is

Fig. 1. Pareto-optimal energy-delay curve of existing FF topologies for a

Index Terms Clocking, energy efficiency, energy-delay

LIP-FLOPS (FFs) and latches are well known to be

targeted speed under a relatively low consumption (with their

to outperform transmission gate flip-flop (TGFF) for extremely

split paths that generate the active-high R (active-low set S)

General scheme of the proposed class of pulsed latches.

pulsed signal, which resets (sets) the output when active.

CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES

Waveforms of internal signals of the general scheme in Fig. 3.

A. CP3 L: Conditional PushPull Pulsed Latch

CP3 L topology (area in dashed line is shareable among multiple

ON and bringing Q to high. Afterwards, the delayed output

III. I MPLEMENTATION OF THE C ONDITIONAL

The schematic of CP3 L topology is depicted in Fig. 5.

the clock phase generator and the pseudo-NAND/NOR gates

Fig. 7. Glitch in CPr occurring if no delay stage is inserted in the feedback

the half latch M4M6 since M5 is OFF. On the other hand,

CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES

CK(IV) shareable pulse generator

CSP3 L topology (area in dashed line is shareable among multiple

pulse to the input of the half latch M1M3 (M4M6), similar

C L /Cin 1). For typical electrical efforts ranging from

portion that can be shared and amortized among multiple

Scan Chain (SC)

Architecture of the testchip.

Finally, post-layout parasitic extraction on different design

CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES

Block diagram of test unit for timing characterization.

delays). For example, DCK delay can be measured according

latches is measured through a separate pin. Regarding timing

1) Select CK through MUX, sweep tunable delay between

VII. M EASUREMENT R ESULTS

Energy per cycle versus data activity.

for the minimum-ED and minimum-ED3 design, respectively.

CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES

(a) Energy-delay tradeoff. (b) Leakage-delay tradeoff.

B. Variations and Comparison With the State of the Art

simulations are based on post-layout extraction. From this

est existing latches. The proposed pulsed latches also exhibit

CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES

Transistor sizes in TGPL.

gate. From usual logical effort calculations, the timing of

Transistor sizes in the critical path of CP3 L and CSP3 L.

B. Logical Effort Optimization of CP3 L and CSP3 L

Similarly, the second stage is a simple inverter and hence has

being W L the load expressed as equivalent transistor

The critical DQ path of CP3 L and CSP3 L (they both

whereas the rise path has the following parameters:

whereas the rise path has

By imposing equal stage efforts, the minimum delay of the

[12] M. Alioto, E. Consoli, and G. Palumbo, From energy-delay metrics to

CONSOLI et al.: ENERGY-EFFICIENT VERY HIGH-SPEED CONDITIONAL PUSHPULL PULSED LATCHES

Elio Consoli was born in Catania, Italy, in 1983.

Gaetano Palumbo (F07) was born in Catania,

Jan M. Rabaey (M83SM92F95) received the

Potrebbero piacerti anche