Sei sulla pagina 1di 5

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 23, NO.

3, JUNE 2013 1700605


16-Bit Wave-Pipelined Sparse-Tree RSFQ Adder
Mikhail Dorojevets, Member, IEEE, Christopher L. Ayala, Member, IEEE, Nobuyuki Yoshikawa, Member, IEEE,
and Akira Fujimaki, Member, IEEE
AbstractIn this paper, we discuss the architecture, design, and
testing of the rst 16-bit asynchronous wave-pipelined sparse-tree
superconductor rapid single ux quantum adder implemented us-
ing the ISTEC 10 kA/cm
2
ADP2.1 fabrication process. Compared
to the KoggeStone adder, our parallel-prex sparse-tree adder
has better energy efciency with signicantly reduced complexity
(at the expense of latency) and almost no decrease in operation fre-
quency. The 16-bit adder core (without SFQ-to-dc and dc-to-SFQ
converters) has 9941 Josephson junctions occupying an area
of 8.5 mm
2
. It is designed for the target operation frequency
of 30 GHz with the expected latency of 352 ps at the bias voltage of
2.5 mV. The adder chip was fabricated and successfully tested at
low frequency for all test patterns with measured bias margins of
+9.8%/ 10.7%.
Index TermsAdders, digital arithmetic, superconducting inte-
grated circuits, superconducting logic circuits.
I. INTRODUCTION
O
NE OF the most universal digital circuit for almost any
application is an adder. It is the fundamental building
block of Arithmetic Logic Units (ALUs) in general-purpose
and special-purpose digital signal microprocessors. Currently,
in the CMOS domain, the design space of adder structures
has been nearly exhausted, with only minimal improvements
shown over previous designs. In contrast, emerging digital
circuit technologies such as superconducting Rapid Single Flux
Quantum (RSFQ) logic opens a way for researchers to explore
new design methodologies for extremely fast, energy-efcient
adders.
In RSFQ logic, most adder designs demonstrated to date
are bit-serial or digit-serial architectures which operate on a
single bit or a small group of bits sequentially at a very high
processing rate [1][6]. Such designs allow for simple clocking
and compact structures. However, the latency of serial adders
scales O(n), where n is the number of bits per operand, which
leads to long latencies for 32-/64-bit operations in general-
purpose processors.
In the past, parallel architectures in RSFQ have been limited
to small data widths or relatively long latency ripple-carry
adders [7][9]. One study evaluated 32-/64-bit parallel Kogge-
Stone RSFQ adders using co-ow clocking [10].
Manuscript received September 28, 2012; accepted December 9, 2012. Date
of publication December 12, 2012; date of current version January 12, 2013.
M. Dorojevets and C. L. Ayala are with the Department of Electrical and
Computer Engineering, Stony Brook University, Stony Brook, NY 11794-2350
USA (e-mail: mdorojevets@gmail.com; chris.ayala@ieee.org).
N. Yoshikawa is with the Department of Electrical and Computer Engineer-
ing, Yokohama National University, Yokohama 240-8501 Japan. He is also
with CREST, Japan Science and Technology Agency (JST), Japan (e-mail:
yoshi@yoshilab.dnj.ynu.ac.jp).
A. Fujimaki is with the Department of Quantum Engineering, Nagoya
University, Nagoya 464-8603 Japan (e-mail: fujimaki@nuee.nagoya-u.ac.jp).
Digital Object Identier 10.1109/TASC.2012.2233846
In the effort of realizing scalable, high-performance, fully
parallel designs, a new technique of asynchronous hybrid wave-
pipelining for RSFQ circuits has been developed at Stony
Brook University (SBU) [11], [12]. Later, as a result of the
collaboration between the SBU and HYPRES designers, an
8-bit wave-pipelined ALU was successfully designed, fabri-
cated, and demonstrated correct operation at the rate of 20 GHz
[13], [14].
In this paper, we present the design of the rst 16-bit asyn-
chronous parallel adder implemented in RSFQ logic. It builds
upon the proven hybrid wave-pipelining techniques to provide
16-bit wide processing and synchronization. It incorporates an
energy efcient, low complexity sparse-tree structure with very
high processing rate. The work is based on a design study for
a scalable 32-bit wave-pipelined sparse-tree adder conducted at
SBU [15].
II. 16-BIT ADDER MICROARCHITECTURE
The microarchitecture of the adder has two main features:
asynchronous hybrid wave-pipelined processing and a prex
sparse-tree carry generate-propagate structure for arithmetic.
A. Hybrid Wave-Pipelining
The use of data and clock waves in two-dimensional RSFQ
circuits such as multipliers with counter-ow clocking was
rst proposed in [16]. In data-driven wave-pipelining, data
waves self-propagate through combinational (non-clocked)
logic gates without any need for clock signals. The data waves
are followed by reset waves that clean up the residual logic
states of the gates before the next data wave arrival.
The maximum clock rate is limited by (1) the time differ-
ences in data propagation paths through combinational circuits,
and (2) the minimum time gap between data and reset signals.
The latter constraint is similar to the one in co-ow clocking
that deals with setup and hold time requirements of clocked
gates [17]. In co-ow clocking, data are pushed through the
circuits by clock signals, so the computation process involves
setup time overhead at each stage. In contrast, wave-pipelined
circuits are data-driven, not clock-driven, and as a result, they
have no setup time contribution to the latency of operations.
In processor design, the most viable approach is to use
asynchronous wave-pipelining for units with regular structure,
and co-ow clocking for irregular circuits such as control logic.
B. Prex Sparse-Tree Core
High-performance parallel adders typically use prex trees
which generate carries in log
2
(n) time, where n is the number
1051-8223/$31.00 2012 IEEE
1700605 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 23, NO. 3, JUNE 2013
Fig. 1. Structural diagram of the 16-bit sparse-tree adder. The carry-out (Cout) is the left-most bit and to the right of it is the most signicant bit of the Sum
result (bit 15). The right-most bit is the least signicant bit (bit 0).
of bits of the datapath. The Kogge-Stone adder (KSA) [18] is
considered to be the fastest among parallel-prex adders. A
KSA adder was successfully used in the 20 GHz 8-bit RSFQ
ALU [14]. However, KSAs have very high complexity and
a tremendous amount of wiring congestion. Further enhance-
ments to the KSA prex structure such as the sparse-tree con-
guration have been proposed and used in high-performance
Intel processors [19].
In our 16-bit RSFQ adder design, we chose the sparse-tree
structure to reduce the number of Josephson junctions (JJs)
needed for its implementation without any signicant effect on
its processing rate. As a side effect, this will also lead to a more
energy-efcient design by reducing the total bias current and
power consumption. Fig. 1 illustrates the structural diagram of
our sparse-tree adder. It consists of the following three stages:
Initialization, Prex-Tree and Summation.
The Initialization stage receives two 16-bit data operands A
and B to create bitwise Generate (G) and Propagate (P) signals
which will be merged in a logarithmic manner in the Prex-Tree
stage.
The Prex-Tree stage consists of Carry-Merge (CM) blocks
to merge the prex signals and provide a group carry to each
4-bit summation block. In contrast, the Kogge-Stone prex-
tree provides a carry to every individual bit of the adder.
DFF (D ip-op) buffers appropriately delay prex and bit-
wise P signals until they are ready to be merged or pro-
cessed at the Summation stage, respectively. The rst three
levels of the Prex-Tree also perform the ripple-carry addition
within each 4-bit group before data arrive at the Summation
stage.
The Summation stage computes the nal sum with 4-bit
carry-skip adders [20]. The lower-half of the adder (bits 7:0)
can start the Summation stage early because all appropriate
signals are ready. The upper-half of the adder (bits 15:8) must
wait until carries for this upper half are calculated by the very
last level of the Prex-Tree stage.
III. 16-BIT ADDER CHIP DESIGN
The design of the 30 GHz 16-bit adder has been done by
the SBU designers using the CONNECT cell library developed
for the ISTEC 1.0 m 10 kA/cm
2
Advanced Process (ADP)
[21] by Yokohama National University and Nagoya Univer-
sity. Through a hierarchical design process, smaller sub-blocks
were thoroughly evaluated and tested for functionality and for
satisfying strict timing requirements. These sub-blocks have a
horizontal bit pitch of 150 m. These smaller sub-blocks were
then assembled to create larger modules that form the 16-bit
adder.
The Initialization stage consists of GPR_INIT logic blocks,
one for each bit. The GPR_INIT creates the bitwise prex
functions described as G
i
= A
i
B
i
and P
i
= A
i
B
i
where
i is the bit index column ranging from 15 down to 0 in the 16-bit
adder. These functions are easily realized through clocked AND
and XOR gates in a co-ow clocking arrangement. The clock is
the Rdy signal provided to all bits through a distribution tree
built with passive transmission lines (PTLs). Additionally, it is
necessary to create the trailing reset signal R which will be used
to reset the asynchronous elements in the Prex-Tree. Signal R
is a copy of the Rdy signal for each bit with JJ-based delay lines
to ensure data signals are processed before reset follows in the
asynchronously wave-pipelined Prex-Tree.
The Prex-Tree stage is built with CM blocks to merge the
prex signals as shown in Fig. 1. Merging of the prex signals
is described in [18]. It is implemented with CFFs (resettable
Muller C-ip-op gates based on the Muller C-element [22],
[23]) and conuence buffers used as asynchronous OR gates.
The CFFs provide the following functions. First, they behave
as asynchronous AND gates. Second, they are used as key
re-synchronization elements for wave-pipelining allowing data
waves to wait until all their appropriate signals arrive. Due to
the encoding of the prex signals, conuence buffers can be
safely used as asynchronous OR gates without any danger of
violating the time separation requirement of their input pulses.
DOROJEVETS et al.: 16-BIT WAVE-PIPELINED SPARSE-TREE RSFQ ADDER 1700605
Fig. 2. Simulated dc bias margins for the 16-bit adder design measured from
19.2 GHz (52 ps cycle time) to 40.0 GHz (25 ps cycle time).
The Summation stage has a 4-bit carry-skip adder block [20]
for each 4-bit group. In our carry-skip adders, the generation of
a carry-in to the two most signicant bits of the group is done
in parallel with the calculation of the two least signicant sum
bits in the group.
For each group, the signal Propagate-Propagate (PP) ob-
tained from the Prex-Tree denotes whether a carry-in will
propagate through the lower 2 bits of the group or not. This
propagation is done by a clocked AND gate which creates the
correct carry-in of the upper 2 bits of the group, thus skipping
the slower ripple-carry produced from the lower 2 bits. The
summation in the upper and lower halves of the 4-bit groups
is done by T1 (toggle ip-ops with clocked output [24]) and
clocked XOR gates. Trailing reset signals are used to clock the
T1 and XOR gates in the Summation blocks.
To facilitate high-speed on-chip testing, we designed 3 sup-
plemental circuits, namely: a clock generator, input shift regis-
ter, and output compressor.
The clock generator uses a 16 pulse-train design where a
single clock pulse is sequentially split 16 times and fed back
into a ladder structure of conuence buffers. An appropriate
number of JTL delays are inserted to achieve a particular
range of clock frequencies. Two such structures are integrated
into the clock generator to provide two frequency ranges:
1525.5 GHz, and 23.541 GHz. By adjusting the bias voltage
of the clock generator, these two ranges cover the full spectrum
of clock frequencies that the 16-bit adder is expected to operate
at. The clock generator characteristics were obtained through
Verilog simulation using gate timing parameters fromJJ circuit-
level modeling. An additional one-input pulse, one-output pulse
low-frequency mode is also available.
A 16-bit parallel-load/parallel-output shift register provides
test vectors to the inputs of the adder at high speed. The
high-speed outputs of the adder are captured by the output
compressor to create an XOR signature of the results using T1
gates. A Verilog testbench simulation has been written to verify
functional correctness and obtain frequency-dependent DC bias
margins. Fig. 2 shows that the 16-bit adder can operate up to
38.5 GHz. It has +20%/ 16% DC bias margins at the target
clock rate of 30 GHz.
Fig. 3. Photograph of the fabricated 16-bit adder with on-chip high frequency
test circuits. The labeled components are the following: (A) dc-SFQ converters;
(B) clock generator; (C) input shifter register; (D) PTL-based splitter tree;
(E) adder core;(F) output compressor; and (G) TFF-based SFQ-dc converters.
TABLE I
SUMMARY OF THE RSFQ ADDER CHARACTERISTICS
The entire layout of the chip with high-frequency test cir-
cuits including moats, I/O, clock generator, input shift register,
output compressor, and the adder core ts within a 4.25 mm
7.00 mm area (Fig. 3) with the adder core circuit occupying the
area of 8.52 mm
2
(2.73 mm x 3.12 mm). Table I summarizes
the characteristics of the 16-bit adder.
IV. TESTING
Two test chips were fabricated on two separate tape-outs.
The rst chip consists of the 16-bit adder for low frequency
testing in the interest of meeting the tape-out schedule and
area constraints. The second chip consists of the same design
with additional high-frequency testing circuits described in the
previous section. The SBU team tested several samples of each
chip design using the measurement facilities at Yokohama.
For low-frequency testing of the rst chip, we supplied
16-bit data operands A and B in parallel, followed by an
appropriately delayed Rdy signal to begin the operation. A
total of 5 oscilloscopes (3 different manufacturers) captures
and displays the 16-bit output result, along with Cout and Rout
(Rdy output) from the adder created by TFF-based SFQ-to-DC
circuits [21].
Performing an exhaustive test by covering the entire range
of two 16-bit inputs is impractical so we opted for a white-box
testing approach. First, we handpicked critical test vectors to
1700605 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 23, NO. 3, JUNE 2013
Fig. 4. Output waveforms for three random tests: (a) 56811 + 14643 =
71454; (b) 8724 + 50892 = 59616; (c) 13982 + 64973 = 78955. Sum[0]
is the least signicant bit. Each output result is accompanied with Rout. Tests
were sent twice to show two transitions for outputs that are logical 1. The lack
of a nearby ground pad to couple with the Cout signal led to its noisy output.
exercise all critical intergroup propagation paths of the Prex-
Tree. Second, we tested the very important case when carry
generated from the least signicant bit propagates all the way
to the Cout output, producing all zeroes for the Sum output and
carry-out from the most signicant bit. Third, the output sum
bits of all individual 4-bit groups were experimentally tested
for the cases when carries were generated within 4-bit groups.
Besides these critical tests, we also applied several random
vectors as a sanity check. In total, 29 critical test vectors were
used in addition to 10 random vectors.
All test vectors in our test plan successfully produced the
expected output for the low frequency testing of the rst
16-bit adder chip. Fig. 4 shows the output waveforms for three
of the tested random cases. We measured the overlapping DC
bias margin across all test cases to be +9.8%/ 10.7%.
In the low frequency mode of operation of the second chip
with high-frequency testing circuits, the serial output of the
input shift register showed several but not all bits functioning
correctly for operand A. The serial output for operand B could
not be stabilized. Individual testing of the 16 vectors revealed
incorrect or unstable output across a majority of the bits. At
high frequency, we observed similar results.
Our second chip was on the ADP627 tape-out and measure-
ments have shown that the circuit parameters appeared to be at
a lower standard of quality than usual for this fabrication run
[25]. We tentatively attribute the failure of the high-speed test
to fabrication difculties.
V. CONCLUSION
We have designed, fabricated, and tested the rst 16-bit
wave-pipelined sparse-tree RSFQ adder chip with the core
complexity of 9941 JJs and the target operation rate of 30 GHz.
We have successfully demonstrated the correct operation of the
chip at low frequency, passing all carefully chosen test vectors
with a measured bias margin of +9.8%/ 10.7%. Another
adder chip consisting of 12785 junctions with additional on-
chip circuits for 30 GHz testing was also fabricated but its
testing showed the need for another fabrication run.
ACKNOWLEDGMENT
The National Institute of Advanced Industrial Science and
Technology (Japan) has partially contributed to the circuit fab-
rication. The authors are also grateful to Dr. M. Tanaka, Nagoya
University, and Taiichi Kato and Y. Shimamura, Yokohama
National University, for their assistance with SFQ CAD tools
and testing equipment.
REFERENCES
[1] M. Tanaka, H. Akaike, A. Fujimaki, Y. Yamanashi, N. Yoshikawa,
S. Nagasawa, K. Takagi, and N. Takagi, 100-GHz single-ux-quantum
bit-serial adder based on 10-kA/cm
2
niobium process, IEEE Trans. Appl.
Supercond., vol. 21, no. 3, pp. 792796, Jun. 2011.
[2] A. F. Kirichenko and O. A. Mukhanov, Implementation of novel push-
forward RSFQ Carry-Save Serial Adders, IEEE Trans. Appl. Super-
cond., vol. 5, no. 2, pp. 30103013, Jun. 1995.
[3] A. Y. Kidiyarova-Shevchenko, K. Y. Platov, E. M. Tolkacheva, and
I. A. Kataeva, RSFQ asynchronous serial multiplier and spreading codes
generator for multiuser detector, IEEE Trans. Appl. Supercond., vol. 13,
no. 2, pp. 429432, Jun. 2003.
[4] S. V. Polonsky and A. V. Rylyakov, RSFQ arithmetic blocks for DSP
applications, IEEE Trans. Appl. Supercond., vol. 5, no. 2, pp. 28232826,
Jun. 1995.
[5] H. Park, Y. Yamanashi, N. Yoshikawa, M. Tanaka, and A. Fujimaki,
Design of fast digit-serial adders using SFQ logic circuits, IEICE Elec-
tronics Express, vol. 6, no. 19, pp. 14081413, 2009.
[6] S. V. Polonsky, V. K. Semenov, P. I. Bunyk, A. F. Kirichenko,
A. Y. Kidiyarov-Shevchenko, O. A. Mukhanov, P. N. Shevchenko,
DOROJEVETS et al.: 16-BIT WAVE-PIPELINED SPARSE-TREE RSFQ ADDER 1700605
D. F. Schneider, D. Y. Zinoviev, and K. K. Likharev, New RSFQ circuits
Josephson junction digital devices, IEEE Trans. Appl. Supercond., vol. 3,
no. 1, pp. 25662577, Mar. 1993.
[7] J. Y. Kim, S. Kim, and J. Kang, Construction of an RSFQ 4-bit ALU with
half adder cells, IEEE Trans. Appl. Supercond., vol. 15, no. 2, pp. 308
311, Jun. 2005.
[8] Q. P. Herr, N. Vukovic, C. A. Mancini, K. Gaj, V. Adler, E. G. Friedman,
A. Krasniewski, M. F. Bocko, and M. J. Feldman, Design and low speed
testing of a four-bit RSFQ multiplier-accumulator, IEEE Trans. Appl.
Supercond., vol. 7, no. 2, pp. 31683171, Jun. 1997.
[9] R. Nakamoto, S. Sakuraba, T. Onomi, S. Sato, and K. Nakajima, 4-bit
SFQ Multiplier Based on Booth Encoder, IEEE Trans. Appl. Supercond.,
vol. 21, no. 3, pp. 852855, Jun. 2011.
[10] P. Bunyk and P. Litskevitch, Case study in RSFQ design: Fast pipelined
parallel adder, IEEE Trans. Appl. Supercond., vol. 9, no. 2, pp. 3714
3720, Jun. 1999.
[11] M. Dorojevets, C. Ayala, and A. Kasperek, Development and evaluation
of design techniques for high-performance wave-pipelined wide datap-
ath RSFQ processors, in Proc. 12th Int. Supercond. Electron. Conf.,
Fukuoka, Japan, 2009, SP-P46.
[12] M. Dorojevets, C. L. Ayala, and A. K. Kasperek, Data-ow microarchi-
tecture for wide datapath RSFQ processors: Design study, IEEE Trans.
Appl. Supercond., vol. 21, no. 3, pp. 787791, Jun. 2011.
[13] T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and
O. Mukhanov, 8-bit asynchronous wave-pipelined RSFQ arithmetic-
logic unit, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 847851,
Jun. 2011.
[14] T. V. Filippov, A. Sahu, A. F. Kirichenko, I. V. Vernik, M. Dorojevets,
C. L. Ayala, and O. A. Mukhanov, 20 GHz operation of an asynchronous
wave-pipelined RSFQ arithmetic-logic unit, Phys. Proc., vol. 36, pp. 59
65, 2012.
[15] M. Dorojevets and C. L. Ayala, Design and evaluation of a 20 GHz
32-bit wave-pipelined sparse tree adder, Ultra-High-Speed-Comput. Lab
Tech. Rep., Stony Brook Univ., Stony Brook, NY, 2010.
[16] K. K. Likharev and V. K. Semenov, RSFQ logic/memory family: A new
Josephson-junction technology for sub-terahertz-clock-frequency digi-
tal systems, IEEE Trans. Appl. Supercond., vol. 1, no. 1, pp. 328,
Mar. 1991.
[17] K. Gaj, E. G. Friedman, and M. J. Feldman, Timing of multi-gigahertz
rapid single ux quantum digital circuits, J. VLSI Signal Process.,
vol. 276, pp. 247276, Nov. 1997.
[18] P. M. Kogge and H. S. Stone, A parallel algorithm for the efcient
solution of a general class of recurrence equations, IEEE Trans. Comput.,
vol. C-22, no. 8, pp. 786793, Aug. 1973.
[19] S. Mathew, M. Anders, R. K. Krishnamurthy, and S. Borkar, A 4-GHz
130-nm address generation unit with 32-bit sparse-tree adder core, IEEE
J. Solid-State Circuits, vol. 38, no. 5, pp. 689695, May 2003.
[20] A. G. M. Strollo and E. Napoli, A fast and area efcient complimentary
pass-transistor logic carry-skip adder, in Proc. 21st Int. Conf. Microelec-
tron., Sep. 1997, vol. 2, pp. 701704.
[21] H. Akaike, M. Tanaka, K. Takagi, I. Kataeva, R. Kasagi, A. Fujimaki,
M. Igarashi, H. Park, Y. Yamanashi, N. Yoshikawa, K. Fujiwara,
S. Nagasawa, M. Hidaka, and N. Takagi, Design of single ux quantum
cells for a 10-Nb-layer process, Phys. C, Supercond., vol. 469, no. 1520,
pp. 16701673, Oct. 2009.
[22] O. A. Mukhanov, S. V. Rylov, V. K. Semonov, and S. V. Vyshenskii,
RSFQ logic arithmetic, IEEE Trans. Magn., vol. 25, no. 2, pp. 857860,
Mar. 1989.
[23] Z. J. Deng, N. Yoshikawa, J. A. Tierno, S. R. Whiteley, and T. van Duzer,
Asynchronous circuits and systems in superconducting RSFQ digital
technology, in Proc. 4th Int. Symp. Adv. Res. Asynchronous Circuits Syst.,
Apr. 1998, pp. 274285.
[24] S. Polonsky, V. K. Semenov, and A. F. Kirichenko, Single ux, quantum
B ip-op and its possible applications, IEEE Trans. Appl. Supercond.,
vol. 4, no. 1, pp. 918, Mar. 1994.
[25] M. Hidaka, S. Nagasawa, K. Hinode, and T. Satoh, Device yield in Nb-
nine-layer circuit fabrication process, IEEE Trans. Appl. Supercond.,
vol. 23.

Potrebbero piacerti anche