IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 23, NO.
3, JUNE 2013 1700605
16-Bit Wave-Pipelined Sparse-Tree RSFQ Adder Mikhail Dorojevets, Member, IEEE, Christopher L. Ayala, Member, IEEE, Nobuyuki Yoshikawa, Member, IEEE, and Akira Fujimaki, Member, IEEE AbstractIn this paper, we discuss the architecture, design, and testing of the rst 16-bit asynchronous wave-pipelined sparse-tree superconductor rapid single ux quantum adder implemented us- ing the ISTEC 10 kA/cm 2 ADP2.1 fabrication process. Compared to the KoggeStone adder, our parallel-prex sparse-tree adder has better energy efciency with signicantly reduced complexity (at the expense of latency) and almost no decrease in operation fre- quency. The 16-bit adder core (without SFQ-to-dc and dc-to-SFQ converters) has 9941 Josephson junctions occupying an area of 8.5 mm 2 . It is designed for the target operation frequency of 30 GHz with the expected latency of 352 ps at the bias voltage of 2.5 mV. The adder chip was fabricated and successfully tested at low frequency for all test patterns with measured bias margins of +9.8%/ 10.7%. Index TermsAdders, digital arithmetic, superconducting inte- grated circuits, superconducting logic circuits. I. INTRODUCTION O NE OF the most universal digital circuit for almost any application is an adder. It is the fundamental building block of Arithmetic Logic Units (ALUs) in general-purpose and special-purpose digital signal microprocessors. Currently, in the CMOS domain, the design space of adder structures has been nearly exhausted, with only minimal improvements shown over previous designs. In contrast, emerging digital circuit technologies such as superconducting Rapid Single Flux Quantum (RSFQ) logic opens a way for researchers to explore new design methodologies for extremely fast, energy-efcient adders. In RSFQ logic, most adder designs demonstrated to date are bit-serial or digit-serial architectures which operate on a single bit or a small group of bits sequentially at a very high processing rate [1][6]. Such designs allow for simple clocking and compact structures. However, the latency of serial adders scales O(n), where n is the number of bits per operand, which leads to long latencies for 32-/64-bit operations in general- purpose processors. In the past, parallel architectures in RSFQ have been limited to small data widths or relatively long latency ripple-carry adders [7][9]. One study evaluated 32-/64-bit parallel Kogge- Stone RSFQ adders using co-ow clocking [10]. Manuscript received September 28, 2012; accepted December 9, 2012. Date of publication December 12, 2012; date of current version January 12, 2013. M. Dorojevets and C. L. Ayala are with the Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794-2350 USA (e-mail: mdorojevets@gmail.com; chris.ayala@ieee.org). N. Yoshikawa is with the Department of Electrical and Computer Engineer- ing, Yokohama National University, Yokohama 240-8501 Japan. He is also with CREST, Japan Science and Technology Agency (JST), Japan (e-mail: yoshi@yoshilab.dnj.ynu.ac.jp). A. Fujimaki is with the Department of Quantum Engineering, Nagoya University, Nagoya 464-8603 Japan (e-mail: fujimaki@nuee.nagoya-u.ac.jp). Digital Object Identier 10.1109/TASC.2012.2233846 In the effort of realizing scalable, high-performance, fully parallel designs, a new technique of asynchronous hybrid wave- pipelining for RSFQ circuits has been developed at Stony Brook University (SBU) [11], [12]. Later, as a result of the collaboration between the SBU and HYPRES designers, an 8-bit wave-pipelined ALU was successfully designed, fabri- cated, and demonstrated correct operation at the rate of 20 GHz [13], [14]. In this paper, we present the design of the rst 16-bit asyn- chronous parallel adder implemented in RSFQ logic. It builds upon the proven hybrid wave-pipelining techniques to provide 16-bit wide processing and synchronization. It incorporates an energy efcient, low complexity sparse-tree structure with very high processing rate. The work is based on a design study for a scalable 32-bit wave-pipelined sparse-tree adder conducted at SBU [15]. II. 16-BIT ADDER MICROARCHITECTURE The microarchitecture of the adder has two main features: asynchronous hybrid wave-pipelined processing and a prex sparse-tree carry generate-propagate structure for arithmetic. A. Hybrid Wave-Pipelining The use of data and clock waves in two-dimensional RSFQ circuits such as multipliers with counter-ow clocking was rst proposed in [16]. In data-driven wave-pipelining, data waves self-propagate through combinational (non-clocked) logic gates without any need for clock signals. The data waves are followed by reset waves that clean up the residual logic states of the gates before the next data wave arrival. The maximum clock rate is limited by (1) the time differ- ences in data propagation paths through combinational circuits, and (2) the minimum time gap between data and reset signals. The latter constraint is similar to the one in co-ow clocking that deals with setup and hold time requirements of clocked gates [17]. In co-ow clocking, data are pushed through the circuits by clock signals, so the computation process involves setup time overhead at each stage. In contrast, wave-pipelined circuits are data-driven, not clock-driven, and as a result, they have no setup time contribution to the latency of operations. In processor design, the most viable approach is to use asynchronous wave-pipelining for units with regular structure, and co-ow clocking for irregular circuits such as control logic. B. Prex Sparse-Tree Core High-performance parallel adders typically use prex trees which generate carries in log 2 (n) time, where n is the number 1051-8223/$31.00 2012 IEEE 1700605 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 23, NO. 3, JUNE 2013 Fig. 1. Structural diagram of the 16-bit sparse-tree adder. The carry-out (Cout) is the left-most bit and to the right of it is the most signicant bit of the Sum result (bit 15). The right-most bit is the least signicant bit (bit 0). of bits of the datapath. The Kogge-Stone adder (KSA) [18] is considered to be the fastest among parallel-prex adders. A KSA adder was successfully used in the 20 GHz 8-bit RSFQ ALU [14]. However, KSAs have very high complexity and a tremendous amount of wiring congestion. Further enhance- ments to the KSA prex structure such as the sparse-tree con- guration have been proposed and used in high-performance Intel processors [19]. In our 16-bit RSFQ adder design, we chose the sparse-tree structure to reduce the number of Josephson junctions (JJs) needed for its implementation without any signicant effect on its processing rate. As a side effect, this will also lead to a more energy-efcient design by reducing the total bias current and power consumption. Fig. 1 illustrates the structural diagram of our sparse-tree adder. It consists of the following three stages: Initialization, Prex-Tree and Summation. The Initialization stage receives two 16-bit data operands A and B to create bitwise Generate (G) and Propagate (P) signals which will be merged in a logarithmic manner in the Prex-Tree stage. The Prex-Tree stage consists of Carry-Merge (CM) blocks to merge the prex signals and provide a group carry to each 4-bit summation block. In contrast, the Kogge-Stone prex- tree provides a carry to every individual bit of the adder. DFF (D ip-op) buffers appropriately delay prex and bit- wise P signals until they are ready to be merged or pro- cessed at the Summation stage, respectively. The rst three levels of the Prex-Tree also perform the ripple-carry addition within each 4-bit group before data arrive at the Summation stage. The Summation stage computes the nal sum with 4-bit carry-skip adders [20]. The lower-half of the adder (bits 7:0) can start the Summation stage early because all appropriate signals are ready. The upper-half of the adder (bits 15:8) must wait until carries for this upper half are calculated by the very last level of the Prex-Tree stage. III. 16-BIT ADDER CHIP DESIGN The design of the 30 GHz 16-bit adder has been done by the SBU designers using the CONNECT cell library developed for the ISTEC 1.0 m 10 kA/cm 2 Advanced Process (ADP) [21] by Yokohama National University and Nagoya Univer- sity. Through a hierarchical design process, smaller sub-blocks were thoroughly evaluated and tested for functionality and for satisfying strict timing requirements. These sub-blocks have a horizontal bit pitch of 150 m. These smaller sub-blocks were then assembled to create larger modules that form the 16-bit adder. The Initialization stage consists of GPR_INIT logic blocks, one for each bit. The GPR_INIT creates the bitwise prex functions described as G i = A i B i and P i = A i B i where i is the bit index column ranging from 15 down to 0 in the 16-bit adder. These functions are easily realized through clocked AND and XOR gates in a co-ow clocking arrangement. The clock is the Rdy signal provided to all bits through a distribution tree built with passive transmission lines (PTLs). Additionally, it is necessary to create the trailing reset signal R which will be used to reset the asynchronous elements in the Prex-Tree. Signal R is a copy of the Rdy signal for each bit with JJ-based delay lines to ensure data signals are processed before reset follows in the asynchronously wave-pipelined Prex-Tree. The Prex-Tree stage is built with CM blocks to merge the prex signals as shown in Fig. 1. Merging of the prex signals is described in [18]. It is implemented with CFFs (resettable Muller C-ip-op gates based on the Muller C-element [22], [23]) and conuence buffers used as asynchronous OR gates. The CFFs provide the following functions. First, they behave as asynchronous AND gates. Second, they are used as key re-synchronization elements for wave-pipelining allowing data waves to wait until all their appropriate signals arrive. Due to the encoding of the prex signals, conuence buffers can be safely used as asynchronous OR gates without any danger of violating the time separation requirement of their input pulses. DOROJEVETS et al.: 16-BIT WAVE-PIPELINED SPARSE-TREE RSFQ ADDER 1700605 Fig. 2. Simulated dc bias margins for the 16-bit adder design measured from 19.2 GHz (52 ps cycle time) to 40.0 GHz (25 ps cycle time). The Summation stage has a 4-bit carry-skip adder block [20] for each 4-bit group. In our carry-skip adders, the generation of a carry-in to the two most signicant bits of the group is done in parallel with the calculation of the two least signicant sum bits in the group. For each group, the signal Propagate-Propagate (PP) ob- tained from the Prex-Tree denotes whether a carry-in will propagate through the lower 2 bits of the group or not. This propagation is done by a clocked AND gate which creates the correct carry-in of the upper 2 bits of the group, thus skipping the slower ripple-carry produced from the lower 2 bits. The summation in the upper and lower halves of the 4-bit groups is done by T1 (toggle ip-ops with clocked output [24]) and clocked XOR gates. Trailing reset signals are used to clock the T1 and XOR gates in the Summation blocks. To facilitate high-speed on-chip testing, we designed 3 sup- plemental circuits, namely: a clock generator, input shift regis- ter, and output compressor. The clock generator uses a 16 pulse-train design where a single clock pulse is sequentially split 16 times and fed back into a ladder structure of conuence buffers. An appropriate number of JTL delays are inserted to achieve a particular range of clock frequencies. Two such structures are integrated into the clock generator to provide two frequency ranges: 1525.5 GHz, and 23.541 GHz. By adjusting the bias voltage of the clock generator, these two ranges cover the full spectrum of clock frequencies that the 16-bit adder is expected to operate at. The clock generator characteristics were obtained through Verilog simulation using gate timing parameters fromJJ circuit- level modeling. An additional one-input pulse, one-output pulse low-frequency mode is also available. A 16-bit parallel-load/parallel-output shift register provides test vectors to the inputs of the adder at high speed. The high-speed outputs of the adder are captured by the output compressor to create an XOR signature of the results using T1 gates. A Verilog testbench simulation has been written to verify functional correctness and obtain frequency-dependent DC bias margins. Fig. 2 shows that the 16-bit adder can operate up to 38.5 GHz. It has +20%/ 16% DC bias margins at the target clock rate of 30 GHz. Fig. 3. Photograph of the fabricated 16-bit adder with on-chip high frequency test circuits. The labeled components are the following: (A) dc-SFQ converters; (B) clock generator; (C) input shifter register; (D) PTL-based splitter tree; (E) adder core;(F) output compressor; and (G) TFF-based SFQ-dc converters. TABLE I SUMMARY OF THE RSFQ ADDER CHARACTERISTICS The entire layout of the chip with high-frequency test cir- cuits including moats, I/O, clock generator, input shift register, output compressor, and the adder core ts within a 4.25 mm 7.00 mm area (Fig. 3) with the adder core circuit occupying the area of 8.52 mm 2 (2.73 mm x 3.12 mm). Table I summarizes the characteristics of the 16-bit adder. IV. TESTING Two test chips were fabricated on two separate tape-outs. The rst chip consists of the 16-bit adder for low frequency testing in the interest of meeting the tape-out schedule and area constraints. The second chip consists of the same design with additional high-frequency testing circuits described in the previous section. The SBU team tested several samples of each chip design using the measurement facilities at Yokohama. For low-frequency testing of the rst chip, we supplied 16-bit data operands A and B in parallel, followed by an appropriately delayed Rdy signal to begin the operation. A total of 5 oscilloscopes (3 different manufacturers) captures and displays the 16-bit output result, along with Cout and Rout (Rdy output) from the adder created by TFF-based SFQ-to-DC circuits [21]. Performing an exhaustive test by covering the entire range of two 16-bit inputs is impractical so we opted for a white-box testing approach. First, we handpicked critical test vectors to 1700605 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 23, NO. 3, JUNE 2013 Fig. 4. Output waveforms for three random tests: (a) 56811 + 14643 = 71454; (b) 8724 + 50892 = 59616; (c) 13982 + 64973 = 78955. Sum[0] is the least signicant bit. Each output result is accompanied with Rout. Tests were sent twice to show two transitions for outputs that are logical 1. The lack of a nearby ground pad to couple with the Cout signal led to its noisy output. exercise all critical intergroup propagation paths of the Prex- Tree. Second, we tested the very important case when carry generated from the least signicant bit propagates all the way to the Cout output, producing all zeroes for the Sum output and carry-out from the most signicant bit. Third, the output sum bits of all individual 4-bit groups were experimentally tested for the cases when carries were generated within 4-bit groups. Besides these critical tests, we also applied several random vectors as a sanity check. In total, 29 critical test vectors were used in addition to 10 random vectors. All test vectors in our test plan successfully produced the expected output for the low frequency testing of the rst 16-bit adder chip. Fig. 4 shows the output waveforms for three of the tested random cases. We measured the overlapping DC bias margin across all test cases to be +9.8%/ 10.7%. In the low frequency mode of operation of the second chip with high-frequency testing circuits, the serial output of the input shift register showed several but not all bits functioning correctly for operand A. The serial output for operand B could not be stabilized. Individual testing of the 16 vectors revealed incorrect or unstable output across a majority of the bits. At high frequency, we observed similar results. Our second chip was on the ADP627 tape-out and measure- ments have shown that the circuit parameters appeared to be at a lower standard of quality than usual for this fabrication run [25]. We tentatively attribute the failure of the high-speed test to fabrication difculties. V. CONCLUSION We have designed, fabricated, and tested the rst 16-bit wave-pipelined sparse-tree RSFQ adder chip with the core complexity of 9941 JJs and the target operation rate of 30 GHz. We have successfully demonstrated the correct operation of the chip at low frequency, passing all carefully chosen test vectors with a measured bias margin of +9.8%/ 10.7%. Another adder chip consisting of 12785 junctions with additional on- chip circuits for 30 GHz testing was also fabricated but its testing showed the need for another fabrication run. ACKNOWLEDGMENT The National Institute of Advanced Industrial Science and Technology (Japan) has partially contributed to the circuit fab- rication. The authors are also grateful to Dr. M. Tanaka, Nagoya University, and Taiichi Kato and Y. Shimamura, Yokohama National University, for their assistance with SFQ CAD tools and testing equipment. REFERENCES [1] M. Tanaka, H. Akaike, A. Fujimaki, Y. Yamanashi, N. Yoshikawa, S. Nagasawa, K. Takagi, and N. Takagi, 100-GHz single-ux-quantum bit-serial adder based on 10-kA/cm 2 niobium process, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 792796, Jun. 2011. [2] A. F. Kirichenko and O. A. Mukhanov, Implementation of novel push- forward RSFQ Carry-Save Serial Adders, IEEE Trans. Appl. Super- cond., vol. 5, no. 2, pp. 30103013, Jun. 1995. [3] A. Y. Kidiyarova-Shevchenko, K. Y. Platov, E. M. Tolkacheva, and I. A. Kataeva, RSFQ asynchronous serial multiplier and spreading codes generator for multiuser detector, IEEE Trans. Appl. Supercond., vol. 13, no. 2, pp. 429432, Jun. 2003. [4] S. V. Polonsky and A. V. Rylyakov, RSFQ arithmetic blocks for DSP applications, IEEE Trans. Appl. Supercond., vol. 5, no. 2, pp. 28232826, Jun. 1995. [5] H. Park, Y. Yamanashi, N. Yoshikawa, M. Tanaka, and A. Fujimaki, Design of fast digit-serial adders using SFQ logic circuits, IEICE Elec- tronics Express, vol. 6, no. 19, pp. 14081413, 2009. [6] S. V. Polonsky, V. K. Semenov, P. I. Bunyk, A. F. Kirichenko, A. Y. Kidiyarov-Shevchenko, O. A. Mukhanov, P. N. Shevchenko, DOROJEVETS et al.: 16-BIT WAVE-PIPELINED SPARSE-TREE RSFQ ADDER 1700605 D. F. Schneider, D. Y. Zinoviev, and K. K. Likharev, New RSFQ circuits Josephson junction digital devices, IEEE Trans. Appl. Supercond., vol. 3, no. 1, pp. 25662577, Mar. 1993. [7] J. Y. Kim, S. Kim, and J. Kang, Construction of an RSFQ 4-bit ALU with half adder cells, IEEE Trans. Appl. Supercond., vol. 15, no. 2, pp. 308 311, Jun. 2005. [8] Q. P. Herr, N. Vukovic, C. A. Mancini, K. Gaj, V. Adler, E. G. Friedman, A. Krasniewski, M. F. Bocko, and M. J. Feldman, Design and low speed testing of a four-bit RSFQ multiplier-accumulator, IEEE Trans. Appl. Supercond., vol. 7, no. 2, pp. 31683171, Jun. 1997. [9] R. Nakamoto, S. Sakuraba, T. Onomi, S. Sato, and K. Nakajima, 4-bit SFQ Multiplier Based on Booth Encoder, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 852855, Jun. 2011. [10] P. Bunyk and P. Litskevitch, Case study in RSFQ design: Fast pipelined parallel adder, IEEE Trans. Appl. Supercond., vol. 9, no. 2, pp. 3714 3720, Jun. 1999. [11] M. Dorojevets, C. Ayala, and A. Kasperek, Development and evaluation of design techniques for high-performance wave-pipelined wide datap- ath RSFQ processors, in Proc. 12th Int. Supercond. Electron. Conf., Fukuoka, Japan, 2009, SP-P46. [12] M. Dorojevets, C. L. Ayala, and A. K. Kasperek, Data-ow microarchi- tecture for wide datapath RSFQ processors: Design study, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 787791, Jun. 2011. [13] T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and O. Mukhanov, 8-bit asynchronous wave-pipelined RSFQ arithmetic- logic unit, IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 847851, Jun. 2011. [14] T. V. Filippov, A. Sahu, A. F. Kirichenko, I. V. Vernik, M. Dorojevets, C. L. Ayala, and O. A. Mukhanov, 20 GHz operation of an asynchronous wave-pipelined RSFQ arithmetic-logic unit, Phys. Proc., vol. 36, pp. 59 65, 2012. [15] M. Dorojevets and C. L. Ayala, Design and evaluation of a 20 GHz 32-bit wave-pipelined sparse tree adder, Ultra-High-Speed-Comput. Lab Tech. Rep., Stony Brook Univ., Stony Brook, NY, 2010. [16] K. K. Likharev and V. K. Semenov, RSFQ logic/memory family: A new Josephson-junction technology for sub-terahertz-clock-frequency digi- tal systems, IEEE Trans. Appl. Supercond., vol. 1, no. 1, pp. 328, Mar. 1991. [17] K. Gaj, E. G. Friedman, and M. J. Feldman, Timing of multi-gigahertz rapid single ux quantum digital circuits, J. VLSI Signal Process., vol. 276, pp. 247276, Nov. 1997. [18] P. M. Kogge and H. S. Stone, A parallel algorithm for the efcient solution of a general class of recurrence equations, IEEE Trans. Comput., vol. C-22, no. 8, pp. 786793, Aug. 1973. [19] S. Mathew, M. Anders, R. K. Krishnamurthy, and S. Borkar, A 4-GHz 130-nm address generation unit with 32-bit sparse-tree adder core, IEEE J. Solid-State Circuits, vol. 38, no. 5, pp. 689695, May 2003. [20] A. G. M. Strollo and E. Napoli, A fast and area efcient complimentary pass-transistor logic carry-skip adder, in Proc. 21st Int. Conf. Microelec- tron., Sep. 1997, vol. 2, pp. 701704. [21] H. Akaike, M. Tanaka, K. Takagi, I. Kataeva, R. Kasagi, A. Fujimaki, M. Igarashi, H. Park, Y. Yamanashi, N. Yoshikawa, K. Fujiwara, S. Nagasawa, M. Hidaka, and N. Takagi, Design of single ux quantum cells for a 10-Nb-layer process, Phys. C, Supercond., vol. 469, no. 1520, pp. 16701673, Oct. 2009. [22] O. A. Mukhanov, S. V. Rylov, V. K. Semonov, and S. V. Vyshenskii, RSFQ logic arithmetic, IEEE Trans. Magn., vol. 25, no. 2, pp. 857860, Mar. 1989. [23] Z. J. Deng, N. Yoshikawa, J. A. Tierno, S. R. Whiteley, and T. van Duzer, Asynchronous circuits and systems in superconducting RSFQ digital technology, in Proc. 4th Int. Symp. Adv. Res. Asynchronous Circuits Syst., Apr. 1998, pp. 274285. [24] S. Polonsky, V. K. Semenov, and A. F. Kirichenko, Single ux, quantum B ip-op and its possible applications, IEEE Trans. Appl. Supercond., vol. 4, no. 1, pp. 918, Mar. 1994. [25] M. Hidaka, S. Nagasawa, K. Hinode, and T. Satoh, Device yield in Nb- nine-layer circuit fabrication process, IEEE Trans. Appl. Supercond., vol. 23.