0 valutazioniIl 0% ha trovato utile questo documento (0 voti)
83 visualizzazioni60 pagine
This document discusses techniques for reducing computation time in two's complement multipliers, specifically for short bit-width designs. It presents a method to reduce the maximum height of the partial product array generated by a radix-4 Modified Booth encoded multiplier by one row, without increasing delay. This reduction could allow for faster compression of the partial product array and regular layouts. The proposed approach is evaluated through theoretical analysis and logic synthesis, showing improvements in area and delay over other possible solutions.
Descrizione originale:
jkljkl
Titolo originale
92050235 Reducing Computation Time for Short Bit Width Twos Compliment Multiplier
This document discusses techniques for reducing computation time in two's complement multipliers, specifically for short bit-width designs. It presents a method to reduce the maximum height of the partial product array generated by a radix-4 Modified Booth encoded multiplier by one row, without increasing delay. This reduction could allow for faster compression of the partial product array and regular layouts. The proposed approach is evaluated through theoretical analysis and logic synthesis, showing improvements in area and delay over other possible solutions.
This document discusses techniques for reducing computation time in two's complement multipliers, specifically for short bit-width designs. It presents a method to reduce the maximum height of the partial product array generated by a radix-4 Modified Booth encoded multiplier by one row, without increasing delay. This reduction could allow for faster compression of the partial product array and regular layouts. The proposed approach is evaluated through theoretical analysis and logic synthesis, showing improvements in area and delay over other possible solutions.
ABSTRACT: Two's complement multipliers are important for a wide range of applications. In this paper, we present a technique to reduce by one row the maximum height of the partial product array generated by a radix-4 Modified ooth !ncoded multiplier, without any increase in the delay of the partial product generation stage. This reduction may allow for a faster compression of the partial product array and regular layouts. This technique is of particular interest in all multiplier designs, but especially in short bit-width two's complement multipliers for high- performance embedded cores. The proposed method is general and can be extended to higher radix encodings, as well as to any si"e square and m times n rectangular multipliers. #e e$aluated the proposed approach by comparison with some other possible solutions% the results based on a rough theoretical analysis and on logic synthesis showed its efficiency in terms of both area and delay. Introducton !"out #$r%o& O'$r'$(: &ardware description languages such as 'erilog differ from software programming languages because they include ways of describing the propagation of time and signal dependencies (sensiti$ity). There are two assignment operators, a bloc*ing assignment (+), and a non-bloc*ing (,+) assignment. The non-bloc*ing assignment allows designers to describe a state-machine update without needing to declare and use temporary storage $ariables (in any general programming language we need to define some temporary storage spaces for the operands to be operated on subsequently% those are temporary storage $ariables). -ince these concepts are part of 'erilog's language semantics, designers could quic*ly write descriptions of large circuits, in a relati$ely compact and concise form. .t the time of 'erilog's introduction (/014), 'erilog represented a tremendous producti$ity impro$ement for circuit designers who were already using graphical schematic capture software and specially-written software programs to document and simulate electronic circuits. The designers of 'erilog wanted a language with syntax similar to the 2 programming language, which was already widely used in engineering software de$elopment. 'erilog is case-sensiti$e, has a basic preprocessor (though less sophisticated than that of .3-I 24255), and equi$alent control flow *eywords (if4else, for, while, case, etc.), and compatible operator precedence. -yntactic differences include $ariable declaration ('erilog requires bit- widths on net4reg types ) , demarcation of procedural bloc*s (begin4end instead of curly braces 67), and many other minor differences. . 'erilog design consists of a hierarchy of modules. Modules encapsulate design hierarchy, and communicate with other modules through a set of declared input, output, and bidirectional ports. Internally, a module can contain any combination of the following8 net4$ariable declarations (wire, reg, integer, etc.), concurrent and sequential statement bloc*s, and instances of other modules (sub-hierarchies). -equential statements are placed inside a begin4end bloc* and executed in sequential order within the bloc*. ut the bloc*s themsel$es are executed concurrently, qualifying 'erilog as a dataflow language. 'erilog's concept of 'wire' consists of both signal $alues (4-state8 9/, :, floating, undefined9), and strengths (strong, wea*, etc.) This system allows abstract modeling of shared signal-lines, where multiple sources dri$e a common net. #hen a wire has multiple dri$ers, the wire's (readable) $alue is resol$ed by a function of the source dri$ers and their strengths.
. subset of statements in the 'erilog language is synthesi"able. 'erilog modules that conform to a synthesi"able coding-style, *nown as ;T< (register transfer le$el), can be physically reali"ed by synthesis software. -ynthesis-software algorithmically transforms the (abstract) 'erilog source into a net-list, a logically-equi$alent description consisting only of elementary logic primiti$es (.3=, >;, 3>T, flip-flops, etc.) that are a$ailable in a specific ?@A. or '<-I technology. ?urther manipulations to the net-list ultimately lead to a circuit fabrication blueprint (such as a photo mas* set for an .-I2, or a bit-stream file for an ?@A.). #$r%o& -HDL H)tor* B$&nnn&: 'erilog was the first modern hardware description language to be in$ented. It was created by @hil Moorby and @rabhu Aoel during the winter of /01B4/014. The wording for this process was 9.utomated Integrated =esign -ystems9 (later renamed to Aateway =esign .utomation in /01C) as a hardware modeling language. Aateway =esign .utomation was purchased by 2adence =esign -ystems in /00:. 2adence now has full proprietary rights to Aateway's 'erilog and the 'erilog-D<, the &=<-simulator that would become the de-facto standard (of 'erilog logic simulators) for the next decade.. >riginally, 'erilog was intended to describe and allow simulation% only afterwards was support for synthesis added. #$r%o&-+,: #ith the increasing success of '&=< at the time, 2adence decided to ma*e the language a$ailable for open standardi"ation. 2adence transferred 'erilog into the public domain under the >pen 'erilog International (>'I) (now *nown as .ccellera) organi"ation. 'erilog was later submitted to I!!! and became I!!! -tandard /BE4-/00C, commonly referred to as 'erilog-0C. In the same time frame 2adence initiated the creation of 'erilog-. to put standards support behind its analog simulator -pectre. 'erilog-. was ne$er intended to be a standalone language and is a subset of 'erilog-.M- which encompassed 'erilog-0C. #$r%o& -../: !xtensions to 'erilog-0C were submitted bac* to I!!! to co$er the deficiencies that users had found in the original 'erilog standard. These extensions became I!!! -tandard /BE4-F::/ *nown as 'erilog-F::/. 'erilog-F::/ is a significant upgrade from 'erilog-0C. ?irst, it adds explicit support for (F's complement) signed nets and $ariables. @re$iously, code authors had to perform signed- operations using aw*ward bit-le$el manipulations (for example, the carry-out bit of a simple 1- bit addition required an explicit description of the oolean-algebra to determine its correct $alue). The same function under 'erilog-F::/ can be more succinctly described by one of the built-in operators8 5, -, 4, G, HHH. . generate4end-generate construct (similar to '&=<'s generate4end-generate) allows 'erilog-F::/ to control instance and statement instantiation through normal decision-operators (case4if4else). Ising generate4end-generate, 'erilog-F::/ can instantiate an array of instances, with control o$er the connecti$ity of the indi$idual instances. ?ile I4> has been impro$ed by se$eral new system-tas*s. .nd finally, a few syntax additions were introduced to impro$e code-readability (e.g. always JG, named-parameter o$erride, 2- style function4tas*4module header declaration). 'erilog-F::/ is the dominant fla$or of 'erilog supported by the maKority of commercial !=. software pac*ages. Introducton !"out Mu%t0%c!ton: Multiplication (often denoted by the cross symbol 9L9) is the mathematical operation of scaling one number by another. It is one of the four basic operations in elementary arithmetic (the others being addition, subtraction and di$ision). Mu%t0%c!ton: If a positional numeral system is used, a natural way of multiplying numbers is taught in schools as long multiplication, sometimes called grade-school multiplication8 multiply the multiplicand by each digit of the multiplier and then add up all the properly shifted results. It requires memori"ation of the multiplication table for single digits. This is the usual algorithm for multiplying larger numbers by hand in base /:. 2omputers normally use a $ery similar shift and add algorithm in base F. . person doing long multiplication on paper will write down all the products and then add them together% an abacus- user will sum the products as soon as each one is computed. E1!20%$: This example uses long multiplication to multiply FB,0C1,FBB (multiplicand) by C,1B: (multiplier) and arri$es at /B0,EME,401,B0: for the result (product). FB0C1FBB C1B: L ------------ :::::::: (+ FB,0C1,FBB L :) M/1M4E00 (+ FB,0C1,FBB L B:) /0/EEC1E4 (+ FB,0C1,FBB L 1::) //0M0//EC (+ FB,0C1,FBB L C,:::) ------------ /B0EME401B0: (+ /B0,EME,401,B0: ) Mu%t0%c!ton !%&ort32: . multiplication algorithm is an algorithm (or method) to multiply two numbers. =epending on the si"e of the numbers, different algorithms are in use. !fficient multiplication algorithms ha$e existed since the ad$ent of the decimal system Types of Multiplication .lgorithms /. oothNs .lgorithm F. Modified oothNs .lgorithm B. #allace Tree .lgorithm Boot3') A%&ort32: ooth's algorithm is a multiplication algorithm which wor*ed for two's complement numbers. It is similar to our paper-pencil method, except that it loo*s for the current as well as pre$ious bit in order to decided what to do. &ere are steps If the current multiplier digit is / and earlier digit is : (i.e. a /: pair) shift and sign extend the multiplicand, subtract with pre$ious result. If it is a :/ pair, add to the pre$ious result. If it is a :: pair, or // pair, do nothing. <et's loo* at few examples. 4 bits ://: ,- E x ::/: ,- F ------------- :::::::: - ://: -------------- ////:/:: 5 ://: -------------- (/) :::://:: ,- /F (o$erflow bit ignored) 1 bits In ooth's algorithm, if the multiplicand and multiplier are n-bit two's complement numbers, the result is considered as Fn-bit two's complement $alue. The o$erflow bit (outside Fn bits) is ignored. The reason that the abo$e computation wor*s is because ://: x ::/: + ://: x (-::/: 5 :/::) + -://:: 5 ://::: + //::. !xample F8 ::/: x ://: ------------ :::::::: - ::/: ------------- //////:: 5 ::/: ------------- (/) :::://:: In this we ha$e computed ::/: x ://: + ::/: x ( -::/: 5 /:::) + - ::/:: 5 ::/:::: + //:: !xample B, (-C) x (-B)8 /:// -H -C (4-bit two's complement) x //:/ -H -B ----------- :::::::: - /////:// (notice the sign extension of multiplicand) ------------ :::::/:/ 5 ////:// ------------- /////:// - ///:// ------------- :::://// -H 5/C . long example8 /::///:: ,- -/:: x ://:::// ,- 00 -------------------- :::::::: :::::::: - //////// /::///:: -------------------- :::::::: ://::/:: 5 ///////: :///:: -------------------- ///////: //:/:/:: - ////::// /:: -------------------- ::::/:// :/:/:/:: 5 //::///: : -------------------- //://::/ :/:/:/:: ,- -00:: 3ote that the multiplicand and multiplier are 1-bit two's complement number, but the result is understood as /E-bit two's complement number. e careful about the proper alignment of the columns. /: pair causes a subtraction, aligned with /, :/ pair causes an addition, aligned with :. In both cases, it aligns with the one on the left. The algorithm starts with the :-th bit. #e should assume that there is a (-/)-th bit, ha$ing $alue :. ooth .lgorithm .d$antages and =isad$antages =epends on the architecture @otential ad$antage8 might reduce the O of /Ns in multiplier In the multipliers that we ha$e seen so far8 =oesnNt sa$e in speed (still ha$e to wait for the critical path, e.g., the shift-add delay in sequential multiplier) Incr !ases area8 recoding circuitry .3= subtraction Mod4$d Boot3: ooth F modified to produce at most n4F5/ partial products. .lgorithm8 (for unsigned numbers) /. @ad the <- with one "ero. F. @ad the M- with F "eros if n is e$en and / "ero if n is odd. B. =i$ide the multiplier into o$erlapping groups of B-bits. 4. =etermine partial product scale factor from modified booth F encoding table. C. 2ompute the Multiplicand Multiples E. -um @artial @roducts 2an encode the digits by loo*ing at three bits at a time ooth recoding table8 /. Must be able to add multiplicand times PF, -/, :, / and F F. -ince ooth recoding got rid of BNs, generating partial products is not that hard (shifting and negating) I5/ i i-/ add : : : :GM : : / /GM : / : /GM : / / FGM / : : PFGM / : / P/GM / / : P/GM / / / :GM ooth F modified to produce at most n4F5/ partial products. .lgorithm8 (for unsigned numbers) /. @ad the <- with one "ero. F. If n is e$en donNt pad the M- ( n4F @@Ns) and if n is odd sign extend the M- by / bit ( n5/4F @@Ns). B. =i$ide the multiplier into o$erlapping groups of B-bits. 4. =etermine partial product scale factor from modified booth F encoding table. C. 2ompute the Multiplicand Multiples E. -um @artial @roducts Interpretation of the ooth recoding table8 i5/ i i-/ add !xplanation : : : :GM 3o string of /Ns in sight : : / /GM !nd of a string of /Ns : / : /GM Isolated / : / / FGM !nd of a string of /Ns / : : PFGM eginning of a string of /Ns / : / P/GM !nd one string, begin new one / / : P/GM eginning of a string of /Ns / / / :GM 2ontinuation of string of /Ns Arouping multiplier bits into pairs >rthogonal idea to the ooth recoding ;educes the num of partial products to half If ooth recoding not used ha$e to be able to multiply by B (hard8 shift5add) .pplying the grouping idea to ooth Modified ooth ;ecoding (!ncoding) #e already got rid of sequences of /Ns no mult by B Qust negate, shift once or twice Ises high-radix to reduce number of intermediate addition operands 2an go higher8 radix-1, radix-/E ;adix-1 should implement GB, G-B, G4, G-4 ;ecoding and partial product generation becomes more complex 2an automatically ta*e care of signed multiplication W!%%!c$ tr$$: . #allace tree is an efficient hardware implementation of a digital circuit that multiplies two integers, de$ised by an .ustralian 2omputer -cientist 2hris #allace in /0E4. R/S The #allace tree has three steps8 /. Multiply (that is - .3=) each bit of one of the arguments, by each bit of the other, yielding n F results. =epending on position of the multiplied bits, the wires carry different weights, for example wire of bit carrying result of aFbB is BF (see explanation of weights below). F. ;educe the number of partial products to two by layers of full and half adders. B. Aroup the wires in two numbers, and add them with a con$entional adder. RFS
The second phase wor*s as follows. .s long as there are three or more wires with the same weight add a following layer8 Ta*e any three wires with the same weights and input them into a full adder. The result will be an output wire of the same weight and an output wire with a higher weight for each three input wires. If there are two wires of the same weight left, input them into a half adder. If there is Kust one wire left, connect it to the next layer. The benefit of the #allace tree is that there are only O(log n) reduction layers, and each layer has O(/) propagation delay. .s ma*ing the partial products is O(/) and the final addition is O(log n), the multiplication is only O(log n), not much slower than addition (howe$er, much more expensi$e in the gate count). 3ai$ely adding partial products with regular adders would require O(log F n) time. ?rom a complexity theoretic perspecti$e, the #allace tree algorithm puts multiplication in the class 32 / . These computations only consider gate delays and don't deal with wire delays, which can also be $ery substantial. The #allace tree can be also represented by a tree of B4F or 44F adders. It is sometimes combined with ooth encoding. W$&3t) $10%!n$d The weight of a wire is the radix (to base F) of the digit that the wire carries. In general, anbm P ha$e indexes of n and m% and since F n F m + F n 5 m the weight of anbm is F n 5 m . E1!20%$ n + 4, multiplying aBaFa/a: by bBbFb/b:8 /. ?irst we multiply e$ery bit by e$ery bit8 o weight / - a:b: o weight F - a:b/, a/b: o weight 4 - a:bF, a/b/, aFb: o weight 1 - a:bB, a/bF, aFb/, aBb: o weight /E - a/bB, aFbF, aBb/ o weight BF - aFbB, aBbF o weight E4 - aBbB F. ;eduction layer /8 o @ass the only weight-/ wire through, output8 / weight-/ wire o .dd a half adder for weight F, outputs8 / weight-F wire, / weight-4 wire o .dd a full adder for weight 4, outputs8 / weight-4 wire, / weight-1 wire o .dd a full adder for weight 1, and pass the remaining wire through, outputs8 F weight-1 wires, / weight-/E wire o .dd a full adder for weight /E, outputs8 / weight-/E wire, / weight-BF wire o .dd a half adder for weight BF, outputs8 / weight-BF wire, / weight-E4 wire o @ass the only weight-E4 wire through, output8 / weight-E4 wire B. #ires at the output of reduction layer /8 o weight / - / o weight F - / o weight 4 - F o weight 1 - B o weight /E - F o weight BF - F o weight E4 - F 4. ;eduction layer F8 o .dd a full adder for weight 1, and half adders for weights 4, /E, BF, E4 C. >utputs8 o weight / - / o weight F - / o weight 4 - / o weight 1 - F o weight /E - F o weight BF - F o weight E4 - F o weight /F1 - / E. Aroup the wires into a pair integers and an adder to add them. T(o5) co20%$2$nt: The two's complement of a binary number is defined as the $alue obtained by subtracting the number from a large power of two (specifically, from F N for an N-bit two's complement). The two's complement of the number then beha$es li*e the negati$e of the original number in most arithmetic, and it can coexist with positi$e numbers in a natural way. . two's-complement system, or two's-complement arithmetic, is a system in which negati$e numbers are represented by the two's complement of the absolute $alue% R/S this system is the most common method of representing signed integers on computers. RFS In such a system, a number is negated (con$erted from positi$e to negati$e or $ice $ersa) by computing its ones' complement and adding one. .n 3-bit two's-complement numeral system can represent e$ery integer in the range TF 3T/ to F 3T/ -/ while ones' complement can only represent integers in the range T(F 3T/ T/) to F 3T/ T/ The two's-complement system has the ad$antage of not requiring that the addition and subtraction circuitry examine the signs of the operands to determine whether to add or subtract. This property ma*es the system both simpler to implement and capable of easily handling higher precision arithmetic. .lso, "ero has only a single representation, ob$iating the subtleties associated with negati$e "ero, which exists in ones'-complement systems. REDUCING THE COMPUTATION TIME IN (SHORT BIT- WIDTH) TWO'S COMPLEMENT MULTIPLIERS /6 INTRODUCTION: In multimedia, B= graphics and signal processing applications, performance, in most cases, strongly depends on the effecti$eness of the hardware used for computing multiplications, since multiplication is, besides addition, massi$ely used in these en$ironments. The high interest in this application field is witnessed by the large amount of algorithms and implementations of the multiplication operation, which ha$e been proposed in the literature (for a representati$e set of references, see R/S). More specifically, short bit-width (1-/E bits) twoNs complement multipliers with single-cycle throughput and latency ha$e emerged and become $ery important building bloc*s for high-performance embedded processors and =-@ execution cores RFS, RBS. In this case, the multiplier must be highly optimi"ed to fit within the required cycle time and power budgets. .nother rele$ant application for short bit-width multipliers is the design of -IM= units supporting different data formats RBS, R4S. In this case, short bit-width multipliers often play the role of basic building bloc*s. TwoNs complement multipliers of moderate bit-width (less than BF bits) are also being used massi$ely in ?@A.-. .ll of the abo$e translates into a high interest and moti$ation on the part of the industry, for the design of high-performance short or moderate bit- width twoNs complement multipliers. The basic algorithm for multiplication is based on the well-*nown paper and pencil approach R/S and passes through three main phases8 /) partial product (@@) generation, F) @@ reduction, and B) final (carry-propagated) addition. =uring @@ generation, a set of rows is generated where each one is the result of the product of one bit of the multiplier by the multiplicand. ?or example, if we consider the multiplication D U V with both D and V on n bits and of the form xnW/ . . . D: and ynW/ . . . V:, then the i th row is, in general, a proper left shifting of yiG D, i.e., either a string of all "eros when yi+ :, or the multiplicand D itself when yi+ /. In this case, the number of @@ rows generated during the first phase is clearly n.
Modified ooth !ncoding (M!) is a technique that has been introduced to reduce the number of @@ rows, still *eeping the generation process of each row both simple and fast enough. >ne of the most commonly used schemes is radix-4 M!, for a number of reasons, the most important being that it allows for the reduction of the si"e of the partial product array by almost half, and it is $ery simple to generate the multiples of the multiplicand. More specifically, the classic twoNs complement n G n bit multiplier using the radix-4 M! scheme, generates a @@ array with a maximum height of Rn4FS5/ rows, each row before the last one being one of theF following possible $alues8 all "eros, 5-D%5-FD. The last row, which is due to the negati$e encoding, can be *ept $ery simple by using specific techniques integrating twoNs complement and sign extension pre$ention R/S.
The @@ reduction is the process of adding all @@ rows by using a compression tree RES, RMS. -ince the *nowledge of intermediate addition $alues is not important, the outcome of this phase is a result represented in redundant carry- sa$e form, i.e., as two rows, which allows for much faster implementations. The final (carry-propagated) addition has the tas* of adding these two rows and of presenting the final result in a non redundant form, i.e., as a single row.
In this wor*, we introduce an idea to o$erlap, to some extent, the @@ generation and the @@ reduction phases. >ur aim is to produce a @@ array with a maximum height of Rn4FS rows that is then reduced by the compressor tree stage.F
.s we will see for the common case of $alues n which are power of two, the abo$e reduction can lead to an implementation where the delay of the compressor tree is reduced by one D>;F gate *eeping a regular layout. -ince we are focusing on small $alues of n and fast single-cycle units, this reduction might be important in cases where, for example, a high computation performance through the assembly of a large number of small processing units with limited computation capabilities are required, such as 1 U 1 or /E U /E multipliers R1S. . similar study aimed at the reduction of the maximum height to Rn4FS but using a different approach has recentlyF presented interesting results in R0S and pre$iously, by the same authors, in R/:S. Thus, in the following, we will e$aluate and compare the proposed approach with the technique in R0S. .dditional details of our approach, besides the main results presented here, can be found in R//S.
The paper is organi"ed as follows8 in -ection F, the multiplication algorithm based on M! is briefly re$iewed and analy"ed. In -ection B, we describe related wor*s. In -ection 4, we present our scheme to reduce the maximum height of the partial product array by one unit during the generation of the @@ rows. ?inally, in -ection C, we pro$ide e$aluations and comparisons. - 6MODI7IED BOOTH RECODED MULTIPLIERS: In general, a radix- + F b M! leads to a reduction of the number of rows to about Rn4bS while, on the other hand, it introduces the need to generate all the multiples of the multiplicand D, at least from P4F G D to 4F G D. .s mentioned abo$e, radix-4 M! is particularly of interest since, for radix-4, it is easy to create the multiples of the multiplicand :% 5-D% 5-FD. In particular, 5-FD can be simply obtained by single left shifting of the corresponding terms 5-D. It is clear that the M! can be extended to higher radices (see R/FS among others), but the ad$antage of getting a higher reduction in the number of rows is paid for by the need to generate more multiples of D. In this paper, we focus our attention on radix-4 M!, although the proposed method can be easily extended to any radix- M! R//S. ?rom an operational point of $iew, it is well *nown that the radix-4 M! scheme consists of scanning the multiplier operand with a three-bit window and a stride of two bits (radix-4). ?or each group of three bits (yFi5/, yFi, yFi5/), only one partial product row is generated according to the encoding in Table /. . possible implementation of the radix-4 M! and of the corresponding partial product generation is shown in ?ig. /, which comes from a small adaptation of R/:, ?ig. /FbS. ?or each partial product row, ?ig. /a produces the one, two, and neg signals. These signals are then exploited by the logic in ?ig. /b, along with the appropriate bits of the multiplicand, in order to generate the whole partial product array. >ther alternati$es for the implementation of the recoding and partial product generation can be found in R/BS, R/4S, R/CS, among others.
.s introduced pre$iously, the use of radix-4 M! allows for the (theoretical) reduction of the @@ rows to Rn4FS, with theF possibility for each row to host a multiple of yiG D, with yi X 6:,5-/,5-F7. #hile it is straightforward to generate the positi$e terms :, D, and FD at least through a left shift of D, some attention is required to generate the terms -D and -FD which, as obser$ed in Table /, can arise from three configurations of the yFi5/ , yFi , and yFi-/ bits. To a$oid computing negati$e encodings, i.e., -D and -FD, the twoNs complement of the multiplicand is generally used. ?rom a mathematical point of $iew, the use of twoNs complement requires extension of the sign to the leftmost part of each partial product row, with the consequence of an extra area o$erhead. Thus, a number of strategies for pre$enting sign extension ha$e been de$eloped. ?or instance, the scheme in R/S relies on the obser$ation that /-F54. The array resulting from the application of the sign extension pre$ention technique in R/S to the partial product array of a 1 G 1 M! multiplier RCS is shown in ?ig. F.
The use of twoNs complement requires a neg signal (e.g., neg:, neg/, negF, and negB in ?ig. F) to be added in the <- position of each partial product row for generating the twoNs complemented, as needed. Thus, although for a n Gn multiplier, only Rn4FS partial products are generated, the maximum height of the partial product array is Rn4FS5/ #hen 4-to-F compressors are used, which is a widely used option because of the high regularity of the resultant circuit layout for n power of two, the reduction of the extra row may require an additional delay of two D>;F gates. y properly connecting partial product rows and using a #allace reduction tree RMS, the extra delay can be further reduced to one D>;F R/ES, R/MS. &owe$er, the reduction still requires additional hardware, roughly a row of n half adders. This issue is of special interest when n is a power of two, which is by far a $ery common case, and the multiplierNs critical path has to fit within the cloc* period of a high performance processor. ?or instance, in the design presented in RFS, for n +/E, the maximum column height of the partial product array is nine, with an equi$alent delay for the reduction of six D>;F gates R/ES, R/MS. ?or a maximum height of the partial product array of 1, the delay of the reduction tree would be reduced by one D>;F gate R/ES, R/MS. .lternati$ely, with a maximum height of eight, it would be possible to use 4 to F adders, with a delay of the reduction tree of six D>;F gates, but with a $ery regular layout. 86 RELATED WOR9: -ome approaches ha$e been proposed aiming to add the Rn4FS 5 / rows, possibly in the same time as the Rn4FS rows. TheFF solution presented in R/4S is based on the use of different types of counters, that is, it operates at the le$el of the @@ reduction phase. Yang and Aaudiot propose a different approach in R0S that manages to achie$e the goal of eliminating the extra row before the @@ reduction phase. This approach is based on computing the twoNs complement of the last partial product, thus eliminating the need for the last neg signal, in a logarithmic time complexity. . special tree structure (basically an incrementer implemented as a prefix tree R/1S) is used in order to produce the twoNs complement (?ig. B), by decoding the M! signals through a B-C decoder (?ig. 4a). ?inally, a row of 4-/ multiplexers with implicit "ero output/ is used (?ig. 4b) to produce the last partial product row directly in twoNs complement, without the need for the neg signal. The goal is to produce the twoNs complement in parallel with the computation of The partial products of the other rows with maximum o$erlap. In such a case, it is expected to ha$e no or a small time penali"ation in the critical path. The architecture in R0S, R/1S is a logarithmic $ersion of the linear method presented in R/0S and RF:S. #ith respect to R/0S, RF:S, the approach in R0S is more general, and shows better adaptability to any word si"e. .n example of the partial product array produced using the abo$e method is depicted in ?ig. C. In this wor*, we present a technique that also aims at producing only Rn4FS rows, but by relying on a differentF approach than R0S. :6 BASIC IDEA: The case of n G n square multipliers is quite common, as the case of n that is a power of two. Thus, we start by focusing our attention on square multipliers, and then present the extension to the general case of m G n rectangular multipliers. :6/ S;u!r$ Mu%t0%$r):
The proposed approach is general and, for the sa*e of clarity, will be explained through the practical case of 1 G 1 multiplications (as in the pre$ious figures). .s briefly outlined in the pre$ious sections, the main goal of our approach is to produce a partial product array with a maximum height of Rn4FS rows, without introducing anyF additional delay. <et us consider, as the starting point, the form of the simplified array as reported in ?ig. F, for all the partial product rows except the first one. .s depicted in ?ig. Ea, the first row is temporarily considered as being split into two sub rows, the first one containing the partial product bits (from right to left) from pp:: to pp1: bar and the second one with two bits set at Zone[ in positions 0 and 1. Then, the bit negB related to the fourth partial product row, is mo$ed to become a part of the second sub row. The *ey point of this Zgraphical[ transformation is that the second sub row containing also the bit negB , can now be easily added to the first sub row, with a constant short carry propagation of three positions (further denoted as ZB-bits addition[), a $alue which is easily shown to be general, i.e., independent of the length of the operands, for square multipliers. In fact, with reference to the notation of ?ig. E, we ha$e that .s introduced abo$e, due to the particular $alue of the second operand, i.e., : / / : negB , in R//S, we ha$e obser$ed that it requires a carry propagation only across the least-significant three positions, a fact that can also be seen by the implementation shown in ?ig. M. It is worth obser$ing that, in order not to ha$e delay penali"ations, it is necessary that the generation of the other rows is done in parallel with the generation of the first row cascaded by the computation of the bits qqM: qqE: in ?ig. Eb. In order to achie$e this, we must simplify and differentiate the generation of the first row with respect to the other rows. #e obser$e that the ooth recoding for the first row is computed more easily than for the other rows, because the yW/ bit used by the M! is always equal to "ero. In order to ha$e a preliminary .nalysis which is possibly independent of technological details, we refer to the circuits in the following figures8 ?ig. /, slightly adapted from R/:, ?ig. /FS, for the partial product generation using M!% ?ig. M, obtained through manual synthesis (aimed at modularity and area reduction without compromising the delay), for the addition of the last neg bit to the three most significant bits of the first row% ?ig. 1, obtained by simplifying ?ig. / (since, in the first row, it is yFi-/ + :), for the partial product generation of the first row only using M!% and ?ig. 0, obtained through manual synthesis of a combination of the two parts of ?ig. 1 and aimed at decreasing the delay of ?ig. 1 with no or $ery small area increase, for the partial product generation of the first row only using M!. In particular, we obser$e that, by direct comparison of ?igs. / and 1, the generation of the M! signals for the first row is simpler, and theoretically allows for the sa$ing of the delay of one 3.3=B gate. In addition, the implementation in ?ig. 0 has a delay that is smaller than the two parts of ?ig. 1, although it could require a small amount of additional area. .s we see in the following, this issue hardly has any significant impact on the o$erall design, since this extra hardware is used only for the three most significant bits of the first row, and not for all the other bits of the array. The high-le$el description of our idea is as follows8 /. Aeneration of the three most significant bit weights of the first row, plus addition of the last neg bit8 possible implementations can use a replication of three times the circuit of ?ig. 0 (each for the three most significant bits of the first row), cascaded by the circuit of ?ig. M to add the neg signal% F. @arallel generation of the other bits of the first row8 possible implementations can use instances of the circuitry depicted in ?ig. 1, for each bit of the first row, except for the three most significant% B. @arallel generation of the bits of the other rows8 possible implementations can use the circuitry of ?ig. /, replicated for each bit of the other rows. .ll items / to B are independent, and therefore can be executed in parallel. 2learly if, as assumed and expected, item / is not the bottlenec* (i.e., the critical path), then the implementation of the proposed idea has reached the goal of not introducing time penalties. :6- E1t$n)on to R$ct!n&u%!r Mu%t0%$r): . number of potential extensions to the proposed method exist, including rectangular multipliers, higher radix M!, and multipliers with fused accumulation R//S. &ere, we quic*ly focus on m G n rectangular multipliers. #ith no loss of generality, we assume m H+ n i.e., m + n 5 m N with mNH+ :, since it leads to a smaller number of rows% for simplicity, and also with no loss of generality, in the following, we assume that both m and n are e$en. 3ow, we ha$e seen in ?ig. Ea, that for mN + : then the last neg bit, i.e., neg Rn4FS5/ belongs to the same column as the first row partial product . #e obser$e that the first partial product row has bits up to % therefore, in order to also include in the first row the contribution of , due to the particular nature of operands it is necessary to perform a carry propagation (i.e., bit addition) in the sum Thus, for rectangular multipliers, the proposed approach can be applied #ith the cost of a -bit addition. The complete or e$en partial execution o$erlap of the first row with other rows generation clearly depends on a number of factors, including the $alue of mN and the way that the -bit addition is implemented, but still the proposed approach offers an interesting alternati$e that can possibly be explored for designing and implementing rectangular multipliers. ,6 E#ALUATION AND COMPARISONS: In this section, the proposed method based on the addition of the last neg signal to the first row is first e$aluated. The designed architecture is then compared with an implementation based on the computation of the twoNs complement of the last row (referred to as ZTwoNs complement[ method) using the designs for the B-C decoders, 4-/ multiplexers, and twoNs complement tree in R0S. Moreo$er, in the analysis, the standard M! implementations for the first and for a Aeneric partial product row are also ta*en into account (as summari"ed in Table F). ?or all the implementations, we explicitly e$aluate the most common case of a n x n multiplier, although we ha$e shown in -ection 4 that the proposed approach can also be extended to m x n rectangular multipliers. #hile studying the framewor* of possible implementations, we considered the first phase of the multiplication algorithm (i.e., the partial product generation) and we focused our attention on the issues of area occupancy and modular design, since it is reasonable to expect that they lead to a possibly small multiplier with regular layout. The detailed results of some extensi$e e$aluations and comparisons, both based on theoretical analysis and related implementations are reported in R//S. ;esults encompass the following8 /. Theoretical analysis based on the concept of equi$alent gates from AaKs*iNs analysis RF/S (as in R0S), F. Theoretical analysis based on delay and area costs for elementary gates in a standard cell library, B. Theoretical analysis showing that the proposed approach, in the $ersion minimi"ing area, can $ery li*ely o$erlap the generation of the first row with the generation of the other rows, and 4. 'alidation by logic synthesis and technology mapping to an industrial cell library. .ll the results show the feasibility of the proposed approach. &ere, for the sa*e of simplicity, we quic*ly summari"e the results of the theoretical analysis and we chec* the $alidity of our estimations through logic synthesis and simulation. ,6/ H&3-L$'$% R$2!r<) !nd T3$or$tc!% An!%*)): .s can be seen from ?ig. E, the generation of the first row is different from the generation of the other rows, basically for two reasons8 /. The first row needs to assimilate the last neg signal, an operation which requires an addition o$er the three most significant bit weights%
F.. The first row can ta*e ad$antage of a simpler ooth recoding, as the yW/ bit used by the M! is always equal to "ero (-ection 4). .s seen before, in ?ig. 1, we ha$e a possible implementation to generate the first row, which ta*es into account the simpler generation of the M! signals. #e ha$e seen that by combining the two parts of ?ig. 1 we get ?ig. 0, which is faster than ?ig. 1, at a possibly slightly larger area cost certainly $ery marginal with respect to the global area of all the partial product bits coming from the other rows. #e ha$e done some rough simulations and found that a good trade-off could be to ha$e the generation of the first bits of the first row carried out by the circuit of ?ig. 0, followed by the cascaded addition pro$ided by ?ig. M (-ection 4). ased on all of the abo$e, our architecture has been designed to perform the following operations8 /. Aeneration of the three most significant bit weights of the first row (through the $ery small and regular circuitry of ?ig. 0) and addition to these bits of the neg signal (by means of the circuitry of ?ig. M)% F. Aeneration of the other bits of the first row, using the circuitry depicted in ?ig. 1% and B. Aeneration of the bits of the other rows, using the circuitry of ?ig. /. .s these three operations can be carried out in parallel, the o$erall critical path of the proposed architecture emerges from the largest delay among the abo$e paths. 2ritical path and area cost for the proposed architecture, as well as for the other implementations in Table F, were computed with reference to a /B: nm &2M>- standard cell library from -TMicroelectronics RFFS (later used also for obtaining o$erall synthesis results). In this analysis, the contribution of wires was neglected, and a buffer-free configuration was considered. 3onetheless, details regarding buffer stages location and si"e are discussed in R//S. =ata concerning area and delay for elementary cells used in this wor* (as well as in R0S) are reported in Table B. ;esults are reported in Tables 4 and C, respecti$ely. It is worth obser$ing that results may $ary depending on specific parameters selected for the synthesis such as logic implementation, optimi"ation strategies, and target libraries. #e obser$e that the ZTwoNs 2omplement[ approach has a delay that is longer than the delay to generate the standard partial product rows, becoming e$en longer as the si"e n of the multiplier increases (e.g., exceeding the delay of a D3>;F gate starting from n \ /E). >n the other hand, according to theoretical estimations, we can see that the delay for generating the first row in the proposed method is es timated to be lower than the delay for generating the standard rows. This means that the extra row is eliminated without any penalty on the o$erall critical path. #ith respect to area costs, it can be obser$ed that the proposed method hardly introduces any area o$erhead with respect to the standard generation of a partial product row. >n the other hand, the ZTwoNs 2omplement[ approach requires additional hardware, which increases with the si"e of the multiplier. ,6- I20%$2$nt!ton R$)u%t): In order to further chec* the $alidity of our estimations in an implementation technology, we implemented the designs in Table F through logic synthesis and technology mapping to an industrial standard cell library. -pecifically, for the logic synthesis, we used -ynopsys =esign 2ompiler and the designs were mapped to a /B: nm &2M>- industrial library from -TMicroelectronics RFFS. To perform the e$aluation, we obtained the area-delay space for the sole generation of the partial product row of interest (i.e., the first row in the proposed approach, the last row in the implementation presented in R0S). In order to support the comparison, the area-delay space for the generation of the partial product rows using standard M! implementations was also e$aluated, by considering the first row and the other rows of the partial product array separately (Table F). The results, obtained for n + 1, /E, and BF, are depicted in ?ig. /:. The delays are shown both in absolute units (ns) and normali"ed to the delay of an in$erter with a fan-out of four (E1 ps for the technology used, under worst-case conditions). .ccordingly, the area is presented both in absolute units (]mF) and normali"ed to equi$alent gates using the area of a 3.3=F gate (48B0 ]mF for the technology used). #e obtained se$eral design points (using different target delays) for each approach, and the minimum delay shown corresponds to the fastest design that the tool was capable of synthesi"ing. #e obser$e that the Z@roposed method[ implementation produces a cur$e in the delay- area graph bounded by the cur$e for the generation of a standard partial product (upper bound) and by the cur$e for the standard generation of the first partial product (lower bound) for the three $alues of n considered. Moreo$er, the minimum delay that is achie$ed is $ery similar to the case of the generation of a standard partial product for n+ 1% /E (with our approach it is about :.C-:.M ?>4 higher), and is e$en less for n+BF due to the predominant effect of the higher loading of the control signals. Therefore, our scheme does not introduce any additional delay in the partial product generation stage for target delays higher than about C ?>4. The cur$e for our scheme gets closer to the cur$e corresponding to the standard generation of the first partial product as n increases. This is due to the fact that as n increases, the short addition of the leading part achie$es more o$erlap with the generation of the rest of the partial product (with higher input load capacitance, as n increases). The ZTwoNs 2omplement[ scheme achie$es minimum delays between M and /: ?>4, at the cost of requiring more than four times the area at this point, compared to the Z@roposed method[ approach. Most importantly, its delay is much higher than the one of any standard row. =6 CONCLUSIONS: TwoNs complement n x n multipliers using radix-4 Modified ooth !ncoding produce Rn4FS partial products but due to theF sign handling, the partial product array has a maximum height of Rn4FS 5 /. #e presented a scheme that produces a partial product array with a maximum height of Rn4FS, withoutF introducing any extra delay in the partial product generation stage. #ith the extra hardware of a (short) B-bit addition, and the simpler generation of the first partial product row, we ha$e been able to achie$e a delay for the proposed scheme within the bound of the delay of a standard partial product row generation. The outcome of the abo$e is that the reduction of the maximum height of the partial product array by one unit may simplify the partial product reduction tree, both in terms of delay and regularity of the layout. This is of special interest for all multipliers, and especially for single-cycle short bit-width multipliers for high performance embedded cores, where short bit-width multiplications are common operations. #e ha$e also compared our approach with a recent proposal with the same aim, considering results using a widely used industrial synthesis tool and a modern industrial technology library, and concluded that our approach may impro$e both the performance and area requirements of square multiplier designs. The proposed approach also applies with minor modifications to rectangular and to general radix- Modified ooth !ncoding multipliers. >6 R$4$r$nc$): 1. M.=. !rcego$ac and T. <ang, =igital .rithmetic. Morgan Yaufmann @ublishers, F::B. 2. -.Y. &su, -.Y. Mathew, M... .nders, .;. ^eydel, '.A.>*lobd"iKa, ;.Y. Yrishnamurthy, and -.V. or*ar, Z. //:A>@-4 # /E-it Multiplier and ;econfigurable @<. <oop in 0:-nm 2M>-,[ I!!! Q. -olid -tate 2ircuits, $ol. 4/, no. /, pp. FCE-FE4, Qan.F::E. 3. &. Yaul, M... .nders, -.Y. Mathew, -.Y. &su, .. .garwal, ;.Y.Yrishnamurthy, and -. or*ar, Z. B:: m' 404A>@-4# ;econfi-gurable =ual--upply 4-#ay -IM= 'ector @rocessing .ccelerator in 4C nm 2M>-,[ I!!! Q. -olid -tate 2ircuits, $ol. 4C, no. /, pp. 0C-/:/, Qan. F:/:. 4. M.-. -chmoo*ler, M. @utrino, .. Mather, Q. Tyler, &.'. 3guyen, 2.;oth, M. -harma, M.3. @ham, and Q. <ent, Z. <ow-@ower, &igh--peed Implementation of a @ower@2 Microprocessor 'ector !xtension,[ @roc. /4th I!!! -ymp. 2omputer .rithmetic, pp. /F- /0,/000. 5. >.<. Mac-orley, Z&igh -peed .rithmetic in inary 2omputers,[@roc. I;!, $ol. 40, pp. EM-0/, Qan. /0E/. 6. <. =adda, Z-ome -chemes for @arallel Multipliers,[ .lta ?requen"a,$ol. B4, pp. B40-BCE, May /0EC. 7. 2.-. #allace, Z. -uggestion for a ?ast Multiplier,[ I!!! Trans.!lectronic 2omputers, $ol. !2-/B, no. /, pp. /4-/M, ?eb. /0E4.=.!. -haw, Z.nton8 . -peciali"ed Machine for Millisecond--caleMolecular =ynamics -imulations of @roteins,[ @roc. /0th I!!! -ymp. 2omputer .rithmetic, p. B, F::0. 8. Q.-V. Yang and Q.-<. Aaudiot, Z. -imple &igh--peed Multiplier =esign,[ I!!! Trans.2omputers, $ol. CC, no. /:, pp. /FCB-/FC1, >ct.F::E. 9. Q.-V. Yang and Q.-<. Aaudiot, Z. ?ast and #ell--tructured Multiplier,[ @roc. !uromicro -ymp. =igital -ystem =esign, pp. C:1-C/C, -ept. F::4. 10. ?. <amberti, 3. .ndri*os, !. .ntelo, and @. Montuschi,Z-peeding-Ip ooth !ncoded Multipliers by ;educing the -i"e of @artial @roduct .rray,[ internal report, http844arith.polito.it4ir_mbe.pdf, pp. /-/4, F::0. 11. !.M. -chwar", ;.M. .$erill III, and <.Q. -igal, Z. ;adix-1 2M>- -4B0: Multiplier,[ @roc. /Bth I!!! -ymp. 2omputer .rithmetic, pp. F-0, /00M. 12. #.-2. Veh and 2.-#. Qen, Z&igh--peed ooth !ncoded @arallel Multiplier =esign,[ I!!! Trans. 2omputers, $ol. 40, no. M, pp. E0F-M:/, Quly F:::. 13. ^. &uang and M.=. !rcego$ac, Z&igh-@erformance <ow-@ower <eft-to-;ight .rray Multiplier =esign,[ I!!! Trans. 2omputers,$ol. C4, no. B, pp. FMF-F1B, Mar. F::C. 14. ;. ^immermann and =.`. Tran, Z>ptimi"ed -ynthesis of -um-of-@roducts,[ @roc. 2onf. ;ecord of the BMth .silomar 2onf. -ignals,-ystems and 2omputers, $ol. /, pp. 1EM-1MF, F::B. 15. '.A. >*lobd"iKa, =. 'illeger, and -.-. <iu, Z. Method for -peed >ptimi"ed @artial @roduct ;eduction and Aeneration of ?ast @arallel Multipliers Ising an .lgorithmic .pproach,[ I!!! Trans.2omputers, $ol. 4C, no. B, pp. F04-B:E, Mar. /00E. 16. @.?. -telling, 2.I. Martel, '.A. >*lobd"iKa, and ;. ;a$i, Z>ptimal 2ircuits for @arallel Multipliers,[ I!!! Trans. 2omputers, $ol. 4M,no. B, pp. FMB-F1C, Mar. /001. 17. Q.-V. Yang and Q.-<. Aaudiot, Z. <ogarithmic Time Method for TwoNs 2omplementation,[ @roc. IntNl 2onf. 2omputational -cience, pp. F/F-F/0, F::C. 18. Y. &wang, 2omputer .rithmetic @rinciples, .rchitectures, and =esign.#iley, /0M0. 19. ;. &ashemian and 2.@. 2hen, Z. 3ew @arallel Technique for =esign of =ecrement4Increment and TwoNs 2omplement 2ircuits,[ @roc. B4th Midwest -ymp. 2ircuits and -ystems, $ol. F,pp. 11M-10:, /00/. 20. =. AaKs*i, @rinciples of =igital =esign. @rentice-&all, /00M.-TMicroelectronics, Z/B:nm &2M>-0 2ell <ibrary,[ http844www.st.com4stonline4products4technologies4soc4e$ol.htm,F:/:. S*nt!1 r$0ort Started : "Check Syntax for Partia!"rod#ct". $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$ % &'( Co)"iation % $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$ Co)"iin* +erio* ,e ",*8-.+" in i-rary .ork Co)"iin* +erio* ,e ",*8a.+" in i-rary .ork /od#e 0,*8-1 co)"ied Co)"iin* +erio* ,e ",*1-.+" in i-rary .ork /od#e 0,*8a1 co)"ied Co)"iin* +erio* ,e ",*1a.+" in i-rary .ork /od#e 0,*1-1 co)"ied Co)"iin* +erio* ,e ",*1.+" in i-rary .ork /od#e 0,*1a1 co)"ied /od#e 0Partia!"rod#ct1 co)"ied 2o error3 in co)"iation 4nay3i3 of ,e 0"Partia!"rod#ct."r5"1 3#cceeded.
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More