Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

Chapter-1
INTRODUCTION
1.1 Introduction
Datapath is the core of every microprocessor, digital signal processor (DSP) and application-
specific integrated circuit (ASIC). The heart of this datapath inturn comprises of various
arithmetic units performing various computation intensive arithmetic functions, such as
adders, and multiplier, multiply and accumulate unit. High performance DSP make extensive
use of Multiply and Accumulate (MAC) unit. MAC is the most crucial element as it lies in
the critical path of the datapath circuit. Therefore, Main motivation behind this thesis is to
accelerate the speed of MAC and consequently, enhance the speed of DSP. MAC performs
two important function first is the Multiplication and second is accumulation.
In DSP applications, multiplication is the most critical operation as switching and critical
computations of a multipliers are quite high, compared to other datapath units of a
processing architecture. Hence, for all multiplication algorithms being implemented in DSP‟s
latency and throughput are the two major concerns from delay perspective. Latency is
nothing but the real delay of any computing function which simply measures how long the
inputs to a device are stable in the final results at the output. Whereas, Throughput is
basically the number multiplications performed in a given period of time. So, the real time
signal processing requires high speed and high throughput multiplier units that consumes low
power which is always a key to achieve high overall performance of DSP. Therefore, the
development of high throughput multipliers has been a subject of interest over decades..
The most common multiplication algorithms which are implemented in digital systems
include array multiplication, Wallace tree multiplication and booth encoding. Array
multiplier is a fast multiplier as the partial products are generated in parallel so the execution
speed increases. The delay associated with the array multiplier is the time taken by the signals
to propagate through the gates forming the multiplication array. The main disadvantage of
this multiplier is that the worst case delay increases with the width of array so it is limited to
small bit multiplication.
Wallace tree multiplier sum up the partial products using carry save adders consequently
reduces delay but it has a complex layout as compared to array multiplier as it uses a number
1
of irregular wires. Booth encoding multiplier offers an advantage over array multiplier as it
reduces the computation complexity by reducing the number of partial products being
generated but increased speed comes at the cost of increased circuit complexity.
First and foremost, some ancient and basic multiplication algorithms have been discussed to
explore the computer arithmetic from a different point of view. Then, some world renowned
Vedic mathematics algorithms known for yield quicker results have been discussed. In
general, for a NXN bit multiplication N2 multiplication are needed. Kasturba algorithm in
vedic mathematics has brought down the complexity to N1.58 . Then, urdhvatiryakbhyam sutra
or “vertically or crosswise” algorithm is discussed which reduces the time complexity by
breaking the multiplication of N*N bit numbers to N/2 * N/2 bit number multiplication and
the process continues till we reach 2*2 size multiplicands. This sutra efficiently deals with
large numbers. This work presents a systematic design methodology using
urdhvatiryakbhyam sutra for the developing a high throughput and speed and area efficient
multiplexer based Vedic multiplier.
In order to fulfill our motivation of enhanced speed the proposed MAC is realized using a
multiplier based on Urdhvatriyakbhyam sutra. Accumulator is again a CIAF involving large
operand additions. So, the work deals with the implementation of various conventional adders
and Parallel Prefix Adders. Study and analysis of these adders have revealed that as the
operand size becomes greater than or equal to 32-bits Conventional adders like Ripple Carry
adders and Carry look ahead adders face a disadvantage of reduced speed, fan-in limitations,
area complexities. So, to enhance the performance of wider adders Han Carlson parallel
prefix adder can prove to be a challenging and promising adder in terms of speed. So, the
proposed MAC has been implemented using Han Carlson PPA .
Work also presents the logic optimization of adder. As adder is a key element of the vedic
multiplier so more stress has to be laid on its optimization. Full adders with 10 different
factorized expression for carry and sum has been implemented in the thesis. Study reveals
that Multiplexer based implementation of circuits reduce the slice utilization ratio of the
adder and makes the design area efficient.
Further, another concept of pipelining has been introduced to increase the throughput of the
multiplier and hence reduce the power dissipation in the device and ultimately improve the
2
performance of the device. Pipelining reduces the effective critical path by introducing
pipelining latches along the data path and thus reduces the average time per instruction.
1.2 Problem Statement
The main objective of this Dissertation is to implement high throughput 32*32 bit Vedic
multiplier which fulfills our motivation of reduced area, reduced delay and reduced power
consumption.
1.3 Objective of Dissertation
I. Qualitative evaluation of four different existing 32 * 32 urdhvatiryakbhyam multipliers.
II. Implementation of a novel urdhva multiplier architecture using BEC-1.
III. Realization of multi-operand carry save adder in terms of multiplexers using logic
optimized full adder circuit
IV. Comparison of area, delay and power results of proposed multiplier architecture with
various existing urdhva multiplier architectures.
V. Pipelining of the proposed 32*32 bit vedic multiplier to increase the throughput.
VI. Implementation of 64-bit han Carlson adder in the Accumulator unit and its performance
evaluation.
1.4 Organization of Dissertation
This Dissertation is organized as follows:
Chapter-1: Gives the brief introduction about the use of MAC and the importance of vedic
multipliers to enhance overall performance of DSP.
Chapter-2: Deals with the literature review for the presented work done and examined a
comprehensive background of other related research works for which IEEE papers and other
referred journals are contributed which relate to the present work with recent research work
going on worldwide and has assured the consistency of the work performed.
Chapter-3: This Chapter discusses the various ancient multiplication algorithms like
3
Urdhvatiryakbhyam, nikhilam. It also throws light on the performance of vedic multiplier in
terms of speed, area and power. Chapter also briefly explain logic optimization of digital
circuits. Lastly, it discusses the technique of pipelining to improve the throughput of the
system.
Chapter-4: Deals with the Analysis of four existing architectures of high speed urdhva
multipliers and gives details of each of them.
Chapter-5: Describes the proposed work on MUX-based implementation of hight

throughput MAC using Han Carlson adder and explains the designing and basic building
blocks of proposed 32x32 bit Vedic MAC. The chapter also deals with the logic optimization
and Multiplexer based implementation of the proposed module. Synthesis results of all the
modules incorporated in the designing of the proposed module have been shown for its
comparison with the existing Urdhva Architectures. Lastly, it throws light on the pipelining
approach used in MAC to accelerate its performance.
Chapter-6: Deals with the simulation results obtained after the successful implementation of
proposed work in VHDL. Chapter also compares the performance of proposed vedic
multiplier to the Existing architecture at different operand sizes in terms of area, power,
delay and space complexity. It also compares Various adders and shows the significance of
Han Carlson adder in the proposed MAC. Chapter also compares the performance of
proposed multiplier with and without the use of pipeline registers.
Chapter-7: Gives the conclusion and future scope of the Dissertation.
1.5 Tools Used:
Software: ISE 12.4 (Integrated system environment) has been used for synthesis and
verification. ISIM M.81d has been used for simulation.
Hardware used: Xilinx Spartan3 (Family), XC3S400 -5 (Speed Grade) , PQ208

(package) FPGA devices.
4
Chapter-2
LITERATURE REVIEW
Study of various multipliers reveals that Multiplication is an important fundamental function
in arithmetic operations. Multiplication-based operations such as Multiply and
Accumulate(MAC) and inner product are among some of the frequently used
Computation- Intensive Arithmetic Functions(CIAF) currently implemented in many
Digital Signal Processing (DSP) applications such as convolution, Fast Fourier
Transform(FFT), filtering and in microprocessors in its arithmetic and logic unit [1].
Since multiplication dominates the execution time of most DSP algorithms, so there is
a need of high speed multiplier. Currently, multiplication time is still the dominant factor
in determining the instruction cycle time of a DSP chip. The demand for high speed
processing has been increasing as a result of expanding computer and signal
processing applications. Higher throughput arithmetic operations are important to achieve
the desired performance in many real-time signal and image processing applications [2].
One of the key arithmetic operations in such applications is multiplication and the
development of fast multiplier circuit has been a subject of interest over decades.
Reducing the time delay and power consumption are very essential requirements for
many applications [2, 3]. Digital multipliers are the core components of all the digital
signal processors (DSPs) and the speed of the DSP is largely determined by the speed of its
multipliers.
Various methods exist for the reduction in the computation time involved by the multiplier
with other factors as trade-offs. three types (a) shift-and-add multipliers that generate partial
products sequentially and accumulate. This requires more hardware and is the slowest
multiplier. This is basically the array multiplier making use of the classical multiplying
technique which consumes more time to perform two subtasks, addition and shifting of the
bits and hence consumes 2 to 8 cycles of clock period (b) Generating all the partial product
bits in parallel and accumulate them using a multi-operand adder. This is also called as
parallel multiplier by using the techniques of Wallace tree and Booth algorithm, (c) using
arrays of almost identical cells for generation of bit products and accumulation.[4]
The most common multiplication algorithms followed in the digital hardware are array
5
multiplication algorithm and Booth multiplication algorithm. The computation time taken by
the array multiplier is comparatively less because the partial products are calculated
independently in parallel. The delay associated with the array multiplier is the time taken by
the signals to propagate through the gates that form the multiplication array. Booth
multiplication is another important multiplication algorithm. Large booth arrays are required
for high speed multiplication and exponential operations which in turn require large partial
sum and partial carry registers. Multiplication of two n-bit operands using a radix-4 booth
recording multiplier requires approximately n / (2m) clock cycles to generate the least
significant half of the final product, where m is the number of Booth recorder adder stages.
Thus, a large propagation delay is associated with this case [5]. Another type of multiplier
which is the Wallace tree multiplier is considered as faster than a simple array multiplier. A
Wallace tree multiplier is a parallel multiplier which uses the carry save addition algorithm to
reduce the latency, but it has a complex layout as compared to array multiplier as it uses a
number of irregular wires [6].
These multipliers involve n2 multiplications for multiplying two n-bit multiplicands which
makes it complex. Vedic mathematics provides a number of algorithms which yield quicker
results compared to above multipliers. Vedic Mathematics is the ancient system of
mathematics which was rediscovered early last century by Sri Bharati Krishna Tirthaji (1884-
1960) [7] .The Sanskrit word “Veda” means “knowledge”. He organized and classified the
whole of Vedic Mathematics into 16 formulae or also called as sutras. Among these
techniques more preferable method is Urdhvatiryakbhyam method to describe
Urdhvatiryakbhyam methodology [8] and their hardware architecture [9] details and
implementation [10] presented.
Various architectures of Urdhvatiryakbhyam sutra based multipliers have come into existence
the conventional vedic multiplier architecture was proposed in [11] with the ripple carry
adder in Vedic multiplication unit for 4 bit binary numbers. In this approach, three 4-bit
ripple carry adders are used and the combinational path delay is found to be 13.102 ns.
Results when compared with Array and Booth Multiplier and it are observed that the
execution time has been reduced for Vedic multiplier and thus proves to be better but since,
the ripple carry adder waits for the carry from previous stage so it restricts the speed of
multiplier. This architecture was extended upto 32 bit[12] and it was found to be faster than
6
the existing multipliers. The modified architecture [13] replaced the ripple carry adder with
carry-look ahead adder (CLA) and it was found that vedic multipliers using CLA are more
power efficient as compared to conventional architecture. Carry lookahead adders are fast as
they do not depend on the carry from the previous stage so rippling effect is eliminated [14].
This research Work [15] presents the efficiency of Urdhva Triyagbhyam Vedic method
for multiplication which strikes a difference in actual process of multiplication itself.
Modified architecture enables the parallel generation of partial products and eliminates
unwanted multiplication An analysis of the best adder among some commonly available
adders is carried out and the best adder is used for adding the partial product
generated in the Vedic multiplication technique to reduce the combinational delay in the
critical path also summation of partial products is also done in a parallel fashion which in turn
reduces the combinational path delay. Since adder are the main units which restrict the speed
of multiplier so fast adder have to be implemented to increase the speed. Parallel Prefix
adders are the fastest adders present. The Han-Carlson adder is a blend of the Brent-Kung
and Kogge-Stone adders. It uses one Brent-Kung stage at the beginning followed by
Kogge-Stone stages, terminating with another Brent-Kung stage to compute the odd
numbered prefixes [17]. It provides better performance compared to Kogge-Stone for
smaller adders. High performance Vedic multiplier using Han-Carlson adder proposed [16]
the benefit of using Han-Carlson adder is its high operational speed. Synthesis results shows
that the performance parameters such as area and delay are reduced compared to
multiplier using Kogge-Stone adder with lower number of bits, which makes it more power
efficient. Due to its regular and parallel structure the proposed design can be realized on
silicon as well. The proposed multiplier is very useful for the microprocessors and DSP
processors whose performance is dependent upon the efficiency of multiplier. A simple
digital multiplier architecture based on the Urdhvatiryakbhyam (Vertically and Cross
wise) Sutra of Vedic Mathematics is presented [18] with an improved technique for low
power and high speed multiplier of two binary numbers (16 bit each) is developed.
An algorithm is proposed and implemented on 16nm CMOS technology. The designed 16x16
bit multiplier dissipates a power of 0.17mW. The propagation delay time of the proposed
architecture is 27.15ns. These results are many improvements over power dissipations and
delays as compared to the other architectures in literature. A further improvised structure
using just 2 carry save adders was proposed in [19] ,this paper presented the detailed study of
7
Different multipliers based on Array Multiplier, Constant coefficient multiplication (KCM)
and multiplication based on vedic mathematics. All multipliers are then compared based on
LUTs (Look up table) and path delays. Results show that urdhva multiplier is the fastest
Multiplier with least path delay. A high speed multiplier architecture for multiplication of
two 8 bit numbers that includes the advantages of compressor based adders[20,21] and
the ancient Vedic maths methodology is proposed. The proposed multiplier is compared
with that of existing Urdhva multiplier and two other popular multipliers. After comparison
of its speed and area occupied by the multiplier it is deduced that the proposed
architecture of compressor based Vedic maths multiplier is better than conventional
multipliers used in several complex VLSI circuits.
This Vedic mathematics improves the performance of the multiplier in terms of speed.
Throughput of a multiplier can be increased by the use of pipelined registers [26]. By using
this technique RTL coding for 4×4 Vedic multipliers with and without Pipelining [22], is
compared. The area, delay, power analysis of multiplier performed in Cadence (rc). The
delay in the Pipelined architecture got reduced by 300ps.
Adders form an almost obligatory component of every contemporary integrated circuit. The
prerequisite of the adder is that it is primarily fast and secondarily efficient in terms of power
consumption and chip area. Therefore, careful optimization of the adder is of the greatest
importance. This optimization can be attained in two levels; it can be circuit or logic
optimization. In circuit optimization the size of transistors are manipulated, where as
in logic optimization the Boolean equations are rearranged (or manipulated) to
optimize speed, area and power consumption. The paper [23] focuses the optimization of
adder through technology independent mapping. The work presents 20 different logical
construction of 1-bit adder cell in CMOS logic and its performance is analyzed in terms
of transistor count, delay and power dissipation. These performance issues are analyzed
through Tanner EDA with TSMC MOSIS 250nm technology. From this analysis the
optimized equation is chosen to construct a full adder circuit in terms of multiplexer. This
logic optimized multiplexer based adders are incorporated in selected existing adders
like ripple carry adder, carry look-ahead adder, carry skip adder, carry select adder, carry
increment adder and carry save adder [25] and its performance is analyzed in terms of
area (slices used) and maximum combinational path delay as a function of size
8
Chapter-3
VEDIC ALGORITHMS AND LOGIC OPTIMIZATION
3.1 History of Vedic Mathematics
Jagadguru Shankaracharya Bharati Krishna Teerthaji Maharaja (1884-1960) constructed

16 sutras (formulae) and 16 Upa sutras (sub formulae) after extensive research in Atharva
Veda[6,21]. Vedic mathematics has created wonder is mathematics. That is why vedic
mathametics has been a wide area of research. VM deals with numerous basic as well as
complex mathematical operations as it provides simple and yet powerful methods for fast
calculations.
The word “Vedic” is derived from the word “veda” which means the store-house of all
knowledge. Vedic mathematics is mainly based on 16 Sutras (or aphorisms) dealing with
various branches of mathematics like arithmetic, algebra, geometry etc[27]. These Sutras
along with their brief meanings are enlisted below alphabetically[21,,27].
1) (Anurupye) Shunyamanyat – If one is in ratio, the other is zero.
2) Chalana-Kalanabyham – Differences and Similarities.
3) Ekadhikina Purvena – By one more than the previous One.
4) Ekanyunena Purvena – By one less than the previous one.
5) Gunakasamuchyah – The factors of the sum is equal to the sum of the factors.
6) Gunitasamuchyah – The product of the sum is equal to the sum of the product.
7) Nikhilam Navatashcaramam Dashatah – All from 9 and last from 10.
8) Paraavartya Yojayet – Transpose and adjust.
9) Puranapuranabyham – By the completion or noncompletion.
10) Sankalana- vyavakalanabhyam – By addition and by subtraction.
11) Shesanyankena Charamena – The remainders by the last digit.
9
12) Shunyam Saamyasamuccaye – When the sum is the same that sum is zero.
13) Sopaantyadvayamantyam – The ultimate and twice the penultimate.
14) Urdhva-tiryakbhyam – Vertically and crosswise.
15) Vyashtisamanstih – Part and Whole.
16) Yaavadunam – Whatever the extent of its deficiency.
These methods can be directly applied to trigonometry, geometry, calculus , and

applied mathematics. The beauty of Vedic mathematics lies in the fact that it reduces
the complex and tedious calculations in conventional maths to very simple ones[27]. This is
a very research intensive field and presents some of the most effective algorithms which can
be applied then be applied to digital signal processing operations..
The multiplier architectures are usually classified into following three categories. First is the
serial multiplier which focuses on minimum hardware and minimum amount of chip area.
Second is the parallel multiplier which focuses on delay reduction. But they have serious
drawback as they have larger chip area consumption. Third one being the serial- parallel
multiplier which offers good trade-off between the time consuming serial multiplier and the
area inefficient designing of parallel multipliers.
3.2 Vedic Mathematics Algorithms
Proposed Vedic multiplier is based on Vedic mathematics sutras (formulae). These Sutras
have been traditionally employed for the multiplication of two numbers following decimal
number system. In this work, the same sutras have been applied for binary number system to
reduce the complexity of the proposed algorithm. Few of the Vedic multiplication algorithms
have been discussed below:
3.2.1 Urdhvatiryakbhyam sutra
The multiplier is based on an algorithm Urdhva Tiryakbhyam (Vertical & Crosswise)

of ancient Indian Vedic Mathematics. Urdhva Tiryakbhyam Sutra is a recursive
multiplication process which literally means “Vertically and crosswise”. It is based on a
unique concept through which the generation of all partial products can be done in parallel
10
fashion followed by the concurrent addition of these partial products[8]. The parallelism in
generation of partial products and their summation is obtained using Urdhava Triyakbhyam
explained in fig 2.1. The algorithm can be generalized for n x n bit number. Since the partial
products and their sums are calculated in parallel, the multiplier is independent of the clock
frequency of the processor[19]. Hence, the net advantage is the reduced need of
microprocessors to operate at increasingly high clock frequencies. While a higher clock
frequency results in increased power dissipation resulting in higher device operating
temperatures. By adopting the Vedic multiplier, microprocessors designers can easily
overcome these problems to avoid device failures[8]. The processing power of multiplier can
easily be increased by increasing the input and output data bus widths since it has a quite a
regular structure[27]. Due to its regular structure, it can be easily layout in a silicon
chip. The Multiplier has the a striking feature that as the number of bits increases, gate delay
and area increases very slowly as compared to other multipliers. Therefore it is time, space
and power efficient.
3.2.1.1 Multiplication of two decimal numbers 41x 53
Figure 3.1 illustrates the urdhva tiryakbhyam multiplication scheme of two numbers 43 and
51. In the first step LSB‟s of both the numbers are multiplied and this generates a result digit
and one carry digit. This carry is added to the next stage in step 2 where crosswise
multiplication is done. In each step, units digit acts as a result digit while higher digits act as
carry digit and the process goes on. The initial carry is taken as zero.
11
Figure 3.1: Multiplication of two-digit decimal number using Urdhvatiryakbhyam sutra.
3.2.1.2 Urdhvatriyakbhyam multiplication for binary numbers.
Now we extend this Sutra to binary number system. To illustrate the multiplication algorithm,
let us consider the multiplication of two binary numbers X3X2X1X0 and Y3Y2Y1Y0. As the
result of this multiplication would be more than 4 bits, we express it as Line diagram for
multiplication of two 4-bit numbers is shown in Figure. 3.2[8]. For the sake of simplicity,
each bit is represented by a circle. Least significant bit P0 is obtained by multiplying the least
significant bits of the multiplicand and the multiplier. The process is followed according to
the steps shown in Fig. 3.2[8] As in the last case, the digits on the both sides of the line are
multiplied and added with the carry from the previous step. This generates one of the bits of
the result Pn and a carry (say Cn). This carry is added in the next step and hence the process
goes on. If more than one lines are there in one step, all the results are added to the previous
carry.
12
Figure 3.2: Line diagram showing 4*4 multiplication using urdhvatriyakbhyam sutra[8]
In each step, least significant bit acts as the result bit and the other bits act as carry. For
example, if in some intermediate step, we get 110, then 0 will act as result bit and 11 as the
carry (referred to as Cn in this text). It should be clearly noted that Cn may be a multi-bit
number. Thus we get the following expressions:
P0 = X0Y0
C0P1 = X0Y1 + Y0X1
C1P2 = X0Y2 + X1Y1 + X2Y0 + C0
C2P3 = X3Y0 + X2Y1 + X1Y2 + Y3X0 + C1
C3P4 = X3Y1+ X2Y2 + X1YR + C2
C4P5 = X3Y2 + X2Y3 + C3
P7P6 = X3Y3 + C4
In this approach, we observe that as the number of bits increase the number of stages through
which the carry has to ripple also increases and hence, the delay increases. So, an alternative
approach can be effectively used to implement urdhva tiryakbhyam algorithm as shown in
13
figure 3.3. This technique uses “divide and conquer” approach in which the 4*4
multiplication is divided into 4 2*2 bit multiplications that can be performed in parallel. This
parallel implementation style of Urdhva tiryakbhyam sutra reduces the number of logic levels
and thus reduces the delay of the multiplier and makes it a better and speedy implementation.
The beauty of that it divides Large bit stream (N bits) to small streams of bit length (N/2= n)
and the recursive process continues till we get the multiplicands of size 2, and then they are
multiplied in parallel, thus providing an increased operational speed.
Figure 3.3: Alternative Efficient Approach for Urdhvatiryakbhyam Binary Multiplication
Finally the 2*2 multiplication follows traditional “vertically or crosswise” technique as

shown in Figure 3.4 which is same as 2-digit decimal urdhva-tiryakbhyam multiplication. Let
2 numbers be x1x0 and y1y0 and final product is P3P2P1P0. Thus we get the following
expressions.
Figure 3.4: 2*2 bit vedic multiplication
P0 = x0 y0 (vertically) (1)
C0P1= x0y1 + x1y0 (crosswise) (2)
P3P2 = C0 + x1y1 (vertically) (3)
14
3.2.2 Nikhilam Sutra
Nikhilam sutra effectively deals with large bit multiplications. Since it first finds out the
compliment of the large number from its nearest base to carry out the further multiplication
operation on it. Complexity involved in computation decreases to a large extent as sizes of
bit length increases. Nikhilam based multiplication is illustrated with the following
illustration. Consider the multiplication of two numbers (96 * 93) where the 100 is chosen as
the base since it is nearest to and greater than both these two numbers [8].
Figure 3.5: Nikhilam sutra multiplication illustration[8]
The right hand side (RHS) of the product is obtained by multiplying the numbers of
the Column 2 (complemented form) (7*4=28). The left hand side (LHS) of the product
can be found by cross subtracting the second number of Column 2 from the first number of
Column 1 or vice versa, i.e., 96 - 7 = 89 or 93 - 4 = 89. The final result is obtained by
concatenating RHS and LHS (Answer = 8928)[8].
3.3 Performance of Vedic Multiplier

3.3.1. Power
Vedic Multiplier requires less number of gates for given 8x8 bits Multiplier. Less number of
gates lead to less power dissipation and another factor responsible for low power
consumption is that the switching activity in case of vedic multiplier is less compared to array
15
and booth multiplier which reduce the dynamic power. Hence, overall power dissipation
reduces.
3.3.2. Speed
Vedic multipliers are the faster than the existing array or booth encoding multiplier
architectures. As, we move towards higher order bits i.e. from 8x8 to 16x16 bits, there is a
sharp decrease in the timing delay in case of vedic multiplier architectures. Delay for 16x16
bit Vedic multiplier is 25 ns whereas; it is 37 ns for booth multiplier and 43 ns for array
multiplier [28]. This shrink in gate delays make the Vedic multipliers best suited for signal
processing and other operations. Vedic multiplier offer advantage, as the number of bits
increases the gate delay increases slowly in Vedic multipliers. Speed improvements are
attained by parallelizing the generation of partial products with their concurrent
additions.
3.3.3 Area
The area needed for Vedic multiplier is very small as compared to other multiplier
architectures i.e. the number of devices used in Vedic multiplier are 259 while Booth and
Array Multiplier is 592 and 495 respectively for 16 x 16 bit number when
implemented on Spartan FPGA [28]. Number of gates required for a Vedic multiplier is
small as compared to the booth and array multiplier. As, Number of gates reduce it brings a
substantial decrease in transistor count and uses less routing resources. Hence, the area
occupied by Vedic multiplier is very less.
Due to regularity in its structure, this architecture can be easily realized on silicon and can
work at high speed without increasing the clock frequency. It has the advantage that as the
number of bits increases area increase very slowly as compared to other multiplier
architectures
3.4 Logic Optimization

Multi-operand carry save adder (CSA) form the most obligatory component of the proposed
vedic multiplier. Since, the overall performance of any multiplier depends on the the
performance of adder. Hence, perquisite of this CSA is that it should be speedy and
secondarily it should be efficient in terms of power and chip area. Therefore, careful
16
optimization of this CSA is the most essential task. This optimization can be achieved at 2
levels[31]:
1. Circuit level optimization: Circuit level optimization involves the manipulation

of circuits in terms of transistor sizing i.e. W/L aspect ratios of NMOS and
PMOS transistor is varied.
2. Logic level optimization: Boolean expression is rearranged and restructured by
simply deriving the logic form with from minimum number of literals. Hence,
reducing the chip area occupied by the design.
The work presented in the thesis deals with the logic optimization only. Reduction in
transistor count is a primary design criterion for the designing of modern digital circuits
which affects the design complexity of a multiplier. Hence, the dominant principle in digital
designing still hovers around reducing the cost and hardware of the circuit. The best solution
to achieve this is logic optimization. It involves the rearrangement of Boolean expression to
obtain fewest literals[23]. Since, the number of transistors which are required to implement a
logic expression is directly proportional to the number of literals. In other words, we require
implementing a minimum cost circuit to reduce the area.
Literal can be as, a given product term consists of some number of variables, each of which
may appear either in complemented or uncomplemented form. Each occurrence of these
variables either in complemented or in uncomplemented form is called literal. Reduction in
the number of literals leads to a cost minimized circuit. Here, cost represents the total gate
count plus the total number of inputs. Typically, logic optimization is done in 2 phases[23]:
1. Technology Independent Phase: In this phase logic is optimized by applying

various Boolean laws to simplify the expressions or factoring of the expression is
undertaken to overcome the major drawback of fan-in limitation[32]. Similarly,
complexity of logic circuit in terms of logic gates and wiring can also be reduced
by decomposing the circuit into small sub-circuits, these sub-circuits can then be
reused at several places in the circuit which reduces the redundancy of several
gates. Thereby, reducing the gate count and hence reduces the chip area
consequently, the power consumption reduces.
17
2. Technology Dependent Phase: It takes into account the peculiarities and
properties of intended implementation architecture or the target technology on
which the circuit has to be implemented[23]. The technology independent
description resulting from the first phase is translated into a gate level netlist in
this phase. Technology dependent phase is not flexible as compared to the first
phase which facilitates restructuring and rebuilding of circuit logic to obtain
minimum number of nodes and literals to reduce area.
3.5 Pipelining Approach
Pipelining is the most prominent approach used in a wide variety of digital circuits to
improve the throughput of the logic modules long critical paths. It basically increases the
frequency of operation. This Increase in the frequency brings continuous shrink in the gate
delay. Minimal clock period assures correct evaluation. Minimal clock period is given by the
following expression[31]:
Tmin = TC-Q + Tpd,logic + Tsu
Pipelining technique decomposes a sequential process into sub processes and each sub
process is executed by a special dedicated segment that operates simultaneously with all the
other segments. A pipeline can also be visualized as a collection of processing modules
through which binary data flows. The result obtained from the computation in logic module is
transferred to the next logic block in the pipeline and so on as shown in the figure 3.6. In
simple words, this technique breaks a computationally complex block into discrete blocks
separated by clock storage elements like latches, flipflops, and registers. A clock is then
applied to all registers assure correct evaluation. Pipelining offers the following advantages:
1. High Throughput: Pipelining increases the functional throughput of the digital

system in manifolds as the introduction of registers between the logic blocks shortens
the maximum combinational delay of the circuit. These intermediate registers cause
the computation of single set of input data to be spread over a number of cycles. This
increased speed and throughput is brought at the expense of latency as gain in speed is
achieved by clocking the sub circuits faster and delay equalization is brought by
registers.
18
Figure 3.6: Pipelining in a Logic Circuit
2. Resource Utilization: In conventional processes a single set of input data has to pass
through the series of combinational block and only when it passes through the entire
process the result is computed the next input is fetched. So in this case, even when the
processing by the initial logic blocks has been done they have to remain idle until the
entire process completes which leads to underutilization of resources. Whereas,
Pipelining increases the resource utilization. First instruction enters into the first pipe
now during the next clock cycle output from the first pipe is send to the next stage and
first pipe gets filled with next input set. This process goes on. Pipelining increases the
throughput but area overhead increases due to the involvement of number of registers.
Hence, the work presented uses this concept of pipelining in order to accomplish the high
performance of the proposed Multiply and accumulator unit (MAC).
19
Chapter-4
EXISTING URDHVA MULTIPLIER ARCHITECTURES
Before introducing the proposed 32*32 bit Vedic multiplier, it would be necessary to ponder
upon the various pre-existing architectures of 32*32 bit Vedic multipliers based on
Urdhvatiryakbhyam sutra and show how the proposed architecture differs from the existing
structures. Four Architectures have been discussed below:
4.1 Basic 2x2 bit multiplier based on Urdhvatiryakbhyam (UT) algorithm
Urdhvatriyakbhyam multiplication algorithm method can be extended for binary numbers as

explained in the previous chapter. A simple 1 bit binary multiplication is done by performing
the logical AND operation between the numbers. Using this and UT method 2X2
multiplication for x1x0 and y1y0 is implemented by simply using 2 half adders and resultant
bits are P3P2P1P0 as shown in Fig. 4.1. The equations regarding this are given below:
Figure 4.1: Block diagram of 2*2 bit vedic multiplier
20
Higher binary multiplications can also be obtained with the help of lower multiplication units
and the adder unit.
4.2 Efficient 4*4 Vedic multiplication based on Urdhvatiryakbhyam algorithm
Let us assume the multiplicand A=A3A2A1A0 and multiplier B= B3B2B1B0 and the output
be =S7S6S5S4S3S2S1S0. Let‟s divide A and B into two parts, say “A3 A2” & “A1 A0”
for A and “B3 B2” & “B1B0” for B. Using the fundamental of Vedic multiplication,
taking two bit at a time and using 2*2 bit multiplier block as shown in 4.1, we can
have the following structure for 4x4 bit multiplication as shown in figure 4.3[11].
Figure 4.2: Structure of 4*4 Multiplication[11]
Each block as shown above is 2x2 bit multiplier generating partial products in the same
manner as discussed above. First 2x2 multiplier inputs are “A1 A0” and “B1 B0”. The last
block is 2x2 bit multiplier with inputs “A3 A2” and “B3 B2”. The middle one shows
two, 2x2 bit multiplier with inputs “A3A2”& “B1B0” and “A1A0” & “B3B2”.Once the
Partial products from all the four 2x2 bit multipliers are generated they are added up using
suitable N-bit adders to obtain the final result which is of 8 bit, “S7S6S5S4S3S2 S1S0”.
So in a generalized way we can say that, The individual multiplication products are obtained
by same recursive partitioning method, Large bit streams of NXN bit numbers are divided
21
into bit streams of length N/2 which are further divided into bit streams of length N/4 till we
get the multiplicands of size 2X2 and ultimately using the 2X2 bit multiplication method as
shown in figure 4.1 to obtain final product. For NXN multiplication unit, we require four N/2
bit multipliers, N bit full adders as shown in Figure 4.3. Speed of multiplier ultimately
depends on the speed of adder used in the circuit.
Figure 4.3: Generalized NXN Bit Urdhvatriyakbhyam Multiplication Block Diagram
4.3. Existing Approaches for High Speed Urdhvatiryakbhyam Multiplier
4.3.1 Conventional 32X32 Bit Vedic Multiplier Using RCA (Architecture 1)
Pushpalata [11] proposed architecture with the ripple carry adder in Vedic multiplication unit
for 4X4 bit multiplication. This architecture can be extended for higher bits like 8, 16, 32 bit
multiplications. The 4X4 multiplier is implemented using 2X2 multiplier unit and Ripple
carry adder as shown in Figure 4.4.
Three 4-bit Ripple Carry (RC) Adders are required. In this proposal, the first 4-bit RC
22
Adder is used to add two 4-bit operands obtained from cross multiplication of the two
middle 2x2 bit multiplier modules. The second 4-bit RC Adder is used to add two 4-bit
operands, i.e. concatenated 4-bit and one 4-bit operand we get as the output sum of first RC
Adder. Its carry “ca1” is forwarded to third RC Adder. Now the third 4-bit RC Adder is used
to add two 4-bit operands, i.e. concatenated 4 -bit (carry ca1, “0” & most significant two
output sum bits of 2ndRC Adder as shown in Figure 4.4) and one 4-bit operand we get as
the output sum of left hand most of 2x2 multiplier module.
Three 4-bit ripple carry adders are used and the combinational path delay is found to be
13.102 ns. Results are compared with Array and Booth Multiplier and it is observed that the
execution time has been reduced for Vedic multiplier and thus proves to be better[11,21].
Figure 4.4: Block Diagram of Conventional 4-Bit Urdhva Multiplier[11]
The Hardware implementation of 32 * 32 bit conventional Urdhvatriyakbhyam multiplier as

shown in figure 4.5 is the extension of 4*4 multiplier shown in Figure 4.4.
The architecture is decomposed into following 2 major blocks:
23
1. 16 * 16 bit conventional vedic multiplier
2. 32- bit ripple carry adder (RCA)
Figure 4.5: Conventional 32*32 Bit Urdhva Multiplier Architecture
4.3.1.1 16x16 bit Conventional Vedic multiplier

The 32 -bit input bit streams of multiplicand A(31-0) and multiplier B(31-0) are divided into
two equal bit stream of 16 bit length. Four 16* 16 bit conventional multipliers are used as
shown in figure 4.5 to generate partial products using urdhvatriyakbhyam technique in the
fashion similar to that of 4*4 conventional vedic multiplier as explained earlier.
4.3.1.2 Ripple Carry Adder (RCA)
Here, three 32-bit ripple carry adders are used to add up the partial products and generate the
final product P(63-0). N-bit ( here N=32) ripple carry adder consists of N-1 full adder and 1
half adder. This adder is also named as parallel adder because these full and half adders are
arranged in parallel in such a way that each adder unit generates a sum bit and carry bit. The
24
sum bit is taken as resultant bit and carry is transmitted to next adder unit as an input. RCA
are the simplest adders with compact layout and lowest power consumption but they are not
efficient for large bit numbers. Since, the delay of RCA increases linearly with increase in the
lengths of addend and augends. One of the major factor affecting its speed is that it has to
wait for the carry to ripple from previous state to next state for its further operation. Hence,
its speed gets restricted which eventually restricts the speed of multiplier. Area complexity of
RCA is O(n) and Time Complexity O(n) where N is the operand size in bits.
When this 32*32 bit Conventional multiplier was synthesized in Xillinx 12.4 and simulated
in ISIM. Combinational delay of this approach was found to 42.268 ns.
4.3.2 Modified 32X32 bit Vedic multiplier (Architecture 2)

The 4x4 multiplier comprises of 4, 2x2 bit vedic urdhva multipliers as explained earlier.
Multiplicands are of size 4, and result is of 8-bit length. The inputs A and B are broken into
chunks of size N/2 i.e. 2 bits. These 2 bit chunks are given as an input to the 2x2 bit
multipliers to generate the partial products of size 4-bits. These outputs are then sent for
addition to addition tree as shown in figure 4.6[29].
Modified Wallace tree look alike addition which reduces the levels of addition to 2 instead of
3[29].LSB 2-bits of q0 are the resultant bits while MSB two bits of qo are sent to the tree for
further addition. This design is found to be more area efficient than the conventional design.
Figure 4.6: Modified 4X4 Bit Vedic Multiplier Architecture 2[29].
25
This architecture can be extended for 8, 16, 32 bit multiplications. Figure 4.7 shows the
modified Architecture for 32*32 bit vedic multiplication. We have, Four 16x16 bit
multipliers and 3 adders of variable size. 32-bit multiplicands are divided into N/2=16 bit
chunks for both A and B. Now, the 16-bit binary input streams are sent to 16x16 bit
multiplier where, 16 bit streams are partitioned to 8 bits followed by the groups even more
smaller chunks of size four bits the process ends up with chunks of size 2 bits. Which are
then, sent to 2x2 multiplier block. Finally, Partial products of size 32 bits are obtained as the
output of 16x16 bit vedic multipliers. Which are then sent to the Wallace tree like structure
for further addition and final product generation.
This structure when simulated and synthesized using xillinx 12.4 and ISIM gave following
results for various operand sizes
Figure 4.7: Modified 32*32 Bit Vedic Multiplier 2
4.3.3 32*32 bit Vedic Multiplier using Carry Save Adder (Architecture 3)
Yet another, Architecture of 8x8 multiplier was proposed in [19] as shown in the block
diagram in Fig. 4.8. It can be easily implemented by using four 4x4 bit Vedic multiplier
modules as discussed in the previously. Let‟s analyze 8x8 multiplications, say A= A7 A6 A5
A4 A3 A2 A1 A0 and B= B7 B6 B5B4 B3 B2 B1B0. The output line for the multiplication
result will be of 16 bits as P = S15 S14 S13 S12 S11 S10 S9 S8 S7 S6S5S4 S3 S2 S1 S0. In
26
this figure the 8 bit multiplicand A can be decomposed into pair of 4 bits AH-AL. Similarly
multiplicand B can be decomposed into BH-BL. The 16 bit product can be written as:
P = AxB = (AH-AL)x(BH-BL) =AHxBH + AHxBL + ALxBH + (AHxBH)
the fundamental of Vedic multiplication, taking four bits at a time and using 4 bit multiplier
block as discussed we can perform the multiplication. The outputs of 4X4 bit multipliers are
added accordingly to obtain the final product. Thus, in the final stage two adders are also
required. Here four 4x4 Bit Vedic Multiplier and two carry save adder is used to implement
the 8x8 Vedic multiplier[19].
Figure 4.8: 8x8 Vedic Multiplier Module Using Carry Save Adder architecture 3 [19]
Above architecture can be extended to 32x32 bit multiplication as shown in figure 4.9.
Hardware implementation of 32x32 bit multiplier using carry save adder is divided in 3
blocks.
1. 16x16 bit vedic multiplier
2. 2-operand carry save adder
3. 3-operand carry save adder.
27
Figure 4.9: 32 Bit Multiplier using CSA[19]
The outputs of 16X16 bit multipliers are added accordingly to obtain the 64 bits final product.
Thus, in the final stage two adders are also required. First carry save adder adds 3 operands.
4-bit MSB from the first multiplier concatenated with 4 zeroes is added to the partial products
generated from second and third 4x4 bit multiplier. LSB 4 bits of the sum generated from this
carry save adder become the resultant bits of the products. MSB 4-bits are added to the partial
product generated from the fourth multiplier by another 2-operand carry save adder (CSA).
The output of this CSA gives the final resultant.
This architecture when simulated and synthesized using Xilinx 12.4 and ISIM.
Combinational delay was found to be 60.436 ns. But the area occupied by this design was
2052 slices out of 3584 which is quite less than the architecture 1 and architecture 2. The
memory utilization of this architecture is 187364 Kb which increases the space complexity of
this architecture
4.3.4 32*32 Bit Low Power Vedic Multiplier Using CLA (Architecture 4)
Fast and low power 16-bit multiplier architecture was proposed by R.K, R.S, S. Sarkar, and
Rajesh [18] replacing ripple carry adder with the carry Look-ahead adder (CLA) as in Fig
4.10. Since the carry is generated in advance in this adder, it decreases the carry propagation
time and thus this architecture improves the operational speed. The first step in the design of
16×16 block will be grouping the 8 bit (byte) of each 16 bit input. These lower and upper
bytes pairs of two inputs will form vertical and crosswise product terms. Each input byte
28
is handled by a separate 8×8 Vedic multiplier to produce sixteen partial product rows.
These partial products rows are then added in a 16-bit carry look ahead adder
optimally to generate final product bits .The figure 4.10 shows the schematic of a 16×16
block designed using 8×8 blocks.
Figure 4.10: Low Power and High Speed Vedic Multipliers Using CLA architecture 4[18]
The 16x16 bit multiplier has reduced combinational delay of 27.148 ns and consumes 0.169
watts of power [18] which is quite less compared to all the existing multipliers.
29
But The above architecture can be extended up to 32x 32 bit multiplication also as shown in
figure 4.11.
Figure 4.11: Low Power & High Speed 32x32 Bit Vedic Multiplier using CLA architecture 4
30
Chapter-5
DESIGN IMPLEMENTATION AND SYNTHESIS
5.1 Proposed 32-Bit Multiplier and Accumulator Unit (MAC)
Multiplication and accumulation are the two most tedious and computational intensive
operations in and DSP architecture. Hence, it‟s the MAC unit which decides the overall
performance most specifically the speed of the system as it is present in the critical path.
Therefore, development of high speed MAC is very crucial. Thesis concentrates on 2 major
bottlenecks of the MAC i.e. Fast multiplication network and Fast accumulator. As both these
stages require adding up of large operands which consequently, involve long carry
propagation paths which is the major factor in determining the speed of MAC.
Figure 5.1: Proposed MAC architecture
31
So, the proposed MAC as shown in figure 5.1 uses urdhva Vedic multiplication network as it
has been identified as the fastest algorithm for multiplication. Accumulator is based on Han
Carlson parallel prefix adder which has proved to be the fastest parallel prefix adders.
Proposed MAC has two basic blocks:
1. BEC-1 based 32X32 bit Vedic multiplier

2. Han-Carlson adder based accumulator.
5.2 Proposed Vedic Multiplier Architecture
The proposed high throughput Vedic multiplier comprises of following sub modules
1. Vedic multiplier
2. Multi-operand carry save adder
3. Binary to excess-1 code converter
4. Multiplexer
All these modules have been coded in VHDL.
5.2.1 Vedic multiplier
Proposed Vedic multiplier design is based on “Urdhvatiryakbhyam sutra” as it is the most

preferred technique amongst all the Vedic multiplication techniques. It simply means
“vertically or crosswise”. Its algebraic principle is based on the multiplication of
polynomials. Vedic multiplier using UT will generate 2N-1 cross products of different width
which when combined form log2N + 1 partial product. UT offers an advantage of parallel
generation of partial products and their concurrent addition.
This property of generating parallel partial product offers various advantages. Firstly, it
makes the multiplier independent of the clock frequency of the processor which leads to less
power dissipation. Hence, the design becomes energy efficient. Secondly, it can be
implemented with reduced number of gates, full adders, half adders. Also, due to the compact
and regular architecture of this multiplier it can be easily layout in a silicon chip. Thirdly,
since the partial products are obtained by vertical or crosswise operation. Hence, delay is
equal to the delay of adder or in other words, critical path would consist of adders adding
maximum number of bits in cross product.
32
The multiplier has a unique advantage that with the increase in the number of bits its
proportionate increase in area and gate delay is quite low in comparison to the other
multipliers. Therefore, it is time, area and power efficient. There is striking difference in
method of computation of Vedic multiplier as it handles an array of large numbers (NXN)
bits by dividing them into small [N/2 = n] and these numbers are divided into further small
numbers (size = n/2), these numbers are further broken into smaller chunks till we get, the
numbers of multiplicands of size 2x2 each for parallel generation of partial products. Thus, it
performs multiplication with minimum number of steps which in-turn reduces the delay and
hardware requirements
So, 2x2 bit vedic multipliers form the fundamental building blocks, using these 4x4
multipliers are made then, using this 8x8 bit multipliers are followed by 16x16 bit multipliers
and finally 32x32 bit multipliers are made. The device selected for synthesis is Spartan 3,
XC3S400 package PQ208 and speed grade -5.
So, let‟s start from the synthesis of 2x2 bit vedic multiplier.
5.2.1.1 2x2 Bit Vedic multiplier
A simple 1 bit binary multiplication is done by performing the logical AND operation
between the numbers. Using this and UT method 2X2 multiplication for x1x0 and y1y0 is
implemented by simply using 2 half adders and resultant bits are P3P2P1P0 as shown in Fig.
5.2. The equations regarding this are given below:
In 2x2 multiplier the input range of multiplicands is (00-11) and output lies in a set of (0000,
0001, 0010, 0011, 0100, 0110, 1001). Consider the following illustration let the multiplicand
a and b be 11 and 10. In the first step, vertical multiplication of both the LSB is done. In the
second step, Crosswise multiplication and addition of partial product takes place. Third step,
carries out the vertical multiplication of MSB of both multiplicand.
33
Figure 5.2(a)
Figure 5.2(b)
Figure 5.2 (a) and (b): Illustration and Hardware implementation of 2x2 UT multiplier
5.2.1.2 Proposed 4x4 Bit Vedic Multiplier
The 4x4 proposed multiplier consists of 4, 2x2 Urdhvatiryakbhyam multiplier blocks. Here,
the multiplicands A and B are of size 4-bits each and the resultant product is of size 8-bits.
The input bit streams A and B are broken into smaller chunks of size n/2 = 2, These newly
created chunks are then given as an input to the 2x2 multiplier block which generate 4-bit
results. Output of these 2x2 multiplier block is then sent to 4-bit multi-operand carry save
adder (CSA) as shown in figure 5.3 for the summation of partial product. The sum-bit of CSA
34
become the resultant bits whereas, the carry bit of CSA acts as selection line of multiplexer.
If carry= „1‟ then, output of binary to excess-1(BEC-1) is taken to be the MSB bits of
resultant product whereas, if carry = „0‟ then the MSB 2-bits of the result generated by the
vertical multiplication of MSBs of A and B form the resultant bit.
Figure 5.3: Proposed 4x4 Bit Vedic Multiplier
The addition can be explained by the following diagram in figure 5.4. q0, q1, q2, q3 be the
partial products generated by the 2x2 multiplier blocks and p(7-0) be the sum.
Figure 5.4: Addition of Partial Products In 4x4 Block
35
The 8x8 proposed multiplier consists of 4, 4x4 proposed Urdhvatiryakbhyam multiplier

blocks. Here, the multiplicands A and B are of size 8-bits each and the resultant product is of
size 16-bits. The input bit streams A and B are broken into smaller chunks of size n/2 = 4,
these newly created chunks are then given as an input to the 4x4 multiplier block, where
again these new chunks are broken into even smaller bit streams of size n/4 =2 and fed to 2x2
multiply block. Finally, each 4x4 bit multiplier block generates 8-bit results. Output of these
4x4 multiplier block is then sent to 8-bit multi-operand carry save adder (CSA) as shown in
figure 5.5 for the summation of partial product. The sum-bit of CSA become the resultant bits
whereas, the carry bit of CSA acts as selection line of multiplexer. If carry= „1‟ then, output
of binary to excess-1(BEC-1) is taken to be the MSB bits of resultant product whereas, if
carry = „0‟ then the MSB 4-bits of the result generated by the vertical multiplication of MSBs
of A and B form the resultant bit.
Figure 5.5: Proposed 8x8 bit Vedic multiplier
partial products generated by the 4x4 multiplier blocks and p(15-0) be the sum.
36

blocks. Here, the multiplicands A and B are of size 16-bits each and the resultant product is
of size 32-bits. The input bit streams A and B are broken into smaller chunks of size n/2 = 8,
multiply block just as in case of 8x8 multiply block. Again new chunks are divided into half
to get chunks of size 2, which are fed to 2x2 multiply block. Finally, each 8x8 bit multiplier
block generates 16-bit results. Output of these 8x8 multiplier block is then sent to 16-bit
multi-operand carry save adder (CSA) as shown in figure 5.7 for the summation of partial
product. The sum-bit of CSA become the resultant bits whereas, the carry bit of CSA acts as
selection line of multiplexer. If carry= „1‟ then, output of binary to excess-1(BEC-1) is taken
to be the MSB bits of resultant product whereas, if carry = „0‟ then the MSB 8- bits of the
result generated by the vertical multiplication of MSBs of A and B form the resultant bit.
37
Figure 5.7: Proposed 16x16 Bit Vedic Multiplier
partial products generated by the 8x8 multiplier blocks and p(31-0) be the sum
5.2.1.5 Proposed 32 x 32 Bit Vedic Multiplier

blocks. Here, the multiplicands A and B are of size 16-bits each and the resultant product is
of size 63-bits. The input bit streams A and B are broken into smaller chunks of size n/2 = 16,
38
multiply block just as in case of 16x16 multiply block. Further these chunks are divided to
get bit streams of size 4, which are given as an input to 4x4 block. Again new chunks are
divided into half to get chunks of size 2, which are fed to 2x2 multiply block. Finally, each
8x8 bit multiplier block generates 16-bit results. Output of these 16x16 multiplier block is
then sent to 32-bit multi-operand carry save adder (CSA) as shown in figure 5.9 for the
summation of partial product. The sum-bit of CSA become the resultant bits whereas, the
carry bit of CSA acts as selection line of multiplexer. If carry= „1‟ then, output of binary to
excess-1(BEC-1) is taken to be the MSB bits of resultant product whereas, if carry = „0‟ then
the MSB 16- bits of the result generated by the vertical multiplication of MSBs of A and B
form the resultant bit.
Figure 5.9: Proposed 32x32 bit Vedic multiplier
39
5.2.2 Conditional Binary to Excess-1 Code Converter (BEC-1)
The proposed BEC-1 vedic multiplier architecture employs Binary to excess-1 converter as
its vital element. The major purpose of using BEC-1 is to get lower area and improved speed
of operation. This logic can be implemented for different bits which are used in the modified
design. The major advantage of this BEC logic stems from the following facts
1. It uses lesser number of logic gates than the n-bit Full Adder (FA) structure. So,
when compared to the conventional architecture we see that 32-bit ripple carry adder
in figure 4.5 is replaced by BEC-1 in the proposed design. Hence, the area occupied
by the proposed design is much less. Structure of 4-bit BEC-1 is shown in figure
5.10[30]. RTL View of BEC-1 is shown in Figure 5.11.
Figure 5.10: BEC-1 Block Diagram[30]
40
Figure 5.11 RTL View of BEC-1
2. BEC-1 in the proposed design is used so as to generate two possible partial results in
parallel. A multiplexer is used to select either the BEC-1 output or direct input
according to the selection line as shown in figure 5.12[30]. Since, the results are
generated in parallel therefore; the delay of proposed design is reduced.
Figure 5.12 BEC-1 with Multiplexer[30]
41
Unlike the existing architecture 3 which uses 2-operand CSA which has to wait for the
carry generated by the preceding 3-operand CSA. Once the Carry is generated only after
that it can add up the carry to generate the final result. Similarly, architecture #4 uses half
adder assembly which again has to wait for the carry to be generated from the preceding
CLA units. Once, the carry is generated then only it adds up the carry to get the final
product.
Hence, the use of BEC-1 in proposed vedic multiplier fulfills both our motivations of reduced
area and reduced delay.
5.2.3 Multi-Operand Carry Save Adder
Carry save adder (CSA) is a peculiar design for speedy multi-operand adders. A CSA
consists of a ladder of stand-alone full adder circuits or in other words, a CSA comprises of
n-disjoint full adders each of which individually computes the sum and carry bit for
corresponding 3 input bits. So, basically it has two main units as shown in figure 5.13:
Figure 5.13 K-Bit Carry Save Adder Adding 3 K-Bit Operands
1. CSA unit : It consists of n- disjoint full adders which take up 3 n-bit binary input
numbers and produces two outputs i.e. n-bit partial sum and n-bit partial carry.
Therefore, it reduces the addition of three numbers to two-numbers also known as 3:2
42
counter for the same reason. The expression governing carry save adder is given by
the following equation :
A + B + C = Sum + 2 * Carry
2. CPA unit: CPA stands for carry propagation adder unit. The resultant sum of the
carry save adder is obtained by adding n-bit partial sum and n-bit partial carry. We
can use either ripple carry adder of the carry look-ahead adder as CPA in order to
compute the final sum.
Carry save adder has been used in proposed Vedic multiplier as a compressor circuit to add
the partial products. Figure 5.14 makes the use of CSA in the proposed technique quite clear.
8X8 bit multiplication of 11111111 x 11111111 has been illustrated. The 3:2 Compressor in
the proposed technique offers the following advantages over the adder chains present in the
existing architectures.
Figure 5.14: Illustration of 8x8 proposed vedic multiplier technique.
1. Delay: Speed enhancement of circuit requires minimal carry propagation. Unlike

the other adders where to add 3 operands, 2 carry propagation chains are required, the
CSA focuses on parallel partial sum and carry generation. In which each column adds
up without waiting for the carry from the previous stage. Hence, time complexity of
CSA unit is O(1), where delay is equal to the delay of a single full adder circuit. It
43
has been implemented and analyzed that addition of 3 numbers using CSA and ripple
carry adder is much faster than the addition using two ripple carry adders as in
conventional vedic multiplier architecture 1 as the complexity of 2 RCA‟s is 2n ( n is
operand size) whereas, for a CSA followed by RCA its (n+1). Table 5.1 shows area
and time complexity of CSA, CLA, and RCA.
TABLE 5.1
AREA AND TIME COMPLEXITIES OF VARIOUS ADDERS
ADDER AREA TIME

COMPLEXITY COMPLEXITY
Ripple Carry Adder O(n) O(n)
Carry look ahead adder O(n logn) O(logn)
3-operand carry save O(n) O(logn)

adder
Since, Ripple carry adders are relatively slow adders as each full adder has to wait
for carry from the previous stage so its complexity is O(n) we can use a carry look
ahead adder instead so that complexity of multi-operand adder further reduces to
(logn + 1). Also it has been analyzed from the synthesis results that the CSA in the
proposed architecture reduced the combinational delay to 15.553 ns as against the use
of 2 CLA‟s in architecture 4 discussed earlier which takes 20ns.
2. Area: Area occupied by the CSA is given by the following equation:
ACSA = N X AFA + ACPA
It is observed that the use of carry look ahead adders in architecture #2 and #3
increases the number of slices occupied by the architecture as the area occupied by
CLA is more than RCA as area complexity of CLA IS O (nlogn) whereas, for RCA
it is O(n). Hence, we see that in the proposed design there is 37.5% reduction in the
number of slices utilized by the proposed design as compared to architecture 4.
44
5.2.4 Flowchart of Proposed Vedic Multiplier Architecture For NXN Bits
Following flowchart explains the architecture of proposed multiplier when extended to NXN
bit multiplication
Figure 5.15: Flowchart of NXN Bit Proposed Multiplier
5.3 Accumulator
MAC performs the multiplication of 2 numbers i.e. the multiplier and the multiplicand and
add up the product obtained to the result stores in the accumulator to attain the final result.
Output of the register is fed back as an input to the adder. Then further on each cycle the
output of the multiplier is added the contents present in the accumulator. Figure 5.16 shows
the accumulator unit. Since, in this case, accumulator has to perform addition of large
operand of size 64 bits which involves long carry propagation chains. Hence fast adders need
45
to be implemented to improvise the performance of whole MAC unit.
Carry
Figure 5.16: Block Diagram of Accumulator

5.3.1 Adder design
MAC requires fast response adders. Hence, implementation of accumulators with

conventional adders like ripple carry adders (RCA), regular carry look Ahead (CLA) adders
will degrade the performance of the MAC due to the following reasons:
1. RCA offers a serious drawback of increased delay with increase in the number of bits
as each of the consecutive full adder has to wait for the carry from the previous stage.
Since, it takes longer time for carry propagation therefore, if used in MAC it will
reduce the speed.
2. CLA is a fast adder in contrast to RCA as it calculates the carry in advance, by simply
looking at the input bits which results in reduced carry propagation delay. But the use
of CLA is restricted only to smaller width adders because for wider bit adders there
will be substantial loading capacitance and therefore, large delay, large power
consumption.
So, in order to attain high performance, implementation of MAC with parallel prefix adders
(PPA) can be one of the most promising solutions in contrast to the above mentioned serious
46
drawbacks.
5.3.1.1 Parallel Prefix Adders
Parallel prefix adders are considered to be one of the fastest adders which are flexible enough
for VLSI implementation, they perform the high speed additions by pre-computing generate
and propagate signals. PPA operation has following 3 stages[16]:
1. Pre-Processing stage: - This stage involves the computation of generate and

propagate signals are used to generate carry input of each adder. A and B are inputs.
These signals are given by the equation 1&2[17]
Pi = Ai Bi…………...…………………………………………………………… (1)
Gi = Ai and Bi……………………………….…………………………………...… (2)
2. Carry Generation Network: - Carry corresponding to each bit is computed in this
stage. Generate and Propagate signals are treated as intermediate signals. Parallel
execution of carry is done. It is governed by the equations 3 & 4 [17]
P(i:k)=P(i:j).P(j1:k)……………………………………………...…………………...(3)
G(i:k) =G(i:j) +(G(j-1:k) . P(i:j))……………….....…………………………….…...(4)
3. Post Processing Stage: - This stage computes the final sum and carry bits following
the equations 5 & 6 [17].
Si = Pi Ci…………………………………………..……………………….……. (5)
Ci+1=(Pi .C0) + Gi…………………..……………………..………………………. (6)
Most commonly used PPA are Brent and Kung, Kogge Stone and Han Carlson adders. Kogge
Stone adder is the fastest reported PPA which generates carry signal in O(logn) time[31] as it
has minimum logic depth and minimum fan out. But this increase in speed comes with
increase in area. Brent and kung adder are area efficient but they have occupy less routing
area O(n logn) but it is slower than Kogge stone adder with T= 2logn -1[31].
Proposed MAC uses Han Carlson adder as it has a compromise between Kogge stone adder
and brent kung adder. It uses brent kung stage at the beginning followed bt Kogge stone stage
and again uses another bren kung stage at the termination. It generates carry in T= logn +1
time and occupies area O(n logn)[17]. Figure5.17 [16] shows 32 -bit HAN Carlson adder.
47
Figure 5.17: 16-bit Han Carlson Adder[16]
Implementation of MAC unit with Han Carlson adder proves to be faster than the hybrid
carry select adder which is being used very frequently now-a-days for MAC implementation.
This increase in speed comes with increase in area as examined from the analysis as shown in
chapter 6.
5.4 Logic Optimization of CSA-Multiplexer Based Approach
The work presented deals with technology independent logic optimization for multi-operand
carry save adder. Since, a multi-operand CSA is a ladder of multiple stand alone full adders.
Hence, the optimization of full adder is the major step in the logic optimization of CSA.
5.4.1 Full-Adder Implementation Methodologies
Full adder being the most obligatory component of the CSA needs a careful optimization.
Full adder is a circuit which adds 3 bits: two inputs A, B and carry from previous stage and it
produces two outputs SUM and output CARRY. The work presents 10 different logic
constructs of full adder as shown below.
5.4.1.1 1-Bit Adder Using XOR, AND, OR (Method 1)
General expression for Full adder is:
48
SUM = A B CIN
CARRY = A.B + B.CIN + CIN.A
Figure 5.18: Full Adder by Method 1

5.4.1.2 1-Bit Adder using NAND (Method 2)
SUM = A B CIN
Figure 5.19: Full adder using NAND gates
5.4.1.3 1-Bit Adder Using NOR (Method 3)

SUM = A B CIN
49
Figure 5.20: 1-Bit Adder Using NOR
5.4.1.4 1-Bit Adder Using XOR, NOR, NOT, OR (Method 4)
SUM= A B CIN
CARRY =
Figure 5.21: 1-Bit Adder using Method 4

5.4.1.5 Full Adder Using XNOR, NOT, AND, OR (Method 5)
SUM = (A B CIN)‟
CARRY = AB + ((A B)‟)‟CIN
50
Figure 5.22: 1-Bit Adder using Method 5
5.4.1.6 Full Adder Using XOR, AND (Method 6)
SUM = (A B CIN)
CARRY = (A B) CIN AB
Figure 5.23 1-Bit Adder using Method 6
5.4.1.7 Full Adder Using XOR, NAND, NOR (Method 7)

SUM = (A B CIN)
CARRY = ((AB)‟ + (BCIN)‟ + (CINA)‟)‟
51
5.4.1.8 Full Adder Using XOR, NAND, NOT, NOR (Method 8)

SUM = (A B CIN)
CARRY = ((AB‟).(((A+B)‟)‟.CIN)‟)‟
5.4.1.9 Full Adder Using XOR, XNOR, MUX (Method 9)
52
Figure 5.26: 1-bit adder using method 9
5.4.1.10 Full adder using MUX (Method 10)
(a) (b)
Figure 5.27: MUX-based full adder (a) RTL View (b) Block Diagram
Full adder with each of the above 10 methodologies has been incorporated in Carry save
adders of various operand sizes 4, 8, 16, 32 bits and the performance of multi-operand carry
save adder has been analyzed in terms of Cost, slice utilization and maximum combinational
delay. From the analysis of carry save adder using all the 10 logic construct the optimized
53
expression of full adder in terms of multiplexer has been proved to be the best. Hence, logic
optimized multiplexer based carry save adders have been employed in the proposed Vedic
multiplier to improve its overall performance. RTL schematic of CSA in terms of multiplexer
has been shown in figure 5.28
Figure 5.28: RTL View of Multi-operand Carry Save Adder

5.5 Pipelining of Proposed Vedic Multiplier
In order to reduce the combinational delay of the proposed 32x32 bit Vedic multiplier
concept of pipelining has been used in the thesis. The 32x32 bit complex multiplication
process is divided into small chunks of 16x16 multiplication which is further divided into 8x8
bit multiplication which in turn is broken into 4x4 multiplication , and this 4x4 multiplication
is divided into 2x2 multiplications. Now, each of these multiplications is performed by
special dedicated segments which are operating concurrently. The 2x2, 4x4, 8x8, 16x16
multiply blocks are separated by parallel in parallel out (PIPO) registers as shown in
figure5.29.
54
1. During the first clock the data enters to 2x2 multiply block processes the data and
stores the data into first pipe.
2. On the second clock cycle, 4x4 multiply block receives the input from first pipe
process the data and stores the data in second pipe.
Figure 5.29: Pipelining in 32x32 Bit Proposed Multiplier

3. Now, during the third clock cycle 8x8 combinational block receives the input from
second pipe produces the partial product and stores the data into third pipe.
4. During the fourth clock period, data from third pipe is fed as an input to 16x16 logic
block which in turn produces the partial product and fourth pipe holds this data.
5. Now, on occurrence of fifth clock period data enters to 32x32 multiply block and
final product is generated. So, we get the output at the end of fifth cycle.
Advantage of pipeline becomes quite apparent when minimal clock period of the proposed
32x32 bit multiplier with pipeline is examined. It has now reduced to 10.520 ns. Of course,
this increased performance increases the area overhead due to the insertion of PIPO.
5.6 Synthesis of Proposed Vedic Multiplier
This section deals with the synthesis of the proposed Vedic multiplier module. The FPGA
used is Xilinx Spartan 3 (Family), XC3S400 (Device), Package (PQ208).
Here, the RTL View, its description and device utilization summary is given for each module
55
5.6.1 2X2 Multiply Block
5.6.1.1 Device utilization summary and Delay
Figure 5.30: RTL View of 2x2 multiply block
5.6.2 4x4 Multiply Block
5.6.2.1 Device Utilization Summary and Delay
Delay: 12.954ns (Levels of Logic = 7)
56
Figure 5.31: RTL View 4x4 multiply block
57
Figure 5.32 RTL View of 8x8 multiply Block
58
Figure 5.33 RTL view of 16x16 multiply block
59
5.6.5 32x32 Proposed Vedic Multiplier
Figure 5.34 RTL View of 32x32 Vedic Multiplier
Logic Levels :23
60
5.6.6 32x32 Bit Pipelined Multiplier
61
Figure 5.35: RTL View of Pipelined 32x32 bit Vedic Multiplier
62
5.6.7 Proposed Pipelined MAC with Han Carlson Accumulator
Figure 5.36: RTL View Accumulator
63
Figure 5.37 : RTL View of Proposed MAC with Han-Carlson Based Accumulator
64
Chapter-6
RESULT AND COMPARISION
6.1 Results
Simulation of all the modules has been done using ISIM (8.1d) Software. Dynamic Power
Results of 32x32 bit Vedic Multipliers have been analyzed using Xilinx Power Analyzer Tool
(XPA analyzer). Area is analyzed in terms of Slice utilization and Delay as the sum of route
delay and logic delay have been estimated by the synthesis report generated by Xilinx 12.4,
Hardware implementation has been done on Spartan 3, XC3S400(Device). pq208(package)
6.1.1 Simulation of 32X32 bit Proposed Vedic Multiplier
Figure 6.1 Simulation of 32X32 bit Proposed Vedic Multiplier
Description
Ain1:32-bitinputmultiplicand Bin1:32-bitinputmultiplier Sout: 64-bit output
6.1.2 Simulation of 32X32 bit Proposed Vedic Multiplier with Pipelining
Figure 6.2(a)
65
Figure 6.2(b)
Figure 6.2: (a) and (b) Simulation of 32X32 bit Proposed Vedic Multiplier with Pipelining
Ain1: 32-bit input multiplicand Sout: 64-bit output
Bin1: 32-bit input multiplier Clk: Clock frequency of multiplier
6.1.3 Simulation of 32x32 Proposed Multiplier and Accumulator Unit (MAC)
Figure 6.3: Simulation Output of Proposed MAC
Ain1: 32-bit input multiplicand Sout: 64-bit output
Bin1: 32-bit input multiplier Clk: Clock frequency of multiplier
6.1.4 FPGA Implementation of Proposed Architecture with 8 Bit Operand Size
Hardware implementation of the proposed architecture has been done for 8x8 modules, for
different input vectors. Glowing of test LED‟s represent the logic state‟1‟ and logic state „0‟
otherwise. Inputs have been applied through the input switches.
Input A =”00000000” Input B = “00000000” Output = “0000000000000000”
66
Figure 6.4: FPGA Implementation of Proposed 8x8 Architecture
Input A= “00000001”
Input B= “11111111”
Output = “0000000011111111”
Figure 6.5: FPGA Implementation of Proposed 8x8 Architecture
67
6.2 Comparison
6.2.1 Comparison of Proposed Vedic Multiplier with Different Existing Architectures
All the existing urdhva multiplier architectures as discussed in Chapter-4 have been extended
upto 32x32 bit and have been synthesized and simulated using Xilinx 12.4 and ISIM(8.1d)
respectively. Coding for all the modules have been done in VHDL. Qualitative evaluation of
all the architectures at different operand size 4, 8, 16, 32 have been done for wide range of
parameters including area, delay, power, logic depth, memory utilization. Results obtained
are compared with that of the proposed technique.
TABLE 6.1
COMPARISON OF PROPOSED 4X4 MULTIPLIER WITH EXISTING 4X4 MULTIPLIER

ARCHITECTURES
Multiplier Total Logic Route Logic Slice 4-input Memory

Delay delay Delay Levels utilize- LUT (Kb)
(ns) (ns) (ns) tion utilize-
tion
Architecture 13.574 8.333 5.241 8 21 41 134772
1
Architecture 13.079 8.080 4.999 7 23 43 134772
2
Architecture 17.862 9.517 8.345 10 19 34 137844
3
Architecture 13.676 8.498 5.178 8 15 29 134772
4
Proposed 13.159 8.141 5.018 7 20 39 134772
Architecture
68
TABLE 6.2
COMPARISON OF PROPOSED 8X8 MULTIPLIER WITH EXISTING 8X8

MULTIPLIER ARCHITECTURES

Delay delay Delay Level utilize- LUT (Kb)
(ns) (ns) (ns) tion utilize-
tion
Architecture 21.676 10.893 10.751 13 119 227 13680
1
Architecture 26.337 10.414 9.923 12 111 213 136820
2
Architecture 21.608 11.372 10.236 14 104 204 137844

3
Architecture 19.007 10.249 8.758 12 88 172 136820
4
Proposed 18.139 9.996 8.143 11 100 196 136820

Architecture
TABLE6.3


Delay delay Delay Levels utilize- LUT (Kb)
(ns) (ns) (ns) tion utilization
Architecture 30.631 14.621 16.010 21 569 1087 145012

1
Architecture 31.765 14.935 16.830 22 547 1065 145012
2
Architecture 36.160 17.269 18.891 27 432 845 14736

3
Architecture 28.830 13.602 14.778 19 451 868 148084

4
Proposed 25.235 12.287 12.948 16 485 948 143988
Architecture
69
TABLE 6.4

Multiplier Total Logic Route Logic Slice LUT Memory Power

Delay delay Delay Level utiliza utiliza- (Kb) (mW)
(ns) (ns) (ns) -tion tion
Architecture 42.268 18.217 24.051 29 2217 4319 175784 99

1
Architecture 46.762 20.954 25.880 35 2148 4168 179428 93
2
Architecture 60.436 26.763 33.673 47 2052 3985 187364 86
3
Architecture 38.855 17.242 21.613 26 2031 3940 195300 83
4
Proposed 34.652 15.579 19.073 23 1882 366 174620 80
Design
Following results are obtained from table 6.1, 6.2, 6.3, 6.4:
1) Delay of proposed architecture reduces at all operand sizes as this architecture uses
single carry save adder for summation of partial products as against the conventional
architecture which uses 2 ripple carry adder and modified architecture which use
minimum 2 carry look ahead adders for addition of 3 operands. Thus, the carry has to
propagate 2 times for the whole operation. Thus, the time complexity of architectures
with 2 RCA is 2n (n =operand size) and that with two CLA‟s is 2logn. Whereas, the
proposed architecture simply uses a single carry save adder which reduces the time
complexity to (1 + logn). Consequently the logic depth reduces which reduces the logic
delay of proposed design
2) The second reason for reduced delay of this architecture is that it uses a binary to
excess-1 code converter circuit with a multiplexer which produces two parallel results
without waiting for the carry generation from previous unit. Thus, when carry is
generated by CSA adder it simply acts as a selection line to select the correct output. As
70
against the use of half adder assembly in architecture 4 and ripple carry adder in
architecture 1 or carry save adder for 2-operands in architecture 3 which require carry
from the previous unit to compute the final result. Delay comparison can be understood
in a better way through the graph shown in figure 6.6
Figure 6.6 : Graph showing the delay comparison of various architectures
6.2.2 Comparison of Carry Save Adder Using Various Full Adder Topologies
Multi-operand Carry save adder consists of a ladder of full adders. These full adders have
been implemented using 6 methodologies discussed in Chapter-5. Performance of CSA by
incorporating primitive adders with different full adder circuits has been analyzed at different
operand Size 4,8,16, 32 in terms of Slice utilization (area) and combinational path (delay).
71
TABLE 6.5
DELAY COMPARISON OF CSA FOR DIFFERENT OPERAND SIZE USING

DIFFERENT FULL-ADDER METHODOLOGIES
DELAY(ns)
4-BIT 8-BIT 16-BIT 32-BIT
Method 1 9.083 12.324 16.537 19.0381
Method 2 10.161 11.885 18.053 19.360
Method 3 9.88 12.974 17.677 20.008
Method 4 10.17 12.197 17.442 18.398
Method 5 10.462 11.711 14.746 15.623
Method 6 10.462 11.275 14.546 16.167
Method 7 10.255 11.745 14.546 16.167
Method 8 10.462 11.745 14.995 16.167
Method 9 9.787 11.591 14.995 17.456
Method 10 8.923 9.59 13.7 15.553
From the synthesis results obtained by implementation of full-adder in all 10 styles show that:
1) Implementation of carry save adder using MUX Approach gives least worst case delay
and occupies least number of occupied slices when compared CSA implemented with
using only NAND , only NOR, AND, OR and XOR style. Thus, CSA using multiplexer
is a better approach to reduce the area as compared to the regular style. The CSA when
implemented in the proposed multiplier significantly reduced the overall area occupied
by the proposed design.
2) This multiplexer based approach has reduced the space complexity of
urdhvatriyakbhyam technique as evident from the table 6.4 showing comparison in
memory utilized (in Kb). Which makes it quite clear that logic optimized primitive
adders prove to be the best choice so far.
3) Mux based approach has minimum cost which is 35 which is lowest in comparision to
other implementation.
72
TABLE 6.6
AREA COMPARISON OF CSA FOR DIFFERENT OPERAND SIZE USING DIFFERENT

FULL-ADDER METHODOLOGIES
4-bit CSA 8-bit CSA 16-bit CSA 32-bit CSA
Slice No. of Slice No. of Slice No. of Slice No. of

utilize- LUT utilize- LUT utilize- LUT utilize- LUT
tion tion tion tion
Method 1 7 13 20 38 45 87 104 202
Method 2 11 22 23 45 58 114 141 277
Method 3 8 16 22 43 62 122 135 264
Method 4 7 13 19 38 45 87 104 202
Method 5 7 13 20 38 45 88 104 203
Method 6 7 13 20 38 45 87 104 202
Method 7 7 14 19 38 45 87 104 202
Method 8 7 14 19 38 45 87 104 202
Method 9 7 14 20 39 45 87 103 201
Method10 8 15 19 37 47 92 103 201

.
73
6.2.3 Comparison of Proposed Vedic Multiplier with and Without Pipelining
TABLE 6.7
COMPARISON OF PROPOSED 32X32 BIT MULTIPLIER WITH AND WITHOUT

PIPELINING
Delay Area (slice utilization)
Proposed Vedic multiplier 34.658 ns 1882/3584

without pipelining
Proposed Vedic multiplier 10.520 ns 1957/3584

with pipelining (Max. Frequency:-
95.057 MHz)
From table 6.7 it can be inferred that:
Insertion of pipeline registers shows that it has proved to be the best technique to increase the
throughput of the system. Shrink in the gate delay is quite identified from the above table and
increase the frequency of operation proves it further more. Though the pipelining registers
increase the area of the multiplier but it reduces the delay by 70% which proves faster
operation and throughput compared to conventional vedic multipliers.
6.2.4 Comparison of Various Adders in Terms of Area and Delay
Table 6.8 shows the comparison of various 64-bit adders in terms of area and delay.
Synthesis results show that Han Carlson adder proves to be the fastest adder since it increases
the speed by more than 70% in comparison to the conventional CLA and upto 45% in
comparison to the hybrid carry select adders which are currently being used for wider bit
lengths. Although this increase in speed comes with increase in area. Hence, the proposed
MAC uses Hancarlson adder in the accumulator for high throughput and increased speed.
Figure 6.7 shows the delay comparison of these adders.
74
TABLE 6.8
AREA AND DELAY COMPARISION OF VARIOUS 64-BIT ADDERS
ADDERS Total Logic Route Logic Slice 4-input

Delay delay Delay Level utilization LUT
(ns) (ns) (ns) utilization
Ripple carry 94.280 35.801 58.479 65 73 127

adder
Carry Look 87.398 35.801 51.597 65 73 127
Ahead adder
Hybrid Carry 44.553 19.036 25.517 30 110 193

select adder
Han Carlson 26.053 12.809 13.244 17 170 302

adder
75
6.2.5 Area and Delay Results of Proposed MAC Using Han Carlson Adder
TABLE 6.9
AREA AND DELAY RESULTS OF THE PROPOSED MAC
DEVICE Minimu Maximum Minimu Max. Logic Slice 4-input

m Frequency m input output Level utilize LUT
Period (MHz) arrival time - utilize-
(ns) before after Tion tion
clock (ns) clock
(ns)
SPARTAN 10.557 94.771 4.795 6.530 7 1953/ 3520/
-3 3584 7168
XC3S400-
PQ208
SPARTAN- 9.181 108.917 4.469 4.519 7 1961/ 3514/
3E 4656 9312
XC3S500E-
FG320
76
Chapter-7
CONCLUSION AND FUTURE SCOPE
This thesis introduces a novel technique for Urdhvatiryakbhyam Vedic multiplier and
Accumulator unit which is the key element of any digital signal operation. The speed of
operation of several parallel architectures like ALU, DSP chips depend on the datapath which
in turn depends on individual elements like multipliers.
Synthesis and simulation results of the proposed multiplier have been put forward in the
thesis. Along with proposed multiplier the already existing urdhva multiplier architectures
using ripple carry adder, carry look ahead adder and modified vedic multiplier have also been
implemented in the thesis and the difference in their overall performance in terms of delay,
area , power has been observed. Hence, our motivation to introduce the new multiplier is just
fulfilled.
The design and implementation of the work presented in the thesis have taken into account
several factors like slice utilization, delay, power, time and space complexity. One of the
major enhancements techniques used in this thesis to reduce the delay and the logic depth is
making use of a single carry save adder to compress all the partial products and obtain the
final product.BEC-1 has been incorporated in the design for parallel generation of final
product bits. Hence, this combination of CSA and BEC-1 has reduced the delay upto 42%
when compared with the latest existing architecture using 2 carry save adders.
Use of optimized expression for primitive adder and using implementing the carry save adder
with MUX-Based approach has decreased the area by 15% as compared to regular style
implementation.
The thesis uses the application of pipeline to increase the throughput as pipelining increases
the frequency of operation. Hence, we observe the shrink in the gate delay and the minimum
period reduces to 10.52ns. Further, a MAC has been designed utilizing the proposed
pipelined Vedic multiplier and han Carlson adder. The use of Han Carlson Adder in the
design has in turn improvised the performance of MAC to a large extent and has helped to
achieve the maximum frequency of 108.917 MHz.
In future, speed can be further optimized by using PPA which are much more area efficient.
In order to reduce the transistor count for this unit Gate diffusion technique can be used at
circuit level. Since, a number of partial products are generated by 2x2 module in this unit,
Compressors can be used at initial level to reduce the number of partial products.
77
REFERENCES
1. Purushottam D. Chidgupkar and Mangesh T. Karad, “The Implementation of Vedic
Algorithms in Digital Signal Processing”, Global J. of Engng. Educ., Vol.8, No.2 2004
UICEE Published in Australia.
2. Himanshu Thapliyal and Hamid R. Arabnia, “A Time-Area- Power Efficient

Multiplier and Square Architecture Based On Ancient Indian Vedic Mathematics”,
Department of Computer Science, The University of Georgia, 415 Graduate Studies
Research Center Athens, Georgia 30602-7404, U.S.A.
3. E. Abu-Shama, M. B. Maaz, M. A. Bayoumi, “A Fast and Low Power Multiplier

Architecture”, The Center for Advanced Computer Studies, The University of
Southwestern Louisiana Lafayette, LA 70504.
4. M.E.Paramasivam, Dr.R.S.Sabeenian,” An Efficient Bit Reduction Binary Multiplication

Algorithm using Vedic Methods”, 2010 IEEE 2nd International Advance Computing
Conference
5. Mr. Virendra Babanrao Magar, “Area And Speed Wise Superior Multiply And Accumulate
Unit Based On Vedic Multiplier”, Journal of Engineering Research and Application, Vol. 3,
Issue 6, Nov-Dec 2013, pp.994-999
6. D.Jaina,Sethi ,” Vedic Mathametics based nultiply and accumulate unit.”,2011 IEEE

conference on Computational intelligence and communication system.
7. Jagadguru Swami Sri Bharati Krisna Tirthaji Maharaja, “Vedic Mathematics: Sixteen
Simple M. Pradhan and R. Panda, “Design and Implementation of Vedic Multiplier” A.M.S.E
Journal, Computer Science and Statistics, France vol. 15, July 2010, pp. 1-19.
8. Harpreet Singh Dhillon, Abhijit Mitra, “A Reduced-Bit Multiplication Algorithm for

Digital Arithmetic‟s” International Journal of Computational and Mathematical Sciences,
spring 2008, pp.64-69.
9. S.S. Kerur, Prakash Narchi, Jayashree C .N. Harish M.Kittur and

GirishV.A,“Implementation of Vedic Multiplier for Digital Signal Processing” ,International
78
Journal of Computer Applications, 2011, vol. 16, pp. 1-5.
11. Verma, P.: “Design of 4X4 bit Vedic Multiplier using EDA Tool,” International Journal
of Computer Application (IJCA), Vol. 8, June, 2012.
12. Abhyarthana Bisoy, Mitu Barat, “Comparison of a 32-Bit Vedic Multiplier With A
Conventional Binary Multiplier”, 2014 IEEE International Conference on Advanced
Communication Control and Computing Technologies (lCACCCT)
13. S.Vijayakumar, Dr.J.Sundararajan” Low power multiplier using VEDIC carry lookahead
Adder”, Research gate conference march 2015.
14. Mi Lu. “Arithmatic and Logic in computer systems”wiley publications
15. Aneesh R, Sarin K Mohan, “Design and Analysis of High Speed, Area Optimized 32x32-Bit
Multiply Accumulate Unit Based on Vedic Mathematics”, International Journal of Engineering
Research & Technology (IJERT) Vol. 3 Issue 4, April – 2014
16.Gijin V George, Anoop Thomas, “High Performance Vedic Multiplier Using Han-Carlson Adder”,
International Journal of Engineering Research & Technology (IJERT) Vol. 3 Issue 3, March - 2014
17. T. Han, David.A. Carlson, “Fast area efficient VLSI adders”,
18.R.K. Bathija,RS Meena “Low Power High Speed 16x16 bit Multiplier usingVedic
Mathematics”,International Journal of Computer Applications (0975 – 8887) Volume 59– No.6,
December 2012
19 .Mohammed Hasmat Ali,, Anil Kumar Sahani,” Study, Implementation and Comparison of
Different Multipliers based on Array, KCM and Vedic Mathematics Using EDA Tools”, International
Journal of Scientific and Research Publications, Volume 3, Issue 6, June 2013.
20. Harish Kumar, Hemanth Kumar A R, “Design and Implementation of Vedic Multiplier using
Compressors”, International Journal of Engineering Research & Technology (IJERT), Vol. 4 Issue 06,
June-2015
21. Yogita Bansal Charu Madhu Pardeep Kaur. “HIGH SPEED VEDIC MULTIPLIER DESIGNS- A
REVIEW”, Proceedings of 2014 RAECS UIET Panjab University Chandigarh, 06 – 08 March, 2014
22. Harish Babu N , Satish Reddy N ,” Pipelined Architecture for Vedic Multiplier”, Adavances in
Electrical Engineering, 2014 IEEE conference.
79
23. R.Uma and P.Dhavachelvan,” Logic Optimization Using Technology Independent Mux Based
Adders In Fpga” International Journal of VLSI design & Communication Systems (VLSICS) Vol.3,
No.4, August 2012
24 .Abdulkarim Al-Sheraidah, Yingtao Jiang,” A Novel Low Power Multiplexer-Based Full

Adder”, ECCTD‟01 - European Conference on Circuit Theory and Design, August 28-31,
2001, Espoo, Finland
25. Taewhan ,Kim , Jao, W. , Tjiang, S. : “Circuit Optimization Using Carry–Save–Adder

Cells,” IEEE Transactions on “Computer-Aided Design of Integrated Circuits and Systems”,
Vol. 17, No. 10,1998, pp. 974-984
26. Prakash Pawar, Varun. R,” Implementation Of High Speed Pipelined Vedic Multiplier”,
International Journal of Engineering Research & Technology (IJERT), Vol. 2 Issue 5, May -
2013
27. Harpreet Singh Dhillon and Abhijit Mitra,” A Reduced-Bit Multiplication Algorithm for
Digital Arithmetic”, World Academy of Science, Engineering and TechnologyVol:19 2008-
07-25
28. Abhijeet Kumar, Dilip Kumar, Siddhi, “Hardware Implementation of 16*16 bit Multiplier
and Square using Vedic Mathematics”, Design Engineer, CDAC, Mohali.
29. Ch. Harish Kumar,” implementation and Analysis of Power, Area and Delay of Array,
Urdhva, Nikhilam Vedic Multipliers”, International Journal of Scientific and Research
Publications Volume 3, Issue 1, January 2013
30. Kala Priya.KSN Raju,” Carry select adder using BEC and RCA”, international Journal of
Advanced Research in Computer and Communication Engineering,Vol. 3, Issue 10, October
2014.
31. Jan.M.Rabaey. “ Digital integerated circuits.”,Second Edition
32. R.UMA,Vidya Vijayan,M. Mohanapriya, Sharon Paul,” Area, Delay and Power
Comparison of Adder Topologies, “International Journal of VLSI design & Communication
Systems(VLSICS) Vol.3, No.1, February 2012.
80
APPENDIX A
VHDL CODING
32-BIT PROPOSED VEDIC MAC
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
entity vedmac_pipe_32 is
Port ( clk , resetn: in std_logic;
ain1 : in STD_LOGIC_VECTOR (31 downto 0);
bin1 : in STD_LOGIC_VECTOR (31 downto 0);
sout : inout STD_LOGIC_VECTOR (63 downto 0));
end vedmac_pipe_32;
architecture Behavioral of vedmac_pipe_32 is
component caresave_n is
generic( n: integer := 32);
Port ( a2,b2,c2 : in STD_LOGIC_VECTOR (n-1 downto 0);
cot : out STD_LOGIC);
end component;
component accumulator is
Port ( clk, resetn : in STD_LOGIC;
x : in STD_LOGIC_vector (63 downto 0);
result : inout std_logic_vector (63 downto 0) );
end component;
component BEC_16 is
generic( p : integer := 16);
Port ( B : in STD_LOGIC_VECTOR (p-1 downto 0);
X : out STD_LOGIC_VECTOR (p-1 downto 0);
ccarry: in std_logic);
end component;
81
component vedmine16_clk is
Port ( ainpu, binputt : in STD_LOGIC_VECTOR (15 downto 0);
soutp : out STD_LOGIC_VECTOR (31 downto 0);
clk: in std_logic);
end component;
component hancarlson_64 is
Port ( ainp,binp : in STD_LOGIC_VECTOR (63 downto 0);
carry : out STD_LOGIC);
end component;
component pipo is
generic (k : integer:= 64);
Port ( clk , resetn: in STD_LOGIC;
din : in STD_LOGIC_VECTOR (k-1 downto 0);
qout : inout STD_LOGIC_VECTOR (k-1 downto 0));
end component;
signal p : std_logic;
signal m : std_logic_vector(128 downto 0);signal sout1, sum : std_logic_vector(63 downto
0);signal n: std_logic_vector (15 downto 0);signal l: std_logic_vector (63 downto 16);
signal j: std_logic_vector (63 downto 16);
begin
v1: vedmine16_clk port map ( ain1( 15 downto 0), bin1(15 downto 0),m(31 downto 0), clk);
v2: vedmine16_clk port map( ain1( 31 downto 16), bin1(31 downto 16),m(63 downto 32),
clk);
clk);
clk);
c1 : caresave_n port map( m(47 downto 16), m(127 downto 96), m(95 downto 64), l(47
downto 16), m(128));
bec3 : bec_16 port map ( m(63 downto 48), l(63 downto 48), m(128));
82
f: for i in 0 to 15 generate
d: dff port map (m(i), clk, sout1(i));
end generate;
f3: for i in 16 to 63 generate
d4: dff port map (l(i), clk, sout1(i));
end generate;
add : hancarlson_64 port map( sout1( 63 downto 0) , sout(63 downto 0) , sum (63 downto 0),
p);
resg : pipo port map (clk, resetn, sum (63 downto 0 ), sout (63 downto 0 ));
end Behavioral;
16X16 Vedic Multiplier
entity vedmine16_clk is
Port ( ainput : in STD_LOGIC_VECTOR (15 downto 0);
binput : in STD_LOGIC_VECTOR (15 downto 0);
clk: in std_logic);
end vedmine16_clk;
architecture Behavioral of vedmine16_clk is

Port ( ain : in STD_LOGIC_vector(7 downto 0);
bin : in STD_LOGIC_vector (7 downto 0);
sout : out STD_LOGIC_VECTOR (15 downto 0);
clk: in std_logic);
end component;
component caresave_16 is
Port ( a2 : in STD_LOGIC_VECTOR (n-1 downto 0);
b2 : in STD_LOGIC_VECTOR ( n-1 downto 0);
83
c2 : in STD_LOGIC_VECTOR (n-1 downto 0);
s2 : out STD_LOGIC_VECTOR (n-1 downto 0);
end component;
component dff is
Port ( d : in STD_LOGIC;
clk : in STD_LOGIC;
q : out STD_LOGIC);
end component;
component BEC is
end component;
signal m : std_logic_vector(64 downto 0);
signal n: std_logic_vector (7 downto 0);
signal l: std_logic_vector (31 downto 8);
begin
v1: vedmine8_clk port map ( ainput( 7 downto 0), binput(7 downto 0),m(15 downto 0),clk);
v2: vedmine8_clk port map( ainput( 7 downto 0), binput(15 downto 8),m(47 downto 32),
clk);
clk);
clk);
c1 : caresave_16 port map( m(23 downto 8), m(47 downto 32), m(63 downto 48), l (23
downto 8), m(64));
bec1 : bec port map ( m(31 downto 24), n (7 downto 0), m(64));
l (31 downto 24) <= n(7 downto 0) when m(64) = '1' else m(31 downto 24);
84
d1: dff port map ( m(i) , clk, soutp(i));
end generate;
d2: dff port map ( l(i) , clk, soutp(i));
end generate;
end Behavioral;
Port ( ain : in STD_LOGIC_vector(7 downto 0);
bin : in STD_LOGIC_vector (7 downto 0);
sout : out STD_LOGIC_VECTOR (15 downto 0);
clk: in std_logic);
end vedmine8_clk;
component caresave_n8 is
end component;
Port ( a,b : in STD_LOGIC_VECTOR (3 downto 0);
p : out STD_LOGIC_VECTOR (7 downto 0);
clk: in std_logic);
end component;
component dff is
clk : in STD_LOGIC;
q : out STD_LOGIC);
85
end component;
signal m: std_logic_vector(36 downto 0);

signal l: std_logic_vector(15 downto 4);
begin
v1: vedmine4_clk port map ( ain(3 downto 0), bin( 3 downto 0), m(7 downto 0),clk);
v2: vedmine4_clk port map ( ain(7 downto 4), bin( 7 downto 4), m(15 downto 8), clk);
v3: vedmine4_clk port map ( ain(7 downto 4), bin( 3 downto 0), m(23 downto 16),clk);
v4: vedmine4_clk port map ( ain(3 downto 0), bin( 7 downto 4), m(31 downto 24), clk);
l(12) <= not(m(12)) when m(32) = '1' else m(12);
l(13) <= (m(13) xor m(12)) when m(32)= '1' else m(13);
m(33) <= m(12 ) and m(13);m(34) <= m(33) and m(14);
l(14) <= (m(14) xor m(33)) when m(32) = '1' else m(14);
l(15) <= (m(15) xor m(34)) when m(32) = '1' else m(15);
care1 : caresave_n8 port map(m(11 downto 4), m(23 downto 16), m(31 downto 24), l(11
downto 4), m(32));
h : dff port map ( m(i), clk, sout(i));
end generate;
h1 : dff port map (l(i), clk, sout(i));
end generate;
end Behavioral;
Port ( a : in STD_LOGIC_VECTOR (3 downto 0);
b : in STD_LOGIC_VECTOR (3 downto 0);
p : out STD_LOGIC_VECTOR (7 downto 0);
clk : in std_logic);
86
end vedmine4_clk;
signal n: std_logic_vector (17 downto 0);
signal l: std_logic_vector (7 downto 2);
signal cot :std_logic;
component ved is
Port ( a1 : in std_logic_vector (1 downto 0);
b1 : in std_logic_vector (1 downto 0);
s1 : out std_logic_vector (3 downto 0);
clk: in std_logic);
end component;
component ripple_8 is
generic (n : integer := 8);
Port ( a2 ,b2: in STD_LOGIC_VECTOR (n-1 downto 0);
carry1 : out STD_LOGIC);
end component;
component caresave_4 is
Port ( a2 ,b2,c2: in STD_LOGIC_VECTOR (3 downto 0);
s2 : out STD_LOGIC_VECTOR (3 downto 0);
end component;
component half is
Port ( x : in STD_LOGIC;
y : in STD_LOGIC;
z : out STD_LOGIC;
co : out STD_LOGIC);
end component;
component dff is
87
clk : in STD_LOGIC;
q : out STD_LOGIC);
end component;
begin
v1: ved port map (a(1 downto 0), b(1 downto 0), n(3 downto 0),clk);
v3: ved port map (b(3 downto 2), a(1 downto 0), n(11 downto 8),clk);
care1: caresave_4 port map(n(5 downto 2), n(11 downto 8), n(15 downto 12), l(5 downto 2),
n(16));
d : dff port map( n(i) , clk , p(i));
end generate;
d1 : dff port map(l(i) , clk , p(i));
end generate;
h1: half port map( n(6), n(16), l(6),n(17));
l(7) <= (n(17) xor n(7) )when n(16) = '1'
else n(7);
end behavioral;
entity ved is
Port ( a1 : in STD_LOGIC_VECTOR (1 downto 0);
b1 : in STD_LOGIC_VECTOR (1 downto 0);
s1 : out STD_LOGIC_VECTOR (3 downto 0);
clk : in STD_LOGIC);
end ved;
88
architecture Behavioral of ved is
component dff is
clk : in STD_LOGIC;
q : out STD_LOGIC);
end component;
begin
process ( clk, a1, b1)
variable m: std_logic_vector (8 downto 1);
begin
m(5) := a1(0) and b1(0); m(1) := a1(1) and b1(0); m(2) := a1(0) and b1(1); m(3) := a1(1) and
b1(1); m(6) := m(1) xor m(2); m(4) := m(1) and m(2); m(7) := m(3) xor m(4); m(8) := m(3)
and m(4);
f: for i in 0 to 3 loop
if (clk' event and clk = '1') then
s1 (i) <= m(i + 5);
end if;
end loop;
end process;
end Behavioral;
Binary to Excess-1 Code Converter
entity BEC_16 is
end BEC_16;
architecture Behavioral of BEC_16 is
89
signal m: std_logic_vector(( p-1) downto 1);
begin
X(0) <= (NOT B(0)) when ccarry = '1' else b(0);
x(1) <= b(1) xor b(0) when ccarry = '1' else b(1);
m(1) <= b(0) and b(1);
g: for i in 2 to p-1 generate
m(i)<= m(i-1) and B(i);
X(i) <=( m(i-1) xor B(i)) when ccarry = '1' else b(i);
end generate ;
end behavioral;
Multioperand Carrysave Adder
entity caresave_n is
end caresave_n;
architecture Behavioral of caresave_n is
component half is
Port ( x,y : in STD_LOGIC;
z ,co: out STD_LOGIC);
end component;
component full is
Port ( x1,y1,cin : in STD_LOGIC;
z1 ,co1: out STD_LOGIC);
end component;
signal s, c: std_logic_vector(n-1 downto 0);
signal l: std_logic_vector( n-2 downto 0);
begin
90
g: for i in 0 to n-1 generate
f: full port map (a2(i), b2(i), c2(i), s(i), c(i));
end generate;
s2(0)<= s(0);
h1 : half port map ( c(0), s(1), s2(1), l(0));
fu: for i in 2 to n-1 generate
f: full port map (s(i), c(i-1), l(i-2), s2(i), l(i-1));
end generate;
cot <= c(n-1) or l(n-2);
end Behavioral;
64-BIT HANCARLSON ADDER
entity hancarlson_64 is
Port ( ainp,binp : in STD_LOGIC_VECTOR (63 downto 0));
carry : out STD_LOGIC);
end hancarlson_64;
architecture Behavioral of hancarlson_64 is
begin
process (ainp, binp)
variable p, p0, p1, p2, p3,g, g0, g1, g2, g3, cint, g4 , p4, p5, g5: std_logic_vector(63 downto
0);
begin
q1: for i in 0 to 63 loop
p0(i) := ainp(i) xor binp(i);
g0(i) := ainp(i) and binp(i);
end loop;
p(2 *i + 1) := (p0 (2 *i + 1) and p0( 2*i));
g(2 *i + 1) := (g0(2*i + 1) or (p0(2*i + 1) and g0(2* i)));
91
end loop;
q3: for j in 1 to 31 loop
p1(2*j + 1) := ((p(2*j + 1) and p( 2*j - 1)));
g1(2*j + 1) := (g(2*j + 1) or (p(2*j + 1) and g(2*j - 1)));
end loop;
p2(5) := p1(5) and p(1);
g2(5) := (g1(5) or (p1(5) and g(1)));
p2(2 *i + 1) := (p1 (2 *i + 1) and p1( 2*i - 3));
g2(2 *i + 1) := g1(2 *i + 1) or (p1(2 *i + 1) and g1(2* i - 3));
end loop;
p3(9) := p2(9) and p(1);g3(9) := g2(9) or( p2(9) and g(1));
p3(11) := p2(11) and p1(3);g3(11) := g2(11) or( p2(11) and g1(3));
p3(2 *i + 1) := (p2 (2 *i + 1) and p2( 2*i - 7));
g3(2 *i + 1) := g2(2 *i + 1) or (p2(2 *i + 1) and g2(2* i - 7));
end loop;
p4(17) := p3(17)and p(1);g4(17) := g3(17) or( p3(17) and g(1));
p4(19) := p3(19)and p1(3);g4(19) := g3(19) or( p3(19) and g1(3));
p4(2 *i + 1) := (p3 (2 *i + 1) and p3( 2*i - 15));
g4(2 *i + 1) := g3(2 *i + 1) or (p3(2 *i + 1) and g3(2* i - 15));
end loop;
p5(33) := p4(33)and p(1);g5(33) := g4(33) or( p4(33) and g(1));
92
p5(2 *i + 1) := (p4 (2 *i + 1) and p4( 2*i - 31));
g5(2 *i + 1) := g4(2 *i + 1) or (p4(2 *i + 1) and g4(2* i - 31));
end loop;
g5(31) := g4(31); g5(29) := g4(29); g5(27) := g4(27); g5(25) := g4(25); g5(23) := g4(23);
g5(21) := g4(21); g5(19) := g4(19); g5(17) := g4(17); g5(15) := g3(15); g5(13) := g3(13);
g5(11) := g3(11); g5(9) := g3(9); g5(7) := g2(7); g5(5) := g2(5); g5(3) := g1(3); g5(1) := g(1);
cint(0) := g0(0);
cint(2*i + 1) := g5(2*i + 1);
end loop;
cint( 2*i) := (g0(2*i) or (p0(2*i) and g5(2*i - 1)));
end loop;
soutp(0) <= p0(0) xor '0';
soutp(i) <= p0(i) xor cint(i-1);
end loop;
carry <= g5(63);
end process;
end Behavioral;
93
APPENDIX B
LIST OF PUBLICATION
1) Prachi Devpura, Anurag Paliwal, “High throughput Vedic Multiplier using Binary to
Excess-1 Code Converter”, International Journal of Advance Research in Electronics
and Communication Engineering ,Volume 4 issue 6,june 2015 ; PP 1771-1774.
2) Prachi Devpura , Anurag Paliwal, “High Throughput BEC-1 Based 32*32 bit Vedic
Multiplier”, International Journal of Research in Electronics and Computer Engineering
(IJRECE), Vol. 3 Issue 3 sept.2015; PP 24-27
94

Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Mux Implementation of Bec-1 Based Pipelined Vedic Mac Using Han Carlson Accumulator

Caricato da

Copyright:

Formati disponibili

Chapter-1

1.2 Problem Statement

1.3 Objective of Dissertation

I. Qualitative evaluation of four different existing 32 * 32 urdhvatiryakbhyam multipliers.

II. Implementation of a novel urdhva multiplier architecture using BEC-1.

1.4 Organization of Dissertation

This Dissertation is organized as follows:

Chapter-5: Describes the proposed work on MUX-based implementation of hight

Chapter-7: Gives the conclusion and future scope of the Dissertation.

1.5 Tools Used:

Hardware used: Xilinx Spartan3 (Family), XC3S400 -5 (Speed Grade) , PQ208

VEDIC ALGORITHMS AND LOGIC OPTIMIZATION

3.1 History of Vedic Mathematics

Jagadguru Shankaracharya Bharati Krishna Teerthaji Maharaja (1884-1960) constructed

1) (Anurupye) Shunyamanyat – If one is in ratio, the other is zero.

2) Chalana-Kalanabyham – Differences and Similarities.

3) Ekadhikina Purvena – By one more than the previous One.

4) Ekanyunena Purvena – By one less than the previous one.

7) Nikhilam Navatashcaramam Dashatah – All from 9 and last from 10.

8) Paraavartya Yojayet – Transpose and adjust.

9) Puranapuranabyham – By the completion or noncompletion.

10) Sankalana- vyavakalanabhyam – By addition and by subtraction.

11) Shesanyankena Charamena – The remainders by the last digit.

13) Sopaantyadvayamantyam – The ultimate and twice the penultimate.

14) Urdhva-tiryakbhyam – Vertically and crosswise.

15) Vyashtisamanstih – Part and Whole.

16) Yaavadunam – Whatever the extent of its deficiency.

These methods can be directly applied to trigonometry, geometry, calculus , and

3.2 Vedic Mathematics Algorithms

3.2.1 Urdhvatiryakbhyam sutra

The multiplier is based on an algorithm Urdhva Tiryakbhyam (Vertical & Crosswise)

3.2.1.1 Multiplication of two decimal numbers 41x 53

3.2.1.2 Urdhvatriyakbhyam multiplication for binary numbers.

C0P1 = X0Y1 + Y0X1

C1P2 = X0Y2 + X1Y1 + X2Y0 + C0

C2P3 = X3Y0 + X2Y1 + X1Y2 + Y3X0 + C1

C3P4 = X3Y1+ X2Y2 + X1YR + C2

C4P5 = X3Y2 + X2Y3 + C3

Figure 3.3: Alternative Efficient Approach for Urdhvatiryakbhyam Binary Multiplication

Finally the 2*2 multiplication follows traditional “vertically or crosswise” technique as

Figure 3.4: 2*2 bit vedic multiplication

C0P1= x0y1 + x1y0 (crosswise) (2)

P3P2 = C0 + x1y1 (vertically) (3)

Figure 3.5: Nikhilam sutra multiplication illustration[8]

3.3 Performance of Vedic Multiplier

3.4 Logic Optimization

1. Circuit level optimization: Circuit level optimization involves the manipulation

1. Technology Independent Phase: In this phase logic is optimized by applying

3.5 Pipelining Approach

Tmin = TC-Q + Tpd,logic + Tsu

1. High Throughput: Pipelining increases the functional throughput of the digital

EXISTING URDHVA MULTIPLIER ARCHITECTURES

4.1 Basic 2x2 bit multiplier based on Urdhvatiryakbhyam (UT) algorithm

Urdhvatriyakbhyam multiplication algorithm method can be extended for binary numbers as

C0P1= x0y1 + x1y0 (crosswise) (2)

P3P2 = C0 + x1y1 (vertically) (3)

Figure 4.1: Block diagram of 2*2 bit vedic multiplier

4.2 Efficient 4*4 Vedic multiplication based on Urdhvatiryakbhyam algorithm

Figure 4.2: Structure of 4*4 Multiplication[11]

Figure 4.3: Generalized NXN Bit Urdhvatriyakbhyam Multiplication Block Diagram

4.3. Existing Approaches for High Speed Urdhvatiryakbhyam Multiplier

4.3.1 Conventional 32X32 Bit Vedic Multiplier Using RCA (Architecture 1)

Figure 4.4: Block Diagram of Conventional 4-Bit Urdhva Multiplier[11]