Sei sulla pagina 1di 55

Why do the interconnect delay pose problems with shrinking technologies?

Describe in brief the design stages in the development of a chip?

The overall design process is represented by two stages. 1. The first stage is referred to as logic design, where the desired functional operation of an integrated circuit is initially defined and tested. 2. The second stage is referred to as physical design, where the logic design created during the logic design stage is processed to select actual circuit components to implement the functions defined in the logic design, and to lay out the components on an integrated circuit and route interconnects there in between. The interconnections between circuit elements are often referred to as nets and the nets are generally routed after placement of circuit components at specific locations on an integrated circuit.

What is the significance of adding buffers in a long interconnect wire?

When you insert buffers in interconnect, the capacitance seen by the source decreases and hence the Elmore delay or the actual delay decreases thus improving the timing. This is the basic result of the well-known algorithm called Van Ginnekan algorithm for Buffer placement.

What can be interpreted if a flip flop has negative hold time?

In a digital circuit, the hold time is the minimum time that an input signal must remain stable after the active edge of the clock in order to assure that that input is correctly recognized. If a circuit has a negative hold time, this means that the input can change before the clock edge and nevertheless the old level will be correctly recognized. This can be produced by internal delay of the clock signal. For example, if a D flip flop has a hold time of 1ns, the level present at the D input up to 1ns before the clock edge is the level captured, provided it was stable up to that moment.

What can be interpreted if a flip flop has negative setup time?

Setup time is the minimum time that an input must stabilize to its logical level before the active edge of the clock in order to assure that that input is correctly recognized. If a circuit has a negative setup time, this means that the input can change after the clock edge and nevertheless the new level will be correctly recognized. This can be produced by internal delay of the clock signal. For example, if a D flip flop has a setup time of 1ns, the level present at the D input from 1 ns after the clock edge is the level captured, provided it remains stable from that moment.

What are High-Vt and Low-Vt cells?

High-Vt cells are MOS devices with less leakage due to high Vt but they have higher delay than low VT, where as the low Vt cells are devices, which have less delay, but leakage is high. The threshold voltage dictates the transistor switching speed, it matters how much minimum threshold voltage applied can make the transistor switching to active state, which results to how fast we can switch the transistor. Disadvantage is it needs to maintain the transistor in a minimum sub threshold voltage level to make it switch fast so it leads to leakage of current in turn loss of power. What is useful-skew mean?

Useful skew is a concept of delaying the capturing flip-flop clock path, this approach helps in meeting setup requirement within the launch and capture timing path. But the hold-requirement has to be met for the design. What is body effect? Is it due to parallel or serial connection of MOSFETs?

Increase in VT (threshold voltage), due to increase in Vs (voltage at source), is called as body effect. It is due to serial connection. In general multiple MOS devices are made on a common substrate. As a result, the substrate voltage of all devices is normally equal. However while connecting the devices serially this may result in an increase in source-to-substrate voltage as we proceed vertically along the series chain, which results Vth2>Vth1.

What is latch up in CMOS design and ways to prevent it?

Latch-up pertains to a failure mechanism wherein a parasitic thyristor (such as a parasitic silicon controlled rectifier, or SCR) is inadvertently created within a circuit, causing a high amount of current to continuously flow through it once it is accidentally triggered or turned on. Depending on the circuits involved, the amount of current flow produced by this mechanism can be large enough to result in permanent destruction of the device due to electrical overstress (EOS). What is Noise Margin? Relate it with Inverter

NMH = VOH - VIH; NML = VIL - VOL What happens to delay if you increase load capacitance?

Delay increases. For CMOS logic, give the various techniques you know to minimize power consumption?

Power dissipation = f C VDD2 minimize the load capacitance C, voltage VDD and the operating frequency f. All of us know how an inverter works. What happens when the PMOS and NMOS are interchanged with one another in an inverter? Output will be degraded 1 and degraded 0. It is similar to p-MOS transferring degraded 0 and n-MOS transferring degraded 1. Give 5 important Design techniques you would follow when doing a Layout for Digital Circuits

1. In digital design, decide the height of standard cells you want to layout. It depends upon how big your transistors will be. Have reasonable width for VDD and GND metal paths. Maintaining uniform Height for all the cell is very important since this will help you use place route tool easily and also in case you want to do manual connection of all the blocks it saves on lot of area 2. Use one metal in one direction only; this does not apply for metal 1. Say you are using metal 2 to do horizontal connections, and then use metal 3 for vertical connections, metal4 for horizontal, metal 5 vertical etc. 3. Place as much substrate contact as possible in the empty spaces of the layout. 4. Do not use poly over long distances as it has huge resistances unless you have no other choice. 5. Use fingered transistors as and when you feel necessary 6. Try maintaining symmetry in your design. Try to get the design in BIT Sliced manner.

Give two ways of converting a two input NAND gate to an inverter?

Short the 2 inputs of the NAND gate and apply the single input to it. Connect the output to one of the input and the other to the input signal. Convert D-Flip-flop into divide by 2. What is the max clock frequency the circuit can handle, given the following information : Setup time = 6nS Hold time = 2nS Propagation time = 10nS Connect Qbar to D and apply the CLK of D Flip-flop and take the output at Q. It gives (frequency/2). Maximum frequency of operation is: 1/ (propagation delay + setup time) = 1/16ns = 62.5 MHz What is false path? Give an example?

The paths which are never exercised during normal circuit operation for any set of inputs. What are multi-cycle paths? Give example.

Multi-cycle paths are data paths that require more than one clock cycle to latch data at the destination register. For example, a register may be required to capture data on every second or third rising clock edge. Figure below shows an example of a multi-cycle path between a multipliers input registers and output register where the destination latches data on every other clock edge.

How to decide number of pads in chip level design?

No. of pads = dynamic power / [no. of sides * core voltage * Max current per pad]. Effectively, it is even distribution of power on 4 sides of the chip. What do corner cells contain?

It has a metal layer for the continuity of power ground network. What is the difference between core filler cells and metal fillers?

Core filler cells are used for the continuity of power rails in core area. Metal fillers are used to avoid Antenna effect. Two capacitors are connected in parallel through a switch. C1= 1uF, C2= 0.25uF. Initially the switch is open, C1 is charged to 10V. What happens if we close the switch?

Since no loss in the circuit the charge remains the same: U1C1 + U2C2 = U3 (C1+C2) U3 = (U1C1+U2C2) / (C1+C2) = (10 * 1 + 0 * 0.25) / (1 + 0.25) = 8 U3 = 8V What will be the voltage level between the 2 capacitors? The Vcc = 10v DC.

U2 = C1U / (C1+C2) = 4v

Suppose, you work on a specification for a system with some digital parameters. The specification table has Min, Typ and Max columns for each parameter. In what column would you put a Setup time and a Hold time?

SETUP time into the Min column and HOLD time into the Min column too. Example: Usually the data must be set at least (minimum) X nS before clock and being held at least Y nS after the clock. You need to specify Min setup and Min hold time. Design a simple circuit based on combinational logic to double the output frequency.

8bit ADC with parallel output converts input signal into digital numbers. You have to come up with the idea of a circuit that finds MAX of every 10 numbers at the output of the ADC.

Since we need to find MAX of every 10 samples, we are going to place after ADC a FIFO 8 bit wide and 10 word deep. It will require 8 * 10 flip flops. Every two stages of FIFO are forwarded to comparator and multiplexer. The comparator compares two 8 bit numbers and enables a multiplexer to choose the maximum of these two numbers. It will require 9 pairs of comparator/multiplexer to find the MAX number. So for every new clock there will be new MAX number. Implement comparator that compares two 2-bit numbers A and B. The comparator should have 3 outputs: A > B, A < B, A = B (Law of trigotomy). Do it two ways: 1. Using combinational logic; 2. Using multiplexers. Write HDL code for your schematic at RTL and gate level.

A1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

A0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

B1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

B0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

A>B 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0

B>A 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0

A=B 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Here is what the behavioral model of the comparator looks like: module comp0 (y1,y2,y3,a,b); input [1:0] a,b; output y1,y2,y3; wire y1,y2,y3; assign y1= (a >b)? 1:0; assign y2= (b >a)? 1:0; assign y3= (a==b)? 1:0; endmodule

You have 8 bit ADC clocking data out every 1mS. Design a system that will sort the output data and keep statistics how often each binary number appears at the output of ADC.

The diagram shows a basic idea of possible solution: using RAM to store a statistic data. The digital number that is to be stored is considered as RAM address. Once digital data at the output of ADC becomes available, memory controller generates RD signal, and the contents of the memory cell addressed by ADC output latches into D register. "1" in the D-register enables WR signal to the next memory cell. To calculate how many times a certain number appeared at the output of ADC it necessary to summarize the contents of all memory cells.

The circle can rotate clockwise and back. Use minimum hardware to build a circuit to indicate the direction of rotating.

2 sensors are required to find out the direction of rotating. They are placed like at the drawing. One of them is connected to the data input of D flip-flop, and a second one - to the clock input. If the circle rotates the way clock sensor sees the light first while D input (second sensor) is zero - the output of the flip-flop equals zero, and if D input sensor "fires" first - the output of the flip-flop becomes high.

An IC device draws higher current when temperature gets: - higher - lower

A device draws higher current when temperature gets lower. To enter the office people have to pass through the corridor. Once someone gets into the office the light turns on. It goes off when no one is present in the room. There are two registration sensors in the corridor. Build a state machine diagram and design a circuit to control the light.

Draw a transistor schematic of NOR gate, its layout and a cross section of the layout.

The silicon of a new device has memory leak. When all "0" are written into RAM, it reads back all "0" without any problem. When all "1" are written, only 80% of memory cells are read back correctly. What can be possibly the problem with the RAM?

Design a FIFO 1 byte wide and 13 words deep. The FIFO is interfacing 2 blocks with different clocks. On the rising edge of CLK the FIFO stores data and increments write pointer. On the rising edge of CLKB the data is put on the b-output, the read pointer points to the next data to be read. If the FIFO is empty, the b-output data is not valid. When the FIFO is full the existing data should not be overridden. When rst_N is asserted, the FIFO pointers are asynchronously reset.

module fifo1 (full,empty,clk,clkb,ain,bout,rst_N) output [7:0] bout; input [7:0] ain; input clk,clkb,rst_N; output empty, full; reg [3:0] wptr, rptr; ... endmodule

We have a FIFO which clocks data in at 100 MHz and clocks data out at 80 MHz. On the input there are only 80 data bits in any order during each 100 clocks. In other words, a 100 input clock will carry only 80 data bits, and the other twenty clocks carry no data (data is scattered in any order). How big the FIFO needs to be to avoid data over/under-run.

In scan chains if some flip flops are positive edge triggered and remaining flip flops are negative edge triggered how does it behave?

For designs with both positive and negative clocked flops, the scan insertion tool will always route the scan chain so that the negative clocked flops come before the positive edge flops in the chain. This avoids the need of lockup latch. For the same clock domain the negative edge flops will always capture the data just captured into the positive edge flops on the positive edge of the clock. For the multiple clock domains, it all depends upon how the clock trees are balanced. If the clock domains are completely asynchronous, ATPG has to mask the receiving flops. What is the difference between normal buffer and clock buffer?

Clock net is one of the High Fan out Net (HFN)s. The clock buffers are designed with some special property like high drive strength and less delay. Clock buffers have equal rise and fall time. This prevents duty cycle of clock signal from changing when it passes through a chain of clock buffers. Normal buffers are designed with W/L ratio such that sum of rise time and fall time is minimum. They too are designed for higher drive strength. What do you mean by scan chain reordering?

During placement, the optimization may make the scan chain difficult to route due to congestion. Hence the tool will re-order the chain to reduce congestion. This sometimes increases hold time problems in the chain. To overcome these buffers may have to be inserted into the scan path. It may not be able to maintain the scan chain length exactly. It cannot swap cell from different clock domains. Because of scan chain reordering patterns generated earlier is of no use. But this is not a problem as ATPG can be redone by reading the new netlist.

On what basis do we determine the clock frequency of a design?

There are several factors such as: 1. 2. 3. 4. 5. Input and output data rate Power Accuracy of the results required Technology Target platform What is the difference between Mealy and Moore state-machines?

In the Mealy state machine we can calculate the next state and output both from the input and state. But in the Moore state machine we can calculate only next state but not output from the input and state and the output is issued according to next state. How to solve setup & Hold violations in the design?

To solve setup violation, we can: 1. 2. 3. Optimize/restructure combination logic between the flops. Tweak flops to offer lesser setup delay. Tweak launch-flop to have better slew at the clock pin, this will make CK to Q of launch flop to be fast there by helping fixing setup violations. 4. Play with skew [tweak clock network delay, slow-down clock to capturing flop and fasten the clock to launch-flop] (otherwise called as Useful-skews) To solve Hold Violations, we can: 1. Add delay/buffer [as buffer offers lesser delay, we go for special Delay cells whose functionality Y=A, but with more delay] 2. 3. Making the launch flop clock reaching delayed Can add lockup-latches [in cases where the hold time requirement is very huge, basically to avoid data slip]

What is antenna Violation? Describe ways to prevent it.

During the process of plasma etching, charges accumulate along the metal strips. The longer the strips are, the more charges are accumulated. If a small sized transistor gate is connected to these long metal strips, the gate oxide could be destroyed (large electric field over a very thin oxide), This is called as Antenna violation. The way to prevent is, by jogging the metal line, which is at least one metal above the layer to be protected. If we want to remove antenna violation in metal2 then we need to jog it in metal3 not in metal1. The reason being while we are etching metal2, metal3 layer is not laid out. So the two pieces of metal2 get disconnected. Only the piece of metal connected to gate has charge to gate. When we lay down metal3, the remaining portion of metal got charge added to metal3. This is called accumulative antenna effect. Another way of preventing is adding reverse Diodes at the gates. We have multiple instances in RTL (Register Transfer Language), do you do anything special during synthesis stage? While writing RTL(Register Transfer language), say in Verilog or in VHDL language, we dont write the same module functionality again and again, we use a concept called as Instantiation, where in as per the language, the instantiation of a module will behave like the parent module in terms of functionality, where during synthesis stage we need the full code so that the synthesis tool can study the logic, structure and map it to the library cells, so we use a command in synthesis called "UNIQUIFY" which will replace the instantiations with the real logic, because once we are in a synthesis stages we have to visualize as real cells and no more modeling just for functionality alone, we need to visualize in-terms of physical world as well. What are Tie-high and Tie-low cells? And where are they used?

Tie-high and Tie-Low cells are used to connect the gate of the transistor to either power or ground. In deep sub micron processes, if the gate is connected to power/ground the transistor might be turned on/off due to power or ground bounce. The suggestion from foundry is to use tie cells for this purpose. These cells are part of standard-cell library. The cells which require Vdd come and connect to tie high. (So tie high is a power supply cell) while the cells which want Vss connect to Tie-low. What is the difference between latches and flip-flops based designs?

Latches are level-sensitive and flip-flops are edge sensitive. Latch based design and flop based design is that latch allows time borrowing which a tradition flop does not. That makes latch based design more efficient. But at the same time, latch based design is more complicated and has more issues in min timing (races). Its STA with time borrowing in deep pipelining can be quite complex.

What does LEF mean?

LEF is an ASCII data format from Cadence Design, to describe a standard cell library. It includes the design rules for routing and the Abstract layout of the cells. LEF file contains the following: Technology: layer, design rules, via-definitions, metal-capacitance. Site: Site extension. Macros: cell descriptions, cell dimensions, layout of pins and blockages, capacitances. What does DEF mean?

DEF is an ASCII data format from Cadence Design, to describe Design related information. 1. What are the steps involved in designing an optimal pad ring? Make sure you have corner-pads, across all the corners of the pad ring; this is mainly to have the powercontinuity as well as the resistance to be less. 2. Ensure that the Pad ring fulfills the ESD requirement, Identify the power-domains, split the domains, Ensure common ground across all the domains. 3. 4. 5. 6. 7. 8. Ensure the pad ring has fulfilled the SSN (Simultaneous Switching Noise) requirement. Place Transfer-cell Pads in the cross power-domains, for different height pads, to have rail connectivity. Ensure that the design has sufficient core power-pads. Choose the Drive-strength of the pads based on the current requirements, timing. Ensure that there is separate analog ground and power pads. A No-Connection Pad is used to fill out the pad-frame if there is no requirement for I/O's. Extra VDD/GND pads also could be used. Ensure that no Input/output pads are used with un-connected inputs, as they consume power if the inputs float. 9. Ensure that oscillator-pads are used for clock inputs.

10. In-case if the design requirement for source synchronous circuits, make sure that the clock and data pads are of same drive-strength. 11. Breaker-pads are used to break the power-ring, and to isolate the power-structure across the pads. 12. Ensure that the metal-wire connected to the pin can carry sufficient amount of the current, check if more than one metal-layer is necessary to carry the maximum current provided at the pin. 13. In case if required, place pads with capacitance.

What is Meta stability? Describe steps to prevent it.

Meta stability is an unknown state it is neither Zero nor One. Meta stability happens for the design systems violating setup or hold time requirements. Setup time is a requirement that the data has to be stable before the clock-edge and hold time is a requirement that the data has to be stable after the clock-edge. The potential violation of the setup and hold violation can happen when the data is purely asynchronous and clocked synchronously. Steps to prevent Meta stability: 1. Using proper synchronizers (two-stage or three stage), as soon as the data is coming from the asynchronous domain. Using Synchronizers, recovers from the Meta stable event. 2. 3. Use synchronizers between cross-clocking domains to reduce the possibility from Meta stability. Using Faster flip-flops (which has narrower Meta stable Window). What do local-skew, global-skew, and useful-skew mean?

Local skew: The difference between the clocks reaching at the launching flop vs. the clock reaching the destination flip-flop of a timing-path. Global skew: The difference between the earliest reaching flip-flop and latest reaching flip-flop for a same clockdomain. Useful skew: Useful skew is a concept of delaying the capturing flip-flop clock path, this approach helps in meeting setup requirement within the launch and capture timing path. But the hold-requirement has to be met for the design. 1. 2. 3. 4. What are the various timing-paths which we should take care in my STA runs? Timing path starting from an input-port and ending at the output port (purely combinational path). Timing path starting from an input-port and ending at the register. Timing path starting from a Register and ending at the output-port. Timing path starting from a register and ending at the register. What are the various components of Leakage-power?

What are the various yield-losses in the design?

The yield loss in the design is characterized by: 1. Functional yield losses, mainly caused by spot defects, especially (shorts & opens) 2. Parametric yield losses, due to process variations. What are the various Variations which impacts timing of the design?

How to build a 4:1 Mux using only 2:1 Mux?

What is glitch? What causes it (explain with waveform)? How to overcome it?

The following figure shows a synchronous alternative to the gated clock using a data path. The flip-flop is clocked at every clock cycle and the data path is controlled by an enable. When the enable is Low, a multiplexer feeds the output of the register back on itself. When the enable is High, new data is fed to the flip-flop and the register changes its state Latch-based designs are sometimes used for high-speed digital circuits. One possible configuration places combinational logic between a pair of latches with opposite polarities (one latch is active high and the other is active low) that use the same clock signal.

Which of the two latches is active high and which is active low?

1. 2.

State whether each of the following statements is True or False. When doing RTL design, all flip-flops need to have a reset input that is synchronized with the clock. False Because there are two types of memory operations (Read and Write), there are four different types of data dependencies that can exist between memory operations. True

3.

If the hold time of a flip-flop is violated, a possible solution would be to add buffers at the input of that flop. True

4.

Voltage scaling is a power reduction technique that relies on reducing the supply voltage of a circuit without affecting any of the other circuit parameters. False

5.

Because of the small number of transitions between codes, a 32-state finite state machine that uses Gray coding will consume less power than one that uses binary coding. False

6.

If a circuit contains some redundant components, all faults in the redundant circuitry are undetectable. False

Pipelining is particular form of retiming where the goal is to increase the throughput (number of results per second) of a circuit. Consider the circuit diagram below; the solid rectangles represent registers, the square are blocks of combinational logic:

Each combinational block in the diagram is annotated with it's propagation delay in ns. For this problem assume that the registers are "ideal", i.e., they have zero propagation delay, and zero setup and hold times. 1. What are the latency and throughput of the circuit above? Latency is how long it takes for a value in the input register to be processed and the result appear at the output register. Throughput is the number of results per second. 2. Pipeline the circuit above for maximum throughput. Pipelining adds additional registers to a circuit; we'll do this by adding additional registers at the output and then use the retiming transformation to move them between the combinational blocks. What are the latency and throughput of your resulting circuit?

1.

The register-to-register TPD is the TPD of the longest path (time wise) through the combinational logic = 53ns. A value in the input register is processed in one clock cycle (latency = 53ns), and the circuit can produce an output every cycle (throughput = 1 answer/53ns).

2.

To maximize the throughput, we need to get the "30" block into its own pipeline stage. So we'll draw the retiming contours like so:

Note there is an alternative way we could have drawn the contours to reach the goal of isolating the "30" block. Similarly, we could have instead added registers at the input and used retiming to move them into the circuit. The contours above lead to the following pipelined circuit diagram:

A good check to see if the retiming is correct is to verify that there are the same number of registers on every path from the input(s) to the output(s). The register-to-register TPD is now 30ns, so the throughput of the pipelined circuit is 1/30ns. The latency has increased to 3 clock cycles, or 90ns. In general increasing the throughput through pipelining always leads to an increase in latency. Fortunately latency is not an issue for many digital processing circuits -- e.g., microprocessors

where we care more about how many results we get per second much more than how long it takes to process an individual result.

In thinking about the propagation delay of a ripple-carry adder, we see that the higher-order bits are "waiting" for their carry-ins to propagate up from the lower-order bits. Suppose we split off the high-order bits and create two separate adders: one assuming that the carry-in was 0 and the other assuming the carry-in was 1. Then when the correct carry-in was available from the low-order bits, it could be used to select which high-order sum to use. The diagram below shows this strategy applied to an 8-bit adder:

1. 2.

Compare the latency of the 8-bit carry-select adder show above to a regular 8-bit ripple-carry adder. The logic shown for C8 seems a bit odd. One might have expected C8 = C8,0 + (C4*C8,1). Explain why both implementations are equivalent and suggest why the logic shown above might be preferred. Hint: What can we say about C8,1 when C8,0=1?

Solution: 1. For the low-order 4 bits, the latency is the same for both implementations: T PD,4-BIT ADDER. But with the carry-select adder, the remaining latency is the propagation delay of the 4-bit 2:1 multiplexer (TPD,2:1 MUX) instead of the longer time it takes for the carry to ripple through another 4 bits of adder (T PD,4-BIT ADDER). If we consider an N-bit adder, the latencies are: TPD,N-BIT RIPPLE = 2 * TPD,(N/2)-BIT RIPPLE TPD,N-BIT CARRY-SELECT = TPD,(N/2)-BIT RIPPLE + TPD,2:1 MUX

As N gets large the carry-select adder is almost twice as fast as a ripple-carry adder since the delay through the 2:1 mux is independent of N. The carry-select strategy can be applied recursively to the (N/2)bit ripple-carry adder, and so on, ultimately producing an adder with O(log2N) latency. 2. If we think about carry generation, it's easy to see that if C8,0 = 1 then C8,1 = 1, i.e., if we get a carry with CIN=0, we'll also get a carry when CIN=1. Using this fact we can do a little Boolean algebra: C8 = C8,0 + (C4 * C8,1) = C8,0 * (C8,1 + C8,1) + (C4 * C8,1) = (C8,0 * C8,1) + (C8,0 * C8,1) + (C4 * C8,1) = (C8,0 * C8,1) + 0 + (C4 * C8,1) = (C8,0 + C4) * C8,1 In the worst case (the carry-in rippling all the way up), C8,1 will take a little longer to compute than C8,0, so the logic for C8 shown in the diagram will be a little faster since it provides the shorter path to C8.

Occasionally you will come across a CMOS circuit where the complementary nature of the n-channel pull-downs and p-channel pull-ups are not obvious, as in the circuit shown below:

1.

Construct a table that gives the on-off status of each transistor in the circuit above for all combinations of inputs A and B.

2.

Compute the output, Y, for each input combination and describe the function of the above circuit.

Solution:

The output Y is connected to four pairs of transistors in series, so each of these pairs can affect the output. When A=0 and B=0, transistors T4 and T5 are on, so Y=0 When A=0 and B=1, transistors T6 and T7 are on, so Y=1 When A=1 and B=0, transistors T2 and T3 are on, so Y=1 When A=1 and B=1, transistors T8 and T9 are on, so Y=0 Putting this together, we conclude that Y = XOR (A, B). Suppose we are building circuits using only the following three components: Inverter: tcd = 0.5ns, tpd = 1.0ns, tr = tf = 0.7ns 2-input NAND: tcd = 0.5ns, tpd = 2.0ns, tr = tf = 1.2ns 2-input NOR: tcd = 0.5ns, tpd = 2.0ns, tr = tf = 1.2ns. Consider the following circuit constructed from an inverter and four 2-input NOR gates:

1. 2. 3.

What is tpd for this circuit? What is tcd for this circuit? What is tpd of the fastest equivalent circuit (i.e., one that implements the same function) built using only the three components listed above?

Solution: 1. tpd for the circuit is the maximum cumulative propagation delay considering all paths from any input to any output. In this circuit, the longest path involves three 2-input NAND gates with a cumulative tpd = 6ns. 2. tcd for the circuit is the minimum cumulative contamination delay considering all paths from any input to any output. In this circuit, the shortest path involves two 2-input NAND gates with a cumulative tcd = 1ns. 3. The most straightforward way to determine the functionality of a circuit is to build a truth table: A 0 0 1 1 B 0 1 0 1 OUT 1 0 1 0

From which we can see that OUT = not (B). We can implement this with a single inverter that has a tpd = 1ns.

Suppose that each component in the circuit below has a propagation delay (tpd) of 10ns, a contamination delay (tcd) of 1ns, and negligible rise and fall times. Suppose initially that all four inputs are 1 for a long time and then the input D changes to 0. Draw a waveform plot showing how X, Y, Z, W and Q change with time after the input transition on D.

The following graph plots the voltage transfer characteristic for a device with one input and one output. Can this device be used as a combinational device in a logic family with 0.75V noise margins?

Design a FSM to detect a sequence 10110.

What is a Microprocessor?

Microprocessor is a program-controlled device, which fetches the instructions from memory, decodes and executes the instructions. Most Micro Processors are single-chip devices. Give examples for 8 / 16 / 32 bit Microprocessor?

8-bit Processor - 8085 / Z80 / 6800 16-bit Processor - 8086 / 68000 / Z8000 32-bit Processor - 80386 / 80486. Why 8085 processor is called an 8 bit processor?

Because 8085 processor has 8 bit ALU (Arithmetic Logic Review). Similarly 8086 processor has 16 bit ALU.

What is 1st / 2nd / 3rd / 4th generation processor?

The processor made of PMOS / NMOS / HMOS / HCMOS technology is called 1st / 2nd / 3rd / 4th generation processor, and it is made up of 4 / 8 / 16 / 32 bits. Define HCMOS?

High-density n-type Complementary Metal Oxide Silicon field effect transistor. What does microprocessor speed depend on?

The processing speed depends on DATA BUS WIDTH. Is the address bus unidirectional?

The address bus is unidirectional because the address information is always given by the Micro Processor to address a memory location of an input / output devices. Is the data bus Bi-directional?

The data bus is Bi-directional because the same bus is used for transfer of data between Micro Processor and memory or input / output devices in both the direction. What is the disadvantage of microprocessor?

It has limitations on the size of data. Most Microprocessor does not support floating-point operations. What is the difference between microprocessor and microcontroller?

In Microprocessor more op-codes, few bit handling instructions. But in Microcontroller: fewer op-codes, more bit handling Instructions, and also it is defined as a device that includes micro processor, memory, & input / output signal lines on a single chip. Why does microprocessor contain ROM chips?

Microprocessor contain ROM chip because it contain instructions to execute data.

What is the difference between primary & secondary storage device?

In primary storage device the storage capacity is limited. It has a volatile memory. In secondary storage device the storage capacity is larger. It is a nonvolatile memory. Primary devices are: RAM / ROM. Secondary devices are: Floppy disc / Hard disk. Difference between static and dynamic RAM?

Static RAM: No refreshing, 6 to 8 MOS transistors are required to form one memory cell, Information stored as voltage level in a flip flop. Dynamic RAM: Refreshed periodically, 3 to 4 transistors are required to form one memory cell, Information is stored as a charge in the gate to substrate capacitance. What is interrupt?

Interrupt is a signal send by external device to the processor so as to request the processor to perform a particular work. What is cache memory?

Cache memory is a small high-speed memory. It is used for temporary storage of data & information between the main memory and the CPU (center processing unit). The cache memory is only in RAM. What is called Scratch pad of computer?

Cache Memory is scratch pad of computer. Which transistor is used in each cell of EPROM?

Floating gate Avalanche Injection MOS (FAMOS) transistor is used in each cell of EPROM. Differentiate between RAM and ROM?

RAM: Read / Write memory, High Speed, Volatile Memory. ROM: Read only memory, Low Speed, Non Volatile Memory.

What is a compiler?

Compiler is used to translate the high-level language program into machine code at a time. It doesnt require special instruction to store in a memory, it stores automatically. The Execution time is less compared to Interpreter. Which processor structure is pipelined?

All x86 processors have pipelined structure. What is flag?

Flag is a flip-flop used to store the information about the status of a processor and the status of the instruction executed most recently What is stack?

Stack is a portion of RAM used for saving the content of Program Counter and general purpose registers. What is NV-RAM?

Nonvolatile Read Write Memory also called Flash memory. It is also known as shadow RAM. Describe a simple circuit for multiplication of 3 bit unsigned numbers.

How do you calculate delay for a combinational logic block?

We use static timing analysis (STA) for digital combinational circuits. The timing analysis is carried out in an input-independent manner, and finds the worst-case delay of the circuit over all possible input combinations. The computational efficiency of such an approach has resulted in its widespread use. A method that is commonly referred to as PERT (Program Evaluation and Review Technique) is popularly used in static timing analysis. It is often referred to as the CPM (Critical Path Method) that is widely used in project management. The CPM procedure is now an integral part of most fast algorithms for circuit delay calculation. While CPM-based methods are the dominantly in use today, other methods for traversing circuit graphs have been used by various timing analyzers. CPM completely ignores the Boolean relationships in a circuit, and work with purely topological properties. As a result, it is possible that it may not be possible to excite the critical path found by CPM, and in general, the critical path delay found using CPM is pessimistic. As an example, consider the circuit shown:

The circuit with three inputs a, b and c, and one output, out. Assume, for simplicity, that the multiplexer and inverter have zero delays, and that the four blocks whose delays are shown are purely combinational. It can easily be verified that the worst-case delay for this circuit computed using the critical path method, is 4 units. However, by

enumerating both possible logic values for the multiplexer, namely, c=0 and c=1, it can be seen the delay in both cases is 3 units, implying that the circuit delay is 3 units. The reason for this discrepancy is simple: the path with a delay of 4 units can never be sensitized because of the restrictions placed by the Boolean dependencies between the inputs.

What is the purpose of adding De-coupling capacitances in SoC?

Decap cells are basically capacitors used for decoupling. The gates in a circuit consume most power (dynamic) only at the clock edges. No voltage source is perfect hence glitches are produced on the power line due to huge current draw at the clock edges. Decap filler cells are small capacitors which are placed between Vdd and ground all over the layout. All these small capacitors add up to a big capacitor between Vdd and ground. This helps to smoothen out the glitches. Apart from Decap fillers, decap macros are also placed in a big design to give additional decoupling. Give a simplified Circuit diagram of Scannable D-Flip Flop

What are the characteristics of a Good Layout?

Good layout should minimize diffusion capacitance. Cells must be designed compactly to minimize wire parasitics.

What is the key observation in reducing the power with delay constraints in mind?

The observation is that for a device size x > x1, the delay reduction is not significant but the power consumption increases because of the larger device sizes. We want to size down the over-sized devices in the circuit, while still satisfying the delay constraints, to reduce wasted power in the design. Briefly describe the concept of clock-gating.

Power has become a primary consideration during hardware design. Dynamic power can contribute up to 50% of the total power dissipation. Clock-gating is the most common RTL optimization for reducing dynamic power.

Most clock-gating is done at the Register Transfer Level (RTL). RTL clock-gating algorithms can be grouped into three categories: system-level, sequential and combinational. System-level clock-gating stops the clock for an entire block, effectively disabling all functionality. On the contrary, combinational and sequential clock-gating selectively suspend clocking while the block continues to produce output.

Combinational clock-gating is a straightforward substitution to the RTL code. It reduces power by disabling the clock on registers when the output is not changing. Opportunities to insert combinational clock-gating can be found by looking for conditional assignments in the code. Clock-gating logic is substituted when code like "if (condition) out <= in" is present. Combinational clock-gating is now a feature in the RTL compilers. Power aware synthesis tools identify RTL coding patterns and make the appropriate substitution. Hardware designers only need to understand some simple RTL coding guidelines to gain the benefits of combinational clock-gating.

Since combinational clock gated flops maintain a one to one state mapping with the original RTL, Combinational Equivalence Checking Tools can be used for functional verification. The switching activity is eliminated only when data is not changing, the actual power savings is limited. In typical designs, combinational clock-gating can reduce dynamic power by about 5-to-10%.

Sequential clock-gating alters the RTL micro-architecture without affecting design functionally. Power is optimized by identifying unused computations, data dependent functions and don't-care cycles in the original code. There are many types of sequential clock-gating transformations. Identifying opportunities for sequential clockgating is difficult, requiring sequential analysis. One example of a sequential optimization is turning off subsequent pipeline stages based on a propagated valid condition. Because of the additional logic, this transformation makes sense only if the data path is multiple bits wide.

Sequential clock-gating

Sequential clock-gating is a multi-cycle optimization with multiple implementation tradeoffs and RTL modifications. Consequently there is a greater demand on functional verification resources. On the other hand sequential clock-gating can save significant power, typically reducing switching activity by 15-to-25% on a given block.

Since sequential optimizations change the state of the design, Combinational Equivalence Checking Tools cannot be used for verification. This is not the case for Sequential Equivalence Checking (SEC). SEC tools can comprehensively verify sequential changes to RTL like clock-gating.

System-level clock-gating is designed into the original hardware architecture and coded as part of the RTL functionality. For example, sleep modes in a cell phone may strategically disable the display, keyboard or radio depending on the phones current operational mode. System-level clock-gating shuts off entire RTL blocks. Because large sections of logic are not switching for many cycles it has the most potential to save power.

Sequential clock-gating in the design flow A standard practice is for design teams to create a block-wise power budget at the beginning of a project. As blocks are implemented, designers optimize those blocks that are over budget. Accurate power analysis for technologies 90nm and below depends on physical place and route information. Unfortunately this information is not available until late in the design flow. This means sequential clock-gating is done late in the project, further highlighting the importance of comprehensive verification.

Power Optimization in a High Performance Microprocessor A hard requirement for the super-scalar, high performance, PowerPC design was it must use a standard package with cooling fan while operating at rates up to 667MHz. Factors like dynamic voltage scaling, multi Vt, sleep modes, low power memories and sequential clock-gating were considered.

Sequential clock-gating optimizations were performed by senior engineers on "hot areas" of the design. The first challenge was to identify the "don't care" states in the pipeline. Using this information and analyzing backwards across three to four pipeline stages, an opportunity for reducing power was identified. In this specific optimization, previous pipeline stages are disabled in earlier cycles when it is determined that the output is not used in the current cycle.

Disabling pipeline stages when the output is not used

As a side note, it would be simpler to only clock gate the final flop with vld_2. However, this would not reduce the switching activity associated with PKT1 and PKT2 flops and combinational logic.

The difficulty implementing this optimization is keeping track of the signals that crossed pipeline stages and contribute to the enable condition. There are no automated tools that can provide this information so this analysis is done manually by reading the RTL source code and waveforms.

RTL clock-gating is a common technique for reducing dynamic power. Today there are no automated tools to identify or make sequential RTL clock-gating optimizations. Such optimizations require experienced engineers that know when and how to apply the appropriate sequential change. Since this is a manually transformation, verification is critical. Sequential Equivalence Checking can verify clock-gating, giving designers the confidence to make aggressive power optimizations late in the design process. The result is a lower power, higher quality design

Cell Low-Vt High-Vt

What is the relation between delay and power for Low-Vt and High-Vt cells? Delay Less More Leakage Power More Less

On What factors does the Short-Circuit power of a circuit depend upon?

The short-circuit power consumption for an inverter gate is proportional to the input ramp time, the load and transistor sizes of the gate. The maximum short-circuit current flows when there is no load; this current decreases with the load. Sometimes it is necessary to use both the rising and the falling edge of the clock to sample the data. This is sometimes needed in many DDR applications. The double edge flop is sometimes depicted like that:

The simplest design one can imagine would be to use two flip flops. One sensitive to the rising edge of the clock, the other to the falling edge and to MUX the outputs of both, using the clock itself as the select. This approach is shown below:

Whats wrong with the above approach? Well in an ideal world it is OK, but we have to remember that semi-custom tools/users dont like to have the clock in the data path. This requirement is justified and can cause a lot of headaches later when doing the clock tree synthesis and when analyzing the timing reports. It is a good idea to avoid such constructions unless they are absolutely necessary. This recommendation applies also for the reset net - try not combining the reset net into your logic clouds. Here is a cool circuit that can help solve this problem:

Replication is an extremely important technique in digital design. The basic idea is that under some circumstances it is useful to take the same logic cloud or the same flip-flops and produce more instances of them, even though only a single copy would normally be enough from a logical point of view. Imagine the situation on the picture below. The darkened flip-flop has to drive 3 other nets all over the chip and due to the physical placement of the capturing flops it cannot be placed close by to all of them. The layout tool finds as a compromise some place in the middle, which in turn will generate a negative slack on all the paths.

We notice that in the above example the logic cloud just before the darkened flop has a positive slack or in other words, some time to give. We now use this and produce a copy of the darkened flop, but this time closer to each of the capturing flops.

Yet another option is to duplicate the entire logic cloud plus the sending flop, as pictured below. This will usually generate even better results.

Notice that we also reduce the fan out of the driving flop, thus further improving on timing. It is important to take care about while writing the HDL code, that the paths are really separated. This means when we want to replicate flops and logic clouds, we should make sure to give the registers/signals/wires different names. It is a good idea to keep some sort of naming convention for replicated paths, so in the future when a change is made on one path, it would be easy enough to mirror that change on the other replications. LFSR width and random numbers for your test bench

Say you designed a pretty complicated block or even a system in HDL and you wish to test it by injecting some random numbers to the inputs (just for the heck of it). For simplicity reasons lets assume your block receives an integer with a value between 1 and 15. You think to yourself that it would be pretty neat to use a 4-bit LFSR which generates all possible values between 1 and 15 in a pseudo-random order and just repeat the sequence over and over again. Together with the other type of noise in the system you inject, this should be pretty thorough, right? Well, not really! Imagine for a second how the sequence looks like, each number will always be followed by another specific number in this sequence! For example, you will never be able to verify a case where the same number is injected immediately again into the block! To verify all other cases (at least for all different pairs of numbers) you would need to use an LFSR with a larger width (How much larger?). What you need to do then is to pick up only 4 bits of this bigger LFSR and inject them to your block. The average digital designer will be happy if he finished his HDL coding, simulated it and verified it is working fine. Next he will run it through synthesis to see if timing is OK and job done, right? Wrong! There are many problems that simply cant surface during synthesis. To name a few: routing congestion, cross talk effects and parasitics etc. This post will try concentrate on another issue which is much easier to understand, but when encountering it, it is usually too late in the design to be able to do something radical about it - the physical placement of flip-flops. The picture below shows a hypothetical architecture of a design, which is very representative of the problems I want to describe.

Flop A is forced to be placed closed to the analog interface at the bottom, to have a clean interface to the digital core. In the same way Flop B is placed near the top, to have a clean interface to the analog part at the top. The signal between them needs to physically cross the entire chip. The layout tools will place many buffers to have clean sharp edges, but in many cases timing is violated. If this signal has to go through during one clock period, you are in trouble. Many times it is not the case, and pipeline stages can be added along the way, or a multi-cycle path can be defined. Most designers choose to introduce pipeline stages and to have a cleaner synthesis flow (less special constraints). Some common solutions are: 1. 2. 3. Using local decoding as described Reducing the width of your register bus (costs in register read/write time) Defining registers as quasi-static - changeable only during the power up sequence, static during normal operation

Sometimes we can achieve low power by replicating the hardware and operating them at lower vdd values. Apply parallelism also. Shifting to SoI technologies with shrinking technologies to lower the leakage power drastically. For lowering capacitances people go for low-K conductor metals. Describe the functionality of a Johnsons counter.

The Johnson counter is made of a simple shift register with an inverted feedback - as can be seen below.

Johnson counters have 2N states (where N is the number of flip-flops) compared to 2^N states for a normal binary counter. Since each time only a single bit is changing - Johnson counter states form a sort of a Gray code. The next picture shows the 12 states of a 6 bit Johnson counter as an example.

i. ii. iii. iv.

What factors determine the threshold voltage of a transistor? Work function difference between the gate and the channel Change the surface potential at the silicon surface Offset the depletion region charge Offset the non ideal charge at the interface between the gate oxide and silicon How does the leakage vary with number of transistors in a stack?

Draw the graph on how the delay at the output node of a NOR gate varies with difference in the arrival times of input A and B of the NOR gate.

Often the design of a Micro architecture is very important in Power and delay tradeoffs. Qualitatively depict how the design of different micro architectures would affect the power and delay

Show how the scan chain could be reordered to reduce power consumed during testing.

Design a LFSR with the characteristic polynomial 1 + x2 + x3

How can you design a Test Access Mechanism for simultaneous testing of different cores.

What are the different possible cases of inducing delay fault in two adjacent wires?

Potrebbero piacerti anche