Sei sulla pagina 1di 10

Fixed- and floating-point packages for VHDL 2005

David Bishop, Eastman Kodak Company, Rochester, NY

Abstract
The pending update to VHDL LRM contains several new packages and functions. The new packages include support for both fixed-point and floating-point binary math. These fully synthesizable packages will raise the level of abstraction in VHDL. DSP applications, which previously needed an independent processor core, or required very difficult manual translation, can now be performed within your VHDL source code. In addition, Schematic-based DSP algorithms can now be translated directly to VHDL. This paper will describe these packages and give examples of their use.

Introduction:
For the past 15 years we have been using HDL to increase the level of abstraction in our ASIC and FPGA designs. HDL was a major leap from schematics. What have we done sense? Little. Attempts have been made. E, System-C and System-Verilog are good examples. These are great ideas, but they do not give the designer the control and tool maturity that VHDL and Verilog provide. Why not simply increase the level of abstraction in a language that is already well known? The potential of VHDL has not yet been fully tapped. Designed from the ground up as a software language it is easily extendable and flexible. Constructed at a higher level then Verilog, it has the ability to provide higher levels of abstraction directly, with already mature tools. Typically designers use integer math in their RTL code. For fixed point they tend to just remember where the decimal point is. For floating point you use a DSP, which may even be off chip. Designers tend to use math solutions in order of integer math, fixed point math and floating point math, where 80% of designs are done in integer, of the next 20% 80 % of those are done in fixed point. Note that the complexity of fixed point math is not that much higher than integer math, but that floating point is about 3x as complex as integer math. The integer math problem has been effectively solved with the NUMERIC_STD packages (1076.3, now part of VHDL200X-FT). This package has been well adopted and been in use for many years. In this paper, I intend to describe a new set of packages, which are being added to the VHDL language in the VHDL2005 update. These packages include VHDL overloads that allow you to do fixed and floating point math directly, without the user having to perform any conversions. These packages raise the level of abstraction in VHDL AND give the user the flexibility and power of an HDL.

Fixed-point package:
Fixed-point math is basically integer math with numbers that can be less than 1.0. A fixed-point number has an assigned width and an assigned location for the decimal point. As long as the number is big enough to provide enough precision then fixed point is fine for most DSP applications. Since it is based on integer math it is extremely efficient as long as the long as the data does not very too much in magnitude. The fixed-point math packages are based on the VHDL 1076.3 numeric_std package and use the signed and unsigned arithmetic from within that package. This makes them highly efficient as the numeric_std package is well supported by simulation and synthesis tools. This package defines two new types ufixed which is unsigned fixed point, and sfixed which is signed fixed point. Usage model:

use ieee.fixed_pkg.all; .... signal a, b : sfixed (7 downto -6); signal c: sfixed (8 downto -6); begin .... a <= to_sfixed (-3.125, 7, -6); b <= to_sfixed (inp1, bhigh, blow); c <= a + b; The two data types are defined as follows: type ufixed is array (integer range <>) of std_logic; -- base Unsigned fixed point type, downto direction assumed type sfixed is array (integer range <>) of std_logic; -- base Signed fixed point type, downto direction assumed This data type uses a negative index to show you where the decimal point is. The decimal point is assumed to be between the "0" and "-1" index. Thus is we can assume "signal y : ufixed (4 downto -5)" as the data type (unsigned fixed point, 10 bits wide, 5 bits of decimal), then y = 6.5 = "00110.10000", or simply: y <= "01011010000"; You can also say: y <= to_ufixed (6.5, 4, -5); where "4" is the upper index, and "-5" is the lower index, so you could also say: y <= to_ufixed (6.5, y'high, y'low); The signed version uses a two compliment to show represent a negative number, just like the "numeric_std" package. Any non-zero index range is valid. Thus: signal z : ufixed (-2 downto -3); z <= "11"; -- 0.375 = 0.011 signal x : sfixed (4 downto 1); y <= "111"; -- -2 = 1110.0 The data widths in the fixed-point package were designed (by Ryan Hilton) so that there is no possibility of an overflow. This is a departure from the numeric_std model which simply throws away underflow and overflow bits. For unsigned fixed point: ufixed(a downto b) + ufixed(c downto d) = ufixed(max(a,c)+1 downto min(b,d)) ufixed(a downto b) - ufixed(c downto d) = ufixed(max(a,c)+1 downto min(b,d)) ufixed(a downto b) * ufixed(c downto d) = ufixed(a+c+1 downto b+d) ufixed(a downto b) / ufixed(c downto d) = ufixed(a-d+1 downto b-c-1) reciprocal (ufixed(a downto b)) = ufixed(a-b+1 downto b-a-1) ufixed(a downto b) rem ufixed(c downto d) = ufixed(c downto d) ufixed(a downto b) mod ufixed(c downto d) = ufixed(a downto b) For signed fixed point: sfixed(a downto b) + sfixed(c downto d) = sfixed(max(a,c)+1 downto min(b,d)) sfixed(a downto b) - sfixed(c downto d) = sfixed(max(a,c)+1 downto min(b,d)) sfixed(a downto b) * sfixed(c downto d) = sfixed(a+c downto b+d) sfixed(a downto b) / sfixed(c downto d) = sfixed(a-d downto b-c) reciprocal (sfixed(a downto b)) = sfixed(a-b downto b-a) ufixed(a downto b) rem ufixed(c downto d) = ufixed(c downto d) ufixed(a downto b) mod ufixed(c downto d) = ufixed(a downto b) Unsigned Example:

signal x : ufixed ( 7 downto 3); signal y : ufixed ( 2 downto 9); If we multiply x by y we would get a signal which would be: x * y = ufixed (7+2+1 downto 3+(-9)) or ufixed (10 downto 12); Signed Example: signal x : sfixed (-1 downto 3); signal y : sfixed (3 downto 1); If we divide x by y we would get a signal which would be: x/y = sfixed (-1-1 downto 3-3) or sfixed (-2 downto 6); The resize function can be used to fix the size of the output. However, rounding and saturate rules are applied: X <= resize (x * y, xhigh, xlow); What about an accumulator? An accumulator is a fixed width number that you continually add to. To implement an accumulator in the fixed-point packages, you can use the resize function as follows: Signal X : ufixed (7 downto 0); X <= resize (X + 1, Xhigh, Xlow, false, false); Where the first false is the round_style. Since we do not need to do any rounding, we set this to false. The second false is the overflow_style. If this is set to true, we saturate, or go to the maximum possible number. When set to false we wrap, meaning that the upper most bit is dropped and the number simply recycles. Note that the default for both overflow_style and round_style is true. Integer and real overloaded for all operators, thus you can say: Signal x : sfixed (4 downto 5); Signal y : real; Z := x + y; In the case where an operation is performed which includes both a fixed-point number and an integer or real then the sizing rules are modified. For a real number, then the real is converted to a fixed-point number that is the same size as the fixed-point number that has been passed as the other argument. Thus in the above example: Z := x + sfixed(y, 4, -5); Would be called, which would result in Z being an sfixed (5 downto 5) type. For an integer, the number is also converted to a fixed-point number, but the size is only downto 0, as an integer can never have a fraction. Thus, if y were an integer the above example would look like: Z := x + sfixed (y, 4, 0); Which in this case would not affect the resultant numbers size. However this has a fairly large effect on the size of the output numbers in the multiply and divide routines. The following operations are defined for ufixed: +, -, *, /, rem, mod, =, /=, <, >, >=, <=, sll, srl, rol, ror, sla, sra The following functions are defined for ufixed: divide, reciprocal, scalb, maximum, minimum, find_lsb, find_msb, resize, To_01, Is_X, Conversion functions are defined for ufixed: to_ufixed (natural), to_ufixed (real), to_ufixed (unsigned), to_ufixed(signed), remove_sign (sfixed), to_unsigned, to_real, to_integer, to_UFix The following operations are defined for sfixed: +, -, *, /, rem, mod, =, /=, <, >, >=, <=, sll, srl, rol, ror, sla, sra, abs, - (unary) The following functions are defined for ufixed divide, reciprocal, scalb, maximum, minimum, find_lsb, find_msb, resize, to_01, Ix_X Conversion functions are defined for ufixed: to_sfixed (natural), to_sfixed (real), to_sfixed (unsigned), to_sfixed(signed), add_sign (ufixed), to_signed, to_real, to_integer, to_Fix

All of the operators are overloaded for real and integer data types. In each case the number is converted into fixed point before the operation is done. Thus the fixed-point operand must be of a format large enough to accommodate the converted input or a vector truncated warning is produced. In the case of an integer, the number is converted in the form integer_width downto 0 which causes the size of the output vector to change accordingly. In these functions fixed_saturate is set to true regardless of what the overflow_style constant is set to. This package defines 3 constants that are used to manipulate fixed-point numbers: constant fixed_round : boolean := true; -- Round or truncate constant fixed_saturate : boolean := true -- saturate or wrap constant fixed_guard_bits : natural := 3; -- guard bits for rounding These constants are defaults, and can be overridden everywhere they are used. "round_style" defaults to fixed_round (true) that turns on the rounding routines. If false then the number is truncated. If the MSB of the remainder is a "1" AND the LSB of the unround result is a '1' or the lower bits of the remainder include a '1' then the result will be rounded. This is similar to the floating-point round_nearest style. "overflow_style" default to fixed_saturate (true) that returns the maximum possible number if the number is too large to represent, otherwise a "wrap" routine is used which simply truncates the top bits. Unlike the way it is done in numeric_std, the sign bit is not preserved when wrapping. Thus it is possible to get positive result when resizing a negative number in this mode. Finally "guard_bits" defaults to "fixed_guard_bits" which defaults to 3. Guard bits are used in the rounding routines. If guard is set to 0, then the rounding is automatically turned off. These extra bits are added to the end of numbers in the division and to_real functions to make the numbers more accurate. The resize function is defined as follows: function resize (arg : sfixed; constant integer_width : INTEGER; constant fraction_width : INTEGER; constant round_style : BOOLEAN := fixed_round; constant overflow_style : BOOLEAN := fixed_saturate) In saturate mode (where overflow_style is true) if the output size is smaller than the input number then the number will saturate. An unsigned fixed point will saturate to all 1, a signed positive number will be all 1 with the first bit a 0, and a signed negative number will saturate to be all 0 with the first number a 1. If in wrap mode (where overflow_style is false) the number will be truncated. In this case the top or the number is simply truncated without regard to the sign bits, so you can truncate a negative number to be a positive one. The rounding routines are left intact in wrap mode. If round_style is true, then the rounding routines are turned on. Otherwise the number is simply truncated. Shift operators are functionally the same as the 1076-1993 shift operators with the exception of the arithmetic shift operations. An arithmetic shift (sra, or sla) on an unsigned number is the same as a logical shift. An arithmetic shift on a signed number is a logical shift if you are shifting left, and an arithmetic shift (sign bit replicated) if you are shifting right. The divide function is defined as follows: function divide ( l, r : sfixed; guard_bits : NATURAL := fixed_guard_bits; round_style : BOOLEAN := fixed_round) return sfixed; The output is sized with the same rules as the / operator. The function allows you to override the number of guard bits and the rounding operation. Note that the output size is calculated so that overflow is not possible. The reciprocal function is defined in a very similar manor to the divide function:

function reciprocal ( arg : ufixed; guard_bits : NATURAL := fixed_guard_bits; round_style : BOOLEAN := fixed_round) return ufixed; This function performs a 1/X function, with the output vector following the sizing rules as noted above. This function is very useful for dividing by a constant, example: A := B/Cons; Can be rewritten as: A := B*reciprocal(Cons); Since a multiply uses less logic then a divide this can save you significant hardware resources. The scalb function is a fixed-point version of a very common floating-point function. The function looks like this: function scalb (y : ufixed; N : SIGNED) return ufixed; This function computes y * 2**N without computing 2**N by using a shift operator. The size of the output number is the same as the input. For this function overflow and rounding functions are ignored, as this is treated like a shift operator. The N input is also overloaded for the type INTEGER. The maximum and minimum functions do a compare operation and return the appropriate value. These functions are not overloaded for integer and real inputs. The size of the inputs does not need to match. The find_lsb and find_msb functions are used to find the most significant bit or least significant bit of a fixed-point number. The function looks like the following: function find_msb (arg : ufixed; y : STD_ULOGIC) return INTEGER; In this case, y can be any std_ulogic value. These functions search for the first occurrence of y in the fixed-point number. find_msb starts at the MSB (arghigh) and goes down. find_msb starts at the LSB (arglow) and goes up. If that value is not found in the find_msb function, then arglow-1 is returned. If the value is not found in the find_lsb function then arghigh+1 is returned. to_01 and Is_X are similar in function to the numeric_std functions with the same name. Most synthesis tools do not support any I/O format other than std_logic_vector and std_logic. Thus functions have been created to convert between std_logic_vector and ufixed or sfixed and visa versa: Uf7_3 <= to_ufixed (slv7, uf7_3high, uf7_3low); and Slv7 <= to_slv (sf7_3); One of the changes made to all packages in vhdl-2005 is that the read and write routines for all data types are now defined in the same package that defines that type. Thus the READ, WRITE, HREAD, HWRITE, OREAD, and OWRITE routines are defined for fixed-point data types. A . Separator is added between the integer part and the fractional part of the fixed-point number. Thus if you write out or 6.5 example from above you will get the string "00110.10000", which you can also read into that data type. New to vhdl-2005 are the functions to_string, to_ostring and to_hstring. These are very useful in assert statements. Example: Assert x=y Report to_string(x) & /= & to_string(y) report error; Or, if you prefer to see the numbers as real numbers, you can use: Assert x=y Report to_string(to_real(x)) & /= & to_string(to_real(y)) report error; MathWorks Simulink is these days the most common way to define a fixed point DSP algorithm. It what would seem to be a major step into the past as it is schematic based. In Simulink an unsigned fixed point number is described as ufix[14,10], which specifies a 14 bit long word with 10 bits after the fraction. This translates into ufixed (3 downto 10) in the unsigned fixed-point type. The Simulink sfix notation translates much better because of the extra sign bit that must be generated. Sfix(14, 10) will translate into sfixed(4 downto 10) in the notation of the fixed_pkg. Issues: A negative or to index is flagged as an error by the fixed point routines. Thus if you define a number as ufixed (-1 to 5) the routines will automatically error out.

String literals are also a problem. By default, if you do the following: Z <= a + 011011; The index of the fixed-point number is undefined. The VHDL compiler will assume that the range of this number has the range Integerlow to integerlow+5, making it very small. To avoid crashing the simulator with a 32,000 bit wide number this also will automatically error out.

Floating-point numbers:
After Fixed point the next step is floating point. Floating-point numbers are well defined by IEEE-754 (32 and 64 bit) and IEEE-854 (variable width) specifications. Floating point has been used in processors and IP for years and is a wellunderstood format. There are many concepts in floating point that make it different from our well understood signed and unsigned number notations. These come from how a floating-point number is defined. Lets first take a look at a 32-bit floating-point number: S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 31 30 25 24 0 +/exp. Fraction Basically, a floating-point number comprises a sign bit (+ or -), a normalized exponent, and a fraction. To convert this number back into an integer, the following equation can be used: S * (1.0 + Fraction/Max fraction) ** 2 (exponent exponent_base) where the exponent_base is 2**((maximum exponent/2)1) and Fraction is always a number less than one. Thus for 32 bit floating point an example would be: 0 10000001 101000000000000000000000 = +1 * 2** (129 127) * (1.0 + 10485760/16777216) = +1 * 1.625 * 4.0 = 6.5 There are also denormal numbers, which are normally numbers smaller than can be represented with this structure. The tag for a denormal number is that the exponent is 0. This forces you to invoke another formula: 0 00000000 100000000000000000000000 = +1 * 2** -126 * (8388608/16777216) = +1 * 2**-1 * 2**-126 = 2**-127 Next are the constants that exist in the floating-point context: 0 00000000 000000000000000000000000 = 0 1 00000000 000000000000000000000000 = -0 (which = 0) 0 11111111 000000000000000000000000 = positive infinity 1 11111111 000000000000000000000000 = negative infinity If you get a number with an infinite (all 1s) exponent and anything other than an all zero fraction, then it is said to be a NAN, or Not A Number. NANs come in two types, signaling and non-signaling. For the purposes of these packages I chose a fraction with an MSB of 1 to be a signaling NAN and anything else to be a quiet NAN. Thus you wind up with the following classes (or states) that each floating-point number can fall into: nan Signaling NaN quiet_nan Quiet NaN neg_inf Negative infinity neg Negative normalized nonzero neg_denormal Negative denormalized neg_zero -0 zero +0 denormal Positive denormalized normal Positive normalized nonzero infinity Positive infinity

In the packages I use these states to both examine and create numbers needed for floating point operations. This defines the type valid_fpstate . The constants zerofp, nanfp, qnanfp, pos_inffp, neginf_fp, neg_zerofp are also defined. Rounding comes in 4 different flavors Round nearest Round positive infinity Round negative infinity Round zero Round nearest has the extra caveat that if the remainder is exactly then you need to round so that the LSB of the number you will get is a zero. The implementation of this feature requires two compare operations, but they can be consolidated. Round negative infinity rounds down, and round positive infinity always rounds up. Round zero is a mix of the two, and has the effect of doing a truncation (no rounding).

The floating point packages:


The new floating-point packages take advantage of a new feature in VHDL-2005 called package generics. The 32 bit floating point package looks like the following: package fphdl32_pkg is new IEEE.fphdl_pkg generic map ( fp_fraction_width => 23; -- 23 bits of fraction fp_exponent_width => 8; -- exponent 8 bits fp_round_style => round_nearest; -- round nearest algorithm fp_denormalize => true; -- Turn on Denormalized numbers fp_check_error => true; -- Turn on NAN and overflow processing fp_guard_bits => 3); -- number of guard bits Package generics allow you to specify any data width or size of floating point number you like. The resulting data type will be called fp. Thus you have the following use model: signal a, b, c : fp; signal x : unsigned (5 downto 0); constant PI : real := 3.14; begin b <= to_fp (x); c <= a + PI; The actual floating-point type is defined as follows: type fp is array (fp_exponent_width downto -fp_fraction_width) of STD_LOGIC; Once again we are using the negative index trick to separate the fraction part of the floating-point number from the exponent. The top bit is the sign bit (high) the next bits are the exponent (high-1 downto 0) and the negative bits are the fraction (-1 downto low). For a 32-bit representation that specification makes the number look as follows: 0 00000000 0000000000000000000000 8 7 0 -1 -23 +/exp. fraction where the sign is bit 8, the exponent is contained in bits 7-0 (8 bits) with bit 7 being the MSB, and the mantissa is contained in bits -1 - -23 (32 - 8 - 1 = 23 bits) where bit -1 is the MSB. The negative index format turns out to be a very natural format for the floating-point number, as the fraction is always assumed to be a number between 1.0 and 2.0 (unless we are denormalized). Thus the implied 1.0 can be assumed on the positive side of the index, and the negative side represents the fraction less than one. Valid values for fp_exponent_width and fp_fraction_width are 3 and up. Thus the smallest (width wise) number that can be made is fp ( 3 downto 3) or a 7-bit Floating-point number.

A generic called "fp_denormalize" is also provided for all operations. This parameter allows you to disable the creation of denormalized numbers. In normal (aka poor man's) floating point, the number closest to "0" consists of an exponent of "1" and a mantissa of "0" (2**-126 in the 32 bit case). Denormal numbers allow for numbers smaller than this by assuming that if the exponent is "0" than the mantissa represents a fraction less than 1. This adds a great deal of overhead to the floating point operations, and was thus left as an option defaulted to "true" in the IEEE 32 and 64 bit implementations, but can be shut off. fp_check_error turns off overflow and NAN processing. As every number must go through this check for every operation according to IEEE-754 this represents a significant hardware savings. fp_guard_bits are bits that are added to the end of every operation to maintain precision. Most implementations of floating point use 3 bits. Any number of bits (including 0) is valid. Note that setting the number of guard bits to 0 is similar to turning off rounding with the round_zero round_type. Defined operations for floating point numbers are: Unary -,abs, +, -, *, /, rem, mod, =, /=, <, >, <=, >= All of these operations are overloaded for integer and real types. The non floating-point type is first converted into floating point and the operation is performed. If the number is out of bounds for that number then the appropriate infinity or zero is returned. Errors from these routines are treated as described in IEEE-754. Defined functions for floating point number aredividbyp2 (divide by a power of 2), reciprocal (1/x), maximum, minimum, to_unsigned, to_signed, to_ufixed, to_sfixed, to_real, to_integer , To_fp(SIGNED), To_fp(UNSIGNED), To_fp(ufixed), To_fp(sfixed), To_fp(integer), To_fp(real), to_01. These functions operate silently, this is to say they the give no warnings for overflow or underflow. Outputting either infinity, or NAN signals errors in the to_fp routines. Errors from the routines that read FP numbers are returned the same way. Functions recommended by IEEE-854: Copysign (x, y) Returns x with the sign of y. Scalb (y, N) Returns y*(2**n) (where N is an integer or SIGNED) without computing 2**n. Logb (x) Returns the unbiased exponent of x Nextafter(x,y) Returns the next representable number after x in the direction of y. Fininte(x) Boolean, true if X is not positive or negative infinity Isnan(x) Boolean, true if X is a NAN or quiet NAN. Unordered(x, y) Boolean, returns true of either X or Y are some type of NAN. Class(x) valid_fpstate, returns the type of floating point number (see valid_fpstate definition above) Two extra functions named break_number and normalize are also provided. break_number takes a floating-point number and returns a SIGNED exponent (biased by 1) and an ufixed fixed point number. normalize takes a SIGNED exponent and a fixed-point number and returns a floating-point number. These functions are useful for times when you want to operate on the fraction of a floating-point number without having to do the shifts on every operation. To_slv (aliased to to_std_logic_vector and to_StdLogicVector) as well as to_fp(std_logic_vector) are used to convert between std_logic_vector and fp types. These should be use on the interface of your designs. The result of to_slv is a std_logic_vector with the length of the input fp type. The procedures Reading and writing floating point numbers are also included in this package. Procedures read, write, oread, owrite (octal), bread, bwrite (binary), hread and hwrite (hex) are defined. To_string, to_ostring, and to_hstring are also provided for string results. Floating point numbers are written in the format 0:000:000 (for a 7 bit FP). They can be read as a simple string of bits, or with a . Or : separator. Changing from one floating point format to another can be done through the resize function provided. Example:

use ieee.fphdl32_pkg.all; architecture RTL of XXX is alias fp32 is ieee.fphdl32_pkg.fp; -- or just fp alias fp64 is ieee.fphdl64_pkg.fp; signal x : fp32; signal y : fp64; begin Y <= ieee.fphdl64_pkg.resize (arg => y, exponent_width => fp_exponent_width, fraction_width => fp_fraction_width, denormalize => fp_denormalized, round_style => fp_round_style);

Challenges for Synthesis vendors:


Now that we are bringing numbers that are less than 1.0 into the realm of synthesis, the type REAL becomes meaningful. To_fp (MATH_PI) will now evaluate to a string of bits. This means that synthesis vendors will now have to not only understand the real type, but the functions in the math_real IEEE package as well. Both of these packages depend on a negative index. Basically, everything that is at an index that is less than zero is assumed to be to the right of the decimal point. By doing this we were able to avoid using record types. This also represents a challenge for some synthesis vendors, but it makes these functions portable to Verilog.

References
1. Floating point for VHDL and Verilog David Bishop, Eastman Kodak - DVCon 2003 2. IEEE Std 754-1985 - IEEE Standard for Binary Floating-Point Arithmetic. 3. IEEE Std 854-1987 - IEEE Standard for Binary Floating-Point Arithmetic. 4. Lecture Notes on the Status of IEEE Standard 754 for Binary. Floating-Point Arithmetic - Prof W. Khan, University of California. 5. What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg. 6. Floating point types for Synthesis Dr. Alex Zamfirescu. 7. RSVP based bandwidth allocation Ananda Rangan and Vignesh Nandakumar, Washington University in St. Louis.
8.

http://babbage.cs.qc.edu/courses/cs341/IEEE-754.html Floating-Point Conversion.

IEEE-754

9.

http://www.markworld.com/showfloat.html - Decompose IEEE Floating Point Number. http://www.ecs.umass.edu/ece/koren/arith/simulator/FPAdd/ Floating-point addition and subtraction.

10.

11. IEEE 1076.3 - VHDL Standard Synthesis packages.

12. Cadence's Verilog-XL Reference Manual.

Potrebbero piacerti anche