Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Question 3 Represent each of the following values in the IEEE 754 32-bit format,

ID: 3562588 • Letter: Q

Question

Question 3

Represent each of the following values in the IEEE 754 32-bit format, truncating inexact values. Indicate the percent error of the value stored compared to the exact value. This will require you to convert the 32 bit floating point number back to decimal and compare to the original value.

a) -37 1/4
b) +12 1/5

Question 4

Do Clements, Question 1.7: Register Transfer Language (RTL) notation. Reminder: This is a sequence of operations, and you should provide the final value of all affected memory locations, as well as the effect of each statement.

Explanation / Answer

a sign bit telling whether the number is positive or negative, an exponent giving its order of magnitude, and a mantissa specifying the actual digits of the number. Using single-precision floats as an example, here is the bit layout:

The value of the number is the mantissa times 2^x, where x is the exponent. Notice that we are dealing with binary fractions, so that 0.1 (the leftmost mantissa bit) means 1/2 (the place values to the right of the decimal point are 2^-1, 2^-2, etc., just as we have 10^-1, 10^-2, etc. in decimal).

Notice further that there's a potential problem with storing both a mantissa and an exponent: 2x10^-1 = 0.2x10^0 = 0.02x10^1 and so on. This would correspond to lots of different bit patterns representing the same quantity, which would be a huge waste (it would probably also make it harder and slower to implement math operations in hardware). This problem is circumvented by interpreting the whole mantissa as being to the right of the decimal point, with an implied "1" always present to the left of the decimal. I'll refer to this as a "1.m" representation. "But wait!" you cry. "What if I don't want a 1 there?" Think of it is as follows: imagine writing a real number in binary. Unless it's zero, it's gotta have a 1 somewhere. Shift your decimal point to just after the first 1, then don't bother to store that 1 since we know it's always implied to be there. Now all you have to do is set the exponent correctly to reproduce the original quantity.

But what if the number is zero? The good people at the IEEE standards committee solve this by making zero a special case: if every bit is zero (the sign bit being irrelevant), then the number is considered zero.

Oh dear. Take a moment to think about that last sentence. Now it would seem we have no way to represent humble 1.0, which would have to be 1.0x2^0 (an exponent of zero, times the implied one)! The way out of this is that the interpretation of the exponent bits is not straightforward either. The exponent of a single-precision float is "shift-127" encoded, meaning that the actual exponent is eeeeeeee minus 127. So thankfully, we can get an exponent of zero by storing 127 (0x7f). Of course simply shifting the range of the exponent is not a panacea; something still has to give somewhere. We yield instead at the low extreme of the spectrum of representable magnitudes, which should be 2^-127. Due to shift-127, the lowest possible exponent is actually -126 (1 - 127). It seems wise, to me, to give up the smallest exponent instead of giving up the ability to represent 1 or zero!

Zero is not the only "special case" float. There are also representations for positive and negative infinity, and for a not-a-number (NaN) value, for results that do not make sense (for example, non-real numbers, or the result of an operation like infinity times zero). How do these work? A number is infinite if every bit of the exponent is set (yep, we lose another one), and is NaN if every bit of the exponent is set plus any mantissa bits are set. The sign bit still distinguishes +/-inf and +/-NaN.

To review, here are some sample floating point representations:

As a programmer, it is important to know certain characteristics of your FP representation. These are listed below, with example values for both single- and double-precision IEEE floating point numbers:

Note that all numbers in the text of this article assume single-precision floats; doubles are included above for comparison and reference purposes.

(*)
Just to make life interesting, here we have yet another special case. It turns out that if you set the exponent bits to zero, you can represent numbers other than zero by setting mantissa bits. As long as we have an implied leading 1, the smallest number we can get is clearly 2^-126, so to get these lower values we make an exception. The "1.m" interpretation disappears, and the number's magnitude is determined only by bit positions; if you shift the mantissa to the right, the apparent exponent will change (try it!). It may help clarify matters to point out that 1.401298464e-45 = 2^(-126-23), in other words the smallest exponent minus the number of mantissa bits.

However, as I have implied in the above table, when using these extra-small numbers you sacrifice precision. When there is no implied 1, all bits to the left of the lowest set bit are leading zeros, which add no information to a number (as you know, you can write zeros to the left of any number all day long if you want). Therefore the absolute smallest representable number (1.401298464e-45, with only the lowest bit of the FP word set) has an appalling mere single bit of precision!

(**)
Epsilon is the smallest x such that 1+x > 1. It is the place value of the least significant bit when the exponent is zero (i.e., stored as 0x7f).

People usually call this distance EPSILON, even though it is not the epsilon of the FP representation. I'll use EPSILON (all caps) to refer to such a constant, and epsilon (lower case) to refer to the actual epsilon of FP numbers.

This technique sometimes works, so it has caught on and become idiomatic. In reality this method can be very bad, and you should be aware of whether it is appropriate for your application or not. The problem is that it does not take the exponents of the two numbers into account; it assumes that the exponents are close to zero. How is that? It is because the precision of a float is not determined by magnitude ("On this CPU, results are always within 1.0e-7 of the answer!") but by the number of correct bits. The EPSILON above is a tolerance; it is a statement of how much precision you expect in your results. And precision is measured in significant digits, not in magnitude; it makes no sense to talk of "1.0e-7 of precision". A quick example makes this obvious: say we have the numbers 1.25e-20 and 2.25e-20. Their difference is 1e-20, much less than EPSILON, but clearly we do not mean them to be equal. If, however, the numbers were 1.2500000e-20 and 1.2500001e-20, then we might intend to call them equal.

The take-home message is that when you're defining how close is close enough, you need to talk about how many significant digits you want to match. Answering this question might require some experimentation; try out your algorithm and see how close "equal" results can get. If you are only dealing with large numbers, then comparing using an EPSILON as above might be just the thing. If not, read on.

Here is a routine that can tell you if two floats are equal to a certain number of significant decimal digits:

It's worthwhile to take a look at ieee754.h if it exists on your system.
How does this function work? First, if the numbers are totally equal, then we know the answer right away. Next, we check the exponents and signs; if the numbers have different exponents then they cannot be equal (because of the 1.m representation). In the rest of the function, we use the old fabs-less-than idiom, but validating its assumption by actually setting the exponents of the two numbers to zero. We compare against 0.5*10^-sigfigs, which provides for rounding. Imagine sigfigs==2

Property Value for float Value for double Largest representable number 3.402823466e+38 1.7976931348623157e+308 Smallest number without losing precision 1.175494351e-38 2.2250738585072014e-308 Smallest representable number(*) 1.401298464e-45 5e-324 Mantissa bits 23 52 Exponent bits 8 11 Epsilon(**) 1.1929093e-7 2.220446049250313e-16
Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote