Calculate by hand the values of double precision(dp) dp(9.4) and dp(.4) as well
ID: 3167575 • Letter: C
Question
Calculate by hand the values of double precision(dp) dp(9.4) and dp(.4) as well as the associated absolute and relative roundoff for each. Finally, calculate dp(9.4 - 9 - .4) and dp(9.4 - .4 - 9) and then verify using these values in MatLab. Explain why (in this example) the order in which the subtractions are done impacts the value we see. Calculate by hand the values of double precision(dp) dp(9.4) and dp(.4) as well as the associated absolute and relative roundoff for each. Finally, calculate dp(9.4 - 9 - .4) and dp(9.4 - .4 - 9) and then verify using these values in MatLab. Explain why (in this example) the order in which the subtractions are done impacts the value we see.Explanation / Answer
DP Refers to a type of floating-point number that has more precision (that is, more digits to the right of the decimal point) than a single-precision number. The term double precision is something of a misnomer because the precision is not really double. The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number. For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long.
The extra bits increase not only the precision but also the range of magnitudes that can be represented. The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values. Most computers use a standard format known as the IEEE floating-point format.
dp(9.4)=9.4
Most accurate representation = 9.40000000000000035527136788005E0
0x4022CCCCCCCCCCCD = 01000000 00100010 11001100 11001100
11001100 11001100 11001100 11001101
dp(0.4)
Most accurate representation = 4.00000000000000022204460492503E-1
0x3FD999999999999A = 00111111 11011001 10011001 10011001
10011001 10011001 10011001 10011010
http://babbage.cs.qc.cuny.edu/IEEE-754.old/Decimal.html
dp (9.4) :
solution :
step 1:
First, convert to binary (base 2) the integer part: 9. Divide the number repeatedly by 2, keeping track of each remainder, until we get a quotient that is equal to zero:
division = quotient + remainder;
9 ÷ 2 = 4 + 1;
4 ÷ 2 = 2 + 0;
2 ÷ 2 = 1 + 0;
1 ÷ 2 = 0 + 1;
step 2:
Construct the base 2 representation of the integer part of the number, by taking all the remainders starting from the bottom of the list constructed above:
9(10) = 1001(2)
Step 3 : Convert to binary (base 2) the fractional part: 0.4. Multiply it repeatedly by 2, keeping track of each integer part of the results, until we get a fractional part that is equal to zero:
#) multiplying = integer + fractional part;
1) 0.4 × 2 = 0 + 0.8;
2) 0.8 × 2 = 1 + 0.6;
3) 0.6 × 2 = 1 + 0.2;
4) 0.2 × 2 = 0 + 0.4;
5) 0.4 × 2 = 0 + 0.8;
6) 0.8 × 2 = 1 + 0.6;
7) 0.6 × 2 = 1 + 0.2;
8) 0.2 × 2 = 0 + 0.4;
9) 0.4 × 2 = 0 + 0.8;
10) 0.8 × 2 = 1 + 0.6;
11) 0.6 × 2 = 1 + 0.2;
12) 0.2 × 2 = 0 + 0.4;
13) 0.4 × 2 = 0 + 0.8;
14) 0.8 × 2 = 1 + 0.6;
15) 0.6 × 2 = 1 + 0.2;
16) 0.2 × 2 = 0 + 0.4;
17) 0.4 × 2 = 0 + 0.8;
18) 0.8 × 2 = 1 + 0.6;
19) 0.6 × 2 = 1 + 0.2;
20) 0.2 × 2 = 0 + 0.4;
21) 0.4 × 2 = 0 + 0.8;
22) 0.8 × 2 = 1 + 0.6;
23) 0.6 × 2 = 1 + 0.2;
24) 0.2 × 2 = 0 + 0.4;
25) 0.4 × 2 = 0 + 0.8;
26) 0.8 × 2 = 1 + 0.6;
27) 0.6 × 2 = 1 + 0.2;
28) 0.2 × 2 = 0 + 0.4;
29) 0.4 × 2 = 0 + 0.8;
30) 0.8 × 2 = 1 + 0.6;
31) 0.6 × 2 = 1 + 0.2;
32) 0.2 × 2 = 0 + 0.4;
33) 0.4 × 2 = 0 + 0.8;
34) 0.8 × 2 = 1 + 0.6;
35) 0.6 × 2 = 1 + 0.2;
36) 0.2 × 2 = 0 + 0.4;
37) 0.4 × 2 = 0 + 0.8;
38) 0.8 × 2 = 1 + 0.6;
39) 0.6 × 2 = 1 + 0.2;
40) 0.2 × 2 = 0 + 0.4;
41) 0.4 × 2 = 0 + 0.8;
42) 0.8 × 2 = 1 + 0.6;
43) 0.6 × 2 = 1 + 0.2;
44) 0.2 × 2 = 0 + 0.4;
45) 0.4 × 2 = 0 + 0.8;
46) 0.8 × 2 = 1 + 0.6;
47) 0.6 × 2 = 1 + 0.2;
48) 0.2 × 2 = 0 + 0.4;
49) 0.4 × 2 = 0 + 0.8;
50) 0.8 × 2 = 1 + 0.6;
51) 0.6 × 2 = 1 + 0.2;
52) 0.2 × 2 = 0 + 0.4;
53) 0.4 × 2 = 0 + 0.8;
step 4:
Construct the base 2 representation of the fractional part of the number, by taking all the integer parts of the multiplying operations, starting from the top of the constructed list above:
0.4(10) = 0.0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0(2)
Step 5 :
Normalize the binary representation of the number, shifting the decimal mark 3 positions to the left so that only one non zero digit remains to the left of it:
9.4(10) =
1001.0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0(base 2) =
1001.0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0110 0(base 2) × (2 raise to 0) =
1.0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100(base 2) × (2 raise to 3)
Up to this moment, there are the following elements that would feed into the 64 bit double precision IEEE 754 binary floating point representation:
Sign: 0 (a positive number)
Exponent (unadjusted): 3
Mantissa (not normalized): 1.0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100
step 6:
Adjust the exponent in 11 bit excess/bias notation and then convert it from decimal (base 10) to 11 bit binary, by using the same technique of repeatedly dividing by 2:
Exponent (adjusted) =
Exponent (unadjusted) + 2(11-1) - 1 =
3 + 2(11-1) - 1 =
(3 + 1 023)(base 10) = 1 026(base 10)
division = quotient + remainder;
1 026 ÷ 2 = 513 + 0;
513 ÷ 2 = 256 + 1;
256 ÷ 2 = 128 + 0;
128 ÷ 2 = 64 + 0;
64 ÷ 2 = 32 + 0;
32 ÷ 2 = 16 + 0;
16 ÷ 2 = 8 + 0;
8 ÷ 2 = 4 + 0;
4 ÷ 2 = 2 + 0;
2 ÷ 2 = 1 + 0;
1 ÷ 2 = 0 + 1;
Exponent (adjusted) =
1026(10) =
100 0000 0010(2)
7. Normalize mantissa, remove the leading (the leftmost) bit, since it's allways 1 (and the decimal point, if the case) then adjust its length to 52 bits, by removing the excess bits, from the right (if any of the excess bits is set on 1, we are losing precision...):
Mantissa (normalized) =
1. 0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 =
0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100
Conclusion:
The three elements that make up the number's 64 bit double precision IEEE 754 binary floating point representation:
Sign (1 bit) =
0 (a positive number)
Exponent (11 bits) =
100 0000 0010
Mantissa (52 bits) =
0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100
Number 9.4, a decimal, converted from decimal system (base 10)
to
64 bit double precision IEEE 754 binary floating point:
0 - 100 0000 0010 - 0010 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100
9.4(10) =
0-10000000010-0010110011001100110011001100110011001100110011001100
(64 bits IEEE 754)
64 bits IEEE 754)
Sign (1 bit):
0
63
repeat the same steps for 0.4 and 9
(0.4) (base 10)= 0-01111111101-1001100110011001100110011001100110011001100110011001
and
9 (base 10) = 0-10000000010-0010000000000000000000000000000000000000000000000000
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.