18. How does a computer represent non integer quantities?#

18.1. prepare work#

  1. In fractionalbinary.md use 8 bits to represent the following numbers by using 4 bits as usual (8,4,2,1) and the other 4 bits are 1/2, 1/4, 1/8, 1/16th:

  • 1.5

  • 12.75

  • 7.5625

  • 15.9375

  1. Add to your file some notes about the limitations of representing non integer values this way. How much would using more bits help with, what limitations are not resolved by adding more bits. For example, how could you represent .1?

  • 1.5

0001.1000

  • 12.75

1100.1100

  • 7.5625

0111.1001

  • 15.9375

1111.1111

  • floating point

  • IEEE double format

18.2. Prelude#

What do we mean by representation?

18.2.1. negative numbers#

we use one bit to represent the sign. and then e have two choices for the rest of the bits:

representation

how to negate

how to read

ones complement

flip all the bits

note the sign, flip all the bits back

two’s complement

flip the bits, add 1

note the sign, flip all the bits, add one

Try it out on integer exposed

For example, set it to 3 and then use the ~ button to flip the bits and add one by flipping the 0 in the ones place to a 1 (add one), that ends up at -3

so in 8 bit signed integer 3 is 00000011 and -3 is 11111101.

Try a few other cases in integer exposed

Two’s complement is themost common

18.3. What about a fixed point for fractions?#

Let’s experiment with an 8 bit representation with 1 bit for sign and then the next 4 used for the part of the number greater than 0 and the last 3 for the fractional part.

would be

0-1000-001

positive, 8.1

18.4. How many total numbers can we represent?#

in this then we can represent the numbers 0-8 for the right hand side and 0-15 on the left hand side so we get

we can visualize values with memory spy

We can visualize this in code using a little Python if we want:

import itertools
num_list = [str(n+f) for n,f in itertools.product(range(16),[i/10 for i in range(10)])]
', '.join(num_list) + ', '.join(['-'+ n for n in num_list])
len(num_list)*2

This is far fewer different values than we could represent with 1 bit for sign and 7 bits to represent intergers

(2**7)*2

Another way we could represent numbers with a fixed point, is to use base two fractions: \(\frac{1}{2}\), \(\frac{1}{4}\), \(\frac{1}{8}\) etc.

In this fixed point we would have, for example: 0101.1010 would be $\(0*2^3 + 1*2^2 + 0*2^2 + 1*2^0 + 1*\frac{1}{2^{-1}} + 0*\frac{1}{2^{-2}} + 1*\frac{1}{2^{-}} + 0*\frac{1}{2^{-4}} = 4 + 1 + \frac{1}{2} + \frac{1}{8} = 5 + \frac{5}{8}= 5.625\)$

In this case we still have a small range of numbers, but it’s a different set of possible numbers.

18.5. Floating Point Notation#

We can write numbers in many different forms. We have written integers through many different bases so far.

For example in scientific contexts, we often write numbers (in base 10) like:

\[ 3.45 \times 10^2 = 345 \]

We can use a similiar form to represent numbers in binary. Using base 2 instead of base 10.

18.6. Floating point numbers are not all exact#

Let’s look at an example, we add .1 together 3 times, and we get the expected result.

.1 + .1 + .1

However, if we check what it’s equal to, it does not equal .3

.1 + .1 + .1 == .3

This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly. However, the display truncates since usually we want only a few significant digits.
Even rounding does not make it work.

round(.1,1) + round(.1,1) + round(.1,1) == round(.3,1)

Important

This matters because we cannot rely on equality tests for floats

Instead, we use subtraction and a tolerance:

(.1+.1+.1 - .3)< .0001
True
round(.1 + .1+ .1 ,1)
.1
num = .1
num

18.7. Floating point IEEE standard#

Now, lets see how these numbers are actually represented.

float.exposed

IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can make.

  • Initially in 1985

  • Reviesd in 2008 after 7 year process that expanded

  • revised in 2019 after 4 year process that mostly added clarifications

  • Next revision is projected to 2028.

Note

IBM mainframes use their own representiton based on Hex*

this is a double precision float or binary64 inthe current standard.

it was called double in the original, offically, so it is commonly called that still, even though it is no longer the formally correct name.

In this case we will 1 bit for sign, 11 for exponent and 52 for the fraction part

float bit image

18.7.1. How do we read a number like this?#

if the sign bit is \(s\) and the number represented by the exponent bits is \(e\) and the 52 bits are number from right most is 0 and the last one is 52.

\[ (-1)s \times \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023} \]

Note that this is \(2^{-1}\) so we are working with fractions instead of integers in the sum.

So if we had:

0 01111111111 0000000000000000000000000000000000000000000000000000

it would represent:

\[ (-1)*0 + \left(1 + 0 \cdot 2^{-0} + 0 \cdot 2^{-1} + \ldots + 0 \cdot 2^{-51} + 0 \cdot 2^{-52} \right) \times 2^{1023-1023} = 0 + (1 + 0) \times 2^0 = 1 \times 1 = 1.0 \]

So, for

0 01111111111 0100000000000000000000000000000000000000000000000000

following the equation, we get:

\[ (-1)*0 + \left(1 + 0\cdot 2^{-1} + 1\cdot 2^{-2} + 0\cdot 2^{-3} + \ldots + 0\cdot 2^{-51} + 0\cdot 2^{-52} \right) \times 2^{1023-1023} = 0 + (1 + \frac{1}{4}) \times 2^0 = 1.25 \times 1 = 1.25 \]
0b01111111111
float.hex(1.5)
0b01111111000-1023

18.7.2. Why use e-1023 for the exponent?#

2**11
2048

so 1023 is half -1, this means that we split the possible range of numbers that we can represent in the exponent to positive and negative, meaning the whole expression \(2^{e-1023}\) ranges from large(\(2^{1025}\)) to small(\(2^{-1023}\))

18.7.3. How do we get numbers into this form?#

Now, let’s return to our example of .1.

First we take the sign of the number for the sign bit, but then to get the exponent and fraction, we have more work to do.

Let’s take the equation we had from the standard:

\[ \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}\]

If we think about the fraction and how we add fractions, by reaching a common denominator. Since they’re all powers of two, we can use the last one.

\[ \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}\]

and we can expand the sum to a different notation for visual clarity

\[ \left( 1 + \frac{b_{52}}{2^{1}} + \frac{b_{51}}{2^{2}} + \cdots + \frac{b_{1}}{2^{51}} + \frac{b_{0}}{2^{52}} \right) \times 2^{e-1023} \]

And apply a common denominator:

\[ \left(\frac{2^{52}}{2^{52}} + \frac{2^{51} b_{52}}{2^{52}} + \frac{2^{50} b_{51}}{2^{52}} + \cdots + \frac{2^{1} b_{1}}{2^{52}} + \frac{2^0 b_{0}}{2^{52}} \right) \times 2^{e-{1023}} \]

So then this becomes a binary number with 53 bits (the 52 + 1) and a denominator of \(2^{52}\), let’s call that number \(J\) so we have:.

\[ \frac{J}{2^{52}} \times 2^{e-{1023}} \]

we can then combine the powers of 2.

\[ \frac{J}{2^{52-e+1023}} = \frac{J}{2^{N}} \]

So in order to return to our .1 that we want to represent, we can represent it as a fraction and then approximated it in the form above.

\[\frac{1}{10} \approx \frac{ J }{2^N}\]

this means approximately equal to

after some manipulation:

\[J ~= 2^N / 10\]

Since we want to use exactly 53 bits to represent \(J\), we want \(\frac{2^N }{10}\) to be greater than or equal to \(2^{52}\) and less than \(2^{53}\).

\[ 2^{52} <= \frac{2^N }{10} < 2^{53} \]

To figure out the \(N\) above, let’s focus on the center and then 10. First we will note that the center is not, but the bounds are both powers of two. So we find the powers of two that are just aboe and just below 10:

\[8<10<16\]

which can be re-written: \(2^3<10<2^4\)

and and again we can invert, to rewrite that:

\[ \frac{1 }{2^4} < \frac{1 }{10} < \frac{1 }{2^3} \]

(larger denominators is smaller numbers if the numerators are the same)

We can then multiply all three parts by \(2^N\):

\[ \frac{2^N }{2^4} < \frac{2^N }{10} < \frac{2^N }{2^3} \]

and simplify the outer ones to only have a single exponent

\[ 2^{N-4} < \frac{2^N }{10} < 2^{N-3} \]

so, returning to our original goal:

\[ 2^{52} <= \frac{2^N }{10} < 2^{53} \]

and combining it with the above, we can say \(2^{52}=2^{N-4}\) and that \(2^{N-3}=2^{53}\) and since both of these work out to \(N=56\) that is the best \(N\)

We can confirm this in a calculator

2**52 <= 2**46/10 < 2**53
False

Now we can get the number we will use for \(J\). We want to get as close to .1 with our fraction as possible, so we get the quotient and remainder, the quotient will be the number we use, but we use the remainder to see if we round up or not

q,r = divmod(2**56,10)

Then we check the remainder to decide if we should round up by 1 or not.

r

\( 6 > 5 = \frac{10}{2}\) so we round up

q+1

then we can check the length in bits of that number

len(bin(q+1))-2
len(bin(q+1) )-2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 len(bin(q+1) )-2

NameError: name 'q' is not defined
bin(q+1)

We need \(52-e+1023 = 56\) so

52-56+1023

Now lets’ confirm what we found with the actual representation:

Python doesn’t provide a binary reprsentation of floats, but it does provide a hex representation.

float.hex(.1)

THat matches this for the part before the p. 0b1 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 1010

after the p is the exponent of \(-4 = 1019-1023\). Which matches the approximation we found.

We can also test by computing

(q+1)/2**56 == .1

this confirms that the approximaiton we found is the same as the float representation of .1.

format(.1, '.17f')

18.8. Questions#

18.8.1. How did they find the equation for the floating point standard?#

Over time, people tried differnt thigns and this one stuck.

More detail on the development could be a good explore topic.

18.8.2. Did we guess and checkt o finnd the 56?#

No, we derived it. I added some extra detail on the steps above.

18.8.3. Why is there different formats for the binary floating point IEEE standard again?#

Standards are typically followed, there are generally multiple ways to do something.

IBM mainframes use another format, more is a good explore topic

18.8.4. What is the difference between float and double?#

Both are the same standard, float historically was 32 and double was 64, but now that computers have mostly 64 bit processing, this distinction is ambiguous.

18.8.5. Where can I learn more?#

The wikipedia on floating point is a good place to start, and then its references

18.8.6. How are legacy systems able to interact with current ones in some cases, even if they may use fewer bits?#

Data gets lost and not all are actually compatible.