How does a computer represent non integer quantities?

20. How does a computer represent non integer quantities?#

20.1. Prelude#

What do we mean by representation?

We talked about how file permissions are represented : rwxrwxrw- The left most three charachters are user permissions The next three are group permissions The right most three charachters are all permissions

If we think about the values that represent the permission, What are the options for each one of them? For each set of permissions we have either r or -, either w or -, and either x or -

What is another way that we can represent these permissions in?

Since each character has two options, we can replace that with binary

so for rwxrwxrw- we can do 111111110

What else?

Well, we already are grouping them in sets of 3s. We know that in the exchange from binary to octal, each three bits can be replaced by one octal digit

So it makes sense to represent the permissions in octal as well:

rwx_rwx_rw- $$\rightarrow$$ `111_111_110` $$\rightarrow$$ 7_7_6

In this, we interpret each bit to represent two different things:

read, write, execute permissions
4,2,1 place values

Sometimes, we mix representations within a single string of bits:

object instructions could use the first few bits to represent the operation and the rest to represent a value (so we can encode an instruction in a single register)
negative numbers

20.1.1. a toy object instruction#

let say we have am CPU that can do 4 bit operations and has 4 control bits (that determine what operation it will do across the ALU + controls for A,D registers & PC)

then we might have an 8 bit register ROM.

the operation @3 might represented as 01000011 if @ is indicated by 0100 on the controler and then the last 4 bits 0011 represent the value 3

20.1.2. negative numbers#

lets visualize 8bit integer 3

20.1.3. negative numbers#

we use one bit to represent the sign. and then e have two choices for the rest of the bits:

representation	how to negate	how to read
ones complement	flip all the bits	note the sign, flip all the bits back
two’s complement	flip the bits, add 1	note the sign, flip all the bits, add one

Try it out on integer exposed

For example, set it to 3 and then use the ~ button to flip the bits and add one by flipping the 0 in the ones place to a 1 (add one), that ends up at -3

so in 8 bit signed integer 3 is 00000011 and -3 is 11111101.

Try a few other cases in integer exposed

Two’s complement is themost common

Note

The following are not rules. They are thought processes that lead to rules adapted later

20.2. What about a fixed point for fractions?#

Let’s experiment with an 8 bit representation with 1 bit for sign and then the next 4 used for the part of the number greater than 0 and the last 3 for the fractional part.

With that rule, what does the following represent?

01000001

would be

0-1000-001

positive, 8.1

20.3. How many total numbers can we represent?#

in this then we can represent the numbers 0-8 for the right hand side and 0-15 on the left hand side so we get

we can visualize values with memory spy

We can visualize this in code using a little Python if we want:

import itertools
num_list = [str(n+f) for n,f in itertools.product(range(16),[i/10 for i in range(10)])]
', '.join(num_list) + ', '.join(['-'+ n for n in num_list])

len(num_list)*2

This is far fewer different values than we could represent with 1 bit for sign and 7 bits to represent intergers

(2**7)*2

In this case we still have a small range of numbers, but it’s a different set of possible numbers.

20.4. Floating Point Notation#

We can write numbers in many different forms. We have written integers through many different bases so far.

For example in scientific contexts, we often write numbers (in base 10) like:

\[ 3.45 \times 10^2 = 345 \]

We can use a similiar form to represent numbers in binary. Using base 2 instead of base 10.

Another way we could represent numbers with a fixed point, is to use base two fractions: $\frac{1}{2}$, $\frac{1}{4}$, $\frac{1}{8}$ etc.

In this fixed point we would have, for example: 0101.1010 would be $$0*2^3 + 1*2^2 + 0*2^2 + 1*2^0 + 1*\frac{1}{2^{-1}} + 0*\frac{1}{2^{-2}} + 1*\frac{1}{2^{-}} + 0*\frac{1}{2^{-4}} = 4 + 1 + \frac{1}{2} + \frac{1}{8} = 5 + \frac{5}{8}= 5.625$$

20.5. Floating point numbers are not all exact#

Let’s look at an example, we add .1 together 3 times, and we get the expected result.

.1 + .1 + .1

0.30000000000000004

However, if we check what it’s equal to, it does not equal .3

.1 + .1 + .1 == .3

False

1.1 + 1.1 + 1.1

3.3000000000000003

This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly. However, the display truncates since usually we want only a few significant digits.

Even rounding does not make it work.

round(.1,1) + round(.1,1) + round(.1,1) == round(.3,1)

False

round(.1 + .1+ .1 ,1) == 0.3

True

Important

This matters because we cannot rely on equality tests for floats

Instead, we use subtraction and a tolerance:

(.1+.1+.1 - .3)< .0001

(.1+.1+.1 - .3)< .000000001

(.1+.1+.1 - .3)< 0.00000000000000001

False

The more precision we try to get we will reach a False at some point

20.6. Floating point IEEE standard#

Now, lets see how these numbers are actually represented.

float.exposed

IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can make.

Initially in 1985
Reviesd in 2008 after 7 year process that expanded
revised in 2019 after 4 year process that mostly added clarifications
Next revision is projected to 2028.

this is a double precision float or binary64 as named in the current standard.

it was called double in the original, offically, so it is commonly called that still, even though it is no longer the formally correct name.

In this case we will 1 bit for sign, 11 for exponent and 52 for the fraction part

20.6.1. How do we read a number like this?#

if the sign bit is $s$ and the number represented by the exponent bits is $e$ and the 52 bits are number from right most is 0 and the last one is 52.

\[ (-1)s + \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023} \]

Note that this is $2^{-1}$ so we are working with fractions instead of integers in the sum.

So if we had:

0 01111111111 0000000000000000000000000000000000000000000000000000

it would represent:

\[ (-1)*0 + \left(1 + 0 \times 2^{-0} + 0 \times 2^{-1} + \ldots + 0 \times 2^{-51} + 0 \times 2^{-52} \right) \times 2^{1023-1023} = 0 + (1 + 0) \times 2^0 = 1 \times 1 = 1.0 \]

You try for

0 01111111111 0100000000000000000000000000000000000000000000000000

So, for

0 01111111111 0100000000000000000000000000000000000000000000000000

following the equation, we get:

\[(-1)*0 + \left(1 + 0\times 2^{-1} + 1\times 2^{-2} + 0\times 2^{-3} + \ldots + 0\times 2^{-51} + 0\times 2^{-52} \right) \times 2^{1023-1023} = 0 + (1 + \frac{1}{4}) \times 2^0 = 1.25 \times 1 = 1.25\]

0b01111111111

float.hex(1.5)

'0x1.8000000000000p+0'

0b01111111000-1023

-7

20.6.2. Why use e-1023 for the exponent?#

2**11

so 1023 is half -1, this means that we split the possible range of numbers that we can represent in the exponent to positive and negative, meaning the whole expression $2^{e-1023}$ ranges from large($2^{1025}$) to small($2^{-1023}$)