How does a computer represent non integer quantities? - CSC311

Fixed Point¶

We practices translating some numbers using a fixed point representation with 8 bits, so there are 8 places

Places¶

Eights¶

2^3 = 8

Fours¶

2^2 = 4

Twos¶

2^1 = 2

Ones¶

2^0 = 1

Halves¶

2^{-1} = \frac{1}{2}

Quarters¶

2^{-1} = \frac{1}{4}

Eighths¶

2^{-1} = \frac{1}{8}

Sixteenths¶

2^{-1} = \frac{1}{16}

Example numbers in Fixed point¶

1.5¶

0001.1000

or:

0*2^3 + 0*2^2 + 0*2^1 + 1*2^0 + 1*2^{-1} + 0*2^{-1} + 0*2^{-1} + 0*2^{-1} =

0*8 + 0*4 +0*2 + 1*2 + 1*\frac{1}{2} + 0* \frac{1}{4} + 0*\frac{1}{8} + 0*\frac{1}{16}=

1 + .5

12.75¶

1100.1100

1100.1100

or:

1*2^3 + 1*2^2 + 0*2^1 + 0*2^0 + 1*2^{-1} + 1*2^{-1} + 0*2^{-1} + 0*2^{-1} =

1*8 + 1*4 +0*2 + 0*1 + 1*\frac{1}{2} + 1* \frac{1}{4} + 0*\frac{1}{8} + 0*\frac{1}{16}=

8+4 + .5 + .25

7.5625¶

0111.1001

0111.1001

or:

0*2^3 + 1*2^2 + 1*2^1 + 1*2^0 + 1*2^{-1} + 0*2^{-1} + 0*2^{-1} + 1*2^{-1} =

0*8 + 1*4 +1*2 + 1*1 + 1*\frac{1}{2} + 0* \frac{1}{4} + 0*\frac{1}{8} + 1*\frac{1}{16}=

4+2+1 + .5 + 0625

15.9375:¶

1111.1111

or:

1*2^3 + 1*2^2 + 1*2^1 + 1*2^0 + 1*2^{-1} + 1*2^{-1} + 1*2^{-1} + 1*2^{-1} =

1*8 + 1*4 +1*2 + 1*2 + 1*\frac{1}{2} + 1* \frac{1}{4} + 1*\frac{1}{8} + 1*\frac{1}{16}=

8+4+2+1 + .5+ .25+ .125 + 0625

Prelude¶

What do we mean by representation?

Computers are electrical devices, there are circuits that calculate everything, using gates that represent logical operations.

In order to get the data (the thigns we want to compute) through the ciruits we have to represent the values somehow.

We discussed previously how we could represent success/failure with 0/1.

You may have seen ASCII codes before: those are numbers to represent each character, that is how characters are represented in a computer.

Sometimes, we mix representations within a single bit string:

object instructions could use the first few bits to represent the operation and the rest to represent a value (so we can encode an instruction in a single register)
negative numbers (the first bit is sign and the rest represent places)

a toy object instruction¶

let say we have am CPU that can do 4 bit operations and has 4 control bits (that determine what operation it will do across the ALU + controls for A,D registers & PC)

then we might have an 8 bit register ROM.

the operation @3 might represented as 01000011 if @ is indicated by 0100 on the controler and then the last 4 bits 0011 represent the value 3

negative numbers¶

we use one bit to represent the sign. and then e have two choices for the rest of the bits:

representation	how to negate	how to read
one’s complement	flip all the bits	note the sign, flip all the bits back
two’s complement	flip the bits, add 1	note the sign, flip all the bits, add one

Try it out on integer exposed

For example, set it to 3 and then use the ~ button to flip the bits and add one by flipping the 0 in the ones place to a 1 (add one), that ends up at -3

so in 8 bit signed integer 3 is 00000011 and -3 is 11111101.

Try a few other cases in integer exposed

Two’s complement is themost common

Floating Point Notation¶

We can write numbers in many different forms. We have written integers through many different bases so far.

For example in scientific contexts, we often write numbers (in base 10) like:

3.45 \times 10^2 = 345

We can use a similiar form to represent numbers in binary. Using base 2 instead of base 10.

Floating point numbers are not all exact¶

Let’s look at an example, we add .1 together 3 times, and we get the expected result.

We’ll use python here, but you can do this in any programming language.

.1+.1 +.1

0.30000000000000004

However, if we check what it’s equal to, it does not equal .3

.1 + .1 + .1 == .3

False

This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly. However, the display truncates since usually we want only a few significant digits.
Even rounding does not make it work.

round(.1,1) + round(.1,1) + round(.1,1) == round(.3,1)

False

Instead, we use subtraction and a tolerance:

(.1+.1+.1 - .3)< .0001

True

Floating point IEEE standard¶

Now, lets see how these numbers are actually represented.

float.exposed

IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can make.

Initially in 1985
Reviesd in 2008 after 7 year process that expanded
revised in 2019 after 4 year process that mostly added clarifications
Next revision is projected to 2028.

this is a double precision float or binary64 inthe current standard.

it was called double in the original, offically, so it is commonly called that still, even though it is no longer the formally correct name.

In this case we will 1 bit for sign, 11 for exponent and 52 for the fraction part

64 total small rectangles in a horizontal line to represent 64 bits, 1 in light blue labeled sign, 11 in green labeled exponent, and 52 in red labeled fraction — Figure 1:a visual representation of a 64 bit float:

How do we read a number like this?¶

if the sign bit is $s$ and the number represented by the exponent bits is $e$ and the 52 bits are number from right most is 0 and the last one is 52.

(-1)s + \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}

where:

$s$ is the sign bit is
$e$ is the number represented by the exponent bits is and
the 52 bits of the fraction are numbered from right to left (the rightmost is $b_0$ and the leftmost is $b_{51}$ ).
$2^{-i}$ means we are working with fractions instead of integers in the sum.
$\sum_{i=1}^{52}$ means the sum for $i = 1, 2, \ldots, 52$

So, for example:

0 01111111111 0000000000000000000000000000000000000000000000000000

we have $s=0$ and $e=1023$ and all of the $b_i=0$ .

we can follow the equation, and interpret the bits to get:

(-1)*0 + \left(1 + 0 \cdot 2^{-0} + 0 \cdot 2^{-1} + \ldots + 0 \cdot 2^{-51} + 0 \cdot 2^{-52} \right) \times 2^{1023-1023} \\ = 0 + (1 + 0) \times 2^0 \\ = 1 \times 1 = 1.0

and for

0 01111111111 0100000000000000000000000000000000000000000000000000

we have $s=0$ and $e=1023$ and all of the $b_i=0$ , except $b_50$ . following the equation, we get:

(-1)*0 + \left(1 + 0\cdot 2^{-1} + 1\cdot 2^{-2} + 0\cdot 2^{-3} + \ldots + 0\cdot 2^{-51} + 0\cdot 2^{-52} \right) \times 2^{1023-1023} \\ = 0 + (1 + \frac{1}{4}) \times 2^0 \\ = 1.25 \times 1 = 1.25

Why use e-1023 for the exponent?¶

2**11

2048

so 1023 is half -1, this means that we split the possible range of numbers that we can represent in the exponent to positive and negative, meaning the whole expression $2^{e-1023}$ ranges from large(2¹⁰²⁵) to small(2^-1023)

How do we get numbers into this form?¶

Now, let’s return to our example of .1.

First we take the sign of the number for the sign bit, but then to get the exponent and fraction, we have more work to do.

Let’s take the equation we had from the standard:

\left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}

If we think about the fraction and how we add fractions, by reaching a common denominator. Since they’re all powers of two, we can use the last one.

\left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}

and we can expand the sum to a different notation for visual clarity

\left( 1 + \frac{b_{52}}{2^{1}} + \frac{b_{51}}{2^{2}} + \cdots + \frac{b_{1}}{2^{51}} + \frac{b_{0}}{2^{52}} \right) \times 2^{e-1023}

And apply a common denominator:

\left(\frac{2^{52}}{2^{52}} + \frac{2^{51} b_{52}}{2^{52}} + \frac{2^{50} b_{51}}{2^{52}} + \cdots + \frac{2^{1} b_{1}}{2^{52}} + \frac{2^0 b_{0}}{2^{52}} \right) \times 2^{e-{1023}}

So then this becomes a binary number with 53 bits (the 52 + 1) and a denominator of 2⁵², let’s call that number $J$ so we have:.

\frac{J}{2^{52}} \times 2^{e-{1023}}

we can then combine the powers of 2.

\frac{J}{2^{52-e+1023}} = \frac{J}{2^{N}}

So in order to return to our .1 that we want to represent, we can represent it as a fraction and then approximated it in the form above.

\frac{1}{10} \approx \frac{ J }{2^N}

$\approx$ means approximately equal to

after some manipulation:

J \approx 2^N / 10

Since we want to use exactly 53 bits to represent $J$ , we want $\frac{2^N }{10}$ to be greater than or equal to 2⁵² and less than 2⁵³.

2^{52} <= \frac{2^N }{10} < 2^{53}

Since $8<10<16$ we can say $2^3<10<2^4$

and we can also say that

\frac{1 }{2^4} < \frac{1 }{10} < \frac{1 }{2^3}

(larger denominators is smaller numbers if the numerators are the same)

We can put any numerator that is the same across all :

\frac{2^N }{2^4} < \frac{2^N }{10} < \frac{2^N }{2^3}

and simplify the outer ones to only have a single exponent

2^{N-4} < \frac{2^N }{10} < 2^{N-3}

If we combine Equation with our previous inequality from Equation, we can see that the best $N$ is 56, by taking either side $N-4=52$ or $N-3=53$ .

We can confirm this in a calculator

2**52 <=2**56/10 < 2**53

True

Now we can get the number we will use for $J$ , from Equation.

We want to get as close to .1 with our fraction as possible, so we get the quotient and remainder, the quotient will be the number we use, but we use the remainder to see if we round up or not

q,r = divmod(2**56, 10)

Now, we check the remainder to decide if we should round up by 1 or not.

6

$6 > 5 = \frac{10}{2}$ so we round up

J = q+1
J

7205759403792794

then we can check the length in bits of that number

bin(J)

'0b11001100110011001100110011001100110011001100110011010'

we could count above, or since there are 2 extra characters to tell us it is binary at the start we can take the len and subtract 2:

len(bin(J) )-2

53

This is, as expected 53 bits!

The significand is actually 52 bits though, the representation just always has a 1 in it, see the 1 added before the sum of the bits in Equation or on float.exposed when the exponent is 1023 (the green bits; so that the multiplicative part of Equation is $2^{1023-1023} = 2^0 = 1$ ) and the significand is 0(the blue bits)

significand = J-2**52
significand

2702159776422298

We have $N = 56 = 52-e+1023$ so to solve for $e$ we get $e= 52-56+1023$

e = 52-56+1023
e

1019

Now lets’ confirm what we found with the actual representation:

First we can plug into float.exposed and see that is very clsoe to .1 and if we increase the significand at all it gets farther away.

Second, we can check in Python. Python doesn’t provide a binary reprsentation of floats, but it does provide a hex representation.

float.hex(.1)

'0x1.999999999999ap-4'

the 0x can be dropped from reading
after the p is the exponent of $-4 = 1019-1023$ . Which matches the approximation we found.
the rest we can compare to the value we found for J

hex(J)

'0x1999999999999a'

Third, we can test by computing the approximation:

(J)/2**56 ==.1

True

We can also see how close the numbers one above and one below are:

(J-1)/2**56 - .1

-1.3877787807814457e-17

(J+1)/2**56 - .1

1.3877787807814457e-17

these are the closest numbers we can represent to .1 meaning we, for example cannot represent $.1 \pm 1.2e-18$

.1 - 1.2*10**-18

0.1

.1 + 1.2*10**-18

0.1

If we add these tiny values it does not change the value at all.

.1 - 1.2*10**-18 ==.1

True

The same thing happens for very large numbers, 4503599627370496 is the beginning of a range in float64 where the closest numbers we can represent are 1 apart, so for example:

4503599627370496.0 +.5

4503599627370496.0

doesn’t work; it cannot represent 4503599627370496.5

Above 9007199254740992 we cannot even represent odd integers, only even for a while and then it gets more and more sparse.

9007199254740992.0 +1

9007199254740992.0

The floating point number line is not uniform.
source — The floating point number line is not uniform. source

Prepare for Next Class¶

think about what you know about networking
make sure you have putty if using windows
get the big ideas of hpc, by reading this IBM intro page and some hypothetical people who would attend an HPC carpentry workshop. Make a list of key terms as an issue comment

Badges¶

Review

Practice

Review the notes from today
Write a C program to compare values as doubles and as float (single precision/32bit) to see that the issue we saw with .1 is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include the program in a code cell in cdouble.md

Experience Report Evidence¶

Nothing separate

Questions After Today’s Class¶

How do calculators get exact decimals if the computer can not store all the exact values?¶

Calculators also have to approximate numbers if they are digital calculators.

this is a good explore topic

Is it possible for a number to accidently reach infinity?¶

In theory, yes, but in practice most programming languages handle this in some way, for example Python just does nothing:

1.79769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368*10**308 + 1

1.7976931348623157e+308

but if you try to increase this number any it goes to infinity

Do neural networks need larger floats?¶

Typically no, you can learn more about quantiation though.

Are there circumstances where using a different representation that is not floating point is necessary?¶

yes! we will see some in lab next week.