Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

How does a computer represent non integer quantities?

Fixed Point

We practices translating some numbers using a fixed point representation with 8 bits, so there are 8 places

Places

Eights

23=82^3 = 8

Fours

22=42^2 = 4

Twos

21=22^1 = 2

Ones

20=12^0 = 1

Halves

21=122^{-1} = \frac{1}{2}

Quarters

21=142^{-1} = \frac{1}{4}

Eighths

21=182^{-1} = \frac{1}{8}

Sixteenths

21=1162^{-1} = \frac{1}{16}

Example numbers in Fixed point

1.5

0001.1000

or:

023+022+021+120+121+021+021+021=0*2^3 + 0*2^2 + 0*2^1 + 1*2^0 + 1*2^{-1} + 0*2^{-1} + 0*2^{-1} + 0*2^{-1} =

08+04+02+12+112+014+018+0116=0*8 + 0*4 +0*2 + 1*2 + 1*\frac{1}{2} + 0* \frac{1}{4} + 0*\frac{1}{8} + 0*\frac{1}{16}=

1+.51 + .5

12.75

1100.1100

1100.1100

or:

123+122+021+020+121+121+021+021=1*2^3 + 1*2^2 + 0*2^1 + 0*2^0 + 1*2^{-1} + 1*2^{-1} + 0*2^{-1} + 0*2^{-1} =

18+14+02+01+112+114+018+0116=1*8 + 1*4 +0*2 + 0*1 + 1*\frac{1}{2} + 1* \frac{1}{4} + 0*\frac{1}{8} + 0*\frac{1}{16}=

8+4+.5+.258+4 + .5 + .25

7.5625

0111.1001

0111.1001

or:

023+122+121+120+121+021+021+121=0*2^3 + 1*2^2 + 1*2^1 + 1*2^0 + 1*2^{-1} + 0*2^{-1} + 0*2^{-1} + 1*2^{-1} =

08+14+12+11+112+014+018+1116=0*8 + 1*4 +1*2 + 1*1 + 1*\frac{1}{2} + 0* \frac{1}{4} + 0*\frac{1}{8} + 1*\frac{1}{16}=

4+2+1+.5+06254+2+1 + .5 + 0625

15.9375:

1111.1111

or:

123+122+121+120+121+121+121+121=1*2^3 + 1*2^2 + 1*2^1 + 1*2^0 + 1*2^{-1} + 1*2^{-1} + 1*2^{-1} + 1*2^{-1} =

18+14+12+12+112+114+118+1116=1*8 + 1*4 +1*2 + 1*2 + 1*\frac{1}{2} + 1* \frac{1}{4} + 1*\frac{1}{8} + 1*\frac{1}{16}=

8+4+2+1+.5+.25+.125+06258+4+2+1 + .5+ .25+ .125 + 0625

Prelude

What do we mean by representation?

Computers are electrical devices, there are circuits that calculate everything, using gates that represent logical operations.

In order to get the data (the thigns we want to compute) through the ciruits we have to represent the values somehow.

We discussed previously how we could represent success/failure with 0/1.

You may have seen ASCII codes before: those are numbers to represent each character, that is how characters are represented in a computer.

Sometimes, we mix representations within a single bit string:

a toy object instruction

let say we have am CPU that can do 4 bit operations and has 4 control bits (that determine what operation it will do across the ALU + controls for A,D registers & PC)

then we might have an 8 bit register ROM.

the operation @3 might represented as 01000011 if @ is indicated by 0100 on the controler and then the last 4 bits 0011 represent the value 3

negative numbers

we use one bit to represent the sign. and then e have two choices for the rest of the bits:

representation

how to negate

how to read

one’s complement

flip all the bits

note the sign, flip all the bits back

two’s complement

flip the bits, add 1

note the sign, flip all the bits, add one

Try it out on integer exposed

For example, set it to 3 and then use the ~ button to flip the bits and add one by flipping the 0 in the ones place to a 1 (add one), that ends up at -3

so in 8 bit signed integer 3 is 00000011 and -3 is 11111101.

Try a few other cases in integer exposed

Two’s complement is themost common

Floating Point Notation

We can write numbers in many different forms. We have written integers through many different bases so far.

For example in scientific contexts, we often write numbers (in base 10) like:

3.45×102=3453.45 \times 10^2 = 345

We can use a similiar form to represent numbers in binary. Using base 2 instead of base 10.

Floating point numbers are not all exact

Let’s look at an example, we add .1 together 3 times, and we get the expected result.

We’ll use python here, but you can do this in any programming language.

.1+.1 +.1
0.30000000000000004

However, if we check what it’s equal to, it does not equal .3

.1 + .1 + .1 == .3
False

This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly. However, the display truncates since usually we want only a few significant digits.
Even rounding does not make it work.

round(.1,1) + round(.1,1) + round(.1,1) == round(.3,1)
False

Instead, we use subtraction and a tolerance:

(.1+.1+.1 - .3)< .0001
True

Floating point IEEE standard

Now, lets see how these numbers are actually represented.

float.exposed

IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can make.

this is a double precision float or binary64 inthe current standard.

it was called double in the original, offically, so it is commonly called that still, even though it is no longer the formally correct name.

In this case we will 1 bit for sign, 11 for exponent and 52 for the fraction part

64 total small rectangles in a horizontal line to represent 64 bits, 1 in light blue labeled sign, 11 in green labeled exponent, and 52 in red labeled fraction

Figure 1:a visual representation of a 64 bit float:

  • the first bit is the sign bit

  • the next 11 bits are the exponent

  • the 52 bits of the fraction are numbered from right to left (the rightmost is b0b_0 and the leftmost is b51b_{51}).

How do we read a number like this?

if the sign bit is ss and the number represented by the exponent bits is ee and the 52 bits are number from right most is 0 and the last one is 52.

(1)s+(1+i=152b52i2i)×2e1023(-1)s + \left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}

where:

So, for example:

0 01111111111 0000000000000000000000000000000000000000000000000000

we have s=0s=0 and e=1023e=1023 and all of the bi=0b_i=0.

we can follow the equation, and interpret the bits to get:

(1)0+(1+020+021++0251+0252)×210231023=0+(1+0)×20=1×1=1.0(-1)*0 + \left(1 + 0 \cdot 2^{-0} + 0 \cdot 2^{-1} + \ldots + 0 \cdot 2^{-51} + 0 \cdot 2^{-52} \right) \times 2^{1023-1023} \\ = 0 + (1 + 0) \times 2^0 \\ = 1 \times 1 = 1.0

and for

0 01111111111 0100000000000000000000000000000000000000000000000000

we have s=0s=0 and e=1023e=1023 and all of the bi=0b_i=0, except b50b_50. following the equation, we get:

(1)0+(1+021+122+023++0251+0252)×210231023=0+(1+14)×20=1.25×1=1.25(-1)*0 + \left(1 + 0\cdot 2^{-1} + 1\cdot 2^{-2} + 0\cdot 2^{-3} + \ldots + 0\cdot 2^{-51} + 0\cdot 2^{-52} \right) \times 2^{1023-1023} \\ = 0 + (1 + \frac{1}{4}) \times 2^0 \\ = 1.25 \times 1 = 1.25

Why use e-1023 for the exponent?

2**11
2048

so 1023 is half -1, this means that we split the possible range of numbers that we can represent in the exponent to positive and negative, meaning the whole expression 2e10232^{e-1023} ranges from large(21025) to small(2-1023)

How do we get numbers into this form?

Now, let’s return to our example of .1.

First we take the sign of the number for the sign bit, but then to get the exponent and fraction, we have more work to do.

Let’s take the equation we had from the standard:

(1+i=152b52i2i)×2e1023\left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}

If we think about the fraction and how we add fractions, by reaching a common denominator. Since they’re all powers of two, we can use the last one.

(1+i=152b52i2i)×2e1023\left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}\right) \times 2^{e-1023}

and we can expand the sum to a different notation for visual clarity

(1+b5221+b5122++b1251+b0252)×2e1023\left( 1 + \frac{b_{52}}{2^{1}} + \frac{b_{51}}{2^{2}} + \cdots + \frac{b_{1}}{2^{51}} + \frac{b_{0}}{2^{52}} \right) \times 2^{e-1023}

And apply a common denominator:

(252252+251b52252+250b51252++21b1252+20b0252)×2e1023\left(\frac{2^{52}}{2^{52}} + \frac{2^{51} b_{52}}{2^{52}} + \frac{2^{50} b_{51}}{2^{52}} + \cdots + \frac{2^{1} b_{1}}{2^{52}} + \frac{2^0 b_{0}}{2^{52}} \right) \times 2^{e-{1023}}

So then this becomes a binary number with 53 bits (the 52 + 1) and a denominator of 252, let’s call that number JJ so we have:.

J252×2e1023\frac{J}{2^{52}} \times 2^{e-{1023}}

we can then combine the powers of 2.

J252e+1023=J2N\frac{J}{2^{52-e+1023}} = \frac{J}{2^{N}}

So in order to return to our .1 that we want to represent, we can represent it as a fraction and then approximated it in the form above.

110J2N\frac{1}{10} \approx \frac{ J }{2^N}

\approx means approximately equal to

after some manipulation:

J2N/10J \approx 2^N / 10

Since we want to use exactly 53 bits to represent JJ, we want 2N10\frac{2^N }{10} to be greater than or equal to 252 and less than 253.

252<=2N10<2532^{52} <= \frac{2^N }{10} < 2^{53}

Since 8<10<168<10<16 we can say 23<10<242^3<10<2^4

and we can also say that

124<110<123\frac{1 }{2^4} < \frac{1 }{10} < \frac{1 }{2^3}

(larger denominators is smaller numbers if the numerators are the same)

We can put any numerator that is the same across all :

2N24<2N10<2N23\frac{2^N }{2^4} < \frac{2^N }{10} < \frac{2^N }{2^3}

and simplify the outer ones to only have a single exponent

2N4<2N10<2N32^{N-4} < \frac{2^N }{10} < 2^{N-3}

If we combine Equation with our previous inequality from Equation, we can see that the best NN is 56, by taking either side N4=52N-4=52 or N3=53N-3=53.

We can confirm this in a calculator

2**52 <=2**56/10 < 2**53
True

Now we can get the number we will use for JJ, from Equation.

We want to get as close to .1 with our fraction as possible, so we get the quotient and remainder, the quotient will be the number we use, but we use the remainder to see if we round up or not

q,r = divmod(2**56, 10)

Now, we check the remainder to decide if we should round up by 1 or not.

r
6

6>5=102 6 > 5 = \frac{10}{2} so we round up

J = q+1
J
7205759403792794

then we can check the length in bits of that number

bin(J)
'0b11001100110011001100110011001100110011001100110011010'

we could count above, or since there are 2 extra characters to tell us it is binary at the start we can take the len and subtract 2:

len(bin(J) )-2
53

This is, as expected 53 bits!

The significand is actually 52 bits though, the representation just always has a 1 in it, see the 1 added before the sum of the bits in Equation or on float.exposed when the exponent is 1023 (the green bits; so that the multiplicative part of Equation is 210231023=20=12^{1023-1023} = 2^0 = 1 ) and the significand is 0(the blue bits)

significand = J-2**52
significand
2702159776422298

We have N=56=52e+1023N = 56 = 52-e+1023 so to solve for ee we get e=5256+1023e= 52-56+1023

e = 52-56+1023
e
1019

Now lets’ confirm what we found with the actual representation:

First we can plug into float.exposed and see that is very clsoe to .1 and if we increase the significand at all it gets farther away.

Second, we can check in Python. Python doesn’t provide a binary reprsentation of floats, but it does provide a hex representation.

float.hex(.1)
'0x1.999999999999ap-4'
hex(J)
'0x1999999999999a'

Third, we can test by computing the approximation:

(J)/2**56 ==.1
True

We can also see how close the numbers one above and one below are:

(J-1)/2**56 - .1
-1.3877787807814457e-17
(J+1)/2**56 - .1
1.3877787807814457e-17

these are the closest numbers we can represent to .1 meaning we, for example cannot represent .1±1.2e18.1 \pm 1.2e-18

.1 - 1.2*10**-18
0.1
.1 + 1.2*10**-18
0.1

If we add these tiny values it does not change the value at all.

.1 - 1.2*10**-18 ==.1
True

The same thing happens for very large numbers, 4503599627370496 is the beginning of a range in float64 where the closest numbers we can represent are 1 apart, so for example:

4503599627370496.0 +.5
4503599627370496.0

doesn’t work; it cannot represent 4503599627370496.5

Above 9007199254740992 we cannot even represent odd integers, only even for a while and then it gets more and more sparse.

9007199254740992.0 +1
9007199254740992.0
The floating point number line is not uniform.
source

The floating point number line is not uniform. source

Prepare for Next Class

  1. think about what you know about networking

  2. make sure you have putty if using windows

  3. get the big ideas of hpc, by reading this IBM intro page and some hypothetical people who would attend an HPC carpentry workshop. Make a list of key terms as an issue comment

Badges

Review
Practice
  1. Review the notes from today

  2. Write a C program to compare values as doubles and as float (single precision/32bit) to see that the issue we saw with .1 is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include the program in a code cell in cdouble.md

Experience Report Evidence

Nothing separate

Questions After Today’s Class

How do calculators get exact decimals if the computer can not store all the exact values?

Calculators also have to approximate numbers if they are digital calculators.

this is a good explore topic

Is it possible for a number to accidently reach infinity?

In theory, yes, but in practice most programming languages handle this in some way, for example Python just does nothing:

1.79769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368*10**308 + 1
1.7976931348623157e+308

but if you try to increase this number any it goes to infinity

Do neural networks need larger floats?

Typically no, you can learn more about quantiation though.

Are there circumstances where using a different representation that is not floating point is necessary?

yes! we will see some in lab next week.