19. How does a computer represent non integer quantities?#
19.1. Prelude#
What do we mean by representation?
Last class we talked about how we could express file permissions in two ways:
rwxrwxrw-
776
In this, we interpret each bit to represent two different things:
read, write, execute permissions
4,2,1 place values
Sometimes, we mix representations within a single bit string:
object instructions could use the first few bits to represent the operation and the rest to represent a value (so we can encode an instruction in a single register)
negative numbers
19.1.1. a toy object instruction#
let say we have am CPU that can do 4 bit operations and has 4 control bits (that determine what operation it will do across the ALU + controls for A,D registers & PC)
then we might have an 8 bit register ROM.
the operation @3
might represented as 01000011
if @
is indicated by 0100
on the controler and then the last 4 bits 0011
represent the value 3
19.1.2. negative numbers#
we use one bit to represent the sign. and then e have two choices for the rest of the bits:
representation |
how to negate |
how to read |
ones complement |
flip all the bits |
note the sign, flip all the bits back |
two’s complement |
flip the bits, add 1 |
note the sign, flip all the bits, add one |
Try it out on integer exposed
For example, set it to 3 and then use the ~
button to flip the bits and add one by flipping the 0 in the ones place to a 1 (add one), that ends up at -3
so in 8 bit signed integer 3 is 00000011 and -3 is 11111101.
Try a few other cases in integer exposed
19.2. What about a fixed point for fractions?#
we could represent numbers with a fixed point, is to use base two fractions: \(\frac{1}{2}\), \(\frac{1}{4}\), \(\frac{1}{8}\) etc.
In this fixed point we would have, for example:
0101.1010
would be $\(0*2^3 + 1*2^2 + 0*2^2 + 1*2^0 + 1*\frac{1}{2^{-1}} + 0*\frac{1}{2^{-2}} + 1*\frac{1}{2^{-}} + 0*\frac{1}{2^{-4}} = 4 + 1 + \frac{1}{2} + \frac{1}{8} = 5 + \frac{5}{8}= 5.625\)$
19.3. Floating Point Notation#
We can write numbers in many different forms. We have written integers through many different bases so far.
For example in scientific contexts, we often write numbers (in base 10) like:
We can use a similiar form to represent numbers in binary. Using base 2 instead of base 10.
19.4. Floating point numbers are not all exact#
Let’s look at an example, we add .1
together 3 times, and we get the expected result, approximately.
.1+.1+.1
0.30000000000000004
However, if we check what it’s equal to, it does not equal .3
.1+.1+.1 == .3
False
This is because the floating point representation is an approximation and there are multiple full numbers that are close to any given number that cannot be expressed exactly.
However, the display truncates since usually we want only a few significant digits.
Even rounding does not make it work.
Important
This matters because we cannot rely on equality tests for floats
Instead, we use subtraction and a tolerance:
(.1+.1+.1 - .3)< .0001
True
19.5. Floating point IEEE standard#
IEEE Floating Point is what basically everyone uses, but it is technically a choice hardware manufacturers can make.
Initially in 1985
Reviesd in 2008 after 7 year process that expanded
revised in 2019 after 4 year process that mostly added clarifications
Next revision is projected to 2028.
Note
IBM mainframes use their own representiton based on Hex*
this is a double precision float or binary64 inthe current standard.
it was called double in the original, offically, so it is commonly called that still, even though it is no longer the formally correct name.
In this case we will 1 bit for sign, 11 for exponent and 52 for the fraction part
19.5.1. How do we read a number like this?#
if the sign bit is \(s\) and the number represented by the exponent bits is \(e\) and the 52 bits are number from right most is 0 and the last one is 52.
Note that this is \(2^{-1}\) so we are working with fractions instead of integers in the sum.
So if we had:
0 01111111111 0000000000000000000000000000000000000000000000000000
it would represent:
So, for
0 01111111111 0100000000000000000000000000000000000000000000000000
we get $\( (-1)*0 + \left(1 + 0\cdot 2^{-1} + 1\cdot 2^{-2} + 0\cdot 2^{-3} + \ldots + 0\cdot 2^{-51} + 0\cdot 2^{-52} \right) \times 2^{1023-1023} = 0 + (1 + \frac{1}{4}) \times 2^0 = 1.25 \times 1 = 1.25 \)$
19.5.2. Why use e-1023 for the exponent?#
2**11
2048
so 1023 is half -1, this means that we split the possible range of numbers that we can represent in the exponent to positive and negative, meaning the whole expression \(2^{e-1023}\) ranges from large(\(2^{1025}\)) to small(\(2^{-1023}\))
19.5.3. How do we get numbers into this form?#
Now, let’s return to our example of .1.
First we take the sign of the number for the sign bit, but then to get the exponent and fraction, we have more work to do.
Let’s take the equation we had from the standard:
If we think about the fraction and how we add fractions, by reaching a common denominator. Since they’re all powers of two, we can use the last one.
and we can expand the sum to a different notation for visual clarity
And apply a common denominator:
So then this becomes a binary number with 53 bits (the 52 + 1) and a denominator of \(2^{52}\), let’s call that number \(J\) so we have:.
we can then combine the powers of 2.
So in order to return to our .1
that we want to represent, we can represent it as a fraction and then approximated it in the form above.
\(\approx\) means approximately equal to
after some manipulation:
Since we want to use exactly 53 bits to represent \(J\), we want \(\frac{2^N }{10}\) to be greater than or equal to \(2^{52}\) and less than \(2^{53}\).
This is 2 equations and 2 unknowns, so we can solve now.
Since \(8<10<16\) we can say \(2^3<10<2^4\)
and we can also say that
(larger denominators is smaller numbers if the numerators are the same)
We can put any numerator that is the same across all : $\( \frac{2^N }{2^4} < \frac{2^N }{10} < \frac{2^N }{2^3} \)$
and simplify the outer ones to only have a single exponent
so if we want:
then best \(N\) is 56,.
We can confirm this in a calculator
2**52 <= 2**46/10 < 2**53
False
2**52 <= 2**56/10 < 2**53
True
Now we can get the number we will use for \(J\). We want to get as close to .1 with our fraction as possible, so we get the quotient and remainder, the quotient will be the number we use, but we use the remainder to see if we round up or not
q,r = divmod(2**56,10)
q
7205759403792793
Then we check the remainder to decide if we should round up by 1 or not.
r
6
\( 6 > 5 = \frac{10}{2}\) so we round up
J = q+1
We need \(52-e+1023 = 56\) so
e = 52-56+1023
e
1019
Now lets’ confirm what we found with the actual representation:
Python doesn’t provide a binary reprsentation of floats, but it does provide a hex representation.
float.hex(.1)
'0x1.999999999999ap-4'
bin(J)
'0b11001100110011001100110011001100110011001100110011010'
THat matches this for the part before the p.
after the p is the exponent of \(-4 = 1019-1023\). Which matches the approximation we found.
We can also test by computing
(q+1)/(2**56) ==.1
True
19.6. Prepare for Next Class#
Create operators.md and make some notes about what you know about operators. What kinds of operators are you familiar with? Which have you seen in programming? math?
Refresh your knowledge of logical operations/ logic gates: and, or, xor, not. For an example reference see the logicemu emulator we will use in class
19.7. Badges#
Write a C program to compare values as doubles and as float (single precision/32bit) to see that this comparison issue is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include the program in a code cell in cdouble.md
Write a C program to compare values as doubles and as float (single precision/32bit) to see that this comparison issue is related to the IEEE standard and is not language specific. Make notes and comparison around its behavior and include the program in a code cell in cdouble.md
In floatexpt.md design an experiment using the
fractions.Fraction
class in Python that shows helps illustrate why.1*3 == .3
evaluates toFalse
but.1*4 ==.4
evaluates toTrue
. Your file must include both code and interpretation of the results.
19.8. Experience Report Evidence#
19.9. Questions After Today’s Class#
Note
these will be added later, sorry!