11. What is a commit number?#
cd gh-inclass-brownsarahm/
git log
commit e8ab736a7c843e8515e484b136ffcbccfc162618 (HEAD -> organization)
Author: Sarah M Brown <brownsarahm@uri.edu>
Date: Tue Oct 1 13:33:22 2024 -0400
add more stuff
commit 1e2ab9259651a73ad277e826d602514d28969c86 (tag: 0.0.1)
Author: Sarah M Brown <brownsarahm@uri.edu>
Date: Tue Sep 24 13:30:44 2024 -0400
include readmen content
commit 87c72aeca9bd16700fc8fd8ee719136c13e83e01 (origin/organization)
Author: Sarah M Brown <brownsarahm@uri.edu>
Date: Tue Sep 24 13:23:36 2024 -0400
stop tracking
11.1. What is a hash?#
a hash is:
a fixed size value that can be used to represent data of arbitrary sizes
the output of a hashing function
often fixed to a hash table
Common examples of hashing are lookup tables and encryption with a cyrptographic hash.
A hashing function could be really simple, to read off a hash table, or it can be more complex.
For example:
Hash |
content |
0 |
Success |
1 |
Failure |
If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.
This lookup table hash works here.
In a more complex scenario, imagine trying to hash all of the new terms you learn in class.
A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.
A cyrptographic hash is additionally:
unique
not reversible
similar inputs hash to very different values so they appear uncorrelated
Now lets go through each of these properties
11.1.1. Cryptographic Hashes are unique#
This means that two different values we put in should give different results.
For this property alone, a simple function could work:
def basic_unique(input):
return input
but this is not a hash because its length would not be constant and not a cryptographic has because it is easily reversible.
11.1.2. Cryptographic Hashes are not reversible#
This means that given the hash (output), we cannot compute the message(input).
Any function that gives the same output for two (or more) values meets this criteria.
for example modulus:
13%3
1
10%3
1
It can be any function that gives the same output for two (or more) values.
but this is not a cryptographic hash
11.1.4. Hashes are fixed length#
So, no matter the size of the input, we get back the same length.
This is good for memory allocation reasons.
We could again write a function that only does this simply:
def fixed_len(input):
'''
pad or trim
'''
len_target=100
str_in = str(input)
if len(str_in)< len_target:
return str_in.ljust(len_target,'-')
else:
return str_in[:len_target]
11.2. How can hashes be used?#
Hashes can then be used for a lot of purposes:
message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same has)
password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)
file or data identifier (eg in git)
11.2.1. Hashing in passwords#
Passowrds can be encrypted and the encrypted information is stored, then when you submit a candidate password it can compare the hash of the submitted password to the hash that was stored. Since the hashing function is nonreversible, they cannot see the password.
password breach
Some sites are negligent and store passwords unencrypted, if your browser warns you about such a site, proceed with caution and definitely do not reuse a password you ever use. (you should never reuse passwords, but especially do not if there is a warning)
An attacker who gets one of those databases, could build a lookup table. For example, “password” is a bad password because it has been hashed in basically every algorithm and then the value of it can be reversed. Choosing an uncommon password makes it less likely that your password exists in a lookup table.
For example, in SHA-1 the hashing algorithm that git uses
echo "password" | git hash-object --stdin
f3097ab13082b70f67202aab7dd9d1b35b7ceac2
11.3. Hashing in Git#
In git we hash both the content directly to store it in the database (.git) directory and the commit information.
Recall, when we were working in our toy repo we created an empty repository and
then added content directly, we all got the same hash, but when we used git
commit our commits had different hashes because we have different names and
made the commits at different seconds. We also saw that two entries were
created in the .git
directory for the commit.
Git was originally designed to use SHA-1.
Then the SHA-1 collision attack was discovered
Git switched to hardened HSA-1 in response to a collision.
In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.
git uses the SHA hash primarily for uniuqeness, not privacy
It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.
This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.
This means it can produce \(2^{160}\) different hashes. Which makes the probability of a collision very low.
The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is
p = n(n-1)/2) * (1/2^160))$. $2^{80}
) is 1.2 x 1024 or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.
11.3.1. Workign with git hashes#
Important
We had discussed this bit before, while answering other questions, so I skipped it today, but added it here for completeness.
Mostly, a shorter version of the commit is sufficient to be unique, so we can use those to refer to commits by just a few characters:
minimum 4
must be unique
by using some options on git log
to make it easier to read a set of them
git log --abbrev-commit --pretty=oneline
e8ab736 (HEAD -> organization) add more stuff
1e2ab92 (tag: 0.0.1) include readmen content
87c72ae (origin/organization) stop tracking
d2d1fac organized files into foleders and ignore private
a3904a0 Revert "start organizing"
4ceb150 fill in decritoin more
9120d9d start organizing
f17e276 explain files
72b85c7 add a note
991ee65 (origin/main, origin/HEAD, main) Merge pull request #5 from compsys-progtools/organizing_ac
1ee9c9f (origin/organizing_ac) add files for organizing activity
0c12714 (my_branch_checkedoutb, my_branch) Merge pull request #4 from compsys-progtools/1-create-a-readme
c7375fa (origin/1-create-a-readme, 1-create-a-readme) create readme closes #1
0e7c990 start about file closes #3
0373342 Initial commit
For most project 7 characters is enough and by default, git will give you 7 digits if you use --abbrev-commit
and git will automatically use more if needed.
11.4. How are hashes computed?#
Hashes are computed on the binary representation of the data, most treat that as numbers and then perform mathematical operations.
you can see the psuedocode to SHA-1
Important
this is an example we skipped in class, but I wanted to give some more info
we’ll work a small example, Python can convert characters to ints:
ord('s')
115
and ints to binary
bin(ord('s'))
'0b1110011'
with some extra at the start, so if we just want the binary string
str(bin(ord('s')))[2:]
'1110011'
We can use this to make a function that performs an xor based hash.
def xor_hash(message,n):
'''
a simple hashing algorithm based on xor that can take a message of
any length and return a version of it with a specific bit length n
'''
bitstring = ''.join([str(bin(ord(c)))[2:] for c in message])
# zero pad to make it a multiple of n
extra = len(bitstring)%n
num_pad = (n-extra)*int(extra >0)
padded_bitstring = bitstring + num_pad*'0'
# split into length n chunks
split_bitstring = [padded_bitstring[i:i+n] for i in range(0, len(padded_bitstring), n)]
# now make them back to integers so we can xor
bin_int_strings = [int("0b" + bits,2) for bits in split_bitstring]
# now xor
out_hash_int = 0
for n_int in bin_int_strings:
out_hash_int = out_hash_int^n_int
# we would need to left zero pad the binary to keep it correct
bin_str = str(bin(out_hash_int))[2:]
extra_out = len(bin_str)%n
num_pad_out = (n-extra_out)*int(extra_out >0)
padded_out = num_pad_out*'0' + bin_str
return padded_out
You can see this visualized on python tutor
We can test that we get something the right length
xor_hash('hello',8)
'00001101'
for different lengths of input
xor_hash('it is almost break',8)
'00110000'
This is a sort of meta function, so we could make simpler, more specific functions for each length or we can use it with different ones to see
xor_hash('it is almost break',13)
'0010111001001'
This is not cryptographic though because similar inputs have correlated outputs.
11.5. How do we get to the alphanumeric hashes we see?#
Let’s build this up with some review
11.5.1. What is a Number ?#
a mathematical object used to count, measure and label
11.6. What is a number system?#
While numbers represent quantities that conceptually, exist all over, the numbers themselves are a cultural artifact. For example, we all have a way to represent a single item, but that can look very different.
for example I could express the value of a single item in different ways:
1
I
11.6.1. Hindu Arabic Number system#
In modern, western cultures, our number system is called the hindu-arabic system, it consists of a set of numerals: 0,1,2,3,4,5,6,7,8,9 and is a place based system with base 10.
11.6.1.1. Where does the name come from?#
invented by Hindu mathematicians in India 600 or earlier
called “Arabic” numerals in the West because Arab merchants introduced them to Europeans
slow adoption
11.6.1.2. Numerals#
are the visual characters used to represent quantities
11.6.1.3. Place based#
We use a place based system. That means that the position or place of the symbol changes its meaning. So 1, 10, and 100 are all different values. This system is also a decimal system, or base 10. So we refer to the places and the ones (\(10^0\)), the tens (\(10^1\)), the hundreds(\(10^2\)), etc for all powers of 10.
Number systems can use different characters, use different strategies for representing larger quantities, or both.
Not all systems are place based, for example Roman numerals.
11.6.2. Roman Numerals#
is a number system that uses different numerals and is not a place based system
There are symbols for specific values: 1=I, V=5, X=10, L =50, C = 100, D=500, M = 1000.
In this system the subsequent symbols are either added or subtracted, with no (nonidentity) multipliers based on position.
Instead if the symbol to right is the same or smaller, add the two together, if the left symbol is smaller, subtract it from the one on the right.
For example
roman numerals |
translation |
value (hindu-arabic) |
III |
1+1+1 |
3 |
IV |
-1 + 5 |
4 |
VI |
5 +1 |
4 |
XLIX |
-10 + 50 -1 + 10 |
49 |
11.7. Comparing Bases#
11.7.1. Decimal#
To represent larger numbers than we have digits on we have a base (10) and then.
we have the ones (\(10^0\)) place, tens (\(10^1\)) place, hundreds (\(10^2\)) place etc.
11.7.2. Binary#
Binary is any base two system, and it can be represented using any two distinct numerals.
Binary number systems have origins in ancient cultures:
Egypt (fractions) 1200 BC
China 9th century BC
India 2nd century BC
In computer science we use binary because mechanical computers began using relays (open/closed) to implement logical (boolean; based on true/false, also binary) operations and then digital computers use on and off in their circuits.
We typically represent binary using the same hindu-arabic symbols that we use for other numbers, but only the 0 and 1(the first two). We also keep it as a place-based number system so the places are the ones(\(2^0\)), twos (\(2^1\)), fours (\(2^2\)), eights (\(2^3\)), etc for all powers of 2.
so in binary, the number of characters in the word binary
is 110
.
so 10 in binary is 2 in decimal
11.7.3. Octal#
Is base 8.
We use hindu-arabic symbols, 0,1,2,3,4,5,6,7 (the first eight). Then nine is represented as 11.
In computer science, this numbering system was popular in 6 bit and 12 bit computers, but is has origins before that.
Native Americans using the Yuki Language (based in what is now California)used an octal system because they count using the spaces between fingers and speakers of the Pamean languages in Mexico count on knuckles in a closed fist. Europeans debated using decimal vs octal in the 1600-1800s for various reasons because 8 is better for math mostly. It is also found in Chinese texts dating to 1000BC.
Some examples:
- \[ 10 \rightarrow 1*8^1 + 0*8^0 = 1*8+0*1 = 8\]
- \[ 401 \rightarrow 4*8^2 + 0*8^1 + 1*8^0 = 4*64 + 1*0 = 257\]
In computer science we use octal a lot because it reduces every 3 bits of a number in binary to a single character. So for a large number, in binary say 101110001100
we can change to 5614
which is easier to read, for a person.
101110001100
\(\rightarrow\) 101 110 001 100
\(\rightarrow\) 5614
11.7.4. Hexadecimal#
base 16, common in CS because its 4 bits. we use 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F.
This is commonly used for representing colors
Some examples:
This is how the git hash is 160 bits, or 20 bytes (one byte is 8 bits) but we represent it as 40 characters. 160/4=40.
python
Python 3.12.4 (v3.12.4:8e8a4baf65, Jun 6 2024, 17:33:18) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> len('f3097ab13082b70f67202aab7dd9d1b35b7ceac2')
40
>>> exit()
11.8. Prepare for Next Class#
(before lab on Tuesday ideally) start commenting/expressing interest on build/explore ideas.
Create a file
gitcommandsbreakdown.md
and for each command in the template below break down what steps it must do based on what we have learned so far about git objects. I started the first one as an example. Next class, we will make a commit using plumbing commands, so thinking about what you already know about commits will prepare you to learn this material.
# What git commands do
## `git status`
- check the branch of the HEAD pointer
- compare the HEAD pointer to the FETCH_HEAD, if different trace back through parent commits to find out how many commits apart they are and which is ahead (or if both ahead and behind)
- compare the snapshot at the HEAD pointer's commit to the current working directory
- if staging is not empty, compare that to the working directory
## `git commit`
-
## `git add`
-
11.9. Badges#
Analyze the xor hashing algorithm in the notes to determine which properties of a cryptographic hash are/not met. Include your analysis in xorhash.md
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) not mentioned in class that are currently used. Describe the number system and its usage in numbers.md. Include links to your sources and be sure that the sources are trustworthy.
Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object and include that number in gitcounts.md with a title
# Git counts
. How many files would have to exist to reach that number of objects assuming every fiile was edited in each of two commits? If you get stuck, outline what you know and then request a review.
Analyze the xor hashing algorithm in the notes to determine which properties of a cryptographic hash are/not met. Include your analysis in xorhash.md
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) not mentioned in class that are currently used. Describe the number system and its usage in numbers.md. Include links to your sources and be sure that the sources are trustworthy.
Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object and include that number in gitcounts_scenarios.md with a title
# Git counts
. Describe 3 scenarios that would get you to that number of objects in terms of what types of objects would exist. For example, what is the maximum number of commits you could have without exceeding that number? How could you get to that number of objects in the fewest number of commits? What might be a typical way to get there? Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.Read about the Learn more about the SHA-1 collision attach
Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below gittransition.md
gittransition
# transition questions
1. Why make the switch? (in detail, not just *an attack*)
2. What impact will the switch have on how git works?
3. Which developers will have the most work to do because of the switch?
11.10. Experience Report Evidence#
11.11. Questions After Today’s Class#
11.11.1. Questions that are good explore badge topics:#
How will quantum computing affect cryptography?
11.11.2. How can hashes be used in programming tasks?#
often for storing or for encrypting.
11.11.3. is hashing used in creating IP Addresses?#
Not in the addressing, but the content that is sent is sometimes hashed.
11.11.4. How do you convert Roman numerals into their corresponding decimal values?#
take what each symbol represents and add or subtract based on their order. the notes should be more clear than in class was.