We have said beforeA commit number is the hash of of the content.
Today we’ll break this down and start discussing representation
What is a hash?¶
a hash is:
a fixed size value that can be used to represent data of arbitrary sizes
the output of a hashing function
often fixed to a hash table
Common examples of hashing are lookup tables and encryption with a cyrptographic hash.
A hashing function could be really simple, to read off a hash table, or it can be more complex.
For example:
Hash | content |
0 | Success |
1 | Failure |
If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.
This lookup table hash works here.
In a more complex scenario, imagine trying to hash all of the new terms you learn in class.
A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.
A cyrptographic hash is additionally:
unique
not reversible
similar inputs hash to very different values so they appear uncorrelated
Now lets go through each of these properties
Cryptographic Hashes are unique¶
This means that two different values we put in should give different results.
For this property alone, a simple function could work:
def basic_unique(input):
return inputbut this is not a hash because its length would not be constant and not a cryptographic has because it is easily reversible.
Cryptographic Hashes are not reversible¶
This means that given the hash (output), we cannot compute the message(input).
Any function that gives the same output for two (or more) values meets this criteria.
for example modulus:
13%3110%31It can be any function that gives the same output for two (or more) values.
but this is not a cryptographic hash
Similar inputs lead to seemingly uncorrelated outpus¶
We can see this using git:
cd ..echo "it's rainy today" | git hash-object --stdinb9d8213c58e0c764af09be950a15f985287c034eand whenw e make smalle changes
echo "it is rainy today" | git hash-object --stdin5f1c15fb282ed50ab8dfec131f50bafbb880bc86the hash looks completely differnt
echo 'it"s rainy today' | git hash-object --stdinffd4e3675f47d1afe0e6273fbc1dcbc9d14d9ccfHashes are fixed length¶
So, no matter the size of the input, we get back the same length.
This is good for memory allocation reasons.
We could again write a function that only does this simply:
def fixed_len(input):
'''
pad or trim
'''
len_target=100
str_in = str(input)
if len(str_in)< len_target:
return str_in.ljust(len_target,'-')
else:
return str_in[:len_target]How can hashes be used?¶
Hashes can then be used for a lot of purposes:
message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same has)
password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)
file or data identifier (eg in git)
Hashing in passwords¶
Passowrds can be encrypted and the encrypted information is stored, then when you submit a candidate password it can compare the hash of the submitted password to the hash that was stored. Since the hashing function is nonreversible, they cannot see the password.
password breach
Some sites are negligent and store passwords unencrypted, if your browser warns you about such a site, proceed with caution and definitely do not reuse a password you ever use. (you should never reuse passwords, but especially do not if there is a warning)
An attacker who gets one of those databases, cannot actually read the passwords, but they could build a lookup table. For example, “password” is a bad password because it has been hashed in basically every algorithm and then the value of it can be reversed. Choosing an uncommon password makes it less likely that your password exists in a lookup table.
For example, in SHA-1 the hashing algorithm that git uses
echo "password" | git hash-object --stdinf3097ab13082b70f67202aab7dd9d1b35b7ceac2Hashing in Git¶
In git we hash both the content directly to store it in the database (.git) directory and the commit information.
Recall, when we were working in our toy repo we created an empty repository and
then added content directly, we all got the same hash, but when we used git
commit our commits had different hashes because we have different names and
made the commits at different seconds. We also saw that two entries were
created in the .git directory for the commit.
Git was originally designed to use SHA-1.
Then the SHA-1 collision attack was discovered
Git switched to hardened HSA-1 in response to a collision.
In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.
git uses the SHA hash primarily for uniuqeness, not privacy
It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.
This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.
This means it can produce 2160 different hashes. Which makes the probability of a collision very low.
The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about 280 (the formula for determining collision probability is
p = n(n-1)/2) * (1/2^160))$. $2^{80}) is 1.2 x 1024 or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.
Workign with git hashes¶
Mostly, a shorter version of the commit is sufficient to be unique, so we can use those to refer to commits by just a few characters:
minimum 4
must be unique
cd gh-inclass-fa25-brownsarahm/By default git log shows the full hash
git logcommit 99f86bf7112debc934e7fa4504232a48266d90e4 (HEAD, origin/1-add-a-readme, 1-add-a-readme)
Author: Sarah M Brown <brownsarahm@uri.edu>
Date: Thu Sep 11 13:41:24 2025 -0400
create a readme closes #1
commit c8f4926313ba8f6c5bfad3857b7479a666328e6d
Author: github-classroom[bot] <66690702+github-classroom[bot]@users.noreply.github.com>
Date: Thu Sep 11 14:49:01 2025 +0000
Initial commitby using some options on git log to make it easier to read a set of them
git log --abbrev-commit --pretty=oneline99f86bf (HEAD, origin/1-add-a-readme, 1-add-a-readme) create a readme closes #1
c8f4926 Initial commitFor most project 7 characters is enough and by default, git will give you 7 digits if you use --abbrev-commit and git will automatically use more if needed.
How are hashes computed?¶
Hashes are computed on the binary representation of the data, most treat that as numbers and then perform mathematical operations.
you can see the psuedocode to SHA-1
we’ll work a small example, Python can convert characters to ints:
ord('s')115and ints to binary
bin(ord('s'))'0b1110011'with some extra at the start, so if we just want the binary string
str(bin(ord('s')))[2:]'1110011'We can use this to make a function that performs an xor based hash.
def xor_hash(message,n):
'''
a simple hashing algorithm based on xor that can take a message of
any length and return a version of it with a specific bit length n
'''
bitstring = ''.join([str(bin(ord(c)))[2:] for c in message])
# zero pad to make it a multiple of n
extra = len(bitstring)%n
num_pad = (n-extra)*int(extra >0)
padded_bitstring = bitstring + num_pad*'0'
# split into length n chunks
split_bitstring = [padded_bitstring[i:i+n] for i in range(0, len(padded_bitstring), n)]
# now make them back to integers so we can xor
bin_int_strings = [int("0b" + bits,2) for bits in split_bitstring]
# now xor
out_hash_int = 0
for n_int in bin_int_strings:
out_hash_int = out_hash_int^n_int
# we would need to left zero pad the binary to keep it correct
bin_str = str(bin(out_hash_int))[2:]
extra_out = len(bin_str)%n
num_pad_out = (n-extra_out)*int(extra_out >0)
padded_out = num_pad_out*'0' + bin_str
return padded_outYou can see this visualized on python tutor
We can test that we get something the right length
xor_hash('hello',8)'00001101'for different lengths of input
xor_hash('it is almost break',8)'00110000'This is a sort of meta function, so we could make simpler, more specific functions for each length or we can use it with different ones to see
xor_hash('it is almost break',13)'0010111001001'How do we get to the alphanumeric hashes we see?¶
Let’s build this up with some review
What is a Number ?¶
a mathematical object used to count, measure and label
What is a number system?¶
While numbers represent quantities that conceptually, exist all over, the numbers themselves are a cultural artifact. For example, we all have a way to represent a single item, but that can look very different.
for example I could express the value of a single item in different ways:
1
I
Hindu Arabic Number system¶
In modern, western cultures, our number system is called the hindu-arabic system, it consists of a set of numerals: 0,1,2,3,4,5,6,7,8,9 and is a place based system with base 10.
Where does the name come from?¶
invented by Hindu mathematicians in India 600 or earlier
called “Arabic” numerals in the West because Arab merchants introduced them to Europeans
slow adoption
Numerals¶
are the visual characters used to represent quantities
Place based¶
We use a place based system. That means that the position or place of the symbol changes its meaning. So 1, 10, and 100 are all different values. This system is also a decimal system, or base 10. So we refer to the places and the ones (100), the tens (101), the hundreds(102), etc for all powers of 10.
Number systems can use different characters, use different strategies for representing larger quantities, or both.
Not all systems are place based, for example Roman numerals.
Roman Numerals¶
is a number system that uses different numerals and is not a place based system
There are symbols for specific values: 1=I, V=5, X=10, L =50, C = 100, D=500, M = 1000.
In this system the subsequent symbols are either added or subtracted, with no (nonidentity) multipliers based on position.
Instead if the symbol to right is the same or smaller, add the two together, if the left symbol is smaller, subtract it from the one on the right.
For example
roman numerals | translation (base 10 hindu arabic) | value (hindu-arabic) |
III | 1+1+1 | 3 |
IV | -1 + 5 | 4 |
VI | 5 +1 | 4 |
XLIX | -10 + 50 -1 + 10 | 49 |
Comparing Bases¶
Decimal¶
To represent larger numbers than we have digits on we have a base (10) and then.
we have the ones (100) place, tens (101) place, hundreds (102) place etc.
Binary¶
Binary is any base two system, and it can be represented using any two distinct numerals.
Binary number systems have origins in ancient cultures:
Egypt (fractions) 1200 BC
China 9th century BC
India 2nd century BC
In computer science we use binary because mechanical computers began using relays (open/closed) to implement logical (boolean; based on true/false, also binary) operations and then digital computers use on and off in their circuits.
We typically represent binary using the same hindu-arabic symbols that we use for other numbers, but only the 0 and 1(the first two). We also keep it as a place-based number system so the places are the ones(20), twos (21), fours (22), eights (23), etc for all powers of 2.
so in binary, the number of characters in the word binary is 110.
8 | 4 | 2 | 2 |
|---|---|---|---|
1 | 0 | 0 | 1 |
so 10 in binary is 2 in decimal
Octal¶
Is base 8.
We use hindu-arabic symbols, 0,1,2,3,4,5,6,7 (the first eight). Then nine is represented as 11.
In computer science, this numbering system was popular in 6 bit and 12 bit computers, but is has origins before that.
Native Americans using the Yuki Language (based in what is now California)used an octal system because they count using the spaces between fingers and speakers of the Pamean languages in Mexico count on knuckles in a closed fist. Europeans debated using decimal vs octal in the 1600-1800s for various reasons because 8 is better for math mostly. It is also found in Chinese texts dating to 1000BC.
Some examples:
In computer science we use octal a lot because it reduces every 3 bits of a number in binary to a single character. So for a large number, in binary say 101110001100 we can change to 5614 which is easier to read, for a person.
101110001100 101 110 001 100 5614
Hexadecimal¶
base 16, commonin CS because its 4 bits. we use 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F.
This is commonly used for representing colors
This is how the git hash is 160 bits, or 20 bytes (one byte is 8 bits) but we represent it as 40 characters. 160/4=40.
Some examples:
How can we manipulate bits?¶
We have several bitwise operators:
&: and|: or^: xor~: not>>: shift right<<: shift left
Let’s review truth tables for and, or, and xor.
Table 4:AND
a | b | output |
|---|---|---|
0 | 0 | 0 |
0 | 1 | 0 |
1 | 0 | 0 |
1 | 1 | 1 |
Table 5:OR
a | b | output |
|---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
Table 6:XOR
a | b | output |
|---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
In order to implement more complex calculations, using gates, we can use these tables as building blocks compared to the required output.
python3We can do xor:
1^011^104&542>>202<<28## Prepare for Next Class
```{include} ../_prepare/2025-10-16.mdBadges¶
Review the notes from today
Analyze the xor hashing algorithm in the notes to determine which properties of a cryptographic hash are/not met. Include your analysis in xorhash.md
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) not mentioned in class that are currently used. Describe the number system and its usage in numbers.md. Include links to your sources and be sure that the sources are trustworthy.
Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object and include that number in gitcounts.md with a title
# Git counts. How many commits would you be able to make before you have to use more than the minimum number of characters to refer to any git object if each commit added a new folder with 1021 files in it? If you get stuck, outline what you know and then request a review.
Review the notes from today
Analyze the xor hashing algorithm in the notes to determine which properties of a cryptographic hash are/not met. Include your analysis in xorhash.md
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) not mentioned in class that are currently used. Describe the number system and its usage in numbers.md. Include links to your sources and be sure that the sources are trustworthy.
Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object and include that number in gitcounts_scenarios.md with a title
# Git counts. Describe 3 scenarios that would get you to that number of objects in terms of what types of objects would exist. For example, what is the maximum number of commits you could have without exceeding that number? How could you get to that number of objects in the fewest number of commits? What is the maximum number of files you can edit per commit and still take have at least 128 commits? Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.Read about the Learn more about the SHA-1 collision attach
Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below gittransition.md
gittransition¶
# transition questions
1. Why make the switch? (in detail, not just *an attack*)
2. What impact will the switch have on how git works?
3. Which developers will have the most work to do because of the switch?Experience Report Evidence¶
save your command history to 2023-10-19-log.txt and put that file in your KWL repo
Questions After Today’s Class¶
How does Git transition from SHA-1 to SHA-256 while keeping older repositories compatible?¶
That work is not yet done, but it is started
git 3.0 will include it, but that piece not being done is one of the main things left to do discussion
How are binary and hex conversions actually handled inside Git’s C code when computing hashes?¶
This is a good explore badge topic. You can view the source code.
how does the computer actually read binary¶
This is all a computer does; everything we see is the result of it reading the binary data.
We will unpack this over a few more class sessions.
Is there a system for the low chance that there is a collision in hashes.¶
the hardened algorithm does some. Otherwise it’s just so low chance that we trust it won’t happen.
Why hasn’t GitHub added support for SHA-256 repositories? You can already create them locally using git init --object-format=256¶
I’m not sure why, just that it has not been done.
https://
We saw the constants used for some operations in SHA-1. How did they come up with them?¶
This is a good explore badge
Are they more number systems used commonly in computing beyond binary, octal, and hexadecimal?¶
These are the main ones.
why did git switch from the regular SHA-1¶
How are hashing algorithms created?¶
This is what the feild of cryptography does!
Professor Lamanga has taught a course on that, and may again in the spring.