What is a commit number?

13. What is a commit number?#

Important

spring break is like a time pause (you get an extra week on things assigned last week and this week)

13.1. What is a hash?#

a hash is:

a fixed size value that can be used to represent data of arbitrary sizes
the output of a hashing function
often fixed to a hash table

Common examples of hashing are lookup tables and encryption with a cyrptographic hash.

A hashing function could be really simple, to read off a hash table, or it can be more complex.

If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.

For example:

Hash	content
0	Success
1	Failure

In a more complex scenario, imagine trying to hash all of the new terms you learn in class.

A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.

A cyrptographic hash is additionally:

unique
not reversible
similar inputs hash to very different values so they appear uncorrelated

Now lets go through each of these properties

13.1.1. Hashes are fixed length#

So, no matter the size of the input, we get back the same length.

This is good for memory allocation reasons.

13.1.2. Hashes are unique#

This means that two different values we put in should give different results.

For this property alone, a simple function could work:

def basic_unique(input):
    return input

but this is not a hash because its length would not be constant and not a cryptographic has because it is easily reversible.

13.1.3. Hashes are not reversible#

This means that given the hash, we cannot compute the input.

An example of a nonreversible function is modulus:

13%3

10%3

It can be any function that gives the same output for two (or more) values.

13.1.4. Similar inputs yeild uncorrelated outputs#

We can see this using git:

echo "it's almost spring" | git hash-object --stdin

1327354a0d47111e361ec52acc46b9509bed69a0

and we get a hash

and then we change it just a little bit:

echo "it's almsot spring" | git hash-object --stdin

f798473089b4a00684cd0f13a6f2d1d7a0e033ba

and now the hash is completely different

13.2. How can hashes be used?#

Hashes can then be used for a lot of purposes:

message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same has)
password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)
file or data identifier (eg in git)

13.2.1. Hashing in passwords#

Passowrds can be encrypted and the encrypted information is stored, then when you submit a candidate password it can compare the hash of the submitted password to the hash that was stored. Since the hashing function is nonreversible, they cannot see the password.

Some sites are negligent and store passwords unencrypted, if your browser warns you about such a site, proceed with caution and definitely do not reuse a password you ever use. (you should never reuse passwords, but especially do not if there is a warning)

An attacker who gets an encyrpted database, cannot actually read the passwords, but they could build a lookup table. For example, “password” is a bad password because it has been hashed in basically every algorithm and then the value of it can be reversed. Choosing an uncommon password makes it less likely that your password exists in a lookup table.

For example, in SHA-1 the hashing algorithm that git uses

echo "password" | git hash-object --stdin

f3097ab13082b70f67202aab7dd9d1b35b7ceac2

13.2.2. Hashing in Git#

In git we hash both the content directly to store it in the database (.git) directory and the commit information.

Recall, when we were working in our toy repo we created an empty repository and then added content directly, we all got the same hash, but when we used git commit our commits had different hashes because we have different names and made the commits at different seconds. We also saw that two entries were created in the .git directory for the commit.

Git was originally designed to use SHA-1.

Then the SHA-1 collision attack was discovered

Git switched to hardened HSA-1 in response to a collision.

In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.

they will change again soon

git uses the SHA hash primarily for uniuqeness, not privacy

It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.

This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.

This means it can produce $2^160$ different hashes. Which makes the probability of a collision very low.

The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about $2^{80}$ (the formula for determining collision probability is $p = (n(n-1)/2) * (1/2^160))$. $2^{80}$) is 1.2 x 1024 or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.

—A SHORT NOTE ABOUT SHA-1 in the Git Documentation

13.3. Working with git hashes#

Mostly, a shorter version of the commit is sufficient to be unique, so we can use those to refer to commits by just a few characters:

minimum 4
must be unique

we’ll see this in our github inclass repo

cd gh-inclass-sp24-brownsarahm/

by using some options on git log to make it easier to read a set of them

git log --abbrev-commit --pretty=oneline

9f39946 (HEAD) Merge pull request #5 from compsys-progtools/organizing_ac
fca59e8 (tag: v0.1, origin/organizing_ac) add files for organizing
1e2a45f (my_branch_cehckoutb, my_branch) Merge pull request #4 from compsys-progtools/2-create-an-about-file
81c6f18 (origin/2-create-an-about-file, 2-create-an-about-file) create about file close s #2
faef6af start readme, closes #3
98cff65 Initial commit

For most project 7 characters is enough and by default, git will give you 7 digits if you use --abbrev-commit and git will automatically use more if needed.

13.4. How are hashes computed?#

Important

I did not have this explicitly in class, but it was implied and I thought it was better to add it here

Hashes are computed on the binary representation of the data, most treat that as numbers and then perform mathematical operations.

you can see the psuedocode to SHA-1

we’ll work a small example, Python can convert characters to ints:

ord('s')

and ints to binary

bin(ord('s'))

'0b1110011'

with some extra at the start, so if we just want the binary string

str(bin(ord('s')))[2:]

'1110011'

We can use this to make a function that performs an xor based hash.

def xor_hash(message,n):
    '''
    a simple hashing algorithm based on xor that can take a message of 
    any length and return a version of it with a specific bit length n
    '''
    bitstring = ''.join([str(bin(ord(c)))[2:] for c in message])

    # zero pad to make it a multiple of n
    extra = len(bitstring)%n
    num_pad = (n-extra)*int(extra >0)
    padded_bitstring = bitstring + num_pad*'0'

    # split into length n chunks
    split_bitstring = [padded_bitstring[i:i+n] for i in range(0, len(padded_bitstring), n)]

    # now make them back to integers so we can xor
    bin_int_strings = [int("0b" + bits,2) for bits in split_bitstring]
    
    # now xor
    out_hash_int = 0
    for n_int in bin_int_strings:
        out_hash_int = out_hash_int^n_int

    # we would need to left zero pad the binary to keep it correct
    bin_str = str(bin(out_hash_int))[2:]
    extra_out = len(bin_str)%n
    num_pad_out = (n-extra_out)*int(extra_out >0)
    padded_out =  num_pad_out*'0' + bin_str
    return padded_out

You can see this visualized on python tutor

We can test that we get something the right length

xor_hash('hello',8)

'00001101'

for different lengths of input

xor_hash('it is almost break',8)

'00110000'

This is a sort of meta function, so we could make simpler, more specific functions for each length or we can use it with different ones to see

xor_hash('it is almost break',13)

'0010111001001'

This is not cryptographic though because similar inputs have correlated outputs.

13.5. How do we get to the alphanumeric hashes we see?#

The actual hashing algorithms are mathematical, they work by treating the binary input

Let’s build this up with some review

13.5.1. What is a Number ?#

a mathematical object used to count, measure and label

13.5.2. What is a number system?#

While numbers represent quantities that conceptually, exist all over, the numbers themselves are a cultural artifact. For example, we all have a way to represent a single item, but that can look very different.

for example I could express the value of a single item in different ways:

1
I

13.5.3. Hindu Arabic Number system#

In modern, western cultures, our number system is called the hindu-arabic system, it consists of a set of numerals: 0,1,2,3,4,5,6,7,8,9 and uses a place based system with base 10.

13.5.3.1. Where does the name come from?#

invented by Hindu mathematicians in India 600 or earlier
called “Arabic” numerals in the West because Arab merchants introduced them to Europeans
slow adoption

13.5.3.2. Numerals#

are the visual characters used to represent quantities

13.5.3.3. Place based#

We use a place based system. That means that the position or place of the symbol changes its meaning. So 1, 10, and 100 are all different values. This system is also a decimal system, or base 10. So we refer to the places and the ones ($10^0$), the tens ($10^1$), the hundreds($10^2$), etc for all powers of 10.

Number systems can use different characters, use different strategies for representing larger quantities, or both.

13.5.4. Roman Numerals#

is a number system that uses different numerals and is not a place based system

There are symbols for specific values: 1=I, V=5, X=10, L =50, C = 100, D=500, M = 1000.

Not all systems are place based, for example Roman numerals. In this system the subsequent symbols are either added or subtracted, with no (nonidentity) multipliers based on position. Instead if the symbol to right is the same or smaller, add the two together, if the left symbol is smaller, subtract it from the one on the right.

roman numerals	translation	value (hindu-arabic)
III	1+1+1	3
IV	-1 + 5	4
VI	5 +1	4
XLIX	-10 + 50 -1 + 10	49

13.6. Comparing Bases#

13.6.1. Decimal#

To represent larger numbers than we have digits on we have a base (10) and then.

\[10 = 10*1 + 1*0\]

\[22 = 10*2 + 1*2 \]

we have the ones ($10^0$) place, tens ($10^1$) place, hundreds ($10^2$) place etc.

13.6.2. Binary#

Binary is any base two system, and it can be represented using any different characters.

Binary number systems have origins in ancient cultures:

Egypt (fractions) 1200 BC
China 9th century BC
India 2nd century BC

In computer science we use binary because mechanical computers began using relays (open/closed) to implement logical (boolean) operations and then digital computers use on and off in their circuits.

We represent binary using the same hindu-arabic symbols that we use for other numbers, but only the 0 and 1(the first two). We also keep it as a place-based number system so the places are the ones($2^0$), twos ($2^1$), fours ($2^2$), eights ($2^3$), etc for all powers of 2.

so in binary, the number of characters in the word binary is 110.

\[ 10 => 1*2^1 + 0*2^0 = 1*2 + 0*1 = 2\]

so 10 in binary is 2 in decimal

Convert $ 1001 $ in binary to decimal

\[ 1001 => 1*8 + 0*4 + 0*2 + 1*1 = 9\]

13.6.3. Octal#

Is base 8.

We use hindu-arabic symbols, 0,1,2,3,4,5,6,7 (the first eight). Then nine is represented as 11.

In computer science, this numbering system was popular in 6 bit and 12 bit computers, but is has origins before that.

Native Americans using the Yuki Language (based in what is now California)used an octal system because they count using the spaces between fingers and speakers of the Pamean languages in Mexico count on knuckles in a closed fist. Europeans debated using decimal vs octal in the 1600-1800s for various reasons because 8 is better for math mostly. It is also found in Chinese texts dating to 1000BC.

Some examples: $$ 10 = > 1*8^1 + 0*8^0 = 1*8+0*1 = 8$$ so 10 in octal is 8 in decimal

\[ 401 => 4*8^2 + 0*8^1 + 1*8^0 = 4*64 + 1*0 = 257\]

In computer science we use octal a lot because it reduces every 3 bits of a number in binary to a single character. So for a large number, in binary say 101110001100 we can change to 5614 which is easier to read, for a person.

13.6.4. Hexadecimal#

base 16, common in CS because its 4 bits. we use 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F.

This is commonly used for representing colors

Some examples:

\[ A5 ==> (9+1)*16^1 + 5*16^0 = 10*16 + 5*1 = 165\]

\[ E40 ==> 14*16^2 + 4*16^1 + 0*16^0 = 14*256 + 4*16 +0*1 = 3648\]

It is also is how the git hash is 160 bits, or 20 bytes (one byte is 8 bits) but we represent it as 40 characters. 160/4=40.

13.7. Prepare for Next Class#

Be sure to get some rest during the break. Come to class ready to learn.

If you do work during the break, be sure to update your ide notes.

13.8. Badges#

Review

Analyze the xor hashing algorithm in the notes to determine which properties of a cryptographic hash are/not met. Include your analysis in xorhash.md
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) not mentioned in class that are currently used. Describe the number system and its usage in numbers.md. Include links to your sources and be sure that the sources are trustworthy.
Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object and include that number in gitcounts.md with a title # Git counts. Describe the scenario that would get you to that number of objects with the maximum or minimum number of commits in terms of what types of objects would exist. Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.

Practice

Analyze the xor hashing algorithm in the notes to determine which properties of a cryptographic hash are/not met. Include your analysis in xorhash.md
find 2 more real world examples of using other number systems (either different bases or different symbols and bases) not mentioned in class that are currently used. Describe the number system and its usage in numbers.md. Include links to your sources and be sure that the sources are trustworthy.
Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object and include that number in gitcounts_scenarios.md with a title # Git counts. Describe 3 scenarios that would get you to that number of objects in terms of what types of objects would exist. For example, what is the maximum number of commits you could have without exceeding that number, how many files could you have? How could you get to that number of objects in the fewest number of commits? What might be a typical way to get there? Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.
Read about the Learn more about the SHA-1 collision attach
Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below gittransition.md

gittransition

# transition questions

Why make the switch? (in detail, not just *an attack*)
What impact will the switch have on how git works?
Which developers will have the most work to do because of the switch?

13.9. Experience Report Evidence#

13.10. Questions After Today’s Class#

13.10.1. Will we be approved for requests to join teams for builds from the discussion posts made?#

Yes

13.10.2. want to learn more about how to use sha-256#

This is a good community badge or explore option.

You can start with the git transition plan

13.10.3. What are all the parts of the SHA hashing algroithm?#

You can see SHA-1 psuedocode on wikipedia

13.10.4. What makes git’s algorithm “hardened”?#

It uses the regular hashing algorithm, then checks if the resulting hash is vulnerable with a detection algorithm and if so, it computes a different hash. accodring to shattered.

13.11. Are there any benefits of a number system like Roman Numerals over the Hindu-Arabic system?#

It is just different, it was something the Romans developed before Europeans had encountered a system like the Hindu-Arabic system.

13.11.1. Does git have a plan to change its hashing algorithm in the future?#

Yes and a detailed transition plan

13.11.2. what is the most secure hash algorithm?#

This is a good community badge as is, or by comparing several, can be an explore

13.11.3. Do you think it would be as easy for us to do math with a octal (base 8) system as it is with a decimal (base 10) system is if we learned it just as early?#

Yes, if that is what you were used to, that would be it.

We would learn times tables in different ways possibly or different tricks because we represented the quantities differently, but yes!

Working out examples of this could be a good explore badge.

13.11.4. Will quantum computing be able to decript things such as hash encriptions ?#

It is good for random things, so it will probably not be better at this than current computers.

13.11.5. Are the git hashes unique universally, or just across the repository?#

They only have to be unique withing a single repo, for example in the test repo, we all had some identical hashes.

The blob and tree hashes are more likely to repeat across repos than commit hashes though because commit hashes cannot really repeat easily because it requires the same person to commit in two different repos (well the same logged in identity) at the same time.