15. Why did we learn the plubming commands?#
You will not typically use them on a day to day basis, but they are a good way to see what happens at the interim steps and make sure that you have the right understanding of what git does.
A correct understanding is essential for using more advanced features
While there is of course some content that we want you to know after this course, my goal is also to teach you process, by modeling it.
No one will ever know all of the things but you can be fast or slow at finding answers.
And you can find correct answers, incorrect answers, or looks-okay-but-you-will-regret-this-later answers.
you do not want to become the colleague that everyone regrets working with
My goal is that you get good at quickly finding correct answers.
Large language models will not do that for you.
15.1. Today’s Questions#
What are references?
How can can I release and share my code?
How else can git help me?
When does the contents of a file get hashed?
[ ] every time you edit a file, git hashes it automatically
[x] when a file is staged for commit
[ ] any time a staged file is edited
[ ] when the commit is created
What type of git object stores the name of a file?
[x] tree
[ ] blob
[ ] commit
[ ] branch
15.2. What does git status do?#
compares the working directory to the current state of the active branch and the index
we can see the working directory with:
ls
we can see the active branch in the
HEAD
filewhat is its status?
Recall that we:
crated a blob objct directly
created a file
hashed the file, to create its blob object
added the file to the index
wrote a tree from files in the index
created a commit object with a tree and a commit message
copied the commit hash and placed it inside a new file
.git/refs/heads/main
usingecho
and>
Let’s use inspection to review where our repo is left off from last week.
git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: test.txt
15.3. Branches are references#
branches are not git objects, they are references.
We can see that by where they are stored.
How can we inspect to see where references are stored?
using ls and/or find
We can see our list of object with find
find .git/objects/ -type f
.git/objects//0c/1e7391ca4e59584f8b773ecdbbb9467eba1547
.git/objects//d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
.git/objects//d8/329fc1cc938780ffdd9f94e0d364e0ea74f579
.git/objects//e3/ba10cb02de504d4f48b9af4934ddcc4d0be3df
.git/objects//83/baae61804e65cc73a7201a7252750c76066a30
Since we made most of these objects as not commits, the hashes are mostly shared, but one of the hashes is unique so we worked together to identify that one.
Then for that unique hash, we confirmed it was a commit using git cat-file
to view its type
git cat-file -t e3ba
commit
as expected
Now let’s look at what the HEAD
pointer says to try to understand why it does not see that commit, since we know that git status works from HEAD
cat .git/HEAD
.git/refs/heads/main
Important
git updates the file for the branch each time you add a commit with git commit
, the first time git commit
is run it also creates the file
If we look in the folder, we can see what is in there
ls .git/refs/heads/
main
cat .git/refs/heads/main
e3ba10cb02de504d4f48b9af4934ddcc4d0be3df
15.3.1. Updating a branch manually#
The branch file is named as the branch (here main
) and stored in the .git/refs/heads/
folder and contains the full hash of the commit that branch is pointing to.
Since we made the commit manually, we need to move the branch manually too. We will use echo, by copyting the full hash from above (I copied the part after the last /
in the list of objects above and then typed the e3
before pasting).
Mine looks like:
echo e3ba10cb02de504d4f48b9af4934ddcc4d0be3df > .git/refs/heads/main
but your commit hash will be different than mine.
Now we check with git again:
git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: test.txt
no changes added to commit (use "git add" and/or "git commit -a")
and now we can see that it no longer sees the staged file and does see our commit.
git cat-file -p 188a
tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579
author Ayman Sandouk <Ayman_sandouk@uri.edu> 1677177139 -0500
committer Ayman Sandouk <Ayman_sandouk@uri.edu> 1677177139 -0500
first commit
git log
So we now have HEAD-> main and main -> our commit -> tree –> blob.
Branches in git are references to specific commits.
15.5. What does this mean?#
We tend to think of comits like this:
In reality
15.6. We can move pointers around freely#
flowchart BT
blob1
blob2
blob3
subgraph A[first commit]
direction TB
commitA
%% file1v1
treeA
commitA-->treeA
end
subgraph B[second commit]
direction TB
%% file1v2
%% file2v1
%% file3v1
commitB
treeB
commitB-->treeB
end
%% subgraph C[third commit]
%% direction TB
%% %% file1v2
%% %% file2v1
%% %% file3v1
%% commitC
%% commitC-->treeC
%% end
B--> A
%% treeC-->blob1
treeC>blob2
treeB-->blob3
end
%% B--> A
commitB -->commitA
15.7. Experience Report Evidence#
write your git status and object list to a file.
15.8. What is a hash?#
a hash is:
a fixed size value that can be used to represent data of arbitrary sizes
the output of a hashing function
often fixed to a hash table
Common examples of hashing are lookup tables and encryption with a cyrptographic hash.
A hashing function could be really simple, to read off a hash table, or it can be more complex.
For example:
Hash |
content |
0 |
Success |
1 |
Failure |
If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.
This lookup table hash works here.
In a more complex scenario, imagine trying to hash all of the new terms you learn in class.
A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.
A cyrptographic hash is additionally:
unique
not reversible
similar inputs hash to very different values so they appear uncorrelated
Now lets go through each of these properties
15.8.1. Cryptographic Hashes are unique#
This means that two different values we put in should give different results.
For this property alone, a simple function could work:
def basic_unique(input):
return input
but this is not a hash because its length would not be constant and not a cryptographic has because it is easily reversible.
15.8.2. Cryptographic Hashes are not reversible#
This means that given the hash (output), we cannot compute the message(input).
Any function that gives the same output for two (or more) values meets this criteria.
for example modulus:
13%3
1
10%3
1
It can be any function that gives the same output for two (or more) values.
but this is not a cryptographic hash
We can use the git hashing algorithm without writing to the repo too:
Then we get the hash back. Try changing the input just a little and running the hashing algorithm again.
do similar inputs have similar hashes?
15.8.4. Hashes are fixed length#
So, no matter the size of the input, we get back the same length.
This is good for memory allocation reasons.
We could again write a function that only does this simply:
def fixed_len(input):
'''
pad or trim
'''
len_target=100
str_in = str(input)
if len(str_in)< len_target:
return str_in.ljust(len_target,'-')
else:
return str_in[:len_target]
Back to it being not a cryptographic hash
Try fixed_len("lkjhlkjhlkjhlkjlkjhlkjhlkjhlkjhkljhjlkhlkjhlkjhlkjhlkjhkjlhkjlhhhhhjlhlfgfgfgfgfgfgfgfgfkjhlkjhljkhnbh123")
Try fixed_len("lkjhlkjhlkjhlkjlkjhlkjhlkjhlkjhkljhjlkhlkjhlkjhlkjhlkjhkjlhkjlhhhhhjlhlfgfgfgfgfgfgfgfgfkjhlkjhljkhnbh258")
What are some ways a hash could be used?
15.9. How can hashes be used?#
Hashes can then be used for a lot of purposes:
message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same hash)
password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)
file or data identifier (eg in git)
15.9.1. Hashing in passwords#
Passowrds can be encrypted and the encrypted information is stored, then when you submit a candidate password it can compare the hash of the submitted password to the hash that was stored. Since the hashing function is nonreversible, they cannot see the password.
Some sites are negligent and store passwords unencrypted, if your browser warns you about such a site, proceed with caution and definitely do not reuse a password you ever use. (you should never reuse passwords, but especially do not if there is a warning)
An attacker who gets one of those databases, cannot actually read the passwords, but they could build a lookup table. For example, “password” is a bad password because it has been hashed in basically every algorithm and then the value of it can be reversed. Choosing an uncommon password makes it less likely that your password exists in a lookup table.
For example, in SHA-1 the hashing algorithm that git uses
echo "password" | git hash-object --stdin
f3097ab13082b70f67202aab7dd9d1b35b7ceac2
15.10. Hashing in Git#
In git we hash the content directly to store it in both the database (.git) directory and the commit information.
Recall, when we were working in our test repo we created an empty repository and then added content directly, we all got the same hash, but when we used git commit our commits had different hashes because we have different names and made the commits at different seconds.
We also saw that two entries were
created in the .git
directory for the commit.
Git was originally designed to use SHA-1.
Then the SHA-1 collision attack was discovered
Git switched to hardened HSA-1 in response to a collision.
In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.
GitHub uses git, it is not an alternative implementation or a fork, so yes it will switch too. The developers at GitHub an other git hosts are among the most impacted by the change since they write code that directly interacts with git objects.
git uses the SHA hash primarily for uniuqeness, not privacy
It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.
This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.
The SHA-1 Algorithm hashes content into a fixed length of 160 bits.
how many different possible hashes can it produce?
[ ] 160*160
[x] 2^160
[ ] 160*2
[ ] 160^2
This means it can produce \(2^{160}\) different hashes. Which makes the probability of a collision very low.
The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is \(p = (n(n-1)/2) * (1/2^160))\). \(2^{80}\)) is \(1.2 \times 10^{24}\) or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.
– A SHORT NOTE ABOUT SHA-1 in the Git Documentation
The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is
p = n(n-1)/2) * (1/2^160))$. $2^{80}
) is 1.2 \times 10^{24}$ or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.
15.11. Prepare for Next Class#
[ ] Think about what you know about networking
[ ] Get the big ideas of hpc, by reading this IBM intro page and some hypothetical people who would attend an HPC carpentry workshop. Make a list of key terms as an issue comment
[ ] Look over build/explore ideas if you plan to do any on discussion repo. Build ideas. Explore ideas
15.12. Badges#
[ ] Read about the Learn more about the SHA-1 collision attach
[ ] Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object (the minimum is 4. That’s usually enough for us to use something like
cat-file
command,git tag
or evengit checkout
)and include that number in gitcounts.md with a title# Git counts
. How many files would have to exist to reach that number of objects assuming every fiile was edited in each of two commits? If you get stuck, outline what you know and then request a review.[ ] Create tagtypeexplore.md with the template below. Determine how many of the tags in the course website are annotated vs lightweight using. (You may need to use
git pull --tags
in your clone of the course website)
# Tags
<!-- short defintion/description in your own words of what a tag is and what it is for -->
## Inspecting tags
Course website tags by type:
- annoted:
- lightweight:
[ ] Create tagtypes.md with the template below. Include an experiment that shows which if either type of tag creates a new git object. There are two types, try creating one of each a lightwight tag (provide only the tag name- what we did in class) and an annotated (provide a name and a message with
-m
).[ ] Determine how many of the tags in the course website are annotated vs lightweight using. (You may need to use
git pull --tags
in your clone of the course website)
# Tags
<!-- short defintion/description in your own words of what a tag is and what it is for -->
## Comparing tag types
<!-- include your experiment terminal history and interpretation -->
## Inspecting tags
Course website tags by type:
- annoted:
- lightweight:
<!-- include lists of tags for each type -->
[ ] Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object (the minimum is 4. That’s usually enough for us to use something like
cat-file
command,git tag
or evengit checkout
) and include that number in gitcounts_scenarios.md with a title# Git counts
. Describe 3 scenarios that would get you to that number of objects in terms of what types of objects would exist. For example, what is the maximum number of commits you could have without exceeding that number? How could you get to that number of objects in the fewest number of commits? What might be a typical way to get there? Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.[ ] Read about the Learn more about the SHA-1 collision attach
[ ] Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below gittransition.md
gittransition
# transition questions
1. Why make the switch? (in detail, not just *an attack*)
2. What impact will the switch have on how git works?
3. Which developers will have the most work to do because of the switch?