15. Why did we learn the plubming commands?#

You will not typically use them on a day to day basis, but they are a good way to see what happens at the interim steps and make sure that you have the right understanding of what git does.

A correct understanding is essential for using more advanced features

While there is of course some content that we want you to know after this course, my goal is also to teach you process, by modeling it.

No one will ever know all of the things but you can be fast or slow at finding answers.

And you can find correct answers, incorrect answers, or looks-okay-but-you-will-regret-this-later answers.

you do not want to become the colleague that everyone regrets working with

My goal is that you get good at quickly finding correct answers.

Large language models will not do that for you.

15.1. Today’s Questions#

  • What are references?

  • How can can I release and share my code?

  • How else can git help me?

When does the contents of a file get hashed?

  • [ ] every time you edit a file, git hashes it automatically

  • [x] when a file is staged for commit

  • [ ] any time a staged file is edited

  • [ ] when the commit is created

What type of git object stores the name of a file?

  • [x] tree

  • [ ] blob

  • [ ] commit

  • [ ] branch

15.2. What does git status do?#

compares the working directory to the current state of the active branch and the index

  • we can see the working directory with: ls

  • we can see the active branch in the HEAD file

  • what is its status?

Recall that we:

  • crated a blob objct directly

  • created a file

  • hashed the file, to create its blob object

  • added the file to the index

  • wrote a tree from files in the index

  • created a commit object with a tree and a commit message

  • copied the commit hash and placed it inside a new file .git/refs/heads/main using echo and >

Let’s use inspection to review where our repo is left off from last week.

git status
On branch main


Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   test.txt

15.3. Branches are references#

branches are not git objects, they are references.

We can see that by where they are stored.

How can we inspect to see where references are stored?

using ls and/or find

We can see our list of object with find

find .git/objects/ -type f
.git/objects//0c/1e7391ca4e59584f8b773ecdbbb9467eba1547
.git/objects//d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
.git/objects//d8/329fc1cc938780ffdd9f94e0d364e0ea74f579
.git/objects//e3/ba10cb02de504d4f48b9af4934ddcc4d0be3df
.git/objects//83/baae61804e65cc73a7201a7252750c76066a30

Since we made most of these objects as not commits, the hashes are mostly shared, but one of the hashes is unique so we worked together to identify that one.

Then for that unique hash, we confirmed it was a commit using git cat-file to view its type

git cat-file -t e3ba
commit

as expected

Now let’s look at what the HEAD pointer says to try to understand why it does not see that commit, since we know that git status works from HEAD

cat .git/HEAD 
.git/refs/heads/main

Important

git updates the file for the branch each time you add a commit with git commit, the first time git commit is run it also creates the file

If we look in the folder, we can see what is in there

ls .git/refs/heads/
main
cat .git/refs/heads/main
e3ba10cb02de504d4f48b9af4934ddcc4d0be3df

15.3.1. Updating a branch manually#

The branch file is named as the branch (here main) and stored in the .git/refs/heads/ folder and contains the full hash of the commit that branch is pointing to.

Since we made the commit manually, we need to move the branch manually too. We will use echo, by copyting the full hash from above (I copied the part after the last / in the list of objects above and then typed the e3 before pasting).

Mine looks like:

echo e3ba10cb02de504d4f48b9af4934ddcc4d0be3df > .git/refs/heads/main

but your commit hash will be different than mine.

Now we check with git again:

git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   test.txt

no changes added to commit (use "git add" and/or "git commit -a")

and now we can see that it no longer sees the staged file and does see our commit.

git cat-file -p 188a
tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579
author Ayman Sandouk <Ayman_sandouk@uri.edu> 1677177139 -0500
committer Ayman Sandouk <Ayman_sandouk@uri.edu> 1677177139 -0500

first commit
git log

So we now have HEAD-> main and main -> our commit -> tree –> blob.

Branches in git are references to specific commits.

15.4. Git Tags#

To practice this, lets go to the github inclass repo

cd gh-

(this is not the full name of the directory, but starting with this and then pressing tab will get you there)

The other type of git reference is called a tag. There are two types of tags:

  • a lightweight tag is like a branch that does not move

  • an annotated tag is a git object, like a commit almost.

We can see where they are stored in the refs folder:

ls .git/refs/
heads	remotes	tags

Remember:

  • heads are branches (like main)

  • remotes are servers that we can push to (eg https://github.com/...)

Before we make a tag, let’s make sure we are on the most up to date main. First we check the branch and working directory:

git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

main, clean working tree is good.

Next we pull to update from github

git pull
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), 955 bytes | 136.00 KiB/s, done.
From https://github.com/compsys-progtools/gh-inclass-sp24-Ayman_sandouk
 * [new branch]      1-create-a-add-a-classmate -> origin/1-create-a-add-a-classmate
Already up to date.

and we get the updates

Now, to make a tag we use git tag <name of tag> to create a lightweight tag

git tag v3.0

This command has no output, but we will inspect

ls .git/refs/tags/
v3.0

we see this created a file for the tag,

we will look at this too

cat .git/refs/tags/v3.0
0d6ef1d3db0729088d515b35b588f39af5b770fd

the lightweight tag is just like a branch, another pointer to the commit.

The difference is that this does not move when we create new commits, so this is like a shortcut to a specific commit.

We can see the tag relative to branches in the log as well:

git log
commit 0d6ef1d3db0729088d515b35b588f39af5b770fd (HEAD -> main, tag: v3.0, origin/main, origin/HEAD)
Merge: 9f39946 e3b192a
Author: AymanBx <Ayman_sandouk@uri.edu>
Date:   Thu Feb 8 13:43:19 2024 -0500

    Merge pull request #6 from compsys-progtools/organization
    
    first pass organizing

commit e3b192aa0cd490226e8adcd81d3d0b95adb5676b (origin/organization, organization)
Author: Ayman Sandouk <Ayman_sandouk@uri.edu>
Date:   Tue Feb 6 13:41:53 2024 -0500

    oranizizng

commit 260c9c309922970f80bfa2c93cc23bcfbb962740
Author: Ayman Sandouk <Ayman_sandouk@uri.edu>
Date:   Tue Feb 6 13:06:20 2024 -0500

    describe the files

commit 29ffc88519103085ed3a2ab01cffb3c99d70fc6a
Author: Ayman Sandouk <Ayman_sandouk@uri.edu>
Date:   Tue Feb 6 12:59:20 2024 -0500

If we push:

git push
Everything up-to-date

we see nothing happens initially.

Tags have to be pushed explicitly

git push --tags

We can look at how tags are represented on github. After opening the repo up in browser:

gh repo view --web

Tags enable releases, where are defined on github as:

Releases are deployable software iterations you can package and make available for a wider audience to download and use.

We can create a relese for the tag we made

15.4.1. We can checkout any commit, not jsut branches#

git checkout 188a
Note: switching to '188a'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 188a75e first commit

we can follow from this warning.

cat .git/HEAD 
188a75ef66b6a85be0ab68d8575ec27808881dfc

This changes the head pointer directly to the commit.

ls
test.txt

and checkout also updates the working directory

cat test.txt 
version 1
git checkout test
Switched to branch 'test'
ls
test.txt
cat test.txt 
version 1

15.5. What does this mean?#

We tend to think of comits like this:

flowchart RL subgraph A[first commit] %% commitA file1v1 %% treeA %% blob1 end subgraph B[second commit] file1v2 file2v1 file3v1 %% commitB %% treeB %% blob2 end B--> A

In reality

flowchart RL subgraph A[first commit] commitA %% file1v1 treeA blob1 commitA-->treeA treeA-->blob1 end subgraph B[second commit] %% file1v2 %% file2v1 %% file3v1 commitB treeB blob1v2 blob2 blob3 commitB-->treeB treeB-->blob1v2 treeB treeB-->blob3 end %% B--> A commitB -->commitA

15.6. We can move pointers around freely#

flowchart BT
    blob1
    blob2
    blob3
    subgraph A[first commit]
        direction TB
        commitA
        %% file1v1
        treeA
        commitA-->treeA

    end 
    subgraph B[second commit]
        direction TB
        %% file1v2
        %% file2v1
        %% file3v1
        commitB
        treeB
        commitB-->treeB
    end
%% subgraph C[third commit]
%%     direction TB
%%     %% file1v2
%%     %% file2v1
%%     %% file3v1
%%     commitC
%%     commitC-->treeC
%% end
B--> A

%% treeC-->blob1
treeC>blob2
treeB-->blob3
end
%% B--> A
commitB -->commitA

15.7. Experience Report Evidence#

write your git status and object list to a file.

15.8. What is a hash?#

a hash is:

  • a fixed size value that can be used to represent data of arbitrary sizes

  • the output of a hashing function

  • often fixed to a hash table

Common examples of hashing are lookup tables and encryption with a cyrptographic hash.

A hashing function could be really simple, to read off a hash table, or it can be more complex.

For example:

Hash

content

0

Success

1

Failure

If we want to represent the status of a program running it has two possible outcomes: success or failure. We can use the following hash table and a function that takes in the content and returns the corresponding hash. Then we could pass around the 0 and 1 as a single bit of information that corresponds to the outcomes.

This lookup table hash works here.

In a more complex scenario, imagine trying to hash all of the new terms you learn in class.

A table would be hard for this, because until you have seen them all, you do not know how many there will be. A more effective way to hash this, is to derive a hashing function that is a general strategy.

A cyrptographic hash is additionally:

  • unique

  • not reversible

  • similar inputs hash to very different values so they appear uncorrelated

Now lets go through each of these properties

15.8.1. Cryptographic Hashes are unique#

This means that two different values we put in should give different results.

For this property alone, a simple function could work:

def basic_unique(input):
    return input

but this is not a hash because its length would not be constant and not a cryptographic has because it is easily reversible.

15.8.2. Cryptographic Hashes are not reversible#

This means that given the hash (output), we cannot compute the message(input).

Any function that gives the same output for two (or more) values meets this criteria.

for example modulus:

13%3
1
10%3
1

It can be any function that gives the same output for two (or more) values.

but this is not a cryptographic hash

We can use the git hashing algorithm without writing to the repo too:

Then we get the hash back. Try changing the input just a little and running the hashing algorithm again.

do similar inputs have similar hashes?

15.8.3. Similar inputs lead to seemingly uncorrelated outpus#

We can see this using git:

echo "it's almost spring" | git hash-object --stdin
1327354a0d47111e361ec52acc46b9509bed69a0

and we get a hash

and then we change it just a little bit:

echo "it's almsot spring" | git hash-object --stdin
f798473089b4a00684cd0f13a6f2d1d7a0e033ba

and now the hash is completely different

15.8.4. Hashes are fixed length#

So, no matter the size of the input, we get back the same length.

This is good for memory allocation reasons.

We could again write a function that only does this simply:

def fixed_len(input):
    '''
    pad or trim 
    '''
    len_target=100
    str_in = str(input)
    if len(str_in)< len_target:
        return str_in.ljust(len_target,'-')
    else:
        return str_in[:len_target]

Back to it being not a cryptographic hash

Try fixed_len("lkjhlkjhlkjhlkjlkjhlkjhlkjhlkjhkljhjlkhlkjhlkjhlkjhlkjhkjlhkjlhhhhhjlhlfgfgfgfgfgfgfgfgfkjhlkjhljkhnbh123") Try fixed_len("lkjhlkjhlkjhlkjlkjhlkjhlkjhlkjhkljhjlkhlkjhlkjhlkjhlkjhkjlhkjlhhhhhjlhlfgfgfgfgfgfgfgfgfkjhlkjhljkhnbh258")

What are some ways a hash could be used?

15.9. How can hashes be used?#

Hashes can then be used for a lot of purposes:

  • message integrity (when sending a message, the unhashed message and its hash are both sent; the message is real if the sent message can be hashed to produce the same hash)

  • password verification (password selected by the user is hashed and the hash is stored; when attempting to login, the input is hashed and the hashes are compared)

  • file or data identifier (eg in git)

15.9.1. Hashing in passwords#

Passowrds can be encrypted and the encrypted information is stored, then when you submit a candidate password it can compare the hash of the submitted password to the hash that was stored. Since the hashing function is nonreversible, they cannot see the password.

Password Breach blog post

Some sites are negligent and store passwords unencrypted, if your browser warns you about such a site, proceed with caution and definitely do not reuse a password you ever use. (you should never reuse passwords, but especially do not if there is a warning)

An attacker who gets one of those databases, cannot actually read the passwords, but they could build a lookup table. For example, “password” is a bad password because it has been hashed in basically every algorithm and then the value of it can be reversed. Choosing an uncommon password makes it less likely that your password exists in a lookup table.

For example, in SHA-1 the hashing algorithm that git uses

echo "password" | git hash-object --stdin
f3097ab13082b70f67202aab7dd9d1b35b7ceac2

15.10. Hashing in Git#

In git we hash the content directly to store it in both the database (.git) directory and the commit information.

Recall, when we were working in our test repo we created an empty repository and then added content directly, we all got the same hash, but when we used git commit our commits had different hashes because we have different names and made the commits at different seconds.

We also saw that two entries were created in the .git directory for the commit.

Git was originally designed to use SHA-1.

Then the SHA-1 collision attack was discovered

Git switched to hardened HSA-1 in response to a collision.

In that case it adjusts the SHA-1 computation to result in a safe hash. This means that it will compute the regular SHA-1 hash for files without a collision attack, but produce a special hash for files with a collision attack, where both files will have a different unpredictable hash. from.

they will change again soon

GitHub uses git, it is not an alternative implementation or a fork, so yes it will switch too. The developers at GitHub an other git hosts are among the most impacted by the change since they write code that directly interacts with git objects.

git uses the SHA hash primarily for uniuqeness, not privacy

It does provide some security assurances, because we can check the content against the hash to make sure it is what it matches.

This is a Secure Hashing Algorithm that is derived from cryptography. Because it is secure, no set of mathematical options can directly decrypt an SHA-1 hash. It is designed so that any possible content that we put in it returns a unique key. It uses a combination of bit level operations on the content to produce the unique values.

The SHA-1 Algorithm hashes content into a fixed length of 160 bits.

how many different possible hashes can it produce?

  • [ ] 160*160

  • [x] 2^160

  • [ ] 160*2

  • [ ] 160^2

This means it can produce \(2^{160}\) different hashes. Which makes the probability of a collision very low.

The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is \(p = (n(n-1)/2) * (1/2^160))\). \(2^{80}\)) is \(1.2 \times 10^{24}\) or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.

A SHORT NOTE ABOUT SHA-1 in the Git Documentation

The number of randomly hashed objects needed to ensure a 50% probability of a single collision is about \(2^{80}\) (the formula for determining collision probability is p = n(n-1)/2) * (1/2^160))$. $2^{80}) is 1.2 \times 10^{24}$ or 1 million billion billion. That’s 1,200 times the number of grains of sand on the earth.

A SHORT NOTE ABOUT SHA-1 in the Git Documentation

15.11. Prepare for Next Class#

  • [ ] Think about what you know about networking

  • [ ] Get the big ideas of hpc, by reading this IBM intro page and some hypothetical people who would attend an HPC carpentry workshop. Make a list of key terms as an issue comment

  • [ ] Look over build/explore ideas if you plan to do any on discussion repo. Build ideas. Explore ideas

15.12. Badges#

  • [ ] Read about the Learn more about the SHA-1 collision attach

  • [ ] Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object (the minimum is 4. That’s usually enough for us to use something like cat-file command, git tag or even git checkout)and include that number in gitcounts.md with a title # Git counts. How many files would have to exist to reach that number of objects assuming every fiile was edited in each of two commits? If you get stuck, outline what you know and then request a review.

  • [ ] Create tagtypeexplore.md with the template below. Determine how many of the tags in the course website are annotated vs lightweight using. (You may need to use git pull --tags in your clone of the course website)

# Tags

<!-- short defintion/description in your own words of what a tag is and what it is for -->


## Inspecting tags

Course website tags by type: 
- annoted:
- lightweight: 
  • [ ] Create tagtypes.md with the template below. Include an experiment that shows which if either type of tag creates a new git object. There are two types, try creating one of each a lightwight tag (provide only the tag name- what we did in class) and an annotated (provide a name and a message with -m).

  • [ ] Determine how many of the tags in the course website are annotated vs lightweight using. (You may need to use git pull --tags in your clone of the course website)

# Tags

<!-- short defintion/description in your own words of what a tag is and what it is for -->


## Comparing tag types 

<!-- include your experiment terminal history  and interpretation -->

## Inspecting tags

Course website tags by type: 
- annoted:
- lightweight: 
<!-- include lists of tags for each type -->
  • [ ] Calculate the maximum number of git objects that a repo can have without requiring you to use more than the minimum number of characters to refer to any object (the minimum is 4. That’s usually enough for us to use something like cat-file command, git tag or even git checkout) and include that number in gitcounts_scenarios.md with a title # Git counts. Describe 3 scenarios that would get you to that number of objects in terms of what types of objects would exist. For example, what is the maximum number of commits you could have without exceeding that number? How could you get to that number of objects in the fewest number of commits? What might be a typical way to get there? Assume normal git use with porcelain commands, not atypical cases with plubming commands. If you get stuck, outline what you know and then request a review.

  • [ ] Read about the Learn more about the SHA-1 collision attach

  • [ ] Learn more about how git is working on changing from SHA-1 to SHA-256 and answer the transition questions below gittransition.md

gittransition

# transition questions

1. Why make the switch? (in detail, not just *an attack*)
2. What impact will the switch have on how git works?
3. Which developers will have the most work to do because of the switch?

15.13. Questions#

15.13.1. How to tags exactly work?#

Similar to branches, an individual task will point at a specific commit (by holding a commit’s hash in its corresponding file). So one can easily use a tag to mark an important state of their project that they never want to lose. The difference between a branch and a tag is that a branch moved with each commit or can be moved with merges, rebases, pushes, pulls etc. A tag cannot.