Post

How actually git tracks objects. Snapshotting and the Illusion of git Diff.

The Illusion of Git Diff

Git stores the full latest snapshot of the entire file when you modify the file and perform commit operations. There are NO diffs of modified lines that are stored on the filesystem or somewhere else after commit. The “diff illusion” appeared in engineers’ minds because we used to see file diffs modifications in many tools, like VCS of IDEA, gitk, etc. The actual diff of modifications between different commits is calculated on the fly and is represented in an understandable form for us.

I will explain this git’s snapshotting model behavior in the following commit-by-commit actions.

1. Init git repository. Add first.txt file to repo

1
2
3
4
5
6
7
8
9
10
$ git init
$ ls -la .git
total 16
drwxr-xr-x  7 rtsypuk  staff  224 Jan 20 15:21 .
drwxr-xr-x  3 rtsypuk  staff   96 Jan 20 15:21 ..
-rw-r--r--  1 rtsypuk  staff   23 Jan 20 15:21 HEAD
-rw-r--r--  1 rtsypuk  staff  137 Jan 20 15:21 config
drwxr-xr-x  5 rtsypuk  staff  160 Jan 20 15:21 hooks
drwxr-xr-x  4 rtsypuk  staff  128 Jan 20 15:21 objects
drwxr-xr-x  4 rtsypuk  staff  128 Jan 20 15:21 refs

title=Listing {counter:refnum}. Adding first.txt file with the single “First line”

1
2
3
echo "First line" >> first.txt
git add .
git commit -m "first commit"

Now we can see that there are 3 new files created in the .git/objects folder.

Listing {counter:refnum}. Content of .git/objects directory]

1
2
3
4
$ find .git/objects -type f
.git/objects/2a/ae0a6496aba985dcf01e7ddf684c49ada7c5db
.git/objects/96/49cde946d8d0896fa80977b9bcd76439f99e6b
.git/objects/5b/9ddc8ff39ad75d0f98e71780cbbf56478dbb8d

We can check types by applying -t option to command cat-file and content by applying option -p. These new objects types are: commit, blob and tree.

Listing {counter:refnum}. using -t cat-file command we can check types of the objects]

1
2
3
4
5
6
$ git cat-file -t 2aae0a6496aba985dcf01e7ddf684c49ada7c5db
commit
$ git cat-file -t 9649cde946d8d0896fa80977b9bcd76439f99e6b
blob
$ git cat-file -t 5b9ddc8ff39ad75d0f98e71780cbbf56478dbb8d
tree

Each object’s filename is the SHA1 hash of its content. But there is special trick in git to remap SHA1 into file structure.

E.g. .git/objects/5b/9ddc8ff39ad75d0f98e71780cbbf56478dbb8d has actual hash 5d9ddc8ff39ad75d0f98e71780cbbf56478dbb8d. Git uses SHA1, 160-bit hexadecimal hash number with 40 characters long to form filenames. Tha actual SHA1 of the object is split into 2 chunks: the first 2 chars define the folder and the last 38 chars create a filename that is placed under this folder. This approach allows better distribution to access files in git filesystem and allows avoid scan entire parent directory .git/objects (in case if all sha1 objects will be placed here with no subdirectories). If you are from Java world you can think about this improvement as accessing the HashMap by key and retrieving the list of objects as the value. Such structures are much efficient than full-length list search.

If you will try to calculate SHA1 of added file by yourself and compare it to SHA1 hexadecimal number stored in git you will not find the matching

1
2
$ shasum first.txt
9b6a082673a0e60b1804797a367e01671cfdb92b  first.txt

It is because git uses standard SHA1 algorithm, but does not apply it directly to the file’s content. Instead, it also uses concatenated keyword blob, the actual file size, “\0” character and finally the content of the entire file. So the actual “git-based SHA1” can be calculated using the following formula:

\[gitSHA = SHA1(blob + [filesize] + \0 + [filecontent])\]

Git has command to get hash of the object, e.g. git-like SHA1 for a file:

1
2
$ git hash-object first.txt
9649cde946d8d0896fa80977b9bcd76439f99e6b

1.1. Commit object

Content of 2aae0a6496aba985dcf01e7ddf684c49ada7c5db commit object

1
2
3
4
5
$ git cat-file -p 2aae0a6496aba985dcf01e7ddf684c49ada7c5db
tree 5b9ddc8ff39ad75d0f98e71780cbbf56478dbb8d
author Roman Tsypuk <tsypuk.conf@gmail.com> 1579528773 +0200
committer Roman Tsypuk <tsypuk.conf@gmail.com> 1579528773 +0200
first commit

Commit object contains reference to a tree and commit details - author, committer and commit message. If you are using PGP verification, PGP signature also will be included here. The same SHA1 commit information is present in regular git log command output, e.g.:

Content of .git/objects directory

1
2
3
4
5
$ git log
commit 2aae0a6496aba985dcf01e7ddf684c49ada7c5db (HEAD -> master)
Author: Roman Tsypuk <tsypuk.conf@gmail.com>
Date:   Mon Jan 20 15:59:33 2020 +0200
first commit

1.2. Tree object

Content of 5b9ddc8ff39ad75d0f98e71780cbbf56478dbb8d tree object]

1
2
$ git cat-file -p 5b9ddc8ff39ad75d0f98e71780cbbf56478dbb8d
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	first.txt

Tree object has references to the list of objects that have been modified in the current commit. Each file item represents a line that starts with the Unix permission followed by blob/tree depending if current object is file(blob) or directory(tree). Next follows git-sha1 of the object and its filename.

1.3. Blob object

Contains the actual content of the file:

1
2
3
Listing{counter:refnum}. Content of 9649cde946d8d0896fa80977b9bcd76439f99e6b blob object]
$ git cat-file -p 9649cde946d8d0896fa80977b9bcd76439f99e6b
First line

The DAG (Direct Acyclic Graph) diagram looks like this:

2. Adding second.txt file with the same content to repository

Adding second.txt

1
2
3
echo "First line" >> second.txt
git add .
git commit -m "second commit"

Content of .git/objects directory

1
2
3
4
5
6
$ find .git/objects -type f
.git/objects/73/d4372bc1049b80935ee2b36d4ade2d7187afe9
.git/objects/2a/ae0a6496aba985dcf01e7ddf684c49ada7c5db
.git/objects/6b/3a79276e9bfa58e64cb7e5def587f593e7ae20
.git/objects/96/49cde946d8d0896fa80977b9bcd76439f99e6b
.git/objects/5b/9ddc8ff39ad75d0f98e71780cbbf56478dbb8d

Now we have 2 more new object:

  • 73/d4372bc1049b80935ee2b36d4ade2d7187afe9
  • 6b/3a79276e9bfa58e64cb7e5def587f593e7ae20

Commit2: .git/objects/6b3a79276e9bfa58e64cb7e5def587f593e7ae20

1
2
3
4
5
6
$ git cat-file -p 6b3a79276e9bfa58e64cb7e5def587f593e7ae20
tree 73d4372bc1049b80935ee2b36d4ade2d7187afe9
parent 2aae0a6496aba985dcf01e7ddf684c49ada7c5db
author Roman Tsypuk <tsypuk.conf@gmail.com> 1579530262 +0200
committer Roman Tsypuk <tsypuk.conf@gmail.com> 1579530262 +0200
second commit

Tree2: .git/objects/73d4372bc1049b80935ee2b36d4ade2d7187afe9

1
2
3
$ git cat-file -p 73d4372bc1049b80935ee2b36d4ade2d7187afe9
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	first.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	second.txt

As we can see from the new TREE file, now git manages 2 records - filename first.txt and second.txt, but both of them are referencing the same blob record. Git has identified that the content of first.txt and second.txt files is absolutely the same (SHA1 equals) and there is no need to keep duplicate of the same object. Remember that only file size and file content are used in SHA1 calculation. Filename does not play here.

Git commits represent the DAG, each next commit has a reference to the previous parent commit, creating the graph structure. The tree git object represents the node of the graph that can be associated with different blob objects.

At the second commit step, we see that commit2 (sha1 6b3a79276e9bfa58e64cb7e5def587f593e7ae20) has reference to parent commit1 (sha1 2aae0a6496aba985dcf01e7ddf684c49ada7c5db) as well as reference to tree2 object (sha1 73d4372bc1049b80935ee2b36d4ade2d7187afe9), which maintains two references to first.txt and seconds.txt but both mapped to the same blob object.

Git repository snapshot:

  • 2 commits
  • 2 trees
  • 1 blobs
  • 2 refs: HEAD,refs/heads/master

3. Adding third.txt file with the same content to git repo

Adding third.txt

1
2
3
echo "First line" >> third.txt
git add .
git commit -m "third commit"

Content of .git/objects directory

1
2
3
4
5
6
7
8
$ find .git/objects -type f
.git/objects/73/d4372bc1049b80935ee2b36d4ade2d7187afe9
.git/objects/2a/ae0a6496aba985dcf01e7ddf684c49ada7c5db
.git/objects/6b/3a79276e9bfa58e64cb7e5def587f593e7ae20
.git/objects/96/49cde946d8d0896fa80977b9bcd76439f99e6b
.git/objects/5b/9ddc8ff39ad75d0f98e71780cbbf56478dbb8d
.git/objects/d4/9117a72fc08b1991572d363bd43bdfe6c56c9f
.git/objects/e9/c3f660695d448fd33fa0b3e731f4ea7d9d4eed

We have 2 new object:

  • d4/9117a72fc08b1991572d363bd43bdfe6c56c9f
  • e9/c3f660695d448fd33fa0b3e731f4ea7d9d4eed

Tree3 .git/objects/d49117a72fc08b1991572d363bd43bdfe6c56c9f

1
2
3
4
$ git cat-file -p d49117a72fc08b1991572d363bd43bdfe6c56c9f
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	first.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	second.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	third.txt

Commit3: .git/objects/e9c3f660695d448fd33fa0b3e731f4ea7d9d4eed

1
2
3
4
5
6
7
$ git cat-file -p e9c3f660695d448fd33fa0b3e731f4ea7d9d4eed
tree d49117a72fc08b1991572d363bd43bdfe6c56c9f
parent 6b3a79276e9bfa58e64cb7e5def587f593e7ae20
author Roman Tsypuk <tsypuk.conf@gmail.com> 1579531622 +0200
committer Roman Tsypuk <tsypuk.conf@gmail.com> 1579531622 +0200
third commit
(base)

And the same happening here again: after adding a third.txt with the same content (“First line”), we have 2 new objects only - the tree3 and the commit3. But tree3 is referencing to the same blob object.

Git repository snapshot:

  • 3 commits
  • 3 trees
  • 1 blobs
  • 2 refs: HEAD,refs/heads/master

4. Appending first.txt with “Second line”

Now let’s do several commits that will modify the actual files content and track the modification in git objects.

Append “Second line” to first.txt

1
2
3
echo "Second line" >> first.txt
git add .
git commit -m "fourth commit"

Content of .git/objects directory

1
2
3
4
5
6
7
8
9
10
11
$ find .git/objects -type f
.git/objects/04/c8335e36222638af3c52ad7cb90d2f8ff68ad6
.git/objects/a5/ed01df2450ce61d01d05803ec75983a38eaff6
.git/objects/7d/91453217afc429984c4706e8df22aaac47c9ce
.git/objects/73/d4372bc1049b80935ee2b36d4ade2d7187afe9
.git/objects/2a/ae0a6496aba985dcf01e7ddf684c49ada7c5db
.git/objects/6b/3a79276e9bfa58e64cb7e5def587f593e7ae20
.git/objects/96/49cde946d8d0896fa80977b9bcd76439f99e6b
.git/objects/5b/9ddc8ff39ad75d0f98e71780cbbf56478dbb8d
.git/objects/d4/9117a72fc08b1991572d363bd43bdfe6c56c9f
.git/objects/e9/c3f660695d448fd33fa0b3e731f4ea7d9d4eed

We have 3 new objects:

  • .git/objects/04/c8335e36222638af3c52ad7cb90d2f8ff68ad6
  • .git/objects/a5/ed01df2450ce61d01d05803ec75983a38eaff6
  • .git/objects/7d/91453217afc429984c4706e8df22aaac47c9ce

Commit4: .git/objects/04c8335e36222638af3c52ad7cb90d2f8ff68ad6

1
2
3
4
5
6
7
8
9
$ git cat-file -t 04c8335e36222638af3c52ad7cb90d2f8ff68ad6
commit
git cat-file -p 04c8335e36222638af3c52ad7cb90d2f8ff68ad6
tree a5ed01df2450ce61d01d05803ec75983a38eaff6
parent e9c3f660695d448fd33fa0b3e731f4ea7d9d4eed
author Roman Tsypuk <tsypuk.conf@gmail.com> 1579532082 +0200
committer Roman Tsypuk <tsypuk.conf@gmail.com> 1579532082 +0200
fourth commit
(base)

Tree4: .git/objects/a5ed01df2450ce61d01d05803ec75983a38eaff6

1
2
3
4
5
6
$ git cat-file -t a5ed01df2450ce61d01d05803ec75983a38eaff6
tree
$ git cat-file -p a5ed01df2450ce61d01d05803ec75983a38eaff6
100644 blob 7d91453217afc429984c4706e8df22aaac47c9ce	first.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	second.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	third.txt

Modified first.txt

1
2
3
4
5
$ git cat-file -t 7d91453217afc429984c4706e8df22aaac47c9ce
blob
$ git cat-file -p 7d91453217afc429984c4706e8df22aaac47c9ce
First line
Second line

Git repository snapshot:

  • 4 commits
  • 4 trees
  • 2 blobs
  • 2 refs: HEAD,refs/heads/master

The new Blob2 with “First line” and “Second line” is created. Now new Tree4 object has reference to it as well as to Blob1, that represents second.txt. At the same time wee see the old version of first.txt (with “First line only”) That is referenced in Commit1-Commit3. So, for modified files git creates a new Blob records with the full entire snapshot of the files content. There are no diffs of modified lines of code!!!

5. Appending first.txt with “Third line”

Appending “Third line” » first.txt

1
2
3
echo "Third line" >> first.txt
git add .
git commit -m "fifth commit"

Content of .git/objects directory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ find .git/objects -type f
.git/objects/04/c8335e36222638af3c52ad7cb90d2f8ff68ad6
.git/objects/3c/fa938b257bccd12320093fd62e5e5c3b0c05fc
.git/objects/a5/ed01df2450ce61d01d05803ec75983a38eaff6
.git/objects/7d/91453217afc429984c4706e8df22aaac47c9ce
.git/objects/73/d4372bc1049b80935ee2b36d4ade2d7187afe9
.git/objects/2a/ae0a6496aba985dcf01e7ddf684c49ada7c5db
.git/objects/6b/3a79276e9bfa58e64cb7e5def587f593e7ae20
.git/objects/91/ddeac3958b9c3f335c723852ed2e3869fad6ba
.git/objects/96/49cde946d8d0896fa80977b9bcd76439f99e6b
.git/objects/5b/9ddc8ff39ad75d0f98e71780cbbf56478dbb8d
.git/objects/6d/a4d3e0a797240aefaa9c8009e49c57ded3b59e
.git/objects/d4/9117a72fc08b1991572d363bd43bdfe6c56c9f
.git/objects/e9/c3f660695d448fd33fa0b3e731f4ea7d9d4eed

We have 3 new object:

  • .git/objects/3c/fa938b257bccd12320093fd62e5e5c3b0c05fc
  • .git/objects/91/ddeac3958b9c3f335c723852ed2e3869fad6ba
  • .git/objects/6d/a4d3e0a797240aefaa9c8009e49c57ded3b59e

Tree5: .git/objects/3cfa938b257bccd12320093fd62e5e5c3b0c05fc

1
2
3
4
5
6
$ git cat-file -t 3cfa938b257bccd12320093fd62e5e5c3b0c05fc
tree
$ git cat-file -p 3cfa938b257bccd12320093fd62e5e5c3b0c05fc
100644 blob 6da4d3e0a797240aefaa9c8009e49c57ded3b59e	first.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	second.txt
100644 blob 9649cde946d8d0896fa80977b9bcd76439f99e6b	third.txt

Commit5: .git/objects/91ddeac3958b9c3f335c723852ed2e3869fad6ba

1
2
3
4
5
6
7
8
9
$ git cat-file -t 91ddeac3958b9c3f335c723852ed2e3869fad6ba
commit
$ git cat-file -p 91ddeac3958b9c3f335c723852ed2e3869fad6ba
tree 3cfa938b257bccd12320093fd62e5e5c3b0c05fc
parent 04c8335e36222638af3c52ad7cb90d2f8ff68ad6
author Roman Tsypuk <tsypuk.conf@gmail.com> 1579533662 +0200
committer Roman Tsypuk <tsypuk.conf@gmail.com> 1579533662 +0200
fifth commit
(base)

All 3 versions of first.txt are present in git

1
2
3
4
5
6
7
8
9
10
11
$ git cat-file -t 6da4d3e0a797240aefaa9c8009e49c57ded3b59e
blob
$ git cat-file -p 6da4d3e0a797240aefaa9c8009e49c57ded3b59e
First line
Second line
Third line
$ git cat-file -p 7d91453217afc429984c4706e8df22aaac47c9ce
First line
Second line
$ git cat-file -p 9649cde946d8d0896fa80977b9bcd76439f99e6b
First line

But in the same time we have the old version of file with 9649cde946d8d0896fa80977b9bcd76439f99e6b object

Git repository snapshot:

  • 5 commits
  • 5 trees
  • 3 blobs
  • 2 refs: HEAD,refs/heads/master

As you can see, all three versions of first.txt are stored now in different Blob1 - Blob3 objects. By checking the desired commit we can switch between attached versions of the files.

Summary

Following this hands-on step-by-step interaction, we have identified that git creates a whole file snapshot if it was modified in the commit and maps it using a tree object. It does not use delta, diff, etc. Exact the latest snapshot is stored. Another interesting behavior is that files with the same content are stored as a single object on the file system with managing multiple references and filename aliases from the tree objects only.

There is also a compaction mechanism and pack files format in git that is triggered by git gc. It allows reorganize the filesystem structure more efficiently and use deltas in files. I will do an overview of pack files in the next article.

This post is licensed under CC BY 4.0 by the author.