Git Intuition

January 10, 2021

In this post I want to moti­vate the fun­da­men­tal data struc­tures that under­lie Git. It will not be an in-depth treat­ment — just enough to have an intu­ition to know how Git can seem­ing­ly travel through time. I seemed to have used up my quota of tech­ni­cal jargon in the pre­vi­ous post so I’ll restraint myself here. (I promise not to use the term ”Merkle trees” any­where!)

Maybe let’s start with the dumb­est pos­si­ble way you could imple­ment some­thing like Git. Imag­ine I have some direc­to­ry whose con­tents I want to keep track of. When­ev­er I want to save a check­point of my work, I zip the entire direc­to­ry and name the zip file as the cur­rent time­stamp. Over time, I’ll end up with a bunch of zip files of the direc­to­ry at dif­fer­ent points in time.

That’s really what Git does. Every­thing else is an opti­miza­tion of space, so that you don’t even­tu­al­ly run out of space on your com­put­er after a while.


A direc­to­ry can be rep­re­sent­ed as a tree. The root of the tree rep­re­sents the direc­to­ry which is being kept under ver­sion con­trol, inter­nal nodes rep­re­sent sub­di­rec­to­ries, and leaves rep­re­sent files:

1

And me cre­at­ing a zip file of the entire direc­to­ry at dif­fer­ent times looks like this:

2

If I had a huge direc­to­ry con­tain­ing thou­sands of files, even if I’ve only changed one file, I’d still have a zip file that’s almost the same size as the pre­vi­ous one, since it con­tains a dupli­cate of all the other files that haven’t changed.

What if we could have a way to ref­er­ence parts of the tree that didn’t change?

3


So the way Git does this is to basi­cal­ly have a way to track which parts of the tree has changed. It does this by using a hash func­tion to hash files and sub­di­rec­to­ries.

Let’s see how the hash­ing process works. Files are hashed one by one (here I’m using an imag­i­nary hash func­tion):

4

Sub­di­rec­to­ries are only slight­ly trick­i­er. The con­tents of a sub­di­rec­to­ry is listed in a text file, with the name of each item and its hash on each line:

For dir1, it looks like this:

b.txt b123
c.txt c123

and for the root direc­to­ry code, it looks like this:

dir1 d1123
a.txt a123
dir2 d2123

5

Notice that code has a mix of files and other sub­di­rec­to­ries. That’s fine — each file and sub­di­rec­to­ry will have its own hash. This text file is then itself hashed.

Now, recall that each file’s hash changes when the con­tents change, and each direc­to­ry’s hash is based off the con­tents of the direc­to­ry. So, it stands to reason that if any­thing in the direc­to­ry changes (and this includes any­thing in the direc­to­ry’s sub­di­rec­to­ries), then the direc­to­ry’s hash will change too.

Per­haps more impor­tant­ly, the con­verse is true as well — if a direc­to­ry’s hash is the same, then we can be sure that every­thing in the direc­to­ry, includ­ing all its sub­di­rec­to­ries, has not changed.

So, the way Git shares parts of the tree is by only cre­at­ing new pieces of infor­ma­tion for those parts of the tree that’s changed:

6

Recall that the text file that rep­re­sent­ed code looked like this:

dir1 d1123
a.txt a123
dir2 d2123

After we mod­i­fied a.txt, the hash of the mod­i­fied file has been changed from a123 to a124, so the text file that rep­re­sents code now looks like this:

dir1 d1123
a.txt a124
dir2 d2123

Notice that dir1 and dir2’s hashes are the same, mean­ing that we can “reuse” them from the pre­vi­ous commit. That text file is then hashed, pro­duc­ing a dif­fer­ent string c567.


Since we’ve cov­ered trees and files, where do com­mits come into the pic­ture?

Com­mits are just anoth­er piece of infor­ma­tion that’s asso­ci­at­ed with each top-level tree object. Com­mits are what ties trees togeth­er and gives them con­ti­nu­ity through time by stor­ing its par­en­t’s hash, besides also con­tain­ing the commit mes­sage, commit author, and commit date:

7

And that’s it!


Now that we know how Git works, let’s try some­thing inter­est­ing.

Do this in your ter­mi­nal:

mkdir -p test-git/dir1/dir2/dir3/dir4/dir5
cd test-git
git init
echo "Test" > dir1/dir2/dir3/dir4/dir5/a.txt
git add .
git commit -m "Initial commit"
du -sh .git

You should see a direc­to­ry size of 124K:

124K	.git

Create a few more com­mits:

echo "Test" >> dir1/dir2/dir3/dir4/dir5/a.txt
git commit -am "Second commit"
du -sh .git
156K	.git
echo "Test" >> dir1/dir2/dir3/dir4/dir5/a.txt
git commit -am "Third commit"
du -sh .git
188K	.git

Notice how the direc­toy size of .git seems to increase as a mul­ti­ple of 32K every commit even though we’ve only changed 1 file? Here’s why it’s hap­pen­ing:

8

From what we know about sub­di­rec­to­ry hashes chang­ing, it makes sense, because in this case Git has to create new tree objects for the entire chain. In this case, it has to create 8 new objects, each 4K in size. To con­firm, you can see the number of files increas­ing in .git/objects:

du -h .git/objects

(rm -rf test-git to cleanup after you’re done!)


I’m fudg­ing over some of the details here, but recall that the pur­pose of this post is just to give some intu­ition on how it works on a high level. I’ll sug­gest Chap­ter 10 of Pro Git for a deeper dive!