Since it’s inception Git has fast become one of the most popular distributed version control systems in use. Despite its pervasive use, Git often still comes across as arcane — with obtuse commands, many of which seem to do similar things. In this article series we will attempt to unravel the mysteries of Git by taking a deep dive into the internals of Git. We will explore the core data-structure Git uses to store our repository’s history and then look at a few commands to see how they mutate and manipulate this data-structure. This will enable us to get a better understanding of the workings of Git, and allow us to better leverage Git for our daily use.
As you know, Git is a distributed version control system.
Git stores all of the repository’s history inside the
.git directory which is usually found at the root level of the Git repository.
We will start our exploration of Git by first taking a peek inside the
Before we begin, let us initialize a new Git repository by using Git’s
Be sure to navigate to a scratch directory prior to running the following command:
$ git init gitsGuts # Initialized empty Git repository in /Users/looselytyped/Documents/articles/gitsGuts/.git/ $ cd gitsGuts $ (master) ls -al .... ls -al total 0 drwxr-xr-x 3 looselytyped staff 102 Jul 6 14:57 . drwxr-xr-x 14 looselytyped staff 476 Jul 6 14:57 .. drwxr-xr-x 9 looselytyped staff 306 Jul 6 14:57 .git ....
Now that we have our repository set up let us take a quick look at the
.git directory’s structure.
We can use the Unix
tree command to see the structure of the
$ (master) tree .git .... .git ├── HEAD (1) ├── config ├── description ├── hooks │ ├── applypatch-msg.sample │ ├── commit-msg.sample │ ├── post-update.sample │ ├── pre-applypatch.sample │ ├── pre-commit.sample │ ├── pre-push.sample │ ├── pre-rebase.sample │ ├── prepare-commit-msg.sample │ └── update.sample ├── info │ └── exclude ├── objects (2) │ ├── info │ └── pack └── refs (3) ├── heads └── tags .... <1> Symbolic Reference <2> Object datastore <3> References
Some of the files and directories found within the
.git directory serve to help configure and customize the Git repository.
To help us out, I have highlighted a few files and directories that will be of particular interest for us in this article series.
If this seems to be unfamiliar territory, worry not — we will be more than acquaintanced before we are finished here.
Now that we have a Git repository, let us get a high level overview of the core constructs that make up Git’s datastore.
The Git datastore
The Git datastore is made up of four different kinds of objects:
For the purposes of our discussion it will suffice to look only at blobs, trees and commits. Before we begin to look at these individually let us talk about these objects from a 20,000 feet view.
As one with an object-oriented background, I remember how my ears perked up when I heard of “Git objects” — I was already thinking of what their API might look like.
But Git objects are nothing like the objects you may be used to in OO-land.
Rather, when you think of Git objects just think of them as “opaque” (that is “not plain text”) records that are stored on the file system (in this case that would be the
.git directory, or specifically, inside the
Each of these objects is compressed prior to being persisted on disk, and Git uses a
SHA-1 hash not only to uniquely identify each object, but also decide where the object is stored.
I realize that this all seems a little abstract, so let us deep-dive into each object individually and perhaps some of this will come into perspective.
We will start with blobs first.
Blobs in Git store the contents of files. Say it with me - blobs in Git store the contents of files. To put it another way, no meta-data about the file is stored in a blob — no names, paths, types of files (regular, executable, symlink) — none of that is stored in a blob. When Git creates a blob it uses the contents of a file to produce a SHA-1 hash. It then uses this hash to both fingerprint the blob as well as determine where to store the blob. Let us see some of this in action. We will start by creating some content, and we will attempt to see the hash that Git will use to represent that content within the datastore.
$ (master) echo 'Hello Git!' | git hash-object --stdin # 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e
Of course this is not usually how we use Git.
So let us write a file with the same content and
git-add the file so that Git adds it to its datastore.
We will then use the Unix
tree command to inspect the
git-adda file to Git
$ (master) echo 'Hello Git!' > README.md $ (master) git add README.md $ (master) tree .git/objects/ .... .git/objects/ ├── 10 │ └── 6287c47fd25ad9a0874670a0d5c6eacf1bfe4e ├── info └── pack 3 directories, 1 file ....
Recall that the hash that Git created to represent “Hello Git!” was
README.md to add add the file to Git’s index, we see that Git has created a hierarchy containing one folder and one file under
The name of the directory just happens to be the first two characters of the hash that represents the content, and the name of the file happens to be the remaining 38 characters.
6287c47fd25ad9a0874670a0d5c6eacf1bfe4e happens to be the blob that Git created to store the contents of
The blob, as I mentioned earlier, is a compressed file that contains the contents of
Let us use Git to find out a little more about this hash.
$ (master) git cat-file -t 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e # blob $ (master) git cat-file -p 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e # Hello Git! $ (master) git cat-file -p 106287 # Hello Git!
We use the
git-cat-file command to ask ask Git the type of hash (using the
-t flag) that
106287c47fd25ad9a0874670a0d5c6eacf1bfe4e represents and Git reports it as a blob.
No surprise there.
We can use the same command to ask Git to pretty-print the contents that the hash represents (this time using the
-p flag) and again, no surprise.
Most Git commands that accept hashes as arguments can be supplied with the first 6 to 7 characters of the hash (since that is usually sufficient for Git to know which hash you mean).
One final note — if you have ever heard anyone call Git a content-addressable storage then perhaps you see why — Git uses the contents of a file to determine where it is to be stored.
Feel free to repeat this experiment with another piece of content. Use
git-hash-object to see what hash Git will generate for it, then see if you can predict where Git will store the blob.
Then simply create a new file with the exact same content, and
git-add it to the index.
.git/objects directory to see if your guess was correct.
To summarize, blobs represent contents of files. They are identified by SHA-1 hashes that are generated using the contents of the files themselves, and Git uses this hash to determine where to store the blob. They contain no metadata about the file itself — so where does this information get stored? The answer lies in the tree objects. Let us look at those next.
Blobs represent the contents of files, trees represent the directory structure of those files. A tree has pointers to all of the blobs that make up that tree, and perhaps to other trees if there happen to be subdirectories.
Before we dig deeper let us add a bit more structure to our Git repository.
$ (master) mkdir src (1) $ (master) touch src/Main.java (2) $ (master) echo '// This is my source code' > src/Main.java (3) $ (master) git add src/Main.java (4) $ (master) $ tree (5) .... . ├── README.md └── src └── Main.java .... <1> Add a src sub-directory <2> Add a file to the sub-directory <3> Put some contents in the newly created file <4> git-add the file to the repository <5> Inspect working directory structure
Quick! How many blobs exist within our Git datastore? If you guessed two then that is absolutely correct. Well done :)
Here is another (albeit trickier) question — how many directories exist within our working directory?
The correct answer to that question is two!
We have the
src directory, and we have the working directory itself (represented by
. in the
We will now ask Git to write the directory structure to the datastore so we can see what tree objects look like.
$ (master) git write-tree (1) # b81f10b16a08debe2624bdc0233a4c2fe2032616 $ (master) tree .git/objects/ (2) .... .git/objects/ ├── 10 │ └── 6287c47fd25ad9a0874670a0d5c6eacf1bfe4e ├── 75 │ └── 460e5f3dd6fa1688922a2b6737dc1143d9bb3f ├── b8 │ └── 1f10b16a08debe2624bdc0233a4c2fe2032616 ├── df │ └── 5044438d88195ccf896bdad3eef8940b31e7de ├── info └── pack 6 directories, 4 files .... <1> Add the tree to the datastore <2> Inspect the .git/objects directory
We use yet another command (
git-write-tree) from Git’s repertoire of commands that causes Git to write the current directory structure to the datastore.
Git replies back with yet another hash (this time
b81f10b16a08debe2624bdc0233a4c2fe2032616) — this hash represents the root of the current working directory.
Just like blobs Git will store the tree under the
.git/objects directory — it takes the first two characters of the hash to create a folder (if it does not exist already) and then creates a file with the remaining 38 characters.
We know that there are two directories in our working directory (the root, and
src) and we have two files.
We confirm this by inspecting the
We know that the
6287c47fd25ad9a0874670a0d5c6eacf1bfe4e contains the contents of
README.md (in compressed format) and
1f10b16a08debe2624bdc0233a4c2fe2032616 represents the root directory. The obvious question is how does Git represent a directory structure? Let us find out.
$ (master) git cat-file -t b81f10b16a08debe2624bdc0233a4c2fe2032616 (1) # tree $ (master) git cat-file -p b81f10b16a08debe2624bdc0233a4c2fe2032616 (2) # 100644 blob 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e README.md # 040000 tree 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f src <1> Ask for type of hash df5044438d88195ccf896bdad3eef8940b31e7de represents <2> Pretty-print (-p) it
We once again use
git-cat-fileto ask for the type of hash that
b81f10b16a08debe2624bdc0233a4c2fe2032616represents and Git tells us it is a tree object.
Pretty printing the same hash reveals something that looks a lot like a directory listing!
Looking over the contents of the pretty print we see a few items that should be familiar.
We know that
106287c47fd25ad9a0874670a0d5c6eacf1bfe4e is a blob representing
We also see an entry for a tree with the name
src with a hash of
Let us inspect that before we proceed to see what actually happened when Git wrote the tree.
$ (master) git cat-file -p 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f (1) # 100644 blob df5044438d88195ccf896bdad3eef8940b31e7de Main.java $ (master) git cat-file -p df5044438d88195ccf896bdad3eef8940b31e7de (2) # // This is my source code <1> Pretty print 75460e5f3dd6fa1688922a2b6737dc1143d9bb3f <2> Pretty-print df5044438d88195ccf896bdad3eef8940b31e7de
75460e5f3dd6fa1688922a2b6737dc1143d9bb3freveals a string much like we saw for
b81f10b16a08debe2624bdc0233a4c2fe2032616except this one has only one entry in it.
Pretty printing the blob contained within the
srcdirectory reveals that it represents the contents of
How does this work?
When we asked Git to write the tree to the datastore it started recursively inspecting the working directory from the root.
It realized that that there was a sub-directory (
src) under the root directory and first calculated the hash for that directory.
It did so by creating a string that looked like
100644 blob df5044438d88195ccf896bdad3eef8940b31e7de Main.java and then using the SHA-1 algorithm to generate a hash from that string.
It then stuffed that very string (after compressing it) in a file called
460e5f3dd6fa1688922a2b6737dc1143d9bb3f under the
75 directory under
The following listing highlights the constituent parts of the string that represent a tree (or a directory) within Git.
100644represents a regular non-executable file (Git uses several other codes such as
100755to represent executable files, and
040000to represent sub-directories a.k.a sub-trees)
The type: blob, tree, etc.
The hash of the current entry
The name of the entry
Perhaps now you see where the file (or blob) metadata is stored — it is in the tree! Furthermore, Git uses the hash of the blobs (and sub-trees) within a tree to calculate the hash of the tree itself!
Now that Git knows the hash of the
src directory it traverses up to the parent directory (or the root directory in our case) and writes out another string that lists all the blobs and trees within that directory.
It uses that string to calculate the hash of the root directory and just like before, stuffs that string in a file called
1f10b16a08debe2624bdc0233a4c2fe2032616 under the
b8 directory under
Let us restate what we learned here. Trees in Git store the metadata (the type, hashes, and names) about the blobs that are contained within it. The hash of the tree is calculated using a string that looks very much like a directory listing. If a tree contains a sub-directory, then the the hash of the sub-tree is first calculated and used to calculate the hash of the parent directory.
Phew! Almost there. Let us look at commits next.
Commits are the level of abstraction that we as developers using Git are most familiar with.
The help page of
git help commit-tree) tells us:
While a tree represents a particular directory state of a working directory, a commit represents that state in “time,” and explains how to get there.
In other words a commit is a snapshot of the working directory at the time the commit was made.
Just so we are on the same page, let us check our Git status:
$ (master) git status # On branch master .... Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: README.md new file: src/Main.java ....
Excellent! We have two files staged, and ready to participate in the next commit. Shall we commit?
$ (master) git commit -m "Initial commit" .... [master (root-commit) 917408c] Initial commit (1) 2 files changed, 2 insertions(+) create mode 100644 README.md create mode 100644 src/Main.java .... <1> Git reports the hash of the commit
|If you are playing along you will get a different hash even if you have the same commit message as mine.|
On a successful commit, Git reports the hash (albeit only the first seven characters) of the newly created commit. Fear not — this is the truncated form of the hash and in most Git operations that require a hash, only the first six or seven characters need be supplied.
If you are curious to know the full hash you can use yet another Git command
git-rev-parse like so:
$ (master) git rev-parse 917408c # 917408c8318bb3dc86c3a6d1095e27b97d14f637
Pop quiz time!
Based on what we have learned so far, where do you think Git will store the commit?
If you answer is a sub-directory within
.git/objects directory with the name
91 and a file called
7408c8318bb3dc86c3a6d1095e27b97d14f637, then you are absolutely correct!
Go ahead — take a look inside
.git/objects and see for yourself.
Of course the next question to answer is: “What does the file
Let us ask Git.
We will once again request the services of our helpful friend
git-cat-file to examine the commit.
$ (master) git cat-file -t 917408c8318bb3dc86c3a6d1095e27b97d14f637 (1) commit $ (master) git cat-file -p 917408c8318bb3dc86c3a6d1095e27b97d14f637 (2) .... tree b81f10b16a08debe2624bdc0233a4c2fe2032616 author Raju Gandhi <firstname.lastname@example.org> 1405795376 -0400 committer Raju Gandhi <email@example.com> 1405795376 -0400 Initial commit .... <1> Ask for the type <2> Pretty print it
The type of object that
917408c8318bb3dc86c3a6d1095e27b97d14f637represents is a commit. Again, no surprise there.
Pretty printing it reveals a few details about the commit. We see the hash of the tree that we created earlier using
git-write-tree. We also see some author and committer information. This is followed by a blank line followed by the commit message we supplied when we created the commit.
Any guesses as to how the hash of the Git was calculated?
Let us take a step in Git’s shoes and see what happens when we make a commit.
Keep in mind that the first thing we do is to add all the files (via
git-add) to the index that we want to commit.
This we know will trigger Git to calculate the blobs to represent each of the files.
On the commit (via
git-commit), Git will internally write the tree to the datastore and then write the commit.
In order to calculate the hash of the commit Git will take the hash of the tree (as is reported by
git-write-tree), the author information (as provided by Gits configuration), the committer information (which in our case happens to be the same as the author information, since we are both making the changes and committing them to Git), the current timestamp, and finally the commit message.
It then proceeds to write out a string that looks like so:
tree <tree hash> author <author name> <author email> <timestamp> committer <committer name> <committer email> <timestamp> Commit message
It proceeds by hashing this string to create the hash of the commit. Finally, it compresses this string and writes it to a file whose path is dictated by the hash it created.
Just like the hash of a tree is a function of all the blobs and trees beneath it, the hash of a commit is a function of the tree that was written when the commit was created.
I mentioned earlier that if you were playing along you will see a different hash than mine. How was I so sure? This is because the hash of the commit is a function of a lot more than just the tree hash! And hopefully, email addresses are unique! :)
The Git DAG
We now know how a Git commit is created. We know that the hash of a Git commit is representative of the tree it points to, which in turn is representative of all the blobs and sub-trees it contains.
But there is one more component to a Git commit. Before we proceed we should note that the commit we made was the first commit in our newly created repository. Let us make a minor change and make another commit to record that change. We will then interrogate the hash of the commit to see what it looks like.
$ (master) echo "Making another commit" >> README.md (1) $ (master) git add README.md (2) $ (master) git commit -m "Second commit" (3) # [master e4e4b13] Second commit (4) # 1 file changed, 1 insertion(+) $ (master) git cat-file -p e4e4b13 (5) .... tree e257f1322a6d1eff27c146860e5bf3db286eceef parent 917408c8318bb3dc86c3a6d1095e27b97d14f637 author Raju Gandhi <firstname.lastname@example.org> 1405799643 -0400 committer Raju Gandhi <email@example.com> 1405799643 -0400 Second commit .... <1> Make a change to README.md <2> Add README.md to the staging area <3> Make a commit <4> Git reports back the hash of the newly created commit <5> Examine the commit
Compare the output of
git cat-file for
e257f1322a6d1eff27c146860e5bf3db286eceef against the one we saw previously for
We see that
e257f1322a6d1eff27c146860e5bf3db286eceef has one more entry in it for parent.
Furthermore, the hash against the parent is the hash of our first commit.
In essence, a Git commit not only points to the tree that represents the working directory, it also points to the hash of the commit that was made just before it. If a commit does not have a parent, Git knows it to be the initial commit in a repository.
To better visualize this I have created an illustration that might help cement this idea:
The red circles in Figure 1 represent commits in our repository, the triangles represent trees and rectangles represent blobs, and time flows up — the child commits appear above their predecessors (just like you see them in Git’s logs).
Our first commit consisted of the
README.md file at the root, and the
Main.java inside the
Our second commit only updated the
Here is where things get interesting — recall that a commit is a snapshot of the working directory at the time the commit was made.
Git knows of
Main.java at the time of the second commit, but also realizes that the file was not modified.
So it simply reuses the blob it created the first time around.
But it does record the state of the working directory in every commit.
You can see in Figure 1 that the commits form a DAG, or directed acyclic graph. The graph is directed and acyclic since children point (direct) towards their parents but never the other way around (acyclic).
Therefore, each commit is not only a function of the state of the working tree (along with other information) but also of the commits that came before it.
We know Git hashes are going to be unique — so if two different repositories have the same files with the same names and the same content in the same directory structure (which leads to the same tree hash) the commits will be unique merely as a function of the authors/committers being different.
Git’s power comes from simplicity.
Understanding how commits are created and how they participate in foundational to the understanding of Git.
In this article we saw how Git stores the history of our repository within a Directed Acyclic Graph of commits, and how the
git-commit command adds to this graph.
In part II of this article series we will take a look at a few more commands such as
git-merge to see how they manipulate this graph.
Understanding how a command alters the DAG, and being able to visualize both the current and the final state of the graph as a function of executing such a command will lift the veil of obscurity that seemingly surrounds Git, and is the key to mastery.
Till we meet again, keep “add-ing” to your experience with Git and stay “commit-ted” to learning more. :)