A Non-Tech Introduction: Git and Github

15 April 2014

After joining a startup, one cool tech concept I had to learn about right away was version control. At work, we use a tool called Github, which is based on the Git version control system and allows many coders to work on one codebase. I've found Git fascinatingly elegant, and Github an incredibly fun tool to use. As a sequel to my first "Non-Tech Introduction", this post will provide an overview of Git and Github assuming no tech background and with minimal use of jargon.

The problem: Why do we need version control?

Git is one solution to the problem of "version control". Generally speaking, this is the problem of keeping track of what changes have been made to certain files over time.

Why do coders care about what files of code looked like in the past, instead of just the current state? One reason is that it provides a backup. If a major bug is discovered in a set of code changes that was just rolled out to live users, version control allows you to more easily "roll back" to a previous, unbroken state of the codebase.

Another reason is that it is helpful for keeping track of who did what and who might understand different parts of the codebase. If I have to tackle a problem in a new part of the codebase that I'm not familiar with, I can look at the history to see which one of my colleagues worked with this code last and ask him for help.

Isn't this like "Track Changes" in Word?

The idea of version control is analagous to the "Track Changes" feature in Microsoft Word. Word lets you see what parts of a text document were changed by the last editor, including any additions, deletions, or comments. However, you might have experienced how a document's changes can get overwhelming as soon as there are multiple rounds of edits. Track Changes doesn't present well the different sets of changes. On top of that, Word doesn't make it easy to see what a document looked like at a certain point in the past. For that, you might be implementing your own type of version control by re-saving each major version of the file with a new name that has a timestamp.

Git is much smarter about tracking changes, and it's able to do this because of one big advantage: code is divisible into separate lines. Normal writing can be divided into sentences or paragraphs, but both of these are more complex "units" of text that are longer and harder to trace over time than code. This simplification is a key reason why Git works so well with code, though it could conceivably work for text documents as well.

The Diff

The basis of Git is that it does not keep track of a document by taking snapshots of that whole document at various points in time. Instead, it tracks sets of differences, or "diffs" between one version and the next. Below is a visual example of how Git sees a change in some code:

A change in line 53

In this example, someone came along and made a change to line 53, which is registered by Git; there was no change to line 54, so that does not make it into the diff.

And that's all Git needs! When Git saves a new version, it only saves the diff with respect to the last known version, which itself is the result of all the diffs going back in time. A diff can include combinations of changes to different files, as well as file additions, deletions and renamings.

A set of files that belong to one codebase is called a "repository", or "repo". Git allows you to see all the diffs that have been made to a repo since it was created.

Remote vs. Local Versions

Another problem that version control has to solve is maintaining a consistent codebase that any members of a team can access, while allowing coders to work on their own computers without necessarily publishing their changes to the common codebase.

This is solved with a distinction between a "remote" and "local" versions of the code. The remote version is the canonical one that anybody on the team has access to. The local version is what exists on a coder's own laptop. The local version is not accessible to other team members and serves as a "work in progress".

When a coder is finished with a set of code changes, she can publish (or "push") her local work to the remote version. She can also get the latest version of the remote codebase by "pulling" it down to update her local files.

Pushing or pulling code is really about pushing or pulling diffs. When the coder pushes a new diff, the remote version of the code adds that diff to the code's diff history, and vice versa for pulling the code into the local environment.

Merging and Branches

What happens when two coders are working on the same chunk of code, and create two diffs that both get pushed to the remote version? This situation will likely require a "merge" between the two diffs.

Let's say Alice and Bob are both working on the same few lines of code, and Alice pushes her new diff first. When Bob pushes his diff, he might hit a "merge conflict", meaning that something in his diff will overwrite a change that Alice made. This is to be expected if Bob's local files do not reflect Alice's newest changes (i.e. Bob hasn't pulled them yet). In that case, Git will throw an alert, and Bob will have to figure out what the "right" final version of the code should be. After changing his diff accordingly, Bob will be able to push it.

Programming a new feature for an app rarely takes just one diff. To keep things clean, Git has the concept of "branches". If the public-ready version of the codebase is considered the trunk of a tree, a branch is a series of diffs coming off of the trunk that introduce the new feature.

Let's say that the trunk of my codebase already has commits A, B and C. At this point, Alice branches off to work on a new feature, and creates commits X, Y and Z. While she's doing this, her colleagues have already pushed commits D and E to the main remote version.

This might seem like it will create a problem, because both Alice's commit X and her colleague's commit D represent a set of changes off of commit C. Git is smart enough to handle this merge situation, however. One way to solve this is to merge in Alice's commits assuming that D and E came first - this creates a history of commits A, B, C, D, E, X, Y, and Z. This essentially acts like Alice created the branch right after commit E. When "rebasing" her branch like this, Alice might have to resolve any conflicts that were caused by the introduction of commits D and E.

Other Features of Github

Github is a popular app built on Git that amplifies the collaborative capabilities of coders. On top of the Git version control system, Github has a number of helpful features, including:

That's Git in a nutshell; for more information, you can check out some documentation or Github's own Git tutorial.

comments powered by Disqus