Git: Splitting Repos and Scrubbing Sensitive Data - header image

Git: Splitting Repos and Scrubbing Sensitive Data

Last updated on November 15, 2023 -
Star me on GitHub →  

This website, or rather its many parts, live in a monolithic GitHub repository. I wanted to split out the guides (like the one you are reading right now) into their own public GitHub repository.

At the same, I wanted to keep the rest private and somehow end up with a unified repository, where I directly "link" the public one, into the private one.

If you’re curious about how to do that and also how to remove sensitive data from any Git repo, this post is for you.

Splitting Repositories

Before you start, you’ll need to install git-filter-repo, a handy Python script that lets you do all the things you never even knew you wanted to do with a Git repository.

Follow the installation instructions here, essentially you need to download the git-filter-repo Python script and put it somewhere on your $PATH.

Then, do a full, new clone of your repository and cd into it.

git clone https://github.com/{username}/{repository name}

cd {repository name}

In my case, I wanted to make one specific subfolder of this repository (marcobehler-guides/eins/zwei/drei) the ROOT for my new repository.

marcobehler-guideseinszweidreigit.adocspring.adocmaven.adoc

The following command did the trick:

git filter-repo --subdirectory-filter {relative-folder-path}

// e.g. git filter-repo --subdirectory-filter marcobehler-guides/eins/zwei/drei

You’ll end up with a new Git repository, that only contains the files from your specified subdirectory.

marcobehler-guidesgit.adocspring.adocmaven.adoc...

As a bonus, this command keeps the entire Git history for all those files!

git log

...

commit 53b84195d1197773b3c8969dc2ea07faef6041c7
Author: Marco Behler <marco@marcobehler.com>
Date:   Mon Nov 13 17:15:32 2018 +0100

...

Subtree Merges

Now that I had two repositories, I asked myself how I could link these two, i.e. end up with one unified repository. Or put another way: I wanted to include the new repository into my old repository.

marcobehlercom (old repo)some_folder...another_folder...marcobehler-guides (new repo)git.adocspring.adocmaven.adoc

There seem to be two choices for this:

I went down the Subtree path. If you have experiences with Submodules, please let me know in the comments. For subtrees, you’ll want to execute these 3 steps:

  1. Add the URL to your new repository as a remote to your (old) repository.

    cd old-repository
    git remote add -f {remote name} {url}
    // e.g. git remote add -f marcobehler-guides https://github.com/marcobehler/marcobehler-guides.git
  2. Make your old repository aware, that we (want to) merge possibly unrelated changes to it.

    $ git merge -s ours --no-commit --allow-unrelated-histories {remote name + / + branch name}
    // e.g. $ git merge -s ours --no-commit --allow-unrelated-histories marcobehler-guides/main
    > Automatic merge went well; stopped before committing as requested
  3. Copy the new repository’s content into a subfolder of your old repository.

     git read-tree --prefix={relative subfolder path} -u {remote name}/{branch name}
     // e.g. git read-tree --prefix=marcobehle-guides/ -u marcobehler-guides/main
  4. Tada! The files are now in your unified (old) repository.

Challenges with the subtree approach:

  • If there are new changes in the public repo, you’ll have to manually sync the changes.

    git pull -s subtree {remote name} {branch name}
    
    // e.g. git pull -s subtree marcobehler-guides main
  • If you create a fresh clone of your unified repository in the future, you’ll also have to go through the steps above again, e.g. add the remote etc.

Does anyone know any better ways for the syncing?

Removing Sensitive Data

Along the way I noticed I wanted to remove a couple of files from my new repository and also remove any trace of these files/contents from the Git history. (It might even have been the case that a friend asked me how to get rid of a leaked credential in his repository )

While you can use git filter-repo above to do that job, I used BFG Repo-Cleaner, because it seems to be simpler and faster (the website claims 10-720x - who wouldn’t need want that for a single run ;) ).

bfg is a good, old Java program, so you’ll need to have a JDK installed. Then simply download the .jar file and you can run it like so:

java -jar bfg.jar --delete-files {your relative file path with sensitive data}

//e.g. java -jar bfg.jar --delete-files mysubDir/passwords.txt

Important note: I erroneously assumed that BFG will delete the file starting from my current commit. Not so.

BFG will only delete the history of the file. Which means, you’ll actually first need to remove (git rm) the file. Commit that change so it’s gone. Then run BFG to clean up the history of the file.

Now there won’t be any trace of your sensitive data left.

Fin

That’s all. I have the feeling I’ll need another couple years to fully understand what Git, or rather tools like git filter-repo are capable of doing. It almost looks like a runner up to ffmpeg in terms of complexity. So, stay tuned for more Git posts!

Meanwhile, you might enjoy my Git: Merge, Cherry-Pick & Rebase guide. Or, if you prefer video and are using IntelliJ IDEA, check out 5 great Git & IntelliJ IDEA Tricks.

There's more where that came from

I'll send you an update when I publish new guides. Absolutely no spam, ever. Unsubscribe anytime.


Share

Comments

let mut author = ?

I'm @MarcoBehler and I share everything I know about making awesome software through my guides, screencasts, talks and courses.

Follow me on Twitter to find out what I'm currently working on.