xargs and the unruly tags

A tale of two commands

I thought I was really clever when I configured my CI/CD pipeline to tag commits that got deployed and push the tags back into the repo, but I’m rarely as clever as I like to think: I had forgotten to put the proper checks in place to avoid these tag pushes triggering subsequent runs of the pipeline, and things got a little … out of hand.

I’d gone to bed just after pushing an update, and when I arose to check on it, I found that the deploy tagging stage had been running over and over and over and over and … you get the point. Thankfully, it had failed after about 130 rounds, so it could have been a lot worse, but I was left with a large amount of useless and unwanted tags in the remote repo.

So how do you fix something like this? Yup, xargs to the rescue!

Where there’s a will …

At first, I didn’t really know how I’d go about it. I was hoping git would have some nice, built-in functionality for mass-deleting remote tags, but while I have found in retrospect that it does (see the postmortem), I couldn’t find it at the time.

However, because all the tags were for a specific commit, I did know that I could list all the relevant tags separated by newlines, using git tag --contains <SHA>.

So, with some helpful advice from Stack Overflow and this guy, I constructed this little command which sorted me out just fine:

Now, I’d come across xargs before, even done the ol’ copying and pasting from Stack Overflow trick, but it had always looked really complicated and no-one had ever told me why I’d need it or what it does; so I just carried on in blissful ignorance. Not this time, though. It was time to figure out what was going on.

Groking xargs

The way xargs was sold to me was: “execute a command for each item in a list”. It’s actually more powerful than that, but that’s a great place to start.

Let’s use the man page to find out what that -I % bit means :

-I

replace-str: “Replace occurrences in the initial-arguments with names read from standard input”

The string to use to indicate where to place arguments in the command to run. In the command above, we chose to use %, but you’re not limited to this.

Similar to printf and format strings in general, this places your arguments at your desired place in the command. In our case it both limits us to using one argument (tag) at a time, and it lets us append it to :refs/tags/ without being separated by a space.

That means that in the above snippet, xargs would, for each tag listed, run the command git push origin :refs/tags/<tag_name>, which pushes that tag with an empty reference, thereby deleting it.

If all you want is to put the argument at the end of the command, you can even do without the -I. Say you want to recursively delete all the .swp files in a directory:

Be aware, though, that without either using a -I or -n (to limit the number of arguments to use for each command), xargs will split the list you give it into sizeable chunks and apply as many arguments to the command as it can each time. That means that in this case, it’d likely end up looking something like this:

which is usually fine and what you want, but keep this in mind for when it isn’t.

This is only scratching the surface of what xargs can do, but it’s enough to make it do some pretty heavy lifting. It might not be something to reach for very often, but for when you do need it, it’s a great tool to have in your belt.

Postmortem ⚰️

Now, you might have noticed that I did a git push for each tag that I was deleting, and you might be thinking that for over a hundred tags, it must have taken quite some time. You would be right. Luckily, I was working on something else, so I could happily let it run in the background. But we can do better!

xargs has an option -P or --max-procs, which you can use to decide how many processes to run in parallel. The default is 1, but if you set it to 0, it will run as many as it can. This could have saved us quite some time, assuming git would let us run multiple push operations from the same repo at the same time. But there is an even better way:

As outlined in this Stack Overflow response, you can use a whitespace-separated list of tag names (<tags>) with git push; so we could have run git push --delete origin <tags> to achieve the same outcome as deleting them one by one.

If we rewrite the command from earlier, we can both simplify it and do it all in a single push:

… yeah, that would have been a lot more efficient 😅


Docking pains

What to do when the whale is too big
Moby Dock, the Docker logo: A blue whale carrying a eight containers on its back.

Big fish, big problems.

You know how some things are a lot more difficult than they seem? In an attempt to speed up deployments for this blog, I wanted to look into building a Docker image with Hakyll and all the required build dependencies available. To be able to do this effectively, I figured I’d need to have the ability to work with Docker locally. Turns out this was one of those things.

The goal
Enable Docker virtualisation and development on a NixOS system
Challenges
The root partition—which is where Docker stores data—keeps running out of space

In theory, it’s simple:

  1. enable Docker
  2. configure it to store data somewhere that is not /var/lib/docker

In practice, it turns out to be a bit more difficult than expected, but don’t worry: We’ll figure it out together!

Enabling

According to the NixOS manual and the the Wiki article on Docker there’s really not much to it: To enable Docker, all you need to do is update your configuration.nix to include

As pure and simple as Nix should be.

According to the manual: “This option enables docker, a daemon that manages linux containers. Users in the “docker” group can interact with the daemon (e.g. to start or stop containers) using the docker command line tool.

Read that last line carefully: “Users in the “docker” group can interact with the daemon […]”. Yup. That means we need to make sure our user is in the correct group:

What isn’t immediately obvious is this: You must log out and back in before this setting change takes effect. Let’s repeat that to make sure we understand:

You must log out and back in before this setting change takes effect.

From what I can tell, this goes for any change to a user’s groups, but it isn’t particularly well documented anywhere. (Psst: I am not the only one to have run into this.)

But wait; there’s more! This is only vaguely referenced in the manual (“using the docker command line tool”), but to have access to the Docker CLI, you’re going to have to install Docker (pkgs.docker) for your user, either by putting it in configuration.nix’s systemPackages or by using a solution such as this.

And that’s it. If all you wanted to do was set up docker to run with the default configuration, you’re done now. Congrats! Have a donut. You’ve earned it.

The space race

Ah, yes, disk space … We’ve got Docker in place now, and, assuming the group change setting has taken effect, we can start playing with it. That’s what I did. For a day or so. As per the usual NixOS song and dance, I wanted to change some configuration settings, so I tried to rebuild my system and got this fateful message:

Now, this isn’t anything new. I’ve realized since setting up the OS, that I should have probably allocated more space for the root partition (someone once told me that NixOS “trades disk space for sanity”). “Oh, well,” I thought. “Guess I have to delete some old generations again.” So I ran the garbage collector. This usually frees up about 7–10GB of space, but now it was hardly removing two! I tried all the tricks that I knew of, but nothing seemed to make a difference. And then the thought struck me: “Docker is installed as a system service. That means it probably stores images system-wide too!”. And indeed, after looking through the ‘docks’ (har har), I found that the default place Docker stores data is in /var/lib/docker.

So I killed all of my containers, deleted all of my images, and lo and behold: My root partition had suddenly lost nearly 10GB! Superb!

The next step, then, would be to figure out how to store the data somewhere else. Luckily, the docs (NixOS and Docker) are quite clear on this point: For your Docker configuration, you can specify an option, --data-root, and have the data stored there instead. In general, I prefer not to mess around with where things are stored too much, but in some cases it makes life easier (until I get around to repartitioning my drive, anyway), so I decided I’d put it under /home/docker for now. This is easily done like this:

This setting means that my /home partition carries some extra data, but it’s got more than enough space to deal with it.

Putting it into practice

Now, having experienced first-hand how space-hungry Docker can be, and having read through the documentation, I found that there are other options that might come in handy. For now, I decided to have the system aggressively auto-prune on a weekly basis. This should keep me from running into space issues any time soon, and if it gets annoying I can always change the settings.

At the end of this little adventure, the resulting configuration.nix should look something like this:

In summary, these are the steps needed:

  1. Enable virtualisation.docker
  2. Make sure your user is in the "docker" group. (Log out and back in!)
  3. Install Docker for your user
  4. (Optional) If, like me, you have issues with space, change the data-root to somewhere else, such as a different partition or an external drive.

So there you have it, folks! It really is quite simple … once you figure out all the tricky parts.


Hello, World!

Experiences from setting up my first blog.

As a first little post and an introduction to the blog, I thought it might be cute to have a little overview of how it’s made. This has been my first time setting up most of the surrounding architecture and I have learned more than a few things in the process (though there is, of course, lots left to learn). This post won’t go into great detail about any particular points, but will serve as more of an overview of what ‘the system’ looks like at the time of writing.

That is to say: don’t expect any brilliant insights from this post, but read on if you’re interested in how I’ve organized things.

Hakyll

Let’s start with the most important part of this whole thing, shall we? The truth is that without Hakyll, this blog would not be up and running now. I have been wanting to get into writing for a bit, but I have been lacking a platform that met my criteria:

  • I wanted to be able to write my posts using Emacs’ Org mode
  • I wanted to be able to put them in a version controlled repo and have the blog auto-update whenever I pushed a new update.
  • I wanted it to be low-ish effort—at the very least I didn’t want to mess with servers and so on—but I also wanted to be able to host it myself, so that I would not be dependent on some other platform.

Maybe I didn’t look hard enough, but I couldn’t really find any alternatives that would let me tick all these boxes. The closest was GitHub pages, which would have been fine, except that I would have had to restrict myself to markdown. … but then a friend of mine told me about Hakyll and I found this blog post on using it with Org mode. That was all I needed to set out.

Notable changes and additions

While Hakyll comes with a lot of great features out of the box, I found that I wanted just a little bit more out of it, and these things had to be configured:

Feeds (RSS and atom)
I’ve never used an RSS reader before, but in setting up this project I wanted to find out how the format works and ended up getting addicted myself. If you’ve not tried it out: you might never wanna go back. Overall, this was quite easy, though it did require some tricksy loading of posts and applying different url treatment.
Sitemap
Another thing I’ve never looked twice at. This project provided me a reason to understand what it is, why you might want it, and how you could set it up. Oh, and how you’d link it in your robots.txt. Pretty simple.
Pagination
For some reason, it was quite important for me to be able to have a paginated stream of all blog posts (yup, that’s the blog page). Luckily, it was fairly easy to set up, though it did require me to change how I store post drafts.

Where I want to go

I think I’ve got it mostly where I want it now, but I’m still not happy with how I store drafts. At the moment, they’re in a drafts directory under the posts directory. In an ideal world I’d be able to store them along with all other posts with the only way to tell whether they’ve been published or not is whether they have a published tag in their metadata. However, this causes some issues with Hakyll’s build process when it tries to process unpublished posts for different parts of the site. It should be as simple as loading the correct snapshot based on a filtered group of posts, but I haven’t quite gotten around to that yet.

Netlify

Overall I’ve been very happy with what Netlify has offered me thus far. Their documentation is great, their CLI tool is pretty nice, and they seem to have their stuff together.

External domain providers

Stupidly, I bought the domain from a different domain provider and then went to Netlify later. If I’d had the foresight to check with Netlify first, I wouldn’t have had to deal with setting up name servers and so on. It hasn’t been that big a deal, and at least now I know you can, but it would just have been easier if I’d gone all in on Netlify, I think. Maybe next time.

It might be worth pointing out that by changing DNS records to have it all managed by Netlify, I got https support for ‘free’ (when the option would have been to pay for it, had I stayed with my domain provider).

Haskell support (or lack thereof)

However, even though Netlify does offer a pretty solid service, their CI/CD system does not, at the time of writing, offer Haskell support1. This means that for now, I’m stuck using an external service to build and then publish to their systems. While this isn’t too bad, it does mean that I miss out on certain benefits that using their integrated system gives you, including minification and dynamic image serving.

GitLab

As Netlify doesn’t support Haskell static site generators out of the box, I have to work around them and use an external CI/CD system instead. This is actually a big part of the reason why I chose GitLab for hosting this site: I’ve used their systems a fair bit lately and find them rather nice to work with.

For my CI/CD setup, there where a couple of rules that I wanted to have in place:

  1. If I push an update to anything that would impact the site (styles, posts, templates), the system should deploy.
  2. If I push changes that include new posts, the system should add the appropriate timestamps to the relevant posts before deploying. The timestamp additions should be commited and pushed back to the main repo before deploying.
  3. I should be able to manually deploy, either by going through the gitlab UI, or by pushing a commit that includes a specific tag ([deploy])
  4. I should be able to say that a commit should skip CI, even if it would normally trigger it, using a tag ([noci])

Most of this sounds pretty basic, but it was a surprising amount of effort to get things set up right. The first big snag was that GitLab doesn’t have built-in support for pushing changes back to the repo from the CI pipeline. There are ways to do it, but it was surprisingly convoluted for something I would expect to be reasonably common.

The second thing, and this is probably the most pressing matter relating to this site at the moment, is that building the site for deployment takes almost an hour. From what I can tell, this is because stack (the Haskell build tool) has to download and compile Hakyll and its dependencies before compiling the site. However, there are solutions to this: The most promising one I found is the one documented in this blog post by Saksham Sharma. I did try using the provided docker image, but due to an issue with locale info being unset in pure nix shells2 (I suspect), the site won’t build properly. I expect I’ll be looking into this pretty soon.

Python helper scripts

As any good software developer, I want to automate as much of the process as possible, and the first obstacle I ran into was related to timestamping posts, both for first publish and for subsequent updates. Beyond that, I also wanted an easy way to move files from the drafts directory to the published posts, adding the required data while doing so.

add_timestamps.py

A pretty simple script meant to be run in CI. It goes through all the posts with ‘published’ tags that have changed since last push and adds the current timestamp. If a file has no value for the ‘published’ tag, the current time is added, else, it adds or updates the ‘modified’ tag with the current time.

publish.py

To automate moving files between directories and adding the ‘published’ tag to drafts, I wrote a simple script that does just that. Super easy. This led me to my first real revelation using nix-shell (still just scratching the surface here), where I could package this script and use it as a command anywhere! I was super excited until I failed at packaging it correctly due to the dependencies on hakyll.py … This hasn’t been resolved yet, but I’ll figure it out at some point!

hakyll.py

This is just some shared functions that deal specifically with Hakyll and my specific system. Most notably: lets you get metadata as a dict, update metadata values, sort tags, decide whether a post is new or updated and whether it is published at all. Also my first time using type annotations in Python (I don’t actually know whether I did it right, but the system isn’t complaining so far?), so that’s a win.

In closing

In short, this has been a very enlightening and rewarding experience, and while I’m certainly glad that to have gone through it, I am looking forward to not having to worry about it for a while (hah, as if).

Now, let’s get writing!

Footnotes


  1. GitHub issue: Add support for Haskell

  2. GitHub issue: “hakyll can’t handle unicode?”

First Prev Page 9 Next Last