Gitlab CI: Lessons learned so far

Sean Löfgren

5 April 2020•10 min read

Since day one, 9fin has been on the Gitlab platform. Over time, we’ve built a workflow that helps our engineers ship product faster than ever before. I’ve got my DevOps hat on today and so I would like to share with you the evolution of our continuous integration pipeline and the little quick wins you can make in terms of speed and cost.

Before I continue, let me just let Martin Fowler describe Continuous Integration (CI):

Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.

To clarify, I’m not going to dive into the git workflow model that we use because that in itself is a whole other blog post. For now, let’s just say that we merge to develop and release updates quite frequently. We want to focus on the as quickly as possible aspect. If you’re interested in our workflow, you can check out a talk we did at Gitlab Commit in 2019.

I’m going to touch upon the Gitlab CI pipeline configuration, Gitlab runner(s), and a little bit into Docker. There is an abundance of blog posts and documentation on this:

Starting from scratch

Our story started with a simple .gitlab-ci.yml configuration for one of our internal services. Something alone the lines of:

If you’re using Gitlab’s hosted service (we still do), these two jobs would by default, be executed sequentially by their runners. A runner is a process that executes your jobs and returns the results back to Gitlab. We started off by using the shared runners Gitlab hosts and manages for everyone.

This ran smoothly for us during this time. We would however run into some delays in job execution from time to time, but Gitlab was very transparent with their Prometheus monitoring dashboards . So we could always see when things were getting backed up.

Hosting a Runner

Whilst fetching the repo in the runner was fine, we were slightly concerned with holding deployment credentials on Gitlab and then having them pass those credentials over. We erred on the side of caution and started hosting our own runners.

With no experience of Gitlab’s runner binary and to avoid blocking the rest of the engineering team, we quickly spun up an on-demand t2.medium EC2 instance in our own cloud infrastructure and set it up with the docker execution type. This basically means that every job execution is isolated in a container. You can check out the pros and cons between the executor types here .

The EC2 instance type gave enough bang for the buck and our projects were quite small and few at the time. We hoped that this hosted runner would relieve some of our security concerns and the delayed job executions that we were experiencing.

Once we established a stable Gitlab runner, we started enabling more projects, adding build/deploy jobs, adding environment scoped variables, the list goes on. We made sure to get the most out of Gitlab’s CI capability.

Scaling: why have one problem when you can have many!

Unfortunately, we started building up a bottleneck on our side. We were hiring more engineers, creating more projects, and the codebases were getting much bigger. With Gitlab’s shared runners, we were able to parallelize jobs across multiple runners. Now, we had only 1 EC2 instance which is capable of executing multiple jobs in parallel but is limited by its 2 virtual CPUs and 4GiB of memory. We went for the quick solution and spun up another exact replica. We now had 2 on-demand instances running 24 hours a day with majority of those hours being wasted since our engineering team is currently all based in London.

Along the way, we were caught by a few issues of our own making. One day, all of our CI jobs started failing. The storage volumes for our two Gitlab runners were filling up due to all the pulled/built docker images. A cron job to clear these dangling/unused images and a volume size increase (storage is cheap) was our solution.

Here at 9fin, most of our projects are built with a Python/React stack. We also maintain some Golang/Rust projects. The problem we ran into was that our compilation steps for our React and Rust codebases were taking longer and longer. As each codebase was growing in size with more and more dependencies, build times went up.

If we were to increase our runners’ hardware specs, it would allow our Rust compiler to parallelize compilation across dependencies. On the JavaScript side multiple CPUs show build performance improved with bigger bundle sizes (we use Webpack to bundle our React projects). Our current t2.medium instances weren’t exactly taking advantage of multi CPU compilations so we upgraded to t2.xlarge (4 vCPUs / 16 GiB) instances.

Where we are now

As you may have noticed, we were always reacting to issues with solutions that kept us afloat but weren’t very cost effective. These two EC2 machines would continue to scrape by until the next problem hit us. The easy solution would be to continue spinning up beefy on-demand instances but we’re a startup, not Amazon.

With everything we’ve learned so far, we set about building a system that would be on-demand, and scale automatically so that we always maximized utilisation but minimised costs. Auto-scaling spot EC2 instances fit the bill and Gitlab documents this process very well.

We went ahead and dropped our t2.xlarge instance for a t2.micro to act as the manager of this process. This manager instance is responsible for launching and auto-scaling spot EC2 instances that will handle job execution. The manager has the runner binary installed and uses Docker Machine with the amazonec2 driver to manage a fleet of spot instances based on a peak/off-peak schedule.

A benefit of this manager approach is that we can create different EC2 and autoscaling configurations. Each configuration can be assigned tag(s).

For example, our heavy compilation jobs get tagged with beastmode and the manager makes sure to spin up a more powerful c5 instance. Increasing spot count or adding more configurations is quite easy. Some downsides with this solution are that we need to keep an eye on available spot capacity (we’ve added some alarms) and we lose that docker build cache on new EC2 instances.

On side note, if Gitlab is still listening, it would be nice to enable/disable group and shared runners across projects on the API level. Please.

Lessons learned

Enough with our story! Tell me what I need to know! If you’re in the position of setting up a CI workflow (with/without Gitlab), here is what you should consider:

Take advantage of those free CI minutes where available. Gitlab provides 2,000 minutes of shared CI runner usage for free. Host a single on-demand Gitlab runner if you have to. You want to iron out all the quirks in your pipeline and also gain more understanding of all the things your CI server is capable of doing (e.g. job scheduling, caching etc.). There is an issue for local testing of the Gitlab pipeline that has been dragging along in the backlog.
Cache job dependencies where possible. Every CI service will have some caching mechanism in place. With Gitlab runners, the cache location is local unless specified. If you use multiple runners, use a distributed cache (e.g. Amazon S3). Best practices can be found here .

One thing to remember with Gitlab caching is that the runner will only cache things inside the project directory. We explicitly set our rust dependency manager location inside the project to cache the registry index as shown above.
Keep your pipeline configuration files clean! It’s alright in the beginning when you’ve only got a few steps in the pipeline but when it starts getting out of control, debugging becomes a pain! If you’re using YAML, take advantage of anchors and aliases . If you need inspiration, check out Gitlab’s own pipeline configuration here .
If you use docker, take advantage of docker caching and keep in mind the images you use for your jobs. The time it will take to pull a fat vs slim image accumulates. If you notice you’re installing the same dependencies over and over again in your job step, maybe you should just build a custom docker image and host it on a registry? With docker caching, If you’re spinning up new docker machines, you’re going to have an empty build cache. Unless you want a pure no-cache clean build for production, use the --cache-from flag and point it to the last image you pushed up. One thing that will throw people off is the multi step docker builds. The flag will actually not cache the first step (which is usually the computationally expensive bit). You’re going to need use BuildKit with at least docker 19.03. See here .

Future tasks

Our process is working smoothly at the moment. Having said that, there are plenty of things we can still implement/investigate. This includes:

Configuring the docker engine to use external credential helpers . We use Amazon’s Elastic Container Registry for our application images. For every build and deploy step that we have, we currently pull tokens (12hr lifetime) from ECR and pass them on to the docker daemon (via the login command). This could allow token leakage (internally) via process lists or shell history. For ECR, we can drop login and refresh token logic for Amazon’s ECR Docker Credential Helper .
When we spin up a new EC2 spot instance, we start off with a fresh local docker registry. That means any jobs that run on that new machine have to pull in any required images from the docker registries which just increases job time. Solutions include:
undefinedundefinedundefined
Explore the ability of using AWS Fargate instead of EC2 Spot instances. This isn’t ready yet but would require us to modify our runner manager to use a custom executor and driver to communicate with Fargate. (Aside: see this other post on using Fargate to power vanishing application builds.)
At the end of the day, how can we measure any of these changes without any form of observability. We need to build a process to pull in Gitlab job wait and execution times via the Jobs API . Whilst we can measure CPU usage of EC2 instances by default via Cloudwatch, we will need to add the Cloudwatch agent onto the machines to measure other metrics such as memory utilization. For cost, we can take advantage of AWS tags .

We would love to hear your comments and critiques. We are learning as we go so if you’ve got some ideas/pointers on how we can improve our pipeline, please send them our way.