ArgoCD Monorepo Performance Optimization Journey (Part 2)

The ArgoCD version in this article is 2.8.x.

In ArgoCD Monorepo Performance Optimization Journey, we introduced the first part of ArgoCD Monorepo performance optimization, mainly reducing ArgoCD load by optimizing Git Repo and turning off Application Auto Sync to trigger application sync via Github Action Workflow.

Although after the above optimization, argoCD can manage not to have performance issues in 99% of cases during daily use, there are still some performance issues in extreme scenarios, resulting in excessively long sync time (for example, normally sync app takes 25s (p90), but sometimes it takes 3min (p99)).

This article will continue to introduce other optimization strategies to optimize the speed and stability of ArgoCD sync and reduce resource usage during sync, thereby enhancing ArgoCD's ability to concurrently sync multiple Applications.

The following are the reasons why argoCD sync is suddenly very slow and optimization strategies:

Reasons

When syncing App, the k8s cluster cache in argoCD application-controller has expired. argoCD needs to refresh the cluster cache first, and then sync the app. If the number of resources in the cluster is too large, the cluster cache refresh time will be too long, resulting in excessively long sync app time.
If the Application uses a sidecar plugin, then when syncing app, argoCD repo-server will compress the entire repo into a tarball and then transfer it to the sidecar plugin. If the repo is too large, compression and decompression will take too long, and utilize a large amount of CPU and memory resources during this process, resulting in argoCD repo-server not having enough resources to process multiple sync requests at the same time.
If the repo server has not received a request for a long time, its internal git code may be outdated for too long, causing that when a sync request is received, it takes a long time to git fetch code (this depends on the frequency of repo updates, if it is a monorepo, there may be a large number of git commits a day), increasing the time consumption of argoCD sync.

Optimization Strategy

1. Optimize cluster cache strategy in argoCD application-controller

argoCD application-controller provides the environment variable ARGOCD_CLUSTER_CACHE_RESYNC_DURATION to control the expiration time of the cluster cache. The default is 12h to expire all cluster caches, and then refresh the cache of the cluster where the application is located until a new sync or refresh request is received (will pull all resources in the k8s cluster, as well as cluster metadata, etc.). At the same time, this process is synchronous, that is, during the cluster cache refresh process, all received app sync requests will wait until the cluster cache refresh is completed.

If there are too many resources in the cluster, for example, there are 200k resources in the cluster, refreshing the cluster cache may take 3~5min.

If there are too many resources in the cluster, you can appropriately increase the value of ARGOCD_CLUSTER_CACHE_RESYNC_DURATION to reduce the frequency of cluster cache refresh (or set it to 0, not expiring cluster cache), thereby reducing the situation of waiting for cluster cache refresh during sync app.

2. Optimize compression strategy of sidecar plugin in argoCD repo-server

This is actually the main reason affecting argoCD performance. If the number of repo-servers is not large, causing multiple sync requests to hit the same repo-server pod, the CPU of the pod will be occupied very quickly (this depends on the size of the git repo, if the repo itself is not large, this problem is actually not obvious. As mentioned in previous articles, our repo size is currently 150Mi, each tarball takes 3s, and when processing a single request, cpu will be fully occupied (4vCPU)).

Of course, this can be alleviated by increasing the number of repo-servers, but it is still possible to have load imbalance. Moreover, increasing the number of repo-servers cannot solve the situation where the business side batch syncs all services in the entire cluster. When the number of apps synced is too large, all repo-server pods will inevitably be occupied, causing subsequent sync requests to be blocked.

Unfortunately, argoCD does not provide an environment variable to optimize the compression strategy of the sidecar plugin, so we can only achieve it by modifying the argoCD source code.

Early in argoCD, argocd.argoproj.io/manifest-generate-paths configuration was provided, which can be set in the Annotation of argoCD Application. This was originally used to handle webhook requests. If argoCD is connected to github webhook, then after the user pushes code, github will send a push event to argoCD server. argoCD then judges whether the changed file is in the argocd.argoproj.io/manifest-generate-paths path. If so, argoCD will automatically trigger sync, otherwise it ignores.

This value often corresponds to the files needed when rendering all k8s resources in this Application.

So we can reuse this configuration and modify the argoCD code so that once it finds argocd.argoproj.io/manifest-generate-paths configuration in an Application, it no longer compresses the entire repo, but only compresses and transmits files in argocd.argoproj.io/manifest-generate-paths configuration.

After optimization is completed, our argoCD repo-server CPU usage dropped significantly by 95%, tarball time consumption from 3s -> 400ms, and argoCD sync speed has also been significantly improved (20s -> 3s).

3. Optimize git fetch strategy in argoCD repo-server

By observing argoCD's grafana dashboard, we found that git fetch metrics occasionally appear abnormal, pulling code takes more than 20s (repo changes frequently, and has 450k+ commits), while normally it takes less than 1s.

Therefore, we suspect that it may be because the repo-server being hit has not received a request for a long time, causing the git code to be outdated for too long, causing it to take a long time to git fetch code when receiving a sync request.

So we also made some optimizations to the argoCD repo-server code.

When the git repo in the repo-server has not been updated for a long time (more than 5min), actively trigger git fetch to ensure that the code in the repo-server is the latest, reducing the time consumption of subsequent git fetch.

Optimize git gc strategy in argoCD repo-server

Because of frequent changes in monorepo, there may be many loose objects or pack files in the git directory in the repo-server, which may trigger git's auto gc. We do not want git gc to occur during normal use, causing repo-server unavailability. Therefore, we rebuilt the argoCD repo-server image and added a .gitconfig file to it to disable git gc.

Then modify the argoCD code and add a grpc function to actively trigger git gc by calling this function. During git gc, repo server will actively fail its readinessProbe, wait for this pod to be removed from k8s service, then execute git gc, and join k8s service after gc.

Finally, execute git gc tasks regularly every day through k8s cronjob. When executing, it will get all repo-server pods, and then trigger git gc by requesting pod ip one by one through grpc.

Summary

Through the above optimization strategies, we further optimized the performance of argoCD processing a large number of concurrent sync apps, and improved the speed of sync app, making it meet our performance and stability requirements.