GitOps in Production: Lessons from 12 Months of ArgoCD

GitOps promises a single source of truth for deployments. After running ArgoCD in production across multiple clusters, here's what worked, what broke, and what we'd do differently.

The promise vs the reality

GitOps is a compelling idea: your Git repository is the single source of truth for everything deployed to your cluster. Any change to the system goes through a pull request. Drift is detected and corrected automatically. Rollback is a revert commit.

After running ArgoCD in production across multiple clusters and teams for twelve months, most of that promise holds up. But there are failure modes the documentation doesn't warn you about, and operational patterns that took us longer to settle on than they should have.

Here's what we actually learned.

What worked exactly as advertised

Drift detection is genuinely valuable. ArgoCD continuously reconciles the desired state in Git against the actual state in the cluster. When something drifts — an engineer manually patched a deployment, a node was replaced with different labels, a ConfigMap was edited in-cluster — ArgoCD flags it immediately. Before GitOps, we discovered drift during incidents. Now we discover it before it matters.

Rollback is fast and low-stress. Reverting a deployment is a Git revert and a sync. No one is scrambling to remember what the previous image tag was. No rollback runbooks needed. This alone justifies the complexity of the setup.

Audit trail is automatic. Every change to every workload is a commit with an author, a timestamp, and a diff. When a post-incident review asks "what changed in the hour before this broke," the answer is in the Git log.

What broke, and why

Secret management was the first friction point. ArgoCD syncs manifests from Git, but you cannot put secrets in Git — not as plaintext, and not if you want to maintain any security posture. We went through three approaches before settling on External Secrets Operator pulling from AWS Secrets Manager. Sealed Secrets works but the key rotation story is painful. SOPS is fine for small teams but doesn't scale well across multiple clusters with different key material.

If you are setting up ArgoCD today: decide on secret management before you write your first Application manifest. It affects your entire repo structure.

App-of-Apps got complicated faster than expected. The App-of-Apps pattern — using an ArgoCD Application to manage other ArgoCD Applications — is the standard way to bootstrap a cluster. It works, but the dependency ordering is implicit and the failure modes when something in the bootstrap chain breaks are not obvious. We added explicit sync waves (using the `argocd.argoproj.io/sync-wave` annotation) to every app in our bootstrap, which helped significantly.

Notifications and alerting needed deliberate setup. Out of the box, ArgoCD shows sync status in the UI. It does not proactively tell anyone when a sync fails in production at 3am. ArgoCD Notifications (now part of the main project) is the right tool, but it requires configuration — Slack webhooks, templates, triggers. Budget time for this before you go to production.

Cluster upgrades require coordination. When you upgrade Kubernetes, some API versions change. If your manifests reference deprecated APIs, ArgoCD will report sync errors after the upgrade. We now run `kubent` (Kube No Trouble) against every cluster before a Kubernetes version bump. Fifteen minutes of scanning saves hours of post-upgrade cleanup.

Patterns we settled on

One Git repo per environment tier, not per cluster. We tried a single repo with environment-specific overlays (Kustomize), and we tried per-cluster repos. The pattern that works best for us is one repo for non-production (dev and staging share it with path-based separation) and a separate, more locked-down repo for production. Promotion is an explicit act of copying manifests, which is the right amount of friction.

ApplicationSets for multi-cluster. If you're managing more than two clusters, ApplicationSets — which generate ArgoCD Applications from a template — reduce the repetition significantly. One ApplicationSet definition can target a dozen clusters. We resisted this for a while because it looked like additional complexity, but the operational simplification is real.

Sync policies: automated in non-prod, manual in prod. Non-production clusters sync automatically when their repo changes. Production requires a human to initiate the sync (or approve it, via a separate tooling layer). This is a judgment call, but we've found that fully automated production syncs make engineers less attentive to what they're merging, and manual sync keeps the "this is going to production" moment explicit.

What we'd do differently

We'd set up the observability stack (Prometheus, Grafana, alerting) before we set up ArgoCD, not after. The two weeks we spent operating ArgoCD without proper metrics meant we were flying blind on sync latency, reconciliation errors, and controller performance.

We'd also document the "break-glass" procedure for when ArgoCD itself is broken before we needed it. When your GitOps controller is degraded, you need a clear, pre-approved process for making emergency changes directly to the cluster and reconciling them later. That process should be written down and tested before the emergency.

The bottom line

ArgoCD is production-ready and we recommend it. The learning curve is real but front-loaded — after the first two months of setup pain, the day-to-day operational experience is significantly smoother than anything we had before. The drift detection and rollback story alone are worth the investment.

Go in with a plan for secrets, notifications, and your break-glass procedure, and you will have a much better first three months than we did.