← Back to catalog

Behavioral Story 3

Learning a new domain (GCP infra from scratch)

3

Owned full cloud infrastructure from scratch — and debugged a production 500 under pressure

No dedicated DevOps at an early-stage startup. Picked up GCP end-to-end. When a deployment broke prod within minutes, caught it fast, rolled back first, fixed root cause, redeployed cleanly.

Learning new domainFull-stack ownershipProduction debuggingGCP / DevOpsObservabilityStartup environment
S — Situation
Early-stage startup with no dedicated DevOps engineer. As full-stack SWE I was also responsible for all deployment infrastructure — GCP Cloud Run, App Engine, load balancing, secret management, GitHub Actions CI/CD. None of which I had real hands-on experience with before.
T — Task
Learn and own the full infra stack — build it, maintain it, and respond when things broke. No senior infra engineer to escalate to.
A — Action
Started by mapping the full architecture before touching anything. Tackled one service at a time: read docs, implement, break it, fix it. When automated E2E tests fired within minutes of a deployment — server returning 500s from a secret access issue — I pulled logs, reproduced locally, traced to the exact deployment change. Rolled back to stable first, then fixed root cause and redeployed cleanly.
R — Result
Incident resolved with minimal user impact. Full GCP stack owned solo. Core takeaway: good observability and automated testing are what let you move fast without fear.
incident caught in minutesfull GCP stack owned solorollback before fix — right order
90-second version — ready to say out loud
"When I joined, we didn't have a dedicated DevOps engineer, so as a full-stack SWE I was also responsible for the entire deployment infrastructure — GCP Cloud Run, App Engine, load balancing, secret management, GitHub Actions CI/CD. None of it was something I had real hands-on experience with before.

My approach was to start with a plan — map out what the infrastructure needed to look like end to end before touching anything. Then I tackled it one service at a time: read the docs, build it, break it, fix it.

There was one incident that stood out. After a deployment, our automated E2E tests started firing within a couple of minutes — the server was returning 500s. I pulled the logs, found the suspicious output, and reproduced the issue locally to confirm. Traced it back to the exact change in the deployment: a secret access misconfiguration. My first move was to roll back to the last stable version — to stop the bleeding for users — and only then fix the root cause and redeploy cleanly.

After going through experiences like that, infra work stopped feeling intimidating. I can now own the full stack from frontend to cloud infrastructure. And the biggest thing that experience taught me: good observability and automated testing are what let you move fast without being afraid."
"Learning a new technology quickly"
Lead with: no DevOps, had to own the whole stack
Primary question — the breadth of the GCP stack is the hook
"Debugging production under pressure"
Lead with: E2E tests fire, 500s, no one to escalate to
Rollback first, fix second — that sequence shows composure
"Outside your comfort zone / job description"
Lead with: hired as SWE, ended up owning all of infra
Good for startups and companies that value versatility
"Engineering reliability / observability"
Lead with: why automated E2E caught it in minutes
End on the principle: observability = move fast without fear
Two stories in one: a learning story and a production incident. The incident is the climax — makes the learning concrete and high-stakes. The sequence matters: roll back first, fix second. That detail signals senior instincts.