Story 3 - Learning GCP Infra

Owned full cloud infrastructure from scratch — and debugged a production 500 under pressure

No dedicated DevOps at an early-stage startup. Picked up GCP end-to-end. When a deployment broke prod within minutes, caught it fast, rolled back first, fixed root cause, redeployed cleanly.

Learning new domainFull-stack ownershipProduction debuggingGCP / DevOpsObservabilityStartup environment

S — Situation

Early-stage startup with no dedicated DevOps engineer. As full-stack SWE I was also responsible for all deployment infrastructure — GCP Cloud Run, App Engine, load balancing, secret management, GitHub Actions CI/CD. None of which I had real hands-on experience with before.

T — Task

Learn and own the full infra stack — build it, maintain it, and respond when things broke. No senior infra engineer to escalate to.

A — Action

Started by mapping the full architecture before touching anything. Tackled one service at a time: read docs, implement, break it, fix it. When automated E2E tests fired within minutes of a deployment — server returning 500s from a secret access issue — I pulled logs, reproduced locally, traced to the exact deployment change. Rolled back to stable first, then fixed root cause and redeployed cleanly.

R — Result

Incident resolved with minimal user impact. Full GCP stack owned solo. Core takeaway: good observability and automated testing are what let you move fast without fear.

incident caught in minutesfull GCP stack owned solorollback before fix — right order

90-second version — ready to say out loud

"When I joined, we didn't have a dedicated DevOps engineer, so as a full-stack SWE I was also responsible for the entire deployment infrastructure — GCP Cloud Run, App Engine, load balancing, secret management, GitHub Actions CI/CD. None of it was something I had real hands-on experience with before.

My approach was to start with a plan — map out what the infrastructure needed to look like end to end before touching anything. Then I tackled it one service at a time: read the docs, build it, break it, fix it.

There was one incident that stood out. After a deployment, our automated E2E tests started firing within a couple of minutes — the server was returning 500s. I pulled the logs, found the suspicious output, and reproduced the issue locally to confirm. Traced it back to the exact change in the deployment: a secret access misconfiguration. My first move was to roll back to the last stable version — to stop the bleeding for users — and only then fix the root cause and redeploy cleanly.

After going through experiences like that, infra work stopped feeling intimidating. I can now own the full stack from frontend to cloud infrastructure. And the biggest thing that experience taught me: good observability and automated testing are what let you move fast without being afraid."

"Learning a new technology quickly"

Lead with: no DevOps, had to own the whole stack

Primary question — the breadth of the GCP stack is the hook

"Debugging production under pressure"

Lead with: E2E tests fire, 500s, no one to escalate to

Rollback first, fix second — that sequence shows composure

"Outside your comfort zone / job description"

Lead with: hired as SWE, ended up owning all of infra

Good for startups and companies that value versatility

"Engineering reliability / observability"

Lead with: why automated E2E caught it in minutes

End on the principle: observability = move fast without fear

Two stories in one: a learning story and a production incident. The incident is the climax — makes the learning concrete and high-stakes. The sequence matters: roll back first, fix second. That detail signals senior instincts.

Behavioral Story 3