Projects

Selected technical work

Infrastructure builds, platform overhauls and reliability engineering initiatives — presented without employer context or dates.

Infrastructure

Cloud & platform projects

Production environments built from scratch or significantly rearchitected.

AWS Middle East Infrastructure

Production Kubernetes platform — UAE region, zero to live

OutcomeLaunched a full production-grade cloud platform in the AWS Middle East (UAE) region from scratch, enabling the product to serve customers across the Middle East with regional data residency.

Built the complete production infrastructure stack for a SaaS product in the AWS UAE region: EKS cluster, RDS database, Application Load Balancers, DNS via Route 53, structured log collection in CloudWatch, Prometheus metrics exported to a remote Grafana instance, Alertmanager for proactive incident detection, and a full CI/CD pipeline in Azure DevOps — all Terraform-managed and stood up from nothing.

AWS EKSRDSALBRoute 53CloudWatchTerraformPrometheusGrafanaAlertmanagerAzure DevOps

GCP GKE Architecture Overhaul

High-availability Kubernetes redesign for a multi-tenant SaaS platform

OutcomeEliminated single points of failure and environment-parity gaps across a GKE-hosted SaaS product, improving availability and unblocking the team from reliable non-production testing.

Comprehensive architecture uplift for a multi-tenant SaaS application running on GKE. Redesigned node topology for multi-zone availability, added Pod Disruption Budgets and topology spread constraints to eliminate scheduling single-points-of-failure, packaged the entire application stack with custom Helm charts, segregated workloads by function (API servers, background processors, report generation), migrated secrets from ad-hoc env vars to GCP Secret Manager, and built a complete non-production environment from scratch.

GCP GKEHelmKustomizeGCP Secret ManagerPrometheusGrafanaTerraformAzure DevOps

EC2 → Kubernetes Migration

Lift, containerize and modernize: microservices moved to EKS

OutcomeMigrated a fleet of EC2-hosted microservices to Kubernetes on AWS EKS, introducing standardized health checks, HPA-driven autoscaling and full observability — with zero downtime across all services.

Planned and executed the migration of a set of EC2-deployed microservices to Kubernetes on AWS EKS. Each service was containerized, health checks and readiness probes were defined, resource requests/limits were profiled and set, HPA policies were configured for traffic-driven autoscaling, and full observability (logs, metrics, traces) was wired in before traffic was cut over. The migration delivered improved reliability, faster deployments and a standardized operational model across all services.

AWS EKSDockerHelmHPAAWS ALBPrometheusGrafanaAWS CloudWatch

Engineering practices

Observability, reliability & security

Cross-cutting initiatives that raised the operational maturity of the platforms I've worked on.

Multi-Cloud Observability Unification

Single pane of glass across AWS and GCP workloads

OutcomeReplaced a fragmented toolset spread across CloudWatch, Elastic Cloud and NewRelic with a single unified observability stack, achieving full metrics, logs, traces and alerting coverage across multi-cloud infrastructure.

Evaluated, selected and implemented a unified observability platform to replace a fragmented toolset that had grown organically across three vendors. Consolidated metrics, logs, traces and alerting from both AWS and GCP workloads into a single Grafana stack, refined alerting rules to reduce noise, and established SLI dashboards tied to defined SLOs. Resulted in 100% observability coverage across all production infrastructure and measurably faster mean-time-to-detect.

PrometheusGrafanaAlertmanagerAWS CloudWatchGCP operations suiteRemote writeSignoz

Disaster Recovery Framework

SLO-driven DR with defined RPO/RTO for a mission-critical fintech platform

OutcomeTook a fintech platform from no formal DR posture to a fully documented, drill-tested framework with defined SLOs, error budgets, and automated runbooks covering all critical infrastructure components.

Designed and implemented a Disaster Recovery framework from the ground up for a U.S. fintech SaaS platform with mission-critical uptime requirements. Defined SLOs and error budgets for all critical paths, documented RTO/RPO targets per service tier, built automated DR runbooks, conducted the first formal DR drill across the production fleet, and established a regular drill cadence. The framework became the foundation for the platform's compliance reporting posture.

AWS EKSAWS RDSTerraformRunbook automationSLO / Error budgetsIncident management

Observability Platform Migration

Three-vendor consolidation into a single unified stack

OutcomeReplaced a three-platform observability sprawl (CloudWatch + Elastic Cloud + NewRelic) with a single unified solution, cutting tooling costs and achieving 100% infrastructure coverage.

Led the evaluation, selection and full implementation of a replacement observability platform for a fintech SaaS infrastructure. The prior state was three separate tools with overlapping and incomplete coverage. Consolidated all metrics, logs, traces and events into one platform, migrated existing dashboards and alert rules, and trained the team on the new tooling. Achieved 100% observability coverage across all production infrastructure — the first time the organization had a complete picture.

PrometheusGrafanaSignozAWS CloudWatchElastic CloudNewRelicKloudfuse

SOC-2 Type II Certification

Security and compliance posture for a fintech SaaS platform

OutcomeContributed the infrastructure and cloud security components of a successful SOC-2 Type II audit, establishing repeatable evidence collection and compliance-as-code practices.

Owned the DevOps and cloud infrastructure scope of a SOC-2 Type II certification effort for a U.S. fintech platform. Work included hardening IAM policies, enabling GuardDuty and Inspector with alerting pipelines, configuring audit logging across all AWS services, integrating StrongDM for privileged access management, and building automated evidence collection to support the audit cycle. Established practices that reduced the manual effort for subsequent audit periods.

AWS IAMAWS GuardDutyAWS InspectorStrongDMVantaSOC-2SAML / OAuth