Rohit Solanki — Mid-Senior DevOps Engineer

4.5+ years of experience in cloud-native technologies, Kubernetes, Terraform, multi-cloud infrastructure (AWS, GCP, Azure, Digital Ocean), GitOps with ArgoCD, and full-stack observability (Prometheus, Grafana, Loki, Thanos, Tempo). Based in India. Available for senior DevOps roles and consulting engagements.

Core Skills

Kubernetes · Docker · Helm · ArgoCD · Terraform · Terragrunt · GitHub Actions · Jenkins · HashiCorp Vault · AWS · GCP · Azure · Prometheus · Grafana · Loki · Thanos · Tempo · OpenTelemetry · Nginx · Envoy · PostgreSQL · MongoDB · Redis · Python · Bash

Work Experience

DevOps Engineer (2023 – Present)

Architected Hub-and-Spoke centralised observability platform aggregating telemetry from distributed Kubernetes workload clusters using Grafana, Loki, Thanos, and Tempo. Engineered production CI/CD pipelines with GitHub Actions, embedding Trivy, CodeQL, KICS, and TruffleHog security scanning. Maintained Helm charts and Terraform/Terragrunt code across five isolated environments. Designed on-premises Kubernetes multi-cluster architecture using Kubeadm with Calico CNI.

Junior DevOps Engineer (2021 – 2023)

Owned end-to-end AWS cloud infrastructure for a production AI conversational chatbot. Managed MongoDB Atlas, MySQL RDS, Redis, and ElastiCache clusters. Built Jenkins CI/CD pipelines with auto-scaling groups and ALBs. Executed complex database migrations to AWS Aurora and Mongo Atlas. Developed AWS Lambda functions for off-hours environment scaledown, reducing monthly cloud costs significantly.

Certifications

AWS Certified Solutions Architect – Associate
AWS Certified AI Practitioner
AWS Certified Cloud Practitioner
Red Hat Certified System Administrator (RHCSA)
CompTIA IT Fundamentals
IBM Cloud Essentials

Education

Master of Science – Computer Application (MSc-CA) · Symbiosis International University

Bachelor of Computer Applications (BCA) · Gujarat University

Blog — AI-Driven Infrastructure: The Shift from Ops to AI-Ops

For most of the past decade, infrastructure operations ran on a simple contract: humans defined the desired state, automation enforced it, and alert managers woke someone up when things went wrong. That model worked well at modest scale. It breaks catastrophically at cloud-native scale, where a single distributed application can generate tens of thousands of metrics, hundreds of log streams, and dozens of trace spans every second.

The core problem is signal-to-noise ratio. A production Kubernetes cluster running 50 microservices across three availability zones might fire 300 alerts on a busy Tuesday. A human on-call engineer cannot meaningfully triage 300 simultaneous alerts. Alert fatigue combined with manual correlation across disconnected dashboards is consistently identified as a root cause in production post-mortems. This is not an operations problem — it is a data problem. AIOps addresses this gap through three capabilities: ML-driven anomaly detection that replaces static thresholds with learned baselines, alert correlation that reduces an alert storm of 40 notifications to a single root cause event, and predictive remediation that identifies failure patterns before they manifest. Teams that adopt AIOps consistently report MTTD reductions of 50-80% and MTTR reductions of 30-60%.

Blog — Maximizing Cloud Efficiency with Predictive Scaling

Reactive auto-scaling operates in the past. By the time your monitoring system detects that CPU has breached 75%, users are already experiencing degraded performance. By the time a new EC2 instance launches, passes health checks, and registers with the load balancer, two to five minutes have elapsed. For high-traffic workloads, those minutes represent thousands of failed requests. The standard workaround — keeping 30% spare capacity at all times — is expensive and only partially effective against true demand spikes.

Predictive scaling inverts the model. AWS Predictive Scaling uses ML models trained on your CloudWatch metrics history to generate capacity forecasts up to 48 hours ahead, identifying diurnal and weekly traffic patterns and pre-provisioning capacity before peaks arrive. On Kubernetes, KEDA combined with cron-based scaling triggers and custom Prometheus metric forecasting achieves the same effect. Organizations implementing predictive alongside reactive scaling typically reduce baseline instance counts by 15-25% and eliminate error spikes during planned demand peaks entirely.

Blog — Why Firebase is the Secret Weapon for AI Landing Pages

Firebase Hosting sits on Google's global CDN infrastructure, distributing static assets across Points of Presence worldwide. For React SPAs, this means LCP times consistently below 1.5 seconds and TTFB under 50ms — metrics that directly influence both user experience and Google Search rankings. Every deployment is atomic and creates an immutable versioned snapshot. Rollback takes under 30 seconds. Preview channels provide shareable staging URLs per pull request without impacting production. SSL certificates are provisioned and renewed automatically. For portfolios and landing pages serving under 10 GB monthly transfer, Firebase Hosting costs nothing, while eliminating all the operational overhead of VPS management, certbot, and manual CDN configuration.

Blog — Building a Hub-and-Spoke Observability Platform with Thanos

Running six Kubernetes clusters — production, staging, UAT, QA, development, and a management plane — without centralised visibility means engineers tab between six separate Grafana instances, correlating incidents manually. The Hub-and-Spoke pattern solves this: a dedicated management cluster (the Hub) hosts Thanos Query, Loki, Tempo, Alertmanager, and Grafana. Spoke clusters run lightweight agents — Prometheus Agent or Sidecar, Fluent Bit, and an OpenTelemetry Collector — that push telemetry to the Hub. The total agent overhead per spoke is under 500m CPU and 2Gi memory. Engineers interact exclusively with the Hub, gaining cross-cluster dashboards, unified alert routing, and the ability to correlate cascading failures across environments from a single pane of glass.

Blog — GitOps with ArgoCD: A Production Deployment Playbook

GitOps defines the desired state of infrastructure and applications declaratively in Git, with an automated system continuously reconciling live cluster state to that definition. ArgoCD watches one or more Git repositories and syncs the cluster state to match the manifests. Separating application source code from deployment configuration repositories enforces clean boundaries between developer and platform engineer concerns. ApplicationSet controllers eliminate per-environment Application resource definitions. Sync waves manage deployment ordering: database migrations in wave -1 complete before application servers in wave 0 receive traffic. The argocd-vault-plugin integrates secret injection from HashiCorp Vault, AWS Secrets Manager, or GCP Secret Manager without storing secret values in Git. Rollback is a git revert: instantaneous and fully auditable.

Blog — Kubernetes Secrets Management with HashiCorp Vault

Kubernetes Secrets are base64-encoded, not encrypted. Anyone with namespace read access can retrieve and decode them. Without encryption at rest configured on etcd, secrets are stored in plaintext — and etcd backups frequently contain secret data. HashiCorp Vault addresses these gaps with fine-grained access control policies, a complete audit log of every secret access event, automatic secret rotation, and dynamic secrets with configurable TTLs. The Kubernetes auth method allows Pods to authenticate using their automatically-mounted Service Account token, eliminating the bootstrap secret problem. The Vault Agent Injector injects secrets as files into Pod volumes without application code changes. The database secrets engine generates unique, time-limited credentials per requester, making credential exposure time-bounded and fully attributable.