- Designed and implemented monitoring and observability solutions for multi-cluster / multi-cloud Kubernetes environments, leveraging Prometheus / Mimir to collect metrics and Grafana for centralized, insightful visualizations.
- Managed multiple Kubernetes clusters and cloud infrastructure components on Azure, with all infrastructure defined and managed using Terraform as IaC.
- Automated and monitored build and release processes using CI/CD tools (Jenkins, GitHub, SonarQube), driving faster feedback and quicker failure resolution.
- Drove a ~40% reduction in cloud costs by optimizing AKS, EKS and databases in collaboration with development and engineering teams.
- Provisioned and managed Azure / AWS resources such as AKS, VNETs, load balancers and storage accounts using Terraform modules integrated into CI/CD pipelines.
- Owned security reporting for the infrastructure and implemented vulnerability scanning of Docker images with Trivy.
- Performed disaster recovery and infrastructure upgrades of application components with zero downtime and minimal data loss.
- Improved alert mechanisms related to various cloud infrastructure services on Azure and created cost analysis dashboards.
- Handled Site Reliability Engineering activities: incident management, on-call rotations, incident resolution and documentation to better prepare the team for future incidents.
- Provided input to architectural planning and decision-making processes for infrastructure and platform initiatives.
- Managed and created cloud infrastructure on AWS and Azure, leveraging DevOps methodologies and best practices.
- Automated configuration management and CI/CD pipelines using Ansible, Jenkins, Git, Maven, SonarQube and JFrog, including containerization with Docker and Kubernetes (AKS).
- Engineered real-time observability dashboards in Splunk for developers and operations teams.
- Created OBIEE dashboards to translate technical metrics into strategic business reports, enabling the business to measure impact and performance.