Oscar Baldenebro

Oscar
Baldenebro

Cloud Platform & AI-Driven Operations

SRE · Observability · Kubernetes · Incident Response

Cloud Platform Engineer with 15+ years operating revenue-critical, highly regulated systems in financial services and large enterprises. Specialized in cloud platform engineering, observability, AI-driven operations, and Kubernetes-based infrastructure. Proven track record of reducing outages, improving MTTR, and translating engineering work into measurable risk reduction and business outcomes.

01Core Expertise

Reliability & Operations

Incident Response On-Call Root Cause Analysis SLA/SLO Management Change & Risk Management

Observability

Prometheus Grafana Splunk Dynatrace

Cloud & Platforms

AKS (Azure) GCP (GKE) AWS Kubernetes Docker

GitOps & Automation

ArgoCD Argo Workflows Rundeck Terraform Ansible

AI-Driven Operations

Claude / Anthropic MCP Servers AI Skills AIOps

Automation & Scripting

Python Bash PowerShell

Operating Systems

Linux (enterprise) Windows (enterprise)

Data & Systems

Oracle SQL Server API-driven platforms

Security & Compliance

Vulnerability remediation CVE management Regulated environments

02Professional Experience

Cloud Platform Engineer

Cloud Platform & AI-Driven Operations

MIC Customs Solutions

Oct 2025 – Present
  • Operate and scale production AKS platform infrastructure, ensuring reliable Kubernetes services for critical enterprise workloads
  • Built and maintained Prometheus/Grafana observability capabilities, improving visibility into service performance, alerting quality, and incident response readiness
  • Introduced AI-powered operational tooling through custom Claude MCP servers and reusable engineering skills, enabling faster troubleshooting, guided runbook execution, and improved support efficiency
  • Delivered GitOps automation with ArgoCD and Argo Workflows, standardizing deployments and reducing release friction across environments
  • Automated recurring operational tasks through Rundeck, reducing manual intervention and enabling controlled self-service workflows for support and on-call engineers
  • Led incident response, production support, and on-call processes, creating runbooks and escalation frameworks that improved operational consistency and reduced service disruption
AKS Kubernetes Prometheus Grafana ArgoCD Rundeck Claude / MCP GitOps

Application Lead Engineer

Senior SRE / Platform Focus

TEKsystems — Client: Bank of America

Nov 2024 – Oct 2025
  • Own reliability, performance, and operational stability of the iManage Document Management platform supporting a global financial user base
  • Designed and expanded enterprise observability strategy using Splunk and Dynatrace, improving proactive detection by 35% and reducing incident noise
  • Leading AIOps adoption to enable anomaly detection and predictive failure prevention in production systems
  • Directed large-scale infrastructure initiatives including iManage RAVN expansion from 6 → 48 servers and 95 TB corporate PST migration with full audit and compliance integrity
  • Partnered with Global Information Security to remediate vulnerabilities, achieving 100% CVE compliance and reducing exposure risk by 40%
  • Act as senior escalation point during high-severity incidents, ensuring calm execution and rapid recovery
  • Lead and mentor cross-functional onshore/offshore teams, improving operational consistency and delivery quality
Splunk Dynatrace AIOps iManage CVE Management

DevOps Engineer / Site Reliability Engineer

TEKsystems — Client: Bank of America

Aug 2022 – Aug 2024
  • Automated AWS infrastructure provisioning using Terraform, reducing manual changes and deployment risk
  • Built Python and PowerShell automation leveraging iManage APIs, eliminating ~50% of repetitive operational work
  • Served as Tier-3 / escalation owner for critical incidents, improving MTTR by 50% and sustaining 98% SLA compliance
  • Developed automated ingestion pipelines for Outlook PST migrations, preserving full metadata for regulatory compliance
  • Designed automated document purging workflows (Python + SQL), reclaiming ~500 GB/month in storage and reducing costs
  • Optimized deployment workflows across Linux and Windows servers, reducing downtime by 40% and improving service continuity
Terraform AWS Python PowerShell SQL

Systems Engineer II

SRE / Platform Support

Charter Communications

Nov 2019 – Aug 2022
  • Supported production GKE clusters delivering enterprise-scale services
  • Improved billing and data workflows using Python, Pandas, and SQL, increasing performance by ~30%
  • Designed executive dashboards integrating Python, Oracle, and visualization tools to support C-suite decision-making
  • Resolved complex Tier-3 production incidents, reducing resolution time by 40%
  • Designed and implemented a new call-billing process that eliminated revenue leakage from previously unbilled services
GKE Python Pandas Oracle Kubernetes

Systems Administrator

MIC Customs Solutions

May 2018 – Nov 2019
  • Supported SaaS and on-prem customer environments with high availability requirements
  • Implemented logging and monitoring using Elasticsearch and New Relic
  • Supported CI/CD pipelines with Jenkins and Ansible
  • Automated operational tasks using Python, Bash, and SQL
  • Reduced recurring incidents by ~30% through root cause analysis and proactive remediation
Elasticsearch New Relic Jenkins Ansible

IT Operations & Infrastructure Leader

Club Premier (Aeroméxico Loyalty Program)

Jun 2014 – Apr 2018
  • Led IT Operations and on-call rotations for customer-facing platforms
  • Designed and implemented disaster recovery architecture in AWS
  • Improved platform security posture across applications, servers, and databases
  • Migrated Oracle databases from Windows to Linux, improving performance and reducing licensing costs
  • Increased production uptime from 99.3% → 99.8% for ClubPremier.com and CRM systems
AWS Oracle Linux DR Architecture

Earlier Roles

  • IT Operations Specialist — Club Premier Aeroméxico
  • Junior Developer — GoNet (Aeroméxico)

03Education

Bachelor of Engineering in Electronics

Instituto Tecnológico de Sonora

Information Security Diploma

Instituto Tecnológico de Estudios Superiores de Monterrey (ITESM)

04Certifications

Introduction to Site Reliability Engineering

Google

SRE Fundamentals

Google