Own end-to-end resolution of complex production incidents for Python web applications and APIs, including deep debugging across UI–API–data layers, with strong focus on restoration of service and prevention of recurrence.
Lead root cause analysis (RCA) and problem management for recurring/critical issues; define corrective and preventive actions, track them to closure, and communicate outcomes to stakeholders.
Perform advanced troubleshooting and performance optimization (latency, throughput, memory/CPU, concurrency, timeouts), leveraging logs, APM traces, and metrics to identify code and configuration improvements.
Support and govern API reliability and security, including authentication/authorization troubleshooting, token/claims validation, and investigation of data integrity issues in API flows.
Drive database and data-layer improvements (SQL and Azure Cosmos DB) through query tuning, indexing/partitioning guidance, throughput/RU optimization, and validation of end-to-end data consistency.
Requirements
6+ years of experience in Python application development and/or L2/L3 production support for web applications and APIs in enterprise environments.
Demonstrated ownership of critical incidents, RCA, and problem management with measurable improvements to stability and MTTR.
Hands-on experience in Azure PaaS-hosted applications and strong understanding of production operations (monitoring, on-call, change management).
Ability to mentor others and lead technical discussions with engineering, DevOps, and business stakeholders.
Bachelor’s degree in Computer Science/Engineering or equivalent practical experience.