Investigate, diagnose, and resolve complex production incidents across distributed systems
Perform in-depth technical root cause analysis (RCA) for customer-reported issues
Develop, maintain, and improve internal tools and services using Python
Collaborate with support engineers to reproduce, analyze, and prioritize escalated issues
Work directly with the operations team to support production workloads running on Linux-based systems
Analyze application logs, system metrics, and networking-level behaviors to pinpoint issues
Implement code fixes, performance improvements, and reliability enhancements
Participate in an on-call rotation for critical issue escalation (if applicable)
Document findings, create knowledge base entries, and automate repetitive troubleshooting steps
Contribute to CI/CD workflows, monitoring improvements, and infrastructure quality initiatives
Must learn and support the entire infrastructure, not just records management
Requirements
3+ years of professional experience with Python development
Moderate programming experience required, preferably in Python, with ability to write clean, maintainable code and develop internal tools for diagnostics and automation
Strong hands-on experience with Linux systems
Shell scripting, system diagnostics, process management, file system debugging
Deep Unix troubleshooting expertise required, supporting multiple clients and modern deployments