TTC is a global specialist software testing company focused on transforming software delivery quality. They are seeking an operations-focused engineer to ensure the reliability and operational excellence of critical third-party enterprise platforms, collaborating with engineering and infrastructure teams.
Responsibilities:
- Serve in an on-call rotation and lead incident response for production issues: triage, mitigation, escalation, and restoration
- Drive operational excellence: improve alert quality, reduce toil, document runbooks, and create repeatable operational processes
- Perform root cause analysis for incidents and recurring issues; drive corrective and preventive actions to completion
- Execute and coordinate maintenance activities (upgrades, patching, configuration changes) with minimal risk and downtime
- Build and maintain monitoring, dashboards, and health checks to detect issues early and reduce mean time to recovery
- Automate routine operational workflows using scripts and small tools; improve reliability through safe incremental change
- Partner cross-functionally (security, networking, storage, compute, vendor/third-party partners) to resolve complex issues
- Maintain accurate system documentation, operational standards, and service ownership practices across supported platforms
Requirements:
- 3+ years experience in production operations, SRE, systems engineering, or production support for enterprise services
- Strong Linux/systems troubleshooting skills (processes, logs, performance, networking basics)
- Experience participating in or leading on-call and handling production incidents with clear communication
- Proficiency in scripting/automation (e.g., Python and/or shell) and comfort with change management / peer review workflows
- Strong written and verbal communication; able to write clear runbooks and incident summaries
- Experience operating third-party enterprise platforms (integration middleware, identity/auth systems, web/app tiers, databases, batch/scheduled jobs)
- Familiarity with vulnerability remediation and patch management practices in production environments
- Demonstrated track record reducing operational toil and improving reliability metrics (MTTR, alert noise, incident recurrence)
- Experience coordinating complex incidents across multiple teams and stakeholders
- Experience using Capirca for network provisioning, Chef for configuration management, and Infrastructure as Code and Containers for deployment