TTC is a global specialist software testing company focused on transforming software delivery quality. They are seeking an operations-focused engineer to ensure the reliability and operational excellence of critical third-party enterprise platforms, collaborating with engineering and infrastructure teams.

Responsibilities:

Serve in an on-call rotation and lead incident response for production issues: triage, mitigation, escalation, and restoration
Drive operational excellence: improve alert quality, reduce toil, document runbooks, and create repeatable operational processes
Perform root cause analysis for incidents and recurring issues; drive corrective and preventive actions to completion
Execute and coordinate maintenance activities (upgrades, patching, configuration changes) with minimal risk and downtime
Build and maintain monitoring, dashboards, and health checks to detect issues early and reduce mean time to recovery
Automate routine operational workflows using scripts and small tools; improve reliability through safe incremental change
Partner cross-functionally (security, networking, storage, compute, vendor/third-party partners) to resolve complex issues
Maintain accurate system documentation, operational standards, and service ownership practices across supported platforms

Requirements:

3+ years experience in production operations, SRE, systems engineering, or production support for enterprise services
Strong Linux/systems troubleshooting skills (processes, logs, performance, networking basics)
Experience participating in or leading on-call and handling production incidents with clear communication
Proficiency in scripting/automation (e.g., Python and/or shell) and comfort with change management / peer review workflows
Strong written and verbal communication; able to write clear runbooks and incident summaries
Experience operating third-party enterprise platforms (integration middleware, identity/auth systems, web/app tiers, databases, batch/scheduled jobs)
Familiarity with vulnerability remediation and patch management practices in production environments
Demonstrated track record reducing operational toil and improving reliability metrics (MTTR, alert noise, incident recurrence)
Experience coordinating complex incidents across multiple teams and stakeholders
Experience using Capirca for network provisioning, Chef for configuration management, and Infrastructure as Code and Containers for deployment

Operations-Focused Engineer

Key skills

About this role

Responsibilities:

Requirements: