Point72 is a leading global alternative investment firm reimagining the future of investing through advanced technology. The Scheduling Reliability Engineer will serve as a Subject-Matter Expert for enterprise scheduling platforms, ensuring system stability and high availability while developing automation solutions and collaborating with cross-functional teams.

Responsibilities:

Serve as the Subject-Matter Expert (SME) for our enterprise scheduling platforms
Maintain, tune, and upgrade the scheduling environment to ensure stability and high availability
Develop and enhance automation solutions using PowerShell and other scripting languages to streamline workload orchestration
Build, configure, and refine monitoring dashboards, alerts, and reports to track system health, throughput, and performance
Lead incident response for high-priority scheduling failures including troubleshooting, resolving the issue, and performing a root-cause analysis
Define, establish, and report on SLIs, SLOs and SLAs for critical business workflows
Collaborate with cross-functional teams to onboard new workflows, optimize job dependencies, and implement best practices
Create and maintain comprehensive documentation, runbooks, and training materials for end users and support teams
Participate in a rotational on-call schedule to support 24/7 operations and critical incident management

Requirements:

Bachelor's degree in computer science, engineering or a related field, or equivalent work experience
5+ years of experience in enterprise scheduling or workload automation tools, including ActiveBatch, CA Workload Automation, Control-M, and/or Autosys
2+ years of experience in a site reliability, DevOps, or production support role with exposure to SLA management and SLI/SLO frameworks
Hands-on expertise with ActiveBatch or similar workload automation tools including job scheduling, calendars, dependencies, security, and versioning
Familiarity with Apache Airflow concepts, including DAG-design, operators, executors, and deployment patterns
Strong scripting skills in PowerShell
Proven track record of troubleshooting complex, distributed workflows and performing root-cause analysis
Experience building and managing monitoring solutions using tools such as Splunk, Datadog, and/or Prometheus/Grafana
Ability to partner with application owners, business analysts, and infrastructure teams to drive continuous improvements
Excellent communication skills with the ability to translate technical concepts for non-technical stakeholders
Commitment to the highest ethical standards

Scheduling Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: