Point72 is a leading global alternative investment firm reimagining the future of investing through advanced technology. The Scheduling Reliability Engineer will serve as a Subject-Matter Expert for enterprise scheduling platforms, ensuring system stability and high availability while developing automation solutions and collaborating with cross-functional teams.
Responsibilities:
- Serve as the Subject-Matter Expert (SME) for our enterprise scheduling platforms
- Maintain, tune, and upgrade the scheduling environment to ensure stability and high availability
- Develop and enhance automation solutions using PowerShell and other scripting languages to streamline workload orchestration
- Build, configure, and refine monitoring dashboards, alerts, and reports to track system health, throughput, and performance
- Lead incident response for high-priority scheduling failures including troubleshooting, resolving the issue, and performing a root-cause analysis
- Define, establish, and report on SLIs, SLOs and SLAs for critical business workflows
- Collaborate with cross-functional teams to onboard new workflows, optimize job dependencies, and implement best practices
- Create and maintain comprehensive documentation, runbooks, and training materials for end users and support teams
- Participate in a rotational on-call schedule to support 24/7 operations and critical incident management
Requirements:
- Bachelor's degree in computer science, engineering or a related field, or equivalent work experience
- 5+ years of experience in enterprise scheduling or workload automation tools, including ActiveBatch, CA Workload Automation, Control-M, and/or Autosys
- 2+ years of experience in a site reliability, DevOps, or production support role with exposure to SLA management and SLI/SLO frameworks
- Hands-on expertise with ActiveBatch or similar workload automation tools including job scheduling, calendars, dependencies, security, and versioning
- Familiarity with Apache Airflow concepts, including DAG-design, operators, executors, and deployment patterns
- Strong scripting skills in PowerShell
- Proven track record of troubleshooting complex, distributed workflows and performing root-cause analysis
- Experience building and managing monitoring solutions using tools such as Splunk, Datadog, and/or Prometheus/Grafana
- Ability to partner with application owners, business analysts, and infrastructure teams to drive continuous improvements
- Excellent communication skills with the ability to translate technical concepts for non-technical stakeholders
- Commitment to the highest ethical standards