Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
Ensure high availability and performance of production systems through proactive monitoring and incident response
Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
Lead projects, teams, or serve as a peer mentor
Requirements
5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of experience leading observability and monitoring tooling
Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
5+ years in infrastructure (windows and Linux) support
5+ years proven success in toil reduction initiatives
5+ years in cloud application management especially OpenShift Container Platform
5+ Years’ experience in SRE, public & private cloud technologies, Java performance tuning, capacity optimization for mission critical applications
Working knowledge of multiple programming languages (e.g., Java, JavaScript, Ruby, Python, JSON, Angular, NodeJS)
Hands-on experience with cloud and platform technologies such as AWS, PCF, PKS, Kubernetes, OpenShift, Linux, Azure, Windows, and VMware
Strong verbal, written, and interpersonal communication skills for effective collaboration across teams
Ability to engage with and influence stakeholders at various organizational levels