Hewlett Packard Enterprise is a global edge-to-cloud company that helps organizations connect, protect, analyze, and act on their data. They are seeking a Site Reliability Engineer to enhance their production environment for rapid scaling and outstanding performance, focusing on maintaining system uptime and reliability while collaborating with software developers.
Responsibilities:
- Express your passion about infrastructure as code and continuous deployment to build scalable and highly reliable systems
- Define and own KPIs around system availability, quality and scale
- Partner with our developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of our applications and systems
- Ensure system availability and business continuity by implementing redundant servers/services
- Manage after-hours infrastructure updates and maintenance
- Proactively research and propose the use of new concepts, processes, technologies, and tools
- Partner with software developers to create Mist standards for Microservices (APIs, schemas, serialization, data stores and best practices)
- Run secure and scalable applications for highly available, multi-region, AWS and GCP deployments
- Ship code several times per week
- Be a part of our On-Call rotation
- Own disaster recovery and business continuity plans
Requirements:
- An extensive background in developing and operating large-scale cloud-based distributed applications
- Direct experience developing/running applications on AWS or Google Cloud
- Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, security, software maintainability, and operational excellence
- The ability to 'fix the plane while in flight' (not just support greenfield solutions)
- The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off
- Delivering web-scale infrastructure for a global market at high release velocity
- A deep understanding of distributed system design and dependency management
- Must have solid experience with at least 2 of the languages: Go, Java, Python
- 10+ years industry experience in managing infrastructure
- 5 years Kubernetes administration in a large-scale SaaS environment
- 5 years maintaining production systems on AWS or GCP
- 3 years in implementing, managing, and monitoring metrics specific to SaaS applications
- 3 years using infrastructure as code software (eg. Terraform, AWS and Google Cloud Deployment, CloudFormation)
- 5 years' experience in continuous integration practices & tools (Jenkins, Travis CI, CircleCI, etc…)
- Experience with Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis, Zookeeper, Nginx, Airflow
- Experience of working with or contributing directly to Open Source projects
- Understanding and experience of leading/managing technology products
- Understand machine learning techniques and tools. Translate business requirements into data models and implement them for scale and production ready systems
- Experience of working with failure-based testing
- Experience working in a test-driven development environment