TwinStream is a company focused on delivering technical excellence and high-quality service to clients, particularly in government organizations. They are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and cost-effectiveness of their services while collaborating with development teams and improving infrastructure and delivery pipelines.
Responsibilities:
- Collaborate with Software Engineers to improve reliability and performance in their subsystems
- Partner with System Administrators in automating toil and eliminating alerts
- Evolve observability and monitoring capabilities to identify and solve problems before they impact the business
- Support development environments to help us achieve our delivery and quality goals
- Research and evaluate technologies, tools and services to influence buy-vs-build decisions
- Develop expertise in diverse technical and business domains
- Expand your knowledge of the technical stacks used
Requirements:
- Experience using AWS
- Experience using modern configuration management tools (such as Ansible, Chef or similar)
- Experience working with Terraform
- Experience working with docker containers & container orchestration tools (such as Kubernetes, OpenShift or Docker Swarm)
- Experience both using and maintaining CI / CD tools (such as Jenkins or similar)
- Experience with monitoring tools such as InfluxDB, Prometheus or Grafana
- Experience of event-driven integration with MQ messaging (RabbitMQ or similar AMQP solution)
- Good understanding of relational databases and SQL
- Linux command line, administration and shell scripting
- Working knowledge of network security protocols
- Experience using, developing with and maintaining cloud hosting services (ideally AWS EC2, RDS, S3, Lambda)
- Industry experience writing well-tested code in one of our platform languages (Java, Go, Python or similar)
- Knowledge of cross-domain principles & technologies
- Experience of working in a service management environment
- Practical applications of using observability patterns in previous systems
- Creating and monitoring system availability metrics and using those to drive work that reduces downtime