NinjaOne is passionate about building unified IT solutions that simplify the way IT organizations work. They are currently looking for a Senior Site Reliability Engineer to join their SRE team and help scale their products to millions of end-users by focusing on automation and observability.
Responsibilities:
- Diagnose and resolve complex application and infrastructure issues
- Participate in our 24x7 on-call rotation, SCRUM, and deployment planning
- Perform Root Cause Analysis (RCA) and provide recommendations for application teams
- Improve availability and reduce customer impact using Industry best observability tools
- Ensure best-practice and security-minded architecture by influencing design decisions
- Create and maintain technical documentation and SOP’s
- Develop software, scripts, or tooling to improve efficiency and reduce delivery time of applications and infrastructure
- Other duties as needed
Requirements:
- 10+ years' experience in DevOps and/or Site Reliability Engineering roles
- 3+ years' experience with an object-oriented language (preferably Java, .NET or C++)
- Intermediate+ level Linux administration, scripting, and troubleshooting
- Demonstrable knowledge of Observability tools (New Relic, Splunk, DataDog)
- Comprehensive experience with AWS (Amazon Web Services) and its core capabilities (VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc)
- Experience with cloud automation and infrastructure-as-code (IaC) toolsets, primarily CloudFormation but also including Terraform, Helm and Ansible. CDK a plus
- Good understanding of containers, Fargate, Kubernetes, and overall distributed microservice architectures
- Passionate about automation, security, and self-service environments/portals
- Hands-on experience with CI/CD and SDLC (Software Development Life Cycle) processes
- Effective communication skills, both verbal and written