As a Senior Site Reliability Engineer (SRE) on our team, you will leverage platform engineering principles to ensure that Shippo's services are reliable, scalable, and performant.
You will be a hybrid software development and operations engineer, responsible for designing, building, and maintaining the infrastructure that supports our applications.
Your work will directly impact our ability to meet and exceed SLAs, and you will collaborate closely with other engineering teams to create services that are automatable, measurable, and resilient to failure.
Design, scale, and secure infrastructure to stay ahead of business needs through fault-tolerant architecture design, performance testing, profiling, and tuning, and capacity planning.
Design, build, deploy, and maintain automation, monitoring, and alerting systems, as well as design, implement, and test disaster recovery solutions.
Ensure scalability and maintainability through microservices adoption, decoupling of concerns and data model, queuing of jobs and application layering.
Enhance and maintain our CI/CD pipeline for smooth and safe production releases via automated testing and verification.
Verify and ensure performance and correctness of systems in response time and throughput.
Participate in peer reviews and testing and contribute to automated test suites and in design reviews for new features, products, and systems.
Participate in an on-call rotation.
Requirements
Experience developing, managing and troubleshooting highly available distributed systems, including operational experience with Kubernetes in a production environment
Extensive expertise with at least one public cloud provider (AWS, GCP, Azure)
Exceptional verbal, written, and interpersonal communication skills
Interest in and understanding of best-in-class security practices, and automation and testing methods
Familiarity with configuration and maintenance of common infrastructure components such as Redis, Elasticsearch, and Hadoop
Deep understanding of customer needs and passion for customer success
BS or MS degree in Computer Science or equivalent experience
Bonus Advanced knowledge of managing and optimizing Postgresql server configuration
3+ years of experience in software development
Experience with:
Managing service meshes (e.g. Istio)
Defining and monitoring Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) to ensure that systems meet reliability and performance targets; Monitoring Tools like New Relic, Prometheus, Grafana and/or Datadog
OpenTelemetry knowledge for distributed tracing and metrics collection and experience on using it in production environments
Managing Python and Golang applications in production
Microservices architectures
DevOps tooling such as Docker, Terraform, ArgoCD, ArgoWorkflows, CircleCI, Github Actions, New Relic, PagerDuty, etc
AWS/Cloud services such as EKS, EC2, S3, Lambda, Route 53, CloudFront, Cloudflare, IAM, etc.
Tech Stack
AWS
Azure
Cloud
Distributed Systems
Docker
EC2
ElasticSearch
Google Cloud Platform
Grafana
Hadoop
Kubernetes
Microservices
Postgres
Prometheus
Python
Redis
Terraform
Go
Benefits
Healthcare coverage for medical, dental, and vision (90% covered by the company, incl. dependents).
Pets coverage is also available!
Take-as-much-as-you-need vacation policy & flexible working hours
One week-long company wide winter slow down
3 Volunteer Days Off (VTOs)
WFH stipend to set up your home office
Charity donation match up to $100
Dedicated programs, coaching, tools, and resources for your professional and career growth as well as an individual learning stipend for your personal and focused growth
Fun team in person time through our Shippos Everywhere program which includes regular team and company off-sites throughout the year as well as local Shippos gatherings throughout the year