OneTrust’s mission is to enable innovation through the responsible use of data and AI. They are seeking a Senior Software Engineer to contribute in all phases of the development lifecycle, ensuring high availability and performance of mission-critical applications.
Responsibilities:
- Contribute in all phases of the development lifecycle
- Write well designed, testable, efficient code
- Ensure designs are in compliance with specification
- Own your code in production, responding to incidents as they occur and participating in retros to determine how to be better in the future
- Prepare and produce releases of software components
- Support continuous improvement by investigating alternatives and technologies and presenting these for architectural review
- Engage and partner with various Engineering, Operations, and Product teams to design, deliver, and maintain a highly available and performant application platform
- Build and implement application observability and platform monitoring tools to continuously improve the customer experience
- Eliminate toil by automating processes, tuning alerts, and improving code where it is most needed
- Frequently evaluate new ideas and trends to identify potentially useful tools and techniques
- Collaborate with different functional groups to identify gaps, prioritize, and resolve issues
- Defining, implementing, and maintaining SLIs and SLOs aligned with customer experience
- Design and instrument SLIs such as latency, error rates, and availability across critical services
- Manage and enforce error budgets to balance system reliability with product feature velocity
- Improving alert quality by reducing noise and focusing on actionable, high-signal alerts
- Embed with product teams to review architectures and catch reliability risks early
- Share your knowledge and experience with the Engineering organization
- Share your findings with technical leadership and senior management
- Build scripts in python/bash/java or ruby for operational automation and incident response
Requirements:
- BE/BTech/MS degree in Computer Science Engineering or a related subject
- Experience in software application development using Java, Spring and Hibernate
- Experience in Spring Boot, Micro services is a plus
- Strong knowledge of algorithms, data structure and design patterns
- Experience with SQL and NoSQL technologies
- Sound understanding of concepts of Restful services
- Solid understanding and experience of Application Server and middleware technologies
- Unix/Linux environments and OS fundamentals
- Bachelor's degree in computer science, Engineering, or related technical or business field
- 4+ yrs. of application development experience with Java or other equivalent language
- Experience with Spring environment
- Experience in cloud-based infrastructure (Azure, AWS, GCP, etc.)
- Experience with the factors that affect software application performance at different levels
- A knowledge of the importance of centralizing logging, metrics dashboards, and alerting
- A good awareness of databases (ideally SQL/NoSQL)
- Hands-on experience with observability tools (Datadog, Prometheus, Grafana, etc.)
- Knowledge with CI/CD pipelines and infrastructure-as-code (Terraform, Helm, jenkins, gitlab)
- Build and operate AI-assisted incident response systems (root cause analysis, log summarization, anomaly triage)
- Develop or integrate LLM-based tools to reduce MTTR and improve alert quality
- Apply machine learning techniques for anomaly detection, capacity prediction, or failure pattern analysis
- Experience deploying AI systems in production (not just experimentation)
- Knowledge with vector databases, embeddings, or RAG architectures for operational intelligence
- Well-developed insight of prompt engineering and evaluation of LLM outputs in the reliability workflow
- Kubernetes and container orchestration (EKS/AKS/GKE)
- Experience with distributed systems at scale
- Familiarity with service meshes and microservices architectures
- Experience with chaos engineering tools (Gremlin, Chaos Monkey)
- Background in product-facing services with high traffic scale
- Understand how to use incident management platforms. This includes using tools like PagerDuty for alerts. It also includes working with DataDog for monitoring