Fundraise Up is a global fundraising platform dedicated to making donations to nonprofits seamless and accessible. The DevOps Engineer/SRE will be responsible for the stability, performance, and security of server infrastructure, working primarily with on-premise systems and driving automation projects.
Responsibilities:
- Work primarily with on‑premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting
- Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps
- Own automation projects end‑to‑end (design → rollout → maintenance)
- Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene
- Keep the platform stable, fast, and secure: servers, web servers, databases, queues
- Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post‑mortems
- Participate in on‑call rotations
- Use AI in all aspects of day‑to‑day work: researching, troubleshooting, developing
Requirements:
- 4+ years as a DevOps Engineer / SRE (or very close responsibilities)
- Real, hands-on experience with servers (VMs, bare metal) at the OS level and below: configuring, troubleshooting, digging into 'why it's broken'
- Confident Linux skills (we use Ubuntu). We expect you to be comfortable with the core tools from Linux Crisis Tools
- Solid understanding of networking basics; ability to configure and troubleshoot iptables
- Ansible + Git
- Experience with Bash or Python scripting for automation/observability
- Production/on‑call experience: diagnosing incidents, restoring service, participating in post‑mortems
- Ownership and attention to detail. Downtime is expensive: five years ago, 10 minutes of downtime cost us $100k — today it's even more
- ClickHouse, MongoDB: what each database is used for, monitoring, troubleshooting performance and slow queries, sharding
- Kafka: operating clusters at scale (topic moves, broker replacements, tuning)
- Redis: high‑load tuning, replication, sharding, performance monitoring
- Elasticsearch: configuration, scaling, sharding/cluster management
- HAProxy / Nginx: load balancing, SSL/TLS, caching, reverse proxying, performance monitoring
- OS tuning: kernel/network stack/filesystem parameters for high‑load systems
- Full Disk Encryption on LVM: We use Clevis + Tang in production
- Infrastructure Security: Teleport, HashiCorp Vault
- VictoriaMetrics and how it differs from the Prometheus stack
- Complex CI/CD pipelines. We use scripted Jenkins pipelines
- Bare‑metal Kubernetes: provisioning, networking (MetalLB or alternatives), isolation from the internet, scaling across providers (like OVH, Hetzner) and integration with existing infrastructure
- Flux and GitOps
- Terraform