Fundraise Up is a global fundraising platform that aims to make donating to nonprofits fast and accessible. The role involves ensuring the stability, performance, and security of server infrastructure, focusing on hands-on management of on-premise systems and automation projects.
Responsibilities:
- Work primarily with on‑premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting
- Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps
- Own automation projects end‑to‑end (design → rollout → maintenance)
- Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene
- Keep the platform stable, fast, and secure: servers, web servers, databases, queues
- Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post‑mortems
- Participate in on‑call rotations
- Use AI in all aspects of day‑to‑day work: researching, troubleshooting, developing
- At Fundraise Up, AI is a default tool, not an experimental one. We expect every team member to actively use AI in their day-to-day work, identify where AI can change the shape of problems in their function, and grow their fluency as the tools evolve. You should already be using AI meaningfully in your work and understand where it adds value and how it can improve the way you operate
Requirements:
- 4+ years as a DevOps Engineer / SRE (or very close responsibilities)
- Real, hands-on experience with servers (VMs, bare metal) at the OS level and below: configuring, troubleshooting, digging into 'why it's broken'
- Confident Linux skills (we use Ubuntu). We expect you to be comfortable with the core tools from Linux Crisis Tools
- Solid understanding of networking basics; ability to configure and troubleshoot iptables
- Ansible + Git
- Experience with Bash or Python scripting for automation/observability
- Production/on-call experience: diagnosing incidents, restoring service, participating in post-mortems
- Ownership and attention to detail. Downtime is expensive: five years ago, 10 minutes of downtime cost us $100k — today it's even more
- ClickHouse, MongoDB: what each database is used for, monitoring, troubleshooting performance and slow queries, sharding
- Kafka: operating clusters at scale (topic moves, broker replacements, tuning)
- Redis: high-load tuning, replication, sharding, performance monitoring
- Elasticsearch: configuration, scaling, sharding/cluster management
- HAProxy / Nginx: load balancing, SSL/TLS, caching, reverse proxying, performance monitoring
- OS tuning: kernel/network stack/filesystem parameters for high-load systems
- Full Disk Encryption on LVM: We use Clevis + Tang in production
- Infrastructure Security: Teleport, HashiCorp Vault