Fundraise Up is a global fundraising platform that aims to make donating to nonprofits fast and accessible. The role involves ensuring the stability, performance, and security of server infrastructure, focusing on hands-on management of on-premise systems and automation projects.

Responsibilities:

Work primarily with on‑premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting
Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps
Own automation projects end‑to‑end (design → rollout → maintenance)
Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene
Keep the platform stable, fast, and secure: servers, web servers, databases, queues
Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post‑mortems
Participate in on‑call rotations
Use AI in all aspects of day‑to‑day work: researching, troubleshooting, developing
At Fundraise Up, AI is a default tool, not an experimental one. We expect every team member to actively use AI in their day-to-day work, identify where AI can change the shape of problems in their function, and grow their fluency as the tools evolve. You should already be using AI meaningfully in your work and understand where it adds value and how it can improve the way you operate

Requirements:

4+ years as a DevOps Engineer / SRE (or very close responsibilities)
Real, hands-on experience with servers (VMs, bare metal) at the OS level and below: configuring, troubleshooting, digging into 'why it's broken'
Confident Linux skills (we use Ubuntu). We expect you to be comfortable with the core tools from Linux Crisis Tools
Solid understanding of networking basics; ability to configure and troubleshoot iptables
Ansible + Git
Experience with Bash or Python scripting for automation/observability
Production/on-call experience: diagnosing incidents, restoring service, participating in post-mortems
Ownership and attention to detail. Downtime is expensive: five years ago, 10 minutes of downtime cost us $100k — today it's even more
ClickHouse, MongoDB: what each database is used for, monitoring, troubleshooting performance and slow queries, sharding
Kafka: operating clusters at scale (topic moves, broker replacements, tuning)
Redis: high-load tuning, replication, sharding, performance monitoring
Elasticsearch: configuration, scaling, sharding/cluster management
HAProxy / Nginx: load balancing, SSL/TLS, caching, reverse proxying, performance monitoring
OS tuning: kernel/network stack/filesystem parameters for high-load systems
Full Disk Encryption on LVM: We use Clevis + Tang in production
Infrastructure Security: Teleport, HashiCorp Vault

DevOps Engineer / SRE

Key skills

About this role

Responsibilities:

Requirements: