Site Reliability Engineer

Now Hiring
Intern
Level 1
Level 2
Senior
Work From

Fly.io is a platform that takes container images and converts them into fleets of Firecracker VMs running on our own hardware around the world. It’s then easy to run applications near users, whether they’re in Singapore, Seattle, or São Paulo. We’re also just a very pleasant place to deploy to. Try it out; if you’ve got a working container already, it can be running here in less than 10 minutes.

Anyways, we’re hiring for our SRE team.

You can read that introduction and get a good idea of how intense our SRE challenge is. To further set the scene, two important true things about ops at Fly.io: ops is a very big deal here, and, because we’re a small team where everyone does ops work, there’s a lot of room for ideas on how to do ops better.

Our platform stack is, of course, Linux, plus the HashiCorp stack (Nomad, Consul, and Vault), plus Firecracker, Amazon’s Rust micro-VM engine, plus WireGuard, Jason Donenfeld’s amazing lightweight VPN, which is what our network fabric is built on. Our users drive Fly.io through a Rails-based GraphQL API. We host a heavy-duty Prometheus-style metrics cluster, an ElasticSearch cluster for logging, a monitoring system using Sensu Go, BGP4 peering with Bird… the list goes on.

Importantly: while we need you to be comfortable working on all of this stuff, we don’t need you to know all of it coming in the door.

We think Fly.io is pretty ambitious, and often fun to work on. We think there are people out there for whom the idea of keeping a system like ours running smoothly is a cool problem. Some things you might want to know about us:

  • We’re a small team, almost entirely technical, and product engineering and ops are tightly integrated, which isn’t something we’re looking to change. We don’t have a culture of developers throwing things over the wall for ops to keep running.
  • We’re remote, with team members in Colorado, Quebec, Chicago, London, Virginia and Utah.
  • We’re an unusually public team, with an online community (at community.fly.io) that we try to be chatty with; you’d want to be comfortable not working secretively in a dark room (you can work noisily in a dark room if that’s your thing).
  • We’re a team, not a family, but we have families and want to be the kind of place where work doesn’t get in the way of that.
  • We have an on-call rotation, because we’re a platform. We all share it, and will be for the foreseeable future.
  • We’re all developers, and we’re all doing our own ops (Steve owns ops, but is gradually being sucked into product development, since any big ops project we do is something we’re going to try to figure out how to turn into a feature, which is something else you might want to be aware of).
  • We’re a real company – we hope that goes without saying – and this is a real, according-to-Hoyle full-time job with health care for US employees, flexible vacation time, hardware/phone allowances, the standard stuff.

Fly.io is weird about hiring. We’re skeptical both of resumes and of interviews. We respect career experience but try not to be hypnotized by it, and are actively excited by the prospect of discovering new talent. Here are a collection of hats we need you to be OK with wearing:

  • The hat where you manage to get a fairly large fleet of servers running on a new kernel configuration while absolutely minimizing downtime for apps in any particular location.
  • The hat where you get metric and log alerting configured so that Kurt gets paged reliably when something goes wrong, while at the same time making sure that everything that pages Kurt is actionable.
  • The hat where you build a process for quickly shipping new features we build to prod with canaries or blues and greens or whatever the cool kids are doing, because some of what we deploy right now is scary enough to slow us down a bit.
  • The hat where you can keep enough of Linux networking’s idiosyncrasies in your head to diagnose problems, especially when we’ve managed to BPF in new dumb idiosyncrasies.
  • The hat where you turn up new data centers for us in random parts of the world, so our users can deploy applications close to penguins or castles or really good muffaletta sandwiches.

You want to be comfortable coding in some programming language (Python and Ruby are fine; if you don’t know Go already, you probably will soon after joining).

There are other hats, too. But if these sound like hats you’d be happy to wear, here’s our process:

  1. You’ll reach out and let us know a little about yourself (you could send a resume if you like). In your mail, tell us a story about any computer disaster you’ve survived — whatever comes to mind!
  2. We’ll invite you to a chat to pitch the company and answer all your questions about both the company and our hiring process.
  3. If you’re sold on the gig after that, we’ll get you some lightweight ops challenge problems, and any information you need to shine on them (we don’t do gotcha problems). You’ll do these on your own time, in whatever snippets of time you have available — we’re not looking over your shoulder — and we’re aiming for this to eat substantially less time than a typical tech company interview gauntlet would.
  4. If, after doing all this, it’s clear that this is the kind of work you want to be doing, we’ll do a larger paid project ($1000 flat rate) that is representative of the kind of work we do.
  5. We’ll evaluate all the paid projects we get over a 3-4 week period and let you know how things went.
  6. We’ll set up another video chat and try to convince you not to join. It’ll be fun!
  7. We’ll get you an offer.

We’re biased but think this is a pretty swell role, doing visible work for an appreciative and enthusiastic user base. We’d be thrilled if you reached out to ask about it. You can’t waste our time!

Interested? Email us at jobs+servers@fly.io.