
JP Phillips is off to greener, or at least calmer, pastures. He joined us 4 years ago to build the next generation of our orchestration system, and has been one of the anchors of our engineering team. His last day is today. We wanted to know what he was thinking, and figured you might too.
Question 1: Why, JP? Just why?
LOL. When I looked at what I wanted to see from here in the next 3-4 years, it didn’t really match up with where we’re currently heading. Specifically, with our new focus on MPG [Managed Postgres] and [llm] [llm].
Editorial comment: Even I don’t know what [llm] is.
The Fly Machines platform is more or less finished, in the sense of being capable of supporting the next iteration of our products. My original desire to join Fly.io was to make Machines a product that would rid us of HashiCorp Nomad, and I feel like that’s been accomplished.
Where were you hoping to see us headed?
More directly positioned as a cloud provider, rather than a platform-as-a-service; further along the customer journey from “developers” and “startups” to large established companies.
And, it’s not that I disagree with PAAS work or MPG! Rather, it’s not something that excites me in a way that I’d feel challenged and could continue to grow technically.
Follow up question: does your family know what you’re doing here? Doing to us? Are they OK with it?
Yes, my family was very involved in the decision, before I even talked to other companies.
What’s the thing you’re happiest about having built here? It cannot be “all of flyd
”.
We’ve enabled developers to run workloads from an OCI image and an API call all over the world. On any other cloud provider, the knowledge of how to pull that off comes with a professional certification.
In what file in our nomad-firecracker
repository would I find that code?
https://docs.machines.dev/#tag/machines/post/apps/{app_name}/machines
So you mean, literally, the whole Fly Machines API, and flaps
, the API gateway for Fly Machines?
Yes, all of it. The flaps
API server, the flyd
RPCs it calls, the flyd
finite state machine system, the interface to running VMs.
Is there something you especially like about that design?
I like that it for the most part doesn’t require any central coordination. And I like that the P90 for Fly Machine create
calls is sub-5-seconds for pretty much every region except for Johannesburg and Hong Kong.
I think the FSM design is something I’m proud of; if I could take any code with me, it’d be the internal/fsm
in the nomad-firecracker
repo.
You can read more about the flyd
orchestrator JP led over here. But, a quick decoder ring: flyd
runs independently without any central coordination on thousands of “worker” servers around the globe. It’s structured as an API server for a bunch of finite state machine invocations, where an FSM might be something like “start a Fly Machine” or “create a new Fly Machine” or “cordon off a Fly Machine so we can update it”. Each FSM invocation is comprised of a bunch of steps, each of those steps has callbacks into the flyd
code, and each step is logged in a BoltDB database.
Thinking back, there are like two archetypes of insanely talented developers I’ve worked with. One is the kind that belts out ridiculous amounts of relatively sophisticated code on a whim, at like 3AM. Jerome [who leads our fly-proxy team], is that type. The other comes to projects with what feels like fully-formed, coherent designs that are not super intuitive, and the whole project just falls together around that design. Did you know you were going to do the FSM log thing when you started flyd
?
I definitely didn’t have any specific design in mind when I started on flyd
. I think the FSM stuff is a result of work I did at Compose.io / MongoHQ (where it was called “recipes”/“operations”) and the workd I did at HashiCorp using Cadence.
Once I understood what the product needed to do and look like, having a way to perform deterministic and durable execution felt like a good design.
Cadence?
Cadence is the child of AWS Step Functions and the predecessor to Temporal (the company).
One of the biggest gains, with how it works in flyd
, is knowing we would need to deploy flyd
all day, every day. If flyd
was in the middle of doing some work, it needed to pick back up right where it left off, post-deploy.
OK, next question. What’s the most impressive thing you saw someone else build here? To make things simpler and take some pressure off the interview, we can exclude any of my works from consideration.
Probably corrosion2
.
Sidebar: corrosion2
is our state distribution system. While flyd
runs individual Fly Machines for users, each instance is solely responsible for its own state; there’s no global scheduler. But we have platform components, most obviously fly-proxy
, our Anycast router, that need to know what’s running where. corrosion2
is a Rust service that does SWIM gossip to propagate information from each worker into a CRDT-structured SQLite database. corrosion2
essentially means any component on our fleet can do SQLite queries to get near-real-time information about any Fly Machine around the world.
If for no other reason than that we deployed corrosion
, learned from it, and were able to make significant and valuable improvements — and then migrate to the new system in a short period of time.
Having a “just SQLite” interface, for async replicated changes around the world in seconds, it’s pretty powerful.
If we invested in Anthesis or TLA+ testing, I think there’s potential for other companies to get value out of corrosion2
.
Just as a general-purpose gossip-based SQLite CRDT gossip system?
Yes.
OK, you’re being too nice. What’s your least favorite thing about the platform?
GraphQL. No, Elixir. It’s a tie between GraphQL and Elixir.
But probably GraphQL, by a hair.
That’s not the answer I expected.
GraphQL slows everyone down, and everything. Elixir only slows me down.
The rest of the platform, you’re fine with? No complaints?
I’m happier now that we have pilot
.
pilot
is our new init
. When we launch a Fly Machine, init
is our foothold in the machine; this is unlike a normal OCI runtime, where “pid 1” is often the user’s entrypoint program. Our original init
was so simple people dunked on it and said it might as well have been a bash script; over time, init
has sprouted a bunch of new features. pilot
consolidates those features, and, more importantly, is itself a complete OCI runtime; pilot
can natively run containers inside of Fly Machines.
Before pilot
, there really wasn’t any contract between flyd
and init
. And init
was just “whatever we wanted init
to be”. That limit its ability to serve us.
Having pilot
be an OCI-compliant runtime with an API for flyd
to drive is a big win for the future of the Fly Machines API.
Was I right that we should have used SQLite for flyd
, or were you wrong to have used BoltDB?
I still believe Bolt was the right choice. I’ve never lost a second of sleep worried that someone is about to run a SQL update statement on a host, or across the whole fleet, and then mangled all our state data. And limiting the storage interface, by not using SQL, kept flyd
‘s scope managed.
On the engine side of the platform, which is what flyd
is, I still believe SQL is too powerful for what flyd
does.
If you had this to do over again, would Bolt be precisely what you’d pick, or is there something else you’d want to try? Some cool-ass new KV store?
Nah. But, I’d maybe consider a SQLite database per-Fly-Machine. Then the scope of danger is about as small as it could possibly be.
Whoah, that’s an interesting thought. People sleep on the “keep a zillion little SQLites” design.
Yeah, with per-Machine SQLite, once a Fly Machine is destroyed, we can just zip up the database and stash it in object storage. The biggest hold-up I have about it is how we’d manage the schemas.
OpenTelemetry: were you right all along?
One hundred percent.
I basically attribute oTel at Fly.io to you.
Without oTel, it’d be a disaster trying to troubleshoot the system. I’d have ragequit trying.
I remember there being a cost issue, with how much Honeycomb was going to end up charging us to manage all the data. But that seems silly in retrospect.
For sure. It is 100% part of the decision and the conversation. But: we didn’t have the best track record running a logs/metrics cluster at this fidelity. It was worth the money to pay someone else to manage tracing data.
Strong agree. I think my only issue is just the extent to which it cruds up code. But I need to get over that.
Yes, it’s very explicit. I think the next big part of oTel is going to be auto-instrumentation, for profiling.
You’re a veteran Golang programmer. Say 3 nice things about Rust.
Most of our backend is in Go, but fly-proxy
, corrosion2
, and pilot
are in Rust.
- Option.
- Match.
- Serde macros.
Even I can’t say shit about Option and match.
Match is so much better than anything in Go.
Elixir, Go, and Rust. An honest take on that programming cocktail.
Three’s a crowd, Elixir can stay home.
If you could only lose one, you’d keep Rust.
I’ve learned its shortcomings and the productivity far outweighs having to deal with the Rust compiler.
You’d be unhappy if we moved the flaps
API code from Go to Elixir.
Correct.
I kind of buy the idea of doing orchestration and scheduling code, which is policy-intensive, in a higher-level language.
Maybe. If Ruby had a better concurrency story, I don’t think Elixir would have a place for us.
Here I need to note that Ruby is functionally dead here, and Elixir is ascendant.
We have an idiosyncratic management structure. We’re bottom-up, but ambiguously so. We don’t have roadmaps, except when we do. We have minimal top-down technical direction. Critique.
It’s too easy to lose sight of whether your current focus [in what you’re building] is valuable to the company.
The first thing I warn every candidate about on our “do-not-work-here” calls.
I think it comes down to execution, and accountability to actually finish projects. I spun a lot trying to figure out what would be the most valuable work for Fly Machines.
You don’t have to be so nice about things.
We struggle a lot with consistent communication. We change direction a little too often. It got to a point where I didn’t see a point in devoting time and effort into projects, because I’d not be able to show enough value quick enough.
I see things paying off later than we’d hoped or expected they would. Our secret storage system, Pet Semetary, is a good example of this. Our K8s service, FKS, is another obvious one, since we’re shipping MPG on it.
This is your second time working Kurt, at a company where he’s the CEO. Give him a 1-4 star rating. He can take it! At least, I think he can take it.
2022: ★★★★
2023: ★★
2024: ★★✩
2025: ★★★✩
On a four-star scale.
Whoah. I did not expect a histogram. Say more about 2023!
We hired too many people, too quickly, and didn’t have the guardrails and structure in place for everybody to be successful.
Also: GPUs!
Yes. That was my next comment.
Do we secretly agree about GPUs?
I think so.
Our side won the argument in the end! But at what cost?
They were a killer distraction.
Final question: how long will you remain in the first-responder on-call rotation after you leave? I assume at least until August. I have a shift this weekend; can you swap with me? I keep getting weekends.
I am going to be asleep all weekend if any of my previous job changes are indicative.
I sleep through on-call too! But nobody can yell at you for it now. I think you have the comparative advantage over me in on-calling.
Yes I will absolutely take all your future on-call shifts, you have convinced me.
All this aside: it has been a privilege watching you work. I hope your next gig is 100x more relaxing than this was. Or maybe I just hope that for myself. Except: I’ll never escape this place. Thank you so much for doing this.
Thank you! I’m forever grateful for having the opportunity to be a part of Fly.io.