Cell-based architecture

Previously know as shared-nothing-architecture, the concept was rediscovered and rebranded as cell-based by Amazon.

This blueprint shows you how to implement a scenario where each user of your application is assigned a new Machine. These Machines, by default, will automatically stop when not in use and be restarted when accessed, so the primary expense will be for volumes and actual usage.

Warning: if you follow this example, applications will be deployed without their databases being replicated. This exposes your application to volume and Machine failures. Backups become your responsibility. Some approaches to addressing this requirement are listed at the bottom of this page.

The example given here will be based on Rails, but the approach applies to all frameworks.

Start With a Single Machine Application

If you don’t have one, the following will do:

rails new blog --css tailwind
cd blog
bin/rails generate scaffold Post title:string body:text
fly launch --name blog-$USER-$RANDOM

Feel free to substitute a different name for your application.

Configure Your Routes

Modify config/routes.rb. Wrap your routes with:

    scope ENV.fetch("FLY_MACHINE_ID", "") do
...
end

  

If present, move get "up" line outside of this block.

Uncomment the root line.

Your final result will look something like this:

    Rails.application.routes.draw do
  scope ENV.fetch("FLY_MACHINE_ID", "") do
    resources :posts
    # Define your application routes per the DSL in https://guides.rubyonrails.org/routing.html

    # Defines the root path route ("/")
    root "posts#index"
  end

  # Reveal health status on /up that returns 200 if the app boots with no exceptions, otherwise 500.
  # Can be used by load balancers and uptime monitors to verify that the app is live.
  get "up" => "rails/health#show", as: :rails_health_check
end

  

What this will do is place your entire application, minus the health check endpoint, into a namespace when deployed to fly.io. Access to your application will need to be prefixed by your Machine id. When run in environments where the FLY_MACHINE_ID environment variable is not set, no prefix is required.

You can get a list of Machine ids using:

fly machines list -q

Starting a Second Machine

In order to prepare for a multiple Machine deployment, we are going to need to ensure that each request is routed to the correct Machine. We can accomplish this via middleware that makes use of the fly-replay response header.

Place the following into config/initializers/fly_router.rb:

    if ENV["FLY_MACHINE_ID"]
  class FlyInstanceRouter
    def initialize(app)
      @app = app
    end

    def call(env)
      segments = env["PATH_INFO"].split('/')
      instance = segments[1]

      if instance =~ /^[0-9a-f]{14}$/ and instance != ENV["FLY_MACHINE_ID"]
        return [409, {"Fly-Replay" => "instance=#{instance}"}, [""]]
      end

      @app.call(env)
    end
  end

  Rails.application.config.middleware.use(FlyInstanceRouter)
end

  

Deploy this change using fly deploy, then use the following command to create a second Machine:

fly machine clone

At this point, you have two instances of the blog application, each on its own Machine, with its own volume and sqlite3 database.

Note that fly machine clone has a number of options that can be used to specify such things as the region, vm sizes, and whether the volume requires a unique zone.

Regions, in particular, can enable you to put both applications and data close to users.

Optimizing your routes

This section is optional, and depends on a unique feature of Rails’ Hotwire implementation.

At this point, many requests will end up making two hops: first to a nearby Machine then to the desired Machine. This increases latency of responses, particularly when it is necessary to wake two Machines to process the first request.

Not much can be done to avoid this for the first request, but subsequent requests can route to the correct Machine using the fly-force-instance-id request header.

Place the following into app/javascript/controllers/fly_router_controller.js:

    import { Controller } from "@hotwired/stimulus"

// Connects to data-controller="fly-router"
export default class extends Controller {
  connect() {
    console.log('fly router connected')
    this.instance = this.element.dataset.instance;

    document.documentElement.addEventListener(
     'turbo:before-fetch-request',
     this.beforeFetchRequest
    )
  }

  disconnect() {
    document.documentElement.removeEventListener(
     'turbo:before-fetch-request',
     this.beforeFetchRequest
    )
  }

  beforeFetchRequest = event => {
    console.log('injecting ' + this.instance);
    event.detail.fetchOptions.headers['fly-force-instance-id'] = this.instance;
  }
}

  

Finally, modify the <body> line in app/views/layouts/application.html.erb:

      <body<% if ENV['FLY_MACHINE_ID'] %> data-instance="<%= ENV['FLY_MACHINE_ID'] %>" data-controller="fly-router"<% end %>>>

  

Deploy this change using fly deploy.

Next steps

The above merely amounts to a proof of concept. A production application will likely make use of one or more of the following:

While putting the Machine id into the URL path is effective, it isn’t very ergonomic. A better solution would be have a registry of paths and their associated Machines.
Implementing routing in middleware gets the job done, but separating this out to a separate process using an application like nginx can have a number of advantages:
- fly-replay is limited to request payloads of one megabyte or less, which would affect features like file uploads. A reverse proxy might be a better solution.
- support for custom subdomains as an alternative to path namespaces and scopes.
- authentication and serving static assets can be performed outside of your application.
- Add-ons like Phusion Passenger can enable you to run multiple instances of your application in one Machine.
Monitoring becomes more crucial when you have hundreds or even dozens of independent Machines. Fly.io will combine your logs and while this addresses many problems, it doesn’t help deal with logs that are missing due to application or network problems. Consider having applications emit a heartbeat, and write monitoring software that looks for missing heartbeats and reports them as issues. Sentry can help here.
While applying updates when all of the services for a single Machine are self contained is an easier problem then upgrading potentially interdependent services running live in production, it does come at a cost: a blue/green deployment is not possible so a focus on reducing startup times is necessary.
Consider building an administrative web based interface for your application. User registration and Machine assignments are likely coupled in your workflow. While Fly.io provides a Machines API, you can go a long way with old-fashioned scripting using the flyctl command.

See Showcase Architecture for an example of an application running in production that implements many of these ideas.

Backups

As mentioned above, backups are crucial. Items to explore:

LiteFS - Distributed SQLite. By itself, it supports failover situations naturally. Additional disaster recovery options are available.
Sqlite3 databases are just files. Build a separate application to host backups and have your application periodiacally POST copies there.
Rsync is a utility available with Linux distributions that can be used to efficiently copy changes between Machines.
Run multiple Machines per cell so that you get the full benefits of traditional high availability configurations. If you have implemented an admin web UI, you can automate the deployment of new clusters easily.

Report an issue or edit this page on GitHub