Execute Third Party Code in a Rails App

Train traveling past conspicuous warning signs on tracks leading into the distance
Image by Annie Ruygt

Imagine inviting random strangers from the internet to bring along their code and run it on your servers in a Rails app. Sounds like a security nightmare, doesn’t it? Where do you even start?

If you run into a person at Fly.io, they might be saying something about “Fast booting VMs”, but what does that mean outside of faster deployment times?

Turns out when an entire machine can be boot in 2 seconds or less, it becomes possible to boot a server via a Rails background job, analyze a strangers code from within the confines of a virtual machine, and shut it down when the job is complete.

Sounds complicated right? It is, but Fly.io built the Machines API to manage all that complexity so you can spend your time and energy sweating the details about your app.

The Problem

Inspecting or executing arbitrary code from third parties comes with a lot of risks.

First off, there’s a threat for the application that has to run it: It could be a target of exploits from malicious code being introduced by attackers to bring the app down, extract passwords, etc.

Then, there’s a risk for the customer of such an application: It could be exposed to malicious code of other, malevolent customers that targets extracting data or intellectual property.

The first is mainly a security issue for the application’s operation, while the latter is business critical as it undermines trust between customer and SaaS provider.

Luckily, Fly.io boasts a solution that provides a safe environment to deploy such workloads and is simple to manage: Fly Machines.

The Context

Attractor is a code quality analysis tool that relies on the churn and complexity metrics to measure how tech debt evolves for a typical Ruby (on Rails) or JavaScript app.

At its heart lies a GitHub app that clones, inspects, (and optionally runs) third party code. A static analysis is conducted and the results are being reported back to the main app.

Since any paying customer can connect any GitHub repository, it would be possible to compromise the application (and customer data) were the user code cloned in the main app machine. A way to safely inspect and possibly run it had to be found.

The Solution

Attractor is a Ruby on Rails app deployed on Fly.io with

  • One process running the app server (puma)
  • Two worker (sidekiq) queues: default and sandbox
  • a Fly machines app to create and run machines on the fly. Important: make sure you run this app in a private network for true isolation, as pointed out here.

When new code changes come in via a pull request, it uses a SandboxRun model to encapsulate such a workload:

class SandboxRun < ApplicationRecord
  has_secure_token

  belongs_to :github_pull_request, class_name: "Github::PullRequest"

  after_create_commit :start

  def invalidate
    self.invalidated_at = Time.now.utc
  end

  def invalidate!
    invalidate
    save!
  end

  def invalidated?
    !!invalidated_at
  end

  private

  def start
    return if invalidated?

    SandboxRunJob.perform_later(self)
  end
end

After a SandboxRun record is created, it self-executes via a SandboxRunJob:

class SandboxRunJob < ApplicationJob
  queue_as :sandbox

  def perform(sandbox_run)
    @sandbox_run = sandbox_run

    boot_sandbox
  end

  private

  def boot_sandbox
    res_create = conn.post "apps/my-app-machines/machines",
      "{
  \"name\": \"sandbox-machine-#{@sandbox_run.id}\",
  \"config\": {
    \"image\": \"my-sandbox-image:latest\",
    \"guest\": {
      \"memory_mb\": 512,
      \"cpu_kind\": \"shared\",
      \"cpus\": 1
    },
    \"restart\": {
      \"policy\": \"no\"
    },
    \"env\": {
      \"SANDBOX_RUN_ID\": \"#{@sandbox_run.id}\",
      \"SANDBOX_RUN_TOKEN\": \"#{@sandbox_run.token}\",
    }
  }
}",
     "Content-Type" => "application/json"

    # abort processing if machine start failed
    if res_create.status >= 400
      raise SandboxStartupError, res_create.body["error"]
    end

    @sandbox_run.fly_machine_id = res_create.body["id"]
    @sandbox_run.save
  end

  def conn
    @conn ||= Faraday.new(
      url: ENV.fetch("FLY_API_URL", "http://_api.internal:4280/v1")
    ) do |conn|
      conn.request :authorization, "Bearer", ENV["FLY_API_TOKEN"]
      conn.response :json
    end
  end
end

This job boots a sandbox by issuing a POST request to the Fly machines app (my_app_machines). It spawns a container using a Docker image (my-sandbox-image:latest) that has to be present in your organization’s registry. Furthermore it is passed two environment variables (SANDBOX_RUN_ID and SANDBOX_RUN_TOKEN) to identify the sandbox run. Critically, the restart policy is set to no to avoid infinite loops.

The logic that runs in the actual sandbox is secondary, it simply returns a JSON payload in form of a POST request to an incoming webhooks controller:

class SandboxWebhooksController < ApplicationController
  # some details omitted
  before_action :authenticate_token!

  def create
    SandboxWebhook.create(data: JSON.parse(request.body.read)).process_async
    render json: {status: "OK"}, status: :created
  end

  private

  def authenticate_token!
    @sandbox_run ||= sandbox_run_from_token

    head :unauthorized unless @sandbox_run.present? && !@sandbox_run.invalidated?
  end

  def sandbox_run_from_token
    SandboxRun.find_by(token: token_from_header)
  end

  def token_from_header
    request.headers.fetch("Authorization", "").split(" ").last
  end
end

Note that the sandbox run is authenticated via a unique secure token that we passed to the sandbox machine as an environment variable (SANDBOX_RUN_TOKEN). Optionally, precautions can be made to make this endpoint only accessible from the internal network.

class SandboxWebhook < ApplicationRecord
  # module includes omitted

  def process
    @sandbox_run = SandboxRun.find(data["sandbox_run"]["id"])
    return if @sandbox_run.invalidated?

     # process incoming payload

    @sandbox_run.invalidate!
  ensure
    teardown_sandbox
  end

  private

  def teardown_sandbox
    _res_wait = conn.get "apps/my-app-machines/machines/#{@sandbox_run.fly_machine_id}/wait",
      {
        state: "stopped",
        instance_id: machine_instance_id
      },
      {
        "Content-Type" => "application/json"
      }

    res_delete = conn.delete "apps/my-app-machines/machines/#{@sandbox_run.fly_machine_id}"

    if res_delete.status >= 400
      raise SandboxShutdownError, res_delete.body["error"]
    end

    res_delete
  end

  def machine_instance_id
    res_machine = conn.get("apps/my-app-machines/machines/#{@sandbox_run.fly_machine_id}")

    res_machine.body["instance_id"]
  end
end

In the created SandboxWebhook model the actual payload processing takes place, which isn’t really of interest. We have to take care, though, that the corresponding sandbox run is invalidated so it doesn’t get executed a second time.

The more salient part of this model for the purposes of this article is the tearing down of the sandbox machine. We want to clean up after the sandbox has run, otherwise we would have dangling machines that add to our bill. To destroy a machine, we have to wait for it to become stopped, though. This is done via a special /wait endpoint that we pass the desired state and the machine’s instance ID.

Beware: This is different from the machine’s ID, which is why we have to invoke another endpoint to obtain it.

The response to the /wait call blocks until the machine reaches the desired state. Afterwards we can destroy it, and re-raise any possibly resulting error.

Wrap-up

Solving the need to separate user code from our own application, we picked up Fly Machines to run ephemeral, isolated workloads. We’ve shown a way to integrate these sandboxes and the results they produce in an idiomatic Rails workflow. In the future, hopefully the verbosity of the API integration will be replaced by an official Fly SDK to create, start, and destroy machines.