AWS without Access Keys

A giant cartoon bird-person skeleton with points of red light in the eye sockets looms menacingly over a much smaller living bird-person who's fallen backward onto a giant book.
Image by Annie Ruygt

It’s dangerous to go alone. Fly.io runs full-stack apps by transmuting Docker containers into Fly Machines: ultra-lightweight hardware-backed VMs. You can run all your dependencies on Fly.io, but sometimes, you’ll need to work with other clouds, and we’ve made that pretty simple. Try Fly.io out for yourself; your Rails or Node app can be up and running in just minutes.

Let’s hypopulate you an app serving generative AI cat images based on the weather forecast, running on a g4dn.xlarge ECS task in AWS us-east-1. It’s going great; people didn’t realize how dependent their cat pic prefs are on barometric pressure, and you’re all anyone can talk about.

Word reaches Australia and Europe, but you’re not catching on, because the… latency is too high? Just roll with us here. Anyways: fixing this is going to require replicating ECS tasks and ECR images into ap-southeast-2 and eu-central-1 while also setting up load balancing. Nah.

This is the O.G. Fly.io deployment story; one deployed app, one versioned container, one command to get it running anywhere in the world.

But you have a problem: your app relies on training data, it’s huge, your giant employer manages it, and it’s in S3. Getting this to work will require AWS credentials.

You could ask your security team to create a user, give it permissions, and hand over the AWS keypair. Then you could wash your neck and wait for the blade. Passing around AWS keypairs is the beginning of every horror story told about cloud security, and security team ain’t having it.

There’s a better way. It’s drastically more secure, so your security people will at least hear you out. It’s also so much easier on Fly.io that you might never bother creating a IAM service account again.

Let’s Get It out of the Way

We’re going to use OIDC to set up strictly limited trust between AWS and Fly.io.

  1. In AWS: we’ll add Fly.io as an Identity Provider in AWS IAM, giving us an ID we can plug into any IAM Role.
  2. Also in AWS: we’ll create a Role, give it access to the S3 bucket with our tokenized cat data, and then attach the Identity Provider to it.
  3. In Fly.io, we’ll take the Role ARN we got from step 2 and set it as an environment variable in our app.

Our machines will now magically have access to the S3 bucket.

What the What

A reasonable question to ask here is, “where’s the credential”? Ordinarily, to give a Fly Machine access to an AWS resource, you’d use fly secrets set to add an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to the environment in the Machine. Here, we’re not setting any secrets at all; we’re just adding an ARN — which is not a credential — to the Machine.

Here’s what’s happening.

Fly.io operates an OIDC IdP at oidc.fly.io. It issues OIDC tokens, exclusively to Fly Machines. AWS can be configured to trust these tokens, on a role-by-role basis. That’s the “secret credential”: the pre-configured trust relationship in IAM, and the public keypairs it manages. You, the user, never need to deal with these keys directly; it all happens behind the scenes, between AWS and Fly.io.

A diagram: STS trusts OIDC.fly.io. OIDC.fly.io trusts flyd. flyd issues a token to the Machine, which proffers it to STS. STS sends an STS cred to the Machine, which then uses it to retrieve model weights from S3.

The key actor in this picture is STS, the AWS Security Token Service. STS‘s main job is to vend short-lived AWS credentials, usually through some variant of an API called AssumeRole. Specifically, in our case: AssumeRoleWithWebIdentity tells STS to cough up an AWS keypair given an OIDC token (that matches a pre-configured trust relationship).

That still leaves the question: how does your code, which is reaching out to the AWS APIs to get cat weights, drive any of this?

The Init Thickens

Every Fly Machine boots up into an init we wrote in Rust. It has slowly been gathering features.

One of those features, which has been around for awhile, is a server for a Unix socket at /.fly/api, which exports a subset of the Fly Machines API to privileged processes in the Machine. Think of it as our answer to the EC2 Instant Metadata Service. How it works is, every time we boot a Fly Machine, we pass it a Macaroon token locked to that particular Machine; init’s server for /.fly/api is a proxy that attaches that token to requests.

In addition to the API proxy being tricky to SSRF to.

What’s neat about this is that the credential that drives /.fly/api is doubly protected:

  1. The Fly.io platform won’t honor it unless it comes from that specific Fly Machine (flyd, our orchestrator, knows who it’s talking to), and
  2. Ordinary code running in a Fly Machine never gets a copy of the token to begin with.

You could rig up a local privilege escalation vulnerability and work out how to steal the Macaroon, but you can’t exfiltrate it productively.

So now you have half the puzzle worked out: OIDC is just part of the Fly Machines API (specifically: /v1/tokens/oidc). A Fly Machine can hit a Unix socket and get an OIDC token tailored to that machine:

{
  "app_id": "3671581",
  "app_name": "weather-cat",
  "aud": "sts.amazonaws.com",
  "image": "image:latest",
  "image_digest": "sha256:dff79c6da8dd4e282ecc6c57052f7cfbd684039b652f481ca2e3324a413ee43f",
  "iss": "https://oidc.fly.io/example",
  "machine_id": "3d8d377ce9e398",
  "machine_name": "ancient-snow-4824",
  "machine_version": "01HZJXGTQ084DX0G0V92QH3XW4",
  "org_id": "29873298",
  "org_name": "example",
  "region": "yyz",
  "sub": "example:weather-cat:ancient-snow-4824"
} // some OIDC stuff trimmed

Look upon this holy blob, sealed with a published key managed by Fly.io’s OIDC vault, and see that there lies within it enough information for AWS STS to decide to issue a session credential.

We have still not completed the puzzle, because while you can probably now see how you’d drive this process with a bunch of new code that you’d tediously write, you are acutely aware that you have not yet endured that tedium — e pur si muove!

One init feature remains to be disclosed, and it’s cute.

If, when init starts in a Fly Machine, it sees an AWS_ROLE_ARN environment variable set, it initiates a little dance; it:

  1. goes off and generates an OIDC token, the way we just described,
  2. saves that OIDC token in a file, and
  3. sets the AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_SESSION_NAME environment variables for every process it launches.

The AWS SDK, linked to your application, does all the rest.

Let’s review: you add an AWS_ROLE_ARN variable to your Fly App, launch a Machine, and have it go fetch a file from S3. What happens next is:

  1. init detects AWS_ROLE_ARN is set as an environment variable.
  2. init sends a request to /v1/tokens/oidc via /.api/proxy.
  3. init writes the response to /.fly/oidc_token.
  4. init sets AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_SESSION_NAME.
  5. The entrypoint boots, and (say) runs aws s3 get-object.
  6. The AWS SDK runs through the credential provider chain
  7. The SDK sees that AWS_WEB_IDENTITY_TOKEN_FILE is set and calls AssumeRoleWithWebIdentity with the file contents.
  8. AWS verifies the token against https://oidc.fly.io/example/.well-known/openid-configuration, which references a key Fly.io manages on isolated hardware.
  9. AWS vends STS credentials for the assumed Role.
  10. The SDK uses the STS credentials to access the S3 bucket.
  11. AWS checks the Role’s IAM policy to see if it has access to the S3 bucket.
  12. AWS returns the contents of the bucket object.

How Much Better Is This?

It is a lot better.

They asymptotically approach the security properties of Macaroon tokens.

Most importantly: AWS STS credentials are short-lived. Because they’re generated dynamically, rather than stored in a configuration file or environment variable, they’re already a little bit annoying for an attacker to recover. But they’re also dead in minutes. They have a sharply limited blast radius. They rotate themselves, and fail closed.

They’re also easier to manage. This is a rare instance where you can reasonably drive the entire AWS side of the process from within the web console. Your cloud team adds Roles all the time; this is just a Role with an extra snippet of JSON. The resulting ARN isn’t even a secret; your cloud team could just email or Slack message it back to you.

Finally, they offer finer-grained control.

To understand the last part, let’s look at that extra snippet of JSON (the “Trust Policy”) your cloud team is sticking on the new cat-bucket Role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::123456123456:oidc-provider/oidc.fly.io/example"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
              "StringEquals": {
                "oidc.fly.io/example:aud": "sts.amazonaws.com",
              },
               "StringLike": {
                "oidc.fly.io/example:sub": "example:weather-cat:*"
              }
            }
        }
    ]
}

The aud check guarantees STS will only honor tokens that Fly.io deliberately vended for STS.

Recall the OIDC token we dumped earlier; much of what’s in it, we can match in the Trust Policy. Every OIDC token Fly.io generates is going to have a sub field formatted org:app:machine, so we can lock IAM Roles down to organizations, or to specific Fly Apps, or even specific Fly Machine instances.

Speedrun your app onto Fly.io.

3…2…1…

Go!  

And So

In case it’s not obvious: this pattern works for any AWS API, not just S3.

Our OIDC support on the platform and in Fly Machines will set arbitrary OIDC audience strings, so you can use it to authenticate to any OIDC-compliant cloud provider. It won’t be as slick on Azure or GCP, because we haven’t done the init features to light their APIs up with a single environment variable — but those features are easy, and we’re just waiting for people to tell us what they need.

For us, the gold standard for least-privilege, conditional access tokens remains Macaroons, and it’s unlikely that we’re going to do a bunch of internal stuff using OIDC. We even snuck Macaroons into this feature. But the security you’re getting from this OIDC dance closes a lot of the gap between hardcoded user credentials and Macaroons, and it’s easy to use — easier, in some ways, than it is to manage role-based access inside of a legacy EC2 deployment!