Multiple Logs for Resiliency

Hooking into NATS to capture logs
Image by Annie Ruygt

You’ve done everything right. You are well aware of Murphy’s Law. You have multiple redundant machines. You’ve set up a regular back up schedule for your database, perhaps even are using LiteFS CLoud. You ship your logs to LogTail or perhaps some other provider so you can do forensic analysis should anything go wrong…

Then the unexpected happens. A major network outage causes your application to misbehave. What’s worse is that your logs are missing crucial data from this point, perhaps because of the same network outage. Maybe this time you are lucky and you can find the data you need by using copies of your logs via flyctl logs or the monitoring tab on the flyctl dashboard before they disappear forever.

So, what is going on here? Let’s look at the steps. Your application writes logs to STDOUT. Fly.io will take that output and send it to NATS. The Log Shipper will take that data and hand it to Vector. From there it is shipped to your third party logging provider. That’s a lot of moving parts.

All that is great, but just like how you have redundant machines in case of failures, you may want to have redundant logs in addition to the ones fly.io and the log shipper provide. Below are two strategies for doing just that. You can use either or both, and best of all the logs you create will be in addition to your existing logs.

Logging to multiple places

The following approach is likely the most failsafe, but often the least convenient: having your primary application on each machine write to a separate log file in addition to standard out. This does mean that when you need this data you will have to fetch it from each machine and it likely with be rather raw. But at least it will be there even in the face of network failures.

For best results put these logs on a volume so that it survives a restart, and be prepared to rotate logs as they grow in size so that they don’t eventually fill up that volume.

This approach is necessarily framework specific, but most frameworks provides some ability to do this. A Rails example:

logger = ActiveSupport::Logger.new(STDOUT)
logger.formatter = config.log_formatter
volume_logger = ActiveSupport::Logger.new("/logs/production.log", 3)
logger = logger.extend ActiveSupport::Logger.broadcast(volume_logger)

You probably already have the first two lines already in your config/environments/production.rb file. Adjust and add the last two lines. That’s it! You now have redundant logs.

See the Ruby docs for Logger documentation on how to handle log rotation.

Some pointers for other frameworks:

Custom log shipper

This approach is less bullet proof but may result in more immediately usable results. Instead of using Log Shipper, Vector, and a third party, it is easy to subscribe directly to NATS and process log entries yourself.

What you are going to want is a separate app running on a separate machine so that it doesn’t go down there are problems with the machine you are monitoring, or even during the times when you are deploying a new version. If the code you write will be writing to disk, you will want a volume.

Also like with log shipper, you will want to set the following secret:

fly secrets set FLY_AUTH_TOKEN=$(fly auth token)

Here’s a minimal JavaScript example that can be run using Node or Bun:

import { connect, StringCodec } from "nats";
import fs from 'node:fs';

// tailor these two constants for your needs
const LOG_FILE = "/log/production.log";
const ORGANIZATION = "your-organization-name";

// create a connection to a nats-server
const nc = await connect({
  servers: "[fdaa::3]:4223",
  user: ORGANIZATION,
  pass: process.env.ACCESS_TOKEN
});

// open log file
file = fs.openSync(LOG_FILE, 'a+');

// create a codec
const sc = StringCodec();

// create a simple subscriber and iterate over messages
// matching the subscription
const sub = nc.subscribe("logs.>");
for await (const msg of sub) {
  const data = JSON.parse(sc.decode(msg.data));

  // build log file entry
  const log = [
    data.timestamp.padEnd(30),
    `[${data.fly.app.instance}]`,
    data.fly.region,
    `[${data.log.level}]`,
    data.message
  ].join(' ') + "\n";

  // write entry to disk
  fs.write(file, log, error => {
    if (error) console.error(error);
  });
}

The above is pretty straightforward. It connects to NAT, opens a file, subscribes to logs, parses each message, and writes out selected data to disk. This example is in JavaScript, but feel free to reimplement this basic approach using your favorite language, as NATS supports plenty.

Things to watch out for: you don’t want recursive errors when exceptions occur during write. You want to capture errors and reconnect to NATS when the connection goes down. You may even want to filter messages. A more complete example implementing a number of these features can be found here.

Conclusion

Log failures are not common, and perhaps the redundant logs that fly.io already keeps will be sufficient for your needs. But it may be worth reviewing what your exposure is and how to mitigate that exposure should your logs fail at the worst possible time.

Hopefully the approaches listed above give you ideas on how to ensure that you will always have the log data you need even in the most hostile environment conditions.