Exploring Options for Storing Custom Data in Ecto

Image by Annie Ruygt

Fly.io runs applications by transmogrifying Docker containers into Firecracker micro-VMs running on our hardware around the world, connected with WireGuard to a global Anycast network. Your Elixir app could be one of them! Check us out: you can deploy in minutes.

Ever wanted to store a blob of custom data on a database record? The data we want to store might be complex too, made up of multiple fields and even lists. When the data is stored using Ecto there are several ways we can do it. This article explores two paths and discusses some of the pros and cons for each approach.

Recently I was confronted with this choice again. This choice doesn’t come along that often for me and I had to re-discover the pros and cons for each option. This article is intended to be a sign posted at the fork in the road to let travelers know a bit more about what lies beyond in the paths before them.

Both in life and in programming, we can be presented with a choice of two directions to go. If you are familiar with Robert Frost’s poem The Road Not Taken, then you may know that it’s actually about indecision, the desire to take both paths because deciding things can be scary! In this article, we will venture a little way down each path before finally deciding which to take.

Before we discuss the choice of paths, we need to cover the problem being solved. Let’s talk about the kind of data being stored.

Our custom data

In our scenario, we have a service that offers many games that members can play. The games can be saved in various states of play and be resumed later. An Account has many GameSave records.

We didn’t want to create a database table for every type of game that could be saved. We are continually adding new games and occasionally retiring old ones that didn’t catch on.

Instead of creating a table for each game type, we created a single table that records a saved game of any game type. It looks something like this:

Account has many GameSave. GameSave has one GameData. ERD diagram

Every game needs to store its data in a different way. For example, our saved Tic-Tac-Toe game will be stored very differently from a chess game or a memory game. The structure and details of the GameData are defined by the game itself.

How can we do this using Postgres and Ecto?

Path One: Store data as a map

The easiest and first thought might be to use an Ecto :map field type. In Postgres, this is stored as a jsonb field. In MSSQL it is stored as text. Other databases will vary, so we need to understand how it works in our particular database. In this situation, it’s Postgres.

The Ecto.Schema docs on the :map field type are very helpful if you plan to go in this direction.

Let’s see what it would look like for our GameSave to use a :map field.

Migration

It’s easy to create this in our migration:

defmodule Core.Repo.Migrations.CreateGameSaves do
  use Ecto.Migration

  def change do
    create table(:game_saves) do
      add :account_id, references(:accounts, on_delete: :delete_all), null: false
      # ...
      # Game data stored as a map
      add :data, :map, null: false, default: %{}

      timestamps()
    end

  end
end

The data type used is :map. We can even assign a default value of an empty map. Cool!

Ecto Schema

Our schema also makes easy use of the :map field type.

defmodule Core.GameSaves.GameSave do
  use Ecto.Schema

  schema "game_saves" do
    # ...
    field :data, :map, default: %{}, required: true

    belongs_to :account, Core.Accounts.Account
    timestamps()
  end
end

With our storage and data structures ready, let’s look at how we use it.

Writing data

When writing the data, a common approach is to use a normal map. By “normal”, we mean that it uses atoms for keys and uses atoms for field values as well. Like this:

data = %{
  turn: 1,
  group_types: [:hero, :mage, :elf],
  characters: [
    %{name: "Herman the Hero", type: :hero, health: 130},
    %{name: "Morgan the Mage", type: :mage, health: 100},
    %{name: "Edward the Elf", type: :elf, health: 100},
  ]
}

{:ok, save} = Core.GameSaves.create_game_save(player_account.id, data)
save

After inserting the data, our struct is returned looking like this:

%GameSave{
  id: 1,
  data: %{
    turn: 1,
    group_types: [:hero, :mage, :elf],
    characters: [
      %{name: "Herman the Hero", type: :hero, health: 130},
      %{name: "Morgan the Mage", type: :mage, health: 100},
      %{name: "Edward the Elf", type: :elf, health: 100},
    ]
  }
}

The thing to note here is the data field. As you would expect, it looks like the data we assigned.

However, when we load that same record back from the database, it looks very different.

%GameSave{
  id: 1,
  data: %{
    "turn" => 1,
    "group_types" => ["hero", "mage", "elf"],
    "characters" => [
      %{"name" => "Herman the Hero", "type" => "hero", "health" => 130},
      %{"name" => "Morgan the Mage", "type" => "mage", "health" => 100},
      %{"name" => "Edward the Elf", "type" => "elf", "health" => 100},
    ]
  }
}

Everything in data was serialized through the Jason library. We lost all the atoms, both as keys and values.

This behavior is expected and documented, but it still catches people because at first it looks like it works just how you’d expect.

From the documentation:

Keep in mind that we advise the map keys to be strings or integers instead of atoms. Atoms may be accepted depending on how maps are serialized but the database will always convert atom keys to strings due to security reasons.

The inconsistency of assigning a map with atoms and later restoring with string keys can create accidental bugs. If we write code that uses the data right after creation when it still has atom keys, those same functions will not work when used on data after it’s read from the database.

Take a look at this function. It works fine right after we create the record. When we load it back, the characters key will be a string and this function won’t match!

def get_group_size(%GameSave{data: %{characters: characters}}) do
  length(characters)
end

Hopefully our tests would catch this early.

So let’s assume that we are careful and only create maps with string keys and no atoms. What are some pros and cons to this approach?

Pros to map approach

  • Postgres allows us to query data in the JSON map. This is perhaps the biggest benefit! We can actually query for data based on the contents inside the Postgres jsonb field!
  • Postgres tools like pgAdmin know how to display the data and support reading and writing it.
  • This approach is officially encouraged.

Cons to map approach

  • Data needs to be created using string keys and strings for atoms. Depending on where the data is coming from, it may need to be converted manually.
  • Limited to only store what JSON can represent.
  • The potential for accidental bugs when setting data using atoms.

Path Two: Store data as an Erlang encoded binary

There is another less used approach available to us. We can natively serialize our Elixir data structures and store them directly in the database. Doing this has a different set of trade-offs which we’ll explore further.

First, it’s important to know that you can do this:

data = %{custom_data: "Yes!", valid_states: [:on, :off]}

term = :erlang.term_to_binary(data)
#=> <<131, 116, 0, 0, 0, 2, 100, 0, 11, 99, 117, 115, 116, 111, 109, 95, 100, 97,
#=>   116, 97, 109, 0, 0, 0, 4, 89, 101, 115, 33, 100, 0, 12, 118, 97, 108, 105,
#=>   100, 95, 115, 116, 97, 116, 101, 115, 108, 0, 0, 0, 2, 100, ...>>

restored = :erlang.binary_to_term(term)
#=> %{custom_data: "Yes!", valid_states: [:on, :off]}

In the above code, we create an Elixir map and convert it to a binary term format. Then we convert from the binary format back to our original data.

Using :erlang.term_to_binary/1 we can convert any Elixir term to a binary format. The binary data is restored using :erlang.binary_to_term/1.

Serializing through the External Term Format is a documented official feature. It is used for transferring data outside of Elixir/Erlang systems and back. Hey! That’s what we’re doing!

If we wanted to try this out on an Ecto.Schema, how would we do it?

Migration

Postgres and Ecto both support binary column types.

defmodule Core.Repo.Migrations.CreateGameSaves do
  use Ecto.Migration

  def change do
    create table(:game_saves) do
      add :account_id, references(:accounts, on_delete: :delete_all), null: false
      # ...
      # Game data stored as binary
      add :data, :binary, null: false

      timestamps()
    end

  end
end

Notice that our data field has the type :binary.

Ecto Schema

To make it easier to store our binary data in our schema, we can create a custom Ecto.Type. To do what we want, this is what ours would look like:

defmodule Core.EctoErlangBinary do
  @moduledoc """
  A custom Ecto type for handling the serialization of arbitrary
  data types stored as binary data in the database. Requires the
  underlying DB field to be a binary.
  """
  use Ecto.Type
  def type, do: :binary

  @doc """
  Provides custom casting rules for params. Nothing changes here.
  We only need to handle deserialization.
  """
  def cast(:any, term), do: {:ok, term}
  def cast(term), do: {:ok, term}

  @doc """
  Convert the raw binary value from the database back to
  the desired term.
  """
  def load(raw_binary) when is_binary(raw_binary),
    do: {:ok, :erlang.binary_to_term(raw_binary)}

  @doc """
  Converting the data structure to binary for storage.
  """
  def dump(term), do: {:ok, :erlang.term_to_binary(term)}
end

This implements the callbacks for the Ecto.Type behaviour. The important bits here are the load and dump functions.

With our custom Ecto.Type defined, we can use it in our schema.

defmodule Core.GameSaves.GameSave do
  use Ecto.Schema
  alias Core.EctoErlangBinary

  schema "game_saves" do
    # ...
    field :data, EctoErlangBinary, required: true

    belongs_to :account, Core.Accounts.Account
    timestamps()
  end
end

The field data uses the type EctoErlangBinary. Now, what does it look like to use this?

Writing data

What happens when we use our same map example?

data = %{
  turn: 1,
  group_types: [:hero, :mage, :elf],
  characters: [
    %{name: "Herman the Hero", type: :hero, health: 130},
    %{name: "Morgan the Mage", type: :mage, health: 100},
    %{name: "Edward the Elf", type: :elf, health: 100},
  ]
}

{:ok, save} = Core.GameSaves.create_game_save(player_account.id, data)
save

After inserting the data, our struct is returned looking like this:

%GameSave{
  id: 1,
  data: %{
    turn: 1,
    group_types: [:hero, :mage, :elf],
    characters: [
      %{name: "Herman the Hero", type: :hero, health: 130},
      %{name: "Morgan the Mage", type: :mage, health: 100},
      %{name: "Edward the Elf", type: :elf, health: 100},
    ]
  }
}

That’s what we would expect. The data field has the Elixir map we assigned it.

What happens when we load it from the database?

Core.GameSaves.get_game_save!(1)

%GameSave{
  id: 1,
  data: %{
    turn: 1,
    group_types: [:hero, :mage, :elf],
    characters: [
      %{name: "Herman the Hero", type: :hero, health: 130},
      %{name: "Morgan the Mage", type: :mage, health: 100},
      %{name: "Edward the Elf", type: :elf, health: 100},
    ]
  }
}

After reading back from the database, it stayed the same! Our atom keys stayed atoms and so did our atom values.

Another Example

For another concrete example, let’s talk about modeling a standard card game with face cards. We start by creating the game logic. During the design of the game, we find it makes the most sense to model the cards like this:

  • {:club, 4} for a 4 of Clubs
  • {:heart, "K"} for a King of Hearts
  • {:spade, "A"} for an Ace of Spades

With this data structure, a players hand can be expressed as a simple keyword list!

player_1 = [club: 2, club: 8, spade: 8, heart: 4, heart: "J"]

For writing the game logic, a data structure like this makes pattern matching easy. After our game logic is implemented, we turn to serializing the data so it can be recovered when it crashes or a node is shutdown as part of a new deploy.

If we store the data as a map in a jsonb field, then we can’t serialize a keyword list. That doesn’t translate automatically to JSON. We can convert it to a JSON friendly map like this:

card = %{
  "suit" => "spade",
  "value" => 8
}

While this data structure works well for JSON, we have to handle transforming our data into this map for storage and then later transform it back to the keyword list format.

If we store the data as a binary using :erlang.term_to_binary, we can directly store our keyword list.

Storing our data in binary is a lot simpler!

Security considerations

The Ecto documentation explained that it goes through JSON for security reasons. It can be dangerous to deserialize malicious data back into Elixir code.

It comes down to the risk of converting unknown strings into atoms and possibly exhausting the atom table and crashing the system. So we really need to think about where the stored data comes from. We don’t want to store data returned from an external service or anything supplied by a user that we don’t explicitly control.

In this situation, the data is generated by our game code and we define what the values can be. With the current design, a user can’t mess with the data stored in the system. However, it’s still important to keep this potential risk in mind going forward!

Don’t store that!

Because it’s now possible to store any Elixir data structure doesn’t mean you should. It may be tempting to store a full Elixir struct in the database but that would be a mistake. The problem comes after we’ve deployed new versions of our application and the struct in our project no longer looks like the struct stored in the database. This can create unexpected problems for our code!

Imagine how much our data structures can change over 2 years of active development! What kinds of things can go wrong when restoring an old structure? Lots!

Here are a few tips when deciding what to store:

  • Instead of storing structs, define a map that stores only enough data to restore things later.
  • Instead of storing fully nested records, store IDs that can be used to reload current versions of those records.
  • Avoid storing extra runtime state that isn’t needed for restoring a game. An example of this is something like a timer reference.

We don’t want to just store the whole blob of game state as-is! We still want to be thoughtful and selective about what we store.

Version the data!

This tip is also true when storing data as a :map. The key is to add this to our data from the very beginning so we can always count on it being there.

The idea is, store a version along with our data. It might look like this:

data = %{
  version: 1,
  turn: 10,
  # ...
}

This lets us do on-the-fly data migration from older versions.

The version: 1 data can easily be pattern matched when we are restoring old data in the future. Our code may now be on version 3, but when we see version: 1, we know how to migrate that data to version: 2 and from version: 2 we can migrate to version: 3.

Let’s review some of pros and cons we’ve identified for storing our data as a binary field.

Pros to the binary approach

  • Atoms stay atoms.
  • Whatever Elixir data we store is returned. This includes tuples, keyword lists, atoms, binaries, floats, etc.
  • More options on what we want to store and how to store it.
  • Once we add a custom Ecto type, it’s very easy to do.

Cons to the binary approach

  • Postgres doesn’t understand the binary format. Tools like pgAdmin show a meaningless binary blob.
  • Postgres is unable to search and query within the data.
  • Presents a potential security problem if you can’t trust the source of the data.
  • May be tempted to store data we shouldn’t.

Fly ❤️ Elixir

Fly is an awesome place to run your Phoenix apps. It’s really easy to get started. You can be running in minutes.

Try Fly for free

Closing thoughts

The longer we are in tech, the more we realize that the answer to a question like, “Which is better, X or Y?” is often, “It depends.”

Here we explored two options for storing custom complex data on a our Ecto.Schema. It’s like when “Two paths diverged in a wood”, the path you choose depends on things like:

  • Where do you want to end up?
  • What features do you value?

Path One: Use the fully supported :map data type. It’s a great option. It’s actually the recommended approach to take by default. It has some gotchas to be aware of but with a little extra work those can be handled.

Path Two: Encode Elixir data structures as :binary and store it on the record. This makes it easy to store anything we want with less work. This comes with a number of cautions and conditions.

Together we’ve peered down the two paths a bit and hopefully now we can quickly determine which path we want next time we come to this fork in the road.

As for me, I chose the binary field approach. It fit my needs and had the features I valued. The path you choose to take is up to you!

Safe travels!

Two paths diverged in a project, and I—
I took the one less traveled by,
And that has made all the difference.