How LiteFS Works

LiteFS is a distributed file system specifically built for replicating SQLite databases. This allows each application node to have a full, local copy of the database and respond to requests with minimal latency. LiteFS is built for high availability so that operations can continue even in the event of catastrophic failures.

Capturing SQLite Transactions

SQLite works by updating a subset of pages in the database file on every transaction. It does this by using the file system API and the exact steps depend on whether you’re using a rollback journal or a write-ahead log.

LiteFS acts as a passthrough file system which intercepts these API calls and copies out the page sets for each transaction. These page sets are stored in an internal format called LTX that performs extensive consistency checking to ensure correctness.

By applying these page sets in the LTX files in order, we can reconstruct the state of our SQLite database at any point in time. We can also copy them to another nodes in order to replicate the changes in real-time to our cluster.

Rollback journal handling

The following steps occur when SQLite creates a transaction with either the DELETE, PERSIST, or TRUNCATE journal modes:

  1. Obtain an exclusive lock on the RESERVED byte.
  2. Create a journal file to start the transaction.
  3. Update page in the main database file while copying old pages to the journal.
  4. Invalidate the journal file to finish the transaction.
  5. Release the lock on the RESERVED byte.

LiteFS maintains a set of dirty database pages based on write(2) calls to the database file. When the journal file is invalidated, LiteFS will copy out the dirty pages in sequential order to an LTX transaction file.

Write-ahead log (WAL) handling

The following steps occur when SQLite creates a transaction with the WAL journal mode:

  1. Acquire an exclusive lock on the WAL_WRITE_LOCK.
  2. Write a WAL frame to the end of the WAL for each new or updated page.
  3. On commit, write the new size of the database to the commit field in the last WAL frame header.
  4. Release the exclusive lock on the WAL_WRITE_LOCK.

Because all page changes are in the write-ahead log, LiteFS does not need to maintain a dirty page set. Instead, it reads from the previous ending offset of the WAL to the end to capture all new & changed pages into an LTX file.

Replication Position

Each database within the LiteFS directory contains a replication position. This position is a combination of the transaction ID (TXID) and the rolling checksum of the database contents. The TXID is an integer that is incremented with every write transaction that occurs.

You can track the relative position between nodes by reading from the database’s position file in the LiteFS mount directory. For example, if you have a database called /mnt/db then you can read the position from /mnt/db-pos file:

$ cat /mnt/db-pos

The output is a hex-encoded TXID followed by a slash and then the database rolling checksum.

Cluster management using leases

SQLite operates as a single-writer database which means only one transaction can write at a time. Because of this, we utilize a lease system whereby a single node acts as the “primary” node at any given time. All writes should be directed to that node and it will propagate changes to the rest of the cluster.

However, nodes can die or become disconnected from the cluster so the primary must be able to change dynamically. LiteFS is built to be run on ephemeral environments where cluster membership can change quickly and frequently so it utilizes leases via Consul to determine the current primary node.

When a LiteFS node starts up, it checks a key within the Consul server for the current primary node. If no primary node exists, then the LiteFS node attempts to obtain a lease from Consul to become a primary node itself. If successful, it updates the Consul key to share its API URL so other LiteFS nodes can connect to it.

Handling split brain

LiteFS uses asynchronous replication which means that it does not confirm that writes are committed to other nodes before continuing to the next transaction. This tradeoff is made to increase write throughput at the cost of some durability.

In the event of an unexpected primary node failure, the primary node may become out of sync with the rest of the cluster if it has unreplicated writes it has applied locally but have not been copied to the other nodes. If a new primary is elected then the old primary could corrupt its data if it tried to apply transactions from the new primary.

Fortunately, LiteFS maintains a rolling checksum of the full contents of the database so it will detect this split and automatically snapshot from the new primary to the out-of-sync node.

Synchronous replication & time-bound asynchronous replication are features that are planned for future development.