WAL Mode in LiteFS

Author

Name: Ben Johnson
@benbjohnson: @benbjohnson

LiteFS is a distributed file system that magically replicates your SQLite databases. Make an update on one server and, voilà, your change is instantly available to your other servers on the edge. Take a look at our Getting Started guide to see how to add LiteFS to your application.

By and large, SQLite is configuration-free. You can get pretty far by just using the default settings. As your application grows and you start tweaking settings, one of the first knobs you’ll come across is the journal mode. This setting determines how SQLite performs transactions on disk and there are essentially two modes: the rollback journal & the write-ahead log, or WAL.

The rollback journal was the original transaction mechanism and it’s still the default. The WAL mode is the shiny new transaction mode. If you start reading blog posts & forums about SQLite, one tip you will repeatedly hear is, “use WAL mode!”

If your database is slow, you should use the WAL mode.

If you have concurrent users, you should use the WAL mode.

WAL mode. WAL mode. WAL mode.

In the SQLite world, the write-ahead log is as close to a silver bullet as you can find. It’s basically magic fairy dust that makes your database better and you should almost always use it.

However, LiteFS, our distributed SQLite file system, only supported the rollback journal mode. Until now! With the release of LiteFS v0.3.0, we now support all journaling modes.

Quick primer on journal modes

We’ve written about the internals of the rollback journal and the WAL mode in previous posts, but here’s a refresher.

With the rollback journal, SQLite:

Writes new pages directly to the database file.
Copies the previous contents to a separate rollback journal file.
Deletes the rollback journal on commit.

Because the pages in the database file are moving around and being deleted, this mode does not allow read transactions & write transactions to occur at the same time.

The WAL works the opposite way:

New pages are written to a separate write-ahead log file.
The last page written has a “commit” flag to indicate the end of the transaction.

Since the original data is never changed during the transaction, readers can continue running in parallel while another process is writing to the database. In addition to improved concurrency, the WAL also tends to have better write performance.

Databases as a history of change sets

Most developers think of databases as just a collection of tables & rows. And that’s how you should view it when you’re building an application. However, when designing database tooling like LiteFS, it’s better to think in terms of change sets.

A good analogy is baseball card collections. You might start off buying a pack of cards to start your collection. Over time, you may buy more packs or you might trade cards with friends. Each of these actions is a “change set”, adding and/or removing a set of cards from your collection.

Eventually, word gets out about your sweet baseball card collection and your friends want to have the same set. So each time you make a change, you send each friend a list of which cards were added and removed so they can update their collections. Now everyone has the same collection just by communicating change sets.

That, in a nutshell, is how LiteFS nodes keep distributed copies of your database in sync. However, instead of baseball cards, these LiteFS nodes communicate change sets of fixed-sized blocks called pages.

SQLite applies these change sets of pages safely & atomically by using either a rollback journal or the write-ahead log. These two methods have a different approach but, at the end of they day, they both transactionally update a set of pages in a SQLite database.

In LiteFS, we track the beginning and end of these transactions through the file system API. We can see which pages have changed and bundle them up in an internal file format called LTX.

Detecting page sets with the rollback journal

The rollback journal is a simple mechanism, which makes it easy for LiteFS to determine when write transactions start & end. From a high-level, SQLite implements transactions like this:

Obtain an exclusive lock on the SHARED & RESERVED lock bytes.
Create a -journal file.
Write changes to the database file with write(2) & copy old versions of pages to the journal.
fsync(2) the database file & unlink(2) the journal file.
Release the locks.

LiteFS acts as a passthrough file system so it can see all these file system calls. On the initial journal creation, it begins watching for page changes. On write(2), it marks a page as changed. And finally, on unlink(2) it will copy the page change set to an LTX file and then delete the journal.

Detecting page sets with the WAL

SQLite’s operations when it uses the WAL mode are a bit more complicated but it still has similar start & end triggers.

Obtain the SHARED lock byte in the database file but also obtain WAL-specific locks such as WAL_WRITE_LOCK.
Write new pages to the end of the WAL using write(2).
On the last page write, the commit field is set in the WAL frame header. This indicates the end of the transaction and also the ending size of the database.
Release locks.

LiteFS can read the list of changed pages from the WAL and copy them out to an LTX file when the final WAL write for the transaction comes in. Again, both the rollback journal and WAL are implementation details so we end up with the same LTX format with either one.

In the WAL mode, SQLite will also maintain a shared-memory file (aka SHM) and uses it as an index to look up pages in the WAL. This piece is managed by SQLite so LiteFS doesn’t touch it during a write.

Applying transactions to the replica

Once an LTX file is created on the primary LiteFS node, it will send it to all connected replica LiteFS nodes. These replicas will validate the file, perform some consistency checks, and then apply the change set to the SQLite database.

The LiteFS replica imitates a SQLite client and takes the same locks in order to apply the transaction. That means it looks like just another SQLite client doing an update so it’s safe across other processes using the database.

Bootstrapping made easy

Previously, it was tough to convert an existing application to use LiteFS. You’d need to create a SQL dump of your database and import in using the sqlite3 command line. That was a pain.

We’ve improved this workflow with the new litefs import command. This command lets you remotely send a SQLite database to your LiteFS cluster and it will transactionally replace it. That means you can start a cluster with an existing database or you can even revert to an old snapshot on a live application.

$ litefs import -name my.db /path/to/my.db

Reworking checksumming

LiteFS uses a fast, incremental checksum for ensuring the state of the entire database is consistent across all nodes at every transaction. The method is simple: we XOR the CRC64 checksums of every page in the database together. This approach let us incrementally update individual pages by XOR'ing out the old checksum for a page and XOR'ing in the new checksum for the page. That’s pretty cool.

However, in practice, it was difficult to ensure we were calculating the correct previous checksum for a page every time we performed an update as page data is spread across the database file, journal file, & WAL file. The edge cases for determining the previous page data were too easy to get wrong.

So in v0.3.0, we decided to rework the database checksum. It still uses the same algorithm of XOR'ing page checksums but now we maintain a map of the current checksum of every page in the database so they can be XOR’d together on commit. We no longer need to track the previous checksum and this change made a lot of edge cases disappear.

This approach is not without its trade-offs though. First, it requires additional memory. The map keys are 4-byte unsigned integers and the values are 8-byte hash values so we need about 12 bytes per page. SQLite uses 4KB pages by default so that’s 262,144 pages per gigabyte. Our total memory overhead for our map of page hashes ends up being about 3MB of RAM per gigabyte of on-disk SQLite database data. LiteFS targets database sizes between 1 to 10 GB so that seemed like a reasonable trade-off.

Second, this approach adds CPU overhead after each commit. Map iteration and XOR computation are quite fast but these do begin to show up in performance profiles as the database grows. In our tests, we’ve found it adds about 5ms per gigabyte of SQLite data. That’s pretty high. Fortunately, much of this iteration can be cached since XORs are associative. We’ll be implementing this cache in the next version of LiteFS.

Improving debugging with the trace log

One benefit to having checksum bugs in v0.2.0 was that it gave us plenty of time to get our hands dirty with debugging. The best tools come out of necessity and the LiteFS trace log is one of those tools.

Debugging a failed database or distributed system is a bit like a murder mystery in that you know how it ended but you need to put the pieces together to figure out how it happened.

In the previous version of LiteFS, we didn’t have many clues when one of these failures happened so it required a Sherlock Holmes level of deductive reasoning to figure out the mystery. The trace log simplifies this process by writing out every internal event to a log file so we can see where things went awry after the fact.

SQLite uses the POSIX file system API so debugging with a normal strace would look like a series of seemingly opaque system calls. LiteFS translates these system calls back into SQLite related actions such as WriteDatabase() or LockSHM(). When we write those events to the trace log, we can decorate the log lines with additional information such as page numbers and checksums. All this makes reading the trace much more straightforward.

The trace log is not without its costs though. It will increase I/O to your disk as there are a lot of events that are written. It’s typical to see your disk I/O double when you enable the trace log. However, it does cap the total size of the trace log by using a rolling log so you don’t need much space available. By default, it will roll over to a new log file every 64MB and it will retain the last 10 logs in a gzipped format.

The trace log is disabled by default, however, you review the trace log documentation if you need it to debug any LiteFS issues.

Upcoming work

The WAL support & stability improvements have been huge steps in moving LiteFS to be production ready but there’s still more work to come. In the next release, we’ll be focused on making LiteFS easier to integrate into your application by adding support for write forwarding. That will let you write to your database from any node and have LiteFS automatically forward those writes to the primary instead of having your application redirect writes.

We’ll also be making performance improvements by adding LZ4 compression to the LTX files. This will reduce latency between nodes and it will significantly cut down on bandwidth costs.

Thank you!

Finally, we’d like to give a huge shoutout for everyone who has tried LiteFS and given feedback. It makes a world of difference! Kent C. Dodds even live streamed his experience with LiteFS and it gave us incredible, detailed feedback. Thank you!

Next post ↑: Fly.io is in Johannesburg
Previous post ↓: Launching Redis by Upstash