SQLite Internals: Pages & B-trees

Author

Name: Ben Johnson
@benbjohnson: @benbjohnson

A parliament of owls meticulously inspecting various pages while in a tree. There's also two feet. — Image by Annie Ruygt

Fly.io runs apps close to users around the world, by taking containers and upgrading them to full-fledged virtual machines running on our own hardware around the world. Sometimes those containers run SQLite and we make that easy too. Give us a whirl and get up and running quickly.

Ok, I’ll admit it—I’m a SQLite shill. There are few holes that I don’t try to put a SQLite-shaped peg into. It’s not that I dislike other databases, they’re great. But SQLite is so easy to use and, more importantly, it’s simple. Simplicity leads to reliability and I don’t know of a more reliable database than SQLite.

There’s a comfort with being able to read through a spec or a code repository and know that you’ve covered the full breadth of a tool. That’s why I love Go’s simple 100-page language specification and that’s why I love SQLite’s 130KLOC core code base. It’s something that you can read through in a long weekend if, ya know, that’s what you do on weekends.

This constrained size means that SQLite doesn’t include every bell and whistle. It’s careful to include the 95% of what you need in a database—strong SQL support, transactions, windowing functions, CTEs, etc—without cluttering the source with more esoteric features. This limited feature set also means the structure of the database can stay simple and makes it easy for anyone to understand.

We here at Fly.io have an unreasonable affinity for explanations involving sandwiches and this post will continue in that sacred tradition.

I have binders full of sandwiches

Recently, I tried to remember a sandwich I ate last week but to no avail. Like any 10x engineer, I quickly over-engineered a solution by making a SQLite database to track every sandwich consumed.

We’ll start off with our table definition:

CREATE TABLE sandwiches (
    id INTEGER PRIMARY KEY,
    name TEXT,
    length REAL,
    count INTEGER
)

Now we’ll have a record of every sandwich, its size in inches, and the number eaten.

I live in Denver where we were known in the early 2000s for toasted sandwiches, and then in the 2010s for bad toasted sandwiches so we’ll kick off with a toasted Italian sub.

INSERT INTO sandwiches (name, length, count) VALUES ('Italian', 7.5, 2)

Voila! Our data is safe on disk. It’s easy to gloss over all the steps it takes to get from hitting the “enter” key to bytes getting saved to disk. SQLite virtually guarantees that your database will never be corrupted or that your transaction will be half-written. But instead of glossing over, let’s dive in deep and see how our Italian sub looks on disk.

Efficient sandwich encoding

Our row of sandwich data exists as an array of bytes inside SQLite that’s encoded using its record format. For our inserted row, we see the following bytes on disk:

15 01 05 00 1b 07 01 49 74 61 6c 69 61 6e 40 1e 00 00 00 00 00 00 02

Let’s break these bytes down to see what’s going on.

The header & variable-length integers

The first byte of 0x15 is the size of our row’s payload, in bytes. After this is our rowid which is used as our PRIMARY KEY. Since this is the first row, its id has a value of 0x01.

These first two fields use what’s called a variable-length integer (“varint”) encoding. This encoding is used so that we don’t use a huge 8-byte field for every integer and waste a bunch of space. It’d be like if a sandwich shop packaged every sandwich in enormous 6-foot party sub containers because they only could use one size of container. That’d make no sense! Instead, each size of sandwich gets its own container size.

Varints use a simple trick. The high bit is used as a flag to indicate if there are more bytes to be read and the other 7 bits are our actual data. So if we wanted to represent 1,000, we start with its binary representation split into 7 bit chunks:

0000111 1101000

Then we add a “1” flag bit to the beginning of the first chunk to indicate we have more chunks, and a “0” flag bit to the beginning of the second chunk to indicate we don’t have any more chunks:

10000111 01101000

With varints, we can now store our integer in 2 bytes instead of 8. This may seem small but many SQLite databases have a lot of integers so it’s a huge win!

The next two bytes after the rowid specify the data that is not spilled to overflow pages but the explanation is lengthy so we’re gonna wave our hands over that part.

Type encoding

Next, we have a list of column types for our name, size, and count fields. Each data type has a different encoding that’s specified as a varint.

For our name column, the 0x1b value specifies that it is a TEXT type and has a length of 7 bytes. Type values that are odd and are greater or equal to 13 are TEXT fields and can be calculated with the formula (n*2) + 13. So our 7-byte string is (7*2) + 13 which is 27, or 0x1b in hex.

BLOB fields are similar except they’re even numbers calculated as (n*2) + 12. SQLite alternates these TEXT and BLOB type values so small lengths of both types can be encoded efficiently as varints.

Next, we have our “length” field which is a floating-point number. These are always encoded as a 0x07.

After that, we have our “count” field which is an integer. These get packed down similar to varints but in a slightly different format. Integers that can fit in an 8-bit integer are represented with a type value of 0x01. 16-bit integers are 0x02, 24-bit integers are 0x03 and so on.

Value encoding

Once our types are all encoded, we just need to pack our data in. The text value of “Italian” is represented as UTF-8:

49 74 61 6c 69 61 6e

Then our length of 7.5 is represented as an IEEE-754-2008 floating-point number. SQLite can optimize integer floating-point values by storing them as pure integer fields but since we have a decimal place it is stored with 8-bytes:

40 1e 00 00 00 00 00 00

And finally we use a single byte to hold our count of 2:

Congrats! You’re now an expert on SQLite record formatting.

E_TOOMANYSANDWICHES

As my sandwich addiction continues unabated, I fill my SQLite database with more and more rows. I even make friends on the /r/eatsandwiches subreddit and start collecting their sandwiches. My sandwich database seems to grow without bound.

Surprisingly though, adding or updating rows is still nearly as instantaneous as when I had a single row. So how does SQLite update a multi-gigabyte sandwich database in a fraction of a second? The answer is pages & b-trees.

A naive approach to a database would just be to pack records in sequentially in a file. However, there’s no way to insert or update rows in the middle of the file without shifting and rewriting all the bytes after the new row.

Instead, SQLite groups rows together into 4KB chunks called “pages”. Why 4KB? That’s what file systems typically use as their page size so keeping everything aligned reduces page fetches from disk. Disks are usually the slowest part of a database so limiting page fetches can have a huge performance win.

Inspecting the page format

If we inspect our page, we can see its header:

0d xx xx 00 03 xx xx xx

There are several parts of this header but I masked out several bytes so we can focus on two particularly important fields. The first byte, 0x0d, indicates the page type. This page type is a table leaf. We’ll talk about those more with b-trees.

The second important field is the cell count, which is 0x0003. This tells us that 3 records exist in this page.

After the header, we have the cell pointer index:

0f e9 0f d2 0f bb

This is a list of 2-byte values that represent offsets in the page for each record. The first record exists at offset 4,073 (0x0fe9), the second record exists at offset 4,050 (0x0fd2), etc. SQLite packs rows at the end of the page and then works backwards.

After the index, we have a bunch of zeros until we reach the content area of the page which holds our row data in record format.

Structuring pages into trees

Now that we’ve chunked our data into pages, we can update a subset of our data without having to rewrite the whole file. That’s great for writing but now we have a list of pages to search through if we want to query our data and that won’t scale as-is.

The simplest approach would be to start at the first page and then search every page for a given row. This is called a “table scan” and it can be really slow—especially if your data is at the end of your table. If you’re into “big-O” notation, it’s also referred to as “linear time complexity”, or O(n). That means the amount of time it takes to search for a record has a linear relationship to the number of records you have, which is referred to as “n”.

Searching faster with binary search

If you are searching by your primary key, you could perform a binary search since the table is sorted by primary key. For a binary search, you search the page right in the middle of your table for your sandwich record. If the record exists, great! You’re done.

If the sandwich you’re searching for is before that page, then you find a new “middle” page in the first half of your table. If it’s after the page, then you find a new middle page in the second half of your table. Keep slicing the search space in half and searching until you find your sandwich. If you squint a bit, this slicing and subdividing has the feel of a tree-like structure.

A binary search has a logarithmic time complexity, or O(log n). That’s considered pretty good for data structures since it means you can scale up to a large number of records, n, while the cost grows at a much slower rate.

Improving binary search by persisting the tree

While logarithmic time complexity is great, we still have a problem. Let’s run some numbers.

If we have a small 2MB database with 4KB pages then that means we have 512 pages. A binary search of that many pages would have a worst-case scenario of log₂(512), or 9 pages. That means we might have to read nine pages in our tiny database just to find one record! Page fetches are painfully slow in databases so we want to reduce that as much as possible.

SQLite is structured as a b-tree, which is a data structure where each node can point to two or more child nodes and the values in these child nodes are all sorted. There are a ton of different variants of b-trees but the one that SQLite uses is called a b+tree. This type of tree stores all data in the leaf nodes, which is what our sorted list of pages represents, but also have an index of key ranges in the branch pages. SQLite refers to these branch pages as “interior pages”.

To illustrate this, let’s say our leaf pages hold sandwich records that are each 40 bytes. The record also contains an integer primary key that is 4 bytes on average. That means we can fit about 100 records into one 4KB page. If we have less than 100 records, we only need one page. Easy peasy.

But what happens when we add a 101st sandwich and it doesn’t fit anymore? SQLite will split the page into two leaf pages and add an interior page as the root of our b+tree that points to our child leaf pages. This interior page stores the key ranges for the leaf pages so that when you search, you can see what ranges each child page holds without having to actually read that child page.

This doesn’t seem like a big improvement over our binary search until we start adding more data. Interior pages are also 4KB and they store pairs of child primary keys and their page numbers so each entry in our interior page takes about 8 bytes. That means we can store about 500 entries in an interior page. For our 2MB database, that means we can hold the key range for all of our leaf pages in a single interior page. To find a given record, we only need to search the root interior page to find the correct leaf page. That means our worst case search is only 2 pages.

Growing a tree

What happens when our root interior page fills up and we need a database bigger than 2MB? Similar to leaf pages, we split the interior page in two and add a new root parent interior page that points to the split interior pages. This means that we need to search the new root interior page to find the correct second-level interior page and then we search page that to find our leaf page. We now have a tree depth of 3.

Since our new root can hold about 500 references to second-level interior pages and those second-level pages hold about 500 references to leaf pages, we can store about 250,000 pages, or about 1GB of data. If we continue adding rows, we’ll need to split the root page again and increase our tree depth to 4.

At a tree depth of 4, we can hold about 500³ leaf pages, or about a 476GB database. That means we only need to read 4 pages to find a given record—even in a huge database!

OK, But Why?

While it’s interesting to noodle around with internals, how does this actually apply to real-world scenarios?

Well, knowing about record formatting tells us that storing integers instead of floating-point numbers is wildly more efficient as SQLite doesn’t compress floats.

Or perhaps knowing that b-tree time cost grows logarithmically will let you feel more comfortable designing an application with a multi-gigabyte table.

Or maybe, just maybe, discovering the /r/eatsandwiches subreddit will inspire your dinner tonight.

Learning about the internals of our tools lets us feel comfortable with them and use them confidently. Hopefully low-level features like SQLite’s PRAGMAs seem a little less opaque now.

I’ll be writing more on SQLite internals in future posts—from rollback journals to write-ahead logs to the SQLite virtual machine. Want to know more about a specific topic? Hit me up on the Fly Community forum or ping us on Twitter.

Next post ↑: Volume Expansion and Snapshot Restores
Previous post ↓: Launching Laravel Bytes