The Spanner Paper: Google's Quest for a Globally Consistent Database
Or, how to have your Bigtable (scale) and eat your MySQL (transactions) too.
In the GFS and Bigtable papers, we saw a clear pattern in Google’s design philosophy: build systems that solve one problem really well, even if it means giving up other features. GFS gave them a file system that spanned the planet, but it didn’t have a standard POSIX API. Bigtable gave them a database with incredible scale, but it didn’t have cross-row transactions.
This trade-off worked great for something like web search. But what happens when you need that scale, but you also need transactions?
That’s the problem Spanner was built to solve. Google’s ads team was stuck. They were running on a manually sharded MySQL database, and their last resharding effort took over two years. They needed to scale, but Bigtable wasn’t an option. You can’t tell an advertiser you might have over-billed them because of eventual consistency.
Spanner was designed to bridge this gap. It was Google’s attempt to build a single, global database that was both scalable and strongly consistent. They pulled it off not with a single magic bullet, but with a series of brilliant, pragmatic trade-offs. The foundation for all of them was a new way to think about time.
The Foundation: A Clock Built on Uncertainty
The big problem with distributed transactions is ordering. You can’t trust server clocks. Spanner’s solution was to build a clock that was provably accurate, called TrueTime.
TrueTime is a new kind of API. Instead of returning a single number, TT.now() returns an interval: [earliest, latest]. That interval is a guarantee: the “true” absolute time is somewhere in that tiny window (less than 10ms). The designers realized you don’t need to know the exact time, you just need to know the bounds of your uncertainty. This guarantee is the bedrock for everything that follows.
Spanner’s Core Trade-offs
Spanner’s genius lies in the bargains it makes. It willingly accepts a small, calculated cost in one area to gain a huge advantage in another.
Trade-off #1: Trading Milliseconds of Latency for Global Consistency
How do you use a “fuzzy” clock to guarantee transaction order? By making a trade: you wait. This is the “Commit Wait” rule.
When a transaction commits, Spanner’s leader assigns it a timestamp and then forces itself to pause. It holds all the transaction’s locks and waits until it knows the absolute time has passed that timestamp.
That’s the deal. Spanner trades a few milliseconds of latency on every single write. In return, it gets a mathematical guarantee of external consistency. It’s the price it pays to ensure that if transaction T1 commits before T2 starts, T1’s timestamp is provably smaller than T2’s, across the entire globe.
Trade-off #2: Trading Heavyweight Writes for Lightweight Reads
The “commit wait” is expensive, but it unlocks a massive payoff by enabling Spanner to operate at two different speeds. This is the second trade-off: making writes slower to make reads much, much faster.
The Heavyweights: Read-Write Transactions. When you need to change data, Spanner uses traditional two-phase locking and pays the “Commit Wait” cost. This is the slow, safe, and correct path.
The Lightweights: Snapshot Reads. Because Spanner is a multiversion database (like Bigtable), it can offer lock-free reads. A read-only transaction gets a fixed timestamp and simply reads the version of data from that moment in the past. It doesn’t need locks, so it’s blazing fast.
This is how Spanner supports high-throughput applications. It concentrates the cost of consistency on the writes, allowing the vast majority of read traffic to fly.
Trade-off #3: Trading a Pure Relational Model for Physical Locality
Even with TrueTime, a transaction that spans datacenters is slow. The fastest distributed transaction is one that isn’t distributed at all. This led to the final, pragmatic trade-off.
Spanner’s schema has a feature called INTERLEAVE IN PARENT. This is a directive from the developer to tell Spanner to physically store a child row (like an Album) with its parent row (like a User).
This isn’t a “pure” relational model, where data location is abstract. It’s a deliberate choice to trade that purity for performance. By co-locating related data, the most common transactions (updating a single user’s data) become fast, single-site operations that don’t need a slow, global two-phase commit. It’s the same practical spirit as Bigtable’s single-row transactions.
So, Was It Worth It?
The F1 ad team’s story says it all. After moving from a manually sharded MySQL database to Spanner, their operations became massively simpler. Spanner’s automatic sharding and failover worked so well that when datacenters failed, the event was “nearly invisible” to them.
Spanner completes the story that GFS and Bigtable started. It’s the final piece of the puzzle, built on a series of smart bargains. It proves you can have both scale and consistency, all for the price of a few atomic clocks and a few milliseconds of waiting.

