Google’s spanner handles trillions of rows of data and Google is shifting away from NoSQL and to NewSQL. Google believes it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.
A complicating factor for an Open Source effort is that Spanner includes the use of GPS and Atomic clock hardware.
Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
The servers in a Spanner universe.
A zone has one zonemaster and between one hundred and several thousand spanservers. The former assigns data to spanservers; the latter serve data to clients. The
per-zone location proxies are used by clients to locate the spanservers assigned to serve their data. The universe master and the placement driver are currently singletons. The universe master is primarily a console that displays status information about all the zones for interactive debugging.
Amazon is somewhat competitive with datacenter reliability but they charge to replicate across clouds. Amazon does failover automatically across clouds and data centers.
A few decades ago Toyota and Japanese car makers had several times more reliability and quality than competing car companies. This required having a different company culture. Orders of magnitude greater reliability and quality can be competitive weapons that enable things to be done that are impossible for those without the quality and reliability. Google also operates at levels of scale that competitors cannot match.
To summarize, Spanner combines and extends on ideas from two research communities: from the database community, a familiar, easy-to-use, semi-relational interface, transactions, and an SQL-based query language; from the systems community, scalability, automatic sharding, fault tolerance, consistent replication, external consistency, and wide-area distribution. Since Spanner’s inception, we have taken more than 5 years to iterate to the current design and implementation. Part of this long iteration phase was due to a slow realization that Spanner should do more than tackle the problem of a globally replicated namespace, and should also focus on database features that Bigtable was missing.
One aspect of our design stands out: the linchpin of Spanner’s feature set is TrueTime. We have shown that reifying clock uncertainty in the time API makes it possible to build distributed systems with much stronger time semantics. In addition, as the underlying system enforces tighter bounds on clock uncertainty, the overhead of the stronger semantics decreases. As a community, we should no longer depend on loosely synchronized clocks and weak time APIs in designing distributed algorithms.
We have spent most of the last year working with the F1 team to transition Google’s advertising backend from MySQL to Spanner. We are actively improving its monitoring and support tools, as well as tuning its performance. In addition, we have been working on improving the functionality and performance of our backup/restore system. We are currently implementing the Spanner schema language, automatic maintenance of secondary indices, and automatic load-based resharding. Longer term, there are a couple of features that we plan to investigate. Optimistically doing reads in parallel may be a valuable strategy to pursue, but initial experiments have indicated that the right implementation is non-trivial. In addition, we plan to eventually support direct changes of Paxos conﬁgurations.
Given that we expect many applications to replicate their data across datacenters that are relatively close to each other, TrueTime may noticeably affect performance. We see no insurmountable obstacle to reducing below 1ms. Time-master-query intervals can be reduced, and better clock crystals are relatively cheap. Time-master query latency could be reduced with improved networking technology, or possibly even avoided through alternate time-distribution technology.
Finally, there are obvious areas for improvement. Although Spanner is scalable in the number of nodes, the node-local data structures have relatively poor performance on complex SQL queries, because they were designed for simple key-value accesses. Algorithms and data structures from DB literature could improve singlenode performance a great deal. Second, moving data automatically between datacenters in response to changes in client load has long been a goal of ours, but to make that goal effective, we would also need the ability to move client-application processes between datacenters in an automated, coordinated fashion. Moving processes raises the even more difﬁcult problem of managing resource acquisition and allocation between datacenters
This new database, Spanner, is a database unlike anything we’ve seen. It’s a database that embraces ACID, SQL, and transactions, that can be distributed across thousands of nodes spanning multiple data centers across multiple regions. The paper dwells on two main features that define this database:
Schematized Semi-relational Tables — A hierarchical approach to grouping tables that allows Spanner to co-locate related data into directories that can be easily stored, replicated, locked, and managed on what Google calls spanservers. They have a modified SQL syntax that allows for the data to be interleaved, and the paper mentions some changes to support columns encoded with Protobufs.
“Reification of Clock Uncertainty” — This is the real emphasis of the paper. The missing link in relational database scalability was a strong emphasis on coordination backed by a serious attempt to minimize time uncertainty. In Google’s new global-scale database, the variable that matters is epsilon — time uncertainty. Google has achieved very low overhead (14ms introduced by Spanner in this paper for datacenters at 1ms network distance) for read-write (RW) transactions that span U.S. East Coast and U.S. West Coast (data centers separated by around 2ms of network time) by creating a system that facilitates distributed transactions bound only by network distance (measured in milliseconds) and time uncertainty (epsilon).