big data | A Matter of Software

We often read about the “data explosion” phenomenon, but in parallel with that, there has also been a “database explosion” in recent years. Cassandra… MongoDB… NoSQL… NewSQL… the database technology landscape in 2015 can be daunting and confusing. From Oracle to VoltDB it has been a 30+ year’s journey, which we will revisit with the intent to better understand how this technology has evolved over time and what are the choices available today when designing the persistence layer of an enterprise software system.

Let’s begin by rewinding back the clock to where it all started.

The Dawn of SQL

1970: at IBM San Jose Research Lab, Edgar F. Codd conceptualized a model for data access consisting of a set of tables containing rows, with each row representing an entity of interest. Data can be normalized to eliminate duplication, and relationships between tables can be enforced by foreign keys. The relational model was born.

Codd’s research started from a simple premise: to be able to ask the computer for information, and then let the computer figure out how to retrieve it. No low-level knowledge of data structures should be required. His idea spawned the creation of an industry-standard computer language for working with relational databases: the Standard Query Language, or SQL.

At that time, computer memory on commodity hardware was a fairly limited resource, so the only option was to design relational database software as disk-based systems. This forced choice will have a significant impact on performance, as accessing data on disk is several orders of magnitude slower than accessing data in memory.

The first commercially available relational database management system (RDBMS) was Oracle, released in 1979 by Relational Software (now Oracle Corporation). Other notable early RDBMS commercial releases were Informix (1981), IBM DB2 (1983) and Sybase (1987).

Apart from supporting SQL and the relational model, the commercial RDBM systems also provide support for transactions, multi-user concurrent access, and failure recovery.

Simply put, a transaction consists of a collection of data updates that the RDBMS treats as a unit of work: all the updates within the transaction either succeed (commit) or fail (rollback). For example, in a financial application, if you move money from account A to account B, you want both account balances to be updated if all is well, or in case of an exception, no updates should occur. With correct use of transactions, the database is always left in a consistent state, with minimal effort on the part of the programmer. Any application with integrity constraints greatly benefits from transaction support.

Transactions, concurrent access and failure recovery help to build applications that are robust, accurate and durable, but at a cost. A large chunk of CPU cycles is spent by RDBM systems to perform the locking, latching and logging required to support them, slowing down performance.

However, the flexibility of the relational model and SQL, the robustness provided by transaction support and an adequate performance for most business applications have made RDBM systems very successful and the unchallenged persistence technology for more than two decades.

But something was about to change.

The Data Explosion and the arrival of NoSQL

2000: The evolution of the World Wide Web has created global-reaching organizations in the fields of Internet search, social media, blogging, collaboration and online retail which had to cope with massive amounts of data, growing at an unprecedented scale.

This scenario represented a challenge for the traditional SQL relational databases. To start with, RDBM systems were originally designed to run as standalone servers. A single machine cannot scale beyond the physical hardware limits of disk space, memory and processing power. What happens when you reach the limit? The only solution is to combine the resources of several machines together in a cluster.

A cluster is a set of computers that perform the same task, controlled and coordinated by software in such a way that it can be viewed as a single system from a client’s perspective.

Some of the traditional SQL RDBM systems (e.g. Oracle RAC, Microsoft SQL Server) can be clustered using shared-disk architecture. In this configuration the database relies on disk storage that is shared by all the computers in the cluster and accessed via the network. This configuration does offer a degree of scalability, but the disk subsystem still represents a single point of failure which limits availability.

The level of scale and availability required by the Web 2.0 and other Big Data use cases demanded clusters with shared-nothing architecture. In shared-nothing clusters, each machine uses its own local hardware resources (processor, memory and disk) and, as the name implies, it shares nothing with the other machines in the cluster. Requests from clients are routed to the machine that owns the resource.

The relational model used by the traditional SQL RDBMS also was not ideally suited to be distributed in a cluster, without heavy management from the application code outside the database.

Because of these limitations, some Web giants, primarily Google and Amazon, started to work on a new class of databases that would become known under the umbrella term of NoSQL with the following design goals:

Shared-nothing clustering
Fast access to large amounts of data
Data models that are cluster-friendly

The NoSQL family of databases did not adopt a standard data model or a standard query language. The only common denominator is that they are non-relational. In practice, the majority of NoSQL implementations use either an aggregate-oriented or a graph data model.

An aggregate, as defined in domain-driven design, is a collection of objects that we wish to treat as a unit for data manipulation and consistency management. For example, an with its associated , and objects could be treated as a single aggregate. Aggregates are used in key-value, column family and document data models.

The graph data model, on the other hand, allows the user to store entities and the relationship between those entities. It is suitable for models with a large number of small records and complex relationships.

Aggregates allow the fast retrieval of large chunks of data. In key-value databases for example, a single lookup for a given key it is all that is needed to retrieve an aggregate of any size. In a relational model to achieve this would typically require several queries or joins, and a much longer execution time.

To make data access even faster, the NoSQL designers also decided that it was OK to sacrifice consistency in exchange for performance, and adopted an “eventual consistency” solution, which means, in practice, that it is OK for most applications to accept inconsistent data in some nodes in the cluster for a period of time following a database update.

The pioneering work by Google and Amazon spawned a number of NoSQL implementations; some examples (by data model) include:

Key-Value databases:

Riak (Basho Technologies), Oracle NoSQL Database (Oracle Corporation), Redis (Pivotal Software)

Document databases:

MongoDB (MongoDB Inc.), CouchDB (Apache Foundation)

Column-Family databases:

Cassandra (Apache Foundation), HBase (Apache Foundation)

Graph databases:

Neo4j (Neo Technology), InfinitGraph (Objectivity Inc.)

The NoSQL databases support highly scalable and highly available architectures. Some implementations can efficiently store unstructured data, such as video and audio files, which are not handled very well by the traditional RDBM systems. With NoSQL, a wider range of data models became available, catering for specialized use cases unsuitable for relational models and SQL.

Despite these advantages, there are downsides. The trade-off between consistency and latency cannot be accepted by all applications; the lack of support for transactions places a big burden on the developers to circumvent it; and the inexistence of a standard query language requires specialized knowledge to extract data.

The NewSQL Era

First used in 2011, the NewSQL designation refers to a new class of contemporary relational database management systems that have been designed to offer the same scalability of NoSQL stores by providing shared-nothing cluster ability, whilst at the same time maintaining support for transactions and the consistency guarantees of the traditional RDBMS SQL systems. This new class of databases aim to yield better performance by adopting new architectures that vastly reduce the overhead required to support transactions, consistency and failure recovery. For example, most implementations use replication instead of transaction logs to achieve high availability and durability, which is much more efficient.

By now, the amount of memory available on commodity hardware means that most NewSQL stores are implemented as main-memory databases, which vastly reduces data access times.

All NewSQL databases support the relational data model, as well as the standard SQL query language. To make the relational model more easily distributable over a cluster, most implementations provide a middleware layer to automatically split relational data across multiple nodes; a technique called sharding.

Some implementations of this class of databases are:

MySQL Cluster (Oracle Corporation)
Spanner (Google)
VoltDB (VoltDB Inc)
MemSQL (MemSQL Inc)
SQLFire (Pivotal Labs)

What Next?

2015: as the amount of data produced in the digital world will continue to grow at an exponential rate, it is reasonable to assume that database technology will continue to evolve at the same pace.

We left out of this discussion third-party hosted alternatives such as cloud databases and Database as a Service (DaaS). There are also multi-model configurations, which comprises of two or more database types coordinated by a common manager, and hybrid solutions where a caching technology (such as Oracle Coherence) creates an in-memory key-value layer on top of a traditional disk based RDBMS.

My aim is to follow up and report developments in this space. In another post, we will look at what considerations should be applied when selecting a persistence technology for your project.

A Matter of Software

Reflections on software architecture and programming, tutorials, reviews and references by Roberto Vormittag

Tag Archives: big data

From SQL to NewSQL via NoSQL: A Database Odyssey