Databases

Credit: Much of the content in this page is based on Vaibhav Aparimit “Fundamentals of System Design” posts.

Applications need data stored somewhere to be retrieved when needed. This data can range from temporary data, which can be easily stored on caches based on files in filesystems or memory, to permanent data, like user personal data (name, home address, etc.) that is usually stored in files, being those files pure plain “text” files or part of data management system like a datatabase or any other proprietary format (i.e: excel file, SAS file, etc.).

In any case, any data storage has to optimize the main requirements from its consumer, like high reads, or high writes, or both, as well as taking care, if needed, of transactionality, consistency, availability. Tihs is ever more true when thinking on distributed systems (see ** CAP ** theorem and NALS ????????? section)

To fulfill these requirements, persistent date needs a good data model

Here Mico Maco will focus mainly on the relational database model and SQL, with some comments on some non-structured/non-SQL models.

Relational model

The (still) prevalent data modeling technique is using relational tables. In relational tables, data is organized into records of a table. Tables are related to one another using primary key foreign key. There are a lot of reasons why your should choose relational table.

1 - You are just building out v1 of your app and data access patterns are not quite clear yet. Relational schemas are always a good first default choice. v1 of all apps are generally shit ( rightly told to me by a senior back-end developer at my office :) ), and no one will fault you for starting with relational tables, instead of fancier NoSQL or Graph DB.

2 - You need to enforce strict schema on write constraints.

3 - You want to maintain zero data redundancy. Normalization of schema in relational models has the effect of shredding information into many tables

4- Your data model has many to one and many to many relationships. In other words you know which joins will be performed before hand. Joining and querying relational databases using a declarative language like SQL is one of the greatest secret sauce of relational databases. A lot of research and effort has gone into making relational queries super fast. An application developer just has to specify the expected data pattern of the query. The query engine will convert the SQL query into an optimized code to fetch/write data.

NoSQL Database

Application development is done using object oriented programming. However when data storage is done using tables, there is a translation required from objects to “shredded” relational tables. ORM frameworks provide boilerplate code to reduce the effort required for this translation, but still there is work to be done.

NoSQL solves this tables to objects translation problem by representing a record as a self contained JSON document. For example, let’s say we need to store patient demographic information along with his current conditions. One way to represent this record in a NoSQL database can be :

{ “first_name”: “John”, “last_name”: “Doe”, “conditions”:[ {“name” : “T2DM”, “onset”: “12–12–1990”}]}

If you need read profile data for a patient in your app, you need not issue multiple joins as all data is inside one NoSQL document. Typically, if your data model exhibits a tree like, one to many relationships, using a NoSQL database might make more sense.

In the patient example above, what if we wanted to store ICD10 standard codes for conditions instead of name of condition. This is a little troublesome in document based NoSQL databases as they have little support for joins. You can still do a join at the application layer but this will be always suboptimal compared to the joins done at a typical relational database layer. NoSQL databases become less desirable in this case.

Lastly in case your data model does not have a fixed schema, going the NoSQL route might make more sense. Consider, the patient example above and say we need to also store patient’s date of birth. In NoSQL case, we can add a new field, ‘dob’ to new documents. At the application level we can also add code to handle reading old documents without dob field. In a relational database, the solution to handle dob would be to alter schema and make data migrations. Data migrations are slow and require downtime and consequently generally avoided.

Graph Database

Graph database makes a lot of sense when your application’s data model needs to support many ‘many-many’ relationships. The relational model can handle a few many to many relationships, but beyond a point all the relational joins become messy and slow. Using graph databases also provides an added advantage of easily extending relationships between heterogeneous objects.

A graph consists of two kinds of objects — nodes and edges. Nodes contain description of objects or entities. Edges contain description of relationships between nodes. For example, say a person suffers from an allergic reaction because of exposure to an substance. You can model person, allergic reaction and substance as nodes. You can also model relationships between nodes in a graph database. For example a relationship between person and allergy can be a unidirectional relationship “person-has-allergy” from person node to allergy node. Relationship between allergy and and substance can be a unidirectional relationship “triggered-by-exposure-to” from allergy to substance.

Why not do this in relational tables ? Well you can. You can create three tables — person, allergy and substance and set up the appropriate Pk-FK constraints. Like I said earlier, using graph databases makes sense when we have a lot of many-many relationships. For example, let’s say you introduce a location object in the overall scheme of things so that you can capture location of the person where the allergy reaction occurred. Location can be neighbourhood, city, state, country, continent or hemisphere. Basically, location information can be available at various levels of granularity. Using SQL to create a declarative query will be messy. SQL needs to know in advance which joins will be part of the query. In graph database, on the other hand, you can traverse many nodes and edges before arriving at the target node. You can express the fact of traversing a graph once or many time quite concisely using a graph database declarative query language like Cypher, for Neo4j graph database.

General concepts

Indexes are an esential part of relational databases. They an abstraction to favor high read throughput, but make writes slower. This is why not all in a database is indexed.

B-trees is the most common way to index data in the relational database word. Other databases, like ElasticSearch and Cassandra, use the indexing system defined in Google’s BigTable, known as Log Structured Merge Tree or LSM Tree

As Google’s paper define, Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes

As a thumb rule, LSM trees can handle higher write workloads and B-trees are good for high read workload. This is because writes in SSTables are always sequential unlike B trees, where random writes take place (B-tree pages need not be sequentially arranged on disk)

Sometimes when database is experiencing very high written workload, compaction and merging of LSM trees adversely affects reads as many more SSTables need to be read. This is a big disadvantage of LSM trees that, in very high write throughputs scenarios, its performance can get unstable, unlike that of B-trees which exudes generally stable performance.

If transactional semantics are of paramount importance, then B-trees are more preferable. In LSM trees, same key can be present in multiple SSTable. In B-tree, one key is present at only place and it’s value is updated in-place. As a result, transactional isolation is easily achieved in Btrees

Be also aware the in ElasticSearch terminology, and index is the equivalent of a database in the relational world. As stated in the official blog site:

The easiest and most familiar layout clones what you would expect from a relational database. You can (very roughly) think of an index like a database.

MySQL => Databases => Tables => Columns/Rows
Elasticsearch => Indices => Types => Documents with Properties

An Elasticsearch cluster can contain multiple Indices (databases), which in turn contain multiple Types (tables). These types hold multiple Documents (rows), and each document has Properties(columns).

ACID

ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc. In the context of databases, a sequence of database operations that satisfies the ACID properties (and these can be perceived as a single logical operation on the data) is called a transaction.

CRUD

CRUD (create, read, update, and delete) are the four basic functions of persistent storage.

The CRUD cycle is designed as a method of functions for enhancing persistent storage, for instance with a database of records. As the name suggests, persistent storage outlives the processes that created it. These functions embody all the hallmarks of a relational database application.

Be aware that some people mix CRUD and REST concepts. While REST is a robust API architecture, CRUD is a cycle for keeping records current and permanent. Read this interesting post to understand the differences among them