Have you read part I of this blog? If so… Carry on!
If not, here’s the link: https://bit.ly/3dBZWl0
Right now you should be pretty sure that your data isn’t relational. If so, NoSQL might be a good fit for you. The term itself though, seems to cause a lot of confusion. It’s a very broad term, which in reality, tells us very little about a database. Except that it’s…Well…Not (only) SQL.
NoSQL is just the overarching name we use for types of databases that each fill their own niche. If you’re interested in some actually valuable information, look for the NoSQL type. Cassandra, the one we went with in one of the previous blogs, is a wide-column store and this subtype is not to be confused with key-value stores, document stores or graph stores. NoSQL’s vague definition actually leads to a lot of discussion online whether we should even consider graph stores to be NoSQL. To make things easier, let’s assume that we do.
Before you consider your options in choosing the right database, you should at least know what your options are. We’ ll touch on each one briefly, but feel free to skip the NoSQL subtypes you already know.
Type 1: the key-value store
The key-value store is the least complex and lends some aspects to the other types, so how about that one first. A basic example would be a phonebook where you would have a list of names, their corresponding phone number and nothing else.
The unique key, the name, is what you would use to lookup what you are actually after. In this case, that value would be the phone number. To store data, the key goes through a hash function and the result is stored in a hash table. This blog won’t go too low-level, but most importantly, this allows for blazingly fast lookups, because you always know exactly where in the hash table your key is located.
In its most basic form, a hashing table is unsorted. That is not quite what we want, because varying first names could put all Smiths in entirely non-adjacent buckets (slots that point to the actual data). Quickly filtering on all people that have a certain last name seems like something you would want from a database, so preferably it hashes them together. Most popular key-value stores like Redis make sure that they are.
Type 2: the document store
The document store is without doubt the most popular NoSQL database type. For the most part, this seems to be the result of a great marketing campaign by MongoDB. Using MongoDB libraries in popular frameworks like Spring Boot seems oddly familiar. Whipping up proof of concepts, you could easily mistake it for a relational database. Interfacing seems the same and surely, it’s not going to complain about your relational data model. At least not when your database only has a few records in it. It may seem to work even better, a breeze really, when it’s not constantly bothering you with annoying little terms like foreign key constraints and data integrity. However, it is not optimized or even meant to store relational data. The fact that it will let you do it without protest sounds more like a bug than a feature.
So if it’s not a relational database, then what is it? Essentially, it is like the key-value store. You have a uniquely generated key, which in the MongoDB example above is the ‘_id’ field and the value is the document itself. The big difference with a simple key-value store is that the value in itself consists of fully searchable key-value pairs that you can nest infinitively. The document is semi-structured in the sense that it does have fields, but those fields are not predefined by a schema. For example, the blog document does not have to have the ‘creation_date’ field. In a relational database, the ‘creation_date’ field could be empty, but it would need to be there if it was defined in the schema.
Another thing that sets it apart from a relational database. is that you should really limit relations to other documents. The ‘comments’ field for example, would be a completely separate table in a relational database and you would have a foreign key referencing the blog id. Here, it’s just thrown into the same document, because it makes sense. You would very rarely want to look up comments outside of the context of the blog. If you actually wanted to show a user page with all the comments a user has made, you would have to do wildly inefficient joins on the application level. That’s right, MongoDB doesn’t support joins, so you’ll have to simulate them by querying the first collection, then querying the second one with the applicable key values of the documents returned by the first one. Or, as an alternative, you would have to duplicate data. For example by appending comments to both the blog document and a user document. In that case, maintaining integrity is in the hands of the gods… or the application.
So remember, yes, you could put anything you want in a document, but with great flexibility comes great responsibility. The fact that you don’t have to think about schemas, doesn’t mean that you shouldn’t.
In the comments example we just gave, it would probably make the most sense to go for the last option, duplicating your data. Duplicating data means denormalizing your database. Ideally, you would decide on this at the drawing board and you would design your data model very carefully around that. All of a sudden that takes away a big part of that flexibility that MongoDB seemingly offered in the first place.
Type 3: the graph store
The graph store is something that is definitely gaining a lot of traction, although it is still quite rare to spot one in production outside of the bigger tech companies. It is all about relations, yet just like a document store, it should not be mistaken for a relational database. Relations mean something entirely different here. I will come back on that later, but it’s important to know that the focus is on relations between entities, not necessarily on the entities themselves.
Take the example above. The two circles represent two entities, which we call nodes in a graph database. In this case, the nodes each represent a person. A person has a unique identifier and some attributes in key-value pairs. That’s not so different from what we would see in a Person table in a relational data model. Just like in other NoSQL databases though, each entity does not necessarily need to have the same key-value pairs. The biggest difference here, is that we don’t really care all that much about the properties of an individual person. We’re interested in what connects them.
For example: Alex and Peter apparently became friends in 2005. We call that a bidirectional relationship, since they kind of both need to acknowledge this for a friendship to exist. Each relationship is stored in the database and has its own properties. At some point, Peter started a company and asked Alex to join. This introduced a new unidirectional relationship. Unfortunately, things took a dark turn, as Peter had to let his friend go in 2019. Another unidirectional relationship now directly and indirectly influences the other two. He certainly isn’t Alex’s boss anymore, but are they still friends?
You see the point. When it’s not the entities themselves that matter, but the relations between them, it becomes very hard to organically keep up with that in a conventional database. An SQL wizard might grunt that it’s definitely possible and that there is no need to look at unfamiliar technologies. He’s right. What about when indirect relationships are involved? Is he still so sure if we turn things up a notch?
Are you still feeling confident when we’re coming up with questions on the fly? What do employee A and employee B have in common? A graph store can easily identify them as having the same boss, but even that SQL wizard might admit that getting the same answer from a conventional database will be a daunting task.
Type 4: a wide-column store
The last of the NoSQL types is a wide-column store and we’ll explain the wide-column store from a Cassandra perspective. Like most NoSQL databases, it uses some functionality of a key-value store. Essentially, you could see it as a hybrid between a key-value store and a table-like database.
On first sight, you can see why. It has keys like ‘company’, ‘last_name’, ‘id’ and so forth, and their value is written underneath. Also it’s presented somewhat like a table, because each person has it’s own row. The second thing that you notice is that typical flexible schema. Peter doesn’t seem to have a phone number and there’s no space reserved for it in the table like in a relational database. Smith has a phone number, but there is no mention of a first name.
Cassandra is distributed, which means that there might be many database nodes, each containing a portion, but not all of the data. I’ll follow up later on how Cassandra manages distribution. For now, it’s important to know that it happens, because it influences the data modelling as seen on the image above a lot.
Although the three rows seemingly belong to the same ‘table’, the database will likely separate them. It does this, because company is the chosen partition key. A partition key is not what makes a person unique, that’s what the primary key does. It’s what will bundle everything that is often queried together. Cassandra is even less flexible in this than MongoDB. If your data cannot be effectively denormalized, do not expect this technology to work out for you.
Other than querying all people that work for the same company, I might also expect people to be interested in finding out where every Smith works. In that case, another denormalized table like the one below is needed. Why? As mentioned, Cassandra is a distributed database and it uses the partition key to distribute data. The original table used Company as partition key. This guarantees that data belonging to the same company lives in the same partition. Therefore, if I wanted to retrieve all Smiths without the extra table, I would have to look through every single partition in my cluster to get the data. This is called a cluster scan and it is highly inefficient. By default, Cassandra will actively stop you from doing this. You can enable it, but don’t.
None of this really speaks to its advantage, but its built-in distribution and replication just might. Spreading data across all available nodes is not random, as this would mean painfully slow read-times. On top of that, there are no master-slave relationships in Cassandra, because we don’t want a single point of failure. Each node needs to be able to serve requests. Combined, it needs a transparent and light mechanism to calculate where it can write and read your data.
The Partitioner is responsible for this and each node has one. It lends its name from the partition key we mentioned earlier. The Partitioner hashes that key and just like hash buckets in a key-value store, it knows exactly which node holds a certain hash range. In Cassandra we call them tokens and token ranges.
A practical example might clear things up. Let’s assume that we have 4 servers in a datacenter that each run one database node. In the first image, we are not replicating our data, so each node actually holds a fourth of all available data. We represented the token ranges in percentages, as the actual possible token ranges are pretty large and inconvenient to work with.
Think back at our first denormalized view, where we made company the partition key. If a consumer asks Node 1 to return all employees that work for Archers and Archers translates to a token value of 28, Node 1 knows that it needs to go looking for it in Node 2. After it gets the answer, it just passes it along to the consumer, pretending that it knew the answer all along.
Cassandra has a flexible replication factor, so the second image explains what would happen if we chose to duplicate data by factor 2. Now the same data can be found on 2 nodes. Image 1 is obviously not a realistic scenario. What if a node goes down? Your data might be inaccessible or worse, lost. Always build your systems for failure.
This built-in functionality is where Cassandra really shines. Your database can’t handle the load? Just add a node. Cassandra does everything else. Each individual node only has to carry a small part of the full load on the database. Due to the known storage place of each token range and the fact that each node can act as a Partitioner, you never need a full cluster scan to get what you need. It scales linearly incredibly well. Something that definitely cannot be said from a conventional database. You really need some incredible loads to outpace most SQL databases these days though. So do not make this choice lightly, because the architectural investment is not to be underestimated.
With your data spread over different nodes, multiple datacenters, perhaps even continents, concerns about performance start popping up. You don’t want to read each single node that has parts of the data you need and wait for them to respond. Cassandra has something called a tunable consistency to alleviate those concerns.
In theory, you could choose a consistency level for each individual read. Perhaps you don’t really care about consistency at all? In that case, you could only require the response of a single node. You could even wait for a response of all nodes that have your requested data, but of course, performance will drop significantly. Even then, you won’t always be fully consistent. Which response do you pick if not all responses are equal?
This goes for writing too. A write could be marked successful as fast as a single node received the data. This allows for very fast writes, but at the same time you would immediately after require a much slower read to be sure your data is consistent. Cassandra’s tunable consistency is great in allowing you to find something that’s exactly right for your use case.
There’s something called quorum read, which sounds like a happy medium for most. In a quorum read, the coordinating node sends out a request to all nodes that have the requested data. However, it only waits for at least 51% of them to respond. If your replication factor were 5 for example, the coordinating node waits for 3/5 of the nodes to respond.
If the data from those three nodes respond with is not consistent, it chooses the response with the most recent timestamp. This does not guarantee consistency, but it comes very close.
In this blog, we have talked about NoSQL and distribution. You may know by now if those concepts can mean something for you and your project. Contrary to that, you might also feel reassured that you stuck with SQL.
So yeah, which database should you use? If you wanted a clear-cut answer, this might be disappointing. As you might’ve guessed by now, it depends. More choice and new technologies are a good thing, but don’t be so quick to rule out relational databases once and for all. Just because they are old, doesn’t mean they’re not still the right tool for the job.