CockroachDB : NewSQL as a Choice of Database

Timothy Agustian
Level Up Coding
Published in
5 min readJun 18, 2020

--

Choosing which database for your system might be one of the crucial decisions to make. Whether relational databases such as PostgreSQL and MySQL that provide ACID transactions and standardized SQL syntax, or NoSQL such as MongoDB that provides horizontally scaling storage and high availability.

Both have their own advantages against each other and have been a solid choices for all developers. But what If we want a solid distributed database that still provides ACID Transactions yet still horizontally scalable?

Yeah, you can still use Master-Slave Replication in Relational Databases, but does it really scales? It only scales for the read capability, while insert, update and delete still go to Master.

We could also do vertical scaling in relational databases by upgrading the specification of the servers. But isn’t it costly? And it does lack the elasticity of scales itself. So we will be having a hard time to grow and shrink the database, adapting with the needs of the system.

In the end, relational databases are not meant and will always not meant to hosted on a distributed system. While NoSQL also doesn’t provide ACID Transactions that is commonly needed in most of the system.

Introducing NewSQL

GIF from CockroachDB Interactive Demo

NoSQL such as CockroachDB is a distributed database system that builds on RocksDB (Persistent Key Value Store Database) and provides horizontally scalable system, ACID Transactions using standardized familiar SQL for querying.

CockroachDB is inspired by Google’s Spanner and F1 technologies, and fortunately, this is an open source databases that could be run on multiple nodes across multiple server. As the name suggests, it provides a resilience of the cockroach itself.

Main Advantages of CockroachDB

  1. High Availability

Powered with consistent replication and repair ability, any failures that occurred in software or hardware level, short or long, from a node restart to datacenter problem should not bear any worries.

ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5;

You could also decide the replication factor of each data, by default it’s only 3.
and the distribution of the replicas will be handled by each node itself, preventing any downtime or data loss if any disaster happens.

For the short term outage, generally the process will be stored only in majority of replica and when the node is up, the replicating process will continue.

For the long term outage, the cluster will find another available nodes to replicate all the value from dead node by copy-ing from the others replica of the dead node’s data.

That’s why it is suggested to set up a system consist of multiple nodes across multiple datacenters.

2. Massively, intelligently, and simply Scale

CockroachDB automatically balances ranges of data across a cluster to avoid failure domains and help address heavy workloads. When we set up a new one or take down a node, CockroachDB will automatically rebalance the distribution data without any manual intervention. This gives easiness and elasticity in scales up/down the system. All we have to do is appoint the node as a part of the cluster or take down any node, and it will intelligently restabilize itself.

3. Distributed Transaction

CockroachDB distributes transactions across your cluster, whether it’s a few servers in a single location or many servers across multiple datacenter. Unlike with sharded setups, you don’t need to know the precise location of data; you just talk to any node in your cluster and CockroachDB gets your transaction to the right place seamlessly.

At What Cost?

  1. Tradeoff for availability

As we set the replication factor, it will decide the availability of the data itself. The bigger the replication factor, The more available the data it is when any nodes having an outage. But that means, any process of Insert, Update, Delete will be triggered as much as the replication factor is. This will lead to higher latency of query that manipulates the records and also will lead to higher capacity consumption. The more available the data that means we need the query latency and storage size as the price.

2. The wrong query is a fatal query

This means that you need to carefully design the database according to the system usage. Any badly design table and indexes will cause the distributed system sucks more than a relational database. Since all the queries executed will be exchanged in multiple nodes, unindexed queries and wrongly design tables will create a chaotic result. Optimizing the indexes, interleaving features and column families might be the answer to this problem.

In My Humble Opinion

After using and managing multiple database in cockroachDB system, I find it quite easier in doing several things such as :

  • Scale up and down by adding or take down a node without any downtime
  • Upgrading the version, you could easily upgrade it 1 node at a time without any downtime
  • Isolation tuning, we could differentiate each setting for each nodes to do any benchmark test for the best set up for our system
  • Less Database downtime, Yes I mention it three times
  • Cool Built-in Admin UI
  • Great open source community

Also it is cheaper to have 3 to 5 low end servers for each node compared to having a 1 high end server

In Summary

The database you choose will be based on the behaviour of your system, If you do need the highly available, horizontally scalable while still providing standardized SQL and ACID transactions, then NewSQL such as CockroachDB might be a choice to be considered. Still the price of latency and storage might be another thing to think about.

As always, we have an opening at Tokopedia.
We are an Indonesian technology company with a mission to democratize commerce through technology and help everyone achieve more.
Find your Dream Job with us in Tokopedia!
https://www.tokopedia.com/careers/

--

--