Apache Cassandra backups (and restores!) – Part I

When I first came across the task to automate Cassandra backups I thought it’s going to be an easy and straightforward process (to some extend it is). I did a quick search for “cassandra + backup + recovery” with my preferred search engine (https://duckduckgo.com/) and found some basic articles and a couple of quick & dirty scripts. The more I kept researching the more I understood it’s not going to be a quick task – especially if I want to do it right.

Things to keep in mind

Cassandra does not maintain strong consistency like your traditional MySQL database which implies your backup won’t be as simple as locking tables and performing mysqldump. What it offers is so called eventual consistency, which is influenced by a number of factors, for example:

replica count
network latency
write buffers
CPU or I/O load

For the sake of simplicity, imagine your Cassandra cluster has 5 nodes in a single datacenter, Replication Factor (RF) is set to 3 and your write Consistency Level (CL) is set to QUORUM ( ( RF // 2) + 1 ). That means, a successful write (this is a broad term, as writes might fail at different stages) needs to be acknowledged (data written into commitlog and memtable) by 2 out of 5 nodes. Later it will also get distributed to the 3rd node for the compliance with Replication Factor of 3. However, if right after a successful write to two nodes you take a cluster-wide snapshot, you would still miss data that was supposed to be distributed to the third node. Or, perhaps not all the nodes were online while you took a snapshot and some replicas were missing. You need to take these (and other) things into consideration when you perform backups in a consistent way and to be able to recover your entire cluster or single node’ data properly.

Restore process

I do believe that each backup strategy should be driven by recovery requirements. It doesn’t matter how great and efficient your backup strategy is, if you cannot properly restore your data. When thinking of backup strategy, you need to clearly define your restore requirements, for example: “Do I need to restore a single node or should I restore entire cluster?”. If your backup strategy treats each node as individual backup item, then how do you restore from a situation where half of your 500-node cluster nodes died? Do you go through each of these nodes’ recovery process individually, one by one? Or once you reach a given threshold of failed nodes you’d want to recover entire cluster? Perhaps you’d like to test a new functionality, or benchmark a new schema using production data, but do it on a test cluster – being able to restore a production cluster into a test cluster would be very helpful. Same goes for: “Do I restore just Cassandra datadir, or do I reinstall entire VM?” in which case Cassandra node ID might change but it’s IP may or may not change? These things will influence your recovery process and you should be aware of them before you start crafting your backup strategy.

To be continued in Part II..