Why Cassandra?

Apache Cassandra is a powerful open-source distributed database management system, highly sought after by enterprises due to its large-scale scalability, zero-tolerance for database failure, peer-to-peer architecture, cost-effectiveness, remote distribution of data across various nodes and much more. Cassandra has quite a lot of highlights to cover, but its capability to survive without service interruptions even when one or more nodes are down, stands tall.

In this blog, we will discuss Apache Cassandra’s highlight features, types of Cassandra backup methods and ways to restore data from backup, with examples.

How Cassandra works?

The reason behind Cassandra’s ability to provide uninterrupted service even when one or more nodes malfunction is due to its data replication among multiple nodes, across multiple data centers. Cassandra keeps data in SSTable files that are stored in the keyspace directory within the data directory path specified by the parameter ‘data_file_directors’ in Cassandra.yaml file.

The default SSTable directory path is: /var/lib/cassandra/data/

When will you need Cassandra backup?

Though Cassandra sounds invincible, backups are still necessary to recover during the following scenarios:

  • Disk failure
  • Corrupted data
  • Accidental deletions
  • Errors made in data by client applications
  • Rolling back the cluster to a known good state
  • Catastrophic failure that requires rebuilding the entire cluster

Types of Cassandra backup

Cassandra provides two types of backup methods:

  • Snapshot based backup
  • Incremental backup

Snapshot based backup method

Cassandra provides nodetool utility which is a command-line interface for managing a cluster. The nodetool utility gives a useful command for creating snapshots of the data. The nodetool snapshot command flushes memtables to the disk and creates a snapshot by creating a hard link to SSTables, which are immutable. The nodetool snapshot command takes snapshot per node basis. To take an entire cluster snapshot, the nodetool snapshot command should be run using a parallel ssh utility, such as pssh. Alternatively, snapshot of each node can be taken one by one. It is possible to take a snapshot of all keyspaces in a cluster, or certain selected keyspaces, or a single table in a keyspace. Note that you must have enough free disk space on the node for taking a snapshot of your data files.

Note that the schema does not get backed up in this method, and it must be done manually, separately. Some examples below:

a. All keyspaces snapshot

If you want to take snapshot of all keyspaces on the node then run the command below:

$ nodetool snapshot

The following message will appear:

Requested creating snapshot(s) for [all keyspaces] with snapshot name [1496225100] Snapshot directory: 1496225100

The snapshot directory is /var/lib/data/keyspace_name/table_name–UUID/ snapshots/1496225100

b. Single keyspace snapshot

Assuming you created the keyspace university. To take a snapshot of the keyspace and you want a name of the snapshot, run the command below:

$ nodetool snapshot -t 2017.05.31 university

The following output will appear:

Requested creating snapshot(s) for [university] with snapshot name [2015.07.17] 
Snapshot directory: 2017.05.31

c. Single table snapshot

If you want to take a snapshot of only the student table in the university keyspace then run the command below:

$ nodetool snapshot --table student university

The following message will appear:

Requested creating snapshot(s) for [university] with snapshot name [1496228400] 
Snapshot directory: 1496228400 

After completing the snapshot, you can move the snapshot files to another location like AWS S3 or Google Cloud or MS Azure etc. However, you must backup the schema because Cassandra can only restore data from a snapshot when the table schema exists.

Pros & Cons of Snapshot based backup

Advantages:

  • Simple and much easier to manage
  • Cassandra nodetool utility provides nodetool clearsnapshot command which removes the snapshot files

Disadvantages:

  • For large datasets, it may be hard to take a daily backup of the entire keyspace
  • It is expensive to transfer large snapshot data to a safe location like AWS S3

Incremental backup method

By default, incremental backup is disabled in Cassandra. This can be enabled by changing the value of “incremental_backups” to “true” in the cassandra.yaml file. Once enabled, Cassandra creates a hard link to each memtable flushed to SSTable to a backup’s directory under the keyspace data directory. In Cassandra, incremental backups contain only new SSTable files, as they are dependent on the last snapshot created. Also, incremental backup requires less disk space as it only contains links to new SSTable files generated in the last full snapshot.

Pros & Cons of Incremental backup method

Advantages:

  • Reduces disk space requirements
  • Reduces transfer cost

Disadvantages:

  • Cassandra does not automatically clear incremental backup files. Removing hard-link files requires writing our own script, as there is no built-in tool to clear them
  • Creates many small size files in backup, making file management and recovery cumbersome
  • Cannot select a subset of column families for incremental backup

Cassandra Data Restore Methods

Backups are meaningful when they are restorable, especially when keyspace gets deleted or new cluster gets launched from the backup data or a node gets replaced. Restoring backed-up data is possible from snapshots and if you are using incremental backups then you need all incremental backup files created after the snapshot.

Ways to restore data from backup

There are two primary ways to restore data from backup:

  • Using nodetool refresh
  • Using sstableloader

Restore using nodetool refresh:

Nodetool refresh command loads newly placed SSTables onto the system without a restart. This method is used when a new node replaces a node which is not recoverable. Restore data from a snapshot is possible if the table schema exists. Assuming you have created a new node then follow the steps below:

  • Create the schema if not created already
  • Truncate the table, if necessary
  • Locate the snapshot folder (/var/lib/keyspace_name/table_name UUID/snapshots/snapshot_name) and copy the snapshot SSTable directory to the /var/lib/keyspace/table_name-UUID directory
  • Run nodetool refresh

Restore using sstableloader:

The sstableloader loads a set of SSTable files in a Cassandra cluster, providing the following options:

  • Loading external data
  • Loading existing SSTables
  • Restore snapshots

To restore using sstableloader, follow the steps below:

  • Create the schema if not exists
  • Truncate the table if necessary
  • Bring your back up data to a node from AWS S3 or Google Cloud or MS Azure (Example: Download your backup data in /home/data)
  • Run the below command
    sstableloader -d ip /home/data

About the author:

The author Sebabrata Ghosh is a Data Engineer at SecureKloud Technologies.