Zero-downtime Cassandra upgrade

Two critical requirements when offering a recommendation service to some of the largest Brazilian web stores are availability and scalability. After all, if our service becomes unavailable, the store customers won’t be able to see our recommendations when they shop online, upsetting both the customer and the partners that rely on our service. We take this requirement very seriously at Chaordic, using the latest technologies to ensure our service will be available 24/7/365, including on peak periods such as Black Friday.

One of the systems we use for increased availability and scalability is Apache Cassandra. Cassandra is an open source distributed storage solution that provides several mechanisms for data distribution and fault tolerance. The Cassandra community is very active, and new releases come out pretty much every month with bugfixes and awesome new features. It becomes quite hard to keep the pace and upgrade as soon as new releases become available.

A big challenge when you have to upgrade a production distributed system is to make sure your service remains available during the upgrade. Luckily, Cassandra makes it really simple to upgrade a production cluster without impacting application performance. We’ve done upgrades a few times already without “major” problems. Thor save us on the next upgrade coming soon! ;]

DataStax provides great upgrade instructions, but we will share some additional tips that may help devops engineers to perform a zero-downtime Cassandra upgrade. Even though these tips are related to the upgrade between versions 1.1 to 1.2, most of them should be valid for more recent upgrades.

Planing the Upgrade

  • Familiarize yourself with the upgrade process, by reading the NEWS.txt, CHANGES.txt and the DataStax upgrade documentation.

  • Create a simple document with the upgrade plan. This document will guide you through the process, allowing you to know which steps to take and to monitor the progress of the upgrade.

  • Define in which order the machines will be upgraded. If your cluster doesn’t use VNodes, a good practice is to upgrade at most one machine per replication group at a time, ensuring no replicas will be lost during the upgrade.

  • Define what time of the day the upgrade will be done, and estimate how long it will take. It is advisable to do upgrade when the system is NOT under heavy load, typically at late night time.

Before the Upgrade

  • Make sure all nodes are on the same schema, by checking the output of the following command on cassandra-cli of all nodes:
  • describe cluster;

  • Backup the cluster schema in case of catastrophe, which is unlikely to happen, but better be safe than sorry. You can easily backup the schema (based on thrift) using the following command:
  • echo -e "show schema;\n" | cassandra-cli -h <seed_node> > "schema-`date +%m-%d-%Y`.cdl"

  • Backup the ring tokens in case of catastrophe, which is unlikely to happen, but just in case:
  • nodetool ring > ring-`date +%m-%d-%Y`.txt

  • Compare the old and new cassandra.yaml files, carefully reviewing attribute changes between versions. For example, the default partitioner changed in Cassandra 1.2, but it's incompatible with the old partitioner so you need to make sure the old partitioner is specified in the configuration file. The same is valid for initial_token and other custom attributes.
  • Prepare an upgrade script with the main upgrade steps, and distribute it to the cluster machines. You don't want to think too much in the middle of an upgrade. Below is a sample upgrade script:
  • Prepare a rollback script in case things go wrong and you need to rollback fast. Below is a sample rollback script:
  • When you're ready to start, install the new Cassandra version binaries on the cluster nodes according to the official upgrade instructions.

During the Upgrade

  • Shutdown all scheduled maintenance tasks during the upgrade, such as repairs, cleanups, or upgradesstables.

  • Run the upgrade script (cassandra_upgrade.sh) on each of your nodes according to the schedule defined in the upgrade plan.

  • Check the cassandra logs for errors or warnings and take the necessary action (ie. rollback, wait) if they seem critical. Some exceptions are not critical, that’s why it’s important that you’re familiarized with the NEWS.txt and CHANGES.txt to know what has changed between versions.

  • In case something goes wrong and you need to rollback to the previous version, execute the rollback script (cassandra_rollback.sh) on the nodes that were already upgraded.

After the Upgrade

  • Run nodetool upgradesstables on all nodes to upgrade the SSTables to the newest version. Since this can be a heavyweight operation be careful not to overload your cluster.

  • Verify your application metrics (throughput, response time, exceptions, etc) to confirm all the metrics are stabilized (or hopefully got better) after the upgrade.

  • Enjoy your newly upgraded Cassandra cluster! ;-)

Special thanks to Charles Brophy and Robert Coli from the Cassandra mailing list for sharing some of the tips described here.