How-to: Migrate Elasticsearch Cluster to Amazon Virtual Private Cloud (VPC) with ZERO downtime.

Vishnu Narang

--

Scalability is the need of the hour. Especially for growing businesses. Recognizing this need, we took upon the task to upgrade our Elasticsearch nodes. Here are some insights and takeaways.

To give you some background, I am a software engineer at Mighty Networks. Mighty Networks provides a platform for anyone to host their own #MightyNetwork and bring together their content, community, brand, business and now their courses together at one place.

High availability, security and high performance is important for our system. Therefore, migrating Elasticsearch from Amazon EC2 classic system to the Virtual Private Cloud (VPC) with zero downtime was challenging and the approach I describe below is what worked best for us.

Few more reasons to consider for migrating systems from Amazon Classic EC2 system to the VPC system are:

  • Better security
  • Availability of more computing power with the class of EC2 instances available only in VPC.
  • Cost savings: VPC machines are cheaper than EC2 Classic for comparable computing power.

This post will further help give some advantages and other details.

Our architecture comprises of a fault tolerant Elasticsearch cluster with 3 master nodes backing 5 data nodes. A useful resources about architecting and designing an Elasticsearch cluster can be found here.

STEPS:

While the steps below are high level and should be applicable for most infrastructure setup, I have provided some additional details that we considered for our setup at Mighty Networks.

1. Setup a new cluster inside a VPC with the right security group and subnet settings.

The Old Architecture:

This comprises of Elasticsearch-Master nodes and 5 Elasticsearch-data nodes (3 for Application and 2 for Analytics).

Master Nodes were running on a m3.large EC2 classic instance while the data nodes were running on m3.2xlarge EC2 classic instance.

You can see more details for these previous generation instances here: https://aws.amazon.com/ec2/previous-generation/

The new VPC Architecture:

The new architecture has 3 Elasticsearch-Master nodes and 6 Elasticsearch-data nodes (4 for Application and 2 for Analytics). Even with an extra data node and bigger machines, the cost of the new architecture is lower than the old system (See the cost comparison for the instances below).

Master Nodes run on a m4.xlarge and the data nodes run on m4.2xlarge instance.

Cost comparison (as of July 2018):

References:

  1. Amazon features and prices for m3 EC2 instances
  2. Amazon features and prices for EC2 instances

Valuable Takeaways:

  • Double check the inbound rules for the security group for VPC. It should not have any rule that makes your app vulnerable to attacks.
  • For most services you should have at least 2 subnets within the VPC. For example: Aurora Postgres database service by AWS won’t work with 1 subnet.
  • Make sure the route table is associated with all the subnets for the VPC.
  • Try to spread the architecture across multiple availability zones for better failover security.

2. Enable Classic Link

In order for the Amazon EC2 Classic instances to be able to communicate with the Amazon EC2 instances in the VPC, we need to enable Classic-Link for all the workers and frontends in the current production architecture. Use AWS console to do this using the actions dropdown for an EC2 instance.

We need this for 2 reasons:

  • Migration depends on this communication.
  • Allows you to migrate one component/service at a time.

3. Verify Cluster is Running

We use Ansible to automate the provisioning and setting up of our AWS instances. Before proceeding further, it’s a good practice to make sure Elasticsearch was installed properly and the 3 master nodes and the 6 data nodes have formed a proper Elasticsearch cluster. To do this, we used tunneling and the head plugin.

Head plugin lets you browse the indices, perform queries, see the mappings, etc for your Elasticsearch cluster. For a local Elasticsearch instance, you can browse https://localhost:9200/_plugin/head/ on your browser.

To access the production instance from your machine, add inbound rule to allow your workplace public ip or your home public ip access to the VPC.

After setting up Elasticsearch cluster, use tunneling (command below) for the port (9200 by default for Elasticsearch) to use with the head plugin.

ssh -f -N -q -L 9200:localhost:9200 <production-ES-endpoint>;ps aux | grep ‘\’’ssh -f’\’’ | grep -v grep

https://localhost:9200/_plugin/head/ on your browser will now give you insights into your new Elasticsearch cluster.

Once it has been made sure it’s up and running, you can clean up the inbound rules that are not needed.

4. Setup empty index mapping

Before migrating, we first need to setup the index mappings. You can run a one off migration script (Example: a rake task in case of Ruby based application) that can setup the mappings on the new cluster.

5. Enable Double Writes for your application

For a zero-downtime migration, the important part is to keep both the databases/clusters in sync. To do that, the application needs to write to both instances from before the migration until the final cut-over is done to use the new cluster post migration.

This can be a very easy task if the codebase is well structured and modular and follows good code practices.

If the codebase isn’t well structured for Elasticsearch and the code to write to Elasticsearch is spread out across the codebase, this is an opportunity to do refactoring.

6. Finish the migration

The application reads from the original cluster but writes to both clusters at this point. We need to migrate all the data from before double-writes were enabled. There are 2 steps:

  1. Perform simple migration for recent data and live indexes
  • Execute a task to perform reads from the old cluster and write to the new cluster until the point when double writes were enabled. This can take considerable amount of time but if you can use batches and parallelize the operation, it’ll take a few hours for a huge amount of data.
  • Also, since we have double writes enabled, we can afford to take time with this migration.
  • We cannot use Snapshot and Restore for the live indexes as the restore step will delete an index before restoring.

2. Perform snapshot and restore for indexes that are not being updated but you may want to keep the data.

  • Example: If you create a different index for events for each month. All events older than a month will be in a different index

7. Verify Data

Once everything is migrated, verify the data to be in sync using the head plugin as before. The double writes are still enabled to keep both the clusters in sync.

8. Cutover to new Cluster endpoint

Perform the cutover to use the new cluster as primary. Again, a well structured codebase will make this a very easy task. At this point, the reads will start happening from your new cluster inside the VPC. Keep the classic-link enabled for the EC2 instances that are communicating with VPC resources.

Use AWS monitoring to verify that all the reads are happening only from the new cluster and the old cluster only has writes. Once again verify the data for extra precautions.

9. Disable double writes

You can now disable double-writes for your application and clean-up any code that you may need to (application and infrastructure code).

Conclusion:

I hope this article gives you a viable strategy to perform a zero downtime migration for an Elasticsearch cluster. The double-writes technique can be used for most database migrations if there aren’t existing tools for a zero-downtime migration.

Endnotes:

If you want learn more about Mighty Networks and see a Mighty Network in action, the Mighty Hosts network is a perfect place to start. Mighty Hosts is a Mighty Network dedicated to entrepreneurs, marketers, and leaders building the most compelling and successful new digital brands.

--

--

Vishnu Narang
Vishnu Narang

Written by Vishnu Narang

A software engineer in silicon valley.

No responses yet