Scaling our data infrastructure

Scaling our data infrastructure

When we first began, our Customer Data Platform was managing around one million pieces of content per day. Today, that figure has climbed to 30 million. We’ve completed a major infrastructure renovation that delivers many long term benefits and in the short-term major speed and stability improvements for our clients.

As the size of the dataset we manage expands, our vision for our infrastructure does too, so this month we set about upgrading this environment to ensure scalability, security, and speed for our clients. If you’re involved in managing large, complex data sets we hope you find this helpful.

Laying the foundation

We were indexing 2.7 billion tweets, comments, messages, articles, and blogs. The first step for us was moving this data into smaller indexes, so we rearranged it into 90 smaller indexes of about 30 million objects each. These new indexes made the process of migration more streamlined and also allowed our clients to make more efficient requests in our products.

The next step was to ensure our searches were compatible by updating our query generation library to ensure the queries generated in the interface (i.e. term matches or author searches) would work on the new search cluster.

Our concerns here were not so much with common queries like keyword searches but instead with the more complicated searches, like geolocation filters. For example, our interface allows users to draw a box on a map that converts to 4 lat/long points, which is converted into a query and run against our data cluster. We needed to ensure that this query was converting the inputs from the user into something the cluster could understand, so we rigorously tested all of the query combinations and possibilities until we were 200% sure this functionality would be available.

Moving in

Upgrading a data asset this large, without any impact on service or availability is a huge challenge, so we took our time to move in.

We experimented with two approaches to performing the upgrade: using the “reindex” API to move data from our old environment to our new one, and performing a snapshot and restore from the old to the new. The re-index API proved far too slow to migrate such a large amount of data in a reasonable time frame, especially as we wanted to minimise the length of time we had to have two clusters running.

The snapshot and restore method proved to be much more practical. We used a snapshot facility that can copy index data to one of many destinations (in our case we used Amazon S3) and then restore this data very quickly into a different cluster. After configurations were complete and we’d conducted a range of tests, we completed a full disaster recovery scenario, going from absolutely nothing to a brand new cluster with our complete 3.7B object data set within just 40 minutes.

Going down the snapshot restore route we came away not only with an upgraded cluster, we also got a robust backup system that backs up our entire social data set every 15 minutes and a quick and reliable disaster recovery scenario.

A lovely housewarming

After performing the upgrade the benefits were immediately obvious, our general query performance improved 13% and some of our most frequently used features (those that show the context and history of a social object) saw a much more dramatic improvement of 30% to 40%.

To put that into perspective, these features are currently used over 8,000 times a day and this translates to a time saving of 6 hours a day across all our clients!

Looking forward

From a process perspective, we’ve emerged from this project with better processes for performing upgrades, an even more robust backup system, and a tested disaster recovery procedure in the unlikely event of outages along the way.

As our quest to house more bytes and consume more data sources goes global, we need a scalable and secure place to store all this new data. This new infrastructure environment delivers exactly this, making it easier for us to get more data into our Customer Data Platform, and ensuring our clients can access it at lightning speed.