Triggered Messaging

Blog


Lessons Learnt From Using MongoDB With High Volume Data

Here at Triggered Messaging, we process high volumes of real-time data.

Whilst building our product, which provides real-time emails to recover abandoned carts for eCommerce sites, we've found a few things:

Drop, don't delete

We take high-volume streams of incoming event data from our clients' shopping cart systems. Initially, we insert these into a collection, waiting to be processed. Once they're processed, we have no further use for them. This left us with some options for how to clear up the temporary data :

  1. Remove each record after we've processed it
  2. Remove all records at the end
  3. Drop the collection at the end
  4. Drop the database at the end

I created a set of tests on github that create a database with 100,000 documents (~1.5GB) in it, then attempt to get rid of them.

Test Average Time
Remove Each1 66.9s
Remove All 23.0s
Drop Collection 0.01s
Drop Database 0.22s

Note that there's some overhead in this test due to looping through each ObjectID to call remove against it. However, you'd be doing something similar in real life.

The test is fairly artificial - all the data items are the same shape and contain the same size of data for consistency, however it does show that, for larger collections / databases, it can often be more efficient to drop the whole collection or database than remove the items within it.

Our system is architected so we can drop all the collections in a database at once, so it's more efficient for us to drop the whole database and all the collections within it. This allows us to rotate out old data at low cost.

Schemaless is Great

One of the main attractions of a No SQL database like Mongo is the lack of pre-defined schemas. This works well for us, as we have different versions of our scripts pushing in different shapes of data from different cart systems, with different fields present.

Thankfully the DB needs to know very little about the data (bar any indexes) and our code just needs to cope gracefully with missing fields.

Schemas are Great

At the same time, having a schema can be a very useful safety net - what happens if one of our data sources is missing fields? What if there's a typo in a fieldname somewhere? Our data is doomed!?!

We wrote and use a schema validator based on https://github.com/JamesCropcho/variety/ to allow us to run tests on our scripts, then check the shape of the data against a set of expectations.

Fork it here: http://dhendo.github.com/node-mongodb-schema-validator/

 

Posted in Technology on August 8, 2012 at 10:48 by David Henderson  |  Permalink  | 
  • E-mail
  • Facebook
  • Twitter
  • Google+
  • Reddit
  • Tumblr

Why would you use a DBMS instead a messaging system like RabbitMQ or ZeroMQ?

“We take high-volume streams of incoming event data from our clients’ shopping cart systems. Initially, we insert these into a collection, waiting to be processed. Once they’re processed, we have no further use for them.”

Sounds like a really appropriate usage of a messaging system.

@Oliver yes, we did consider that, and we do use 0MQ extensively to provide smart socket functionality elsewhere in the system.

We could have used a more traditional queue system like Rabbit to hold this incoming data, but the way we process the data is more involved than a simple FIFO, so the extra flexibility and querying functionality a DBMS gives us is very useful.

There’s another reason to drop whole databases and not collections: Mongo stores all collections on disk together, dropping a collection only creates holes and will not reclaim disk space. Worst case you end up with a badly fragmented DB after a few rounds. Dropping whole DBs gets around this completely.

@Theo. Good comment. We were cautious when chosing an architecture and have the freedom to do as you suggest. Triggered Messaging uses logical collection names plus a “discovery” service which maps collections to Mongo databases. This allows our infrastructure team to configure collections with similar lifespans together in each database, so they can drop a whole database when it’s no longer needed. We’re currently investigating how it works in practice.

Add Comment