Wireclub Development

Migrating from Microsoft SQL Server to MongoDB - Lessons Learned

Your database is one of your most important technical stakeholders and will influence every major decision you make during the lifetime of your product. Once you have made your choice and built on top of it, chances are that you would never consider migrating to a radically different solution.

As difficult of a decision as it was, early this year we migrated all our data from Microsoft SQL Server, a traditional relational database, to MongoDB which is a “NoSQL” document-based database.

In this article I will share the challenges we faced and the lessons we have learned from this experience.

Who are we?

Wireclub is a vibrant online community with over 3 million users world-wide. Everyday we serve over a million database-intensive pageviews and service tens of millions of API calls. Our users are very active, exchanging over 1.2 million messages every day. We focus on providing our users tools for real-time interaction therefore performance is very important to us.

Who should read this article?

This article is packed with useful information for anyone planning to use MongoDB in a production environment with large amounts of data. While I will focus on our experience adapting a popular, pre-existing service to use MongoDB, I will also cover several key concepts I believe all MongoDB users should at least be aware of.

Why did we migrate to MongoDB?

Wireclub was born as a hobby and bootstrapped since its inception. It was not until we reached one million users that we saw it as a viable business. By that time we had already committed ourselves to a set of a sub-optimal technologies that were chosen not for being the best solution but simply because they represented the path of least resistance. Using SQL Server as our database was one of those decisions.

Fast forward to 2010 and we found ourselves growing fast and constantly struggling to fit a square peg in a round hole to maintain acceptable performance levels using our original, organic platform.

After much consideration, we finally decided to bite the bullet and revisit all the technology decisions we had made up to that point. This was when we, among other things, systematically reviewed all available alternatives to SQL Server and finally decided that MongoDB was our best bet.

This is not an article on the virtues of NoSQL over relational databases. We are a very pragmatic team (we are not tech-hipsters, we are tech-spartans) and we didn’t pick MongoDB for what it isn’t (SQL) but for what it is: a powerful storage solution optimized for the specific kind of workload generated by web applications.

It was simply a better tool for the job at hand.

Trade-offs of replacing Microsoft SQL Server with MongoDB

Beyond the obvious fact that MongoDB is a document-oriented database and SQL Server is a relational database, there are several other key practical differences to keep in mind.

MongoDB is free: If you just have a handful of servers, licensing costs are unlikely to be a major concern. However if you plan to scale your business aggressively, using a free database may help you to maintain your profit margins while you expand your operations.

MongoDB is fast: When used correctly, MongoDB is incredibly fast. So fast that sometimes we questioned even if it was actually doing anything (yes, it was). If you provision enough RAM to fit all your indexes in memory, the performance is absolutely incredible. While using SQL Server, we worried a lot about caching data outside the database and that added considerable complexity to our code. With our current MongoDB (on SSD) setup we have a database that is almost as fast as any caching system could be.

Freedom from rigid schemas: Coming from a RDMS background, I can’t stress enough how liberating it is to work with a database that does not require the explicit declaration of data schemas. Documents contained within a MongoDB collection can be as homogeneous or as heterogeneous as you wish. This means that augmenting your data models with extra fields is a breeze, and that storing multiple object types inside the same collection is not an issue.

ObjectIDs are expressive and handy: In MongoDB all documents in a collection must contain an unique ID field and by default this field is an ObjectID which is Mongo’s idea of an identity column. The great thing about ObjectIDs is that unlike SQL Server identity sequences, it does not require that all nodes in a cluster coordinate to generate the next unique, monotonically increasing identity value. They also carry the document’s creation time which would be normally stored in a separate field in a SQL database.

GridFS for distributed file storage: We were initially skeptical about GridFS and it turned out to be one of the best features gained from switching to MongoDB. Now thanks to GridFS, all our user contributed data, including pictures are stored in the database, seamlessly replicated across our cluster and augmented with rich application-specific metadata.

For example, when someone uploads an image to Wireclub, we augment the document with image analysis metadata such as location of all faces detected, the percentage of skin-colored pixels (to identify inappropriate images), the percentage of the frame that is covered by faces (to identify portraits), etc. By indexing these fields we are able to search our images for “well framed portraits, taken outdoors with a dSLR camera”.

Developed in the open: MongoDB is open source and 10Gen does a great job at keeping the community in the loop regarding the future of the platform.

Things we will miss from SQL Server

Maturity: SQL Server is a very mature, robust storage solution. Also, when it comes down to durability, we trusted SQL Server completely and absolutely - and we are not quite there with MongoDB yet.

Tools: SQL Server comes with an incredible set of management and development tools which is something MongoDB lacks. The SQL Server Management Studio is an incredibly valuable tool we will sorely miss, not only it provides everything you need to manage your database, it also provides basic features for data exploration as well a robust query and database profiler. Having good tools can save you a lot of time!

Query expressiveness: Microsoft’s Transact-SQL is a Turing-complete variant of SQL which is itself a forth-generation programming language created with the sole purpose of accessing and manipulating data stored within a relational database. Being both Turing-complete and a specialized language, it goes without saying that T-SQL is incredibly expressive within its domain.

In contrast, MongoDB offers a robust albeit limited set of operators that cover all basic requests. If you need anything that is not standard, you may express your logic using Javascript by using the $where clause. While Javascript is also Turing complete, when using $where all operations must be element-wise and that will make several queries either non-trivial or impossible to express.

In practical terms, you should not count on MongoDB for anything other than retrieving data using its built-in operators. This is not necessarily a weakness, just a reflection of Mongo’s philosophy that your business logic should reside within your application.

Transactions: In SQL Server you can execute arbitrary blocks of T-SQL as an atomic transaction, this means that either the whole block is successful or all changes will be rolled back in case of failure. This is very handy when performing complex, multi-step operations.

MongoDB does not offer a transactional mechanism mostly because it is intended for massively parallel deployments and in such environment, the checks necessary to ensure atomicity of transactions would incur a significant performance cost. It does however provide a set of very useful atomic modifiers for operations in single documents.

Joins: In SQL Server it is trivial to retrieve data from multiple tables as one single consolidated set. This is not possible in MongoDB as each query may only act over a single collection. This means that when denormalization isn’t viable operations must be performed in multiple round-trips to the database which can significantly hurt your performance.

One particularly sub-optimal case we identified is that there is no performance-friendly way to query collection (A) for a reasonably large list of items (IDs) that must be retrieved from collection (B) and then paginate the results while sorting for a field in (B). This may sound like an edge case but it turned out to be much more common than we expected.

At the end of the day, eliminating joins makes scaling horizontally much simpler but it also require a significant mind shift from RDMS practices.

Case-insensitive indexes on text fields: In MongoDB all indexes are case sensitive. This means that if you have a field that is case sensitive (for display purposes for example) that is also searchable you must store two versions of the data.

Moving your data into MongoDB

Once you have made your decision to migrate, the next step is to start thinking about how you are going to reshape your relational data to make the most of MongoDB’s document-based storage.

Data model changes: You could simply map relational tables to document collections but we would strongly advise against that. Because in MongoDB you cannot rely on triggers and transactions to keep your normalized data consistent across multiple collections you will have to extensively re-engineer your data models into the appropriate document forms, denormalizing as needed.

Your database is likely to grow: MongoDB collections are schema-less and each document must define its own schema. If you have an element (field) named with 10 byte long identifier in a collection with one million documents that would amount to a 9.5MB storage overhead for that field name alone, never mind the data.

This may sound wasteful and honestly it is. There are several strategies to mitigate this overhead but realistically, you should expect you data to at least double in size when you migrate from SQL Server to MongoDB.

One strategy to mitigate the issue is to keep field names as short as possible. Another is to embrace the fact that in MongoDB not all documents must have all elements and make sure default values are never written to the database. When you restore an object, if a field is missing, simply assume it has the default value.

Treat your data migration code as production code: If everything goes according to plan, you will only ever run your migration code once. It can be tempting to hack your migration logic together and do everything the quick and dirty way. The bad news is that things are unlikely to go according to plan and having a decent codebase to work with is a necessity when you are under pressure and your site is down.

Test your migration code extensively against a full copy of your production database, also make sure you are running your tests on machines that are equivalent or inferior than the ones that you will use for the actual data migration.

Plan for extended downtime: Depending on the size of your database, your migration may take several hours or maybe even days. It is important to have a detailed migration plan, listing all migration steps to make sure you will have all the resources you need once you go offline. Even so. things will probably go wrong and your team will have to adapt to the situation on the fly.

Our migration took approximately 17 hours with dedicated servers for both SQL Server and MongoDB, it would have taken days if both databases were running on the same machine, competing for resources.

Make sure your migration code can be stopped and resumed: This is very important, imagine that your site has been down for hours and you are almost done another migration step that has been running for the last 4 hours and it fails. If you don’t plan your migration code so the work is done in batches and the migration can resume from the last successful batch in case disaster strikes, you may find yourself in a very unpleasant predicament.

General performance considerations: Odds are that SQL Server can read data much faster than MongoDB can write it to the disk. MongoDB will be under constant I/O pressure and it is important to monitor it during the whole process. At some points during our migration we had to give Mongo some time to catch up flushing data to the disk before initiating the next step.

When inserting data into MongoDB, it is imperative that you use your driver’s batch insert methods otherwise you will experience low insertion throughput. When possible, you should also gather all the data you need to assemble a document in memory and write it to MongoDB in one step (batch insert) instead of relying on updates/upserts.

Deployment Considerations

Durability: We ran our SQL database in a centralized server and thanks to ACID compliance, our data was always guaranteed to be consistent when we used shadow-copy snapshots to back it up.

MongoDB on the other hand provides durability through journaling and replication. Simply put if you can’t afford to run MongoDB with journaling on - it is optional and it comes with a performance cost - and in a cluster of at least 3 machines (a replica set), you can’t afford to run it at all.

The good news is that not all nodes must be a full featured server. In our case, we simply commissioned resources in other machines in our network to act as replication targets.

MongoDB also offers a lot of control over how your data is replicated through your cluster. For example you can designate a node to replicate with a delay and maintain multiple snapshots of your data. If you are not planning to query a slave node, you can also reduce its resource requirements by replicating only the data without the indexes. Overall, maintaining a MongoDB cluster is a breeze.

Issues running on Windows: At this time I strongly advise you against running MongoDB on Windows. This is not an statement about Windows fitness as a server OS by any means, all the problems we faced lie squarely within 10Gen’s jurisdiction. It seems like Windows is considered a second-class citizen by MongoDB developers.

Here is a rundown of the issues we experienced running MongoDB on a Windows server:

Rampant memory use to the detriment of all other running processes. Over time a MongoDB instance will take over all available RAM and it will never release it back to the OS.
Spotty, unreliable performance that degrades as run time increases. Probably related to the memory issue but the throughput of a MongoDB instance on Windows will steadily decrease over time. This situation is so real that in one of our backup nodes that is Windows server we had to schedule MongoDB to restart twice a day otherwise it wouldn’t be able to keep up with the master node. Restarting the MongoDB process seems to restore throughput back to normal levels and allow the node to "catch up" with the master.
Excessive locking: we ran our benchmark test on the same machine under both Ubuntu 11.04 and Windows Server 2008 R2. Locking percentages on Linux were constantly under 10% while the same tests caused the MongoDB running on Windows to register lock percentages in around 70-80%.
Unexplained crashes where no useful diagnostic information is provided: a truly terrifying situation to find yourself in, not only the database would crash and exit without providing any useful information but the data would also be left in a corrupted state. While this only happened to us during high-throughput write tests, it is the kind of thing that should never happen (ever).

Issues running on Linux: MongoDB works flawlessly on Linux with the only exception being that if your server has a NUMA architecture you may sporadically experience a deadlock condition. This issue is so severe that we seriously considered giving up on MongoDB because of it.

We worked closely with 10Gen to diagnose this problem and they were very helpful but ultimately unable to help us. Basically weeks before the big relaunch day, we found ourselves with a big problem and no one that could help us.

Eventually it was an Oracle support article that indirectly helped us identify and solve the problem. If you have a machine with a NUMA architecture and you are using a up to date version of Linux (which you should) it is very likely that your kernel has NUMA optimizations turned on.

As it turns out, those optimizations not only greatly harm MongoDB’s performance, it also frequently leads to an unrecoverable deadlock state. The solution for us was to disable all NUMA optimizations at the kernel level.

The take home lesson from our experience is that when it comes down to adopting new technologies with a relatively small installed base, you are more likely to find yourself in the uncomfortable predicament of needing help and not having anyone to help you. This is a risk factor that should not be ignored and you should have a contingency plan for - in our case we were ready to fork and adapt MongoDB to our needs if necessary.

Again I must stress that 10Gen was incredibly supportive and that they did their best to help us, they just were unable to at that time.

Maximizing MongoDB Performance

Our team has extensive experience building high-performance applications, before we created Wireclub, we have worked on real-time simulations, high-frequency trading and other similarly demanding fields. We cultivate a culture of pragmatic performance and as a result we maintained an average request fulfilment time of 50ms while servicing over 2.5 millions users from a single low-cost server with Microsoft SQL Server.

Before migrating to MongoDB we allocated time to figure out everything we could do to maximize throughput. Following are the most significant ways we have identified to boost MongoDB’s performance:

Go Solid State: If you can afford it, invest on solid state storage - it is absolutely worth every penny.

Use multiple SSDs/HDDs: Instead of buying one big disk, buy several smaller ones, have your OS and swap running on its own dedicated disk, your journal on a separate one and then spread your document collections across 1-2 additional physical volumes. By doing so you are maximizing the SATA/SAS bandwidth available to MongoDB.

Keep your indexes in RAM: Make sure your server has enough RAM available to keep all your important indexes in memory, this can greatly increase MongoDB’s query/update throughput.

Use MongoDB’s safe mode wisely: There are two ways to write to a MongoDB collection: with “safe mode = on” you submit a request and wait for a result which requires a round trip to the server. Alternatively if you don’t need to know the result of a given operation you can set “safe mode = off” and your request will return immediately - although you will not know if it was successful or not.

Only retrieve the fields you need: Retrieving whole documents is the equivalent of “select * from table” in SQL and it is just as bad for performance. When you are fulfilling hundred or thousands of request per second, it is certainly advantageous to only retrieve the fields you need.

Profile your application often: If your stack doesn’t offer a good, free profiler, I recommend you invest in one. If there are no profilers available for your stack, I am afraid you picked the wrong stack. Optimizing without a profiler is akin to developing without a debugger - a silly endeavor for people that don’t like to get things done.

Instrumentalize your driver: MongoDB drivers come with source code which makes it straightforward to trap and time all calls to the database. Be careful to make sure you capture the network time which might not be trivial if your driver uses asynchronous methods.

By doing this you could easily build a real-time query profiler and make sure you only spend time optimizing the queries that matter most.

Here is what our realtime query profiler looks like:

Start testing performance early: The idea isn’t to optimize prematurely but simply rule out queries and data models that might cause unreasonable I/O pressure later on.

Work with full-size databases: Often the performance characteristics of database interactions are dependent on the actual data being manipulated. I recommend that you do all your performance tests against a copy of your production database because this will avoid many unpleasant surprises later on.

Finally, always keep in mind that synthetic benchmarks don’t aways translate into real world performance. Things will come up once you go live, be prepared to address unforeseen throughput issues as they arise.

Conclusion

At Wireclub Media we really love MongoDB. The migration process was at times arduous but ultimately it paid off handsomely. Our new system is faster, more resilient and it is based on a data model which is much more flexible - all thanks to MongoDB.

I hope this article was useful to you!

- Rod (@rfurlan)

ps1. We held a Q&A on HackerNews about our experience with MongoDB.

ps2. I am also answering questions about this article on Reddit.