Struggling with even the most cutting-edge big-data tools, developers at Facebook have announced a new open-source tool set, Corona, designed to enable data-processing on the incredible scale required by the massive social-media network.
Admittedly, there aren't a lot of companies with big data this big, but Corona should have spill-over benefits for the entire Hadoop-using big data community. Increasing Hadoop's overall efficiency should make big data even more accessible and faster to use, enabling big data applications to disrupt even more industries.
Conventional databases that reach into the terabytes are difficult and expensive to work with. It's not just the size of databases that cause problems, either. Growth - of data itself, systems and queries - can waylay the most expertly managed database. It's hard to keep all of the data tables and relationships intact across the new machines you have to keep adding to store all that data.
Really, Really Big Data
Keep that in mind, and then ponder Facebook's problem: Its largest collection of data is 100 petabytes. That's 1,048,576,000 gigabytes. Think you're pretty bad ass with your 3TB hard drive? This is the equivalent to over 34,000 such drives. And that's just Facebook's largest collection of data, not all of it.
And growth? According to Facebook engineers, they're getting about half-a-petabyte coming in from the social network every 24 hours - 512 terabytes every single day.
This is the kind of data load that would make all but the most expensive database software and hardware on the planet go pfft! with a tiny puff of smoke and a barely audible scream of pain from the server room.
The Rise Of Corona
Corona divides the job tracker's responsibilities in two. First, a new manager manages cluster resources and keeps an eye on what's available in that cluster. In our grocery-store analogy, Corona lets more shoppers into the store and opens more checkout lanes.
At the same time, Corona creates a dedicated job tracker for each job, which means the job tracker no longer has to be tied to the cluster. With Corona, smaller jobs can be processed right on the requester's own machine. Continuing the shopping example, it's like putting an automat outside the store that will give you anything in the store, but a few items at a time.
Facebook has been using Corona for some time, according to the blog, and is still working to improve it, as well as to apply its improved scheduling features to other types of applications. More important, the team revealed that their efforts have been open-sourced, which means other developers can take this work and apply it to their own Hadoop systems.
Improved job scheduling in Hadoop is a pretty big deal, if only because any Hadoop system can be improved with this kind of functionality. It's something the Apache Hadoop developer team has already been working on in their new version of MapReduce 2.0, with the introduction of the YARN ResourceManager. But at the scales Facebook is working in now, the engineering team could not wait for YARN.
Given that Corona is also open source, there's a good possibility that work on MapReduce 2.0 will benefit from Facebook's efforts, improving everyone's Hadoop experience. That's something anyone in big data can Like.