Hundreds of millions of tweets are created every day and turn into over 1 trillion events for Twitter’s data center to process, which is why Twitter is one of the largest Hadoop users in the world.
Hadoop helps to store the events and perform analytics on that data. A typical Hadoop cluster at Twitter can have over 100,000 hard disk drives constantly in use, but hard disks weren’t delivering enough IOPS for applications to get fast access to the data. HDFS data and temporary data managed by YARN often flow at the same time - resulting in a performance bottleneck. Something had to change.
With help from Intel, Twitter developed a new Hadoop solution using Intel® Cache Acceleration Software (Intel® CAS) to selectively cache the temporary YARN files on a fast solid state drive.
The two data streams were no longer competing, so hard disk drive utilization dropped, and Hadoop could serve up data faster.
Removing the storage I/O bottleneck allowed Twitter to reduce the total number of racks in the cluster, decreasing the data center footprint. Using fewer but larger hard drives reduced the number of hard disk drives in a cluster by 75 percent without negatively impacting performance.
Twitter could now take advantage of more CPU horsepower, moving from 4-core processors to 24-core processors. Fewer systems, hard drives and racks in the Hadoop clusters meant reduced maintenance costs, and less energy needed to produce the same results.
Optimizing the storage performance resulted in much faster runtimes, and a lower total cost of ownership (TCO). So Twitter’s Hadoop cluster can continue to scale as its data grows, while still delivering the great experience their users expect.