Advantages of Facebook’s heat map

On a blog, an engineer from Facebook’s cache performance group, Sean Lynch wrote that when something goes wrong in a social networking, it becomes very difficult to find out the fault as Facebook has grown in size as well as intricacy.

In supervising operations at the data centre, the use of heat map is a promising one. One engineer from Oracle conducted investigation with regard to the use of heat maps for finding likely issues in the data centre quickly. Whenever some technical problems arise in a SNS, the reserve presentation team has to ensure that the problem is not created by caching mechanism. The operational status of numerous components could be efficiently represented by heat map. In a heat map, on a large matrix, each element is characterized as a cell and the health of the element is represented by the colour. A green cell might symbolize a node that is working properly whereas a red cell represents a cell that is not working accurately.

Two admin cache systems, Memcache and a caching graph database called TAO, are being used by Facebook. Abundant performance metrics are produced by these two systems on various latency, error rate statistics and request rate. As per Lynch, to monitor performance, already a generic heat map is being used by caching team. However, the software is not able to fit easily the visual data into a single screen. The heat map software use colours to stand for various values and it offer no spontaneous indication if a server is working properly. Again the software didn’t explain the source data that would point out whether a particular host was working perfectly or not.

Claspin, the heat map designed by Lynch is given the name of a protein. In Claspin, every cluster of servers has its own heat map as ordered by the rack number within a data centre. Hence just by looking at the heat map, problems at the cluster level and rack level could be identified.

Lynch said it is possible to fit 10,000 hosts at a time on a 30-inch screen with 30 or more stats updated within seconds. JavaScript is used for the code to analyse and assemble the operational metrics and heat maps were rendered using the SVG format.

According to the heat map, a green block means that the host is performing well, a black box means an individual host is down and a red box indicates that some aspects of the operation, like a big number of timeouts, is not realistic. Claspin provides a quick look into operation visually and it also enables the users to pierce down to precise metrics.

Lynch said that till date the tool is an accomplishment within Facebook. To watch over their servers, other engineering groups have started to use heat maps as well. As the engineers began to use Claspin more and more servers become operational. The engineer said when he first introduced Claspin, he had to face many obstacles but with the passage of time, the using of Claspin was made easier and more and more people began to use it.

Now Facebook is thinking of releasing Claspin as an open source as it has done with its internally developed tools on previous occasions.

Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!