Photo caching for a system like Facebook.

If you've been reading about cloud computing systems, you'll know that the way systems like Facebook get such high performance is by implementing very large scale caching services that can keep huge amounts of data in memory (mostly, images that are likely to be requested soon). For example, in his 2013 SOSP paper on Facebook caching, Qi Huang explains that Facebook only keeps two versions of any given image in its disk system: a thumbnail and a high resolution copy. Any other sizes needed are created dynamically and then held in cache for a while. Facebook also needs to deal with viral photos that become very popular very suddenly.

Read about Facebook caching at Facebook Caching.

By reading the paper, you'll have learned that the Facebook caching service runs at multiple locations. Each instance of it is basically a cluster of computing systems that reside at some data center and handles requests that reach that data center, and it needs to be adaptive to handle rapidly evolving loads and demands. The popularity graphs in the paper you've read would allow you to create synthetic load traces.

With this in mind, design and build a distributed and scalable caching solution for a company like Facebook. Your solution should have the following features:
  • It must be possible to test and demo the solution and to show before/after benefit of caching.
  • The solution should accept streams of requests from client systems and spread the work in some sensible way over the cache servers that participate.
  • Each cache server needs to either resolve a request, or pass it in to Haystack, the Facebook backend image server (for your purposes use any kind of collection of photos on a disk). The server should have a policy it uses to decide what to cache. For example you read about the S4LRU police used in Facebook.

So far you probably won't have needed Vsync. But now think about ways that Vsync can help you with tasks such as
  • Tracking which servers are running the caching system.
  • Failure detection and reconfiguration (you may need to fine-tune the Vsync detection parameters and also make sure that your servers fail "gracefully" so that Vsync learns of a failure instantly!)
  • Load balancing to relieve hot-spots.
  • Dynamically changing the replication levels for certain popular items. E.g. if a photo of Angelina Jolie goes viral you might want to cache it at 2 or 4 servers, not just one.
  • Tracking various image access statistics over time to dynamically predict which images are about to go viral. You might consider doing this in a faked (synthetic) way to demo it to your professor or TA but having a "real way" under wraps and ready to go when you deploy your solution at Facebook for real.

Facebook has a great internship program. If you do this project, you might consider applying and if that works out, perhaps you can actually experiment with your caching service using traces from the real Facebook system under load. Obviously summer interns don't get to change the real caching service in such a drastic way, but you could probably learn a lot about the "real" issues these systems need to deal with that way!

Last edited Nov 19, 2015 at 8:22 PM by birman, version 2