Crawling, Analyzing, and Ranking 20 Million Photos a Day
Pixable aggregates photos from across your different social networks and finds the best ones so you never miss an important moment. That means currently processing the metadata of more than 20 million new photos per day: crawling, analyzing, ranking, and sorting them along with the other 5+ billion that are already stored in our database. Making sense of all that data has challenges, but two in particular rise above the rest:
- How to access millions of photos per day from Facebook, Twitter, Instagram, and other services in the most efficient manner.
- How to process, organize, index, and store all the meta-data related to those photos.
Sure, Pixable’s infrastructure is changing continuously, but there are some things that we have learned over the last year. As a result, we have been able to build a scalable infrastructure that takes advantage of today’s tools, languages and cloud service, all running on Amazon Web Services where we have more than 80 servers running. This document provides a brief introduction to those lessons.
Backend architecture – where everything happens
Infrastructure – loving Amazon EC2
We maintain all of our servers on Amazon EC2 using a wide range of instances, from t1.micro to m2.2xlarge, with CentOS Linux. After setting up the server, we create our own internal AMIs–one for every type of server. We always have them ready for immediate deployment when the load increases, thus maintaining a minimum performance standard at all times.
To compensate for these load fluctuations, we developed our own auto-scaling technology that forecasts how many servers we need of each type according to the current and historic load for a particular time of the day. Then we launch or terminate instances just to keep the right level of provisioning. In this way, we are able to save money by scaling down our servers when they’re unnecessary. Auto-scaling in Amazon is not easy, since there are many variables to consider.
For example: it makes no sense to terminate an instance that has been running for just half an hour, since Amazon charges for whole hours. Also, Amazon can take 20+ minutes to launch a spot-instance. So for sudden spikes in traffic, we do some clever launch scheduling of on-demand instances (that launch much faster), and then swapping them out for spot-instances in the next hour. This is the result of pure operational research, the objective of which is to extract the best performance for the right amount of money. Think of it like the film “Moneyball”, but with virtualized servers instead of baseball players.
Our web servers currently run Apache + PHP 5.3 (lately we have been fine-tuning some web servers to run nginx + php-fpm, which will soon become our standard configuration). The servers are evenly distributed across different availability zones behind an Amazon’s Elastic Load Balancer, so we can absorb both downtime and price fluctuations on Amazon. Our static content is stored on Amazon Cloud Front, and we use Amazon Route 53 for DNS services. Yes indeed… we love Amazon.
Work queue- jobs for crawling and ranking photos, send notifications and more
Virtually all processing at Pixable is done via an asynchronous job (e.g., crawling new photos from different users from Facebook, sending push notifications, calculating friend rankings, etc). We have several dozen worker servers crawling metadata from photos from different services and processing that data. This is a continuous, around-the-clock process.
As expected, we have different types of jobs: some with high priority, such as real time user calls, messaging, and crawling photos for currently active users. Lower priority jobs include offline crawling and long data-intensive deferred tasks. Although we use the very capable beanstalkd as our job queue server, we have developed our own management framework on top of it. We call it the Auto-Pilot, and it automatically manages the handling of priorities, e.g. by devoting the job server time to the high priority jobs and pausing the lower priority ones when certain sets of conditions on the platform-wide level are met.
We developed very sophisticated rules to handle these priorities, considering metrics that affect the performance of the system and impact the user-perceived speed. Some are obvious, such as the average waiting time of jobs or the lag of the slaves (ssshhh, we never have lag on our slaves 🙂 ), to more complex metrics such as the status of our own PHP synchronization mutex locks for distributed environments. We do as much as possible to make an equitable trade-off between efficiency and performance.
Crawling Engine – crawl new photos across Facebook, Twitter and more 24/7
We are constantly improving our crawling technology, which is a complex parallel algorithm that uses a mutex locking library, developed in-house, to synchronize all the processes for a particular user. This algorithm has helped us to improve our Facebook crawling speed by at least 5x since launch. We can now easily fetch in excess of 20 million new photos every day. This is quite remarkable, considering the fact that any large data query to the Facebook API can several seconds. We’ll get deeper into our crawling engine in a subsidiary document.
Data Storage – indexing photos and metadata
Naturally, our data storage grows every day. Currently we store 90% of our data in MySQL (with a memcached layer on top of it), using two groups servers. The first group is a 2 master – 2 slave configuration that stores the more normalized data accessed by virtually every system, such as user profile information, global category settings, and other system parameters.
The second server group contains manually sharded servers in which we store the data related to user photos, such as photo URLs. This metadata is highly de-normalized to the point where we virtually run the storage as a NoSQL solution like MongoDB, only in MySQL tables (NoSQL-in-MySQL). So you can guess where the other 10% of our data is stored, right? Yes, in MongoDB! We are moving parts of our storage to MongoDB, mainly because of the simple-yet-flexibile sharding and replication solutions it offers.
Logging, Profiling and Analytics
We developed a highly flexible logging and profiling framework that allows us to record events with high granularity–down to a single line of code. Every log event is categorized by a set of labels that we later query (e.g. event of user X, or calls in module Y). On top of that, we can dynamically profile the time between any of the logging events, allowing us to build real-time profiling of our entire system. The logging and profiling system weighs very heavily on the storage system (several thousands of updates per second), so we developed a mix of two-level MySQL tables (a memory-based fast buffer, that serves as a leaky bucket for the actual data stored), combined with some partitioned MySQL tables that are filled up later asynchronously. This architecture allows us to process more than 15,000 log entries per second. We also have our own event tracking system, wherein every user action from logins to shares to individual clicks are recorded so they can later be analyzed with complex queries.
We also rely heavily on the wonderful Mixpanel service, an event-based tracking system in which we perform most of our high-level analysis and reports.
Front End – simple visualization devices
Pixable runs in multiple (front-end) devices, the most popular one being the iPhone and the iPad. We also have web and mobile web sites that load a simple index page, and everything else is performed in the client by extensive use of jQuery and Ajax calls. All our web frontends will soon run a single codebase that automatically adapts to mobile or desktop screens (give it a try! http://new.pixable.com). This way, we can run the same code on our main site, on an Android device, or a Chrome browser extension! In fact, our Android app is a combination of a native app with our mobile web frontend. The Android app renders a frame with some minimal controls, and simply presents the mobile web view inside of it.
It sounds a bit harsh, but all our frontends are “dumb”. All the heavy lifting is performed in the backend, with everything connected via our own private API. This allows us to do rapid development and deployment without having to change the installed user base. Agile, baby!
API – connecting our front end and back end
The API is the glue that keeps everything working together. Using Tonic PHP, we developed our own private RESTful API that exposes our backend functionality. Also, we extended Tonic to make it support built-in versioning. That way, it’s really easy for us to maintain backward compatibility when developing new features or changing response formats in the API. We just configure what versions of the API calls are supported for which device version. We have a no-device-left-behind policy, no matter how old the version is!
But the API does not do any actual work. It simply relies on the information from the front-ends, passing it to the actual Pixable heart: the backend’