Arguably, good images drive dynamic and engaging service-based web content. Managing those images properly by ensuring a good load rate, clarity and more can be the difference between a frustrating web experience and a rewarding one for consumers.
After joining TrueCar, my first project was to replace an ImageProcessor that was disrupting the user experience. The legacy ImageProcessor had scalability and performance issues. It was unable to finish its daily workload within 24 hours and was not bulletproof against Internet related errors like HTTP404, HTTP403, etc.
Our image processing pipeline is responsible for:
- Acquiring images of vehicles
- Identifying updates/deletes to images
- Generating unique URLs for images
- Cropping and resizing images
- Copying images to AssetServers
- De-duplication of images
In the figure above, left side shows a raw image from a dealer feed, while the right side is a snippet of the same vehicle listing on our website. In the listing, we show two image resolutions. All of these images from various dealer feeds are acquired, resized, named and stored by the ImageProcessor.
At TrueCar, our image catalogue is quite large: close to 400 million images. On any given day, we are processing more than 30 million image URL’s. Processing these calls to the Internet has very high latency that varies from 200ms-5000ms, with an average of 300ms. So, to process 30 million web URL’s in one day, we would need at least 2500 hours of CPU time.
Storage and access can also make or break the whole operation. Storage and retrieval of 100’s of millions of images in near real-time requires a scalable database. Based on the above challenges and requirements, we decided to build ImageProcessor using the following components: MapReduce, HBase and Kafka.
Above is the overview of our new ImageProcessor pipeline using MapReduce, HBase and Kafka. Below we will discuss the selection of each component individually.
- HBase: Datastore for storing images and historical data for images
HBase is great at storing very large datasets and provides the capability to do near real-time read/write of data in HDFS. It also supports dynamic schemas. Vehicles can have varying number of images and this dynamic data model fits perfectly with HBase. However, HBase is not considered good for storing large files but our average image file size of 200KB (between 5KB-5000KB overall) was well within limits.
(PS: Awaiting HBase Medium Objects(MOB) feature for better support for storing files up to 5MB)
- MapReduce: Computation engine for ImageProcessor
As discussed above, ImageProcessor needed 2500 hours of CPU time everyday to process all of our images. MapReduce is really good at solving CPU intensive problems. By using MapReduce, we avoided writing complex multithreaded application code to get 2500 hours of CPU time everyday. Also, it has a high fault tolerance and is available, ready to use, more or less, right out of the box.
- Kafka: Publisher/Subscriber for pushing images to AssetServers
Once the images are processed, we need a queue to deliver images to our AssetServer’s. The AssetServers are the servers that are used to hand off images for our front-end. The output from the MapReduce jobs would quickly overwhelm a single AssetServer or even farm of servers, but a Kafka cluster can hold the image outputs while the AssetServers download them at their individual pace. Another benefit of a Kafka cluster is that we can scale the inputs from the MapReduce jobs independently from the outputs to the AssetServers. If the AssetServers are unable to keep up, we can simply build more of them.
All these pieces work together in a pipeline that is fault tolerant, highly available and scalable. It also finishes our daily workload in a very timely manner. As a result, TrueCar.com’s image coverage has improved from roughly 70% to 99.99% and has increased performance and customer satisfaction. The figure below illustrates the impact of the Hadoop ImageProcessor on Truecar.com. You can see on the “Before” side that we were missing some images; the “After” image, however, shows the kind of coverage we’ve gotten with the Hadoop ImageProcessor.
In the next 3 blogposts, we will be going into great detail about the design internals of MapReduce, HBase and Kafka related to building ImageProcessor.