How Delivery Hero Serves AI Generated Food Photos
The Business Case for AI Food Imagery
Iaroslav Amerkhanov, who leads AI solutions development at Delivery Hero, shared the story of an impactful generative AI project. It all began with a simple business hypothesis: the quality of menu content positively influences customer conversion rates. A data analysis confirmed this, revealing that a significant number of products lacked either an image or a description. The team quickly realized that images were the critical missing piece, as only 14% of products were purchased without a visual.
This insight sparked the initiative to use generative AI to create high-quality food images. The team brainstormed two primary solutions: a straightforward text-to-image generation process and a more sophisticated inpainting technique that would replace food items within existing vendor photos.
From MVP to a Scalable Solution
The Minimum Viable Product (MVP) was built on the Google Cloud Platform (GCP), utilizing Cloud Run, Postgres, and Vertex AI Pipelines—a wrapper over Kubeflow that simplified model orchestration. Initially, the team leveraged OpenAI's DALL·E, one of the most advanced models at the time. The pipeline extracted data, ran it through the generative flow, and presented the resulting images to content teams via Google Forms for selection.
The positive feedback and growing demand from various entities within Delivery Hero made it clear that the project needed to scale. This led to a crucial decision: to reduce costs and gain more control, they needed to host their own model internally. The team chose to work with Stable Diffusion.
Mastering Stable Diffusion for Production
To understand their optimization efforts, it's helpful to know how Stable Diffusion works. It primarily consists of three components:
- Variational AutoEncoder (VAE): Maps images into a compressed vector space (latent space) and back.
- U-Net: A neural network trained to predict and remove noise from the image in the latent space.
- CLIP (Contrastive Language-Image Pre-Training): Aligns text descriptions and images within the same latent space, allowing the model to generate images based on text prompts.
The generation process involves adding noise to an image representation (forward pass) and then systematically removing that noise over several steps, guided by the text prompt, to create a new image (backward pass).
Overcoming Technical Hurdles with CLIP
While implementing Stable Diffusion, the team encountered unexpected challenges with the CLIP component. Firstly, the model struggled to generate accurate images for products with non-Latin names, such as those in Chinese or Arabic, despite CLIP's multilingual training. The straightforward solution was to translate all non-Latin product names to English via an API, which resolved the issue.
A more intricate problem was CLIP's context length limit of 77 tokens. Many of their detailed prompts—including ingredients, lighting, and positioning—exceeded this limit. To overcome this, they used a library called Compel, which chunks larger prompts, vectorizes each part separately, and then combines them for the model.
Scaling the Infrastructure and Measuring Success
With the core technical issues solved, the team scaled up their infrastructure. They made a strategic decision to migrate parts of the service to Amazon Web Services (AWS) for cost reasons, creating a multi-cloud environment. This introduced integration complexities but was ultimately managed with a robust CI/CD pipeline using GitHub Actions.
The new architecture, running on AWS Elastic Kubernetes Service (EKS) with KEDA for message-based autoscaling, could scale from 0 to 36 GPU nodes, enabling the generation of around 100,000 images daily. Restaurant owners could now see AI-generated suggestions directly in their app and select the best ones for their menus.
Measuring the quality of a generative model is notoriously difficult due to the absence of ground truth. The team developed an innovative benchmarking framework. They created a suite of computer vision models to score generated images based on internal product guidelines for composition, coloring, and content. This allowed them to objectively compare different model versions and configurations.
The Economics of Self-Hosting AI Models
The primary motivation for self-hosting was cost efficiency. The team conducted rigorous tests to find the optimal hardware, discovering that NVIDIA L4 GPUs offered the best cost-per-image, reducing it by 50% compared to other options and beating the price of the DALL·E API.
Further optimizations focused on reducing inference time without sacrificing quality:
- Precision: Switching from float32 to float16 precision yielded a 4x speed improvement.
- VAE: Using a Tiny Variational AutoEncoder saved an additional 0.5 seconds per image.
- Parameters: Adjusting the Classifier-Free Guidance Scale and reducing the number of generation steps from 50 to 40 brought significant time savings.
- Compilation: Using
torch.compile
pre-optimized the model weights for faster inference.
These efforts resulted in an 85% decrease in computation time, bringing the cost per image down to less than 0.3 cents—an 8x reduction compared to DALL·E 2. At a scale of tens of millions of images, this translated to millions of euros in savings.
Fine-Tuning for Hyper-Local Cuisine
While the model excelled at generating common items like pizzas and burgers, it failed spectacularly with exotic local dishes like 'stuffed pigeon' or 'salad with ant eggs'. The initial outputs were often bizarre and unusable. This presented an opportunity to explore fine-tuning.
The team chose full model fine-tuning over creating thousands of individual LoRAs for each local dish, as managing the latter would be too complex. Using the OneTrainer framework, they prepared a high-quality dataset of local dishes, applied image augmentations, and trained the U-Net component of the Stable Diffusion model. The fine-tuned model produced excellent results for local cuisine, and by serving multiple model versions in production, they could route requests appropriately based on the product type.
Ensuring Quality with an AI Safety System
Even with a powerful model, statistical anomalies can produce disturbing or nonsensical images. To protect the brand and its partners, Delivery Hero built a comprehensive AI safety system to filter out these failures before anyone saw them. This system included several components:
- Creature and People Detection: A customized vision model to flag images containing unexpected figures.
- Optical Character Recognition (OCR): An OCR model to detect and flag malformed or gibberish text.
- Composition Analysis: An object detection model to ensure the product was centered and properly framed according to guidelines.
- Color and Quality Analysis: A system to detect issues like over-blurriness or low contrast by calculating the image's Laplacian gradient.
Each image received a weighted score, and those falling below a certain threshold were automatically discarded.
The Tangible Business Impact
The project was a resounding success. Over 100,000 products on the platform are now represented by AI-generated images. Most importantly, the initial hypothesis was proven correct. A/B testing revealed a 6% to 8% increase in the conversion rate from menu to cart for products that had an AI-generated image added.
Key Lessons from the Trenches
Iaroslav concluded with three critical takeaways from this journey:
- Avoid cross-cloud architectures unless absolutely necessary, as integration can cause significant delays.
- Invest time in model optimization. For large-scale inference, every second saved translates into substantial financial savings.
- Automate quality measurement first. Before you start fine-tuning, establish a robust, automated way to measure the quality of your generative AI output. You can't optimize what you can't measure.