Back to all posts

Google Experts Unpack AI Inference

2025-06-24Molly McHugh-Johnson5 minutes read
Inference
AI
Google

Google recently unveiled Ironwood, its seventh-generation Tensor Processing Unit (TPU), specifically engineered for the demands of generative AI inference. While TPUs, the specialized chips powering AI systems, aren't a new concept at Google, Ironwood marks a significant step. It aims to propel AI systems from being merely responsive to becoming proactive. This leap is powered by inference, the critical process enabling AI systems to leverage models for generating knowledge-based outputs. To shed light on this pivotal aspect of AI, Google's senior product manager Niranjan Hira and distinguished engineer Fenghui Zhang offer their insights.

What is AI Inference?

The term 'inference' generally means drawing conclusions from given information. Does this apply to AI as well?

Niranjan Hira: In a way, yes. While an oversimplification, it's helpful to think of inference as pattern matching. When discussing generative AI and inference, we're essentially asking: Can AI models recognize patterns to predict what you're looking for? For instance, if you said 'peanut butter and ____' to an American audience, they'd likely complete it with 'jelly.' This exemplifies inference for speech patterns, a capability AI inference possesses, though its scope extends far beyond this.

Fenghui Zhang: Broadly, inference is how we utilize a trained model to perform useful tasks. An AI model comprises parameters, architecture, and configuration—the code necessary for task execution. Inference is what allows us to take all these components and put them to practical use.

How Does Inference Work in AI Models?

Question: What kinds of AI models rely on inference?

Fenghui Zhang: Deep learning AI, such as language models, image generation models, and audio models, all employ inference. They make predictions about what will 'happen' next based on patterns learned from historical data.

Niranjan Hira: Recommendation models also heavily utilize inference.

Question: Can you give an example of a recommendation model?

Fenghui Zhang: Most advertising models are recommendation models, as is the model that suggests YouTube videos. These are often termed 'traditional' or 'classical' AI—distinct from generative AI like LLMs or image/video generation models—and they have been using inference for a long time.

The Evolution and Improvement of Inference

Question: So, inference isn't new to AI, but it has significantly improved as AI has become more capable?

Fenghui Zhang: Precisely. Inference not only enables AI models to predict but also to classify. A model can label items based on its learned knowledge. A well-known example is from many years ago when Google challenged an AI model to identify a cat in an image. By processing data and using inference, it successfully taught itself what a cat is and identified it.

Niranjan Hira: More recently, you might recall discussions a couple of years back about AI-generated images that seemed to disregard the laws of physics—incorrectly depicted hands were a common issue. Today's models are much better at rendering physics and textures, among other things. The same progress is evident in text translation. Previously, language translation was largely statistical; it was usable but not always accurate and certainly not conversational. Statistical translation paved the way for generative AI translation, which many now confidently use, even in customer-facing products. The underlying process is still inference, but the AI itself and our computational capacity have improved dramatically.

An illustration of a hand placing a red puzzle piece, surrounded by other puzzle pieces and dashed lines.

Measuring and Enhancing Inference

Question: Can you measure how well inference works?

Fenghui Zhang: We can, by measuring how well a model performs on specific tasks. We also use inference to evaluate and train models, making them progressively better. So, during model training, we continuously run inference to try and improve model quality simultaneously.

Question: And because of these training setups, inference levels are presumably getting better by industry benchmarks?

Niranjan Hira: Yes. However, there's also the crucial aspect of human perception—how much have we, as users, noticed these improvements? Generally, it's quite significant. Another area of focus for Google when working on inference is privacy. We are meticulous about what data needs to be stored for these experiences to function effectively.

Inference in Action: Google's AI

Question: What are some examples of Google AI where we can see improved inference?

Fenghui Zhang: One of the prime use cases for inference at Google is AI Overviews in Search. When you type a query, a complex system distributes it to multiple models to retrieve results. It uses inference to understand your query, determine the desired answer, and then summarizes its findings into a useful overview. Inference is also vital to much of the agentic work Google is pursuing. With AI agents, beyond asking a model to deliver information based on its inference, you can also have it perform actions for you. This represents an extension of how we previously understood inference.

The Future of AI Inference

Question: So, inference is getting better at using data, or knowledge, to offer answers and even take action. How else is it changing?

Fenghui Zhang: A critically important factor is cost. We are striving to make inference as affordable as possible. For instance, if we aim to make a smaller, more accessible version of Gemini available, we would work on the model's inference. This involves finding ways to alter the computation paradigm or the model's underlying code without changing its semantics—its principal task—to reduce cost. Essentially, it's about creating a smaller, more efficient version of the model so more people can access its capabilities.

Question: How is this cost reduction achieved?

Fenghui Zhang: Optimizing hardware is one way, which is why Ironwood is set for release this year. It's designed with an inference-first approach, offering more compute power, increased memory, and optimization for specific numeric types. On the software side, we're enhancing our compilers and frameworks. Our long-term goal is for AI inference to be more efficient. We want improved quality, but with a smaller operational footprint that costs less to deliver helpful results.

Read Original Post
ImaginePro newsletter

Subscribe to our newsletter!

Subscribe to our newsletter to get the latest news and designs.