Machine intelligence

Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, applying learning algorithms to understand and generalize.

Machine Intelligence at Google raises deep scientific and engineering challenges, allowing us to contribute to the broader academic research community through technical talks and publications in major conferences and journals. Contrary to much of current theory and practice, the statistics of the data we observe shifts rapidly, the features of interest change as well, and the volume of data often requires enormous computation capacity. When learning systems are placed at the core of interactive services in a fast changing and sometimes adversarial environment, combinations of techniques including deep learning and statistical models need to be combined with ideas from control and game theory.

Recent Publications

VISTA: A Test-Time Self-Improving Video Generation Agent
Xuan Long Do
Hootan Nakhost
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (to appear) (2026)
Preview abstract Despite rapid advances in text-to-video (T2V) synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. To address this, we introduce VISTA, a novel multi-agent system that autonomously refines prompts to improve video generation. VISTA operates in an iterative loop, first decomposing a user's idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. To rigorously evaluate our proposed approach, we introduce MovieGen-Bench, a new benchmark of diverse single- and multi-scene video generation tasks. Experiments show that while prior methods yield inconsistent gains, VISTA consistently improves video quality, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA's outputs in 68% of comparisons. View details
Vibe Coding XR: Accelerating AI + XR Prototyping with XR Blocks and Gemini
Benjamin Hersh
Jiahao Ren
Xingyue Chen
Robert Timothy Bettridge
Faraz Faruqi
Anthony 'Xiang' Chen
Steve Toh
Google XR, Google (2026)
Preview abstract While large language models have accelerated software development through "vibe coding", prototyping intelligent Extended Reality (XR) experiences remains inaccessible due to the friction of complex game engines and low-level sensor integration. To bridge this gap, we contribute XR Blocks, an open-source, modular WebXR framework that abstracts spatial computing complexities into high-level, human-centered primitives. Building upon this foundation, we present Vibe Coding XR, an end-to-end rapid prototyping workflow that leverages LLMs to translate natural language intent directly into functional XR software. Using a web-based interface, creators can transform high-level prompts (e.g., "create a dandelion that reacts to hand") into interactive WebXR applications in under a minute. We provide a preliminary technical evaluation on a pilot dataset (VCXR60) alongside diverse application scenarios highlighting mixed-reality realism, multi-modal interaction, and generative AI integrations. By democratizing spatial software creation, this work empowers practitioners to bypass low-level hurdles and rapidly move from "idea to reality." Code and live demos are available at https://xrblocks.github.io/gem and https://github.com/google/xrblocks. View details
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Han Zhou
Shariq Iqbal
Ivan Vulić
Anna Korhonen
International Conference on Learning Representations (ICLR) (2026)
Preview abstract Large language models (LLMs), employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with {prompts} that declare their functionality, along with the {workflows} that orchestrate interactions within a structured flow. Designing prompts and workflows for multi-agent systems is inherently complex, especially when addressing a new task. It often demands expert-level knowledge and involves significant trial and error. Gaining a deep understanding of the factors that contribute to effective multi-agent systems is essential for automating the entire process. Motivated by this, we first conduct an in-depth analysis of the design spaces for multi-agent systems, focusing on the impact of prompts, scaling the number of agents, and common types of agentic modules. Our findings reveal that top-performing systems often emerge from simpler design spaces, where prompts play a critical role in enhancing agent functionality and enabling more effective scaling. Based on the insights, we propose Multi-Agent System Search (MASS), a multi-stage optimization framework that performs the optimization in a pruned design space, with prompts and an influential subset of modules. We show that MASS-optimized multi-agent systems outperform existing alterntives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems. View details
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Yehonathan Refael
Amit Aides
Aviad Barzilai
Vered Silverman
Bolous Jaber
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops (2026), pp. 886-894
Preview abstract Open-vocabulary object detection (OVD) models offer remarkable flexibility applications by enabling object detection from arbitrary text queries. Still, the zero-shot performance of the pre-trained models is hampered by the inherent semantic ambiguity of natural language, result to low precision, leading to insufficient crucial downstream applications. For instance, in the remote sensing (RS) domain, a query for "ship" can yield varied and contextually irrelevant results. To address this, for real time applications, we propose a novel cascaded architecture that synergizes the broad capabilities of a large, pre-trained OVD model with a lightweight, few-shot classifier. Our approach utilizes the frozen weights of the zero-shot model to generate initial, high-recall object-embedding proposals, which are then refined by a compact classifier trained in real-time on a handful of user-annotated examples. The core of our contribution is an efficient one step active learning strategy for selecting the most informative samples for user annotation. Our method identifies (extremely) small amount of an uncertain candidates near the theoretical decision boundary using density estimation and then applies clustering to ensure a diverse training set. This targeted sampling enables our cascaded system to elevate performance on standard remote sensing benchmarks. Our work thus presents a practical and resource-efficient framework for adapting foundational models to specific user needs, drastically reducing annotation overhead while achieving high accuracy without costly full-model fine-tuning. View details
Preview abstract The field of Human-Computer Interaction is approaching a critical inflection point, moving beyond the era of static, deterministic systems into a new age of self-evolving systems. We introduce the concept of Adaptive generative interfaces that move beyond static artifacts to autonomously expand their own feature sets at runtime. Rather than relying on fixed layouts, these systems utilize generative methods to morph and grow in real-time based on a user’s immediate intent. The system operates through three core mechanisms: Directed synthesis (generating new features from direct commands), Inferred synthesis (generating new features for unmet needs via inferred commands), and Real-time adaptation (dynamically restructuring the interface's visual and functional properties at runtime). To empirically validate this paradigm, we executed a within-subject (repeated measures) comparative study (N=72) utilizing 'Penny,' a digital banking prototype. The experimental design employed a counterbalanced Latin Square approach to mitigate order effects, such as learning bias and fatigue, while comparing Deterministic interfaces baseline against an Adaptive generative interfaces. Participant performance was verified through objective screen-capture evidence, with perceived usability quantified using the industry-standard System Usability Scale (SUS). The results demonstrated a profound shift in user experience: the Adaptive generative version achieved a System Usability Scale (SUS) score of 84.38 ('Excellent'), significantly outperforming the Deterministic version’s score of 53.96 ('Poor'). With a statistically significant mean difference of 30.42 points (p < 0.0001) and a large effect size (d=1.04), these findings confirm that reducing 'navigation tax' through adaptive generative interfaces directly correlates with a substantial increase in perceived usability. We conclude that deterministic interfaces are no longer sufficient to manage the complexity of modern workflows. The future of software lies not in a fixed set of pre-shipped features, but in dynamic capability sets that grow, adapt, and restructure themselves in real-time to meet the specific intent of the user. This paradigm shift necessitates a fundamental transformation in product development, requiring designers to transcend traditional, linear workflows and evolve into 'System Builders'—architects of the design principles and rules that facilitate this new age of self-evolving software. View details
Preview abstract Validating conversational artificial intelligence (AI) for regulated medical software applications may present challenges, as static test datasets and manual review may be limited in identifying emergent, conversational anomalies. A multi-agent AI system may be configured in a closed-loop for automated validation. The system can, for example, utilize an end user persona simulator agent to generate prompts for a target model and a domain /regulatory expert adjudicator agent to evaluate the target model’s responses against a configurable rubric. A meta-analysis agent can analyze anomalies to identify underlying vulnerabilities, which may then be used to programmatically synthesize new adversarial personas. This adaptive process can generate evidence to support regulatory compliance and continuous performance monitoring for medical software algorithms systems. View details
×