October 29, 2025
Jon E. Froehlich, Visiting Faculty Researcher, and Shaun Kane, Research Scientist, Google Research
We introduce StreetReaderAI, a new accessible street view prototype using context-aware, real-time AI and accessible navigation controls.
Interactive streetscape tools, available today in every major mapping service, have revolutionized how people virtually navigate and explore the world — from previewing routes and inspecting destinations to remotely visiting world-class tourist locations. But to date, screen readers have not been able to interpret street view imagery, and alt text is unavailable. We now have an opportunity to redefine this immersive streetscape experience to be inclusive for all with multimodal AI and image understanding. This could eventually allow a service like Google Street View, which has over 220 billion images spanning 110+ countries and territories, to be more accessible to people in the blind and low-vision community, offering an immersive visual experience and opening up new possibilities for exploration.
In “StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI”, presented at UIST’25, we introduce StreetReaderAI, a proof-of-concept accessible street view prototype that uses context-aware, real-time AI and accessible navigation controls. StreetReaderAI was designed iteratively by a team of blind and sighted accessibility researchers, drawing on previous work in accessible first-person gaming and navigation tools, such as Shades of Doom, BlindSquare, and SoundScape. Key capabilities include:
StreetReaderAI provides a context-aware description of the street view scene by inputting geographic information sources and the user’s current field-of-view into Gemini. For the full audio-video experience, including sound, please refer to this YouTube video.
StreetReaderAI uses Gemini Live to provide a real-time, interactive conversation about the scene and local geographic features. For the full audio-video experience, including sound, please refer to this YouTube video.
StreetReaderAI offers an immersive, first-person exploration experience, much like a video game where audio is the primary interface.
StreetReaderAI provides seamless navigation through both keyboard and voice interaction. Users can explore their surroundings using the left and right arrow keys to shift their view. As the user pans, StreetReaderAI shares audio feedback, voicing the current heading as a cardinal or intercardinal direction (e.g., “Now facing: North” or “Northeast”). It also expresses whether the user can move forward and if they are currently facing a nearby landmark or place.
To move, the user can take “virtual steps” using the up arrow or move backward with the down arrow. As a user moves through the virtual streetscape, StreetReaderAI describes how far the user traveled and key geographic information, such as nearby places. Users can also use “jump” or “teleport” features to quickly move to new locations.
The core of StreetReaderAI is its two underlying AI subsystems backed by Gemini: AI Describer and AI Chat. Both subsystems take in a static prompt and optional user profile as well as dynamic information about the user’s current location, such as nearby places, road information, and the current field-of-view image (i.e., what’s being shown in Street View).
AI Describer functions as a context-aware scene description tool that combines dynamic geographic information about the user’s virtual location along with an analysis of the current Street View image to generate a real-time audio description.
It has two modes: a “default” prompt emphasizing navigation and safety for blind pedestrians, and a “tour guide” prompt that provides additional tourism information (e.g., historic and architectural context). We also use Gemini to predict likely follow-up questions specific to the current scene and local geography that may be of interest to blind or low-vision travelers.
A diagram of how AI Describer combines multimodal data to support context-aware scene descriptions.
AI Chat builds on AI Describer but allows users to ask questions about their current view, past views, and nearby geography. The chat agent uses Google's Multimodal Live API, which supports real-time interaction, function calling, and temporarily retains memory of all interactions within a single session. We track and send each pan or movement interaction along with the user's current view and geographic context (e.g., nearby places, current heading).
What makes AI Chat so powerful is its ability to hold a temporary “memory” of the user's session — the context window is set to a maximum of 1,048,576 input tokens, which is roughly equivalent to over 4k input images. Because AI Chat receives the user's view and location with every virtual step, it collects information about the user’s location and context. A user can virtually walk past a bus stop, turn a corner, and then ask, “Wait, where was that bus stop?” The agent can recall its previous context, analyze the current geographic input, and answer, “The bus stop is behind you, approximately 12 meters away.”
To evaluate StreetReaderAI, we conducted an in-person lab study with eleven blind screen reader users. During the sessions, participants learned about StreetReaderAI and used it to explore multiple locations and evaluate potential walking routes to destinations.
A blind participant using StreetReaderAI to explore potential travel to a bus stop and inquire about bus stop features, such as the existence of benches and a shelter. For the full audio-video experience, including sound, please refer to this YouTube video.
Overall, participants reacted positively to StreetReaderAI, rating the overall usefulness 6.4 (median=7; SD=0.9) on a Likert scale from 1–7 (where 1 was ‘not at all useful’ and 7 was ‘very useful’), emphasizing the interplay between virtual navigation and AI, the seamlessness of the interactive AI Chat interface, and the value of information provided. Qualitative feedback from participants consistently highlighted StreetReaderAI's significant accessibility advancement for navigation, noting that existing street view tools lack this level of accessibility. The interactive AI chat feature was also described as making conversations about streets and places both engaging and helpful.
During the study, participants visited over 350 panoramas and made over 1,000 AI requests. Interestingly, AI Chat was used six times more often than AI Describer, indicating a clear preference for personalized, conversational inquiries. While participants found value in StreetReaderAI and adeptly combined virtual world navigation with AI interactions, there is room for improvement: participants sometimes struggled with properly orienting themselves, distinguishing the veracity of AI responses, and determining the limits of AI knowledge.
In one study task, participants were given the instruction, “Find out about an unfamiliar playground to plan a trip with your two young nieces.” This video clip illustrates the diversity of questions asked and the responsiveness of StreetReaderAI. For the full audio-video experience, including sound, please refer to this YouTube video.
As the first study of an accessible street view system, our research also provides the first-ever analysis of the types of questions blind people ask about streetscape imagery. We analyzed all 917 AI Chat interactions and annotated each with up to three tags drawn from an emergent list of 23 question type categories. The four most common question types included:
Because StreetReaderAI relies so significantly on AI, a critical challenge is response accuracy. Of the 816 questions that participants asked AI Chat:
Of the 32 incorrect responses:
More work is necessary to explore how StreetReaderAI performs in other contexts and beyond lab settings.
StreetReaderAI is a promising first step toward making streetscape tools accessible to all. Our study highlights what information blind users desire from and ask about streetscape imagery and the potential for multimodal AI to answer their questions.
There are several other opportunities to expand on this work:
Though a “proof-of-concept” research prototype, StreetReaderAI helps demonstrate the potential of making immersive streetscape environments accessible.
This research was conducted by Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, and Philip Nelson. We thank Project Astra and the Google Geo teams for their feedback as well as our participants. Diagram icons are from Noun Project, including: “prompt icon” by Firdaus Faiz, “command functions” by Kawalan Icon, “dynamic geo-context” by Didik Darmanto, and “MLLM icon” by Funtasticon.