
Enabling physician-centered oversight for AMIE
August 12, 2025
David Stutz, Research Scientist, Google DeepMind, and Natalie Harris, Software Engineer, Google Research
We introduce guardrailed-AMIE (g-AMIE), a diagnostic AI designed for history-taking. g-AMIE operates with a guardrail that prohibits it from giving individualized medical advice, instead generating a summary for an overseeing physician to review.
Quick links
Recent work demonstrated that Articulate Medical Intelligence Explorer (AMIE), our research AI system for medical reasoning and diagnostic dialogue, can provide accurate medical advice in text-based simulations of patient visits. However, individual patient diagnoses and treatment plans are regulated activities and must be reviewed and approved by licensed medical professionals prior to any patient communication. Simultaneously, oversight is an established paradigm in the medical domain allowing autonomy for care team members while overseeing primary care physicians (PCPs) retain accountability for the care of the patient. Inspired by this, our current research explores a framework for physician oversight of AMIE.
In “Towards physician-centered oversight of conversational diagnostic AI”, we introduce an extension of our AMIE research system, guardrailed-AMIE (g-AMIE), with a multi-agent setup based on Gemini 2.0 Flash. g-AMIE can gather patient information (i.e., history taking) via a dialogue and generate a body of information for a clinician to review. This comprises a summary of information gathered, a proposed differential diagnosis and management plan, and a draft message to the patient. We design g-AMIE with guardrail constraints that prevent it from sharing any individualized medical advice, i.e., any diagnoses or management plan tailored to the patient. This information is reviewed and can be edited by an overseeing PCP through a purpose-built web interface called the clinician cockpit. Decoupling history taking from medical decision-making allows the overseeing PCP to review cases asynchronously. In a randomized, blinded, virtual objective structured clinical examination (OSCE), we compared g-AMIE’s performance with nurse practitioners (NPs), physicians assistants/associates (PAs), and PCPs operating under the same guardrail constraints. We found that g-AMIE’s diagnostic performance and management plans were preferred by overseeing PCPs and independent physician raters. Additionally, g-AMIE’s patient messages were preferred by patient actors. While this represents an important milestone towards human–AI collaboration with AMIE, results need to be interpreted with care, especially when making comparisons to clinicians. The workflow was designed for the unique characteristics of AI systems, whereas clinicians haven’t been trained to operate within this framework.

Asynchronous oversight framework. 1. g-AMIE as well as NP/PA and PCP control groups perform history taking within guardrails, abstaining from individualized medical advice. 2. g-AMIE and control groups generate differential diagnoses (DDx) and management plans. 3. Overseeing physician revises DDx & management plan to ensure patient safety and accountability. 4. Overseeing PCP shares a revised message with the patient. “g-PCP” and “g-NP/PA” refer to providers operating under the same guardrail constraints as g-AMIE.
A clinician cockpit for oversight
To enable physician oversight, g-AMIE produces a detailed medical note that is then reviewed by the overseeing PCP using our clinician cockpit interface, which we developed in a co-design study with 10 outpatient physicians. The co-design was conducted through semi-structured interviews with potential users and thematic analysis to identify crucial components before results were shared with a UI designer to draft the interface. The cockpit is based on the widely-used SOAP note format which includes subsections for Subjective (the patient’s perspective on their condition), Objective (observable and measurable patient data, such as vital signs or lab data), Assessment (differential diagnosis with justification), and Plan (management strategy).

Our clinician-cockpit, inspired by the SOAP note format.
History taking and medical note generation
For g-AMIE to respect its guardrails during history taking and generate high-quality, accurate SOAP notes, we developed a multi-agent system consisting of a dialogue agent, a guardrail agent, and a SOAP note agent. The dialogue agent aims to perform high-quality history taking in three phases: (1) general history taking, (2) targeted validation of an initial differential diagnosis, and (3) a conclusion phase that addresses patient questions. The guardrail agent verifies that each response from the dialogue agent avoids any individualized medical advice, rephrasing responses as needed. The SOAP note agent performs sequential multi-step generation, separating the summarization tasks (Subjective and Objective) from the inferential tasks (Assessment and Plan) and from the patient message generation.

A. g-AMIE performs history taking using a three-phase dialogue agent, including general history taking, validating its differential diagnosis using targeted questions, and allowing the patient to ask questions; each response is verified using a guardrail agent to ensure that g-AMIE does not provide individualized medical advice. B. For medical note generation, g-AMIE follows a sequential multi-step approach to generate all sections of the SOAP note format along with a patient message. Click to enlarge image.
Randomized OSCE study
We conducted a virtual OSCE study to contextualize g-AMIE’s performance across 60 case scenarios developed by an OSCE laboratory. The study included patient actors, g-AMIE and two control groups: PCPs with 5 years or less of independent practice experience and a group of NP/PAs. Both groups were instructed to follow the same guardrails as g-AMIE and to also draft SOAP notes and patient messages following their history taking. We refer to these control groups as guardrail PCPs and guardrail NP/PAs. In a second step, a group of overseering PCPs, with at least 5 years experience including experience supervising healthcare team members, reviewed SOAP notes, made edits, and either approved the (edited) patient message or decided that a follow-up consultation would be needed. Finally, independent physicians rated the overall quality of the transcript, SOAP notes and the overseeing PCP’s decision. Patient actors, overseeing PCPs and independent raters were blinded as to whether history taking had been performed by g-AMIE or a control group. We extended evaluation rubrics from previous work by evaluating SOAP note quality using a modified QNote rubric and adding oversight-specific rubrics.

Our virtual OSCE study with oversight follows three steps: A. Patient actors interact with g-AMIE or a control group during history taking. SOAP notes are generated or obtained through a questionnaire after history taking. B. Transcripts and SOAP notes are displayed in our clinician cockpit for overseeing PCPs to make edits and to decide on whether or not to approve the (edited) patient message. C. Independent experts rate the quality of the consultations, including oversight edits and decisions, using various evaluation rubrics. Click to enlarge image.

In our virtual OSCE study, g-AMIE outperforms groups of PCPs, NPs and PAs operating within the same guardrailed workflow (g-PCPs, g-NP/PAs).
Evaluation
History taking, SOAP notes, and oversight experience
We observed that g-AMIE successfully performs history taking within guardrails, with no consultation being rated as definitely containing individualized medical advice. Moreover, g-AMIE’s ability to elicit key information from the patient was rated higher compared to the control groups and g-AMIE’s SOAP notes are rated to be more complete, accurate, and readable. Overseeing PCPs preferred overseeing g-AMIE and were more likely to accept patient messages crafted by g-AMIE.

A. History taking safety and quality as assessed by independent physicians showing that g-AMIE follows guardrails and performs high-quality history taking. B. Completeness and accuracy of generated SOAP notes and patient messages. C. Oversight actions and experience as rated by the overseeing PCP showing that g-AMIE’s patient messages are accepted in the vast majority of cases and that overseeing PCPs prefer the experience with g-AMIE over either control groups. The asterisks represent statistical significance (*: p<0.05, **: p<0.01, ***: p<0.01, n.s.: not significant). Click to enlarge image.
Diagnostic quality and composite performance
We found that g-AMIE’s history taking and medical notes lead to differential diagnoses and management plans that are rated to be more appropriate. Moreover, raters indicated that g-AMIE enables overseeing PCPs to make an appropriate decision for patient follow-up more often, with medical notes that are more often rated as sufficient for downstream care.

A. g-AMIE produces DDx and management plans that are more appropriate compared to control groups. B. Evaluation of composite performance showing g-AMIE leads to more appropriate decisions by the overseeing PCP (o-PCP) and produces SOAP notes and patient messages that are more often rated as sufficient for downstream patient care. The asterisks represent statistical significance (*: p<0.05, **: p<0.01, ***: p<0.01, n.s.: not significant). Click to enlarge image.
Discussion and limitations
While g-AMIE is able to follow guardrails in the vast majority of the cases, there are caveats and nuances in classifying individualized medical advice. Our results are based on a single rating per case even though we observed significant disagreement among raters in previous studies. Moreover, the comparison to both control groups should not be taken as commentary on their ability to follow the supplied guardrails; PCPs in particular are not used to withholding medical advice in consultations. Considerable further development of AI oversight paradigms in real-world settings is required to ensure generalisation of our proposed framework.
While g-AMIE’s SOAP notes included confabulations in a few cases, we found that such confabulations occur at a similar rate as misremembering by both guardrail PCPs and guardrail NP/PAs. It is noteworthy, however, that g-AMIE’s notes are considerably more verbose, which leads to longer oversight times and a higher rate of edits focused on reducing verbosity. In interviews with overseeing PCPs, we also found that oversight is mentally demanding, which is consistent with prior work on cognitive load of AI-assisted decision support systems.
On the other hand, during history taking, we believe this verbosity contributes to g-AMIE’s higher ratings for how information is explained and rapport is built. Patient actors and independent physicians preferred g-AMIE’s patient messages and its demonstration of patient empathy. These findings highlight that future work aimed at finding the right trade-off in terms of verbosity between history taking, medical notes and patient messages is required.
We also found that NPs and PAs consistently outperform PCPs in history taking quality, following guardrails and diagnostic quality. However, these differences should not be extrapolated to meaningful indicators of relative performance in the real world. The tested workflow was designed to explore a paradigm of AI oversight and both control groups are provided primarily to contextualize g-AMIE’s performance. None received specific training for this workflow, and it does not account for several real-world professional needs. Therefore, it would likely significantly underestimate clinicians’ capabilities. Moreover, the recruited NPs and PAs had more experience and may be more familiar with patient-focused history-taking. PCPs, in contrast, are taught to explicitly link history-taking to the diagnostic process, linking questions to direct hypothesis testing, and the proposed workflow would likely have significantly impacted their consultation performance.
Finally, patient actors cannot act as an exact substitute for real patients, especially in combination with our 60 constructed scenario packs. While these cover a range of conditions and demographics, they are not representative of real clinical practice.
Conclusion
We introduce a paradigm for asynchronous oversight of conversational diagnostic AI systems such as AMIE. Preserving conversational properties, AMIE can operate within guardrails, performing history taking without providing individualized medical advice. The latter, including diagnosis and management planning, is deferred to an overseeing physician. This disentangles history-taking from decision making, ensuring patient safety with the overseeing physician remaining accountable. In a virtual, randomized OSCE study, we show that our system, termed guardrailed-AMIE, can perform high-quality history taking, medical note generation and leads to better overall diagnostic decisions compared to PCPs, NPs, and PAs operating under the same guardrails. Our results should not be interpreted to mean that g-AMIE is superior to clinicians, who have not been trained in this workflow. Nevertheless, our work marks a significant step towards a framework for responsible and scalable use of conversational diagnostic AI systems in healthcare.
Acknowledgements
The research described here is joint work across many teams at Google Research and Google DeepMind. We are grateful to all our co-authors: Elahe Vedadi, David Barrett, Natalie Harris, Ellery Wulczyn, Shashir Reddy, Roma Ruparel, Mike Schaekermann, Tim Strother, Ryutaro Tanno, Yash Sharma, Jihyeon Lee, Cian Hughes, Dylan Slack, Anil Palepu, Jan Freyberg, Khaled Saab, Valentin Liévin, Wei-Hung Weng, Tao Tu, Yun Liu, Nenad Tomasev, Kavita Kulkarni, S. Sara Mahdavi, Kelvin Guu, Joelle Barral, Dale R. Webster, James Manyika, Avinatan Hassidim, Katherine Chou, Yossi Matias, Pushmeet Kohli, Adam Rodman, Vivek Natarajan, Alan Karthikesalingam, and David Stutz.
-
Labels:
- Generative AI
- Health & Bioscience