Max McKinnon

Max McKinnon

Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Voice activity detection (VAD) plays a vital role in enabling applications such as speech recognition. We analyze the impact of window size on the accuracy of three VAD algorithms: Silero, WebRTC, and Root Mean Square (RMS) across a set of diverse real-world digital audio streams. We additionally explore the use of hysteresis on top of each VAD output. Our results offer practical references for optimizing VAD systems. Silero significantly outperforms WebRTC and RMS, and hysteresis provides a benefit for WebRTC. View details
    Preview abstract The ability of large language models (LLMs) to recall and retrieve information from long contexts is critical for many real-world applications. Prior work (Liu et al., 2023) reported that LLMs suffer significant drops in retrieval accuracy for facts placed in the middle of large contexts, an effect known as "Lost in the Middle" (LITM). We find the model Gemini 2.5 Flash can answer needle-in-a-haystack questions with great accuracy regardless of document position including when the document is nearly at the input context limit. Our results suggest that the "Lost in the Middle" effect is not present for simple factoid Q&A in Gemini 2.5 Flash, indicating substantial improvements in long-context retrieval. View details
    Bone Conducted Signal Guided Speech Enhancement For Voice Assistant on Earbuds
    Jens Heitkaemper
    Joe Caroselli
    Nathan Howard
    ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE
    Preview abstract In this work we present a multi-modal, streaming enhancement net-work to improve speech recognition for voice assistants. The pro-posed model is guided by the bone conducted signal (BCS) to sep-arate the interfering sources from the target speaker signal. Wet trained the model on a simulated speech enhancement training set with a simulated BCS and finetune it on a small earbuds specific training set, consisting of less than 7 hours of speech. To account for distorted BCS the enhancement module is complemented by a voice activity-based decision to discard the enhanced output for BCS without speech information. A possibility to preprocess the BCS to account for the low-pass characteristic of the bone conduction is evaluated to lower the required transmission bandwidth from the ear-buds to the recognition device. The results show that a reduction of the BCS bandwidth can be reduced to 500 Hz with only a small losses in word error rate (WER). The system with and without bandwidth reduction are compared to a state-of-the-art multi-channel enhancement method on a realistic test set and outperforms the multi-channel model for most of the considered sets View details
    ×