Overlap-Native Speaker
Diarization.
As the foundation for wearable audio-visual memory systems
Overlapping speech — two or more people speaking simultaneously — remains a critical failure mode in speaker diarization and transcription systems. When overlap occurs, conventional pipelines drop words, truncate turns, or misattribute content to the wrong speaker, fundamentally undermining trust in any downstream automation.
This paper presents Sonal's technical approach: overlap-native speaker diarization designed for a wearable device that captures everyday life conversations — not just meetings, but walks, calls, family moments, errands, worship, and spontaneous interactions.
We survey established overlap-aware techniques including End-to-End Neural Diarization (EEND), which formulates diarization as multi-label classification, and Target-Speaker Voice Activity Detection (TS-VAD), which estimates per-speaker activity conditioned on speaker embeddings. We propose a reproducible baseline architecture built on open foundations, and outline an evaluation methodology that reports performance separately on overlap versus non-overlap regions.
The framework is the foundation of Presence AI, our native overlap LLM, and the prerequisite for every Sonal memory product. Everything downstream — summary, report, action — inherits the accuracy of this first layer.