Sonal Labs · § ResearchOverlap-native speaker diarizationFolio IV · MMXXVI

§ Research · Publication register

Who said
what, when.

Our agenda is narrow by design. Overlap-native speaker diarization is the first research bet — getting attribution right, frame by frame. From that base, we extend to speaker-attributed transcription, memory-object generation, and audio-visual memory.

Paper № 001·Whitepaper·Sonal Labs Research·January 2026

Overlap-Native Speaker
Diarization.

As the foundation for wearable audio-visual memory systems

Overlapping speech — two or more people speaking simultaneously — remains a critical failure mode in speaker diarization and transcription systems. When overlap occurs, conventional pipelines drop words, truncate turns, or misattribute content to the wrong speaker, fundamentally undermining trust in any downstream automation.

This paper presents Sonal's technical approach: overlap-native speaker diarization designed for a wearable device that captures everyday life conversations — not just meetings, but walks, calls, family moments, errands, worship, and spontaneous interactions.

We survey established overlap-aware techniques including End-to-End Neural Diarization (EEND), which formulates diarization as multi-label classification, and Target-Speaker Voice Activity Detection (TS-VAD), which estimates per-speaker activity conditioned on speaker embeddings. We propose a reproducible baseline architecture built on open foundations, and outline an evaluation methodology that reports performance separately on overlap versus non-overlap regions.

The framework is the foundation of Presence AI, our native overlap LLM, and the prerequisite for every Sonal memory product. Everything downstream — summary, report, action — inherits the accuracy of this first layer.

Request the PDF See the model

Fig. 1 — Overlap event, 4.5 s excerptA ink · B Sonal green

§ Contributions

What the paper contributes.

I.

Problem formalisation

A precise statement of speaker diarization, overlapping speech, and the four failure modes under overlap: dropped content, misattribution, turn truncation, and speaker fragmentation.

II.

Survey of overlap-aware techniques

EEND (multi-label classification), TS-VAD (per-speaker activity), OSD as preprocessing routing, and diarization-guided source separation — with notes on productisation gaps.

III.

Reproducible baseline architecture

A five-stage system on open foundations: Whisper ASR → OSD → hybrid diarization core → identity stitching → attributed transcript with confidence metadata. Designed for wearable constraints.

IV.

Overlap-stratified evaluation methodology

The critical requirement: report DER, JER and word-level metrics separately on overlap and non-overlap regions. Aggregate numbers hide the failure mode.

§ Benchmarks

What we measure against.

No single dataset captures “everyday life” audio. We evaluate on a portfolio that covers the axes that matter — meetings, telephone, in-the-wild, and acoustic robustness — with overlap-stratified reporting in every case.

AMI Meeting Corpus

~100 h of multimodal meeting recordings with multiple microphones.

VoxConverse

In-the-wild multi-speaker content with reported overlap statistics.

CALLHOME

Unscripted telephone conversations reflecting informal speech patterns.

DIHARD

Robustness-focused challenge with diverse acoustic conditions.

§ Roadmap

Four phases, in order.

Phase I.

Overlap-Native Attribution

Current

OSD + overlap-aware diarization baseline
Overlap-stratified evaluation methodology
Published benchmarks and ablation results

Phase II.

Reliable Speaker-Attributed Transcription

Improved identity stitching and speaker stability
cpWER / tcpWER evaluation for joint diarization + ASR
Wearable-specific acoustic adaptation

Phase III.

Memory Object Generation

In design

Action items, reminders, follow-ups with evidence grounding
Confidence-aware UI with review workflows for uncertain attribution
Searchable conversation memory

Phase IV.

Audio-Visual Memory

Horizon

Visual context integration for disambiguation and enriched recall
Episodic memory retrieval capabilities
Audio-visual diarization cues (lip movement, facial cues)

§ Publications

The register.

2026
№ 001
Overlap-Native Speaker Diarization as the Foundation for Wearable Audio-Visual Memory Systems
Sonal Labs Whitepaper · v1.0 · Tim Uzua
Foundational paper. Draft available on request.
forthcoming
№ 002
Overlap-stratified evaluation: a protocol for speaker-attributed transcription benchmarks
Sonal Labs Technical Report
Benchmark and baselines across AMI, VoxConverse, CALLHOME and DIHARD.
forthcoming
№ 003
Presence AI — model card (preview)
Sonal Labs
Accompanies the first public Presence release.
forthcoming
№ 004
Multimodal memory: extending attribution to vision and environment
Sonal Labs Working Paper
Direction of travel. Early drafts with research partners.

Research partners

Working in speech, ML systems, or applied memory? Write to us with a short paragraph — we read everything.

research@sonallabs.com

Who saidwhat, when.

Overlap-Native SpeakerDiarization.