Multimodal Speaker Separation Demo

Environment Baseline

Click Play to hear the unseparated Ground Truth mixture of all speakers.

Meeting room environment

Reconstructed Scene after Multimodal Speaker Separation

Click on any speaker's body to hear their cleanly separated audio stream.

Drag to rotate | Scroll to zoom

Ground Truth Reference

Ground Truth Mixture

Speaker 1 Ground Truth

Speaker 2 Ground Truth

Speaker 3 Ground Truth

Convolution with RIR

Speaker 1

Speaker 2

Speaker 3

Input Mixture to Proposed System

Beamformer Output

Towards Speaker 1

Towards Speaker 2

Towards Speaker 3

Final Separated Output

Speaker 1 Extracted

Speaker 2 Extracted

Speaker 3 Extracted

System Architecture

Abstract — Separating overlapping speakers in meeting-room environments remains challenging due to reverberation, source ambiguity, and dynamic speaker interactions. In existing systems, audio-only approaches degrade under adverse acoustic conditions, while audio-visual methods predominantly rely on explicit geometric calibration or depth estimation. This paper proposes a geometry-aware audio-visual speaker separation framework that addresses these limitations by integrating 360-degree visual perception with spherical microphone array (SMA) processing. A co-located 360-degree camera and SMA provide natural audio-visual alignment without explicit calibration. The visual front-end employs the YOLOv8l-pose on equirectangular video to extract speaker directions via nose-keypoint centroids with a confidence-gated bounding-box fallback. These directions steer a null-steering beamformer in 3 dimensional space, followed by SepFormer-based neural speech separation. Visual direction-extraction on the STARSS23 dataset achieves a mean angular error of 2.218° with 96.62% of detections within 5°. The full pipeline achieves an average SDR of 18.27 dB and PESQ of 3.093 in the multiple speaker case.

Block diagram of the proposed multimodal speaker separation framework.

Multimodal Speaker Separation in Audio-Visual Scene using Spherical Microphone Array

System Architecture