Multimodal Speaker Separation in Audio-Visual Scene using Spherical Microphone Array

Environment Baseline
Click Play to hear the unseparated Ground Truth mixture of all speakers.
Meeting room environment
Reconstructed Scene after Multimodal Speaker Separation
Click on any speaker's body to hear their cleanly separated audio stream.
Drag to rotate | Scroll to zoom
Ground Truth Reference
Ground Truth Mixture
Speaker 1 Ground Truth
Speaker 2 Ground Truth
Speaker 3 Ground Truth
Convolution with RIR
Speaker 1
Speaker 2
Speaker 3
Input Mixture to Proposed System
Beamformer Output
Towards Speaker 1
Towards Speaker 2
Towards Speaker 3
Final Separated Output
Speaker 1 Extracted
Speaker 2 Extracted
Speaker 3 Extracted

System Architecture

Abstract — Separating overlapping speakers in meeting-room environments remains challenging due to reverberation, source ambiguity, and dynamic speaker interactions. In existing systems, audio-only approaches degrade under adverse acoustic conditions, while audio-visual methods predominantly rely on explicit geometric calibration or depth estimation. This paper proposes a geometry-aware audio-visual speaker separation framework that addresses these limitations by integrating 360-degree visual perception with spherical microphone array (SMA) processing. A co-located 360-degree camera and SMA provide natural audio-visual alignment without explicit calibration. The visual front-end employs the YOLOv8l-pose on equirectangular video to extract speaker directions via nose-keypoint centroids with a confidence-gated bounding-box fallback. These directions steer a null-steering beamformer in 3 dimensional space, followed by SepFormer-based neural speech separation. Visual direction-extraction on the STARSS23 dataset achieves a mean angular error of 2.218° with 96.62% of detections within 5°. The full pipeline achieves an average SDR of 18.27 dB and PESQ of 3.093 in the multiple speaker case.
System Block Diagram

Block diagram of the proposed multimodal speaker separation framework.