Back to home

ICMI2008 (International Conference on Multimodal Interfaces 2008)

 

A Realtime Multimodal System for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization

 

K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich, and J. Yamato
NTT Communication Science Laboratories

 

[Abstract]

This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. “who is looking at whom”, in addition to speaker diarization, i.e. “who is speaking and when”. First, a novel tabletop sensing device for roundtable meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people’s faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering.This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.

 

[Download]

Poster PDF(0.32MB)

Paper PDF(3.3MB)

© ACM, (2008). This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ICMI08, October 2022, 2008, Chania, Crete, Greece. Copyright 2008 ACM 978-1-60558-198-9/08/10

 

[Movies]

Movie of Fig.2 (wmv file) 1.9MB

 

Movie of Fig.2 (wmv file) 3.1MB
Another view (tracking results)

 

Movie of Fig.7 (wmv)

 

Fig.8(a).wmv

Fig.8(b).wmv

Fig.8(c)(d).wmv

 

For more information, visit our Realtime Multimodal System for Conversation Scene Analysis

 

Any reproduction, modification, distribution, or republication of materials contained on this Web site, without prior explicit permission of the copyright holder, is strictly prohibited.

 

All rights reserved, Copyright(C) 2008 NTT Communication Science Laboratories

 

Back to home