ICMI2008 (International
Conference on Multimodal Interfaces 2008)
A Realtime Multimodal System
for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization
K. Otsuka, S. Araki, K. Ishizuka, M. Fujimoto, M. Heinrich,
and J. Yamato
NTT Communication Science Laboratories
[Abstract]
This paper presents a realtime system for analyzing group meetings that uses a novel omnidirectional camera-microphone system. The goal is to automatically discover the visual focus of attention (VFOA), i.e. “who is looking at whom”, in addition to speaker diarization, i.e. “who is speaking and when”. First, a novel tabletop sensing device for roundtable meetings is presented; it consists of two cameras with two fisheye lenses and a triangular microphone array. Second, from high-resolution omnidirectional images captured with the cameras, the position and pose of people’s faces are estimated by STCTracker (Sparse Template Condensation Tracker); it realizes realtime robust tracking of multiple faces by utilizing GPUs (Graphics Processing Units). The face position/pose data output by the face tracker is used to estimate the focus of attention in the group. Using the microphone array, robust speaker diarization is carried out by a VAD (Voice Activity Detection) and a DOA (Direction of Arrival) estimation followed by sound source clustering.This paper also presents new 3-D visualization schemes for meeting scenes and the results of an analysis. Using two PCs, one for vision and one for audio processing, the system runs at about 20 frames per second for 5-person meetings.
[Download]
© ACM,
(2008). This is the author's version of the work. It is posted here by
permission of ACM for your personal use. Not for redistribution. The definitive
version was published in ICMI’08, October 20–22, 2008, Chania, Crete, Greece. Copyright 2008 ACM
978-1-60558-198-9/08/10
[Movies]

Movie of Fig.2 (wmv file) 1.9MB
Movie of
Fig.2 (wmv file) 3.1MB
Another view (tracking results)


For more information, visit our Realtime
Multimodal System for Conversation Scene Analysis
Any reproduction, modification, distribution, or republication of materials contained on this Web site, without prior explicit permission of the copyright holder, is strictly prohibited.
All rights reserved, Copyright(C) 2008 NTT Communication Science Laboratories