Conversation Scene Analysis
We aim to develop an automatic computer system that can understand conversation scenes, based on data collected by cameras, microphones, and sensors. Our initial goals are as follows.
-Who is talking to whom, when?
-Who is looking at whom? Who attracts attention from others?
-Who is responding to whom, when, and how?
-Who are influenced and by whom? Who is the most influential person in a meeting?
-How are conversations organized by accumulating interactions among people?
-How does a discussion evolve over time and reach a conclusion?
Humans find it easy to answer these questions. Current computers, however, lack even this ability. These questions are closely related to the mechanism of human communication and developing a computer that can answer these questions is essential if we are to realize truly user-friendly communication systems such as teleconferencing systems, meeting archives, and social agent/robots. So far, we have been mainly focusing on low-level states of conversations such as, “who is talking to whom” and “who is looking at whom”. However, in the future, we believe computers will be able to understand higher states such as “who made him angry?” and “Why she is crying?”.
Below is the history of our research. A complete list of publications is here.
The information below omits some Japanese publications/presentations. See the complete list here (in Japanese).
---Jump to latest news---
2004 and earlier
I was working on automatic video editing system for conversation scenes with Yoshinao Takemae, Ph. D., who was with NTT CS Labs., at that time.
Our video editing approach is based on the idea that the person who is attracting more gaze than others should be a key person at that moment, and his/her face should be on screen. A person looks at another because he/she thinks that such attention is needed to acquire something important for understanding/participating in the meeting. Therefore, the more people look at the same target, the more important the target is. We thought that this assumption also should be held for remote viewers of the meeting, people who did not attend the meeting. Accordingly, we believed that gaze-based video switching is effective way of editing meeting scenes, and this prediction was verified in experiments.
Find related papers here.
2005
First, I targeted four-person conversations and formulated a problem for determining “conversation structures”; we define this word as representing the basic structure of conversations such as “who is talking to whom”. I came up with the idea of probabilistic conversation models. Features of this model are 1) The higher states of conversation (I call them conversation regimes) govern how people interact with each other, and people’s behaviors probabilistically appear depending on the higher states. 2) Gaze direction can indicate the structure of conversations. 3)Head directions can indicate gaze directions. I then solved this inference problem using MCMC.
I presented the first paper in this field at the 1st International Workshop on Conversational Informatics, which was hosted Prof. Nishida at Kyoto University. Then, I extended this work and presented the results at ICMI2005 and Transaction of IPSJ.
A Probabilistic Inference of Multiparty-Conversation
Structure Based on
Markov-Switching Models of Gaze Patterns, Head Directions, and Utterances
ACM Int. Conf. Multimodal Interfaces (ICMI)'05, October, 2005.
[Abstract][Paper][Presentation][Movies]
2006
The method proposed last year needed sensors for measuring head directions. This year, we tried to replace the body-mounted sensors with a vision-based face tracking method. This method was presented at ICME2006 and MIRU2006. Also, I gave a talk at an MIT seminar and presented an invited talk at JSAI-SLUD.
Conversation Scene Analysis with Dynamic Bayesian Network based on Visual Head
Tracking
IEEE ICME'06, July, 2006
[Demo movies]
Modeling and Probabilistic Inference of Conversation Structures in
Multiparty Face-to-Face Setting based on Visual Head Tracking
MIRU 2006, July 2006
Note: Japanese Domestic Conference. Content is the same as ICME'06
[Demo movies]
Communication Scene Analysis based on Probabilistic Modeling of Human Gaze
Behavior
MIT CSAIL HCI Seminar Series Spring 2006
[Abstract][Presentation]
As an application of estimated conversation structure and gaze directions, we
proposed a measure for interpersonal influence in conversation. This work was
presented at the CHI poster session.
Quantifying Interpersonal Influence in Face-to-face Conversations based on
Visual Attention Patterns
ACM CHI (Work-In-Progress Session), April, 2006
[Abstract][Paper][Poster]
2007
We extended the previous method to infer the action-reaction relationship in conversations, i.e. "who responds to whom". I presented this work at ICMI2007 and received the Outstanding Paper Award.
Automatic Inference of Cross-modal Nonverbal
Interactions in Multiparty Conversations
Proc. ACM ICMI2007, Nov. 2007.
[Abstract][Paper][Presentation][Movies]
November 30, I gave a talk at the IEICE Verbal-Nonverbal Communication, held at
the University of Tokyo.
Probabilistic Inference of Conversation Structures based on Nonverbal Behaviors ---Toward automatic understanding face-to-face conversations---
[Abstract][Presentation][Movies] (In Japanese)
In addition, on December 5, I was invited to the Nonverbal Knowledge Workshop, hosted by Prof. Mase at Nagoya University.
Recognition and Understanding
of Face-to-face Conversations based on Nonverbal Behaviors
[Presentation PDF
(360kB)] (In Japanese)
Recently, I have been working with intern students from domestic and foreign
universities. Their topics are fast and accurate face tracking in video,
and facial expression recognition. We presented a demo system of face tracking
at MIRU2007.
Simultaneous Real-time 3D Visual Tracking of Multiple
Objects using a Stream Processor
Meeting on Image Recognition and Understanding(MIRU2007)DS-01 (2007)
[Paper]
Shiro Kumano, who is an intern, presented a poster at MIRU2007 and gave a talk at IPSJ CVIM and ACCV2007. He received an Honorable Mention at ACCV2007.
Pose-Invariant Facial Expression Recognition Using
Variable-Intensity Templates
Proc. Asian Conference on Computer Vision, 2007
[To Shiro’s
page]
2008
I gave a talk at JSAI SLUD, May 7. The content was the Japanese version of the ICMI2007 paper.
Automatic Estimation of Nonverbal Interaction Structures in Multiparty Conversations ---Who Responds to Whom and How?---
April, Oscar Mateo Lozano, a former intern, gave a talk at ICASSP2008. This paper proposed GPU-based face tracking.
Simultaneous and Fast 3D Tracking of Multiple Faces
in Video by GPU-based Stream Processing
ICASSP2008(IEEE The 33rd
International Conference on Acoustics, Speech, and Signal Processing)
[Related
Info.]
July 12, 2008, a paper, coauthored by Oscar Mateo Lozano, was opened at
Springer’s Website.
Real-time visual tracker by Stream processing
---Simultaneous and fast 3D tracking of multiple faces in video sequences by
using a particle filter ---
Journal of VLSI Signal Processing Systems
http://www.springerlink.com/content/pk22n1632859082k/
OpenAccess for anyone. [Related Info.]
Some Web sites picked up this paper. Many Thanks!!
NVIDIA
CUDA ZONE
GPGPU
Homepage
Geeks3D.com
Impress
PC-watch, NVISION08 report (In Japanese)
From May 29 to 30, 2008, we exhibited our Realtime
Meeting Analysis System at NTT CS Labs. OpenHouse2008. Also, I was a
panelist at a panel discussion session, Understanding Communications, in
OpenHouse2008.
-Official Website is here. Video archives and PPT are available from here.
-Overview of our demo system is here.
-Audio technologies used in the system are here.
-Face pose tracker used in the system are here.
September 9, 2008, I will present a paper at MLMI2008, in Utrecht, The Netherlands. This paper reports that we have applied a face tracer, STCTracker, to meeting analysis, and confirmed its effectiveness.
Fast and Robust Face Tracking for Analyzing
Multiparty Face-to-Face Meetings
5th Joint Workshop on Machine Learning and Multimodal Interaction (MLMI2008)
[Paper][Presentation][Demo
Movies]
October, 2008. I will present a paper at ICMI2008. This
paper reveals the technical content of the realtime system for conversation
scene analysis, of which we demonstrated at CS Labs. OpenHouse2008.
A Realtime Multimodal System for Analyzing Group Meetings by Combining Face
Pose Tracking and Speaker Diarization
Proc. ACM 10th Int. Conf. Multimodal Interfaces (ICMI2008)
[Paper][Presentation][Demo
Video]
November, 2008, I will give a talk at IEICE MVE, at Osaka University. This talk is the Japanese version of ICMI2008 paper. If possible, I will take some part of our system to the site.
Any reproduction, modification, distribution, or republication of materials contained on this Web site, without prior explicit permission of the copyright holder, is strictly prohibited.
All rights reserved, Copyright(C) 2005, 2006, 2007, 2008 NTT Communication Science Laboratories