Conversation Scene Analysis

                                   Back to home page

              Go to Japanese page

We aim to develop an automatic computer system that can understand conversation scenes, based on data collected by cameras, microphones, and sensors. Our initial goals are as follows.

-Who is talking to whom, when?

-Who is looking at whom? Who attracts attention from others?

-Who is responding to whom, when, and how?

-Who are influenced and by whom?  Who is the most influential person in a meeting?

-How are conversations organized by accumulating interactions among people?

-How does a discussion evolve over time and reach a conclusion?

Humans find it easy to answer these questions. Current computers, however, lack even this ability. These questions are closely related to the mechanism of human communication and developing a computer that can answer these questions is essential if we are to realize truly user-friendly communication systems such as teleconferencing systems, meeting archives, and social agent/robots. So far, we have been mainly focusing on low-level states of conversations such as, who is talking to whom and who is looking at whom. However, in the future, we believe computers will be able to understand higher states such as who made him angry? and Why she is crying?.

Below is the history of our research. A complete list of publications is here.

The information below omits some Japanese publications/presentations. See the complete list here (in Japanese).

---Jump to latest news---

2004 and earlier

I was working on automatic video editing system for conversation scenes with Yoshinao  Takemae, Ph. D., who was with NTT CS Labs., at that time.

Our video editing approach is based on the idea that the person who is attracting more gaze than others should be a key person at that moment, and his/her face should be on screen. A person looks at another because he/she thinks that such attention is needed to acquire something important for understanding/participating in the meeting. Therefore, the more people look at the same target, the more important the target is. We thought that this assumption also should be held for remote viewers of the meeting, people who did not attend the meeting. Accordingly, we believed that gaze-based video switching is effective way of editing meeting scenes, and this prediction was verified in experiments.

Find related papers here.

2005

First, I targeted four-person conversations and formulated a problem for determining conversation structures; we define this word as representing the basic structure of conversations such as who is talking to whom. I came up with the idea of probabilistic conversation models. Features of this model are 1) The higher states of conversation (I call them conversation regimes) govern how people interact with each other, and peoples behaviors probabilistically appear depending on the higher states. 2) Gaze direction can indicate the structure of conversations. 3)Head directions can indicate gaze directions. I then solved this inference problem using MCMC.

I presented the first paper in this field at the 1st International Workshop on Conversational Informatics, which was hosted Prof. Nishida at Kyoto University. Then, I extended this work and presented the results at ICMI2005 and Transaction of IPSJ.

A Probabilistic Inference of Multiparty-Conversation Structure Based on
Markov-Switching Models of Gaze Patterns, Head Directions, and Utterances

ACM Int. Conf. Multimodal Interfaces (ICMI)'05, October, 2005.
[Abstract][Paper][Presentation][Movies]

2006

The method proposed last year needed sensors for measuring head directions. This year, we tried to replace the body-mounted sensors with a vision-based face tracking method. This method was presented at ICME2006 and MIRU2006. Also, I gave a talk at an MIT seminar and presented an invited talk at JSAI-SLUD.


Conversation Scene Analysis with Dynamic Bayesian Network based on Visual Head Tracking
IEEE ICME'06, July, 2006
[Demo movies]

Modeling and Probabilistic Inference of Conversation Structures in Multiparty Face-to-Face Setting based on Visual Head Tracking
MIRU 2006, July 2006
Note: Japanese Domestic Conference. Content is the same as ICME'06
[Demo movies]

Communication Scene Analysis based on Probabilistic Modeling of Human Gaze Behavior

MIT CSAIL HCI Seminar Series Spring 2006
[Abstract][Presentation]

As an application of estimated conversation structure and gaze directions, we proposed a measure for interpersonal influence in conversation. This work was presented at the CHI poster session.


Quantifying Interpersonal Influence in Face-to-face Conversations based on Visual Attention Patterns
ACM CHI (Work-In-Progress Session), April, 2006
[Abstract][Paper][Poster]

2007

We extended the previous method to infer the action-reaction relationship in conversations, i.e. "who responds to whom". I presented this work at ICMI2007 and received the Outstanding Paper Award.

Automatic Inference of Cross-modal Nonverbal Interactions in Multiparty Conversations
Proc. ACM ICMI2007, Nov. 2007.
[Abstract][Paper][Presentation][Movies]

November 30, I gave a talk at the IEICE Verbal-Nonverbal Communication, held at the University of Tokyo.

 

Probabilistic Inference of Conversation Structures based on Nonverbal Behaviors ---Toward automatic understanding face-to-face conversations---

[Abstract][Presentation][Movies] (In Japanese)

 

In addition, on December 5, I was invited to the Nonverbal Knowledge Workshop, hosted by Prof. Mase at Nagoya University.

Recognition and Understanding of Face-to-face Conversations based on Nonverbal Behaviors
[Presentation PDF (360kB)]  (In Japanese)

Recently, I have been working with intern students from domestic and foreign  universities. Their topics are fast and accurate face tracking in video, and facial expression recognition. We presented a demo system of face tracking at MIRU2007.

Simultaneous Real-time 3D Visual Tracking of Multiple Objects using a Stream Processor
Meeting on Image Recognition and Understanding
MIRU2007DS-01 (2007)
[Paper]

Shiro Kumano, who is an intern, presented a poster at MIRU2007 and gave a talk at IPSJ CVIM and ACCV2007. He received an Honorable Mention at ACCV2007.

Pose-Invariant Facial Expression Recognition Using Variable-Intensity Templates
Proc. Asian Conference on Computer Vision, 2007
[To Shiros page]

2008

I gave a talk at JSAI SLUD, May 7. The content was the Japanese version of the ICMI2007 paper.  

Automatic Estimation of Nonverbal Interaction Structures in Multiparty Conversations ---Who Responds to Whom and How?---

April, Oscar Mateo Lozano, a former intern, gave a talk at ICASSP2008. This paper proposed GPU-based face tracking.

Simultaneous and Fast 3D Tracking of Multiple Faces in Video by GPU-based Stream Processing
ICASSP2008(IEEE The 33rd International Conference on Acoustics, Speech, and Signal Processing)
[Related Info.]

July 12, 2008, a paper, coauthored by Oscar Mateo Lozano, was opened at Springer
s Website.

Real-time visual tracker by Stream processing ---Simultaneous and fast 3D tracking of multiple faces in video sequences by using a particle filter ---
Journal of VLSI Signal Processing Systems
  http://www.springerlink.com/content/pk22n1632859082k/
 OpenAccess for anyone. [Related Info.]

Some Web sites picked up this paper. Many Thanks!!

NVIDIA CUDA ZONE
GPGPU Homepage
Geeks3D.com
Impress PC-watch, NVISION08 report (In Japanese)


From May 29 to 30, 2008, we exhibited our Realtime Meeting Analysis System at NTT CS Labs. OpenHouse2008. Also, I was a panelist at a panel discussion session, Understanding Communications, in OpenHouse2008.

-Official Website is here.  Video archives and PPT are available from here.

-Overview of our demo system is here.

 -Audio technologies used in the system are here.

     -Face pose tracker used in the system are here.

 

September 9, 2008, I will present a paper at MLMI2008, in Utrecht, The Netherlands. This paper reports that we have applied a face tracer, STCTracker, to meeting analysis, and confirmed its effectiveness.

Fast and Robust Face Tracking for Analyzing Multiparty Face-to-Face Meetings
5th Joint Workshop on Machine Learning and Multimodal Interaction (MLMI2008)
[Paper][Presentation][Demo Movies]

 

October, 2008. I will present a paper at ICMI2008. This paper reveals the technical content of the realtime system for conversation scene analysis, of which we demonstrated at CS Labs. OpenHouse2008.

A Realtime Multimodal System for Analyzing Group Meetings by Combining Face Pose Tracking and Speaker Diarization

Proc. ACM 10th Int. Conf. Multimodal Interfaces (ICMI2008)
[Paper][Presentation][Demo Video]

November, 2008, I will give a talk at IEICE MVE, at Osaka University. This talk is the Japanese version of ICMI2008 paper. If possible, I will take some part of our system to the site.

 

Back to home

 

Any reproduction, modification, distribution, or republication of materials contained on this Web site, without prior explicit permission of the copyright holder, is strictly prohibited.

All rights reserved, Copyright(C) 2005, 2006, 2007, 2008 NTT Communication Science Laboratories