||``Robust Media Search" technology for searching for audio/video data in our surroundings
|When you hear a good song on the street, or see an interesting item on TV, do you ever want to find out more information about what you’re hearing or seeing? Searching for information on a distorted signal captured by a mobile phone or camera is possible with NTT’s Robust Media Search technology.
Media Search technology
The media search basically examines the similarity between a “query signal,” which is a fragment of audio or video, and each section of “stored signals” in a database to extract various information related to the query(Fig. 1).
In a media search, the main goals are the accuracy and high speed.
Fig.1 Media Search Technology.
NTT has been studying this from early on and led the world by developing Fast Media Search technologies, such as the Time-series Active Search and the Learning-based Active Search.
Now we have developed much faster and more accurate searching technology -- the Robust Media Search.
Robust Media Search Technology
Robust Media Search Technology has two main twists; one is using the locally defined feature of a signal, and the other is making the feature itself robust.
Let’s go into the details one by one.
Divide-And-Locate (DAL) method
Let’s say you want to find out the title of a song from a fragment of a TV program in which the song was used as background music. But the fragment is hard to identify because the song overlaps with narration. NTT has developed the DAL method for such situations by using the local feature, which is the first twist of the Robust Media Search. Analyzed with frequency on the vertical axis and time on the horizontal, this part of the spectrogram shows the narration signal. A close look reveals small gaps where the feature of the background music remains (Fig. 2).
To extract the feature, the spectrogram is divided into small regions (Fig. 3). The feature extracted from each region then undergoes the matching process with feature of the signals in the music database to locate a part with which many regions match. The feature of stored signals is calculated and categorized in advance, thereby achieving an extremely fast search.
Fig. 2 A close look on the frequencies reveals small gaps where the feature of the background music remains.
Fig. 3 The spectrogram is divided into small regions.
Binary Area Matching (BAM) Method
Fig. 4 shows a sound transmitted via mobile phone from a moving car. The original feature of the signal is distorted and it is difficult for the DAL method to handle such signals.
The Binary Area Matching, or BAM, method was developed to allow media searches in such cases by using the second twist to make the signal feature itself robust. The BAM method first divides the signal into small segments (Fig. 5), and then selects the portions in which the signal changes greatly over time. In other words, it selects only the segments that are rather unsusceptible to the real environmental influences. Next, the selected segments are quantized coarsely and represented by numbers.
Fig. 4 A sound spectrogram via mobile phone from a moving car.
By comparing the signal after not fine but coarse quantization, the influence of distortion is weakened, thus achieving the robust search. Identifying such signals was difficult with the conventional method, but was easy using Binary Area Matching. An experiment resulted in a high recognition accuracy of 93.4% using the BAM method, but only 15.9% with the conventional method.
Fig. 5 Not fine but coarse quantization by the BAM method.
Coarsely-quantized Area Matching (CAM) Method
Most of the images in our surroundings, such as those on TV, are edited to have superimposed or other visual effects. The Coarsely-quantized Area Matching, or CAM, method is an extension of the DAL and BAM methods for video searching. The CAM method uses a color change of each pixel instead of the sound spectrum. As in the case of the BAM method, the CAM method selects, based on statistical criteria, the segments that are rather unsusceptible to distortions; such as the segments where the color changes greatly over time. Through matching after coarse quantization, a very robust media search was achieved.
Even from a picture with a superimposed image, the original video was found correctly using the CAM method. Even with big graphics overlaid with a color change, or with an obstacle in front of the object, the original data was successfully identified.
Fig. 6 Coarsely coded video features by CAM method.
What are the possibilities of the Robust Media Search? For one, it's possible to search for the copyright owner of music or video content. This makes it possible to check for pirated copies of music and video on the Internet and support proper use on video posting sites. By searching via the Internet with just a shot of the item, you can get information on it -- not only what’s introduced in the program but also other related information. The media search has many more possibilities. Looking to the near future, NTT is extending its development to Media Induction Technology, which searches media by complementing or generating audio/video signals that have common attributes with the query signal even though their appearances differ. It may not be long before we can retrieve information from the sound of nature such as a bird’s twittering by applying the Robust Media Search Technology.