Skip to main content

2014 | Buch

MultiMedia Modeling

20th Anniversary International Conference, MMM 2014, Dublin, Ireland, January 6-10, 2014, Proceedings, Part II

herausgegeben von: Cathal Gurrin, Frank Hopfgartner, Wolfgang Hurst, Håvard Johansen, Hyowon Lee, Noel O’Connor

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 8325 and 8326 constitutes the thoroughly refereed proceedings of the 20th Anniversary International Conference on Multimedia Modeling, MMM 2014, held in Dublin, Ireland, in January 2014. The 46 revised regular papers, 11 short papers, and 9 demonstration papers were carefully reviewed and selected from 176 submissions. 28 special session papers and 6 papers from Video Browser Showdown workshop are also included in the proceedings. The papers included in these two volumes cover a diverse range of topics including: applications of multimedia modelling, interactive retrieval, image and video collections, 3D and augmented reality, temporal analysis of multimedia content, compression and streaming. Special session papers cover the following topics: Mediadrom: artful post-TV scenarios, MM analysis for surveillance video and security applications, 3D multimedia computing and modeling, social geo-media analytics and retrieval, multimedia hyperlinking and retrieval.

Inhaltsverzeichnis

Frontmatter

Special Session: Mediadrom: Artful Post-TV Scenarios

Organising Crowd-Sourced Media Content via a Tangible Desktop Application

This paper aims to present a framework for collaborative design environments during the manipulation of multimedia content. The proposed desktop system can be assumed as an imitation of physical desktop including three additional properties: a tangible timeline layer, a projection of 2D plan and 3D perspective views which can be controlled through specified augmented reality markers. The potentials of this desktop application, such as unfoldable experience of time via a timeline, multi-touch manipulation and organisation of 2D and 3D visual data, were explored for the further augmented reality studies.

Sema Alaçam, Yekta İpek, Özgün Balaban, Ceren Kayalar
Scenarizing Metropolitan Views: FlanoGraphing the Urban Spaces

The recent decade has seen a rapid evolution in the field of digital media. Mobile devices are now being integrated into every aspect of urban life. GPS, sensor technologies and augmented reality have transformed the new generation of mobile devices from a communication and information platform into a navigational tool, fostering new ways of perceiving reality and image building. Touch sensor technology has changed the screen into a joint input and display device. In this paper we present the

FlanoGraph

, an application for smartphones and tablets designed to take benefit of the changes induced by mobile devices. We first briefly outline the conceptual background, evoking the work of some researchers in the fields of ‘Non Representational Theory’, mobile media, and computational data processing. We then present and describe the

FlanoGraph

through a set of use cases. Finally, we conclude discussing some techniques necessary for the development of the application.

Bénédicte Jacobs, Laure-Anne Jacobs, Christian Frisson, Willy Yvart, Thierry Dutoit, Sylvie Leleu-Merviel
Scenarizing CADastre Exquisse: A Crossover between Snoezeling in Hospitals/Domes, and Authoring/Experiencing Soundful Comic Strips

This paper aims at providing scenarios for the design of authoring and experiencing environments for interactive soundful comic strips. One setting would be a virtual immersive environment made of a dome including spherical projection, surround sound, where visitors comfortably lying down on an interactive mattress can appreciate exquisite corpses floating on the ceiling of the dome, animated, with sound, dependent of the overall behavior of the visitors. On tabletops, creators can generate comic-strip-like creatures by collage or sketching, and associate audiovisual behaviors and soundscapes to these. This creation system will be used in hospitals towards a living lab comforting patients in accepting their health trip. Both settings are inspired by snoezelen methods. These crossover scenarios associate a project by L’Art-Chétype retained to be featured for Mons 2015 EU Capital of Culture and other partners aiming at designing an environment for experiencing/authoring interactive comic-strips augmented with sound.

Cédric Sabato, Aurélien Giraudet, Virginie Delattre, Yves Desnos, Christian Frisson, Rudi Giot, Willy Yvart, François Rocca, Stéphane Dupont, Guy Vandem Bemden, Sylvie Leleu-Merviel, Thierry Dutoit
An Interactive Device for Exploring Thematically Sorted Artworks

This Mediadrom artful post-TV scenario consists in sketching the user interface of an interactive media content browsing system for exploring thematically sorted artworks, from the art field of plastic theater, merging art pieces at the intersection of the visual and the performing arts. Combining a touchscreen and an hypermedia browser of image and video content with expert annotations, this system can be installed in venues such as museum and media libraries, and performance spaces as satellite installation to plastic theater performances.

Aurélie Baltazar, Pascal Baltazar, Christian Frisson

Special Session: MM Analysis for Surveillance Video and Security Applications

Hierarchical Audio-Visual Surveillance for Passenger Elevators

Modern elevators are equipped with closed-circuit television (CCTV) cameras to record videos for post-incident investigation rather than providing proactive event monitoring. While there are some attempts at automated video surveillance, events such as urinating, vandalism, and crimes that involved vulnerable targets may not exhibit significant visual cues. On contrary, such events are more discerning from audio cues. In this work, we propose a hierarchical audio-visual surveillance framework for elevators. Audio analytic module acts as the front line detector to monitor for such events. This means audio cue is the main determining source to infer the event occurrence. The secondary inference process involves queries to visual analytic module to build-up the evidences leading to event detection. We validate the performance of our system at a residential trial site and the initial results are promising.

Teck Wee Chua, Karianto Leman, Feng Gao
An Evaluation of Local Action Descriptors for Human Action Classification in the Presence of Occlusion

This paper examines the impact that the choice of local descriptor has on human action classifier performance in the presence of static occlusion. This question is important when applying human action classification to surveillance video that is noisy, crowded, complex and incomplete. In real-world scenarios, it is natural that a human can be occluded by an object while carrying out different actions. However, it is unclear how the performance of the proposed action descriptors are affected by the associated loss of information. In this paper, we evaluate and compare the classification performance of the state-of-art human local action descriptors in the presence of varying degrees of static occlusion. We consider four different local action descriptors: Trajectory (TRAJ), Histogram of Orientation Gradient (HOG), Histogram of Orientation Flow (HOF) and Motion Boundary Histogram (MBH). These descriptors are combined with a standard bag-of-features representation and a Support Vector Machine classifier for action recognition. We investigate the performance of these descriptors and their possible combinations with respect to varying amounts of artificial occlusion in the KTH action dataset. This preliminary investigation shows that MBH in combination with TRAJ has the best performance in the case of partial occlusion while TRAJ in combination with MBH achieves the best results in the presence of heavy occlusion.

Iveel Jargalsaikhan, Cem Direkoglu, Suzanne Little, Noel E. O’Connor
Online Identification of Primary Social Groups

Online group identification is a challenging task, due to the inherent dynamic nature of groups. In this paper, a novel framework is proposed that combines the individual trajectories produced by a tracker along with a prediction of their evolution, in order to identify existing groups. In addition to the widely known criteria used in the literature for group identification, we present a novel one, which exploits the motion pattern of the trajectories. The proposed framework utilizes the past, present and predicted states of groups within a scene, to provide robust online group identification. Experiments were conducted to provide evidence of the effectiveness of the proposed method with promising results.

Dimitra Matsiki, Anastasios Dimou, Petros Daras
Gait Based Gender Recognition Using Sparse Spatio Temporal Features

A gender balanced dataset of 101 pedestrians on a treadmill is presented. Gait is analysed for gender classification using a modification of a framework which has previously proven effective when used in behaviour recognition experiments. Sparse spatio temporal features from the video clips are classified using Support Vector Machines. Tuning parameters are investigated to find an effective feature descriptor for gender separation and an accuracy of 87% is achieved.

Matthew Collins, Paul Miller, Jianguo Zhang
Perspective Multiscale Detection and Tracking of Persons

The efficient detection and tracking of persons in videos has widrespread applications, specially in CCTV systems for surveillance or forensics applications. In this paper we present a new method for people detection and tracking based on the knowledge of the perspective information of the scene. It allows alleviating two main drawbacks of existing methods: (i) high or even excessive computational cost associated to multiscale detection-by-classification methods; and (ii) the inherent difficulty of the CCTV, in which predominate partial and full occlusions as well as very high intra-class variability. During the detection stage, we propose to use the homograhy of the dominant plane to compute the expected sizes of persons at different positions of the image and thus dramatically reduce the number of evaluation of the multiscale sliding window detection scheme. To achieve robustness against false positives and negatives, we have used a combination of full and upper-body detectors, as well as a Data Association Filter (DAF) inspired in the well-known Rao-Blackwellization-based particle filters (RBPF). Our experiments demonstrate the benefit of using the proposed perspective multiscale approach, compared to conventional sliding window approaches, and also that this perspective information can lead to useful mixes of full-body and upper-body detectors.

Marcos Nieto, Juan Diego Ortega, Andoni Cortes, Seán Gaines
Human Action Recognition in Video via Fused Optical Flow and Moment Features – Towards a Hierarchical Approach to Complex Scenario Recognition

This paper explores using motion features for human action recognition in video, as the first step towards hierarchical complex event detection for surveillance and security. We compensate for the low resolution and noise, characteristic of many CCTV modalities, by generating optical flow feature descriptors which view motion vectors as a global representation of the scene as opposed to a set of pixel-wise measurements. Specifically, we combine existing optical flow features with a set of moment-based features which not only capture the orientation of motion within each video scene, but incorporate spatial information regarding the relative locations of directed optical flow magnitudes. Our evaluation, using a benchmark dataset, considers their diagnostic capability when recognizing human actions under varying feature set parameterizations and signal-to-noise ratios. The results show that human actions can be recognized with mean accuracy across all actions of 93.3%. Furthermore, we illustrate that precision degrades less in low signal-to -noise images when our moments-based features are utilized.

Kathy Clawson, Min Jing, Bryan Scotney, Hui Wang, Jun Liu

Special Session: 3D Multimedia Computing and Modeling

Sparse Patch Coding for 3D Model Retrieval

3D shape retrieval is a fundamental task in many domains such as multimedia, graphics, CAD, and amusement. In this paper, we propose a 3D object retrieval approach by effectively utilizing low-level patches with initial semantics of 3D shapes, which are similar as superpixels in images. These patches are first obtained by means of stably over-segmenting 3D shape, and we adopt five representative geometric features such as shape diameter function, average geodesic distance, and heat kernel signature, to characterize these low-level patches. A large number of patches collected from shapes in a dataset are encoded into visual words by virtue of sparse coding, and input query compares with 3D models in the dataset by probability distribution of visual words. Experiments show that the proposed method achieves comparable retrieval performance to state-of-the-art methods.

Zhenbao Liu, Shuhui Bu, Junwei Han, Jun Wu
3D Object Classification Using Deep Belief Networks

Extracting features with strong expressive and discriminative ability is one of key factors for the effectiveness of 3D model classifier. Lots of research work has illustrated that deep belief networks (DBN) have enough power to represent the distributions of input data. In this paper, we apply DBN for extracting the features of 3D model. After implementing a contrastive divergence method, we obtain a trained-well DBN, which can powerfully represent the input data. Therefore, the feature from the output of last layer is acquired. This procedure is unsupervised. Due to the limit of labeled data, a semi-supervised method is utilized to recognize 3D objects using the feature obtained from the trained DBN. The experiments are conducted in the publicly available Princeton Shape Benchmark (PSB), and the experimental results demonstrate the effectiveness of our method.

Biao Leng, Xiangyang Zhang, Ming Yao, Zhang Xiong
Pursuing Detector Efficiency for Simple Scene Pedestrian Detection

Detector accuracy is by any means the key focus in most existing pedestrian detection algorithms especially for clutter scenes. However, it is not always necessary, while sometimes over-fitted, to directly leverage such detectors in scenarios with simple scene compositions. To this end, limited work has done on a systematic detector simplification towards balancing its speed and accuracy. In this paper, we study this problem by investigating two mutually correlated issues, i.e. fast edge-based feature extraction and detector score computation. For handling the first issue, a simple Structured Local Edge Pattern (SLEP) is proposed to extract and encode local edge cues, extremely effectively, into a histogram. For the second, an integral image based acceleration is proposed toward fast classifier score computation by transforming the classifier score into a linear sum of weights. Experimental results on CASIA gait recognition dataset show that our proposed method is highly efficient than most existing detectors, which even faster than the practical OpenCV pedestrian detector.

De-Dong Yuan, Jie Dong, Song-Zhi Su, Shao-Zi Li, Rong-Rong Ji
Multi-view Action Synchronization in Complex Background

This paper addresses temporal synchronization of human actions under multiple view situation. Many researchers focused on frame by frame alignment for sync these multi-view videos, and expolited features such as interesting point trajectory or 3d human motion feature for event detecting individual. However, since background are complex and dynamic in real world, traditional image-based features are not fit for video representation. We explore the approach by using robust spatio-temporal features and self-similarity matrices to represent actions across views. Multiple sequences can be aligned their temporal patch(Sliding window) using the Dynamic Time Warping algorithm hierarchically and measured by meta-action classifiers. Two datasets including the Pump and the Olympic dataset are used as test cases. The methods are showed the effectiveness in experiment and suited general video event dataset.

Longfei Zhang, Shuo Tang, Shikha Singhal, Gangyi Ding
Parameter-Free Inter-view Depth Propagation for Mobile Free-View Video

As a result of the rapidly improving performance of mobile devices, there is an inevitable trend for the development of applications with stereoscopic functions. For these applications, especially those can be seen from free angles, a set of depth images are usually needed. To fit for bit-rate limited working conditions such as for mobile terminals, depth propagation is always exploited thus to decrease the transfer consumption. Traditionally, accurate propagation needs camera parameters to warp one view to another, which are usually unavailable. This paper proposes a parameter-free depth inter-view propagation method for free-view video with multi-view texture videos plus single-view depth video. Firstly, the depth is estimated for each view by the corresponding texture video. Then, inter-view relationships of depth views are investigated by the neighboring estimated depths. Finally, the obtained inter-view relationship is used to propagate the depth from one given depth with high quality to other views. Experimental results demonstrate that our scheme has better quality on both depth and virtual viewpoint generation.

Binbin Xiong, Weimin Wu, Haojie Li, Hongtao Yu, Hanzi Mao
Coverage Field Analysis to the Quality of Light Field Rendering

In light field rendering (LFR), the geometric configuration of cameras concerns the rendering quality of virtual views. A mathematical model of coverage field (CF) is proposed in this paper to quantify the relationship between the rendering quality and the geometric configuration of cameras. We analyze the impact of changes in CF with the rendering quality by a set of positions of the virtual views and the geometric configuration of cameras. An optimization algorithm is also presented to optimize the geometric configuration of cameras with the help of CF. The experimental results show that the proposed CF can effectively quantify the quality of LFR, and can be used to optimize the geometric configuration of cameras.

Changjian Zhu, Li Yu, Peng Zhou

Special Session: Social Geo-Media Analytics and Retrieval

Personalized Recommendation by Exploring Social Users’ Behaviors

With the popularity and rapid development of social network, more and more people enjoy sharing their experiences, such as reviews, ratings and moods. And there are great opportunities to solve the cold start and sparse data problem with the new factors of social network like interpersonal influence and interest based on circles of friends. Some algorithm models and social factors have been proposed in this domain, but have not been fully considered. In this paper, two social factors: interpersonal rating behaviors similarity and interpersonal interest similarity, are fused into a consolidated personalized recommendation model based on probabilistic matrix factorization. And the two factors can enhance the inner link between features in the latent space. We implement a series of experiments on Yelp dataset. And experimental results show the outperformance of proposed approach.

Guoshuai Zhao, Xueming Qian, He Feng
Where Is the News Breaking? Towards a Location-Based Event Detection Framework for Journalists

The rise of user-generated content (UCG) as a source of information in the journalistic lifecycle is driving the need for automated methods to detect, filter, contextualise and verify citizen reports of breaking news events. In this position paper we outline the technological challenges in incorporating UCG into news reporting and describe our proposed framework for exploiting UGC from social media for location-based event detection and filtering to reduce the workload of journalists covering breaking and ongoing news events. News organisations increasingly rely on manually curated UGC. Manual monitoring, filtering, verification and curation of UGC, however, is a time and effort consuming task, and our proposed framework takes a first step in addressing many of the issues surrounding these processes.

Bahareh Rahmanzadeh Heravi, Donn Morrison, Prashant Khare, Stephane Marchand-Maillet
Location-Aware Music Artist Recommendation

Current advances in music recommendation underline the importance of multimodal and user-centric approaches in order to transcend limits imposed by methods that solely use audio, web, or collaborative filtering data. We propose several hybrid music recommendation algorithms that combine information on the

music content

, the

music context

, and the

user context

, in particular integrating geospatial notions of similarity. To this end, we use a novel standardized data set of music listening activities inferred from microblogs (

MusicMicro

) and state-of-the-art techniques to extract audio features and contextual web features. The multimodal recommendation approaches are evaluated for the task of music artist recommendation. We show that traditional approaches (in particular, collaborative filtering) benefit from adding a user context component, geolocation in this case.

Markus Schedl, Dominik Schnitzer
Task-Driven Image Retrieval Using Geographic Information

When large-scale online geo-tagged images come into view, it is important to leverage geographic information for web image retrieval. In this paper, a geo-metadata based image retrieval system is proposed using both textual tags and visual features. This image retrieval system is especially useful for tourism related tasks such as tourism recommendation and tourism guide. First, the requested image retrieval task is classified into one of the three different types according to the retrieval purpose, and then it can be handled with specific method. Second, a WordNet hierarchy based semantic similarity is developed to measure the similarity between different cities. This semantic similarity is somehow consistent with the visual similarity. Finally, a high-level image representation method is proposed to narrow the semantic gap between the low-level visual features and high-level image concepts. The proposed algorithm is evaluated on an image set which is consisted of totally 177,158 images of 120 most popular cities all over the world collected from Flickr, and the experiments have provided very positive results.

Peixiang Dong, Kuizhi Mei, Ji Zhang, Hao Lei, Jianping Fan
The Evolution of Research on Multimedia Travel Guide Search and Recommender Systems

The importance of multimedia travel guide search and recommender systems has led to a substantial amount of research spanning different computer science and information system disciplines in recent years. The five core research streams we identify here incorporate a few multimedia computing and information retrieval problems that relate to the alternative perspectives of algorithm design for optimizing search/recommendation quality and different methodological paradigms to assess system performance at large scale. They include (1) query analysis, (2) diversification based on different criteria, (3) ranking and reranking, (4) personalization and (5) evaluation. Based on a comprehensive discussion and analysis of these streams, this survey evaluates the recent major contributions to theoretical and system development, and makes some predictions about the road that lies ahead for multimedia computing and information retrieval (IR) researchers in both academia and industry world.

Junge Shen, Zhiyong Cheng, Jialie Shen, Tao Mei, Xinbo Gao

Special Session: Multimedia Hyperlinking and Retrieval

Average Precision: Good Guide or False Friend to Multimedia Search Effectiveness?

Approaches to multimedia search often evolve from existing approaches with strong average precision. However, work on search evaluation shows that average precision does not always capture effectiveness in terms of satisfying user needs because it ignores the diversity of search results. This paper investigates whether search approaches with diverse results have been neglected within the multimedia retrieval research agenda due the fact that they are overshadowed by search approaches with strong average precision. To this end, we compare 361 search approaches applied on the TrecVid benchmarks between 2005 and 2007. We motivate two criteria based on measure correlation and statistical equivalence to estimate whether search approaches with diverse results have been neglected. We show that hypothesized effect indeed occurs in the above examined collections. As a consequence, the research community would benefit from reconsidering existing approaches in the light of diversity.

Robin Aly, Dolf Trieschnigg, Kevin McGuinness, Noel E. O’Connor, Franciska de Jong
An Investigation into Feature Effectiveness for Multimedia Hyperlinking

The increasing amount of archival multimedia content available online is creating increasing opportunities for users who are interested in exploratory search behaviour such as browsing. The user experience with online collections could therefore be improved by enabling navigation and recommendation within multimedia archives, which can be supported by allowing a user to follow a set of hyperlinks created within or across documents. The main goal of this study is to compare the performance of different multimedia features for automatic hyperlink generation. In our work we construct multimedia hyperlinks by indexing and searching textual and visual features extracted from the blip.tv dataset. A user-driven evaluation strategy is then proposed by applying the Amazon Mechanical Turk (AMT) crowdsourcing platform, since we believe that AMT workers represent a good example of “real world” users. We conclude that textual features exhibit better performance than visual features for multimedia hyperlink construction. In general, a combination of ASR transcripts and metadata provides the best results.

Shu Chen, Maria Eskevich, Gareth J. F. Jones, Noel E. O’Connor
Mining the Web for Multimedia-Based Enriching

As the amount of social media shared on the Internet grows increasingly, it becomes possible to explore a topic with a novel, people based viewpoint. We aim at performing topic enriching using media items mined from social media sharing platforms. Nevertheless, such data collected from the Web is likely to contain noise, hence the need to further process collected documents to ensure relevance. To this end, we designed an approach to automatically propose a cleaned set of media items related to events mined from search trends. Events are described using word tags and a pool of videos is linked to each event in order to propose relevant content. This pool has previously been filtered out from non-relevant data using information retrieval techniques. We report the results of our approach by automatically illustrating the popular moments of four celebrities.

Mathilde Sahuguet, Benoit Huet

Short Papers

Spatial Similarity Measure of Visual Phrases for Image Retrieval

Spatial information plays an essential role in accurate matching of local features in applications, e.g., image retrieval. Despite of previous work, it remains a challenging problem to extract appropriate spatial information. We propose an image retrieval framework based on visual phrase. By encoding the spatial information into the similarity measure of visual phrases, our approach is able to capture accurate spatial information between visual words. Furthermore, the image-specific visual phrase selection process helps to reduce large number of redundant visual phrases. We have conducted experiments on two datasets: UKbench and TRECVID, which shows that our ideas significantly improve the performance in image retrieval application.

Jiansong Chen, Bailan Feng, Bo Xu
Semantic Based Background Music Recommendation for Home Videos

In this paper, we propose a new background music recommendation scheme for home videos and two new features describing the short-term motion/tempo distribution in visual/aural content. Unlike previous researches that merely matched the visual and aural contents through a perceptual way, we incorporate the textual semantics and content semantics while determining the matching degree of a video and a song. The key idea is that the recommended music should contain semantics that relate to the ones in the input video and that the rhythm of the music and the visual motion of the video should be harmonious enough. As a result, a few user-given tags and automatically annotated tags are used to compute their relation to the lyrics of the songs for selecting candidate musics. Then, we use the proposed motion-direction histogram (MDH) and pitch tempo pattern (PTP) to do the second-run selection. The user preference to the music genre is also taken into account as a filtering mechanism at the beginning. The primitive user evaluation shows that the proposed scheme is promising.

Yin-Tzu Lin, Tsung-Hung Tsai, Min-Chun Hu, Wen-Huang Cheng, Ja-Ling Wu
Smoke Detection Based on a Semi-supervised Clustering Model

Video-based smoke detection is regarded as an effective way for fire detection in open spaces. In this paper, a classification model based on a semi-supervised clustering method is introduced to improve the performance of smoke detection. In our model, we present a novel method to automatically determine the number of clusters

K

. Considering the randomness of the initial centers in K-means++, a voting strategy is proposed to maintain a stable clustering performance. Besides, the scene-related information is added to our clustering data to obtain a self-adaptive model. Finally, the experimental results show that our classification model outperforms other state-of-the-art methods and has great improvement in terms of generalization (i.e. can adapt to the unknown scenes).

Haiqian He, Liqun Peng, Deshun Yang, Xiaoou Chen
Empirical Exploration of Extreme SVM-RBF Parameter Values for Visual Object Classification

This paper presents a preliminary exploration showing the surprising effect of extreme parameter values used by Support Vector Machine (SVM) classifiers for identifying objects in images. The Radial Basis Function (RBF) kernel used with SVM classifiers is considered to be a state-of-the-art approach in visual object classification. Standard tuning approaches apply a relative narrow window of values when determining the main parameters for kernel size. We evaluated the effect of setting an extremely small kernel size and discovered that, contrary to expectations, in the context of visual object classification for some object and feature combinations these small kernels can demonstrate good classification performance. The evaluation is based on experiments on the TRECVid 2013 Semantic INdexing (SIN) training dataset and provides initial indications that can be used to better understand the optimisation of RBF kernel parameters.

Rami Albatal, Suzanne Little
Real-World Event Detection Using Flickr Images

This paper proposes a real-world event detection method by using the time and location information and text tags attached to the images in Flickr. Events can generally be detected by extracting images captured at the events which are annotated with text tags frequently used only in specific times and locations. However, such approach can not detect events where only a small number of images were captured. We focus on the fact that semantically related events often occur around the same time at different locations. Considering a group of these events as an

event class

, the proposed method firstly detects event classes from all images in Flickr based on their similarity of the captured time and text tags. Then, from the images consisting each event class, events are detected based on their similarity of the captured locations. Such two-step approach enables us to detect events where a small number of images were captured.

Naoko Nitta, Yusuke Kumihashi, Tomochika Kato, Noboru Babaguchi
Spectral Classification of 3D Articulated Shapes

A large number of 3D models distributed on internet has created the demand for automatic shape classification. This paper presents a novel classification method for 3D mesh shapes. Each shape is represented by the eigenvalues of an appropriately defined affinity matrix, forming a spectral embedding which achieves invariance against rigid-body transformations, uniform scaling, and shape articulation. And then, Adaboost algorithm is applied to classify the 3D models in the spectral space according to its immunity to overfitting. We evaluate the approach on the McGill 3D shape benchmark and compare the results with previous classification method, and it achieves higher classification accuracy. This method is suitable for automatic classification of 3D articulated shapes.

Zhenbao Liu, Feng Zhang, Shuhui Bu
Improving Scene Detection Algorithms Using New Similarity Measures

The creation process of interactive non-linear videos affords the definition of scenes which are connected in a scene graph. These might be available in form of raw material (shots) or need to be extracted from existing films. In the latter case, the scenes have to be defined by the author in a time-consuming process. A semi-automated scene extraction helps the user perform this task. Detected shots and scenes may be corrected in a graphical user interface which is integrated in our authoring tool. To provide satisfying results in the automated scene detection process our main goal was to improve an existing algorithm. Different shot comparison functions were evaluated regarding the overall performance index. The commonly used color histogram intersection was outperformed by a combination of

χ

2

-distance and complexity comparison, which unexpectedly attained the best results.

Stefan Zwicklbauer, Britta Meixner, Harald Kosch
EvoTunes: Crowdsourcing-Based Music Recommendation

In recent days, there have been many attempts to automatically recommend music clips that are expected to be liked by a listener. In this paper, we present a novel music recommendation system that automatically gathers listeners’ direct responses about the satisfaction of playing specific two songs one after the other and evolves accordingly for enhanced music recommendation. Our music streaming web service, called “EvoTunes,” is described in detail. Experimental results using the service demonstrate that the success rate of recommendation increases over time through the proposed evolution process.

Jun-Ho Choi, Jong-Seok Lee
Affect Recognition Using Magnitude Models of Motion

The analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, neuroscience, and related disciplines. We focus on the recognition of the affect state of a single person from video streams. We create a model that allows to estimate the state of four affective dimensions of a person which are arousal, anticipation, power and valence. This sequence model is composed of a magnitude model of motion constructed from a set of point of interest tracked using optical flow. The state of the affective dimension is then predicted using SVM. The experimentation has been performed on a standard dataset and has showed promising results.

Oussama Hadjerci, Adel Lablack, Ioan Marius Bilasco, Chaabane Djeraba
Effects of Audio Compression on Chord Recognition

Feature analysis of audio compression is necessary to achieve high accuracy in musical content recognition and content-based music information retrieval (MIR). Bit rate differences are expected to adversely affect musical content analysis and content-based MIR results because the frequency response might be changed by the encoding. In this paper, we specifically examine its effect on the chroma vector, which is a commonly used feature vector for music signal processing. We analyze sound qualities extracted from encoded music files with different bit rates and compare them with the chroma features of original songs obtained using datasets for chord recognition.

Aiko Uemura, Kazumasa Ishikura, Jiro Katto
The Perceptual Characteristics of 3D Orientation

The rapid development of 3D video stimulates demand for 3D audio and promotes the research on perceptual characteristics in 3D sound field. The traditional researches of 3D orientation characteristics were measured sensitivity for some specific locations on site. The test efficiency and results are not good to practical application. This paper first designed a new device to collect optional position in order to establish a 3D sound database and proposed a test method which can be large-scale test. It can ensure the consistency of the experimental data and can’t interfere by the experiment personnel. It can improve the efficiency and accuracy of the experiment by using an adaptive method on the computer. It is a significant way to explore the perceptual mechanism of 3D orientation by discovering the relative contribution of frequency and position to selectivity for azimuth and distance. It will provide a theoretical support for the 3D audio acquisition, coding, reconstruction and playback by carrying out the research of 3D orientation characteristics.

Wang Heng, Zhang Cong, Hu Ruimin, Tu Weiping, Wang Xiaochen

Demonstrations

Folkioneer: Efficient Browsing of Community Geotagged Images on a Worldwide Scale

In this paper, we introduce Folkioneer, a novel approach for browsing and exploring community-contributed geotagged images. Initially, images are clustered based on the embedded geographical information by applying an enhanced version of the CURE algorithm, and characteristic geodesic shapes are derived using Delaunay triangulation. Next, images of each geographical cluster are analyzed and grouped according to visual similarity using SURF and restricted homography estimation. At the same time, LDA is used to extract representative topics from the provided tags. Finally, the extracted information is visualized in an intuitive and user-friendly manner with the help of an interactive map.

Hatem Mousselly-Sergieh, Daniel Watzinger, Bastian Huber, Mario Döller, Elöd Egyed-Zsigmond, Harald Kosch
Muithu: A Touch-Based Annotation Interface for Activity Logging in the Norwegian Premier League

Annotation of content is a key enabling technology for multimedia systems development. In this demonstration, we present a real-time activity annotation interface designed to be intuitive while imposing minimum effort on the user. Our solution is to use a smartphone and implement a tile-based touch interface. The interface was developed as a part of a larger project in collaboration with Tromsø IL, one of the top soccer teams in Norway. In this demonstration submission we present and evaluate the annotation interface of Muithu.

Magnus Stenhaug, Yang Yang, Cathal Gurrin, Dag Johansen
FoodCam: A Real-Time Mobile Food Recognition System Employing Fisher Vector

In the demo, we demonstrate a mobile food recognition system with Fisher Vector and liner one-vs-rest SVMs which enables us to record our food habits easily. In the experiments with 100 kinds of food categories, we have achieved the 79.2% classification rate for the top 5 category candidates when the ground-truth bounding boxes are given. The prototype system is open to the public as an Android-based smartphone application.

Yoshiyuki Kawano, Keiji Yanai
The LIRE Request Handler: A Solr Plug-In for Large Scale Content Based Image Retrieval

Big data in the visual domain and large scale image retrieval are current and pressing topics. In this demo paper we describe a specific implementation, that allows for large scale content based image retrieval using the Apache Solr search server. The combination of the robust and well accepted Apache Solr search server with the LIRE content based search framework provides an easy to use and well performing combination of two popular open source tools. We demonstrate the usefulness of our plug-in based the a real life scenario of visual trademark search with more than 1,800,000 images.

Mathias Lux, Glenn Macstravic
M3 + P3 + O3 = Multi-D Photo Browsing

Collections of digital media, personal, professional or social, have been growing ever larger, leaving users overwhelmed with data. Learning from the success of multi-dimensional analysis (MDA) for On- Line Analytical Processing (OLAP) applications, we demonstrate the M

3

multi-dimensional model for media browsing and the associated P

3

photo browser and O

3

media server prototypes.

Björn Thór Jónsson, Áslaug Eiríksdóttir, Ólafur Waage, Grímur Tómasson, Hlynur Sigurthórsson, Laurent Amsaleg
Tools for User Interaction in Immersive Environments

REVERIE – REal and Virtual Engagement in Realistic Immersive Environments – is a large scale collaborative project co-funded by the European Commission targeting novel research in the general domain of Networked Media and Search Systems. The project aims to bring about a revolution in 3D media and virtual reality by developing technologies for safe, collaborative, online environments that can enable realistic interpersonal communication and interaction in immersive environments. To date, project partners have been developing component technologies for a variety of functionalities related to the aims of REVERIE prior to integration into an end-to-end system. In this demo submission, we first introduce the project in general terms, outlining the high-level concept and vision before briefly describing the suite of demonstrations that we intend to present at MMM 2014.

N. E. O’Connor, D. Alexiadis, K. Apostolakis, Petros Daras, E. Izquierdo, Y. Li, D. S. Monaghan, F. Rivera, C. Stevens, S. Van Broeck, J. Wall, H. Wei
RESIC: A Tool for Music Stretching Resistance Estimation

In this demonstration, we present a useful tool that estimates the stretching

RE

sistance of mu

SIC

(RESIC). The tool takes advantage of both music characteristics and human factors by incorporating audio content features, musical genre and user-tagged music stretching resistance data set to provide reliable estimation. For better understanding of this tool, two front-ends are introduced. Our work fills the gap of music stretching resistance estimation, which aids music resizing techniques in parameter selections, and also expands the user manipulation of music.

Jun Chen, Chaokun Wang
A Visual Information Retrieval System for Radiology Reports and the Medical Literature

The enormous amount of visual data in Picture Archival and Communication Systems (PACS) and in the medical literature is growing exponentially. In the proposed demo, the medical image search of the KHRESMOI project is presented to solve some of the challenges of medical data management and retrieval. The system allows searching for visual information by combining content–based image retrieval (CBIR) and text retrieval in several languages using semantic concepts. 3D visual retrieval in internal hospital sources is supported by marking volumes of interest (VOI) in the data and connection to the medical literature are established to allow further investigating interesting cases. The system is demonstrated on 5TB of radiology reports with associated images and articles of the biomedical literature with over 1.7M images.

Dimitrios Markonis, René Donner, Markus Holzer, Thomas Schlegl, Sebastian Dungs, Sascha Kriewel, Georg Langs, Henning Müller
Eolas: Video Retrieval Application for Helping Tourists

In this paper, a video retrieval application for the Android mobile platform is described. The application utilises computer vision technologies that, given a photo of a landmark of interest, will automatically locate online videos about that landmark. Content-based video retrieval technologies are adopted to find the most relevant videos based on visual similarity of video content. The system has been evaluated using a custom test collection with human annotated ground truth. We show that our system is effective, both in terms of speed and accuracy. This application is proposed for demonstration at MMM2014 and we are sure that this application would benefit tourists either planning travel or while travelling in real-time.

Zhenxing Zhang, Yang Yang, Ran Cui, Cathal Gurrin

Video Browser Showdown

Audio-Visual Classification Video Browser

This paper presents our third participation in the Video Browser Showdown. Building on the experience that we gained while participating in this event, we compete in the 2014 showdown with a more advanced browsing system based on incorporating several audio-visual retrieval techniques. This paper provides a short overview of the features and functionality of our new system.

David Scott, Zhenxing Zhang, Rami Albatal, Kevin McGuinness, Esra Acar, Frank Hopfgartner, Cathal Gurrin, Noel E. O’Connor, Alan F. Smeaton
Content-Based Video Browsing with Collaborating Mobile Clients

A system comprised of collaborating mobile clients and a server is introduced in order to solve known item search tasks. The clients query the server for small video sequences according to some search criteria and are kept informed about the actions of collaborating participants (viewed frames, queries submitted, bookmarks set). The results are browsed and refined through a GUI that takes advantage of modern tablets’ capabilities.

Claudiu Cobârzan, Marco A. Hudelist, Manfred Del Fabro
Browsing Linked Video Collections for Media Production

This paper describes a video browsing tool for media (post-) production, enabling users to efficiently find relevant media items for redundant and sparsely annotated content collections. Users can iteratively cluster the content set by different features, and restrict the content set by selecting a subset of clusters. In addition to clustering by features, similarity search by different features is supported, and a set of linked video segments can be explored for a segment of interest. Desktop and Web-based variants of the user interface, including temporal preview functionality, are available.

Werner Bailer, Wolfgang Weiss, Christian Schober, Georg Thallinger
VERGE: An Interactive Search Engine for Browsing Video Collections

This paper presents VERGE interactive video retrieval engine, which is capable of searching and browsing video content. The system integrates several content-based analysis and retrieval modules such as video shot segmentation and scene detection, concept detection, clustering and visual similarity search into a user friendly interface that supports the user in browsing through the collection, in order to retrieve the desired clip.

Anastasia Moumtzidou, Konstantinos Avgerinakis, Evlampios Apostolidis, Vera Aleksić, Fotini Markatopoulou, Christina Papagiannopoulou, Stefanos Vrochidis, Vasileios Mezaris, Reinhard Busch, Ioannis Kompatsiaris
Signature-Based Video Browser

In this paper, we present a new signature-based video browser tool relying on the natural human ability to perceive and memorize visual stimuli of color regions in video frames. The tool utilizes feature signatures based on color and position extracted from the key frames in the preprocessing phase. Such content representation facilitates users in drawing simple query sketches and enables also effective and efficient processing of the query sketches. Besides user drawn simple sketches of desired scenes, the tool supports also several additional automatic content-based analysis techniques enabling restrictions to various concepts like faces or shapes.

Jakub Lokoč, Adam Blažek, Tomáš Skopal
NII-UIT: A Tool for Known Item Search by Sequential Pattern Filtering

This paper presents an interactive tool for searching a known item in a video or a video archive. To rapidly select the relevant segment, we use query patterns formulated by users for filtering. The patterns can be formulated by drawing color sketches or selecting predefined concepts. Especially, our tool support users to define patterns for sequences of consecutive segments, for instance, sequences of occurrences of concepts. Such patterns are called sequential patterns, which are more powerful to describe users’ search intention. Besides that, the user interface is organized following a coarse-to-fine manner, so that users can quickly scan the set of candidate segments. By using color-based and concept-based filters, our tool can deal with both visual and descriptive known item search.

Thanh Duc Ngo, Vu Hoang Nguyen, Vu Lam, Sang Phan, Duy-Dinh Le, Duc Anh Duong, Shin’ichi Satoh
Backmatter
Metadaten
Titel
MultiMedia Modeling
herausgegeben von
Cathal Gurrin
Frank Hopfgartner
Wolfgang Hurst
Håvard Johansen
Hyowon Lee
Noel O’Connor
Copyright-Jahr
2014
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-04117-9
Print ISBN
978-3-319-04116-2
DOI
https://doi.org/10.1007/978-3-319-04117-9

Neuer Inhalt