Skip to main content
main-content
Top

2019 | Book

MultiMedia Modeling

25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part I

Editors: Ioannis Kompatsiaris, Dr. Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, Dr. Stefanos Vrochidis

Publisher: Springer International Publishing

Book Series: Lecture Notes in Computer Science

share
SHARE
insite
SEARCH

About this book

The two-volume set LNCS 11295 and 11296 constitutes the thoroughly refereed proceedings of the 25th International Conference on MultiMedia Modeling, MMM 2019, held in Thessaloniki, Greece, in January 2019.

Of the 172 submitted full papers, 49 were selected for oral presentation and 47 for poster presentation; in addition, 6 demonstration papers, 5 industry papers, 6 workshop papers, and 6 papers for the Video Browser Showdown 2019 were accepted. All papers presented were carefully reviewed and selected from 204 submissions.

Table of Contents

Frontmatter
Correction to: MultiMedia Modeling

In the original version of the book, the following belated corrections have been incorporated:

Ioannis Kompatsiaris, Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, Stefanos Vrochidis

Regular and Special Session Papers

Frontmatter
Sentiment-Aware Multi-modal Recommendation on Tourist Attractions

For tourist attraction recommendation, there are three essential aspects to be considered: tourist preferences, attraction themes, and sentiments on themes of attraction. By utilizing vast multi-modal media available on Internet, this paper is aiming to develop an efficient solution of tourist attraction recommendation covering all these three aspects. To achieve this goal, we propose a probabilistic generative model called Sentiment-aware Multi-modal Topic Model (SMTM), whose advantages are four folds: (1) we separate tourists and attractions into two domains for better recovering tourist topics and attraction themes; (2) we investigate tourists sentiments on topics to retain the preference ones; (3) the recommended attraction is guaranteed with positive sentiment on the related attraction themes; (4) the multi-modal data are utilized to enhance the recommendation accuracy. Qualitative and quantitative evaluation results have validated the effectiveness of our method.

Junyi Wang, Bing-Kun Bao, Changsheng Xu
SCOD: Dynamical Spatial Constraints for Object Detection

One-stage detectors are widely used in real-world computer vision applications nowadays due to their competitive accuracy and very fast speed. However, for high resolution (e.g., $$512 \times 512$$ 512 × 512 ) input, most one-stage detectors run too slowly to process such images in real time. In this paper, we propose a novel one-stage detector called Dynamical Spatial Constraints for Object Detection (SCOD). We apply dynamical spatial constraints to address multiple detections of the same object and use two parallel classifiers to address the serious class imbalance. Experimental results show that SCOD makes a significant improvement in speed and achieves competitive accuracy on the challenging PASCAL VOC2007 and PASCAL VOC2012 benchmarks. On VOC2007 test, SCOD runs at 41 FPS with a mAP of 80.4%, which is $$2.2 {\times }$$ 2.2 × faster than SSD that runs at 19 FPS with a mAP of 79.8%. On VOC2012 test, SCOD runs at 71 FPS with a mAP of 75.4%, which is $$1.8 {\times }$$ 1.8 × faster than YOLOv2 that runs at 40 FPS with a mAP of 73.4%.

Kai-Jun Zhang, Cheng-Hao Guo, Zhong-Han Niu, Lu-Fei Liu, Yu-Bin Yang
STMP: Spatial Temporal Multi-level Proposal Network for Activity Detection

We propose a network for unconstrained scene activity detection called STMP to provide a deep learning method that can encode effective multi-level spatiotemporal information simultaneously and perform accurate temporal activity localization and recognition. Aiming at encoding meaningful spatial information to generate high-quality activity proposals in a fixed temporal scale, a spatial feature hierarchy is introduced in this approach. Meanwhile, to deal with various time scale activities, temporal feature hierarchy is proposed to represent activities of different temporal scales. The core component in STMP is STFH, which is a unified network implemented Spatial and Temporal Feature Hierarchy. On each level of STFH, an activity proposal detector is trained to detect activities in inherent temporal scale, which allows our STMP to make the full use of multi-level spatiotemporal information. Most importantly, STMP is a simple, fast and end-to-end trainable model due to its pure and unified framework. We evaluate STMP on two challenging activity detection datasets, and we achieve state-of-the-art results on THUMOS’14 (about 9.3% absolute improvement over the previous state-of-the-art approach R-C3D [1]) and obtains comparable results on ActivityNet1.3.

Guang Chen, Yuexian Zou, Can Zhang
Hierarchical Vision-Language Alignment for Video Captioning

We have witnessed promising advances on video captioning in recent years, which is a challenging task since it is hard to capture the semantic correspondences between visual content and language descriptions. Different granularities of language components (e.g. words, phrases and sentences), are corresponding to different granularities of visual elements (e.g. objects, visual relations and interested regions). These correspondences can provide multi-level alignments and complementary information for transforming visual content to language descriptions. Therefore, we propose an Attention Guided Hierarchical Alignment (AGHA) approach for video captioning. In the proposed approach, hierarchical vision-language alignments, including object-word, relation-phrase, and region-sentence alignments, are extracted from a well-learned model that suits for multiple tasks related to vision and language, which are then embedded into parallel encoder-decoder streams to provide multi-level semantic guidance and rich complementarities on description generation. Besides, multi-granularity visual features are also exploited to obtain the coarse-to-fine understanding on complex video content, where an attention mechanism is applied to extract comprehensive visual discrimination to enhance video captioning. Experimental results on widely-used dataset MSVD demonstrate that AGHA achieves promising improvement on popular evaluation metrics.

Junchao Zhang, Yuxin Peng
Task-Driven Biometric Authentication of Users in Virtual Reality (VR) Environments

In this paper, we provide an approach for authenticating users in virtual reality (VR) environments by tracking the behavior of users as they perform goal-oriented tasks, such as throwing a ball at a target. With the pervasion of VR in mission-critical applications such as manufacturing, navigation, military training, education, and therapy, validating the identity of users using VR systems is becoming paramount to prevent tampering of the VR environments, and to ensure user safety. Unlike prior work, which uses PIN and pattern based passwords to authenticate users in VR environments, our approach authenticates users based on their natural interactions within the virtual space by matching the 3D trajectory of the dominant hand gesture controller in a display-based head-mounted VR system to a library of trajectories. To handle natural differences in wait times between multiple parts of an action such as picking a ball and throwing it, our matching approach uses a symmetric sum-squared distance between the nearest neighbors across the query and library trajectories. Our work enables seamless authentication without requiring the user to stop their activity and enter specific credentials, and can be used to continually validate the identity of the user. We conduct a pilot study with 14 subjects throwing a ball at a target in VR using the gesture controller and achieve a maximum accuracy of 92.86% by comparing to a library of 10 trajectories per subject, and 90.00% by comparing to 6 trajectories per subject.

Alexander Kupin, Benjamin Moeller, Yijun Jiang, Natasha Kholgade Banerjee, Sean Banerjee
Deep Neural Network Based 3D Articulatory Movement Prediction Using Both Text and Audio Inputs

Robust and accurate predicting of articulatory movements has various important applications, such as human-machine interaction. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in many fields. To increase the accuracy, on the one hand, we propose a new network architecture called bottleneck squeeze-and-excitation recurrent convolutional neural network (BSERCNN) for articulatory movement prediction. On the one hand, by introducing the squeeze-and-excitation (SE) module, our BSERCNN can model the interdependencies and relationships between channels and that makes our model more efficiency. On the other hand, phoneme-level text features and acoustic features are integrated together as inputs to BSERCNN for better performance. Experiments show that BSERCNN achieves the state-of-the-art root-mean-squared error (RMSE) 0.563 mm and the correlation coefficient 0.954 with both text and audio inputs.

Lingyun Yu, Jun Yu, Qiang Ling
Subjective Visual Quality Assessment of Immersive 3D Media Compressed by Open-Source Static 3D Mesh Codecs

While studies for objective and subjective evaluation of the visual quality of compressed 3D meshes has been discussed in the literature, those studies were covering the evaluation of 3D-meshes either created by 3D artists or generated by a computationally expensive reconstruction process applied on high quality 3D scans. With the advent of RGB-D sensors operating at high frame-rates and the utilization of fast 3D reconstruction algorithms, humans can be captured and reconstructed into a 3D representation in real-time, enabling new (tele-)immersive experiences. The produced 3D mesh content is structurally different in the two cases. The first type of content is nearly perfect and clean while the second type is much more irregular and noisy. Evaluating compression artifacts on this new type of immersive 3D media, constitutes a yet unexplored scientific area. In this paper, we conduct a survey to subjectively assess the perceived fidelity of 3D meshes subjected to compression using three open-source static 3D mesh codecs compared to the original uncompressed models. The subjective evaluation of the content is conducted in a Virtual Reality setting, using the forced-choice pairwise comparison methodology with existing reference. The results of this study are two-fold; first, the design of an experimental setup that can be used for the subjective evaluation of 3D media, and second, a mapping of the compared conditions to a continuous ranking scale. The latter can be used when selecting codecs and optimizing their compression parameters to achieve optimum balance between bandwidth and perceived quality in tele-immersive platforms.

Kyriaki Christaki, Emmanouil Christakis, Petros Drakoulis, Alexandros Doumanoglou, Nikolaos Zioulis, Dimitrios Zarpalas, Petros Daras
Joint EPC and RAN Caching of Tiled VR Videos for Mobile Networks

In recent years, 360-degree VR (Virtual Reality) video has brought an immersive way to consume content. People can watch matches, play games and view movies by wearing VR headsets. To provide such online VR video services anywhere and anytime, the VR videos need to be delivered over wireless networks. However, due to the huge data volume and the frequent viewport-updating of VR video, its delivery over mobile networks is extremely difficult. One of the difficulties for the VR video streaming is the latency issue, i.e., the necessary viewport data cannot be timely updated to keep pace with the rapid viewport motion during viewing VR videos. To deal with this problem, this paper presents a joint EPC (Evolved Packet Core) and RAN (Radio Access Network) tile-caching scheme that pushes the duplicates of VR video tiles near the user end. Based on the predicted viewport-popularity of the VR video, the collaborative tile data caching between the EPC and RAN is formulated as a 0-1 knapsack problem, and then solved by a genetic algorithm (GA). Experimental results show that the proposed scheme can achieve great improvements in terms of the saved transmission bandwidth as well as the latency over the scheme of traditional full-size video caching and the scheme that the tiles are only cached in the EPC.

Kedong Liu, Yanwei Liu, Jinxia Liu, Antonios Argyriou, Ying Ding
Foveated Ray Tracing for VR Headsets

In this work, we propose a real-time foveated ray tracing system, which mimics the non-uniform and sparse characteristic of the human retina to reduce spatial sampling. Fewer primary rays are traced in the peripheral regions of vision, while sampling frequency for the fovea region traced by the eye tracker is maximised. Our GPU-accelerated ray tracer uses a sampling mask to generate a non-uniformly distributed set of pixels. Then, the regular Cartesian image is reconstructed based on the GPU-accelerated triangulation method with the barycentric interpolation. The temporal anti-aliasing is applied to reduce the flickering artefacts. We perform a user study in which people evaluate the visibility of artefacts in the peripheral region of vision where sampling is reduced. This evaluation is conducted for a number of sampling masks that mimic the sensitivity to contrast in the human eyes but also test different sampling strategies. The sampling that follows the gaze-dependent contrast sensitivity function is reported to generate images of the best quality. We test the performance of the whole system on the VR headset. The achieved frame-rate is twice higher in comparison to the typical Cartesian sampling and cause only barely visible degradation of the image quality.

Adam Siekawa, Michał Chwesiuk, Radosław Mantiuk, Rafał Piórkowski
Preferred Model of Adaptation to Dark for Virtual Reality Headsets

The human visual system has the ability to adapt to various lighting conditions. In this work, we simulate the dark adaptation process using a custom virtual reality framework. The high dynamic range (HDR) image is rendered, tone mapped and displayed in the head-mounted-display (HMD) equipped with the eye tracker. Observer’s adaptation state is predicted by analysing the HDR image in the surrounding of his/her gaze point. This state is applied during tone mapping to simulate how an observer would see the whole scene being adapted to an arbitrary luminance level. We take into account the spatial extent of the visual adaptation, loss of colour vision, and time course of adaptation. Our main goal is to mimic the adaptation process naturally implemented by the human visual system. However, we prove in the psychophysical experiments that people prefer shorter adaptation while watching a virtual environment. We also justify that a complex perceptual model of adaptation to dark can be replaced with simpler linear formulas.

Marek Wernikowski, Radosław Mantiuk, Rafał Piórkowski
From Movement to Events: Improving Soccer Match Annotations

Match analysis has become an important task in everyday work at professional soccer clubs in order to improve team performance. Video analysts regularly spend up to several days analyzing and summarizing matches based on tracked and annotated match data. Although there already exists extensive capabilities to track the movement of players and the ball from multimedia data sources such as video recordings, there is no capability to sufficiently detect dynamic and complex events within these data. As a consequence, analysts have to rely on manually created annotations, which are very time-consuming and expensive to create. We propose a novel method for the semi-automatic definition and detection of events based entirely on movement data of players and the ball. Incorporating Allen’s interval algebra into a visual analytics system, we enable analysts to visually define as well as search for complex, hierarchical events. We demonstrate the usefulness of our approach by quantitatively comparing our automatically detected events with manually annotated events from a professional data provider as well as several expert interviews. The results of our evaluation show that the required annotation time for complete matches by using our system can be reduced to a few seconds while achieving a similar level of performance.

Manuel Stein, Daniel Seebacher, Tassilo Karge, Tom Polk, Michael Grossniklaus, Daniel A. Keim
Multimodal Video Annotation for Retrieval and Discovery of Newsworthy Video in a News Verification Scenario

This paper describes the combination of advanced technologies for social-media-based story detection, story-based video retrieval and concept-based video (fragment) labeling under a novel approach for multimodal video annotation. This approach involves textual metadata, structural information and visual concepts - and a multimodal analytics dashboard that enables journalists to discover videos of news events, posted to social networks, in order to verify the details of the events shown. It outlines the characteristics of each individual method and describes how these techniques are blended to facilitate the content-based retrieval, discovery and summarization of (parts of) news videos. A set of case-driven experiments conducted with the help of journalists, indicate that the proposed multimodal video annotation mechanism - combined with a professional analytics dashboard which presents the collected and generated metadata about the news stories and their visual summaries - can support journalists in their content discovery and verification work.

Lyndon Nixon, Evlampios Apostolidis, Foteini Markatopoulou, Ioannis Patras, Vasileios Mezaris
Integration of Exploration and Search: A Case Study of the M Model

Effective support for multimedia analytics applications requires exploration and search to be integrated seamlessly into a single interaction model. Media metadata can be seen as defining a multidimensional media space, casting multimedia analytics tasks as exploration, manipulation and augmentation of that space. We present an initial case study of integrating exploration and search within this multidimensional media space. We extend the M $$^3$$ 3 model, initially proposed as a pure exploration tool, and show that it can be elegantly extended to allow searching within an exploration context and exploring within a search context. We then evaluate the suitability of relational database management systems, as representatives of today’s data management technologies, for implementing the extended M $$^3$$ 3 model. Based on our results, we finally propose some research directions for scalability of multimedia analytics.

Snorri Gíslason, Björn Þór Jónsson, Laurent Amsaleg
Face Swapping for Solving Collateral Privacy Issues in Multimedia Analytics

A wide range of components of multimedia analytics systems relies on visual content that is used for supervised (e.g., classification) and unsupervised (e.g., clustering) machine learning methods. This content may contain privacy sensitive information, e.g., show faces of persons. In many cases it is just an inevitable side-effect that persons appear in the content, and the application may not require identification – a situation which we call “collateral privacy issues”. We propose de-identification of faces in images by using a generative adversarial network to generate new face images, and use them to replace faces in the original images. We demonstrate that face swapping does not impact the performance of visual descriptor matching and extraction.

Werner Bailer
Exploring the Impact of Training Data Bias on Automatic Generation of Video Captions

A major issue in machine learning is availability of training data. While this historically referred to the availability of a sufficient volume of training data, recently this has shifted to the availability of sufficient unbiased training data. In this paper we focus on the effect of training data bias on an emerging multimedia application, the automatic captioning of short video clips. We use subsets of the same training data to generate different models for video captioning using the same machine learning technique and we evaluate the performances of different training data subsets using a well-known video caption benchmark, TRECVid. We train using the MSR-VTT video-caption pairs and we prune this to reduce and make the set of captions describing a video more homogeneously similar, or more diverse, or we prune randomly. We then assess the effectiveness of caption-generating trained with these variations using automatic metrics as well as direct assessment by human assessors. Our findings are preliminary and show that randomly pruning captions from the training data yields the worst performance and that pruning to make the data more homogeneous, or diverse, does improve performance slightly when compared to random. Our work points to the need for more training data, both more video clips but, more importantly, more captions for those videos.

Alan F. Smeaton, Yvette Graham, Kevin McGuinness, Noel E. O’Connor, Seán Quinn, Eric Arazo Sanchez
Fashion Police: Towards Semantic Indexing of Clothing Information in Surveillance Data

Indexing and retrieval of clothing based on style, similarity and colour has been extensively studied in the field of fashion with good results. However, retrieval of real-world clothing examples based on witness descriptions is of great interest in for security and law enforcement applications. Manually searching databases or CCTV footage to identify matching examples is time consuming and ineffective. Therefore we propose using machine learning to automatically index video footage based on general clothing types and evaluate the performance using existing public datasets. The challenge is that these datasets are highly sanitised with clean backgrounds and front-facing examples and are insufficient for training detectors and classifiers for real-world video footage. In this paper we highlight the deficiencies of using these datasets for security applications and propose a methodology for collecting a new dataset, as well as examining several ethical issues.

Owen Corrigan, Suzanne Little
CNN-Based Non-contact Detection of Food Level in Bottles from RGB Images

In this paper, we present an approach that detects the level of food in store-bought containers using deep convolutional neural networks (CNNs) trained on RGB images captured using an off-the-shelf camera. Our approach addresses three challenges—the diversity in container geometry, the large variations in shapes and appearances of labels on store-bought containers, and the variability in color of container contents—by augmenting the data used to train the CNNs using printed labels with synthetic textures attached to the training bottles, interchanging the contents of the bottles of the training containers, and randomly altering the intensities of blocks of pixels in the labels and at the bottle borders. Our approach provides an average level detection accuracy of 92.4% using leave-one-out cross-validation on 10 store-bought bottles of varying geometries, label appearances, label shapes, and content colors.

Yijun Jiang, Elim Schenck, Spencer Kranz, Sean Banerjee, Natasha Kholgade Banerjee
Personalized Recommendation of Photography Based on Deep Learning

The key to the picture recommendation problem lies in the representation of image features. There are many methods for image feature description, and some are mature. However, due to the particularity of the photographic works we are concerned with, the traditional recommendation based on original features or labels cannot get better results. In our topic problem, the discovery of image style features is very important. Our main job is to propose an optimized feature representation method in the unlabeled data set, and to train by the deep learning convolutional neural network (CNN), and finally achieve the recommended purpose. Combined with the latent factor model, the user features and image style features are deeply characterized. After a lot of experiments, we show that our method is better than other mainstream recommendation algorithms based on unlabeled data sets, and achieved better recommendation results.

Zhixiang Ji, Jie Tang, Gangshan Wu
Two-Level Attention with Multi-task Learning for Facial Emotion Estimation

Valence-Arousal model can represent complex human emotions, including slight changes of emotion. Most prior works of facial emotion estimation only considered laboratory data and used video, speech or other multi-modal features. The effect of these methods applied on static images in the real world is unknown. In this paper, a two-level attention with multi-task learning (MTL) framework is proposed for facial emotion estimation on static images. The features of corresponding region were automatically extracted and enhanced by first-level attention mechanism. And then we designed a practical structure to process the features extracted by first-level attention. In the following, we utilized Bi-directional Recurrent Neural Network (Bi-RNN) with self-attention (second-level attention) to make full use of the relationship of these features adaptively. It can be concluded as a combination of global and local information. In addition, we exploited MTL to estimate the value of valence and arousal simultaneously, which employed the correlation of the two tasks. The quantitative results conducted on AffectNet dataset demonstrated the superiority of the proposed framework. In addition, extensive experiments were carried out to analysis effectiveness of different components.

Xiaohua Wang, Muzi Peng, Lijuan Pan, Min Hu, Chunhua Jin, Fuji Ren
User Interaction for Visual Lifelog Retrieval in a Virtual Environment

Efficient retrieval of lifelog information is an ongoing area of research due to the multifaceted nature, and ever increasing size of lifelog datasets. Previous studies have examined lifelog exploration on conventional hardware platforms, but in this paper we describe a novel approach to lifelog retrieval using virtual reality. The focus of this research is to identify what aspects of lifelog retrieval can be effectively translated from a conventional to a virtual environment and if it provides any benefit to the user. The most widely available lifelog datasets for research are primarily image-based and focus on continuous capture from a first-person perspective. These large image corpora are often enhanced by image processing techniques and various other metadata. Despite the rapidly maturing nature of virtual reality as a platform, there has been very little investigation into user interaction within the context of lifelogging. The experiment outlined in this work seeks to evaluate four different virtual reality user interaction approaches to lifelog retrieval. The prototype system used in this experiment also competed at the Lifelog Search Challenge at ACM ICMR 2018 where it ranked first place.

Aaron Duane, Cathal Gurrin
Query-by-Dancing: A Dance Music Retrieval System Based on Body-Motion Similarity

This paper presents Query-by-Dancing, a dance music retrieval system that enables a user to retrieve music using dance motions. When dancers search for music to play when dancing, they sometimes find it by referring to online dance videos in which the dancers use motions similar to their own dance. However, previous music retrieval systems could not support retrieval specialized for dancing because they do not accept dance motions as a query. Therefore, we developed our Query-by-Dancing system, which uses a video of a dancer (user) as the input query to search a database of dance videos. The query video is recorded using an ordinary RGB camera that does not obtain depth information, like a smartphone camera. The poses and motions in the query are then analyzed and used to retrieve dance videos with similar poses and motions. The system then enables the user to browse the music attached to the videos it retrieves so that the user can find a piece that is appropriate for their dancing. An interesting problem here is that a simple search for the most similar videos based on dance motions sometimes includes results that do not match the intended dance genre. We solved this by using a novel measure similar to tf-idf to weight the importance of dance motions when retrieving videos. We conducted comparative experiments with 4 dance genres and confirmed that the system gained an average of 3 or more evaluation points for 3 dance genres (waack, pop, break) and that our proposed method was able to deal with different dance genres.

Shuhei Tsuchida, Satoru Fukayama, Masataka Goto
Joint Visual-Textual Sentiment Analysis Based on Cross-Modality Attention Mechanism

Recently, many researchers have focused on the joint visual-textual sentiment analysis since it can better extract user sentiments toward events or topics. In this paper, we propose that visual and textual information should differ in their contribution to sentiment analysis. Our model learns a robust joint visual-textual representation by incorporating a cross-modality attention mechanism and semantic embedding learning based on bidirectional recurrent neural network. Experimental results show that our model outperforms existing the state-of-the-art models in sentiment analysis under real datasets. In addition, we also investigate different proposed model’s variants and analyze the effects of semantic embedding learning and cross-modality attention mechanism in order to provide deeper insight on how these two techniques help the learning of joint visual-textual sentiment classifier.

Xuelin Zhu, Biwei Cao, Shuai Xu, Bo Liu, Jiuxin Cao
Deep Hashing with Triplet Labels and Unification Binary Code Selection for Fast Image Retrieval

With the significant breakthrough of computer vision using convolutional neural networks, deep learning has been applied to image hashing algorithms for efficient image retrieval on large-scale datasets. Inspired by Deep Supervised Hashing (DSH) algorithm, we propose to use triplet loss function with an online training strategy that takes three images as training inputs to learn compact binary codes. A relaxed triplet loss function is designed to maximize the discriminability with consideration of the balance property of the output space. In addition, a novel unification binary code selection algorithm is also proposed to represent the scalable binary code in an efficient way, which can fix the problem of conventional deep hashing methods that generate different lengths of binary code by retraining. Experiments on two well-known datasets of CIFAR-10 and NUS-WIDE show that the proposed DSH with use of unification binary code selection can achieve promising performance as compared with conventional image hashing and CNN-based hashing algorithms.

Chang Zhou, Lai-Man Po, Mengyang Liu, Wilson Y. F. Yuen, Peter H. W. Wong, Hon-Tung Luk, Kin Wai Lau, Hok Kwan Cheung
Incremental Training for Face Recognition

Many applications require the identification of persons in video. However, the set of persons of interest is not always known in advance, e.g., in applications for media production and archiving. Additional training samples may be added during the analysis, or groups of faces of one person may need to be identified retrospectively. In order to avoid re-running the face recognition, we propose an approach that supports fast incremental training based on a state of the art face detection and recognition pipeline using CNNs and an online random forest as a classifier. We also describe an algorithm to use the incremental training approach to automatically train classifiers for unknown persons, including safeguards to avoid noise in the training data. We show that the approach reaches state of the art performance on two datasets when using all training samples, but performs better with few or even only one training sample.

Martin Winter, Werner Bailer
Character Prediction in TV Series via a Semantic Projection Network

The goal of this paper is to automatically recognize characters in popular TV series. In contrast to conventional approaches which rely on weak supervision afforded by transcripts, subtitles or character facial data, we formulate the problem as the multi-label classification which requires only label-level supervision. We propose a novel semantic projection network consisting of two stacked subnetworks with specially designed constraints. The first subnetwork is a contractive autoencoder which focuses on reconstructing feature activations extracted from a pre-trained single-label convolutional neural network (CNN). The second subnetwork functions as a region-based multi-label classifier which produces character labels for the input video frame as well as reconstructing the input visual feature from the mapped semantic labels space. Extensive experiments show that the proposed model achieves state-of-the-art performance in comparison with recent approaches on three challenging TV series datasets (the Big Bang Theory, the Defenders and Nirvava in Fire).

Ke Sun, Zhuo Lei, Jiasong Zhu, Xianxu Hou, Bozhi Liu, Guoping Qiu
A Test Collection for Interactive Lifelog Retrieval

There is a long history of repeatable and comparable evaluation in Information Retrieval (IR). However, thus far, no shared test collection exists that has been designed to support interactive lifelog retrieval. In this paper we introduce the LSC2018 collection, that is designed to evaluate the performance of interactive retrieval systems. We describe the features of the dataset and we report on the outcome of the first Lifelog Search Challenge (LSC), which used the dataset in an interactive competition at ACM ICMR 2018.

Cathal Gurrin, Klaus Schoeffmann, Hideo Joho, Bernd Munzer, Rami Albatal, Frank Hopfgartner, Liting Zhou, Duc-Tien Dang-Nguyen
SEPHLA: Challenges and Opportunities Within Environment - Personal Health Archives

It is well known that environment and human health have a close relationship. Many researchers have pointed out the high association between the condition of an environment (e.g. pollutant concentrations, weather variables) and the qualification of health (e.g. cardio-respiratory, psychophysiology) [1, 10]. Meanwhile, environment information can be recorded accurately by sensors installed in stations, most of the health information comes from interviews, surveys, or records from medical organizations. The common approach for collecting and analyzing data to discover the association between environment and health outcomes is first isolating a predefined location then collecting all related data inside such a location. The size of this location can be scaled from local (e.g. city, province, country) to global (e.g. region, worldwide) scopes. Nevertheless, this approach cannot give a close-up perspective of an individual scale (i.e. the reaction of individual’s health against his/her surrounding environment during his/her lifetime). To fulfill this gap, we create the SEPHLA: the surrounding-environment personal-health lifelog archive. This purpose of creating this archive is to create a dataset at the individual scale by collecting psychophysiological (e.g. perception, heart rate), pollutant concentrations (e.g. $$PM_{2.5}$$ P M 2.5 , $$NO_{2}$$ N O 2 , $$O_{3}$$ O 3 ), weather variables (e.g. temperature, humidity), and urban nature (e.g. GPS, images, comments) data via wearable sensors and smart-phones/lifelog-cameras attached to each person. We explore and exploit this archive for better understanding the impact of an environment on human health at the individual level. We also address challenges of organizing, extending, and searching SEPHLA archive.

Tomohiro Sato, Minh-Son Dao, Kota Kuribayashi, Koji Zettsu
Athens Urban Soundscape (ATHUS): A Dataset for Urban Soundscape Quality Recognition

Soundscape can be regarded as the auditory landscape, conceived individually or at collaborative level. This paper presents ATHUS (ATHens Urban Soundscape), a dataset of audio recordings of ambient urban sounds, which has been annotated in terms of the corresponding perceived soundscape quality. To build our dataset, several users have recorded sounds using a simple smartphone application, which they also used to annotate the recordings, in terms of the perceived quality of the soundscape (i.e. level of “pleasantness”), in a range of 1 (unbearable) to 5 (optimal). The dataset has been made publicly available (in http://users.iit.demokritos.gr/~tyianak/soundscape ) as an audio feature representation form, so that it can directly be used in a supervised machine learning pipeline without need for feature extraction. In addition, this paper presents and publicly provides ( https://github.com/tyiannak/soundscape_quality ) a baseline approach, which demonstrates how the dataset can be used to train a supervised model to predict soundscape quality levels. Experiments under various setups using this library have demonstrated that Support Vector Machine Regression outperforms SVM Classification for the particular task, which is something expected if we consider the gradual nature of the soundscape quality labels. The goal of this paper is to provide to machine learning engineers, working on audio analytics, a first step towards the automatic recognition of soundscape quality in urban spaces, which could lead to powerful assessment tools in the hands of policy makers with regards to noise pollution and sustainable urban living.

Theodoros Giannakopoulos, Margarita Orfanidi, Stavros Perantonis
V3C – A Research Video Collection

With the widespread use of smartphones as recording devices and the massive growth in bandwidth, the number and volume of video collections has increased significantly in the last years. This poses novel challenges to the management of these large-scale video data and especially to the analysis of and retrieval from such video collections. At the same time, existing video datasets used for research and experimentation are either not large enough to represent current collections or do not reflect the properties of video commonly found on the Internet in terms of content, length, or resolution.In this paper, we introduce the Vimeo Creative Commons Collection, in short V3C, a collection of 28’450 videos (with overall length of about 3’800 h) published under creative commons license on Vimeo. V3C comes with a shot segmentation for each video, together with the resulting keyframes in original as well as reduced resolution and additional metadata. It is intended to be used from 2019 at the International large-scale TREC Video Retrieval Evaluation campaign (TRECVid).

Luca Rossetto, Heiko Schuldt, George Awad, Asad A. Butt
Image Aesthetics Assessment Using Fully Convolutional Neural Networks

This paper presents a new method for assessing the aesthetic quality of images. Based on the findings of previous works on this topic, we propose a method that addresses the shortcomings of existing ones, by: (a) Making possible to feed higher-resolution images in the network, by introducing a fully convolutional neural network as the classifier. (b) Maintaining the original aspect ratio of images in the input of the network, to avoid distortions caused by re-scaling. And (c) combining local and global features from the image for making the assessment of its aesthetic quality. The proposed method is shown to achieve state of the art results on a standard large-scale benchmark dataset.

Konstantinos Apostolidis, Vasileios Mezaris
Detecting Tampered Videos with Multimedia Forensics and Deep Learning

User-Generated Content (UGC) has become an integral part of the news reporting cycle. As a result, the need to verify videos collected from social media and Web sources is becoming increasingly important for news organisations. While video verification is attracting a lot of attention, there has been limited effort so far in applying video forensics to real-world data. In this work we present an approach for automatic video manipulation detection inspired by manual verification approaches. In a typical manual verification setting, video filter outputs are visually interpreted by human experts. We use two such forensics filters designed for manual verification, one based on Discrete Cosine Transform (DCT) coefficients and a second based on video requantization errors, and combine them with Deep Convolutional Neural Networks (CNN) designed for image classification. We compare the performance of the proposed approach to other works from the state of the art, and discover that, while competing approaches perform better when trained with videos from the same dataset, one of the proposed filters demonstrates superior performance in cross-dataset settings. We discuss the implications of our work and the limitations of the current experimental setup, and propose directions for future research in this area.

Markos Zampoglou, Foteini Markatopoulou, Gregoire Mercier, Despoina Touska, Evlampios Apostolidis, Symeon Papadopoulos, Roger Cozien, Ioannis Patras, Vasileios Mezaris, Ioannis Kompatsiaris
Improving Robustness of Image Tampering Detection for Compression

The task of verifying the originality and authenticity of images puts numerous constraints on tampering detection algorithms. Since most images are acquired on the internet, there is a significant probability that they have undergone transformations such as compression, noising, resizing and/or filtering, both before and after the possible alteration. Therefore, it is essential to improve the robustness of tampered image detection algorithms for such manipulations. As compression is the most common type of post-processing, we propose in our work a robust framework against this particular transformation. Our experiments on benchmark datasets show the contribution of our proposal for camera model identification and image tampering detection compared to recent literature approaches.

Boubacar Diallo, Thierry Urruty, Pascal Bourdon, Christine Fernandez-Maloigne
Audiovisual Annotation Procedure for Multi-view Field Recordings

Audio and video parts of an audiovisual document interact to produce an audiovisual, or multi-modal, perception. Yet, automatic analysis on these documents are usually based on separate audio and video annotations. Regarding the audiovisual content, these annotations could be incomplete, or not relevant. Besides, the expanding possibilities of creating audiovisual documents lead to consider different kinds of contents, including videos filmed in uncontrolled conditions (i.e. fields recordings), or scenes filmed from different points of view (multi-view). In this paper we propose an original procedure to produce manual annotations in different contexts, including multi-modal and multi-view documents. This procedure, based on using both audio and video annotations, ensures consistency considering audio or video only, and provides additionally audiovisual information at a richer level. Finally, different applications are made possible when considering such annotated data. In particular, we present an example application in a network of recordings in which our annotations allow multi-source retrieval using mono or multi-modal queries.

Patrice Guyot, Thierry Malon, Geoffrey Roman-Jimenez, Sylvie Chambon, Vincent Charvillat, Alain Crouzil, André Péninou, Julien Pinquier, Florence Sèdes, Christine Sénac
A Robust Multi-Athlete Tracking Algorithm by Exploiting Discriminant Features and Long-Term Dependencies

This paper addresses multiple athletes tracking problem. Athletes tracking is the key to whether sports video analysis can be more effective and practical or not. One great challenge faced by multi-athlete tracking is that athletes, especially the athletes in the same team, share very similar appearance, thus, most existing MOT approaches are hardly applicable in this task. To address this problem, we put forward a novel triple-stream network which could capture long-term dependencies by exploiting pose information to better distinguish different athletes. The method is motivated by the fact that poses of athletes are distinct from each other in a period of time because they play different roles in the team thus could be used as a strong feature to match the correct athletes. We design our Multi-Athlete Tracking (MAT) model on top of the online tracking-by-detection paradigm whereby bounding boxes from the output of a detector are connected across video frames, and improve it from two aspects. Firstly, we propose a Pose-based Triple Stream Networks (PTSN) based on Long Short-Term Memory (LSTM) networks, which are capable of modeling and capturing more subtle differences between athletes. Secondly, based on PTSN, we propose a multi-athlete tracking algorithm that is robust to noisy detection and occlusion. We demonstrate the effectiveness of our method on a collection of volleyball videos by comparing it with recent advanced multi-object trackers.

Nan Ran, Longteng Kong, Yunhong Wang, Qingjie Liu
Early Identification of Oil Spills in Satellite Images Using Deep CNNs

Oil spill pollution comprises a significant threat of the oceanic and coastal ecosystems. A continuous monitoring framework with automatic detection capabilities could be valuable as an early warning system so as to minimize the response time of the authorities and prevent any environmental disaster. The usage of Synthetic Aperture Radar (SAR) data acquired from satellites have received a considerable attention in remote sensing and image analysis applications for disaster management, due to the wide area coverage and the all-weather capabilities. Over the past few years, multiple solutions have been proposed to identify oil spills over the sea surface by processing SAR images. In addition, deep convolutional neural networks (DCNN) have shown remarkable results in a wide variety of image analysis applications and could be deployed to overcome the performance of previously proposed methods. This paper describes the development of an image analysis approach utilizing the benefits of a deep CNN combined with SAR imagery to establish an early warning system for oil spill pollution identification. SAR images are semantically segmented into multiple areas of interest including oil spill, look-alikes, land areas, sea surface and ships. The model was trained and tested using multiple SAR images, acquired from the Copernicus Open Access Hub and manually annotated. The dataset is a result of Sentinel-1 missions and EMSA records for relative pollution events. The conducted experiments demonstrate that the deployed DCNN model can accurately discriminate oil spills from other instances providing the relevant authorities a valuable tool to manage the upcoming disaster effectively.

Marios Krestenitis, Georgios Orfanidis, Konstantinos Ioannidis, Konstantinos Avgerinakis, Stefanos Vrochidis, Ioannis Kompatsiaris
Point Cloud Colorization Based on Densely Annotated 3D Shape Dataset

This paper introduces DensePoint, a densely sampled and annotated point cloud dataset containing over 10,000 single objects across 16 categories, by merging different kind of information from two existing datasets. Each point cloud in DensePoint contains 40,000 points, and each point is associated with two sorts of information: RGB value and part annotation. In addition, we propose a method for point cloud colorization by utilizing Generative Adversarial Networks (GANs). The network makes it possible to generate colours for point clouds of single objects by only giving the point cloud itself. Experiments on DensePoint show that there exist clear boundaries in point clouds between different parts of an object, suggesting that the proposed network is able to generate reasonably good colours. Our dataset is publicly available on the project page ( http://rwdc.nagao.nuie.nagoya-u.ac.jp/DensePoint ).

Xu Cao, Katashi Nagao
evolve2vec: Learning Network Representations Using Temporal Unfolding

In the past few years, various methods have been developed that attempt to embed graph nodes (e.g. users that interact through a social platform) onto low-dimensional vector spaces, exploiting the relationships (commonly displayed as edges) among them. The extracted vector representations of the graph nodes are then used to effectively solve machine learning tasks such as node classification or link prediction. These methods, however, focus on the static properties of the underlying networks, neglecting the temporal unfolding of those relationships. This affects the quality of representations, since the edges don’t encode the response times (i.e. speed) of the users’ (i.e. nodes) interactions. To overcome this limitation, we propose an unsupervised method that relies on temporal random walks unfolding at the same timescale as the evolution of the underlying dataset. We demonstrate its superiority against state-of-the-art techniques on the tasks of hidden link prediction and future link forecast. Moreover, by interpolating between the fully static and fully temporal setting, we show that the incorporation of topological information of past interactions can further increase our method efficiency.

Nikolaos Bastas, Theodoros Semertzidis, Apostolos Axenopoulos, Petros Daras
The Impact of Packet Loss and Google Congestion Control on QoE for WebRTC-Based Mobile Multiparty Audiovisual Telemeetings

While previous expensive and complex desktop video conferencing solutions had a restricted reach, the emergence of the WebRTC (Web Real-Time Communication) open framework has provided an opportunity to redefine the video conferencing communication landscape. In particular, technological advances in terms of high resolution displays, cameras, and high speed wireless access networks have set the ground for emerging multiparty video telemeeting solutions realized via mobile devices. However, deploying multiparty video communication solutions on smart phones calls for the need to optimize video encoding parameters due to limited device processing power and dynamic wireless network conditions. In this paper, we report on a subjective user study involving 30 participants taking part in three-party audiovisual telemeetings on mobile devices. We conduct an experimental investigation of the Google Congestion Control (GCC) Algorithm in light of packet loss and under various video codec configurations, with the aim being to observe the impact on end user Quality of Experience (QoE). Results provide insights related to QoE-driven video encoding adaptation (in terms of bit rate, resolution, and frame rate), and show that in certain cases, adaptation invoked by GCC leads to video interruption. In majority of other cases, we observed that it took approximately 25 s for the video stream to recover to an acceptable quality level after the temporary occurrence of network packet loss.

Dunja Vucic, Lea Skorin-Kapov
Hierarchical Temporal Pooling for Efficient Online Action Recognition

Action recognition in videos is a difficult and challenging task. Recent developed deep learning-based action recognition methods have achieved the state-of-the-art performance on several action recognition benchmarks. However, it is noted that these methods are inefficient since they are of large model size and require long runtime which restrict their practical applications. In this study, we focus on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations. Our motivation stems from the observation that redundant information widely exists in adjacent frames in the videos and humans do not recognize actions based on frame-level features. Therefore, to extract the effective video-level features, a Hierarchical Temporal Pooling (HTP) module is proposed and a two-stream action recognition network termed as HTP-Net (Two-stream) is developed, which is carefully designed to obtain effective video-level representations by hierarchically incorporating the temporal motion and spatial appearance features. It is worth noting that all two-stream action recognition methods using optical flow as one of the inputs are computationally inefficient since calculating optical flow is time-consuming. To improve the efficiency, in our study, we do not consider using optical flow but consider only raw RGB as input to our HTP-Net termed as HTP-Net (RGB) for a clear and concise presentation. Extensive experiments have been conducted on two benchmarks: UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods. Specifically, our HTP-Net (RGB) runs at 42 videos per second (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, which enables real-time action recognition and is of great value in practical applications.

Can Zhang, Yuexian Zou, Guang Chen
Generative Adversarial Networks with Enhanced Symmetric Residual Units for Single Image Super-Resolution

In this paper, we propose a new generative adversarial network (GAN) with enhanced symmetric residual units for single image super-resolution (ERGAN). ERGAN consists of a generator network and a discriminator network. The former can maximally reconstruct a super-resolution image similar to the original image. This lead to the discriminator network cannot distinguish the image from the training data or the generated sample. Combining residual units used in the generator network, ERGAN can retain the high-frequency features and alleviate the difficulty training in deep networks. Moreover, we constructed the symmetric skip-connections in residual units. This reused features generated from the low-level, and learned more high-frequency content. Moreover, ERGAN reconstructed the super-resolution image by four times the length and width of the original image and exhibited better visual characteristics. Experimental results on extensive benchmark evaluation showed that ERGAN significantly outperformed state-of-the-art approaches in terms of accuracy and vision.

Xianyu Wu, Xiaojie Li, Jia He, Xi Wu, Imran Mumtaz
3D ResNets for 3D Object Classification

During the last few years, deeper and deeper networks have been constantly proposed for addressing computer vision tasks. Residual Networks (ResNets) are the latest advancement in the field of deep learning that led to remarkable results in several image recognition and detection tasks. In this work, we modify two variants of the original ResNets, i.e. Wide Residual Networks (WRNs) and Residual of Residual Networks (RoRs), to work on 3D data and investigate for the first time, to our knowledge, their performance in the task of 3D object classification. We use a dataset containing volumetric representations of 3D models so as to fully exploit the underlying 3D information and present evidence that ‘3D ResNets’ constitute a valuable tool for classifying objects on 3D data as well.

Anastasia Ioannidou, Elisavet Chatzilari, Spiros Nikolopoulos, Ioannis Kompatsiaris
Four Models for Automatic Recognition of Left and Right Eye in Fundus Images

Fundus image analysis is crucial for eye condition screening and diagnosis and consequently personalized health management in a long term. This paper targets at left and right eye recognition, a basic module for fundus image analysis. We study how to automatically assign left-eye/right-eye labels to fundus images of posterior pole. For this under-explored task, four models are developed. Two of them are based on optic disc localization, using extremely simple max intensity and more advanced Faster R-CNN, respectively. The other two models require no localization, but perform holistic image classification using classical Local Binary Patterns (LBP) features and fine-tuned ResNet-18, respectively. The four models are tested on a real-world set of 1,633 fundus images from 834 subjects. Fine-tuned ResNet-18 has the highest accuracy of 0.9847. Interestingly, the LBP based model, with the trick of left-right contrastive classification, performs closely to the deep model, with an accuracy of 0.9718.

Xin Lai, Xirong Li, Rui Qian, Dayong Ding, Jun Wu, Jieping Xu
On the Unsolved Problem of Shot Boundary Detection for Music Videos

This paper discusses open problems of detecting shot boundaries for music videos. The number of shots per second and the type of transition are considered to be a discriminating feature for music videos and a potential multi-modal music feature. By providing an extensive list of effects and transition types that are rare in cinematic productions but common in music videos, we emphasize the artistic use of transitions in music videos. By the use of examples we discuss in detail the shortcomings of state-of-the-art approaches and provide suggestions to address these issues.

Alexander Schindler, Andreas Rauber
Enhancing Scene Text Detection via Fused Semantic Segmentation Network with Attention

Scene text detection (STD) in natural images is still challenging since text objects exhibit vast diversity in fonts, scales and orientations. Deep learning based state-of-the-art STD methods are promising such as PixelLink which has achieved 85% accuracy on ICDAR 2015 benchmark. Our preliminary experimental results with PixelLink have shown that its detection errors come mainly from two aspects: failing to detect the small scale and ambiguous text objects. In this paper, following the powerful PixelLink framework, we try to improve the STD performance via delicately designing a new fused semantic segmentation network with attention. Specifically, an inception module is carefully designed to extract multi-scale receptive field features aiming at enhancing feature representation. Besides, a hierarchical feature fusion module is cascaded with the inception module to capture multi-level inception features to obtain more semantic information. At last, to suppress background disturbance and better locate the text objects, an attention module is developed to learn a probability heat map of texts which helps accurately infer the texts even for ambiguous texts. Experimental results on three public benchmarks demonstrate the effectiveness of our proposed method compared with the state-of-the-arts. We note that the highest F-measure on ICADR 2015, ICADR 2013 and MSRA-TD500 has been obtained for our proposed method but the higher computational cost is required.

Chao Liu, Yuexian Zou, Dongming Yang
Exploiting Incidence Relation Between Subgroups for Improving Clustering-Based Recommendation Model

Matrix factorization (MF) has been attracted much attention in recommender systems due to its extensibility and high accuracy. Recently, some clustering-based MF recommendation methods have been proposed in succession to capture the associations between related users (items). However, these methods only use the subgroup data to build local models, so they will suffer the over-fitting problem caused by insufficient data in the process of training. In this paper, we analyse the incidence relation between subgroups of users (items) and then propose two single improved clustering-based MF models. Through exploiting these relations between subgroups, the local model in each subgroup can obtain global information from other subgroups, which can mitigate the over-fitting problem. Above all, we generate an ensemble model by combining the two single models for capturing associations between users and associations between items at the same time. Experimental results on different scales of MovieLens datasets demonstrate that our method outperforms state-of-the-art clustering-based recommendation methods, especially on sparse datasets.

Zhipeng Wu, Hui Tian, Xuzhen Zhu, Shaoshuai Fan, Shuo Wang
Hierarchical Bayesian Network Based Incremental Model for Flood Prediction

To minimize the negative impacts brought by floods, researchers pay special attention to the problem of flood prediction. In this paper, we propose a hierarchical Bayesian network based incremental model to predict floods for small rivers. The proposed model not only appropriately embeds hydrology expert knowledge with Bayesian network for high rationality and robustness, but also designs an incremental learning scheme to improve the self-improving and adaptive ability of the proposed model. Following the idea of a famous hydrology model, i.e., XAJ model, we firstly present the construction of hierarchical Bayesian network as local and global network construction. After that, we propose an incremental learning scheme, which selects proper incremental data to improve the completeness of prior knowledge and updates parameters of Bayesian network to prevent training from scratch. We demonstrate the accuracy and effectiveness of the proposed model by conducting experiments on a collected dataset with one comparative method.

Yirui Wu, Weigang Xu, Qinghan Yu, Jun Feng, Tong Lu
A New Female Body Segmentation and Feature Localisation Method for Image-Based Anthropometry

An increasingly growing demand on the bespoke service for buying clothes online presents a new challenge of how to efficiently and precisely acquire anthropometric data of distant customers. The conventional 2D anthropometric methods are efficient but face a problem of imperfect body segmentation because they cannot automatically deal with arbitrary background. To address this problem this paper aimed at female anthropometry proposes to segment the female body out of an orthogonal photo pair with deep learning, and to extract a group of body feature points according to curvature and bending direction of the segmented body contour. With the located feature points we estimate six body parameters with two existing mathematical models and assess their pros and cons in this paper.

Dan Wang, Yun Sheng, GuiXu Zhang
Greedy Salient Dictionary Learning for Activity Video Summarization

Automated video summarization is well-suited to the task of analysing human activity videos (e.g., from surveillance feeds), mainly as a pre-processing step, due to the large volume of such data and the small percentage of actually important video frames. Although key-frame extraction remains the most popular way to summarize such footage, its successful application for activity videos is obstructed by the lack of editing cuts and the heavy inter-frame visual redundancy. Salient dictionary learning, recently proposed for activity video key-frame extraction, models the problem as the identification of a small number of video frames that, simultaneously, can best reconstruct the entire video stream and are salient compared to the rest. In previous work, the reconstruction term was modelled as a Column Subset Selection Problem (CSSP) and a numerical, SVD-based algorithm was adapted for solving it, while video frame saliency, in the fastest algorithm proposed up to now, was also estimated using SVD. In this paper, the numerical CSSP method is replaced by a greedy, iterative one, properly adapted for salient dictionary learning, while the SVD-based saliency term is retained. As proven by the extensive empirical evaluation, the resulting approach significantly outperforms all competing key-frame extraction methods with regard to speed, without sacrificing summarization accuracy. Additionally, computational complexity analysis of all salient dictionary learning and related methods is presented.

Ioannis Mademlis, Anastasios Tefas, Ioannis Pitas
Accelerating Topic Detection on Web for a Large-Scale Data Set via Stochastic Poisson Deconvolution

Organizing webpages into hot topics is one of the key steps to understand the trends from multi-modal web data. To handle this pressing problem, Poisson Deconvolution (PD), a state-of-the-art method, recently is proposed to rank the interestingness of web topics on a similarity graph. Nevertheless, in terms of scalability, PD optimized by expectation-maximization is not sufficiently efficient for a large-scale data set. In this paper, we develop a Stochastic Poisson Deconvolution (SPD) to deal with the large-scale web data sets. Experiments demonstrate the efficacy of the proposed approach in comparison with the state-of-the-art methods on two public data sets and one large-scale synthetic data set.

Jinzhong Lin, Junbiao Pang, Li Su, Yugui Liu, Qingming Huang
Automatic Segmentation of Brain Tumor Image Based on Region Growing with Co-constraint

Image segmentation remains an ongoing challenge in medical image processing research. Owing to brain tumor’s inhomogeneous structure and blurred boundary, the segmentation of brain tumor image is not always ideal. Therefore, we propose a novel region growing model that enables to segment the brain tumor image accurately and automatically. The model mainly improves the selection of seed points and the growth rules. Using the method of fusion information with multimodal MRI images is described to select the seed point automatically, which makes the segmentation algorithm more robust. Furthermore, in order to mostly remain the local feature and the boundary information of brain tumor, a spatial texture feature is constructed in this study. Based on the above model, an automatic brain tumor image segmentation algorithm is established, which uses the region growing with the Co-constraint of intensity and spatial texture. In terms of performance evaluation, the proposed method not only outperforms other segmentation algorithms in the accuracy of results, but also has lower computational cost. This is undoubtedly a worthy method of brain tumor image segmentation.

Siming Cui, Xuanjing Shen, Yingda Lyu
Proposal of an Annotation Method for Integrating Musical Technique Knowledge Using a GTTM Time-Span Tree

This paper proposes an annotation method for integrating the knowledge of musical techniques and musical structures. We have attempted to support musical instrument performances from the viewpoint of knowledge engineering. We focused on classical guitar, which requires many techniques, and developed guitar rendition ontology that can serve as a guideline for classical guitar performances at teaching and learning sites. In order to effectively use ontology knowledge at the sites, we need to connect it with musical structures so that the ontology data can be integrated with musical score information. Therefore, we propose a method that annotates the knowledge related to musical techniques to time-span trees obtained from time-span analysis based on the generative theory of tonal music (GTTM). We experimented with several bars of four guitar pieces and investigated how much the knowledge, which is executed with more than two notes, can add to time-span trees. Our results showed that about 76% of the ontology knowledge corresponded with the structure of time-span trees.

Nami Iino, Mayumi Shimada, Takuichi Nishimura, Hideaki Takeda, Masatoshi Hamanaka
A Hierarchical Level Set Approach to for RGBD Image Matting

This paper presents a novel method for the image matting of RGBD data, using a Hierarchical Level Set. The approach has four main steps. First of all, the color and depth channel is preprocessed. Noise is eliminated by using a Directional Joint Bilateral Filter and holes are removed from the depth map. Secondly, color cues and depth cues are integrated to segment the image using a Hierarchical Level Set Framework. After this, the segmentation of the color and depth cues is used to generate a trimap. Finally, an extended alpha matting approach is used to obtain the final matting result, with the color image, depth image and trimap serving as input. Experiments using complex natural images demonstrate that the proposed RGBD matting approach is able to generate good matting results.

Wenliang Zeng, Ji Liu
A Genetic Programming Approach to Integrate Multilayer CNN Features for Image Classification

Fusing information extracted from multiple layers of a convolutional neural network has been proven effective in several domains. Common fusion techniques include feature concatenation and Fisher embedding. In this work, we propose to fuse multilayer information by genetic programming (GP). With the evolutionary strategy, we iteratively fuse multilayer information in a systematic manner. In the evaluation, we verify the effectiveness of discovered GP-based representations on three image classification datasets, and discuss characteristics of the GP process. This study is one of the few works to fuse multilayer information based on an evolutionary strategy. The reported preliminary results not only demonstrate the potential of the GP fusion scheme, but also inspire future study in several aspects.

Wei-Ta Chu, Hao-An Chu
Improving Micro-expression Recognition Accuracy Using Twofold Feature Extraction

Micro-expressions are generated involuntarily on a person’s face and are usually a manifestation of repressed feelings of the person. Micro-expressions are characterised by short duration, involuntariness and low intensity. Because of these characteristics, micro-expressions are difficult to perceive and interpret correctly, and they are profoundly challenging to identify and categorise automatically.Previous work for micro-expression recognition has used hand-crafted features like LBP-TOP, Gabor filter, HOG and optical flow. Recent work also has demonstrated the possible use of deep learning for micro-expression recognition. This paper is the first work to explore the use of hand-craft feature descriptor and deep feature descriptor for micro-expression recognition task. The aim is to use the hand-craft and deep learning feature descriptor to extract features and integrate them together to construct a large feature vector to describe a video. Through experiments on CASME, CASME II and CASME+2 databases, we demonstrate our proposed method can achieve promising results for micro-expression recognition accuracy with larger training samples.

Madhumita A. Takalkar, Haimin Zhang, Min Xu
An Effective Dual-Fisheye Lens Stitching Method Based on Feature Points

Fisheye lens is a super-wide-angle lens which is very light. Usually two cameras can shoot 360-degree panoramic images. However, the limited overlapping field of views make it hard to stitch in the boundaries. This paper introduces a novel method for dual-fisheye camera stitching based on feature points. And we also put forward the idea of expanding to video. Results show that this method can be used to produce high-quality panoramic images by stitching the original images of the dual-fisheye camera Samsung Gear 360.

Li Yao, Ya Lin, Chunbo Zhu, Zuolong Wang
3D Skeletal Gesture Recognition via Sparse Coding of Time-Warping Invariant Riemannian Trajectories

3D skeleton based human representation for gesture recognition has increasingly attracted attention due to its invariance to camera view and environment dynamics. Existing methods typically utilize absolute coordinate to present human motion features. However, gestures are independent of the performer’s locations, and the features should be invariant to the body size of performer. Moreover, temporal dynamics can significantly distort the distance metric when comparing and identifying gestures. In this paper, we represent each skeleton as a point in the product space of special orthogonal group SO3, which explicitly models the 3D geometric relationships between body parts. Then, a gesture skeletal sequence can be characterized by a trajectory on a Riemannian manifold. Next, we generalize the transported square-root vector field to obtain a re-parametrization invariant metric on the product space of SO(3), therefore, the goal of comparing trajectories in a time-warping invariant manner is realized. Furthermore, we present a sparse coding of skeletal trajectories by explicitly considering the labeling information with each atoms to enforce the discriminant validity of dictionary. Experimental results demonstrate that proposed method has achieved state-of-the-art performance on three challenging benchmarks for gesture recognition.

Xin Liu, Guoying Zhao
Efficient Graph Based Multi-view Learning

Graph-based learning methods especially multi-graph-based methods have attracted considerable research interests in the past decades. In these methods, the traditional graph models are used to build adjacency relationships for samples within different views. However, owing to the huge time complexity, they are inefficient for large-scale datasets. In this paper, we propose a method named multi-anchor-graph learning (MAGL), which aims to utilize anchor graphs for the adjacency estimation. MAGL can not only sufficiently explore the complementation of multiple graphs built upon different views but also keep an acceptable time complexity. Furthermore, we show that the proposed method can be implemented through an efficient iterative process. Extensive experiments on six publicly available datasets have demonstrated both the effectiveness and efficiency of our proposed approach.

Hengtong Hu, Richang Hong, Weijie Fu, Meng Wang
DANTE Speaker Recognition Module. An Efficient and Robust Automatic Speaker Searching Solution for Terrorism-Related Scenarios

The vast amount of data crossing the net with terrorism-related content, including voice, is so immense that the use of powerful filtering/detection tools with great discriminative capacities becomes essential. Although the analysis of this content often ends with some manual inspection, a first filtering process becomes basic. In this direction, we propose a speaker clustering solution based on a speaker identification system. We show that both the speaker clustering and the speaker recognition solution can be used individually to efficiently solve searching tasks in several terrorism-related scenarios.

Jesús Jorrín, Luis Buera
Backmatter
Metadata
Title
MultiMedia Modeling
Editors
Ioannis Kompatsiaris
Dr. Benoit Huet
Vasileios Mezaris
Cathal Gurrin
Wen-Huang Cheng
Dr. Stefanos Vrochidis
Copyright Year
2019
Electronic ISBN
978-3-030-05710-7
Print ISBN
978-3-030-05709-1
DOI
https://doi.org/10.1007/978-3-030-05710-7