Skip to main content

2018 | Buch

MultiMedia Modeling

24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part II

herausgegeben von: Prof. Dr. Klaus Schoeffmann, Thanarat H. Chalidabhongse, Chong Wah Ngo, Prof. Dr. Supavadee Aramvith, Noel E. O’Connor, Yo-Sung Ho, Moncef Gabbouj, Prof. Ahmed Elgammal

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 10704 and 10705 constitutes the thoroughly refereed proceedings of the 24th International Conference on Multimedia Modeling, MMM 2018, held in Bangkok, Thailand, in February 2018.

Of the 185 full papers submitted, 46 were selected for oral presentation and 28 for poster presentation; in addition, 5 papers were accepted for Multimedia Analytics: Perspectives, Techniques, and Applications, 12 extended abstracts for demonstrations ,and 9 accepted papers for Video Browser Showdown 2018. All papers presented were carefully reviewed and selected from 185 submissions.

Inhaltsverzeichnis

Frontmatter

Full Papers Accepted for Poster Presentation

Frontmatter
A New Accurate Image Denoising Method Based on Sparse Coding Coefficients

Although sparse coding error has been introduced to improve the performance of sparse representation-based image denoising, however, the sparse coding noise is not tight enough. To suppress the sparse coding noise, we exploit a couple of images to estimate unknown sparse code. There are two main contributions in this paper: The first is to use a reference denoised image and an intermediate denoised image to estimate the sparse coding coefficients of the original image. The second is that we set a threshold to rule out blocks of low similarity to improve the accuracy of estimation. Our experimental results have shown improvements over several state-of-the-art denoising methods on a collection of 12 generic natural images.

Kai Lin, Ge Li, Yiwei Zhang, Jiaxing Zhong
A Novel Frontal Facial Synthesis Algorithm Based on Individual Residual Face

The frontal facial synthesis results of current main methods tend to be smooth and lack personal characteristics, which greatly influence the subjective impression. Especially aiming at large-scale deflecting faces, the shielding side just provides few features for reconstructing and the deformation of facial elements enhance the difficulty to obtain exact features of target frontal face, making the synthesis result seem to be same and as mean face of database. In this paper, to solve these problems, we propose a novel two-step face synthesis method. In the first step, we utilize the basic symmetry of human face to predict the missing patches according to the other side and generate interim facial image. And in the following step, we introduce the individual residual facial image between interim result and mean face to compensate for the lost personal features of the synthesis result because the residual image carries more individual characteristic of the input face. We show that experimental results of proposed method outperform in the objective and subjective effects with other state-of-the-art methods.

Xin Ding, Ruimin Hu, Zhen Han, Zhongyuan Wang
A Text Recognition and Retrieval System for e-Business Image Management

The on-going growth of e-business has resulted in companies having to manage an ever increasing number of product, packaging and promotional images. Systems for indexing and retrieving such images are required in order to ensure image libraries can be managed and fully exploited as valuable business resources. In this paper, we explore the power of text recognition for e-business image management and propose an innovative system based on photo OCR. Photo OCR has been actively studied for scene text recognition but has not been exploited for e-business digital image management. Besides the well known difficulties in scene text recognition such as various size, location, orientation in text and cluttered background, e-business images typically feature text with extremely diverse fonts, and the characters are often artistically modified in shape, colour and arrangement. To address these challenges, our system takes advantage of the combinatorial power of deep neural networks and MSER processing. The cosine distance and n-gram vectors are used during retrieval for matching detected text to queries to provide tolerance to the inevitable transcription errors in text recognition. To evaluate our proposed system, we prepared a novel dataset designed specifically to reflect the challenges associated with text in e-business images. We compared our system with two other approaches for scene text recognition, and the results show our system outperforms other state-of-the-art on the new challenging dataset. Our system demonstrates that recognizing text embedded in images can be hugely beneficial for digital asset management.

Jiang Zhou, Kevin McGuinness, Noel E. O’Connor
Accurate Detection for Scene Texts with a Cascaded CNN Networks

We propose an algorithm of text detection to accurately and reliably determine the bounding regions of texts in a natural scene. The cascaded convolutional neural networks are aggregated in our system in order to obtain accurate Precision, Recall and F-score (PRF) of text detection. The first fully convolutional network, as a coarse detector, is in charge of detecting and segmenting areas of text-like. And the second network filters the segment blocks of non-text and accurately determines each text lines of the segment blocks. In order to make best use of the advantages of two networks, we proposed an intermediate-processing mechanism. The whole system has powerful capability of detecting those squeezed lines with very tiny words and also those texts with different sizes, especially for small size text. Our experimental system is based on a Titan X GPU and achieves precision of 0.92, recall of 0.83 and F-score of 0.87, which is listed in the 22nd place among all the published results of the ICDAR 2013 Focused Scene Text dataset benchmark.

Jianjun Li, Chenyan Wang, Zhenxing Luo, Zhuo Tang, Haojie Li
Cloud of Line Distribution and Random Forest Based Text Detection from Natural/Video Scene Images

Text detection in natural and video scene images is still considered to be challenging due to unpredictable nature of scene texts. This paper presents a new method based on Cloud of Line Distribution (COLD) and Random Forest Classifier for text detection in both natural and video images. The proposed method extracts unique shapes of text components by studying the relationship between dominant points such as straight or cursive over contours of text components, which is called COLD in polar domain. We consider edge components as text candidates if the edge components in Canny and Sobel of an input image share the COLD property. For each text candidate, we further study its COLD distribution at component level to extract statistical features and angle oriented features. Next, these features are fed to a random forest classifier to eliminate false text candidates, which results representatives. We then perform grouping using representatives to form text lines based on the distances between edge components in the edge image. The statistical and angle orientated features are finally extracted at word level for eliminating false positives, which results in text detection. The proposed method is tested on standard database, namely, SVT, ICDAR 2015 scene, ICDAR2013 scene and video databases, to show its effectiveness and usefulness compared with the existing methods.

Wenhai Wang, Yirui Wu, Palaiahnakote Shivakumara, Tong Lu
CNN-Based DCT-Like Transform for Image Compression

This paper presents a block transform for image compression, where the transform is inspired by discrete cosine transform (DCT) but achieved by training convolutional neural network (CNN) models. Specifically, we adopt the combination of convolution, nonlinear mapping, and linear transform to form a non-linear transform as well as a non-linear inverse transform. The transform, quantization, and inverse transform are jointly trained to achieve the overall rate-distortion optimization. For the training purpose, we propose to estimate the rate by the $$l_1$$l1-norm of the quantized coefficients. We also explore different combinations of linear/non-linear transform and inverse transform. Experimental results show that our proposed CNN-based transform achieves higher compression efficiency than fixed DCT, and also outperforms JPEG significantly at low bit rates.

Dong Liu, Haichuan Ma, Zhiwei Xiong, Feng Wu
Coarse-to-Fine Image Super-Resolution Using Convolutional Neural Networks

Convolutional neural networks (CNNs) have been widely applied to computer vision fields due to its excellent performance. CNN-based single image super resolution (SR) methods are also put into practice and outperform previous methods. In this paper, we propose a coarse-to-fine CNN method to boost the existing CNN-based SR methods. We design a cascaded CNN architecture with three stages. The first stage takes the low-resolution (LR) image as the input and outputs a high-resolution (HR) image, then the next stage similarly takes the high-resolution result as the input and produces a finer HR image. Finally, the last stage can obtain the finest HR image. Our architecture is trained as one entire CNN which combines three loss functions to optimize the gradient descent procedure. Experiments on ImageNet-based training samples validates the effectiveness of our method on the public benchmark datasets.

Liguo Zhou, Zhongyuan Wang, Shu Wang, Yimin Luo
Data Augmentation for EEG-Based Emotion Recognition with Deep Convolutional Neural Networks

Emotion recognition is the task of recognizing a person’s emotional state. EEG, as a physiological signal, can provide more detailed and complex information for emotion recognition task. Meanwhile, EEG can’t be changed and hidden intentionally makes EEG-based emotion recognition achieve more effective and reliable result. Unfortunately, due to the cost of data collection, most EEG datasets have small number of EEG data. The lack of data makes it difficult to predict the emotion states with the deep models, which requires enough number of training data. In this paper, we propose to use a simple data augmentation method to address the issue of data shortage in EEG-based emotion recognition. In experiments, we explore the performance of emotion recognition with the shallow and deep computational models before and after data augmentation on two standard EEG-based emotion datasets. Our experimental results show that the simple data augmentation method can improve the performance of emotion recognition based on deep models effectively.

Fang Wang, Sheng-hua Zhong, Jianfeng Peng, Jianmin Jiang, Yan Liu
Domain Invariant Subspace Learning for Cross-Modal Retrieval

Due to the rapid growth of multimodal data, cross-modal retrieval has drawn growing attention in recent years, which aims to take one type of data as the query to retrieve relevant data of another type. To enable directly matching between different modalities, the key issue in cross-modal retrieval is to eliminate the heterogeneity between modalities. A bundle of existing approaches directly project the samples of multimodal data into a common latent subspace with the supervision of class label information, and different samples within the same class contribute uniformly to the subspace construction. However, the subspace constructed by these methods may not reveal the true importance of each sample as well as the discrimination of different class label. To tackle this problem, in this paper we regard different modalities as different domains and propose a Domain Invariant Subspace Learning (DISL) method to associate multimodal data. Specifically, DISL simultaneously minimize the classification error with sample-wise weighting coefficients and preserve the structure similarity within and across modalities with the graph regularization. Therefore, the subspace learned by DISL can well reflect the sample-wise importance and capture the discrimination of different class labels in multi-modal data. Compared with several state-of-the-art algorithms, extensive experiments on three public datasets demonstrate the superiority of the proposed method for cross-modal retrieval tasks such as image-to-text and text-to-image.

Chenlu Liu, Xing Xu, Yang Yang, Huimin Lu, Fumin Shen, Yanli Ji
Effective Action Detection Using Temporal Context and Posterior Probability of Length

In this paper, we focus on human action detection for untrim-med long videos. We propose an effective action detection system aiming at solving two difficulties in existing works. Firstly, we propose to take into account the temporal context information in model learning to tackle with the problem of high-quality proposal generation. Secondly, we propose to utilize the posterior probability of proposal length to adjust the selection criterion of action proposals. This can effectively encourage the proposals with reasonable lengths and suppress the high-classification-score proposals with unreasonable lengths. We test our method on the THUMOS14 Dataset and the experiment results show that our action detection system improve the performance by about 4% compared with the state-of-art methods.

Xinran Liu, Yan Song, Jinhui Tang
Efficient Two-Layer Model Towards Cover Song Identification

So far, few cover song identification systems aim at practical application. On one hand, existing sequence alignment methods achieve a high precision at the expense of high time cost. On the other hand, for large-scale identification, researchers attempt to exploit fixed low-dimensional features to reduce time cost. However, such highly compressed representations often result in a worse accuracy. In this paper, we propose an efficient two-layer system which takes advantage of the two kinds of methods. The proposed approach outperforms existing approaches and achieves high precision with relatively small time complexity.

Xiaoshuo Xu, Yao Cheng, Xiaoou Chen, Deshun Yang
Food Photo Recognition for Dietary Tracking: System and Experiment

Tracking dietary intake is an important task for health management especially for chronic diseases such as obesity, diabetes, and cardiovascular diseases. Given the popularity of personal hand-held devices, mobile applications provide a promising low-cost solution to tackle the key risk factor by diet monitoring. In this work, we propose a photo based dietary tracking system that employs deep-based image recognition algorithms to recognize food and analyze nutrition. The system is beneficial for patients to manage their dietary and nutrition intake, and for the medical institutions to intervene and treat the chronic diseases. To the best of our knowledge, there are no popular applications in the market that provide a high-performance food photo recognition like ours, which is more convenient and intuitive to enter food than textual typing. We conducted experiments on evaluating the recognition accuracy on laboratory data and real user data on Singapore local food, which shed light on uplifting lab trained image recognition models in real applications. In addition, we have conducted user study to verify that our proposed method has the potential to foster higher user engagement rate as compared to existing apps based dietary tracking approaches.

Zhao-Yan Ming, Jingjing Chen, Yu Cao, Ciarán Forde, Chong-Wah Ngo, Tat Seng Chua
Fusion Networks for Air-Writing Recognition

This paper presents a fusion framework for air-writing recognition. By modeling a hand trajectory using both spatial and temporal features, the proposed network can learn more information than the state-of-the-art techniques. The proposed network combines elements of CNN and BLSTM networks to learn the isolated air-writing characters. The performance of proposed network was evaluated by the alphabet and numeric databases in the public dataset namely 6DMG. We first evaluate the accuracy of fusion network using CNN, BLSTM, and another fusion network as the references. The results confirmed that the average accuracy of fusion network outperforms all of the references. When the BLSTM unit was set at 40, the best accuracy of proposed network is 99.27% and 99.33% in the alphabet and numeric gesture, respectively. When compared this result with another work, the accuracy of proposed network improves 0.70% and 0.34% in the alphabet and numeric gesture, respectively. We also examine the performance of the proposed network by varying the number of BLSTM units. The experiments demonstrate that while increasing the number of BLSTM units, the accuracy also improves. When the BLSTM unit is greater than 20, the accuracy maintains even though the BLSTM unit increases. Despite adding more learning features, the accuracy of proposed network insignificantly improves.

Buntueng Yana, Takao Onoye
Global and Local C3D Ensemble System for First Person Interactive Action Recognition

Action recognition in first person videos is different from that in third person videos. In this paper, we aim to recognize interactive actions in first person videos. First person interactive actions contain two kinds of motion which are the ego-motion from the observer and the motion from the actor. To enable an observer to understand “what activity others are doing to me”, we propose a twin stream network architecture based on 3D convolution networks. The global action C3D learns interactions with ego-motion and the local salient motion C3D analyzes the motion from the actor in a salient region, especially when the action happens at a distance from the observer. We also propose a sampling method to extract clips as input to the C3D models and investigate different C3D architectures to improve the performance of C3D. We carry out experiments on the benchmark of JPL first-person interaction dataset. Experiment results prove that the ensemble of global and local networks can increase the accuracy over the state-of-the-art methods by 3.26%.

Lingling Fa, Yan Song, Xiangbo Shu
Implicit Affective Video Tagging Using Pupillary Response

The psychological research found that human eyes could serve as a sensitive indicator of emotional response. Pupillary response has been used to analyze the affective video content in previous studies, but the performance is not good enough. In this paper, we propose a novel method for implicit affective video tagging using pupillary response. The issue of pupil size difference between subjects has not been effectively solved, which seriously affected the performance of the implicit affective video tagging. In our method, we first define the pupil diameter baseline of each subject to diminish individual difference on pupil size. Besides, the probabilistic support vector machine (SVM) and long short term memory (LSTM) network are used to extract valuable information and output the probability estimates based on the proposed global features and sequence features obtained from the pupil dilation ratio time-series data, respectively. The final decision is made by combining the probability estimates from these two models based on the sum rule. In empirical validation, we evaluate the proposed method on a standard dataset MAHNOB-HCI. The experimental results show that the proposed method achieves better classification accuracy compared with the existing method.

Dongdong Gui, Sheng-hua Zhong, Zhong Ming
k-Labelsets for Multimedia Classification with Global and Local Label Correlation

Multimedia data, e.g., text and images, can be associated with more than one label. Existing methods for multimedia data classification either consider label correlation globally by assuming that it is shared by all the instances; or consider label correlations locally by assuming that it is a pairwise label correlation and shared only in a local group of instances. In fact, both global and local correlations may occur in the real-world applications; and the label correlation cannot be confined to pairwise labels. In this paper, a novel and effective multi-label learning approach named GLkEL is proposed for multimedia data categorization. Briefly, a High-Order Label Correlation Assessment strategy named HOLCA is proposed by using approximated joint mutual information; and then GLkEL, which breaks the original label set into several of the most correlated and distinct combination of k labels (called k-labELsets) according to the HOLCA strategy, learns Global and Local label correlations simultaneously based on label correlation matrix. Comprehensive experiments across 8 data sets from different multimedia domains indicate that, it manifests competitive performance against other well-established multi-label learning methods.

Yan Yan, Shining Li, Xiao Zhang, Anyi Wang, Zhigang Li, Jingyu Zhang
LVFS: A Lightweight Video Storage File System for IP Camera-Based Surveillance Applications

Surveillance video data are characterized by a high-volume write-oriented workload and cyclic use of storage space at full capacity. Besides, the video data needs indexing to support accurate-query. Data archiving for such applications, however, is complicated by the increasing demands of higher-resolution cameras and longer video-retention times. Due to the constant data steaming and nearly 100% write activity, general-purpose file systems will not suffice for present purposes. Thus, we specially design and implement a lightweight video storage file system (LVFS) for IP camera-based surveillance applications. LVFS provides a recycled storage platform to meet the retention requirement of surveillance applications, and delivers high performance for concurrent stream data uploading from multiple cameras and accurate data retrieval. Results of a multi-workload experiment show that LVFS is able to archive 40 HD or 16 full-HD cameras for an individual hard disk while processing nearly constant time queries, indicating that LVFS successfully fits the system requirement and performs better than general-purpose file system.

Chong Wang, Ke Zhou, Zhongying Niu, Ronglei Wei, Hongwei Li
Person Re-id by Incorporating PCA Loss in CNN

This paper proposes an algorithm, particularly a loss function and its end to end learning manner, for person re-identification task. The main idea is to take full advantage of the labels in a batch during training, and to employ PCA to extract discriminative features. Deriving from the classic eigenvalue computation problem in PCA, our method incorporates an extra term in loss function with the purpose of minimizing those relative large eigenvalues. And the derivative with respect to the designed loss can be back-propagated in deep network by stochastic gradient descent (SGD). Experiments show the effectiveness of our algorithm on several re-id datasets.

Kaixuan Zhang, Yang Xu, Li Sun, Song Qiu, Qingli Li
Robust and Real-Time Visual Tracking Based on Complementary Learners

Correlation filter based tracking methods have achieved impressive performance in recent years, showing high efficiency and robustness to challenging situations which exhibit illumination variations and motion blur. However, how to reduce model drift phenomenon which is usually caused by object deformation, abrupt motion, heavy occlusion and out-of-view, is still an open problem. In this paper, we exploit the low dimensional complementary features and an adaptive online detector with the average peak-to-correlation energy to improve tracking accuracy and time efficiency. Specifically, we appropriately integrate several complementary features in the correlation filter based discriminative framework and combine with the global color histogram to further boost the overall performance. In addition, we adopt the average peak-to-correlation energy to determine whether to activate and update an online CUR filter for re-detecting the target. We conduct extensive experiments on challenging OTB-15 benchmark datasets, and experimental results demonstrate that the proposed method achieves promising results in terms of efficiency, accuracy and robustness while running at 46 FPS.

Xingzhou Luo, Dapeng Du, Gangshan Wu
Room Floor Plan Generation on a Project Tango Device

This article presents a method to ease the generation of room floor plans with a Project Tango device. Our method takes as input range images as well as camera poses. It is based on the extraction of vertical planar surfaces we label as wall or clutter. While the user is scanning the scene, we estimate the room layout from the labeled planar surfaces by solving a shortest path problem. The user can intervene in this process, and affect the estimated layout by changing the label of the extracted planar surfaces. We compare our approach with other mobile applications and demonstrate its validity.

Vincent Angladon, Simone Gasparini, Vincent Charvillat
Scalable Bag of Selected Deep Features for Visual Instance Retrieval

Recent studies show that aggregating activations of convolutional layers from CNN models together as a global descriptor leads to promising performance for instance retrieval. However, due to the global pooling strategy adopted, the generated feature representation is lack of discriminative local structure information and is degraded by irrelevant image patterns or background clutter. In this paper, we propose a novel Bag-of-Deep-Visual-Words (BoDVW) model for instance retrieval. Activations of convolutional feature maps are extracted as a set of individual semantic-aware local features. An energy-based feature selection is adopted to filter out features on homogeneous background with poor distinction. To achieve the scalability of local feature-level cross matching, the local deep CNN features are quantized to adapt to the inverted index structure. A new cross-matching metric is defined to measure image similarity. Our approach achieves respectable performance in comparison to other state-of-the-art methods. Especially, it is proved to be more effective and efficient on large scale datasets.

Yue Lv, Wengang Zhou, Qi Tian, Houqiang Li
SeqSense: Video Recommendation Using Topic Sequence Mining

This paper examines content-based recommendation in domains exhibiting sequential topical structure. An example is educational video, including Massive Open Online Courses (MOOCs) in which knowledge builds within and across courses. Conventional content-based or collaborative filtering recommendation methods do not exploit courses’ sequential nature. We describe a system for video recommendation that combines topic-based video representation with sequential pattern mining of inter-topic relationships. Unsupervised topic modeling provides a scalable and domain-independent representation. We mine inter-topic relationships from manually constructed syllabi that instructors provide to guide students through their courses. This approach also allows the inclusion of multi-video sequences among the recommendation results. Integrating the resulting sequential information with content-level similarity provides relevant as well as diversified recommendations. Quantitative evaluation indicates that the proposed system, SeqSense, recommends fewer redundant videos than baseline methods, and instead emphasizes results consistent with mined topic transitions.

Chidansh Bhatt, Matthew Cooper, Jian Zhao
ShapeCreator: 3D Shape Generation from Isomorphic Datasets Based on Autoencoder

With the development of 3D digital geometry media, the creation of 3D content is becoming more and more important. This paper presents a novel method for 3D shape generation from the isomorphic examples in terms of Autoencoder. A structure-aware shape representation is firstly built from the given examples with same category. The representation describes the shapes in a unified manner no matter how the shape structure varies. Then, an Autoencoder model is introduced to establish a bidirectional mapping between the high-dimensional representing space and a 2D latent space. This bridges the existed examples and the latent generated shapes. In the one hand, the sample data in representation space is transferred to a lower dimension by the encoder of Autoencoder to form a latent space. Then the latent space is checked and visualized to guarantee created shapes meaningful. In the other hand, decoder of Autoencoder is able to transform new data from the latent space to isomorphic representation, and novel structures are constructed from the decoded representation. This scheme facilitate the operation of 3D creation a lot. Experimental results prove the effectiveness of the proposed approach.

Yunjie Wu, Zhengxing Sun, Youcheng Song, Hongyan Li
Source Distortion Estimation for Wyner-Ziv Distributed Video Coding

Distributed video coding (DVC), which can move the computational complexity burden from the encoder to the decoder, is an effective source coding paradigm for promising video applications over wireless networks, e.g. wireless video surveillance and wireless video sensor networks. For these video applications, it is crucial to provide an efficient way to assess the quality of reconstructed videos accurately. However, due to absence of original frames at the decoder, how to estimate the reconstructed video quality of DVC remains a challenging task. In this paper, we propose a source distortion estimation method for DVC, in which the distortion incurred by the quantization and reconstruction is taken into account. Focusing on the statistical distortion of a transformed coefficient in each Wyner-Ziv (WZ) frame, the proposed method measures the average distortion of WZ frames utilizing only the coding information available at the decoder, i.e. the coefficients of side information (SI) frames and the decoded coefficients outputted from a decoder of low density parity code (LDPC). Besides, we propose an estimation algorithm of probability distribution parameters to deal with the case that all the coefficients of a sub-band are zero values by using an approximate principle. Experiments have been conducted to validate the accuracy of our estimation method. For no requirement of original WZ frames at the decoder, the presented method can be suitable for real-time video applications.

Zhenhua Tang, Sunguo Huang, Hongbo Jiang
SRN: The Movie Character Relationship Analysis via Social Network

Video character relationship mining, as a kind of video semantic analysis, has become a hot topic. Based on the temporal and spatial context and video semantic information of the scene, this paper proposes a method of exploiting character co-occurrence relationship, and constructs character’s social network called SRN through the quantitative relationship among the characters. Based on SRN network, we can get rid of the limitation of the traditional feature-based approach and carry out more in-depth video semantic analysis. By analyzing the characters in the SRN network, a community identification method based on the core characters is proposed, which can automatically confirm the core characters in the video and dig out the communities around the core characters. In this paper, lots of movie videos are used to experiment, and experimental results show the effectiveness of SRN model and community identification method.

Jingmeng He, Yuxiang Xie, Xidao Luan, Lili Zhang, Xin Zhang
The Long Tail of Web Video

Web Video continues to gain importance not only in many areas of computer science but in society in general. With the growth in numbers, both of videos, viewers, and views, there arise several technical challenges. In order to address them effectively, the properties of Web Video in general need to be known. There is however comparatively little analysis of these properties. In this paper, we present insights gained from the analysis of a data set containing the meta data of over 100 million videos from YouTube. We were able to confirm common wisdom about the relationship between video duration and user engagement and show the extreme long tail of the distribution of video views overall. Such data can be beneficial in making informed decisions regarding strategies for large scale video storage, delivery, processing and retrieval.

Luca Rossetto, Heiko Schuldt
Vehicle Semantics Extraction and Retrieval for Long-Term Carpark Video Surveillance

Car park video surveillance data provides plenty of semantic rich data such as vehicle color, trajectory, speed, and type which can be tapped into and extracted for video and data analytics. We present methods for extracting and retrieving color and motion semantics from long term carpark video surveillance. This is a challenging task in outdoor scenarios due to ever-changing illumination and weather conditions, while retrieval time also increases as data size grows. To address these challenges, we subdivided the search space into smaller chunks by introducing spatio-temporal cubes or atoms, which can store and retrieve these semantics at ease. The proposed method was tested on 2 days of continuous data from an outdoor carpark under various lighting and weather conditions. We report the precision, recall and $$F_1$$F1 scores to determine the overall performance of the system.

Clarence Weihan Cheong, Ryan Woei-Sheng Lim, John See, Lai-Kuan Wong, Ian K. T. Tan, Azrin Aris
Venue Prediction for Social Images by Exploiting Rich Temporal Patterns in LBSNs

Location (or equivalently, “venue”) is a crucial facet of user generated images in social media (aka. social images) to describe the events of people’s daily lives. While many existing works focus on predicting the venue category based on image content, we tackle the grand challenge of predicting the specific venue of a social image. Simply using the visual content of a social image is insufficient for this purpose due its high diversity. In this work, we leverage users’ check-in histories in location-based social networks (LBSNs), which contain rich temporal movement patterns, to complement the limitations of using visual signals alone. In particular, we explore the transition patterns on successive check-ins and periodical patterns on venue categories from users’ check-in behaviors in Foursquare. For example, users tend to check-in to cinemas nearby after having meals at a restaurant (transition patterns), and frequently check-in to churches on every Sunday morning (periodical patterns). To incorporate such rich temporal patterns into the venue prediction process, we propose a generic embedding model that fuses the visual signal from image content and various temporal signal from LBSN check-in histories. We conduct extensive experiments on Instagram social images, demonstrating that by properly leveraging the temporal patterns latent in Foursquare check-ins, we can significantly boost the accuracy of venue prediction.

Jingyuan Chen, Xiangnan He, Xuemeng Song, Hanwang Zhang, Liqiang Nie, Tat-Seng Chua

Demonstrations

Frontmatter
A Virtual Reality Interface for Interactions with Spatiotemporal 3D Data

Traditional interfaces for interacting with 3D models in virtual environments lack support for spatiotemporal 3D models such as point clouds and meshes generated by markerless capture systems. We present a virtual reality (VR) interface that enables the user to perform spatial and temporal interactions with spatiotemporal 3D models. To accommodate the high volume of spatiotemporal data, we provide a data format for spatiotemporal 3D models which has an average speedup of 3.84 and a space reduction of 43.9% over traditional model file formats. We enable the user to manipulate spatiotemporal 3D data using gestures intuitive from real-world experience or by using a VR user interface similar to traditional 2D visual interactions.

Hunter Quant, Sean Banerjee, Natasha Kholgade Banerjee
ActionVis: An Explorative Tool to Visualize Surgical Actions in Gynecologic Laparoscopy

Appropriate visualization of endoscopic surgery recordings has a huge potential to benefit surgical work life. For example, it enables surgeons to quickly browse medical interventions for purposes of documentation, medical research, discussion with colleagues, and training of young surgeons. Current literature on automatic action recognition for endoscopic surgery covers domains where surgeries follow a standardized pattern, such as cholecystectomy. However, there is a lack of support in domains where such standardization is not possible, such as gynecologic laparoscopy. We provide ActionVis, an interactive tool enabling surgeons to quickly browse endoscopic recordings. Our tool analyses the results of a post-processing of the recorded surgery. Information on individual frames are aggregated temporally into a set of scenes representing frequent surgical actions in gynecologic laparoscopy, which help surgeons to navigate within endoscopic recordings in this domain.

Stefan Petscharnig, Klaus Schoeffmann
AR DeepCalorieCam: An iOS App for Food Calorie Estimation with Augmented Reality

A food photo generally includes several kinds of food dishes. In order to recognize multiple dishes in a food photo, we need to detect each dish in a food image. Meanwhile, in recent years, the accuracy of object detection has improved drastically by the appearance of Convolutional Neural Network (CNN). In this demo, we present two automatic calorie estimation apps, DeepCalorieCam and AR DeepCalorieCam, running on iOS. DeepCalorieCam can estimate food calories by detecting dishes from the video stream captured from the built-in camera of an iPhone. We use YOLOv2 [1] which is the state-of-the-art object detector using CNN, as a dish detector to detect each dish in a food image, and the food calorie of each detected dish is estimated by image-based food calorie estimation [2, 3]. AR DeepCalorieCam is a combination of calorie estimation and augmented reality (AR) which is an AR version of DeepCalorieCam.

Ryosuke Tanno, Takumi Ege, Keiji Yanai
Auto Accessory Segmentation and Interactive Try-on System

The convenience and diversity of online shopping makes many consumers willing to buy apparel or accessories on the web. In order to make products more attractive to users, many virtual try-on systems are developed for e-commerce applications. This paper proposes an interactive virtual try-on system combined with automatic accessory segmentation. Our system automatically retrieves the hat from images and store them in the try-on system to provide users with subsequent selection. When a user selects the hat that he or she wants to try on, the hat is placed on the proper position of the user in the image. In the stage of accessories segmentation, we perform background elimination and super-pixel segmentation. According to the color information on the hat image, the feature vector generated by the color histogram is used to select super-pixels that belong to the accessories. In the stage of try-on system, we use Kinect, which provides skeleton information, to track the user’s face and gestures. When a user selects the hat, the proposed system reads the corresponding hat information and places the hat in the appropriate location based on the results of the face tracking. The proposed try-on system can reach 30 fps real-time speed in a personal computer.

Yi-Xuan Zeng, Yu-Hang Kuo, Hsu-Yung Cheng
Automatic Smoke Classification in Endoscopic Video

Medical smoke evacuation systems enable proper, filtered removal of toxic fumes during surgery, while stabilizing internal pressure during endoscopic interventions. Typically activated manually, they, however, are prone to inefficient utilization: tardy activation enables smoke to interfere with ongoing surgeries and late deactivation wastes precious resources. In order to address such issues, in this work we demonstrate a vision-based tool indicating endoscopic smoke – a first step towards automatic activation of said systems and avoiding human misconduct. In the back-end we employ a pre-trained convolutional neural network (CNN) model for distinguishing images containing smoke from others.

Andreas Leibetseder, Manfred Jürgen Primus, Klaus Schoeffmann
Depth Representation of LiDAR Point Cloud with Adaptive Surface Patching for Object Classification

Object segmentation and classification from point cloud light detection and ranging (LiDAR) are increasingly important in 3D mapping and autonomous mobile systems. Even though the distance measurement and object localization from laser pulses are accurate and robust to environmental variations better than an image, the reflected points in each frame are sparse and lack semantic information. The appropriate representation that can extract object characteristics from a single frame point cloud is important for segmenting a moving object before it makes a trail in the reconstruction. We propose depth projection and an adaptive surface patch to extract and emphasize shape, curve, and some texture of the object point cloud for classification. The projection plane is based on the sensor position to ensure that the projected image contains fine details of the object surface. An adaptive surface patch is used to construct an object surface from a sparse point cloud at any distance. The experimental results indicate that the object representation can be used to classify an object by means of an existing image classification method [1].

Kanokphan Lertniphonphan, Satoshi Komorita, Kazuyuki Tasaka, Hiromasa Yanagihara
ImageX - Explore and Search Local/Private Images

In this paper we present a system to visually explore and search large sets of untagged images, running on common operating systems and consumer hardware. High quality image descriptors are computed using activations of a convolutional neural network. By applying normalization and a principal component analysis of the activations compact feature vectors of only 64 bytes are generated. The L1-distances for these feature vectors can be calculated very fast using a novel computation approach and allows search-by-example queries to be processed in fractions of a second. We further show how entire image collections can be transferred into hierarchical image graphs and describe a scheme to explore this complex data structure in an intuitive way. To enable keyword search for untagged images, reference features for common keywords are generated. These features are constructed by collecting and clustering examples images from the web.

Nico Hezel, Kai Uwe Barthel, Klaus Jung
Lifelog Exploration Prototype in Virtual Reality

Efficiently exploring large lifelog datasets is the subject of much research in the lifelogging community. In this paper we describe a pioneer lifelog interaction prototype developed for virtual reality. This prototype was created as part of a larger research effort to explore the feasibility and potential of exploring visual lifelogs in virtual environments. In this paper we describe the prototype and its design.

Aaron Duane, Cathal Gurrin
Multi-camera Microenvironment to Capture Multi-view Time-Lapse Videos for 3D Analysis of Aging Objects

We present a microenvironment of multiple cameras to capture multi-viewpoint time-lapse videos of objects showing spatiotemporal phenomena such as aging. Our microenvironment consists of four synchronized Raspberry Pi v2 cameras triggered by four corresponding Raspberry Pi v3 computers that are controlled by a central computer. We provide a graphical user interface for users to trigger captures and visualize multiple viewpoint videos. We show multiple viewpoint captures for objects such as fruit that depict shape changes due to water volume loss and appearance changes due to enzymatic browning.

Lintao Guo, Hunter Quant, Nikolas Lamb, Benjamin Lowit, Natasha Kholgade Banerjee, Sean Banerjee
Ontlus: 3D Content Collaborative Creation via Virtual Reality

3D content creation is usually done by commercial software like Maya and 3ds Max, or shareware like Blender. However, most of the current 3D modeling software tend to edit the content via 2D screen, which is not efficient and intuitive. In this work, we develop a virtual reality (VR) system, in which the user can not only create 3D content easily and smoothly but also collaborate with other users in the VR environment to accomplish their 3D design project. We use HTC VIVE to display the immersive designing environment and realize the interaction between the user and the 3D content. Through VIVE controllers, the users can perform painting, sculpturing, coloring, transformation, and multi-editor collaboration in the 3D virtual space. Compared with the traditional 3D modeling software, directly editing 3D content in the 3D VR environment is more user-friendly.

Chien-Wen Chen, Jain-Wei Peng, Chia-Ming Kuo, Min-Chun Hu, Yuan-Chi Tseng
Programmatic 3D Printing of a Revolving Camera Track to Automatically Capture Dense Images for 3D Scanning of Objects

Low-cost 3D scanners and automatic photogrammetry software have brought digitization of objects into 3D models to the level of the consumer. However, the digitization techniques are either tedious, disruptive to the scanned object, or expensive. We create a novel 3D scanning system using consumer grade hardware that revolves a camera around the object of interest. Our approach does not disturb the object during capture and allows us to scan delicate objects that can deform under motion, such as potted plants. Our system consists of a Raspberry Pi camera and computer, stepper motor, 3D printed camera track, and control software. Our 3D scanner allows the user to gather image sets for 3D model reconstruction using photogrammetry software with minimal effort. We scale 3D scanning to objects of varying sizes by designing our scanner using programmatic modeling, and allowing the user to change the physical dimensions of the scanner without redrawing each part.

Nikolas Lamb, Natasha Kholgade Banerjee, Sean Banerjee
Video Browsing on a Circular Timeline

The emerging ubiquity of videos in all aspects of society demands for innovative and efficient browsing and navigation mechanisms. We propose a novel visualization and interaction paradigm that replaces the traditional linear timeline with a circular timeline. The main advantages of this new concept are (1) significantly increased and dynamic navigation granularity, (2) minimized spacial distances between arbitrary points on the timeline, as well as (3) the possibility to efficiently utilize the screen space for bookmarks or other supplemental information associated with points of interest. The demonstrated prototype implementation proves the expedience of this new concept and includes additional navigation and visualization mechanisms, which altogether create a powerful video browser.

Bernd Münzer, Klaus Schoeffmann

Video Browser Showdown

Frontmatter
Competitive Video Retrieval with vitrivr

This paper presents the competitive video retrieval capabilities of vitrivr. The vitrivr stack is the continuation of the IMOTION system which participated to the Video Browser Showdown competitions since 2015. The primary focus of vitrivr and its participation in this competition is to simplify and generalize the system’s individual components, making them easier to deploy and use. The entire vitrivr stack is made available as open source software.

Luca Rossetto, Ivan Giangreco, Ralph Gasser, Heiko Schuldt
Enhanced VIREO KIS at VBS 2018

The VIREO Known-Item Search (KIS) system has joined the Video Browser Showdown (VBS) [1] evaluation benchmark for the first time in year 2017. With experiences learned, the second version of VIREO KIS is presented in this paper. Considering the color-sketch based retrieval, we propose a simple grid-based approach for color query. This method allows the aggregation of color distributions in video frames into a shot representation, and generates the pre-computed rank list for all available queries which reduces computational resources and favors a recommendation module. With focusing on concept based retrieval, we modify our multimedia event detection system at TRECVID 2015 in VIREO KIS 2017. In this year, the concept bank of VIREO KIS has been upgraded to 14K concepts. An adaptive concept selection, combination and expansion mechanism, which assists the user in picking the right concepts and logically combining concepts to form more expressive query, has been developed. In addition, metadata is included for textual query and some interface designs are also revised for providing a flexible view of results to the user.

Phuong Anh Nguyen, Yi-Jie Lu, Hao Zhang, Chong-Wah Ngo
Fusing Keyword Search and Visual Exploration for Untagged Videos

Video collections often cannot be searched by keywords because most videos are poorly annotated. We present a system that allows to search untagged videos by sketches, example images and keywords. Having analyzed the most frequent search terms and the corresponding images from the Pixabay stock photo agency we derived visual features that allow to search for 20000 keywords. For each keyword we use several image features to be able to cope with large visual and conceptual variations. As the intention of a user searching for an image is unknown, we retrieve thousands of result images (video scenes), which are shown as a visually sorted hierarchical image map. The user can easily find images of interest by dragging and zooming. The visual arrangement of the images is performed with an improved version of a self-sorting map, which allows organizing thousands of images in fractions of a second. If an image similar to the search query has been found, further zooming will show more related images, retrieved from a precomputed image graph. The new approach helps to find untagged images very quickly in an exploratory, incremental way.

Kai Uwe Barthel, Nico Hezel, Klaus Jung
Revisiting SIRET Video Retrieval Tool

The known-item and ad-hoc video search tasks still represent challenging problems for the video retrieval community. During last years, the Video Browser Showdown identified several promising approaches that can improve the effectiveness of interactive video retrieval tools focusing on the tasks. We present a major revision of the SIRET interactive video retrieval tool that follows these findings. The new version employs three different query initialization approaches and provides several result visualization methods for effective navigation and browsing in sets of ranked keyframes.

Jakub Lokoč, Gregor Kovalčík, Tomáš Souček
Sketch-Based Similarity Search for Collaborative Feature Maps

Past editions of the annual Video Browser Showdown (VBS) event have brought forward many tools targeting a diverse amount of techniques for interactive video search, among which sketch-based search showed promising results. Aiming at exploring this direction further, we present a custom approach for tackling the problem of finding similarities in the TRECVID IACC.3 dataset via hand-drawn pictures using color compositions together with contour matching. The proposed methodology is integrated into the established Collaborative Feature Maps (CFM) system, which has first been utilized in the VBS 2017 challenge.

Andreas Leibetseder, Sabrina Kletz, Klaus Schoeffmann
Sloth Search System

In this paper, we present the Sloth Search System (SSS) for large scale video browsing. Our key concept is to apply object recognition and scene classification to generate keyword tags from video images. This indexing process is performed only on selected frames for faster processing. The keyword tags are used to retrieve videos from a text-based query. Additional feature signatures are also used to extract spatial and color information. These proposed signatures are stored as binary codes for a compact representation and for fast search. Such a representation allows users to search by drawing a sketch or a bounding box of a specific object.

Sitapa Rujikietgumjorn, Nattachai Watcharapinchai, Sanparith Marukatat
The ITEC Collaborative Video Search System at the Video Browser Showdown 2018

We present our video search system for the Video Browser Showdown (VBS) 2018 competition. It is based on the collaborative system used in 2017, which already performed well but also revealed high potential for improvement. Hence, based on our experience we introduce several major improvements, particularly (1) a strong optimization of similarity search, (2) various improvements for concept-based search, (3) a new flexible video inspector view, and (4) extended collaboration features, as well as numerous minor adjustments and enhancements, mainly concerning the user interface and means of user interaction. Moreover, we present a spectator view that visualizes the current activity of the team members to the audience to make the competition more attractive.

Manfred Jürgen Primus, Bernd Münzer, Andreas Leibetseder, Klaus Schoeffmann
VERGE in VBS 2018

This paper presents VERGE interactive video retrieval engine, which is capable of browsing and searching into video content. The system integrates several content-based analysis and retrieval modules including concept detection, clustering, visual and textual similarity search, query analysis and reranking, as well as multimodal fusion.

Anastasia Moumtzidou, Stelios Andreadis, Foteini Markatopoulou, Damianos Galanopoulos, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris, Ioannis Patras
Video Search Based on Semantic Extraction and Locally Regional Object Proposal

In this paper, we propose a semantic concept-based video browsing system which mainly exploits the spatial information of both object and action concepts. In a video frame, we soft-assign each locally regional object proposal into cells of a grid. For action concepts, we also collect a dataset with about 100 actions. In many cases, actions can be predicted from a still image, not necessarily from a video shot. Therefore, we consider actions as object concepts and use a deep neural network based on YOLO detector for action detection. Moreover, instead of densely extracting concepts of a video shot, we focus on high-saliency objects and remove noisy concepts. To further improve the interaction, we develop a color-based sketch board to quickly remove irrelevant shots and an instant search panel to improve the recall of the system. Finally, metadata, such as video’s title and summary, is integrated into our system to boost its precision and recall.

Thanh-Dat Truong, Vinh-Tiep Nguyen, Minh-Triet Tran, Trang-Vinh Trieu, Tien Do, Thanh Duc Ngo, Dinh-Duy Le
Backmatter
Metadaten
Titel
MultiMedia Modeling
herausgegeben von
Prof. Dr. Klaus Schoeffmann
Thanarat H. Chalidabhongse
Chong Wah Ngo
Prof. Dr. Supavadee Aramvith
Noel E. O’Connor
Yo-Sung Ho
Moncef Gabbouj
Prof. Ahmed Elgammal
Copyright-Jahr
2018
Electronic ISBN
978-3-319-73600-6
Print ISBN
978-3-319-73599-3
DOI
https://doi.org/10.1007/978-3-319-73600-6

Neuer Inhalt