Skip to main content

2019 | Buch

MultiMedia Modeling

25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part II

herausgegeben von: Ioannis Kompatsiaris, Dr. Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, Dr. Stefanos Vrochidis

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The two-volume set LNCS 11295 and 11296 constitutes the thoroughly refereed proceedings of the 25th International Conference on MultiMedia Modeling, MMM 2019, held in Thessaloniki, Greece, in January 2019.

Of the 172 submitted full papers, 49 were selected for oral presentation and 47 for poster presentation; in addition, 6 demonstration papers, 5 industry papers, 6 workshop papers, and 6 papers for the Video Browser Showdown 2019 were accepted. All papers presented were carefully reviewed and selected from 204 submissions.

Inhaltsverzeichnis

Frontmatter

Regular and Special Session Papers

Frontmatter
Photo-Realistic Facial Emotion Synthesis Using Multi-level Critic Networks with Multi-level Generative Model

In this paper, we propose photo-realistic facial emotion synthesis by using a novel multi-level critic network with multi-level generative model. We devise a new facial emotion generator containing the proposed multi-level decoder to synthesize facial image with a desired variation. A proposed multi-level decoder and multi-level critic network help the generator to produce a photo-realistic and variation-realistic facial image in generative adversarial learning. The multi-level critic network consists of two discriminators, photo-realistic discriminator and variation-realistic discriminator. The photo-realistic discriminator in the multi-level critic network determines whether the multi-resolution facial image generated from the latent feature of the multi-level decoding module is photo-realistic or not. The variation-realistic discriminator determines whether the multi-resolution facial image has natural variation or not. Experimental results show that the proposed facial emotion synthesis method outperforms existing methods in terms of both qualitative performance and quantitative performance of expression recognition.

Minho Park, Hak Gu Kim, Yong Man Ro
Adaptive Alignment Network for Person Re-identification

Person re-identification aims at identifying a target pedestrian across non-overlapping camera views. Pedestrian misalignment, which mainly arises from inaccurate person detection and pose variations, is a critical challenge for person re-identification. To address this, this paper proposes a new Adaptive Alignment Network (AAN), towards robust and accurate person re-identification. AAN automatically aligns pedestrian images from coarse to fine by learning both patch-wise and pixel-wise alignments, leading to effective pedestrian representation invariant to the variance of human pose and location across images. In particular, AAN consists of a patch alignment module, a pixel alignment module and a base network. The patch alignment module estimates the alignment offset for each image patch and performs patch-wise alignment with the offsets. The pixel alignment module is for fine-grained pixel-wise alignment. It learns the subtle local offset for each pixel and produces finely aligned feature map. Extensive experiments on three benchmarks, i.e., Market1501, DukeMTMC-reID and MSMT17 datasets, have demonstrated the effectiveness of the proposed approach.

Xierong Zhu, Jiawei Liu, Hongtao Xie, Zheng-Jun Zha
Visual Urban Perception with Deep Semantic-Aware Network

Visual urban perception has received a lot attention for its importance in many fields. In this paper we transform it into a ranking task by pairwise comparison of images, and use deep neural networks to predict the specific perceptual score of each image. Distinguished from existing researches, we highlight the important role of object semantic information in visual urban perception through the attribute activation maps of images. Base on this concept, we combine the object semantic information with the generic features of images in our method. In addition, we use the visualization techniques to obtain the correlations between objects and visual perception attributes from the well trained neural network, which further proves the correctness of our conjecture. The experimental results on large-scale dataset validate the effectiveness of our method.

Yongchao Xu, Qizheng Yang, Chaoran Cui, Cheng Shi, Guangle Song, Xiaohui Han, Yilong Yin
Deep Reinforcement Learning for Automatic Thumbnail Generation

An automatic thumbnail generation method based on deep reinforcement learning (called RL-AT) is proposed in this paper. Differing from previous saliency-based and deep learning-based methods which predict the location and size of a rectangle region, our method models the thumbnail generation as predicting a rectangle region by cutting along four edges of the rectangle. We project the thumbnail cutting operations as a four step Markov decision-making process in the framework of deep Reinforcement learning. The best crop location in each cutting step is learned by using a deep Q-network. The deep Q-network gets observations from the recent image and selects an action from the action space. Then the deep Q-network receives feedback based on current selected action as reward. The action space and reward function are specifically designed for the thumbnail generation problem. A data set with more than 70,000 thumbnail annotations is used to train our RL-AT model. Our RL-AT model can efficiently generate thumbnails with low computational complexity, and 0.09 s is needed to generate a thumbnail image. Experiments have shown that our RL-AT model outperforms related methods in the thumbnail generation.

Zhuopeng Li, Xiaoyan Zhang
3D Object Completion via Class-Conditional Generative Adversarial Network

Many robotic tasks require accurate shape models in order to properly grasp or interact with objects. However, it is often the case that sensors produce incomplete 3D models due to several factors such as occlusion or sensor noise. To address this problem, we propose a semi-supervised method that can recover the complete the shape of a broken or incomplete 3D object model. We formulated a hybrid of 3D variational autoencoder (VAE) and generative adversarial network (GAN) to recover the complete voxelized 3D object. Furthermore, we incorporated a separate classifier in the GAN framework, making it a three player game instead of two which helps stabilize the training of the GAN as well as guides the shape completion process to follow the object class labels. Our experiments show that our model produces 3D object reconstructions with high-similarity to the ground truth and outperforms several baselines in both quantitative and qualitative evaluations.

Yu-Chieh Chen, Daniel Stanley Tan, Wen-Huang Cheng, Kai-Lung Hua
Video Summarization with LSTM and Deep Attention Models

In this paper we propose two video summarization models based on the recently proposed vsLSTM and dppLSTM deep networks, which allow to model frame relevance and similarity. The proposed deep learning architectures additionally incorporate an attention mechanism to model user interest. In this paper the proposed models are compared to the original ones in terms of prediction accuracy and computational complexity. The proposed vsLSTM+Att method with an attention model outperforms the original methods when evaluated on common public datasets. Additionally, results obtained on a real video dataset containing terrorist-related content are provided to highlight the challenges faced in real-life applications. The proposed method yields outstanding results in this complex scenario, when compared to the original methods.

Luis Lebron Casas, Eugenia Koblents
Challenges in Audio Processing of Terrorist-Related Data

Much information in multimedia data related to terrorist activity can be extracted from the audio content. Our work in ongoing projects aims to provide a complete description of the audio portion of multimedia documents. The information that can be extracted can be derived from diarization, classification of acoustic events, language and speaker segmentation and clustering, as well as automatic transcription of the speech portions. An important consideration is ensuring that the audio processing technologies are well suited to the types of data of interest to the law enforcement agencies. While language identification and speech recognition may be considered as ’mature technologies’, our experience is that even state-of-the-art systems require customisation and enhancements to address the challenges of terrorist-related audio documents.

Jodie Gauvain, Lori Lamel, Viet Bac Le, Julien Despres, Jean-Luc Gauvain, Abdel Messaoudi, Bianca Vieru, Waad Ben Kheder
Identifying Terrorism-Related Key Actors in Multidimensional Social Networks

Identifying terrorism-related key actors in social media is of vital significance for law enforcement agencies and social media organizations in their effort to counter terrorism-related online activities. This work proposes a novel framework for the identification of key actors in multidimensional social networks formed by considering several different types of user relationships/interactions in social media. The framework is based on a mechanism which maps the multidimensional network to a single-layer network, where several centrality measures can then be employed for detecting the key actors. The effectiveness of the proposed framework for each centrality measure is evaluated by using well-established precision-oriented evaluation metrics against a ground truth dataset, and the experimental results indicate the promising performance of our key actor identification framework.

George Kalpakis, Theodora Tsikrika, Stefanos Vrochidis, Ioannis Kompatsiaris
Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

The forensic investigation of a terrorist attack poses a huge challenge to the investigative authorities, as several thousand hours of video footage need to be spotted. To assist law enforcement agencies (LEA) in identifying suspects and securing evidences, we present a platform which fuses information of surveillance cameras and video uploads from eyewitnesses. The platform integrates analytical modules for different input-modalities on a scalable architecture. Videos are analyzed according their acoustic and visual content. Specifically, Audio Event Detection is applied to index the content according to attack-specific acoustic concepts. Audio similarity search is utilized to identify similar video sequences recorded from different perspectives. Visual object detection and tracking are used to index the content according to relevant concepts. The heterogeneous results of the analytical modules are fused into a distributed index of visual and acoustic concepts to facilitate rapid start of investigations, following traits and investigating witness reports.

Alexander Schindler, Martin Boyer, Andrew Lindley, David Schreiber, Thomas Philipp
A Semantic Knowledge Discovery Framework for Detecting Online Terrorist Networks

This paper presents a knowledge discovery framework, with the purpose of detecting terrorist presence in terms of potential suspects and networks on the open and Deep Web. The framework combines information extraction methods and tools and natural language processing techniques, together with semantic information derived from social network analysis, in order to automatically process online content coming from disparate sources and identify people and relationships that may be linked to terrorist activities. This framework has been developed within the context of the DANTE Horizon 2020 project, as part of a larger international effort to detect and analyze terrorist-related content from online sources and help international police organizations in their investigations against crime and terrorism.

Andrea Ciapetti, Giulia Ruggiero, Daniele Toti
A Reliability Object Layer for Deep Hashing-Based Visual Indexing

Nowadays, time-efficient search and retrieval of visually similar content has emerged as a great necessity, while at the same time it constitutes an outstanding research challenge. The latter is further reinforced by the fact that millions of images and videos are generated on a daily basis. In this context, deep hashing techniques, which aim at estimating a very low dimensional binary vector for characterizing each image, have been introduced for realizing realistically fast visual-based search tasks. In this paper, a novel approach to deep hashing is proposed, which explicitly takes into account information about the object types that are present in the image. For achieving this, a novel layer has been introduced on top of current Neural Network (NN) architectures that aims to generate a reliability mask, based on image semantic segmentation information. Thorough experimental evaluation, using four datasets, proves that incorporating local-level information during the hash code learning phase significantly improves the similar retrieval results, compared to state-of-art approaches.

Konstantinos Gkountakos, Theodoros Semertzidis, Georgios Th. Papadopoulos, Petros Daras
Spectral Tilt Estimation for Speech Intelligibility Enhancement Using RNN Based on All-Pole Model

Speech intelligibility enhancement is extremely meaningful for successful speech communication in noisy environments. Several methods based on Lombard effect are used to increase intelligibility. In those methods, spectral tilt has been suggested to be a significant characteristic to produce Lombard speech that is more intelligible than normal speech. All-pole model computed by some methods has been used to capture the accurate spectral tilt of high-quality speech, but they are not appropriate for the spectral tilt estimation of telephone speech. In this paper, recurrent neural networks (RNNs) are used to estimate the tilt of telephone speech in German and English. RNN-based spectral tilt estimation show the robustness on the change of the all-pole model order and phonation type for narrow and wideband speech. Mean squared error (MSE) of spectral tilt estimation using RNN-based method is increased by about 26.20% in narrow speech and 19.49% in wideband speech comparing to the DNN-based measure.

Rui Zhang, Ruimin Hu, Gang Li, Xiaochen Wang
Multi-channel Convolutional Neural Networks with Multi-level Feature Fusion for Environmental Sound Classification

Learning acoustic models directly from the raw waveform is an effective method for Environmental Sound Classification (ESC) where sound events often exhibit vast diversity in temporal scales. Convolutional neural networks (CNNs) based ESC methods have achieved the state-of-the-art results. However, their performance is affected significantly by the number of convolutional layers used and the choice of the kernel size in the first convolutional layer. In addition, most existing studies have ignored the ability of CNNs to learn hierarchical features from environmental sounds. Motivated by these findings, in this paper, parallel convolutional filters with different sizes in the first convolutional layer are designed to extract multi-time resolution features aiming at enhancing feature representation. Inspired by VGG networks, we build our deep CNNs by stacking 1-D convolutional layers using very small filters except for the first layer. Finally, we extend the model using multi-level feature aggregation technique to boost the classification performance. The experimental results on Urbansound 8k, ESC-50, and ESC-10 show that our proposed method outperforms the state-of-the-art end-to-end methods for environmental sound classification in terms of the classification accuracy.

Dading Chong, Yuexian Zou, Wenwu Wang
Audio-Based Automatic Generation of a Piano Reduction Score by Considering the Musical Structure

This study describes a method that automatically generates a piano reduction score from the audio recordings of popular music while considering the musical structure. The generated score comprises both right- and left-hand piano parts, which reflect the melodies, chords, and rhythms extracted from the original audio signals. Generating such a reduction score from an audio recording is challenging because automatic music transcription is still considered to be inefficient when the input contains sounds from various instruments. Reflecting the long-term correlation structure behind similar repetitive bars is also challenging; further, previous methods have independently generated each bar. Our approach addresses the aforementioned issues by integrating musical analysis, especially structural analysis, with music generation. Our method extracts rhythmic features as well as melodies and chords from the input audio recording and reflects them in the score. To consider the long-term correlation between bars, we use similarity matrices, created for several acoustical features, as constraints. We further conduct a multivariate regression analysis to determine the acoustical features that represent the most valuable constraints for generating a musical structure. We have generated piano scores using our method and have observed that we can produce scores that differently balance between the ability to achieve rhythmic characteristics and the ability to obtain musical structures.

Hirofumi Takamori, Takayuki Nakatsuka, Satoru Fukayama, Masataka Goto, Shigeo Morishima
Violin Timbre Navigator: Real-Time Visual Feedback of Violin Bowing Based on Audio Analysis and Machine Learning

Bowing is the main control mechanism in sound production during a violin performance. The balance among bowing parameters such as acceleration, force, velocity or bow-bridge distance are continuously determining the characteristics of the sound. However, in traditional music pedagogy, approaches to teaching the mechanics of bowing are based on subjective and vague perception, rather than on accurate understanding of the principles of movement bowing. In the last years, advances in technology has allowed to measure bowing parameters in violin performances. However, sensing systems are generally very expensive, intrusive and require for very complex and time consuming setups, which makes it impossible to bring them into a classroom environment. Here, we propose an algorithm that is able to estimate bowing parameters from audio analysis in real-time, requiring just a microphone and a simple calibration process. Additionally, we present the Violin Palette, a prototype that uses the reported algorithm and presents bowing information in an intuitive way.

Alfonso Perez-Carrillo
The Representation of Speech in Deep Neural Networks

In this paper, we investigate the connection between how people understand speech and how speech is understood by a deep neural network. A naïve, general feed-forward deep neural network was trained for the task of vowel/consonant classification. Subsequently, the representations of the speech signal in the different hidden layers of the DNN were visualized. The visualizations allow us to study the distance between the representations of different types of input frames and observe the clustering structures formed by these representations. In the different visualizations, the input frames were labeled with different linguistic categories: sounds in the same phoneme class, sounds with the same manner of articulation, and sounds with the same place of articulation. We investigate whether the DNN clusters speech representations in a way that corresponds to these linguistic categories and observe evidence that the DNN does indeed appear to learn structures that humans use to understand speech without being explicitly trained to do so.

Odette Scharenborg, Nikki van der Gouw, Martha Larson, Elena Marchiori
Realtime Human Segmentation in Video

Human segmentation from a single image using deep learning models has obtained significant performance improvements. However, when directly adopting a deep human segmentation model on video human segmentation, the performance is unsatisfactory due to some issues, e.g., the segmentation results of video frames are discontinuous, and the speed of segmentation process is slow. To address these issues, we propose a new real-time video-based human segmentation framework which is designed for the single person from videos to produces smoothing, accurate and fast human segmentation results. The proposed framework for video human segmentation consists of a fully convolutional network and a tracking module based on a level set algorithm, where the fully convolutional network segments the human part in the first frame of the video sequence, and the tracking module obtains the segmentation results of other frames using the segmentation result of the last frame as the initial segmentation. The fully convolutional network is trained using human images datasets. To evaluate the proposed framework for video human segmentation, we have created and annotated a new single person video dataset. The experimental results demonstrate very accurate and smoothing human segmentation with very higher speed only using a deep human segmentation model.

Tairan Zhang, Congyan Lang, Junliang Xing
psDirector: An Automatic Director for Watching View Generation from Panoramic Soccer Video

Watching TV or Internet video is the most common way for people perceiving soccer matches. However, it is immature to generalize this mean to amateur soccer, as it is expensive to direct a match professionally by human. As an alternative, using multiple cameras to generate a panoramic video can faithfully record the match, but with bad watching experience. In this work, we develop a psDirector system to address this dilemma. It takes the panoramic soccer video as input and outputs a corresponding watching view counterpart, which continuously focuses on attractive playing areas that people are interested in. The task is somewhat unique and we propose a novel pipeline to implement it. It first extracts several soccer-related semantics, i.e., soccer field, attractive ROI, distribution of players, attacking direction. Then, the semantics are reasonably utilized to produce the outputted video, where important match content, camera action as well as their consistency along the time axis are carefully considered to ensure the video quality. Experiments on school soccer videos show rationality of the proposed pipeline. Meanwhile, psDirector generates video with better watching experience than an existing commercial tool.

Chunyang Li, Caiyan Jia, Zhineng Chen, Xiaoyan Gu, Hongyun Bao
No-Reference Video Quality Assessment Based on Ensemble of Knowledge and Data-Driven Models

No-reference (NR) video quality assessment (VQA) aims to evaluate video distortion in line with human visual perception without referring to the corresponding pristine signal. Many methods try to design models using prior knowledge of people’s experience. It is challenging due to the underlying complexity of video content, and the relatively limited understanding of the intricate mechanisms of the human visual system. Recently, some learning-based NR-VQA methods were proposed and regarded as data driven methods. However, in many practical scenarios, the labeled data is quite limited which significantly restricts the learning ability. In this paper, we first propose a data-driven model, V-CNN. It adaptively fits spatial and temporal distortion of time-varying video content. By using a shallow neural network, the spatial part runs faster than traditional models. The temporal part is more consistent with human subjective perception by introducing temporal SSIM jitter and hysteresis pooling. We then exploit the complementarity of V-CNN and a knowledge-driven model, VIIDEO. Compared to state-of-the-art full reference, reduced reference and no reference VQA methods, the proposed ensemble model shows a better balance between performance and efficiency with limited training data.

Li Su, Pamela Cosman, Qihang Peng
Understanding Intonation Trajectories and Patterns of Vocal Notes

Unlike fixed-pitch instruments hold the same pitch over time, the voice requires careful regulation during each note in order to maintain a steady pitch. Previous studies have investigated singing performance which takes single note as an element, such as intonation accuracy, pitch drift while the note trajectory within the notes has hardly been investigated. The aim of this paper is to study pitch variation within vocal notes and ascertain what factors influence the various parts of a note. We recorded data which including five SATB groups (four participants each group) singing two pieces of music in three listening conditions according to whether can hear other participants or not. After extracting fundamental frequency and analysing, we obtained all the notes by relative time and real-time duration, then observed a regular pattern among all the notes. To be specific: (1) There are transient parts in both the beginning and end of a note which is about 15–20% of the whole duration; (2) The shapes of transient parts differ significantly according to adjacent pitch, although all singers tend to have a descending transient at the end of a note.

Jiajie Dai, Simon Dixon
Temporal Lecture Video Fragmentation Using Word Embeddings

In this work the problem of temporal video lecture fragmentation in meaningful parts is addressed. The visual content of lecture video can not be effectively used for this task due to its extremely homogeneous content. A new method for lecture video fragmentation in which only automatically generated speech transcripts of a video are exploited, is proposed. Contrary to previously proposed works that employ visual, audio and textual features and use time-consuming supervised methods which require annotated training data, we present a method that analyses the transcripts’ text with the help of word embeddings that are generated from pre-trained state-of-the-art neural networks. Furthermore, we address a major problem of video lecture fragmentation research, which is the lack of large-scale datasets for evaluation, by presenting a new artificially-generated dataset of synthetic video lecture transcripts that we make publicly available. Experimental comparisons document the merit of the proposed approach.

Damianos Galanopoulos, Vasileios Mezaris
Using Coarse Label Constraint for Fine-Grained Visual Classification

Recognizing fine-grained categories (e.g., dog species) relies on part localization and fine-grained feature learning. However, these classification methods use fine labels and ignore the structural information between different classes. In contrast, we take into account the structural information and use it to improve fine-grained visual classification performance. In this paper, we propose a novel coarse label representation and the corresponding cost function. The new coarse label representation idea comes from the category representation in the multi-label classification. This kind of coarse label representation can well express the structural information embedded in the class hierarchy, and the coarse labels are only obtained from suffix names of different category names, or given in advance like CIFAR100 dataset. A new cost function is proposed to guide the fine label convergence with the constraint of coarse labels, so we can make full use of this kind of coarse label supervised information to improve fine-grained visual classification. Our method can be generalized to any fine-tuning task; it does not increase the size of the original model; and adds no overhead to the training time. We conduct comprehensive experiments and show that using coarse label constraint improves major fine-grained classification datasets.

Chaohao Lu, Yuexian Zou
Gated Recurrent Capsules for Visual Word Embeddings

The caption retrieval task can be defined as follows: given a set of images I and a set of describing sentences S, for each image i in I we ought to find the sentence in S that best describes i. The most commonly applied method to solve this problem is to build a multimodal space and to map each image and each sentence to that space, so that they can be compared easily. A non-conventional model called Word2VisualVec has been proposed recently: instead of mapping images and sentences to a multimodal space, they mapped sentences directly to a space of visual features. Advances in the computation of visual features let us infer that such an approach is promising. In this paper, we propose a new Recurrent Neural Network model following that unconventional approach based on Gated Recurrent Capsules (GRCs), designed as an extension of Gated Recurrent Units (GRUs). We show that GRCs outperform GRUs on the caption retrieval task. We also state that GRCs present a great potential for other applications.

Danny Francis, Benoit Huet, Bernard Merialdo
An Automatic System for Generating Artificial Fake Character Images

Due to the introduction of deep learning for text detection and recognition in natural scenes, and the increase in detecting fake images in crime applications, automatically generating fake character images has now received greater attentions. This paper presents a new system named Fake Character GAN (FCGAN). It has the ability to generate fake and artificial scene characters that have similar shapes and colors with the existing ones. The proposed method first extracts shapes and colors of character images. Then, it constructs the FCGAN, which consists of a series of convolution, residual and transposed convolution blocks. The extracted features are then fed to the FCGAN to generate fake characters and verify the quality of the generated characters simultaneously. The proposed system chooses characters from the benchmark ICDAR 2015 dataset for training, and further validated by conducting text detection and recognition experiments on input and generated fake images to show its effectiveness.

Yisheng Yue, Palaiahnakote Shivakumara, Yirui Wu, Liping Zhu, Tong Lu, Umapada Pal
Person Re-Identification Based on Pose-Aware Segmentation

Person re-identification (Re-ID) is a key technology for intelligent video analysis. However, it is still a challenging task due to various complex background, different poses of person, etc. In this paper we try to address this issue by proposing a novel method based on person segmentation. Contrary to the previous method, we segment the person region from the image first. A pose-aware segmentation method (PA) is proposed by introducing the human pose into segmentation scheme. Then the deep learning features are extracted based on the person region instead of the whole bounding box. Finally, the person Re-ID results are acquired through the rank of Euclidean distance. Comprehensive experiments on two public person Re-ID datasets show the effectiveness of our method and the comparison experiments demonstrate that our method can outperform the state-of-the-art method.

Wenfeng Zhang, Zhiqiang Wei, Lei Huang, Jie Nie, Lei Lv, Guanqun Wei
Neuropsychiatric Disorders Identification Using Convolutional Neural Network

The neuropsychiatric disorders have become a high risk among the elderly group and their group of patients has the tendency of getting younger. However, an efficient computer-aided system with the computer vision technique to detect the neuropsychiatric disorders has not been developed yet. More specifically, there are two critical issues: (1) the postures between various neuropsychiatric disorders are similar, (2) lack of physiotherapists and expensive examinations. In this study, we design an innovative framework which associates a novel two-dimensional feature map with a convolutional neural network to identify the neuropsychiatric disorders. Firstly, we define the seven types of postures to generate the one-dimensional feature vectors (1D-FVs) which can efficiently describe the characteristics of neuropsychiatric disorders. To further consider the relationship between different features, we reshape the features from one-dimensional into two-dimensional to form the feature maps (2D-FMs) based on the periods of pace. Finally, we generate the identification model by associating the 2D-FMs with a convolutional neural network. To evaluate our work, we introduce a new dataset called Simulated Neuropsychiatric Disorders Dataset (SNDD) which contains three kinds of neuropsychiatric disorders and one healthy with 128 videos. In experiments, we evaluate the performance of 1D-FVs with classic classifiers and compare the performance with the gait anomaly feature vectors. In addition, extensive experiments conducting on the proposed novel framework which associates the 2D-FMs with a convolutional neural network is applied to identify the neuropsychiatric disorders.

Chih-Wei Lin, Qilu Ding
Semantic Map Annotation Through UAV Video Analysis Using Deep Learning Models in ROS

Enriching the map of the flight environment with semantic knowledge is a common need for several UAV applications. Safety legislations require no-fly zones near crowded areas that can be indicated by semantic annotations on a geometric map. This work proposes an automatic annotation of 3D maps with crowded areas, by projecting 2D annotations that are derived through visual analysis of UAV video frames. To this aim, a fully convolutional neural network is proposed, in order to comply with the computational restrictions of the application, that can effectively distinguish between crowded and non-crowded scenes based on a regularized multiple-loss training method, and provide semantic heatmaps that are projected on the 3D occupancy grid of Octomap. The projection is based on raycasting and leads to polygonal areas that are geo-localized on the map and could be exported in KML format. Initial qualitative evaluation using both synthetic and real world drone scenes, proves the applicability of the method.

Efstratios Kakaletsis, Maria Tzelepi, Pantelis I. Kaplanoglou, Charalampos Symeonidis, Nikos Nikolaidis, Anastasios Tefas, Ioannis Pitas
Temporal Action Localization Based on Temporal Evolution Model and Multiple Instance Learning

Temporal action localization in untrimmed long videos is an important yet challenging problem. The temporal ambiguity and the intra-class variations of temporal structure of actions make existing methods far from being satisfactory. In this paper, we propose a novel framework which firstly models each action clip based on its temporal evolution, and then adopts a deep multiple instance learning (MIL) network for jointly classifying action clips and refining their temporal boundaries. The proposed network utilizes a MIL scheme to make clip-level decisions based on temporal-instance-level decisions. Besides, a temporal smoothness constraint is introduced into the multi-task loss. We evaluate our framework on THUMOS Challenge 2014 benchmark and the experimental results show that it achieves considerable improvements as compared to the state-of-the-art methods. The performance gain is especially remarkable under precise localization with high tIoU thresholds, e.g. mAP@tIoU=0.5 is improved from 31.0% to 35.0%.

Minglei Yang, Yan Song, Xiangbo Shu, Jinhui Tang
Near-Duplicate Video Retrieval Through Toeplitz Kernel Partial Least Squares

The existence of huge volumes of near-duplicate videos shows a rising demand on effective near-duplicate video retrieval technique in copyright violation and search result re-ranking. In this paper, Kernel Partial Least Squares (KPLS) is used to find strong information correlation in near-duplicate videos. Furthermore, to solve the problem of “curse of kernelization” when querying a large-scale video database, we propose a Toeplitz Kernel Partial Least Squares method. The Toeplitz matrix multiplication can be implemented by the Fast Fourier Transform (FFT) to accelerate the computation. Extensive experiments on the widely used CC_WEB_VIDEO dataset demonstrate that the proposed approach exhibits superior performance of near-duplicate video retrieval (NDVR) over state-of-the-art methods, such as BCS, SE, SSBelt and CCA, achieving a mean average precision (MAP) score of 0.9665.

Jia-Li Tao, Jian-Ming Zhang, Liang-Jun Wang, Xiang-Jun Shen, Zheng-Jun Zha
Action Recognition Using Visual Attention with Reinforcement Learning

Human action recognition in videos is a challenging and significant task with a broad range of applications. The advantage of the visual attention mechanism is that it can effectively reduce noise interference by focusing on the relevant parts of the image and ignoring the irrelevant part. We propose a deep visual attention model with reinforcement learning for this task. We use Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) units as a learning agent. The agent interact with video and decides both where to look next frame and where to locate the most relevant region of the selected video frame. REINFORCE method is used to learn the agent’s decision policy and back-propagation method is used to train the action classifier. The experimental results demonstrate that this glimpse window can focus on important clues. Our model achieves significant performance improvement on the action recognition datasets: UCF101 and HMDB51.

Hongyang Li, Jun Chen, Ruimin Hu, Mei Yu, Huafeng Chen, Zengmin Xu
Soccer Video Event Detection Based on Deep Learning

Automatically identifying the most interesting content in a long video remains a challenging task. Event detection is an important aspect of soccer game research. In this paper, we propose a model that is able to detect events in long soccer games with a single pass through the video. Combined with replay detection, we generate story clips, which contain more complete temporal context, meeting audiences’ needs. We also introduce a soccer game dataset that contains 222 broadcast soccer videos, totaling 170 video hours. The dataset covers three annotation types: (1) shot annotations (type and boundary), (2) event annotations (with 11 event labels), and (3) story annotations (with 15 story labels). Finally, we report the performance of the proposed model for soccer events and story analysis.

Junqing Yu, Aiping Lei, Yangliu Hu
Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding

Social relation understanding is an increasingly popular research area. Great progress has been achieved by exploiting sentiment or social relation from image data, however, it is also difficult to attain satisfactory performance for social relation analysis from video data. In this paper, we propose a novel Spatio-Temporal attention model based on Multi-View (STMV) for understanding social relations from video. First, in order to obtain rich representation for social relation traits, we introduce different ConvNets to extract multi-view features including RGB, optical flow, and face. Second, we exploit temporal features of multi-view through time using Long Short-Term Memory (LSTM) for social relation understanding. Specially, we propose multiple attention units in our attention module. Through this manner, we can generate an appropriate feature representation focusing on multiple aspects of social relation traits from video, thus excellent mapping function from low-level video pixels to high-level social relation space can be built. Third, we introduce a tensor fusion layer, which learns interactions among multi-view features. Extensive experiments show that our STMV model achieves the state-of-the-art performance on the SRIV video dataset for social relation classification.

Jinna Lv, Bin Wu
Detail-Preserving Trajectory Summarization Based on Segmentation and Group-Based Filtering

In this paper, aiming at preserving more details of the original trajectory data, we propose a novel trajectory summarization approach based on trajectory segmentation. The proposed approach consists of five stages. First, the proposed relative distance ratio based abnormality detection is performed to remove outliers. Second, the remaining trajectories are segmented into sub-trajectories using the minimum description length (MDL) principle. Third, the sub-trajectories are combined into groups by considering both spatial proximity, through the use of searching window, and shape restriction. And the sub-trajectories within the same group are resampled to have the same number of sample points. Fourth, a non-local filtering method based on wavelet transformation is performed on each group. Fifth, the filtered sub-trajectories which derived from the same trajectory are linked together to present the summarization result. Experiments show that our algorithm can obtain satisfactory results.

Ting Wu, Qing Xu, Yunhe Li, Yuejun Guo, Klaus Schoeffmann
Single-Stage Detector with Semantic Attention for Occluded Pedestrian Detection

In this paper, we propose a pedestrian detection method with semantic attention based on the single-stage detector architecture (i.e., RetinaNet) for occluded pedestrian detection, denoted as PDSA. PDSA contains a semantic segmentation component and a detector component. Specifically, the first component uses visible bounding boxes for semantic segmentation, aiming to obtain an attention map for pedestrians and the inter-class (non-pedestrian) occlusion. The second component utilizes the single-stage detector to locate the pedestrian from the features obtained previously. The single-stage detector adopts over-sampling of possible object locations, which is faster than two-stage detectors that train classifier to identify candidate object locations. In particular, we introduce the repulsion loss to deal with the intra-class occlusion. Extensive experiments conducted on the public CityPersons dataset demonstrate the effectiveness of PDSA for occluded pedestrian detection, which outperforms the state-of-the-art approaches.

Fang Wen, Zehang Lin, Zhenguo Yang, Wenyin Liu
Poses Guide Spatiotemporal Model for Vehicle Re-identification

In this paper, we tackle the vehicle Re-identification (Re-ID) problem, which is important in the urban surveillance. Utilizing visual appearance information is limited on performance due to occlusions, illumination variations, etc. To make the best of our knowledge, the recent few methods consider the spatiotemporal information to solve vehicle Re-ID problem, and neglect the influence of driving direction. In this paper, we explore that the spatiotemporal distribution of vehicle movements follows certain rules, moreover the vehicles’ poses on camera view indicate their directions are closely related to the spatiotemporal cues. Inspired by these two observations, we propose a vehicles’ Poses Guide Spatiotemporal model (PGST) for assisting vehicle Re-ID. Firstly, a Gaussian distribution based spatiotemporal probability model is exploited to predict the vehicle’s spatiotemporal movement. Then a CNN embedding poses classifier is exploited to estimate driving direction by evaluating vehicle’s pose. Finally, PGST model is integrated into the framework which fuses the results of visual appearance model and spatiotemporal model together. Due to the lack of vehicle dataset with spatiotemporal information and topology of cameras, experiments are conducted on a public vehicle Re-ID dataset which is the only one meeting the experiments requirements. The proposed approach achieves competitive performances.

Xian Zhong, Meng Feng, Wenxin Huang, Zheng Wang, Shin’ichi Satoh
Alignment of Deep Features in 3D Models for Camera Pose Estimation

Using a set of semantically annotated RGB-D images with known camera poses, many existing 3D reconstruction algorithms can integrate these images into a single 3D model of the scene. The semantically annotated scene model facilitates the construction of a video surveillance system using a moving camera if we can efficiently compute the depth maps of the captured images and estimate the poses of the camera. The proposed model-based video surveillance consists of two phases, i.e. the modeling phase and the inspection phase. In the modeling phase, we carefully calibrate the parameters of the camera that captures the multi-view video for modeling the target 3D scene. However, in the inspection phase, the camera pose parameters and the depth maps of the captured RGB images are often unknown or noisy when we use a moving camera to inspect the completeness of the object. In this paper, the 3D model is first transformed into a colored point cloud, which is then indexed by clustering—with each cluster representing a surface fragment of the scene. The clustering results are then used to train a model-specific convolution neural network (CNN) that annotates each pixel of an input RGB image with a correct fragment class. The prestored camera parameters and depth information of fragment classes are then fused together to estimate the depth map and the camera pose of the current input RGB image. The experimental results show that the proposed approach outperforms the compared methods in terms of the accuracy of camera pose estimation.

Jui-Yuan Su, Shyi-Chyi Cheng, Chin-Chun Chang, Jun-Wei Hsieh
Regular and Small Target Detection

Although remarkable results have been achieved in the areas of object detection, the detection of small objects is still a challenging task now. The low resolution and noisy representation make small objects difficult to detect, and further recognition will be much harder. Aiming at the small objects that have regular positions, shapes, colors or other features, this paper proposes an approach of Regular and Small Target Detection based on Faster R-CNN (RSTD) for the detection and recognition of regular and small targets such as traffic signs. In this approach, a regular and small target feature extraction layer is designed to automatically extract the surrounding background and internal key information of the proposal objects, which benefits the detection and recognition. Extensive evaluations on Tsinghua-Tencent 100K and GTSDB datasets demonstrate the superiority of our approach in detecting traffic signs over well-established state-of-the-arts. The source code and model introduced in this paper are publicly available at: https://github.com/zhezheey/RSTD/ .

Wenzhe Wang, Bin Wu, Jinna Lv, Pilin Dai
From Classical to Generalized Zero-Shot Learning: A Simple Adaptation Process

Zero-shot learning (ZSL) is concerned with the recognition of previously unseen classes. It relies on additional semantic knowledge for which a mapping can be learned with training examples of seen classes. While classical ZSL considers the recognition performance on unseen classes only, generalized zero-shot learning (GZSL) aims at maximizing performance on both seen and unseen classes. In this paper, we propose a new process for training and evaluation in the GZSL setting; this process addresses the gap in performance between samples from unseen and seen classes by penalizing the latter, and enables to select hyper-parameters well-suited to the GZSL task. It can be applied to any existing ZSL approach and leads to a significant performance boost: the experimental evaluation shows that GZSL performance, averaged over eight state-of-the-art methods, is improved from 28.5 to 42.2 on CUB and from 28.2 to 57.1 on AwA2.

Yannick Le Cacheux, Hervé Le Borgne, Michel Crucianu

Industry Papers

Frontmatter
Bag of Deep Features for Instructor Activity Recognition in Lecture Room

This research aims to explore contextual visual information in the lecture room, to assist an instructor to articulate the effectiveness of the delivered lecture. The objective is to enable a self-evaluation mechanism for the instructor to improve lecture productivity by understanding their activities. Teacher’s effectiveness has a remarkable impact on uplifting students performance to make them succeed academically and professionally. Therefore, the process of lecture evaluation can significantly contribute to improve academic quality and governance. In this paper, we propose a vision-based framework to recognize the activities of the instructor for self-evaluation of the delivered lectures. The proposed approach uses motion templates of instructor activities and describes them through a Bag-of-Deep features (BoDF) representation. Deep spatio-temporal features extracted from motion templates are utilized to compile a visual vocabulary. The visual vocabulary for instructor activity recognition is quantized to optimize the learning model. A Support Vector Machine classifier is used to generate the model and predict the instructor activities. We evaluated the proposed scheme on a self-captured lecture room dataset, IAVID-1. Eight instructor activities: pointing towards the student, pointing towards board or screen, idle, interacting, sitting, walking, using a mobile phone and using a laptop, are recognized with an 85.41% accuracy. As a result, the proposed framework enables instructor activity recognition without human intervention.

Nudrat Nida, Muhammad Haroon Yousaf, Aun Irtaza, Sergio A. Velastin
A New Hybrid Architecture for Human Activity Recognition from RGB-D Videos

Activity Recognition from RGB-D videos is still an open problem due to the presence of large varieties of actions. In this work, we propose a new architecture by mixing a high level handcrafted strategy and machine learning techniques. We propose a novel two level fusion strategy to combine features from different cues to address the problem of large variety of actions. As similar actions are common in daily living activities, we also propose a mechanism for similar action discrimination. We validate our approach on four public datasets, CAD-60, CAD-120, MSRDailyActivity3D, and NTU-RGB+D improving the state-of-the-art results on them.

Srijan Das, Monique Thonnat, Kaustubh Sakhalkar, Michal Koperski, Francois Bremond, Gianpiero Francesca
Utilizing Deep Object Detector for Video Surveillance Indexing and Retrieval

Intelligent video surveillance is one of the most challenging tasks in computer vision due to high requirements for reliability, real-time processing and robustness on low resolution videos. In this paper we propose solutions to those challenges through a unified system for indexing and retrieval based on recent discoveries in deep learning. We show that a single stage object detector such as YOLOv2 can be used as a very efficient tool for event detection, key frame selection and scene recognition. The motivation behind our approach is that the feature maps computed by the deep detector encode not only the category of objects present in the image, but also their locations, eliminating automatically background information. We also provide a solution to the low video quality problem with the introduction of a light convolutional network for object description and retrieval. Preliminary experimental results on different video surveillance datasets demonstrate the effectiveness of the proposed system.

Tom Durand, Xiyan He, Ionel Pop, Lionel Robinault
Deep Recurrent Neural Network for Multi-target Filtering

This paper addresses the problem of fixed motion and measurement models for multi-target filtering using an adaptive learning framework. This is performed by defining target tuples with random finite set terminology and utilisation of recurrent neural networks with a long short-term memory architecture. A novel data association algorithm compatible with the predicted tracklet tuples is proposed, enabling the update of occluded targets, in addition to assigning birth, survival and death of targets. The algorithm is evaluated over a commonly used filtering simulation scenario, with highly promising results ( https://github.com/mehryaragha/MTF ).

Mehryar Emambakhsh, Alessandro Bay, Eduard Vazquez
Adversarial Training for Video Disentangled Representation

The strong demand for video analytics is largely due to the widespread application of CCTV. Perfectly encoding moving objects and scenes with different sizes and complexity in an unsupervised manner is still a challenge and seriously affects the quality of video prediction and subsequent analysis. In this paper, we introduce adversarial training to improve DrNet which disentangles a video with stationary scene and moving object representations, while taking the tiny objects and complex scene into account. These representations can be used for subsequent industrial applications such as vehicle density estimation, video retrieval, etc. Our experiment on LASIESTA database confirms the validity of this method in both reconstruction and prediction performance. Meanwhile, we propose an experiment that vanishes one of the codes and reconstructs the images by concatenating these zero and non-zero codes. This experiment separately evaluates the moving object and scene coding quality and shows that the adversarial training achieves a significant reconstruction quality in visual effect, despite of complex scene and tiny object.

Renjie Xie, Yuancheng Wang, Tian Xie, Yuhao Zhang, Li Xu, Jian Lu, Qiao Wang

Demonstrations

Frontmatter
A Method for Enriching Video-Watching Experience with Applied Effects Based on Eye Movements

We propose a method to enrich the experience of watching videos by applying effects to video clips which are shared on the Web on the basis of eye movements. We implemented a prototype system as a Web browser extension and created several effects that are applied depending on the point of a viewer’s gaze. In addition, we conducted an experimental test, and clarified the usefulness of our effects, and investigated how adding the effects affected viewer experience.

Masayuki Tamura, Satoshi Nakamura
Fontender: Interactive Japanese Text Design with Dynamic Font Fusion Method for Comics

Comics consist of frames, drawn images, speech balloons, text, and so on. In this work, we focus on the difficulty of designing the text used for the narration and quotes of characters. In order to support creators in their text design, we propose a method to design text by a font fusion algorithm with arbitrary existing fonts. In this method, users can change the font type freely by indicating a point on the font map. We implement a prototype system and discuss its effectiveness.

Junki Saito, Satoshi Nakamura
Training Researchers with the MOVING Platform

The MOVING platform enables its users to improve their information literacy by training how to exploit data and text mining methods in their daily research tasks. In this paper, we show how it can support researchers in various tasks, and we introduce its main features, such as text and video retrieval and processing, advanced visualizations, and the technologies to assist the learning process.

Iacopo Vagliano, Angela Fessl, Franziska Günther, Thomas Köhler, Vasileios Mezaris, Ahmed Saleh, Ansgar Scherp, Ilija Šimić
Space Wars: An AugmentedVR Game

Over the past couple of years, Virtual and Augmented Reality have been at the forefront of the Mixed Reality development scene, whereas Augmented Virtuality has significantly lacked behind. Widespread adoption however requires efficient low-cost platforms and minimalistic interference design. In this work we present Space Wars, an end-to-end proof of concept for an elegant and rapid-deployment Augmented VR platform. Through the engaging experience of Space Wars, we aim to demonstrate how digital games, as forerunners of innovative technology, are perfectly suited as an application area to embrace the underlying low-cost technology, and thus pave the way for other adopters (such as healthcare, education, tourism and e-commerce) to follow suit.

Kyriaki Christaki, Konstantinos C. Apostolakis, Alexandros Doumanoglou, Nikolaos Zioulis, Dimitrios Zarpalas, Petros Daras
ECAT - Endoscopic Concept Annotation Tool

The trend to video documentation in minimally invasive surgery demands for effective and expressive semantic content understanding in order to automatically organize huge and rapidly growing endoscopic video archives. To provide such assistance, deep learning proved to be the means of choice, but requires large amounts of high quality training data labeled by domain experts to produce adequate results. We present a web-based annotation system that provides a very efficient workflow for medical domain experts to conveniently create such video training data with minimum effort.

Bernd Münzer, Andreas Leibetseder, Sabrina Kletz, Klaus Schoeffmann
Automatic Classification and Linguistic Analysis of Extremist Online Material

The growth of the Internet in the last decade has created great opportunities for sharing content and opinions at a global scale. While this may look like a completely positive feature, it also facilitates the dissemination of discriminative material, propaganda calling for violence, etc. We present a system for recognition, classification and inspection of this kind of material in terms of different characteristics and identification of its authors. The system is illustrated using different sources – including Jihadist magazines and White Supremacist forum posts. We show experiments on the detection of offensive content, on its classification and provide a visualization and enrichment of extremist data.

Juan Soler-Company, Leo Wanner

Video Browser Showdown

Frontmatter
Autopiloting Feature Maps: The Deep Interactive Video Exploration (diveXplore) System at VBS2019

We present the most recent version of our Deep Interactive Video Exploration (diveXplore) system, which has been successfully used for the latest two Video Browser Showdown competitions (VBS2017 and VBS2018) as well as for the first Lifelog Search Challenge (LSC2018). diveXplore is based on a plethora of video content analysis and processing methods, such as simple color, texture, and motion analysis, self-organizing feature maps, and semantic concept extraction with different deep convolutional neural networks. The biggest strength of the system, however, is that it provides a variety of video search and rich interaction features. One of the novelties in the most recent version is a Feature Map Autopilot, which ensures time-efficient inspection of feature maps without gaps and unnecessary visits.

Klaus Schoeffmann, Bernd Münzer, Andreas Leibetseder, Jürgen Primus, Sabrina Kletz
VISIONE at VBS2019

This paper presents VISIONE, a tool for large–scale video search. The tool can be used for both known-item and ad-hoc video search tasks since it integrates several content-based analysis and retrieval modules, including a keyword search, a spatial object-based search, and a visual similarity search. Our implementation is based on state-of-the-art deep learning approaches for the content analysis and leverages highly efficient indexing techniques to ensure scalability. Specifically, we encode all the visual and textual descriptors extracted from the videos into (surrogate) textual representations that are then efficiently indexed and searched using an off-the-shelf text search engine.

Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Franca Debole, Fabrizio Falchi, Claudio Gennaro, Lucia Vadicamo, Claudio Vairo
VIRET Tool Meets NasNet

The results of the last Video Browser Showdown in Bangkok 2018 show that multimodal search with interactive query reformulation represents a competitive search strategy for all the evaluated task categories. Therefore, we plan to target the effectiveness of involved retrieval models by making use of the most recent deep network architectures in the new version of our interactive video retrieval VIRET tool. Specifically, we apply the NasNet deep convolutional neural network architecture for automatic annotation and similarity search in the set of selected frames from the provided video collection. In addition, we implement temporal sequence queries and subimage similarity search to provide higher query formulation flexibility for users.

Jakub Lokoč, Gregor Kovalčík, Tomáš Souček, Jaroslav Moravec, Jan Bodnár, Přemysl Čech
VERGE in VBS 2019

This paper presents VERGE, an interactive video retrieval engine that enables browsing and searching into video content. The system implements various retrieval modalities, such as visual or textual search, concept detection and clustering, as well as a multimodal fusion and a reranking capability. All results are displayed in a graphical user interface in an efficient and friendly manner.

Stelios Andreadis, Anastasia Moumtzidou, Damianos Galanopoulos, Foteini Markatopoulou, Konstantinos Apostolidis, Thanassis Mavropoulos, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris, Ioannis Patras
VIREO @ Video Browser Showdown 2019

In this paper, the VIREO team video retrieval tool is described in details. As learned from Video Browser Showdown (VBS) 2018, the visualization of video frames is a critical need to improve the browsing effectiveness. Based on this observation, a hierarchical structure that represents the video frame clusters has been built automatically using k-means and self-organizing-map and used for visualization. Also, the relevance feedback module which relies on real-time support-vector-machine classification becomes unfeasible with the large dataset provided in VBS 2019 and has been replaced by a browsing module with pre-calculated nearest neighbors. The preliminary user study results on IACC.3 dataset show that these modules are able to improve the retrieval accuracy and efficiency in real-time video search system.

Phuong Anh Nguyen, Chong-Wah Ngo, Danny Francis, Benoit Huet
Deep Learning-Based Concept Detection in vitrivr

This paper presents the most recent additions to the vitrivr retrieval stack, which will be put to the test in the context of the 2019 Video Browser Showdown (VBS). The vitrivr stack has been extended by approaches for detecting, localizing, or describing concepts and actions in video scenes using various convolutional neural networks. Leveraging those additions, we have added support for searching the video collection based on semantic sketches. Furthermore, vitrivr offers new types of labels for text-based retrieval. In the same vein, we have also improved upon vitrivr’s pre-existing capabilities for extracting text from video through scene text recognition. Moreover, the user interface has received a major overhaul so as to make it more accessible to novice users, especially for query formulation and result exploration.

Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, Heiko Schuldt

MANPU 2019 Workshop Papers

Frontmatter
Structure Analysis on Common Plot in Four-Scene Comic Story Dataset

Comic is the one of the most attractive creative contents and it contains both components of image and words features. Especially, I have been focused in four-scene comics which can represent stories with the simple and clear structure. One of my aims of the researches is to promote collaboration between creators and artificial intelligence. To contribute for the field, I have proposed the original four-scene comics dataset with creative process and meta-data. According to the existing comics, I defined the typical patterns of structure and contents. I provided the character and several information to keep balance of common twenty scenarios based on two types of structure for ten plots. The dataset contains 100 kinds of four-scene comics to keep layer information and several annotations by five artists. Thus, it can be analyzed various expressions in common scenarios. In this research, I show the procedure of creating the dataset. Then, I describe the features of the dataset and results of computational experiment.

Miki Ueno
Multi-task Model for Comic Book Image Analysis

Comic book image analysis methods often propose multiple algorithms or models for multiple tasks like panels and characters detection, balloons segmentation and text recognition, etc. In this work, we aim to reduce the complexity for comic book image analysis by proposing one model which can learn multiple tasks called Comic MTL. In addition to the detection task and segmentation task, we integrate the relation analysis task for balloons and characters into the Comic MTL model. The experiments with our model are carried out on the eBDtheque dataset which contains the annotations for panels, balloons, characters and also the relations balloon-character. We show that the Comic MTL model can detect the association between balloons and their speakers (comic characters) and handle other tasks like panels, characters detection and balloons segmentation with promising results.

Nhu-Van Nguyen, Christophe Rigaud, Jean-Christophe Burie
Estimating Comic Content from the Book Cover Information Using Fine-Tuned VGG Model for Comic Search

The purpose of this research is to realize retrieval of comic based on content information. Resources of the contents information of existing comics were only the comics itself and review. However, these pieces of information have drawbacks that they can not sufficiently extract information necessary for searching, and that they contain a lot of unnecessary information. In order to solve this problem, we proposed to use the book cover of comics as a resource to grasp the contents of comics. In the proposed method, we estimate the age and cultural background of comics expressed by clothes and belongings written on the cover of comics from the reasoning model which performed fine-tuning from the VGG-16 model. Also, we associated comics with each other based on the obtained semantic vectors and tags. As a result of the experiment, the accuracy of the model was 0.693, and the reproducibility of the tag to the correct data was 0.918. Furthermore, we observed unity in the comics related by the obtained information.

Byeongseon Park, Mitsunori Matsushita
How Good Is Good Enough? Establishing Quality Thresholds for the Automatic Text Analysis of Retro-Digitized Comics

Stylometry in the form of simple statistical text analysis has proven to be a powerful tool for text classification, e.g. in the form of authorship attribution. When analyzing retro-digitized comics, manga and graphic novels, the researcher is confronted with the problem that automated text recognition (ATR) still leads to results that have comparatively high error rates, while the manual transcription of texts remains highly time-consuming. In this paper, we present an approach and measures that specify whether stylometry based on unsupervised ATR will produce reliable results for a given dataset of comics images.

Rita Hartel, Alexander Dunst
Comic Text Detection Using Neural Network Approach

Text is a crucial element in comic books; hence text detection is a significant challenge in an endeavour to achieve comic processing. In this work, we study in what extent an off-the-shelf neural network approach for scene text detection can be used to perform comic text detection. Experiment on a public data set shows that such an approach allows to perform as well as methods of the literature, which is promising for building more accurate comic text detector in the future.

Frédéric Rayar, Seiichi Uchida
CNN-Based Classification of Illustrator Style in Graphic Novels: Which Features Contribute Most?

Can classification of graphic novel illustrators be achieved by convolutional neural network (CNN) features evolved for classifying concepts on photographs? Assuming that basic features at lower network levels generically represent invariants of our environment, they should be reusable. However, features at what level of abstraction are characteristic of illustrator style? We tested transfer learning by classifying roughly 50,000 digitized pages from about 200 comic books of the Graphic Narrative Corpus (GNC, [6]) by illustrator. For comparison, we also classified Manga109 [18] by book. We tested the predictability of visual features by experimentally varying which of the mixed layers of Inception V3 [29] was used to train classifiers. Overall, the top-1 test-set classification accuracy in the artist attribution analysis increased from 92% for mixed-layer 0 to over 97% when adding mixed-layers higher in the hierarchy. Above mixed-layer 5, there were signs of overfitting, suggesting that texture-like mid-level vision features were sufficient. Experiments varying input material show that page layout and coloring scheme are important contributors. Thus, stylistic classification of comics artists is possible re-using pre-trained CNN features, given only a limited amount of additional training material. We propose that CNN features are general enough to provide the foundation of a visual stylometry, potentially useful for comparative art history.

Jochen Laubrock, David Dubray
Backmatter
Metadaten
Titel
MultiMedia Modeling
herausgegeben von
Ioannis Kompatsiaris
Dr. Benoit Huet
Vasileios Mezaris
Cathal Gurrin
Wen-Huang Cheng
Dr. Stefanos Vrochidis
Copyright-Jahr
2019
Electronic ISBN
978-3-030-05716-9
Print ISBN
978-3-030-05715-2
DOI
https://doi.org/10.1007/978-3-030-05716-9

Neuer Inhalt