Skip to main content

Über dieses Buch

This book presents revised selected papers from the 16th International Forum on Digital TV and Wireless Multimedia Communication, IFTC 2019, held in Shanghai, China, in September 2019.

The 34 full papers presented in this volume were carefully reviewed and selected from 120 submissions. They were organized in topical sections on image processing; machine learning; quality assessment; telecommunications; video surveillance; virtual reality.



Image Processing


Fast Traffic Sign Detection Using Color-Specific Quaternion Gabor Filters

A novel and fast traffic sign detection method is proposed based on color-specific quaternion Gabor filtering (CS-QGF). The proposed method is based on the fact that traffic signs are usually specialized in color and shape. Accordingly, we apply a quaternion Gabor transformation to extract the color and shape features of traffic signs simultaneously. Statistical color distribution of traffic sign is analyzed to optimize the construction of quaternion Gabor filters. The feature extracted via CS-QGF is robust to the distortion of color, the change of image resolution, and the change of lighting and shading conditions, which helps the following traffic sign detector reduce the search range of proposal regions. Experiments on GTSDB and TT100K datasets demonstrate that the proposed method helps to localize traffic signs in images with high efficiency, which outperforms state-of-the-art methods on both detection speed and final recognition accuracy.

Shiqi Yin, Yi Xu

Smoke Detection Based on Image Analysis Technology

Ecological problems and pollution problems must be faced and solved in the sustainable development of a country. With the continuous development of image analysis technology, it is a good choice to use machine to automatically judge the external environment. In order to solve the problem of smoke extraction and exhaust monitoring, we need the applicable database. Considering the number of databases that can be used to detect smoke is small and these databases have fewer types of pictures, we subdivide the smoke detection database and get a new database for smoke and smoke color detection. The main purpose is to preliminarily identify pollutants in smoke and further develop smoke image detection technology. We discuss eight kinds of convolutional neural network, they can be used to classify smoke images. Testing different convolutional neural networks on this database, the accuracy of several existing networks is analyzed and compared, and the reliability of the database is also verified. Finally, the possible development direction of smoke detection is summarized.

Huiqing Zhang, Jiaxu Chen, Shuo Li, Ke Gu, Li Wu

Non-local Recoloring Algorithm for Color Vision Deficiencies with Naturalness and Detail Preserving

People with Color Vision Deficiencies (CVD) may have difficulty in recognizing and communicating color information, especially in the multimedia era. In this paper, we proposed a recoloring algorithm to enhance visual perception of people with CVD. In the algorithm, color modification for color blindness is conducted in HSV color space under three constraints: detail, naturalness and authenticity. A new non-local recoloring method is used for preserving details. Subjective experiments were conducted among normal vision subjects and color blind subjects. Experimental results show that our algorithm is robust, detail preserving and maintains naturalness. (Source codes are freely available to non-commercial users at the website ( )).

Yunlu Wang, Duo Li, Menghan Hu, Liming Cai

Missing Elements Recovery Using Low-Rank Tensor Completion and Total Variation Minimization

The Low-rank (LR) and total variation (TV) are two most popular regularizations for image processing problems and have sparked a tremendous number of researches, particularly for moving from scalar to vector, matrix or even high-order based functions. However, discretization schemes commonly used for TV regularization often ignore the difference of the intrinsic properties, which is not effective enough to exploit the local smoothness, let alone the problem of edge blurring. To address this issue, in this paper, we consider the color image as three-dimensional tensors, then measure the smoothness of these tensors by TV norm along the different dimensions. The three-order tensor is then recovered by Tucker decomposition factorization. Specifically, we propose integrating Shannon total variation (STV) into low-rank tensor completion (LRTC). Moreover, due to the suboptimality of nuclear norm, we propose a new nonconvex low-rank constraint for closer rank approximation, namely truncated $$\gamma $$-norm. We solve the cost function using the alternating direction method of multipliers (ADMM) method. Experiments on color image inpainting tasks demonstrate that the proposed method enhances the details of the recovered images.

Jinglin Zhang, Mengjie Qin, Cong Bai, Jianwei Zheng

Hyperspectral Image Super-Resolution Using Multi-scale Feature Pyramid Network

Hyperspectral (HS) images are captured with rich spectral information, which have been proved to be useful in many real-world applications, such as earth observation. Due to the limitations of HS cameras, it is difficult to obtain HS images with high-resolution (HR). Recent advances in deep learning (DL) for single image super-resolution (SISR) task provide a powerful tool for restoring high-frequency details from low-resolution (LR) input image. Inspired by this progress, in this paper, we present a novel DL-based model for single HS image super-resolution in which a feature pyramid block is designed to extract multi-scale features of the input HS image. Our method does not need auxiliary inputs which further extends the application scenes. Experiment results show that our method outperforms state-of-the-arts on both objective quality indices and subjective visual results.

He Sun, Zhiwei Zhong, Deming Zhai, Xianming Liu, Junjun Jiang

Machine Learning


Unsupervised Representation Learning Based on Generative Adversarial Networks

This paper introduces a novel model for learning disentangled representations based on Generative Adversarial Networks. The training model is unsupervised without identity information. Unlike InfoGAN in which the disentangled representation is learnt by getting the variational lower bound of the mutual information indirectly, our method introduces a direct way by adding predicting networks and encoder into GANs and measuring the correlation among the encoder outputs. Experiment results on MNIST demonstrate that the proposed model is more generalizable and robust than InfoGAN. With experiments on Celeba-HQ, we show that our model can extract factorial features with complicate datasets and produce results comparable to supervised models.

Shi Xu, Jia Wang

Interactive Face Liveness Detection Based on OpenVINO and Near Infrared Camera

For the security of face recognition, this paper proposes an interactive face liveness detection method based on OpenVINO and near infrared camera. Firstly, the face feature points are normalized and the faces are aligned in the environment of OpenVINO and near infrared camera. Secondly, the Euclidean Distance between the mouth feature vectors is calculated. When the distance is greater than a threshold, the system will judge it as a smile. Finally, the system will send random smile commands to the authenticated users to realize liveness detection. According to the results, the proposed method can effectively distinguish between real people and printed photos, and the running time of the liveness detection system based on OpenVINO can reach 14–30 ms, the recognition accuracy can reach 0.977, which has outstanding generalization ability in practical project applications.

Nana Zhang, Jun Huang, Hui Zhang

Multi-scale Generative Adversarial Learning for Facial Attribute Transfer

Generative Adversarial Network (GAN) has shown its impressive ability on facial attribute transfer. One crucial part in facial attribute transfer is to retain the identity. To achieve this, most of existing approaches employ the L1 norm to maintain the cycle consistency, which tends to cause blurry results due to the weakness of the L1 loss function. To address this problem, we introduce the Structural Similarity Index (SSIM) in our GAN training objective as the measurement between input images and reconstructed images. Furthermore, we also incorporate a multi-scale feature fusion structure into the generator to facilitate feature learning and encourage long-term correlation. Qualitative and quantitative experiments show that our method has achieved better visual quality and fidelity than the baseline on facial attribute transfer.

Yicheng Zhang, Li Song, Rong Xie, Wenjuan Zhang

Convolutional-Block-Attention Dual Path Networks for Slide Transition Detection in Lecture Videos

Slide transition detection is used to find the images where the slide content changes, which form a summary of the lecture video and save the time for watching the lecture videos. 3D Convolutional Networks (3D ConvNet) has been regarded as an efficient approach to learn spatio-temporal features in videos. However, 3D ConvNet gives the same weight to all features in the image, and can’t focus on key feature information. We solve this problem by using the attention mechanism, which highlights more effective features information by suppressing invalid ones. Furthermore, 3D ConvNet usually costs much training time and needs lots of memory. Dual Path Network (DPN) combines the two network structures of ResNext and DenseNet and has the advantages of them. ResNext adds input directly to the convolved output, which takes advantage of extracted features from the previous hierarchy. DenseNet concatenates the output of each layer to the input of each layer, which extracts new features from the previous hierarchy. Based on the two networks, DPN not only saves training time and memory, but also extracts more effective features and improves training results. Consequently, we present a novel ConvNet architecture based on Convolutional Block Attention and DPN for slide transition detection in lecture videos. Experimental results show that the proposed novel ConvNet architecture achieves the better results than other slide detection approaches.

Minhuang Guan, Kai Li, Ran Ma, Ping An

Attention-Based Top-Down Single-Task Action Recognition in Still Images

Human action recognition via deep learning methods in still images has been an active research topic in computer vision recently . Different from the traditional action recognition based on videos or image sequences, a single image contains no temporal information or motion features for action characterization. In this study, we utilize a top-down action recognition strategy to analyze person instances in a scene respectively, on the task of detecting determine persons playing a cellphone. A YOLOv3 detector is applied to predict the human bounding boxes, and the HRNet (High Resolution Network) is used to regress the attention map centered on the area of playing a cellphone, taking the region of given human bounding box as the input. Experimental results on a custom dataset show that HRNet can reliably represent a person image to a heatmap where the region of interest (ROI) is highlighted. The accuracy of the proposed framework exceeds the performance of all the evaluated naive classification models, i.e., Densenet, inception_v3 and shufflenet_v2.

Jinhai Yang, Xiao Zhou, Hua Yang

Adaptive Person-Specific Appearance-Based Gaze Estimation

Non-invasive gaze estimation from only eye images captured by camera is a challenging problem due to various eye shapes, eye structures and image qualities. Recently, CNN network has been applied to directly regress eye image to gaze direction and obtains good performance. However, generic approaches are susceptible to bias and variance highly relating to different individuals. In this paper, we study the person-specific bias when applying generic methods on new person. And we introduce a novel appearance-based deep neural network integrating meta-learning to reduce the person-specific bias. Given only a few person-specific calibration images collected in normal calibration process, our model adapts quickly to test person and predicts more accurate gaze directions. Experiments on public MPIIGaze dataset and Eyediap dataset show our approach has achieved competitive accuracy to current state-of-the-art methods and are able to alleviate person-specific bias problem.

Chuanyang Zheng, Jun Zhou, Jun Sun, Lihua Zhao

Preliminary Study on Visual Attention Maps of Experts and Nonexperts When Examining Pathological Microscopic Images

Pathological microscopic image is regarded as a gold standard for the diagnosis of disease, and eye tracking technology is considered as a very effective tool for medical education. It will be very interesting if we use the eye tracking to predict where pathologists or doctors and persons with no or little experience look at the pathological microscopic image. In the current work, we first establish a pathological microscopic image database with the eye movement data of experts and nonexperts (PMIDE), including a total of 425 pathological microscopic images. The statistical analysis is afterwards conducted on PMIDE to analyze the difference in eye movement behavior between experts and nonexperts. The results show that although there is no significant difference in general, the experts focus on a broader scope than nonexperts. This inspires us to respectively develop saliency models for experts and nonexperts. Furthermore, the existing 10 saliency models are tested on PMIDE, and the performance of these models are all unacceptable with AUC, CC, NSS and SAUC below 0.73, 0.47, 0.78 and 0.52, respectively. This study indicates that the saliency models specific to pathological microscopic images urgent need to be developed using our database—PMIDE or the other related databases.

Wangyang Yu, Menghan Hu, Shuning Xu, Qingli Li

Few-Shot Learning for Crossing-Sentence Relation Classification

There is heavy dependence on the large amount of annotated data in most existing methods of relation classification, which is a serious problem. Besides, we cannot learn by leveraging past learned knowledge in most situation, which means it can only train from scratch to learn new tasks. Motivated from humans’ ability of learning effectively from few samples and learning quickly by utilizing learned knowledge, we use both meta network based on co-reference resolution and prototypical network based on co-reference resolution to resolve the problem of few-shot relation classification for crossing-sentence task. Both of the two network aim to learn a transferrable deep distance metric to recognize new relation categories given very few labelled samples. Instead of single sentence, paragraphs containing multi-sentence is a major concern in the experiment. The results demonstrate that our approach performs well and achieves high precision.

Wen Wen, Yongbin Liu, Chunping Ouyang

Image Classification of Submarine Volcanic Smog Map Based on Convolution Neural Network

In order to meet the problem of smoke image classification in submarine volcanic scene, in this paper, depth convolution neural network (Deep Convolutional Neural Networks, DCNN) is used to classify smoke seafloor map and smoke-free seafloor map under small-scale data set and limited computing power. Firstly, the data enhancement technology is used to expand the data set through angle rotation, horizontal flipping, random cutting and adding Gaussian noise, and then the depth convolution neural network is built for training. Finally, the recognition and classification is carried out according to the prediction image label of the classifier. The experimental results show that the classification accuracy of the proposed method is more than 91%.

Xiaoting Liu, Li Liu, Yuhui Chen

Multi-Scale Depthwise Separable Convolutional Neural Network for Hyperspectral Image Classification

Hyperspectral images (HSIs) have far more spectral bands than conventional RGB images. The abundant spectral information provides very useful clues for the followup applications, such as classification and anomaly detection. How to extract discriminant features from HSIs is very important. In this work, we propose a novel spatial-spectral features extraction method for HSI classification by Multi-Scale Depthwise Separable Convolutional Neural Network (MDSCNN). This new model consists of a multi-scale atrous convolution module and two bottleneck residual units, which greatly increase the width and depth of the network. In addition, we use depthwise separable convolution instead of traditional 2D or 3D convolution to extract spatial and spectral features. Furthermore, considering classification accuracy can benifit from multi-scale information, we introduce atrous convolution with different dilation rates parallelly to extract more discriminant features of HSIs for classification. Experiments on three standard datasets show that the proposed MDSCNN has got the state-of-the-art accuracy among all compared methods.

Jiliang Yan, Deming Zhai, Yi Niu, Xianming Liu, Junjun Jiang

Joint SPSL and CCWR for Chinese Short Text Entity Recognition and Linking

Entity Recognition Linking (ERL) is a basic task of Natural Language Processing (NLP), which is an extension of the Named Entity Recognition (NER) task. The purpose of the ERL is to detect the entity from a given Chinese short text, and the detected entity is linking to the corresponding entity in the given knowledge library. ERL’s task include two subtasks: Entity Recognition (ER) and Entity Link (EL). Due to the lack of rich context information in Chinese short text, the accuracy of ER is not high. In different fields, the meaning of the entity is different and the entity cannot be accurately linking. These two problems have brought a big challenge to the Chinese ERL task. In order to solve these two problems, this paper proposes based on neural network model joint semi-point semi-label (SPSL) and Combine character-based and word-based representations (CCWR) embedding. The structure of this model enhances the representation of entity features and improve the performance of ER. The structure of this model enhances the representation of contextual semantic information and improve the performance of EL. In summary, this model has a good performance in ERL. In the ccks2019 Chinese short text ERL task, the F1 value of this model can reach 0.7463.

Zhiqiang Chong, Zhiming Liu, Zhi Tang, Lingyun Luo, Yaping Wan

A Reading Assistant System for Blind People Based on Hand Gesture Recognition

A reading assistant system for blind people based on hand gesture recognition is proposed in this paper. This system consists of seven modules: camera input module, page adjustment module, page information retrieval module, hand pose estimation module, hand gesture recognition module, media controller and audio output device. In the page adjustment module, Hough line detection and local OCR (Optical Character Recognition) are used to rectify text orientation. In the hand gesture recognition module, we propose three practical methods: geometry model, heatmap model and keypoint model. Geometry model recognizes different gestures by geometrical characteristics of hand. Heatmap model which is based on image classification algorithm uses CNN (Convolutional Neural Network) to classify various hand gestures. To simplify the networks in heatmap model, we extract 21 keypoints from a hand heatmap and make them a dataset of points coordinates for training classifier. These three methods can get good results of gesture recognition. By recognizing gestures, our designed system can realize perfect reading assistant function.

Qiang Lu, Guangtao Zhai, Xiongkuo Min, Yucheng Zhu

Intrusion Detection Based on Fusing Deep Neural Networks and Transfer Learning

Intrusion detection is the key research direction of network security. With the rapid growth of network data and the enrichment of intrusion methods, traditional detection methods can no longer meet the security requirements of the current network environment. In recent years, the rapid development of deep learning technology and its great success in the field of imagery have provided a new solution for network intrusion detection. By visualizing the network data, this paper proposes an intrusion detection method based on deep learning and transfer learning, which transforms the intrusion detection problem into image recognition problem. Specifically, the stream data visualization method is used to present the network data in the form of a grayscale image, and then a deep learning method is introduced to detect the network intrusion according to the texture features in the grayscale image. Finally, transfer learning is introduced to improve the iterative efficiency and adaptability of the model. The experimental results show that the proposed method is more efficient and robust than the mainstream machine learning and deep learning methods, and has better generalization performance, which can detect new intrusion methods more effectively.

Yingying Xu, Zhi Liu, Yanmiao Li, Yushuo Zheng, Haixia Hou, Mingcheng Gao, Yongsheng Song, Yang Xin

The Competition of Garbage Classification Visual Recognition

This paper introduces the details of “Tianyi Cup” Intelligent Environmental Protection challenge. It includes the assumption of different topics, the selection and preparation of garbage data sets, the segmentation and preprocessing of garbage data sets, as well as the results of competition run on the final data set. And put forward some reasonable suggestions to promote the level of garbage classification.

Yiyi Lu, Hanlong Guo, Jiaming Ma, Zhengrong Ge

An Academic Achievement Prediction Model Enhanced by Stacking Network

This article focuses on the use of data mining and machine learning in AI education to achieve better prediction accuracy of students’ academic achievement. So far, there are already many well-built gradient boosting machines for small data sets prediction, such as lightGBM, XGBoost, etc. Based on this, we presented and experimented a new method in a regression prediction. Our Stacking Network combines the traditional ensemble models with the idea of deep neural network. Compared with the original Stacking method, Stacking Network can infinitely increase the number of layers, making the effect of Stacking Network much higher than that of traditional Stacking. Simultaneously, compared with deep neural network, this Stacking Network inherits the advantages of the Boosting machines. We have applied this approach to achieve higher accuracy and better speed than the conventional Deep neural network. And also, we achieved a highest rank on the Middle School Grade Dataset provided by Shanghai Telecom Corporation.

Shaofeng Zhang, Meng Liu, Jingtao Zhang

Quality Assessment


Blind Panoramic Image Quality Assessment Based on Project-Weighted Local Binary Pattern

The majority of existing objective panoramic image quality assessment algorithms are based on peak signal to noise ratio (PSNR) or structural similarity (SSIM). However, they are not highly consistent with human perception. In this paper, a new blind panoramic image quality assessment metric is proposed based on project-weighted gradient local binary pattern histogram (PWGLBP), which explores the structure degradation in sphere by combining with the nonlinear transformation relationship between the projected plane and sphere. Finally, support vector regression (SVR) is adopted to learn a quality predictor from feature space to quality score space. The experimental results demonstrate the superiority of our proposed metric compared with state-of-the-art objective PIQA methods.

Yumeng Xia, Yongfang Wang, Peng Ye

Blind 3D Image Quality Assessment Based on Multi-scale Feature Learning

3D image quality assessment (3D-IQA) plays an important role in 3D multimedia applications. In recent years, convolutional neural networks (CNN) have been widely used in various images processing tasks and achieve excellent performance. In this paper, we propose a blind 3D-IQA metric based on multi-scale feature learning by using multi-column convolutional neural networks (3D-IQA-MCNN). To address the problem of limited 3D-IQA dataset size, we take patches from the left image and right image as input and use the full-reference (FR) IQA metric to approximate a reference ground-truth for training the 3D-IQA-MCNN. Then we put the patches from left image and right image into the pre-trained 3D-IQA-MCNN and obtain two quality feature vectors based on multi-scale. Finally, by regressing the quality feature vectors onto the subjective mean opinion score (MOS), the visual quality of 3D images is predicted. Experimental results show that the proposed method achieves high consistency with human subjective assessment and outperforms several state-of-the-art 3D-IQA methods.

Yongfang Wang, Shuai Yuan, Yumeng Xia, Ping An

Research on Influence of Content Diversity on Full-Reference Image Quality Assessment

With the development of image quality assessment (IQA), the full-reference (FR) IQA algorithms are becoming more and more mature. This paper analyzes the performance of five FR natural scene (NS) IQA algorithms and three FR screen content (SC) IQA algorithms, and makes analysis on the eight IQA algorithms in terms of prediction performance and design principle. Experiments show, (1) the performance of perceptual similarity based IQA method proposed by Gu et al. is better; (2) the multi-scale technology plays an important role in improving the performance of algorithms. Finally, summarized and expounded the development direction of image quality evaluation.

Huiqing Zhang, Shuo Li, Zhifang Xia, Weiling Chen, Lijuan Tang

Screen Content Picture Quality Evaluation by Colorful Sparse Reference Information

With the rapid development of multimedia interactive applications, the processing volume of the screen content (SC) images is increasing day by day. The research on image quality assessment is the basis of many other applications. The focus of general image quality assessment (QA) research is natural scene (NS) images, now for the quality assessment research of SC images becomes very urgent and has received more and more attention. Accurate quality assessment of SC images helps improve the user experience. Based on these, this paper proposes an improved method using very sparse reference information for accurate quality assessment of SC images. Specifically, the proposed method extracts macroscopic, microscopic structure and color information respectively, and measures the differences in terms of macroscopic, microscopic features and color information between the original SC image and its distorted version, and finally calculates the overall quality score of the distorted SC image. The quality assessment model we built uses a dimension reduction histogram and only needs to transmit very sparse reference information. Experiments show that the proposed method has obvious superiority over the state-of-the-art relevant quality metrics in the visual quality assessment of SC images.

Huiqing Zhang, Donghao Li, Shuo Li, Zhifang Xia, Lijuan Tang

PMIQD 2019: A Pathological Microscopic Image Quality Database with Nonexpert and Expert Scores

In medical diagnostic analysis, pathological microscopic image is often regarded as a gold standard, and hence the study of pathological microscopic image is of great necessity. High quality microscopic pathological images enable doctors to arrive at correct diagnosis. The pathological microscopic image is an important cornerstone for modernization and computerization of medical procedures. The quality of pathological microscopic images may be degraded due to a variety of reasons. It is difficult to acquire key information, so research for quality assessment of pathological microscopic image is quite necessary. In this paper, we perform a study on subjective quality assessment of pathological microscopic images and investigate whether the existing objective quality measures can be applied to the pathological microscopic images. Concretely, we establish a new pathological microscopic image quality database (PMIQD) which includes 425 pathological microscopic images with different quality degrees. The mean opinion scores rated by nonexperts and experts are calculated afterwards. Besides, we investigate the prediction performance of the existing popular image quality assessment (IQA) algorithms on PMIQD, including 8 no-reference (NR) methods. Experimental results demonstrate that the present objective models do not work well. IQA for pathological microscopic image needs to be developed for predicting the quality rated by nonexperts and experts.

Shuning Xu, Menghan Hu, Wangyang Yu, Jianlin Feng, Qingli Li



A Generalized Cellular Automata Approach to Modelling Contagion and Monitoring for Emergent Events in Sensor Networks

In order to improve the invulnerability and adaptability in sensor networks, we propose a cellular automata (CA) based propagation control mechanism (CACM) to inhibit and monitor emergent-event contagion. The cellular evolving rules of CACM are figured in multi-dimension convolution operations and cell state transform, which can be utilized to model the complex behavior of sensor nodes by separating the intrinsic and extrinsic states for each network cell. Furthermore, inspired by burning pain for Wireworld based monitoring model, network entropy theory is introduced into layered states on CACM to construct particle-based information communication process by efficient distribution of event-related messages on network routers, thus an invulnerable and energy-efficient diffusion and monitoring being achieved. Experiment results prove that CACM can outperform traditional propagation models in adaptive invulnerability and self-recovery scalability on sensor networks for propagation control on malicious events.

Ru Huang, Hongyuan Yang, Haochen Yang, Lei Ma

Video Surveillance


Large-Scale Video-Based Person Re-identification via Non-local Attention and Feature Erasing

Encoding the video tracks of person to an aggregative representation is the key for video-based person re-identification (re-ID), where average pooling or RNN methods are typically used to aggregating frame-level features. However, It is still difficult to deal with the spatial misalignment caused by occlusion, posture changes and camera views. Inspired by the success of non-local block in video analysis, we use a non-local attention block as a spatial-temporal attention mechanism to handle the spatial-temporal misalignment problem. Moreover, partial occlusion is widely occurred in video sequences. We propose a local feature branch to tackle the partial occlusion problem by using feature erasing in the frame-level feature map. Therefore, our network is composed by two-branch, the global branch via non-local attention encoding the global feature and the local feature branch grasping the local feature. In evaluation, the global feature and local feature are concatenated to obtain a more discriminative feature. We conduct extensive experiments on two challenging datasets (MARS and iLIDS-VID). The experimental results demonstrate that our method is comparable with the state-of-the-art methods in these datasets.

Zhao Yang, Zhigang Chang, Shibao Zheng

Design and Optimization of Crowd Behavior Analysis System Based on B/S Software Architecture

With the development of society and economy, the importance of crowd behavior analysis is increasing. However, the system often requires a large amount of computing resources, which is often difficult to meet for personal computers in traditional client/server architecture (C/S architecture). So based on the existing local analysis system [5], we construct a crowd behavior analysis system based on browser/server architecture (B/S architecture). Then we optimize many aspects of this B/S system to improve its communication capability and stability under high load. Finally, the acceleration work of the CGAN-based crowd counting module is carried out. The generator of CGAN (Conditional Generative Adversarial Network) was optimized such as residual layer pruning, upsampling optimization, and instance normalization layer removing, and then deployed and INT8 quantized in TensorRT. After these optimizations, the inferring speed on the NVIDIA platform is increased to 541.6% of the original network with almost no loss of inference accuracy.

Yuanhang He, Jing Guo, Xiang Ji, Hua Yang

Affective Video Content Analysis Based on Two Compact Audio-Visual Features

In this paper, we propose a new framework for affective video content analysis by using two compact audio-visual features. In the proposed framework, the eGeMAPS is first calculated as global audio feature and then the key frames of optical flow images are fed to VGG19 network for implementing the transfer learning and visual feature extraction. Finally for model learning, the logistic regression is employed for affective video content classification. In the experiments, we perform the evaluations of audio and visual features on the dataset of Affective Impact of Movies Task 2015 (AIMT15), and compare our results with those of competition teams participated in AIMT15. The comparison results show that the proposed framework can achieve the comparable classification result with the first place of AIMT15 with a total feature dimension of 344, which is only about one thousandth of feature dimensions used in the first place of AIMT15.

Xiaona Guo, Wei Zhong, Long Ye, Li Fang, Qin Zhang

Virtual Reality


Image Aligning and Stitching Based on Multilayer Mesh Deformation

This paper aims to solve the problem of strong disparity in image stitching. By studying current mesh deformation methods based on single grid, we propose an image stitching framework based on multilayer mesh deformation for aligning regions in different layer. With development of image depth perception and semantic segmentation technology, we can get layering maps of images or photos expediently. We introduce images representation with layers and get layer corresponding by using depth or disparity information for large parallax scenarios. Registration of each layer is carried out independently. To ensure the integrity of layer synthesis results, we apply deformation with translation and scaling compensation between different layers before blending. The experiment demonstrates that our method can adequately utilize the prior information in layering maps to decouple 2D transformation between different layers, finally achieve outstanding aligning performance in all layers and naturalness in complete stitching result.

Mingfu Xie, Jun Zhou, Xiao Gu, Hua Yang

Point Cloud Classification via the Views Generated from Coded Streaming Data

Point cloud has been widely used in various fields such as virtual reality and autonomous driving. As the basis of point cloud processing, the research of point cloud classification draw many attentions. This paper proposes a views-based framework for streaming point cloud classification. We obtain six views from coded stream without fully decoding as the inputs of the neural network, and then a modified ResNet structure is proposed to generate the final classification results. The experimental results show that our framework achieve comparable result, while it could be used when the input is streaming point cloud data.

Qianqian Li, Long Ye, Wei Zhong, Li Fang, Qin Zhang

Geometry-Guided View Synthesis with Local Nonuniform Plane-Sweep Volume

In this paper we develop a geometry-guided image generation technology for scene-independent novel view synthesis from a stereo image pair. We employ the successful plane-sweep strategy to tackle the problem of 3D scene structure approximation. But instead of putting on a general configuration, we use depth information to perform a local nonuniform plane spacing. More specifically, we first explicitly estimate a depth map in the reference view and use it to guide the planes spacing in plane-sweep volume, resulting in a geometry-guided manner for scene geometry approximation. Next we learn to predict a multiplane images (MPIs) representation, which can then be used to synthesize a range of novel views of the scene, including views that extrapolate significantly beyond the input baseline, to allow for efficient view synthesis. Our results on massive YouTube video frames dataset indicate that our approach makes it possible to synthesize higher quality images, while keeping the number of depth planes.

Ao Li, Li Fang, Long Ye, Wei Zhong, Qin Zhang

Single View 3D Reconstruction with Category Information Learning

3D reconstruction from single image is a classical problem in computer vision. Due to the fact that the information contained in one single image is not sufficient for 3D shape reconstruction, the existing model cannot reconstruct 3D models very well. To tackle this problem, we propose a novel model which effectively utilizes the category information of objects to improve the performance of network on single view 3D reconstruction. Our model consists of two parts: rough shape generation network (RSGN) and category comparison network (CCN). RSGN can learn the characteristics of objects in the same category through the comparison part CCN. In the experiments, we verify the feasibility of our model on the ShapeNet dataset, and the results confirm our framework.

Weihong Cao, Fei Hu, Long Ye, Qin Zhang

Three-Dimensional Reconstruction of Intravascular Ultrasound Images Based on Deep Learning

Coronary artery disease (CAD), CAD is a common atherosclerotic disease and one of the leading diseases that endanger human health. Acute cardiovascular events are catastrophic, the main cause of which is atherosclerosis (AS) plaque rupture and secondary thrombosis. In order to measure the important parameters such as the diameter, cross-sectional area, volume, wall thickness of the vessel and the size of the AS plaque, it is necessary to first extract the inner and outer membrane edges of the vessel wall in each frame intravascular ultrasound (IVUS) and the plaque edges that may exist. IVUS-based three-dimensional intravascular reconstruction can accurately assess and diagnose the tissue characterization of various cardiovascular diseases to obtain the best treatment options. However, due to the presence of vascular bifurcation in the blood vessels, the presence of bifurcated blood vessels creates great difficulties for the segmentation and reconstruction of the inner and outer membranes. In order to solve this problem, this paper is based on the deep learning method, which first classifies the intravascular bifurcation vessels and normal blood vessels, and then segmentation of the inner and outer membrane, separately. Finally, the three-dimensional reconstruction of the segmented blood vessels is of great significance for the auxiliary diagnosis and treatment of coronary heart disease.

Yankun Cao, Zhi Liu, Xiaoyan Xiao, Yushuo Zheng, Lizhen Cui, Yixian Du, Pengfei Zhang


Weitere Informationen