main-content

This book constitutes the thoroughly refereed proceedings of the 14th International Conference on Collaborative Computing: Networking, Applications, and Worksharing, CollaborateCom 2018, held in Shanghai, China, in December 2018. The 43 full and 19 short papers presented were carefully reviewed and selected from 106 submissions. The papers reflect the conference sessions as follows: vehicular networks; social networks, information processing, data detection and retrieval & mobility, parallel computing, knowledge graph, cloud and optimization & software testing and formal verification; collaborative computing, social networks, vehicular networks, networks and sensors, information processing and collaborative computing, mobility and software testing and formal verification, web services and image information processing, web services and remote sensing.

Meta-Path and Matrix Factorization Based Shilling Detection for Collaborate Filtering

Nowadays, collaborative filtering methods have been widely applied to E-commerce platforms. However, due to its openness, a large number of spammers attack those systems to manipulate the recommendation results to earn huge profits. The shilling attack has become a major threat to collaborative filtering systems. Therefore, effectively detecting shilling attacks is a crucial task. Most existing detection methods based on statistical-based features or unsupervised methods rely on a priori knowledge about attack size. Besides, the majority of work focuses on rating attack and ignore the relation attack. In this paper, motivated by the success of heterogeneous information network and oriented towards the hybrid attack, we propose an approach DMD to detect shilling attack based on meta-path and matrix factorization. At first, we concatenate the user-item bipartite network and user-user relation network as a whole. Next, we design several meta-paths to guide the random walk to product node sequences and utilize the skip-gram model to generate user embeddings. Meanwhile, users’ latent factors are decomposed by matrix factorization. Finally, we incorporate these embeddings and factors to joint train the detector. Extensive experimental analysis on two public datasets demonstrate the superiority of the proposed method and show the effectiveness of different attack strategies and various attack sizes.

Xin Zhang, Hong Xiang, Yuqi Song

Collaborative Thompson Sampling

Thompson sampling is one of the most effective strategies to balance exploration-exploitation trade-off. It has been applied in a variety of domains and achieved remarkable success. Thompson sampling makes decisions in a noisy but stationary environment by accumulating uncertain information over time to improve prediction accuracy. In highly dynamic domains, however, the environment undergoes frequent and unpredictable changes. Making decisions in such an environment should rely on current information. Therefore, standard Thompson sampling may perform poorly in these domains. Here we present a collaborative Thompson sampling algorithm to apply the exploration-exploitation strategy to highly dynamic settings. The algorithm takes collaborative effects into account by dynamically clustering users into groups, and the feedback of all users in the same group will help to estimate the expected reward in the current context to find the optimal choice. Incorporating collaborative effects into Thompson sampling allows to capture real-time changes of the environment and adjust decision making strategy accordingly. We compare our algorithm with standard Thompson sampling algorithms on two real-world datasets. Our algorithm shows accelerated convergence and improved prediction performance in collaborative environments. We also provide a regret analysis of our algorithm on a non-contextual model.

Zhenyu Zhu, Liusheng Huang, Hongli Xu

Collaborative Workflow Scheduling over MANET, a User Position Prediction-Based Approach

The explosive increase of mobile devices and advanced communication technologies prompt the emergence of mobile computing. In this paradigm, mobile users’ idle resources can be shared as service through device-to-device links to other users. Some complex workflow-based mobile applications are therefor no longer need to be offloaded to remote cloud, on the contrary, they can be solved locally with the help of other devices in a collaborative way. Nevertheless, various challenges, especially the reliability and quality-of-service of such a collaborative workflow scheduling problem, are yet to be properly tackled. Most studies and related scheduling strategies assume that mobile users are fully stable and with constantly available. However, this is not realistic in most real-world scenarios where mobile users are mobile most of time. The mobility of mobile users impact the reliability of corresponding shared resources and consequently impact the success rate of workflows. In this paper, we propose a reliability-aware mobile workflow scheduling approach based on prediction of mobile users’ positions. We model the scheduling problem as a multi-objective optimization problem and develop an evolutionary multi-objective optimization based algorithm to solve it. Extensive case studies are performed based on a real-world mobile users’ trajectory dataset and show that our proposed approach significantly outperforms traditional approaches in term of workflow success rate.

Qinglan Peng, Qiang He, Yunni Xia, Chunrong Wu, Shu Wang

Worker Recommendation with High Acceptance Rates in Collaborative Crowdsourcing Systems

Mingchu Li, Xiaomei Sun, Xing Jin, Linlin Tian

Cost-Aware Targeted Viral Marketing with Time Constraints in Social Networks

Online social networks have been one of the most effective platforms for marketing which is called viral marketing. The main challenge of viral marketing is to seek a set of k users that can maximize the expected influence, which is known as Influence Maximization (IM) problem. In this paper, we incorporate heterogeneous costs and benefits of users and time constraints, including time delay and time deadline of influence diffusion, in IM problem and propose Cost-aware Targeted Viral Marketing with Time constraints (CTVMT) problem to find the most cost-effective seed users who can influence the most relevant users within a time deadline. We study the problem under IC-M and LT-M diffusion model which extends IC and LT model with time constraints. Since CTVMT is NP-hard under two models, we design a BCT-M algorithm using two new benefit sampling algorithms designed for IC-M and LT-M respectively to get a solution with an approximation ratio. To the best of our knowledge, this is the first algorithm that can provide approximation guarantee for our problem. Our empirical study over several real-world networks demonstrates the performances of our proposed solutions.

Ke Xu, Kai Han

Exploring Influence Maximization in Location-Based Social Networks

In the last two decades, the issue of Influence Maximization (IM) in traditional online social networks has been extensively studied since it was proposed. It is to find a seed set which has maximum influence spread under a specific network transmission model. However, in real life, the information can be spread not only through online social networks, but also between neighbors who are close to each other in the physical world. Location-Based Social Network (LBSN) is a new type of social network which is emerging increasingly nowadays. In a LBSN, users can not only make friends, but also share the events they participate in at different locations by checking in. In this paper, we aim to study the IM in LBSNs, where we consider both the influence of online and offline interactions. A two-layer network model and an information propagation model are proposed. Also, we formalize the IM problem in LBSNs and present an algorithm obtaining an approximation factor of $$(1 - 1/e - \epsilon$$ ) in near-linear expected time. The experimental results show that the algorithm is efficient meanwhile offering strong theoretical guarantees.

Shasha Li, Kai Han, Jiahao Zhang

Measuring Bidirectional Subjective Strength of Online Social Relationship by Synthetizing the Interactive Language Features and Social Balance (Short Paper)

In online collaboration, instead of the objective strength of social relationship, recent study reveals that the two participants can have different subjective opinions on the relationship between them, and the opinion can be investigated with their interactive language on this relationship. However, two participants’ bidirectional opinions in collaboration is not only determined by their interaction on this relationship, but also influenced by the adjacent third-party partners. In this work, we define the two participants’ opinions as the subjective strength of their relationship. To measure the bidirectional subjective strength of a social relationship, we propose a computational model synthetizing the features from participants’ interactive language and the adjacent balance in social network. Experimental results on real collaboration in Enron email dataset verify the effectiveness of the proposed model.

Baixiang Xue, Bo Wang, Yanshu Yu, Ruifang He, Yuexian Hou, Dawei Song

Recommending More Suitable Music Based on Users’ Real Context

Music recommendation is an popular function for personalized services and smart applications since it focuses on discovering users’ leisure preference. The traditional music recommendation strategy captured users’ music preference by analyzing their historical behaviors to conduct personalized recommendation. However, users’ current states, such as in busy working or in a leisure travel, etc., have an important influence on their music enjoyment. Usually, those existing methods only focus on pushing their favorite music to users, which may be not the most suitable for current scenarios. Users’ current states should be taken into account to make more perfect music recommendation. Considering the above problem, this paper proposes a music recommendation method by considering both users’ current states and their historical behaviors. First, a feature selection process based on ReliefF method is applied to discover the optimal features for the following recommendation. Second, we construct different feature groups according to the feature weights and introduce Naive Bayes model and Adaboost algorithm to train these feature groups, which will output a base classifier for each feature group. Finally, a majority voting strategy decides the optimal music type and each user will be recommended more suitable music based on their current context. The experiments on the real datasets show the effectiveness of the proposed method.

Qing Yang, Le Zhan, Li Han, Jingwei Zhang, Zhongqin Bi

A Location Spoofing Detection Method for Social Networks (Short Paper)

It is well known that check-in data from location-based social networks (LBSN) can be used to predict human movement. However, there are large discrepancies between check-in data and actual user mobility, because users can easily spoof their location in LBSN. The act of location spoofing refers to intentionally making false location, leading to a negative impact both on the credibility of location-based social networks and the reliability of spatial-temporal data. In this paper, a location spoofing detection method in social networks is proposed. First, Latent Dirichlet Allocation (LDA) model is used to learn the topics of users by mining user-generated microblog information, based on this a similarity matrix associated with the venue is calculated. And the venue visiting probability is computed based on user historical check-in data by using Bayes model. Then, the similarity value and visiting probability is combined to quantize the probability of location spoofing. Experiments on a large scale and real-world LBSN dataset collected from Weibo show that the proposed approach can effectively detect certain types of location spoofing.

Chaoping Ding, Ting Wu, Tong Qiao, Ning Zheng, Ming Xu, Yiming Wu, Wenjing Xia

The Three-Degree Calculation Model of Microblog Users’ Influence (Short Paper)

Highly influential social users can guide public opinion and influence their emotional venting. Therefore, it is of great significance to identify high-impact users effectively. This paper starts with the users’ text content, users’ emotions, and fans’ behaviors. It combines the amount of information in the content and sentiment tendency with the fans’ forwarding, commenting, and Liking actions. And based on the principle of the three-degree influence, the users’ influence calculation model is constructed. Finally, the experimental results show that the three-degree force calculation model is more accurate and effective than other similar models.

Xueying Sun, Fu Xie

Identifying Local Clustering Structures of Evolving Social Networks Using Graph Spectra (Short Paper)

The clustering coefficient has been widely used for identifying the local structure of networks. In this paper, the weighted spectral distribution with 3-cycle (WSD3) that is similar (but not equal) to the clustering coefficient is studied on evolving social networks. It is demonstrated that the ratio of the WSD3 to the network size (i.e., the node number) provides a more sensitive discrimination for the size-independent local structure of social networks in contrast to the clustering coefficient. Moreover, the difference of the WSD3’s performances on social networks and communication networks is investigated, and it is found that the difference is induced by the different symmetrical features of the normalized Laplacian spectral densities on these networks.

Bo Jiao, Yiping Bao, Jin Wang

The Parallel and Precision Adaptive Method of Marine Lane Extraction Based on QuadTree

Extracting the marine lane results from the ocean spatial big data is a challenging problem. One of the challenges is that the quality of the trajectory data is quite low, and the trajectory data quality is extremely different in different areas. A parallel and precision adaptive method of marine lane extraction based on QuadTree is proposed to meet this challenge. The method takes advantage of several methods including average sampling, interpolation, removing noise, trajectory segmentation, and trajectory clustering based on GeoHash encoding through the MapReduce parallel computing framework. The preprocessing phase can effectively simplify the big data and improve the efficient of data processing. Based on the QuadTree data structure, a parallel merge filtering algorithm is proposed and implemented used Spark framework. The algorithm performs grid merging on the sparse grid regions, and obtaining a new grid result with different size. The sliding local window filtering algorithm based on the QuadTree is proposed to obtain the marine lane grid set data. Applying the Delaunay triangulation method on the grid data, the multi-precision marine lane results are effectively extracted. The experimental results show that the proposed method can automatically extract multi-precision marine lane using the trajectory data near the coast with high and low grid precision.

Zhuoran Li, Guiling Wang, Jinlong Meng, Yao Xu

GPU-accelerated Large-Scale Non-negative Matrix Factorization Using Spark

Non-negative matrix factorization (NMF) has been introduced as an efficient way to reduce the complexity of data compress and its ability of extracting highly-interpretable parts from data sets, and it has also been applied to various fields, such as recommendations, image analysis, and text clustering. However, as the size of the matrix increases, the processing speed of non-negative matrix factorization algorithm is very slow. To solve this problem, this paper proposes a parallel algorithm based on GPU for NMF in Spark platform, which makes full use of the advantages of in-memory computation mode and GPU Single-Instruction Multiple-data Streams mode. The new GPU-accelerated NMF on Spark platform is evaluated in a 4-nodes Spark heterogeneous cluster using Google Compute Engine by configuring each node a NVIDIA K80 GPU card, and experimental results indicate that it is competitive in terms of computational time against the existing solutions on a variety of matrix orders. It can achieve a high speed-up, and also can effectively deal with the non-negative decomposition of higher-order matrices, which greatly improves the computational efficiency.

Bing Tang, Linyao Kang, Yanmin Xia, Li Zhang

Adaptive Data Sharing Algorithm for Aerial Swarm Coordination in Heterogeneous Network Environments (Short Paper)

With the development of unmanned aerial vehicle (UAV) systems, multi-UAV cooperation has attracted noticeable attention. In response to the communication constraints faced in UAV swarm coordination, both the lazy and the eager strategies were proposed to enable swarm-wide reliable information exchange to further behavior coordination for UAV swarms. However, these two algorithms are only evaluated in a fixed and homogeneous network scenario. Hence, how to choose the proper information exchange strategy for a UAV swarm in realistic dynamic and heterogeneous network environments remains an open while interesting problem. Therefore, in this paper, we first evaluate the convergence and payload cost of both strategies for robotic swarms in realistic network scenarios. Then we propose a novel online adaptive information exchange strategy by adopting single relay selection schemes to ensure low payload and fast convergence in various network environments. Numerical results reveal our novel strategy performs well across different network scenarios in terms of convergence and payload cost, showing its robustness, adaptive capability and potential applications in UAV swarms.

Yanqi Zhang, Bo Zhang, Xiaodong Yi

Reverse Collective Spatial Keyword Querying (Short Paper)

Recently, Collective Spatial Keyword Querying (CoSKQ), which returns a group of objects that cover a set of given keywords collectively and have the smallest cost, has received extensive attention in spatial database community. However, no research so far focuses on a situation when the result of CoSKQ is taken as the input of a query. But this kind of query has many applications in location based services. In this paper, we introduce a new problem Reverse Collective Spatial Keyword Querying (RCoSKQ) that returns a region, in which the query objects are qualified objects with the highest spatial and textual similarity. We propose an efficient method which uses IR-tree to retrieve objects with text descriptions. To accelerate the query process, a pruning method that effectively reduces computing is proposed. The experiments over real and synthesis data sets demonstrate the efficiency of our approaches.

Yang Wu, Jian Xu, Liming Tu, Ming Luo, Zhi Chen, Ning Zheng

Important Member Discovery of Attribution Trace Based on Relevant Circle (Short Paper)

Cyberspace attack is a persistent problem since the existing of internet. Among many attack defense measures, collecting information about the network attacker and his organization is a promising means to keep the cyberspace security. The exposing of attackers halts their further operation. To profile them, we combine these retrieved attack related information pieces to form a trace network. In this attributional trace network, distinguishing the importance of different trace information pieces will help in mining more unknown information pieces about the organizational community we care about. In this paper, we propose to adopt relevant circle to locate these more important vertices in the trace network. The algorithm first uses Depth-first search to traverse all vertices in the trace network. Then it discovers and refines relevant circles derived from this network tree, the rank score is calculated based on these relevant circles. Finally, we use the classical 911 covert network dataset to validate our approach.

Jian Xu, Xiaochun Yun, Yongzheng Zhang, Zhenyu Cheng

Safety Message Propagation Using Vehicle-Infrastructure Cooperation in Urban Vehicular Networks

A soaring number of vehicles in modern cities bring in complicated urban transportation and severe safety risks. After a traffic accident occurs, how to quickly disseminate this alert to other vehicles is very important to avoid rear-end collision and traffic jam. Existing studies mainly use the vehicles travelling in the same direction as the collision vehicles to forward safety messages, which strictly limit the performance improvements. In this paper, we propose a safety message propagation scheme using vehicle-infrastructure cooperation in urban vehicular networks, named SMP. On straight roads, the opposite-lane front vehicles help to relay data when no further collision-lane back vehicles exist, while at intersections, the deployed roadside units create new safety messages with updated dissemination parameters and distribute them in the upstream lanes. The collaboration of vehicles in two directions and roadside units enhances the performances of safety-related applications. Besides, three checking policies are designed to avoid transmission failures and hence save network resources. Simulation experiments show that SMP achieves a high reception ratio and a short propagation delay.

Xiaolan Tang, Zhi Geng, Wenlong Chen

Predicting Duration of Traffic Accidents Based on Ensemble Learning

Traffic congestion can be divided into recurrent congestion and accidental congestion, and the latter one is usually caused by traffic accidents. It is of great significance to predict the duration of traffic accidents accurately and transfer the results to drivers on the road in time. Most of the existing works utilize traditional, single machine learning model to predict the duration of accident, while the accuracy is not satisfying. In this paper, we firstly construct and extract features from the accident records including description, location, as well as some external information such as weather. We then divide the duration into multiple periods, corresponding to multiple categories. In order to improve the prediction precision of rare categories, we convert the multi-class classification problem into a binary classification problem, constructing multiple XGBoost binary classifiers which are restricted by F1 (harmonic mean) evaluation index. Finally, in order to improve the overall accuracy further, the classification results are integrated by using artificial neural networks. The experiment is conducted on real datasets in Xiamen and employs mean absolute percentage error (MAPE) and root-mean-square error (RMSE) as indicators. The experimental results show the effectiveness of the proposed method and show better performance in comparison with traditional models.

Lina Shan, Zikun Yang, Huan Zhang, Ruyi Shi, Li Kuang

A Single-Hop Selection Strategy of VNFs Based on Traffic Classification in NFV

Network Function Virtualization (NFV) has become a hot technology since it provides the flexible management of network functions and efficient sharing of network resources. Network resources in NVF require an appropriate management strategy which often manifests as a difficult online decision making task. Resource management in NFV can be thought of as a process of virtualized network functions (VNFs) selection or deployment. This paper proposes a single-hop VNFs selection strategy to realize network resource management. For satisfying quality requirements of different network services, this strategy is based on the results of traffic classification which utilizes Multi-Grained Cascade Forest (gcForest) to distinguish user behaviors on the internet. In the order of VNFs, a network is divided into several layers where each arrived packet needs to queue. The scheduler of each layer selects a layer which hosts the next VNF for the packets in the queue. Experiments prove that the proposed traffic classification method increases the precision by 7.7% and improves the real-time performance. The model of VNFs selection reduces network congestion compared to traditional single-hop scheduling models. Moreover, the number of packets which fail to reach target node in time drops 30% to 50% using the proposed strategy compared to the strategy without the section of traffic classification.

Bo He, Jingyu Wang, Qi Qi, Haifeng Sun

A Stacking Approach to Objectionable-Related Domain Names Identification by Passive DNS Traffic (Short Paper)

Domain name classification is an important issue in the field of cyber security. Notice that objectionable-related domain names are one category of domain names that serve services such as gambling, pornography, etc. They are classified and even forbidden in some areas, some of these domain names may defraud visitors privacy and property. Timely and accurate identification of these domain names is significant for Internet content censorship and users security. In this work, we analyze the behavior of objectionable-related domain names from the real-world DNS traffic, finding that there exist evidently differences between objectionable-related domain names and none-objectionable ones. In this paper, we propose a stacking approach to objectionable-related domain names identification, VisSensor, that automatically extracts name features and latent visiting patterns of domain names from the DNS traffic and distinguishes objectionable-related ones. We integrate convolutional neural networks with fully-connected neural networks to collaborate features of different dimensions and improve experimental results. The accuracy of VisSensor is 88.48% with a false positive rate of $$9.11\%$$ . We also compared VisSensor with a public domain name tagging system, and our VisSensor performed better than the tagging system on the identification task of the objectionable-related domain names.

Chen Zhao, Yongzheng Zhang, Tianning Zang, Zhizhou Liang, Yipeng Wang

Grid Clustering and Routing Protocol in Heterogeneous Wireless Sensor Networks (Short Paper)

In wireless sensor network, sensor nodes usually use batteries to provide energy, and energy consumption have very strict restrictions. High demands about the efficient use of energy are put forward. However typical multi to one communication mode in the wireless sensor network will lead to the uneven consumption of sensor nodes in the whole network. So it will greatly shorten the lifecycle of the entire network. As for this problem this paper optimize the model of heterogeneous chessboard clustering of sensor network and propose a grid clustering mechanism and propose an effective node routing protocol to achieve the goal of prolonging the network lifecycle by balancing the energy consumption of nodes. Simulation experiments show that the grid clustering protocol greatly improves the lifecycle of wireless sensor networks and has better performance compared with LEACH and LRS.

Zheng Zhang, Pei Hu

GeoBLR: Dynamic IP Geolocation Method Based on Bayesian Linear Regression

The geographical location of dynamic IP addresses is important for network security applications. The delay-based or topology-based measurement method and the association-analysis-based method improve the median estimation accuracy, but are still affected by the limited precision (about 799 m) and the longer response time (tens of seconds), which cannot meet the location-aware applications of high-precise and real-time location requirements, especially the position of dynamic IP addresses. In this paper, we propose a novel approach for dynamic IP geolocation based on Bayesian Linear Regression, namely, GeoBLR, which exploits geolocation resources fundamentally different from existing ones. We exploit the location data that users would like to share in location sharing services for accurate and real-time geolocation of dynamic IP addresses. Experimental results show that compared to existing geolocation techniques, GeoBLR achieves (1) a median estimation error of 239 m and (2) a mean response time of 270 ms, which are valuable for accurate location-aware network security applications.

Fei Du, Xiuguo Bao, Yongzheng Zhang, Yu Wang

MUI-defender: CNN-Driven, Network Flow-Based Information Theft Detection for Mobile Users

Nowadays people save a lot of privacy information in mobile devices. These information can be theft by adversaries through suspicious apps installed in smartphones, and protecting users’ privacy has become a great challenge. So developing a method to identify if there are apps thieving users’ personal information in smartphones is important and necessary. Through the analysis of apps’ network traffic data, we observe that general apps generate regular network flows with the users’ normal operations. But information theft apps’ network flows have no relationship with users’ operations. In this paper we propose a model MUI-defender (Mobile Users’ Information defender), which is based on analyzing the relationship between users’ operation patterns and network flows with CNN (Convolutional Neural Network), can efficiently detect information theft. Because of C&C (Command-and-Control) server invalidation [33] and system version incompatibility [25], etc., most of the collected information theft apps can’t run properly in reality. So we extract information theft code modules from some of these apps, and then recode and compile them into the ITM-capsule (Information Theft Modules capsule) for verification. Finally, we run the ITM-capsule and several normal apps to detect the network flows, which shows our detection model can achieve an accuracy higher than 94%. Therefore, MUI-defender is suitable for detecting the network flows of information theft.

Zhenyu Cheng, Xunxun Chen, Yongzheng Zhang, Shuhao Li, Jian Xu

Delayed Wake-Up Mechanism Under Suspend Mode of Smartphone

In this paper, the impact of Suspend/Resume mechanism on the power consumption of smartphone is investigated. When the operating system (OS) is in suspend mode, many trivial (less urgent) network packets will always wake up system, making OS frequently switch from suspend mode to resume mode, thus causing a high power consumption. Based on this observation, an novel optimization mechanism was proposed to delay wake-up system and prolong the duration of suspend mode so as to reduce power consumption. This novel method not only reduces the power consumption of WiFi components, but also reduces the power consumption of the total components. To verify the effectiveness of the novel mechanisms, we implemented such proposed mechanism on Huawei P8 platform, and carried out relevant experiments. The results showed that the proposed mechanism can effectively decrease power consumption by more than 7.63%, which indicates the feasibility of the proposed mechanism.

Bo Chen, Xi Li, Xuehai Zhou, Zongwei Zhu

Mobile Data Sharing with Multiple User Collaboration in Mobile Crowdsensing (Short Paper)

With the development of the Internet and smart phone, mobile data sharing have been attracted many researcher’s attentions. In this paper, we investigate the mobile data sharing problem in mobile crowdsensing. There are a large number of users, each user can be a mobile data acquisition, or can be a mobile data sharing, the problem is how to optimal choose users to collaborative sharing their idle mobile data to others. We consider two data sharing models, One-to-Many and Many-to-Many data sharing model when users share their mobile data. For One-to-Many model, we propose an OTM algorithm based on the greedy algorithm to share each one’s data. For Many-to-Many model, we translate the problem into the stable marriage problem (SMP), and we propose a MTM algorithm based on the SMP algorithm to solve this problem. Experimental results show that our methods are superior to the other approaches.

Changjia Yang, Peng Li, Tao Zhang, Yu Jin, Heng He, Lei Nie, Qin Liu

Exploiting Sociality for Collaborative Message Dissemination in VANETs

Message dissemination problem have attracted great attention in vehicular ad hoc networks (VANETs). One important task is to find a set of relay nodes to maximize the number of successful delivery messages. In this paper, we investigate the message dissemination problem and propose a new method that aims at selecting optimal nodes as the collaborative nodes to distribute message. Firstly, we analyze the real vehicle traces and find its sociality by extracting contacts and using community detecting approach. Secondly, we propose community collaboration degree to measure the collaborative possibility of message delivery in the whole community. Moreover, we use Markov chains to infer future community collaboration degree. Thirdly, we design a community collaboration (CC) algorithm for selecting the optimal collaborative nodes. We compare our algorithm with other methods. The simulation results show that our algorithm performance is better than other methods.

Weiyi Huang, Peng Li, Tao Zhang, Yu Jin, Heng He, Lei Nie, Qin Liu

An Efficient Traffic Prediction Model Using Deep Spatial-Temporal Network

Recently years, traffic prediction has become an important and challenging problem in smart urban traffic computing, which can be used for government for road planning, detecting bottle-neck congestions roads, pollution emissions estimating and so on. However, former data mining algorithms mainly address the problem by using the traditional mathematical or statistical theories, and they were impossible to model the spatial and temporal relationship simultaneously. To address these issues, we propose an end-to-end neural network named C-LSTM to predict the traffic congestion at next time interval. More specifically, the C-LSTM is based on CNN and LSTM to collectively capture the spatial-temporal dependencies on the road network. Inspired by the procedure of handling the image by CNN, the city-wide traffic maps are first converted into a series of static images like the video frame and then are fed into a deep learning architecture, in which CNN extracts the spatial characteristics, and LSTM extracts the temporal characteristics. In addition, we also consider some external factors to further improve the prediction accuracy. Extensive experiments on reality Beijing transportation datasets demonstrate the superiority of our method.

Jie Xu, Yong Zhang, Yongzheng Jia, Chunxiao Xing

Assessing Data Anomaly Detection Algorithms in Power Internet of Things

At present, the data related to the Internet of Things has shown explosive growth, and the importance of data has been greatly improved. Data collection and analysis are becoming more and more valuable. However, a large number of abnormal data will bring great trouble to our research, and even lead people into misunderstandings. Therefore, anomaly detection is particularly necessary and important. The purpose of this paper is to find an efficient and accurate outlier detection algorithm. Our work also analyzes their advantages and disadvantages theoretically. At the same time, the effects of the data set size, number of proximity points, and data dimension on CPU time and precision are discussed. The performance, advantages and disadvantages of each algorithm in dealing with high-dimensional data are compared and analyzed. Finally, the algorithm is used to analyze the actual anomaly data collected from the Internet of Things and analyze the results. The results show that the LOF algorithm can find the abnormal data in the data set in a shorter time and with higher accuracy.

Zixiang Wang, Zhoubin Liu, Xiaolu Yuan, Yueshen Xu, Rui Li

PARDA: A Dataset for Scholarly PDF Document Metadata Extraction Evaluation

Metadata extraction from scholarly PDF documents is the fundamental work of publishing, archiving, digital library construction, bibliometrics, and scientific competitiveness analysis and evaluations. However, different scholarly PDF documents have different layout and document elements, which make it impossible to compare different extract approaches since testers use different source of test documents even if the documents are from the same journal or conference. Therefore, standard datasets based performance evaluation of various extraction approaches can setup a fair and reproducible comparison. In this paper we present a dataset, namely, PARDA(Pdf Analysis and Recognition DAtaset), for performance evaluation and analysis of scholarly documents, especially on metadata extraction, such as title, authors, affiliation, author-affiliation-email matching, year, date, etc. The dataset covers computer science, physics, life science, management, mathematics, and humanities from various publishers including ACM, IEEE, Springer, Elsevier, arXiv, etc. And each document has distinct layouts and appearance in terms of formatting of metadata. We also construct the ground truth metadata in Dublin Core XML format and BibTex format file associated this dataset.

Tiantian Fan, Junming Liu, Yeliang Qiu, Congfeng Jiang, Jilin Zhang, Wei Zhang, Jian Wan

New Cross-Domain QoE Guarantee Method Based on Isomorphism Flow

This paper investigates the issue of Quality of Experience (QoE) for multimedia services over heterogeneous networks. A new concept of “Isomorphism Flow” (iFlow) was introduced for analyzing multimedia traffics, which is inspired by the abstract algebra based on experimental research. By using iFlow, the multimedia traffics with similar QoE requirements for different users are aggregated. A QoE evaluation method was also proposed for the aggregated traffics. Then a new cross-domain QoE guarantee method based on the iFlow QoE is proposed in this paper to adjust the network resource from the perspective of user perception. The proposed scheme is validated through simulations. Simulation results show that the proposed scheme achieves an enhancement in QoE performance and outperforms the existing schemes.

Zaijian Wang, Chen Chen, Xinheng Wang, Lingyun Yang, Pingping Tang

Extracting Business Execution Processes of API Services for Mashup Creation

Mashup services creation has become a new research issue for service-oriented complex application systems. During the mashup service creation, how to extract business execution processes among APIs plays an important role when a mashup service developer receives a bunch of recommended API services. However, it does not exist an effective way to perform mashup recommendation with the support of extracting API business execution processes. In this paper, we propose a novel approach for automated extraction of API business execution processes for mashup creation. Based on the proposed word-domain matrix model, API annotation in a mashup service is transformed as a bipartite graph problem that is solved by the maximum bipartite matching algorithm to semantically annotate involved APIs. Then, directed dependency network among APIs is constructed by analyzing path dependencies and evaluating the compound polarity. Finally, API business execution processes in a mashup service can be extracted. The advantage of the work is that it generates business execution processes instead of a list of independent APIs, which can significantly facilitate mashup service creation for software developers. To validate the performance, we conduct extensive experiments on a large-scale real-world dataset crawled from ProgrammableWeb. The experimental results demonstrate the feasibility and effectiveness of our proposed approach.

Guobing Zou, Yang Xiang, Pengwei Wang, Shengye Pang, Honghao Gao, Sen Niu, Yanglan Gan

An Efficient Quantum Circuits Optimizing Scheme Compared with QISKit (Short Paper)

Recently, the development of quantum chips has made great progress – the number of qubits is increasing and the fidelity is getting higher. However, qubits of these chips are not always fully connected, which sets additional barriers for implementing quantum algorithms and programming quantum programs. In this paper, we introduce a general circuit optimizing scheme, which can efficiently adjust and optimize quantum circuits according to arbitrary given qubits’ layout by adding additional quantum gates, exchanging qubits and merging single-qubit gates. Compared with the optimizing algorithm of IBM’s QISKit, the quantum gates consumed by our scheme is 74.7%, and the execution time is only 12.9% on average.

Xin Zhang, Hong Xiang, Tao Xiang

Feature-based Online Segmentation Algorithm for Streaming Time Series (Short Paper)

Over the last decade, huge number of time series stream data are continuously being produced in diverse fields, including finance, signal processing, industry, astronomy and so on. Since time series data has high-dimensional, real-valued, continuous and other related properties, it is of great importance to do dimensionality reduction as a preliminary step. In this paper, we propose a novel online segmentation algorithm based on the importance of TPs to represent the time series into some continuous subsequences and maintain the corresponding local temporal features of the raw time series data. To demonstrate the advantage of our proposed algorithm, we provide extensive experimental results on different kinds of time series datasets for validating our algorithm and comparing it with other baseline methods of online segmentation.

Peng Zhan, Yupeng Hu, Wei Luo, Yang Xu, Qi Zhang, Xueqing Li

MalShoot: Shooting Malicious Domains Through Graph Embedding on Passive DNS Data

Malicious domains are key components to a variety of illicit online activities. We propose MalShoot, a graph embedding technique for detecting malicious domains using passive DNS database. We base its design on the intuition that a group of domains that share similar resolution information would have the same property, namely malicious or benign. MalShoot represents every domain as a low-dimensional vector according to its DNS resolution information. It automatically maps the domains that share similar resolution information to similar vectors while unrelated domains to distant vectors. Based on the vectorized representation of each domain, a machine-learning classifier is trained over a labeled dataset and is further applied to detect other malicious domains. We evaluate MalShoot using real-world DNS traffic collected from three ISP networks in China over two months. The experimental results show our approach can effectively detect malicious domains with a 96.08% true positive rate and a 0.1% false positive rate. Moreover, MalShoot scales well even in large datasets.

Chengwei Peng, Xiaochun Yun, Yongzheng Zhang, Shuhao Li

Learning from High-Degree Entities for Knowledge Graph Modeling

Knowledge base (KB) completion aims to infer missing facts based on existing ones in a KB. Many approaches firstly suppose that the constituents themselves (e.g., head, tail entity and relation) of a fact meet some formulas and then minimize the loss of formula to obtain the feature vectors of entities and relations. Due to the sparsity of KB, some methods also take into consideration the indirect relations between entities. However, indirect relations further widen the differences of training times of high-degree entities (entities linking by many relations) and low-degree entities. This results in underfitting of low-degree entities. In this paper, we propose the path-based TransE with aggregation (PTransE-ag) to fine-tune the feature vector of an entity by comparing it to its related entities that linked by the same relations. In this way, low-degree entities can draw useful information from high-degree entities to directly adjust their representations. Conversely, the overfitting of high-degree entities can be relieved. Extensive experiments carried on the real world dataset show our method can define entities more accurately, and inferring is more effectively than in previous methods.

Tienan Zhang, Fangfang Liu, Yan Shen, Honghao Gao, Jing Duan

Target Gene Mining Algorithm Based on gSpan

In recent years, the focus of bioinformatics research has turned to biological data processing and information extraction. New mining algorithm was designed to mine target gene fragment efficiently from a huge amount of gene data and to study specific gene expression in this paper. The extracted gene data was filtered in order to remove redundant gene data. Then the binary tree was constructed according to the Pearson correlation coefficient between gene data and processed by gSpan frequent subgraph mining algorithm. Finally, the results were visually analyzed in grayscale image way which helped us to find out the target gene. Compared with the existing target gene mining algorithms, such as integrated decision feature gene selection algorithm, our approach enjoys the advantages of higher accuracy and processing high-dimensional data. The proposed algorithm has sufficient theoretical basis, not only makes the results more efficient, but also makes the possibility of error results less. Moreover, the dimension of the data is much higher than the dimension of the data set used by the existing algorithm, so the algorithm is more practical.

Liangfu Lu, Xiaoxu Ren, Lianyong Qi, Chenming Cui, Yichen Jiao

Booter Blacklist Generation Based on Content Characteristics

Distributed Denial of Service (DDoS) attacks-as-a-service, known as Booter or Stresser, is convenient and low-priced for ordinary people to launch DDoS attacks. It makes DDoS attacks even more rampant. However, until now there is not much research on Booter and little acquaintance with their backend infrastructure, customers, business, etc. In this paper, we present a new method which focuses on the content (text) characteristics on Booters websites and selects more discriminative features between Booter and non-Booter to identify Booters more effectively in the Internet. The experimental results show that the classification accuracy of distinguishing Booter and non-Booter websites is 98.74%. In addition, our method is compared with several representative methods and the results show that the proposed method outperforms the classical methods in 66% of the classification cases on three datasets: Booter websites, 20-Newsgroups and WebKB.

Wang Zhang, Xu Bai, Chanjuan Chen, Zhaolin Chen

A 2D Transform Based Distance Function for Time Series Classification

Along with the arrival of Industry 4.0 era, time series classification (TSC) has attracted a lot of attention in the last decade. The high dimensionality, high feature correlation and typically high levels of noise that found in time series bring great challenges to TSC. Among TSC algorithms, the 1NN classifier has been shown as effective and difficult to beat. The core of the 1NN classifier is the distance function. The large majority of TSC have concentrated on alternative distance functions. In this paper, a two-dimensional (2D) transform based distance (2DTbD) function is proposed. There are three steps in 2DTbD. Firstly, we convert time series to 2D space by turning time series around the coordinate origin. Then we calculate distances of each dimension. Finally, we ensemble distances in 2D space to get the final time series distance. Our distance function raises the accuracy rate through the fusion of 2D information. Experimental results demonstrate that the classification accuracy can be improved by 2DTbD.

Cun Ji, Xiunan Zou, Yupeng Hu, Shijun Liu

Research on Access Control of Smart Home in NDN (Short Paper)

Named Data Networking (NDN) is one of the future Internet architectures and can support smart home very well. There is a large amount of private data with lower security level in smart home. Access control is an effective security solution. However, the existing NDN’s access control mechanisms that can be applied to smart homes don’t reasonably use the cache in NDN and take into account users’ authorization cancellation phase. Therefore, we designed an access control mechanism for smart homes in NDN. We mainly consider the process of the user requests permission, user requests data and user permission cancellation. By using the Cipher Block Chaining (CBC) symmetric encryption algorithm, identity-based encryption, and proxy re-encryption, the cache in NDN is effectively utilized, and the counting Bloom Filter is used to filter ineffective Interest packets and complete the user’s privilege cancellation phase. Experimental results show that the access control mechanism designed in this paper can effectively reduce the total time which starts from user requests the permission to decrypt data and reduce the time overhead of the NDN routers in the process of user privileges cancellation after using the counting Bloom Filter.

Rina Wu, Bo Cui, Ru Li

The Realization of Face Recognition Algorithm Based on Compressed Sensing (Short Paper)

Once the sparse representation-based classifier (SRC) was raised, it achieved a more outstanding performance than typical classification algorithm. Normally, SRC algorithm adopts $$l_1$$ -norm minimization method to solve the sparse vector, and its computation complexity increases correspondingly. In this paper, we put forward a compressed sensing reconstruction algorithm based on residuals. This algorithm utilizes the local sparsity within figures as well as the non-local similarity among figure blocks to boost the performance of the reconstruction algorithm while remaining a median computation complexity. It achieves a superior recognition rate in the experiments of Yale facial database.

Huimin Zhang, Yan Sun, Haiwei Sun, Xin Yuan

An On-line Monitoring Method for Monitoring Earth Grounding Resistance Based on a Hybrid Genetic Algorithm (Short Paper)

In this paper, a method for measuring the grounding resistance of the tower without disconnecting all the down conductors is proposed for the first time in view of the shortage of manual measurement of the grounding resistance of the tower. This paper introduces the measurement model of single or multiple down conductors, and uses a hybrid genetic algorithm to comprehensively calculate the grounding resistance of all towers in the closed loop, which greatly improves the measurement accuracy. Through simulation analysis and actual measurement, it proves that the method is simple and convenient to measure, does not need to disconnect the grounding wire, and has high measurement accuracy. Compared with the clamp ammeter method, the accuracy is improved by 30%. The method is applied to the on-line monitoring system of the tower grounding resistance, which can reduce the labor intensity of the line maintenance personnel, greatly improve the work efficiency and provide a basis for discovering the fault in time.

Guangxin Zhang, Minzhen Wang, Xinheng Wang, Liying Zhao, Jinyang Zhao

How Good is Query Optimizer in Spark?

In the big data community, Spark plays an important role and is used to process interactive queries. Spark employs a query optimizer, called Catalyst, to interpret SQL queries to optimized query execution plans. Catalyst contains a number of optimization rules and supports cost-based optimization. Although query optimization techniques have been well studied in the field of relational database systems, the effectiveness of Catalyst in Spark is still unclear. In this paper, we investigated the effectiveness of rule-based and cost-based optimization in Catalyst, meanwhile, we obtained a set of comparative experiments by varying the data volume and the number of nodes. It is found that even when applied query optimizations, the execution time of most TPC-H queries were slightly reduced. Some interesting observations were made on Catalyst, which can enable the community to have a better understanding and improvement of the query optimizer in Spark.

Zujie Ren, Na Yun, Youhuizi Li, Jian Wan, Yuan Wang, Lihua Yu, Xinxin Fan

An Optimized Multi-Paxos Protocol with Centralized Failover Mechanism for Cloud Storage Applications

For typical Multi-Paxos protocol running on a cloud storage application, the failover mechanism is complex in terms of implementation. When the leader fails within a replica group, a new leader should be elected by broadcasting prepare requests over the replica group. Moreover, repairing new leader’s missing log entries requires broadcasting prepare request as well. This introduces too much network cost and increase the latency to restore normal storage service at the same time. In view of this challenge, an optimization for Multi-Paxos protocol with centralized failover mechanism for cloud storage applications is proposed in this paper. Compared with typical Multi-Paxos protocol, failover mechanism and normal client requests handling logic are split, and been handled by two clusters respectively: A coordinator cluster is dedicated to handle failover issues as a central manager; while a data cluster only takes charge of data replication and storage regarding client commands. With the centralized failover mechanism in the new design, the centralized coordinator cluster maintains real-time status information of each replica group. And a replica with largest apply index value is elected as the new leader by coordinator cluster; while repairing missing log entries can be achieved with limited replica’s bitmap information maintained by coordinator cluster as well. Comparison between two protocols is implemented and analyzed to prove the feasibility of our proposal.

Wenmin Lin, Hao Jiang, Nailiang Zhao, Jilin Zhang

A Resource Usage Prediction-Based Energy-Aware Scheduling Algorithm for Instance-Intensive Cloud Workflows

The applications of instance-intensive workflow are widely used in e-commerce, advanced manufacturing, etc. However, existing studies normally do not consider the problem of reducing energy consumption by utilizing the characters of instance-intensive workflow applications. This paper presents a resource usage Prediction-based Energy-Aware scheduling algorithm, named PEA. Technically, this method improves the energy efficiency of instance-intensive cloud workflow by predicting resources utilization and the strategies of batch processing and load balancing. The efficiency and effectiveness of the proposed algorithm are validated by extensive experiments.

Zhibin Wang, Yiping Wen, Yu Zhang, Jinjun Chen, Buqing Cao

Web Service Discovery Based on Information Gain Theory and BiLSTM with Attention Mechanism

Web service discovery is an important problem in service-oriented computing with the increasing number of Web services. Clustering or classifying Web services according to their functionalities has been proved to be an effective way to Web service discovery. Recently, semantic-based Web services clustering exploits topic model to extract latent topic features of Web services description document to improve the accuracy of service clustering and discovery. However, most of them don’t consider deep and fine-grained level information of description document, such as the weight (importance) for each word or the word order. While the deep and fine-grained level information can be fully used to argument service clustering and discovery. To address this problem, we proposed a Web service discovery approach based on information gain theory and BiLSTM with attention mechanism. This method firstly obtains the effective words through information gain theory and then adds them to an attention-based BiLSTM neural network for Web service clustering. The comparative experiments are performed on ProgrammableWeb dataset, and the results show that a significant improvement is achieved for our proposed method, compared with baseline methods.

Xiangping Zhang, Jianxun Liu, Buqing Cao, Qiaoxiang Xiao, Yiping Wen

Neighborhood-Based Uncertain QoS Prediction of Web Services via Matrix Factorization

With the rapidly overwhelming number of services on the internet, QoS-based web service recommendation has become an urgent demand on service-oriented applications. Since there are a large number of missing QoS values in the user historical invocation records, accurately predicting these missing QoS values becomes a hot research issue. However, most existing service QoS prediction research assumes that the transactional process of the service was stable, and its QoS doesn’t change as time goes. In fact, service invocation process is usually affected by many factors (e.g., geographical location, network environment), leading to service invocations with QoS uncertainty. Therefore, QoS prediction based on traditional methods can not exactly adapt to the scenarios in real-world applications. To solve the issue, combined with the collaborative filtering and matrix factorization theory, we propose a novel approach for prediction of uncertain service QoS under the dynamic Internet environment. Extensive experiments have been conducted on a real-world data set and the results demonstrate the effectiveness and applicability of our approach for QoS prediction.

Guobing Zou, Shengye Pang, Pengwei Wang, Huaikou Miao, Sen Niu, Yanglan Gan, Bofeng Zhang

Runtime Resource Management for Microservices-Based Applications: A Congestion Game Approach (Short Paper)

The term “Microservice Architecture” has sprung up in recent years as a new style of software design that gains popularity as cloud computing prospers. In microservice-based applications, different microservices collaborate with one another via interface calls, but they may also compete for resources when an increase of users’ need renders the resources insufficient. This poses new challenges for allocating resources efficiently during runtime. To tackle the problem, we propose a novel approach based on Congestion Game in this paper. Firstly, we use a weighted directed acyclic graph to model the inter-relationship of the microservices that compose an application. Then we use M/G/1 Queue in Queue Theory to describe the arrival process of access requests, and combine it with the above graph to calculate the arrival rate of access requests to each microservice, which in turn is used to estimate response time in a newly-designed microservice revenue function. Finally, we define resources competing problem as a congestion game where each microservice is a player aiming to maximize its revenue, and propose an algorithm to find Nash equilibrium in polynomial time. Experiment results show that our approach can effectively improve the overall performance of the system with limited resources, and outperform Binpack and Spread, two scheduling strategies used in Docker Swarm.

Ruici Luo, Wei Ye, Jinan Sun, Xueyang Liu, Shikun Zhang

CPN Model Based Standard Feature Verification Method for REST Service Architecture

The representational state transfer (REST) service architecture is widely used in large-scale and scalable distributed web systems. If the REST service architecture does not comply with its standard feature constrains, it can result in degraded performance or low scalability of the REST-based web systems. Therefore, in order to enhance the quality of system designing, it is necessary to verify whether the system design meets the standard feature constrains of the REST service architecture. In this paper, we propose a standard feature constrains verification method for REST service architecture based on Colored Petri Nets (CPN) model. Firstly, five standard feature constrains of the REST service architecture are modeled using the CPN. Then a verification method is proposed based on synchronized matching of the execution paths in model state space. Lastly, we validate the usability and validity of the proposed verification method using a practical course management web system based on the REST service architecture. Experimental results show that our method can effectively confirm whether the web application system design based on REST service architecture conforms to the standard feature constrains of the REST service architecture. Besides, it can also provide intuitive and feasible execution data when the standard feature constraints are not met, which can facilitate the defects location and correction of the following design of application systems.

Jing Liu, Zhen-Tian Liu, Yu-Qiang Zhao

Crawled Data Analysis on Baidu API Website for Improving SaaS Platform (Short Paper)

SaaS (Software-as-a-Service) is a cloud computing model, which is sometimes referred to as “on-demand software”. Existing SaaS platforms are investigated before building new distributed SaaS platform. The service data mining and evaluation on existing SaaS platforms improve our new SaaS platform. For SaaS that provide various APIs, we analysis their website data in this paper by our data mining method and related software. We wrote a crawler program to obtain data from these websites. The websites include Baidu API and ProgrammableWeb API. After ETL (Extract-Transform-Load), the obtained and processed data is ready to be analyzed. Statistical methods including non-linear regression and outlier detection are used to evaluate the websites performance, and give suggestions to improve the design and development of our API website. All figures and tables in this paper are generated from IBM SPSS statistical software. The work helps us improve our own API website by comprehensively analyzing other successful API websites.

Lei Yu, Shanshan Liang, Shiping Chen, Yaoyao Wen

A Hardware/Software Co-design Approach for Real-Time Binocular Stereo Vision Based on ZYNQ (Short Paper)

Based on the ZYNQ platform, this paper proposes a hardware/software co-design approach, and implements a binocular stereo vision system with high real-time performance and good human-computer interaction, which can be used to assist advanced driver assistance systems to improve driving safety. Combining the application characteristics of binocular stereo vision, the approach firstly modularizes the system’s functions to perform hardware/software partitioning, accelerates the data processing on FPGA, and performs the data control on ARM cores; then uses the ARM instruction set to configure the registers within FPGA to design relevant interfaces to complete the data interaction between hardware and software; finally, combines the implementation of specific algorithms and logical control to complete the binocular stereo vision system. The test results show that the frame rate with an image resolution of 640 * 480 can reach 121.43 frames per second when the FPGA frequency is 100M, and the frame rate is also high for large resolution images. At the same time, the system can achieve real-time display and human-computer interaction with the control of the graphical user interface.

Yukun Pan, Minghua Zhu, Jufeng Luo, Yunzhou Qiu

The Cuckoo Search and Integer Linear Programming Based Approach to Time-Aware Test Case Prioritization Considering Execution Environment

Regression testing plays an important role in software development process. The more mature software system development is, the greater the proportion of regression testing during software life cycle takes. To this point, test case prioritization techniques are proposed to detect more faults as early as possible and improve the effectiveness of regression testing. However, it is often performed in a time constrained execution environment. This paper introduces a new method of time-aware test case prioritization. First of all, it takes advantage of the cuckoo search algorithm to reorder test suite. Then, integer linear programming model is employed to test selection in light of time budget. At last, a novel fitness function is designed focusing on code coverage that from method-call information perspective. Experimental results show that our method improves the effectiveness of fault detection compared with traditional fault detection techniques especially time is constrained.

Yu Wong, Hongwei Zeng, Huaikou Miao, Honghao Gao, Xiaoxian Yang

Using Hybrid Model for Android Malicious Application Detection Based on Population (Short Paper)

In the Android system security issue, the maliciousness of the applications is closely related to the permissions they applied. In this paper, a population-based model is proposed for detecting Android malicious application. Which is in the view of the current disadvantages of missing report, long detection period caused by features redundancy, and the instability of detection rate lead by unbalanced data of benign and malicious samples. Drawing on the idea of population in biology, each app was labeled by preprocessing. And adaptive feature vectors were automatically selected through the feature engineering. Thus the malicious application detection is carried out in the form of hybrid model voting. The experimental results show that feature engineering can remove a large amount of redundancy before classification. And the hybrid voting model can provide adaptive detection service for different populations.

Zhijie Xiao, Tao Li, Yuqiao Wang