Skip to main content

2018 | Buch

Algorithms and Architectures for Parallel Processing

18th International Conference, ICA3PP 2018, Guangzhou, China, November 15-17, 2018, Proceedings, Part I

insite
SUCHEN

Über dieses Buch

The four-volume set LNCS 11334-11337 constitutes the proceedings of the 18th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2018, held in Guangzhou, China, in November 2018.

The 141 full and 50 short papers presented were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on Distributed and Parallel Computing; High Performance Computing; Big Data and Information Processing; Internet of Things and Cloud Computing; and Security and Privacy in Computing.

Inhaltsverzeichnis

Frontmatter
Correction to: An Efficient Retrieval Method for Astronomical Catalog Time Series Data

The original version of the chapter starting on p. 284 was revised. The grant numbers of the Joint Research Fund in Astronomy were incorrect in the acknowledgement on p. 297. The original chapter was corrected.

Bingyao Li, Ce Yu, Xiaoteng Hu, Jian Xiao, Shanjiang Tang, Lianmeng Li, Bin Ma

Distributed and Parallel Computing

Frontmatter
Network-Aware Grouping in Distributed Stream Processing Systems

Distributed Stream Processing (DSP) systems have recently attracted much attention because of their ability to process huge volumes of real-time stream data with very low latency on clusters of commodity hardware. Existing workload grouping strategies in a DSP system can be classified into four categories (i.e. raw and blind, data skewness, cluster heterogeneity, and dynamic load-aware). However, these traditional stream grouping strategies do not consider network distance between two communicating operators. In fact, the traffic from different network channels makes a significant impact on performance. How to grouping tuples according to network distances to improve performance has been a critical problem.In this paper, we propose a network-aware grouping framework called Squirrel to improve the performance under different network distances. Identifying the network location of two communicating operators, Squirrel sets a weight and priority for each network channel. It introduces Weight Grouping to assign different numbers of tuples to each network channel according to channel’s weight and priority. In order to adapt to changes in network conditions, input load, resources and other factors, Squirrel uses Dynamic Weight Control to adjust network channel’s weight and priority online by analyzing runtime information. Experimental results prove Squirrel’s effectiveness and show that Squirrel can achieve 1.67x improvement in terms of throughput and reduce the latency by 47%.

Fei Chen, Song Wu, Hai Jin
vPlacer: A Co-scheduler for Optimizing the Performance of Parallel Jobs in Xen

Xen, a popular virtualization platform which enables multiple operating systems sharing one physical host, has been widely used in various fields nowadays. Currently, the existing schedulers of Xen are initially targeting at serial jobs, which achieves a remarkable utilization of computer hardware and impressive overall performance. However, the virtualized systems are expected to accommodate both parallel jobs and serial jobs in practice, and resource contention between virtual machines results in severe performance degradation of the parallel jobs. Moreover, the physical resource is vastly wasted during the communication process due to the ineffective scheduling of parallel jobs.This paper aims to optimize the performance of the parallel jobs in Xen using the co-scheduling mechanism. In this paper, we statistically analyze the process of scheduling parallel jobs in Xen, which points out that the credit scheduler is not capable of properly scheduling a parallel job. Moreover, we propose vPlacer, a conservative co-scheduler to improve the performance of the parallel job in Xen. Our co-scheduler is able to identify the parallel jobs and optimize the scheduling process to satisfy the particularity of the parallel job. The prototype of our vPlacer is implemented, and the experimental results show that the performance of the parallel job is significantly improved and the utilization of the hardware resource is optimized.

Peng Jiang, Ligang He, Shenyuan Ren, Zhiyan Chen, Rui Mao
Document Nearest Neighbors Query Based on Pairwise Similarity with MapReduce

With the continuous development of Web technology, many Internet issues evolve into Big Data problems, characterized by volume, variety, velocity and variability. Among them, how to organize plenty of web pages and retrieval information needed is a critical one. An important notion is document classification, in which nearest neighbors query is the key issue to be solved. Most parallel nearest neighbors query methods adopt Cartesian Product between training set and testing set resulting in poor time efficiency. In this paper, two methods are proposed on document nearest neighbor query based on pairwise similarity, i.e. brute-force and pre-filtering. brute-force is constituted by two phases (i.e. copying and filtering) and one map-reduce procedure is conducted. In order to obtain nearest neighbors for each document, each document pair is copied twice and all records generated are shuffled. However, time efficiency of shuffle is sensitive to the number of the intermediate results. For the purpose of intermediate results reduction, pre-filtering is proposed for nearest neighbor query based on pairwise similarity. Since only first top-k neighbors are output for each document, the size of records shuffled is kept in the same magnitude as input size in pre-filtering. Additionally, detailed theoretical analysis is provided. The performance of the algorithms is demonstrated by experiments on real world dataset.

Peipei Lv, Peng Yang, Yong-Qiang Dong, Liang Gu
Accurate Identification of Internet Video Traffic Using Byte Code Distribution Features

Video traffic, the most rapidly growing traffic type in Internet, is posing a serious challenge to Internet management. Different kinds of Internet video contents, including illegal and adult contents, make it necessary to manage different video traffic using different strategies. Unfortunately, there are few research work concerning Internet video traffic type identification. In this paper, we propose a new effective feature extraction method, namely byte code distribution (BCD), for Internet video traffic type identification. The BCD method first counts the times of each byte code value (0 to 255) from a video flow, and then computes the ratio between each count and the total byte count. Such that the 256 ratios are used as the features. Comparing with traditional packet-level features, the BCD features contain more video type information, and are able to make identification more accurately. To test the performance of our proposal, we collect a set of video traffic traces containing two typical video types, romance and action. We conduct a set of comparing experiments on the collected data. The results show that the BCD method can hit extremely high identification accuracies (higher than 99%), far higher than those of the traditional packet-level feature extracting methods. The empirical studies show that the BCD method is promising for Internet video traffic identification.

Yuxi Xie, Hanbo Deng, Lizhi Peng, Zhenxiang Chen
RISC: Risk Assessment of Instance Selection in Cloud Markets

Cloud markets provide instances as products in Infrastructure-as-a-Service (IaaS). Users usually underprovision instances while risking the possible failure of SLOs, or overprovision resources by suffering higher expenses. The underlying key nature of user behavior in purchasing instances can be essential for maximizing cloud market profits. However, for cloud service providers, there is little knowledge on assessing the risk of user choices on cloud instances. This paper proposes one of the first studies on the risk assessment in IaaS cloud markets. We first provide a modeling process to understand user and violations of SLOs, from server statistics. To understand the risk, we propose RISC, a mechanism to assess the risk of instance selection. RISC contains an analytic hierarchy process to evaluate the decisions, an optimization process to expose the risk frontier, and a feedback approach to fine-tuning the instance recommendation. We have evaluated our approach using simulations on real-world workloads and cloud market statistics. The results show that, compared to traditional approaches, our approach provides the best tradeoff between SLOs and costs, as it can maximize the overall profit up to 5X for the cloud service provider. All users achieve their SLOs goals while minimizing their average expenses by 34.6%.

Jingyun Gu, Zichen Xu, Cuiying Gao
Real-Time Data Stream Partitioning over a Sliding Window in Real-Time Spatial Big Data

In recent years, real-time spatial applications, like location-aware services and traffic monitoring, have become more and more important. Such applications result in dynamic environments where data, as well as queries, are continuously moving. As a result, there is a tremendous amount of real-time spatial data generated every day. The growth of the data volume seems to outspeed the advance of our computing infrastructure. For instance, in real-time spatial Big Data, users expect to receive the results of each query within a short time period without holding into account the load of the system. But with a huge amount of real-time spatial data generated, the system performance degrades rapidly, especially in overload situations. To solve this problem, we propose the use of data partitioning as an optimization technique. Traditional horizontal and vertical partitioning can increase the performance of the system and simplify data management. But they remain insufficient for real-time spatial Big data; they can’t deal with real-time and stream queries efficiently. Thus, in this paper, we propose a novel data partitioning approach over a sliding window in real-time spatial Big Data named VPA-RTSBD (Vertical Partitioning Approach for Real-Time Spatial Big data). This contribution is an implementation of the Matching algorithm for traditional vertical partitioning. We find, firstly, the optimal attributes sequence by the use of the Matching algorithm. Then, we propose a new cost model used for database partitioning, for keeping the data amount of each partition more balanced limit and for providing a parallel execution guarantee for the most frequent queries. VPA-RTSBD aims to obtain a real-time partitioning scheme and deals with stream data. It improves the performance of query execution by maximizing the degree of parallel execution. This affects QoS (Quality Of Service) improvement in real-time spatial Big Data especially with a huge volume of stream data. The performance of our contribution is evaluated via simulation experiments. The results show that the proposed algorithm is both efficient and scalable and that it outperforms comparable algorithms.

Sana Hamdi, Emna Bouazizi, Sami Faiz
A Priority and Fairness Mixed Compaction Scheduling Mechanism for LSM-tree Based KV-Stores

Key-value (KV) stores have become a backbone of large-scale applications in today’s data centers. Write-optimized data structures like the Log-Structured Merge-tree (LSM-tree) and their variants are widely used in KV storage systems. Conventional LSM-tree organizes KV items into multiple, successively larger components, and uses compaction to push KV items from one smaller component to another adjacent larger component until the KV items reach the largest component. Unfortunately, LSM-tree has severe file retention phenomenon. File retention phenomenon means that lots of SSTables locate in one component and then too many SSTables are involved in one compaction, which causes one compaction occupies long time and causes front-end writing pauses or even stops frequently. We propose a new compaction scheduling scheme called Slot, and implement it on LevelDB. The main idea of Slot is to combine score centric priority based compaction scheduling with time-slice centric fairness based compaction scheduling to alleviate the file retention and then decrease the write amplification of LSM-tree based key/value stores. Slot avoids too many files involved in one compaction and decreases the frequency of write pause or write stop. We conduct extensive evaluations and the experimental results demonstrate that Slot keeps the writing procedure more smoothly and outperforms LevelDB by 20–210% on write throughput without sacrificing the read latency.

Lidong Chen, Yinliang Yue, Haobo Wang, Jianhua Wu
PruX: Communication Pruning of Parallel BFS in the Graph 500 Benchmark

Parallel Breadth First Search (BFS) is a representative algorithm in Graph 500, the well-known benchmark for evaluating supercomputers for data-intensive applications. However, the specific storage model of Graph 500 brings severe challenge to efficient communication when computing parallel BFS in large-scale graphs. In this paper, we propose an effective method PruX for optimizing the communication of parallel BFS in two aspects. First, we adopt a scalable structure to record the access information of the vertices on each machine. Second, we prune unnecessary inter-machine communication for previously accessed vertices by checking the records. Evaluation results show that the performance of our method is at least six times higher than that of the original implementation of parallel BFS.

Menghan Jia, Yiming Zhang, Dongsheng Li, Songzhu Mei
Comparative Study of Distributed Deep Learning Tools on Supercomputers

With the growth of the scale of data set and neural networks, the training time is increasing rapidly. Distributed parallel training has been proposed to accelerate deep neural network training, and most efforts are made on top of GPU clusters. This paper focuses on the performance of distributed parallel training in CPU clusters of supercomputer systems. Using resources at the supercomputer system of “Tianhe-2”, we conduct extensive evaluation of the performance of popular deep learning tools, including Caffe, TensorFlow, and BigDL, and several deep neural network models are tested, including AutoEncoder, LeNet, AlexNet and ResNet. The experiment results show that Caffe performs the best in communication efficiency and scalability. BigDL is the fastest in computing speed benefiting from its optimization for CPU, but it suffers from long communication delay due to the dependency on MapReduce framework. The insights and conclusions from our evaluation provides significant reference for improving resource utility of supercomputer resources in distributed deep learning.

Xin Du, Di Kuang, Yan Ye, Xinxin Li, Mengqiang Chen, Yunfei Du, Weigang Wu
Noncooperative Optimization of Multi-user Request Strategy in Cloud Service Composition Reservation

With the maturity of virtualization technology and service-oriented architecture, single cloud services have been difficult to satisfy cloud users’ increasingly complex demand. Cloud service composition has become a hot topic. Nevertheless, few researches consider the problem of competition of service compositions among multiple users and interaction between the user and the cloud provider. Aiming at this problem, a service composition reservation model of a cloud provider, a cloud broker and multiple users is provided in this paper. A utility function related to revenue, payoff and performance of service compositions is designed and each user expects to maximize it. We consider this optimization problem from the perspective of game theory, and model it as a non-cooperative game. The existence of Nash equilibrium solution of the game is proved and an iterative proximate algorithm (IPA) is proposed to compute it. A series of simulation experiments are conducted to verify the theoretical analysis and the performance of IPA algorithm. The results show IPA algorithm quickly converge to a relatively stable state, and improve the utility of the user and the resource utilization of the cloud provider.

Zheng Xiao, Yang Guo, Gang Liu, Jiayi Du
Most Memory Efficient Distributed Super Points Detection on Core Networks

The super point, a host which communicates with lots of others, is a kind of special hosts gotten great focus. Mining super point at the edge of a network is the foundation of many network research fields. In this paper, we proposed the most memory efficient super points detection scheme. This scheme contains a super points reconstruction algorithm called short estimator and a super points filter algorithm called long estimator. Short estimator gives a super points candidate list using thousands of bytes memory and long estimator improves the accuracy of detection result using millions of bytes memory. Combining short estimator and long estimator, our scheme acquires the highest accuracy using the smallest memory than other algorithms. There is no data confliction and floating operation in our scheme. This ensures that our scheme is suitable for parallel running and we deploy our scheme on a common GPU to accelerate processing speed. Experiments on several real-world core network traffics show that our algorithm acquires the highest accuracy with only consuming littler than one-fifth memory of other algorithms.

Jie Xu, Wei Ding, Xiaoyan Hu
Parallel Implementation and Optimizations of Visibility Computing of 3D Scene on Tianhe-2 Supercomputer

Visibility computing is a basic problem in computer graphics, and is often the bottleneck in realistic rendering algorithms. Some of the most common include the determination of the objects visible from a viewpoint, virtual reality, real-time simulation and 3D interactive design. As one technique to accelerate the rendering speed, the research on visibility computing has gained great attention in recent years. Traditional visibility computing on single processor machine has been unable to meet more and more large-scale and complex scenes due to lack parallelism. However, it will face many challenges to design parallel algorithms on a cluster due to imbalance workload among compute nodes, the complicated mathematical model and different domain knowledge. In this paper, we propose an efficient and highly scalable framework for visibility computing on Tianhe-2 supercomputer. Firstly, a new technique called hemispheric visibility computing is designed, which can overcome the visibility missing of traditional perspective algorithm. Secondly, a distributed parallel algorithm for visibility computing is implemented, which is based on the master-worker architecture. Finally, we discuss the issue of granularity of visibility computing and some optimization strategies for improving overall performance. Experiments on Tianhe-2 supercomputer show that our distributed parallel visibility computing framework almost reaches linear speedup by using up to 7680 CPU cores.

Zhengwei Xu, Xiaodong Wang, Congpin Zhang, Changmao Wu
Efficient Algorithms of Parallel Skyline Join over Data Streams

The issue of finding skyline tuples over multiple relations, more commonly known as the skyline join problem, has been well studied in scenarios in which the data is static. Most recently, it has become a new trend that performing skyline queries on data streams, where tuples arrive or expire in a continuous approach. A few algorithms have been proposed for computing skylines on two data streams. However, those literatures did not consider the inherent parallelism, or employ serial algorithms to solve the skyline query problem, which cannot leverage the multi-core processors. Based on this motivation, in this paper, we address the problem of parallel computing for skyline join over multiple data streams. We developed a Novel Iterative framework based on the existing work and study the inherent parallelism of the Novel Iterative framework. Then we propose two parallel skyline join algorithms over sliding windows, NP-SWJ and IP-SWJ.To the best of our knowledge, this is the first paper that addresses parallel computing of skyline join over multiple data streams. Extensive experimental evaluations on real and synthetic data sets show that the algorithms proposed in this paper provide large gains over the state-of-the-art serial algorithm of skyline join over data streams.

Jinchao Zhang, JingZi Gu, Shuai Cheng, Bo Li, Weiping Wang, Dan Meng
Air Flow Based Failure Model for Data Centers

With the explosive growth of data, thousands upon thousands servers are contained in data centers. Hence, node failure is unavoidable and it generally brings effects on the performance of the whole data center. On the other hand, data centers with vast nodes will cause plenty of energy consumption. Many existing task scheduling techniques can effectively reduce the power consumption in data centers by considering heat recirculation. However, traditional techniques barely take the situation of node failure into account. This paper proposes an airflow-based failure model for data centers by leveraging heat recirculation. In this model, the spatial distribution and time distribution of failure nodes are considered. Furthermore, the Genetic algorithm (GA) and Simulated Annealing algorithm (SA) are implemented to evaluate the proposed failure model. Because the position of failures has a significant impact on the heat recirculation and the energy consumption of data centers, failure nodes with different positions are analyzed and evaluated. The experimental results demonstrate that the energy consumption of data centers can be significantly reduced by using the GA and SA algorithms for task scheduling based on proposed failure model.

Hao Feng, Yuhui Deng, Liang Yu
Adaptive Load Balancing on Multi-core IPsec Gateway

Cloud service providers usually offer IPsec VPN services to tenants by deploying the software IPsec gateway on the virtual machine. However, the current software IPsec gateway solutions cannot make full use of the allocated multi-core virtual machine resources and unable to meet the performance requirement of tenants. In order to optimize the IPsec gateway performance, the flow processing load must be properly allocated to multi-cores considering the multiple dimensions of load to improve the throughput of IPsec gateway. In this paper, we propose an optimizing scheme which separates the encryption and decryption computation from the packet forwarding process in the IPsec gateway, and implements fine-grained network flows scheduling in parallel processors. Furthermore, we present an adaptive load balancing algorithm based on quantifying the load of each processing core in real-time. Experimental results show that the performance of the IPsec gateway has significant improvement.

Wei Li, Shengjie Hu, Guanchao Sun, Yunchun Li
An Active Learning Based on Uncertainty and Density Method for Positive and Unlabeled Data

Active learning can select most informative unlabeled samples to manually annotate to enlarge the training set. Many active learning methods have been proposed so far, most of them work for these data that have all classes of tagged data. A few methods work for positive and unlabeled data and the computational complexity of existing methods is particularly high and they can’t work well for big data. In this paper, we proposed an active learning approach that works well when only small number positive data are available in big data. We utilize data preprocessing to remove most of the outliers, so the density calculation is simplified relative to KNN algorithm, and our proposed sample selection strategy Min-Uncertainty Density (MDD) can help select more uncertain and higher density unlabeled samples with less computation. A combined semi-supervised learning active learning technique (MDD-SSAL) automatically annotating some confident unlabeled samples in the each iteration is proposed to reduce the number of manually annotated samples. Experimental results indicate that our proposed method is competitive with other similar methods.

Jun Luo, Wenan Zhou, Yu Du
TAMM: A New Topology-Aware Mapping Method for Parallel Applications on the Tianhe-2A Supercomputer

With the increasing size of high performance computing systems, the expensive communication overhead between processors has become a key factor leading to the performance bottleneck. However, default process-to-processor mapping strategies do not take into account the topology of the interconnection network, and thus the distance spanned by communication messages may be particularly far. In order to enhance the communication locality, we propose a new topology-aware mapping method called TAMM. By generating an accurate description of the communication pattern and network topology, TAMM employs a two-step optimization strategy to obtain an efficient mapping solution for various parallel applications. This strategy first extracts an appropriate subset of all idle computing resources on the underlying system and then constructs an optimized one-to-one mapping with a refined iterative algorithm. Experimental results demonstrate that TAMM can effectively improve the communication performance on the Tianhe-2A supercomputer.

Xinhai Chen, Jie Liu, Shengguo Li, Peizhen Xie, Lihua Chi, Qinglin Wang
Adaptive Data Sampling Mechanism for Process Object

Process object is the abstraction of process. In process object, there are different type of entities and associations. The entities vary dependent on other entities. The performance and evolution of process object are affected by the association between entities. These changes could be reflected in the data collected from the process objects. These data from process object could be regard as big data stream. In the context of big data, how to find appropriate data for process object is a challenge. The data sampling should reflect the performance change of process object, and should be adaptive to the current underlying distribution of data in data stream. For finding appropriate data in big data stream to model process object, an adaptive data sampling mechanism is proposed in this paper. Experiments demonstrate the effectiveness of the proposed adaptive data sampling mechanism for process object.

Yongzheng Lin, Hong Liu, Zhenxiang Chen, Kun Zhang, Kun Ma
MedusaVM: Decentralizing Virtual Memory System for Multithreaded Applications on Many-core

Virtual memory system multiplexes the single physical memory for multiple running processes with two centralized resources, i.e., virtual memory space and page table hierarchy. However, for multithreaded applications running a single address space, current centralized VM system design encounters severe scalability bottlenecks and significantly impedes the application speedup increment on many-core systems. This paper proposes a novel VM system called MedusaVM to scale VM system to many cores. To this end, MedusaVM partitions the global virtual memory space and page table tree in a memory-efficient way, eliminating performance interference and lock contention between cores. Moreover, MedusaVM also provides a traditional shared memory interface for multithreaded applications.Our prototype system is implemented based on Linux kernel 4.4.0 and glibc 2.23. Experimental results evaluated on a 32-core machine demonstrate that MedusaVM scales much better than Linux kernel and uses 22 $$\times $$ × less memory compared with the state-of-art approach. For microbenchmarks experiments, MedusaVM achieves nearly linear performance speedup. In multithreaded applications Metis and Psearchy, MedusaVM also significantly outperforms Linux kernel by up to a factor of 2.5 $$\times $$ × .

Miao Cai, Shenming Liu, Weiyong Yang, Hao Huang
An Efficient Retrieval Method for Astronomical Catalog Time Series Data

Astronomical catalog time series data refer to the data collected at different time, which can provide a comprehensive understanding of the celestial objects’ attributes and expose various astronomical phenomena. Its retrieval is indispensable to astronomy research. However, the existing time series data retrieval methods involve lots of manual work and extremely time-consuming. The complexity will also be augmented by the exponentially growth of observation data. In this paper, we propose an automatic and efficient retrieval method for astronomical catalog time series data. With the goal of identifying the same celestial objects time series data automatically, a cross-match scheme is designed, which labeled a unique MatchID for each record matched with the datum catalog. To accelerate the matching process, an in-memory index structure based on Redis is specially designed, which enables matching speed 1.67 times faster than that of MySQL in massive amounts of data. Moreover, Catalog-Mongo—an improved database of MongoDB—is presented, in which a Data Blocking Algorithm is proposed to improve the data partitioning of MongoDB and accelerate query performance. The experimental results show that the query speed is about 2 times faster than MongoDB and 7.6 to 8.7 times than MySQL.

Bingyao Li, Ce Yu, Xiaoteng Hu, Jian Xiao, Shanjiang Tang, Lianmeng Li, Bin Ma
Maintaining Root via Custom Android Kernel Across Over-The-Air Upgrade

People can obtain the highest privileges and control devices by Android root. However, an Android phone has been rooted, it is difficult for the user to update the Android system. Aiming at these problems, this paper proposes a maintaining root via custom Android kernel across Over-The-Air (OTA) upgrade. By customizing the kernel in boot and recovery, the boot will be replaced with rooted boot after updating automatically, so that system not only can be updated successfully but also maintain root. Experiments show that there is no abnormal between rooted mobile with a customized kernel and normal mobile during a minor system update.

Huang Zucheng, Liu Lu, Li Yuanzhang, Zhang Yu, Zhang Qikun
Accelerating Pattern Matching with CPU-GPU Collaborative Computing

Pattern matching algorithms are used in several areas such as network security, bioinformatics and text mining. In order to support large data and pattern sets, these algorithms have to be adapted to take advantage of the computing power of emerging parallel architectures. In this paper, we present a parallel algorithm for pattern matching on CPU-GPU heterogeneous systems, which is based on the Parallel Failureless Aho-Corasick algorithm (PFAC) for GPU. We evaluate the performance of the proposed algorithm on a machine with 36 CPU cores and 1 GPU, using data and pattern sets of different size, and compare it with that of PFAC for GPU and the multithreaded version of PFAC for shared-memory machines. The results reveal that our proposal achieves higher performance than the other two approaches for data sets of considerable size, since it uses both CPU and GPU cores.

Victoria Sanz, Adrián Pousa, Marcelo Naiouf, Armando De Giusti
An Incremental Map Matching Algorithm Based on Weighted Shortest Path

GPS (global position system) trajectories collected by urban cars depict the citizens’ daily trips and reflect the traffic situation in city areas. The process called map matching is to match the GPS point sequence to the corresponding road which is the fundamental step for further travel pattern mining and transport situation analysis. The existing research based on the incremental map matching applies only to GPS trajectories of high-sampling-rate (0 to 30 s). However most actually collected GPS trajectories are with a low-sampling-rate (more than 2 min) for saving the collection and transmission costs. In this paper, we proposed an incremental map matching algorithm based on weighted shortest path, called WSI-matching. By matching single GPS point to its candidate road and filling the missing path between two GPS points, it improves the overall matching accuracy with a relatively low time complexity compared to the traditional global matching algorithm. Experiment results show that our WSI-matching algorithm present obvious advantages over traditional incremental algorithms and global algorithms in terms of both matching accuracy and running time, and it adapts to either high-sampling-rate trajectories or low-sampling-rate trajectories.

Jixiao Chen, Yongjian Yang, Liping Huang, Zhuo Zhu, Funing Yang
Asynchronous Parallel Dijkstra’s Algorithm on Intel Xeon Phi Processor
How to Accelerate Irregular Memory Access Algorithm

As the instruction-level parallelism (ILP) on CPU develops to a rather advanced level, the exploration that whether many-core architecture is applicable for graph algorithms is generating more interests in researchers. However, due to the irregular memory access and the low ratio of computation to memory access, the performance of graph algorithms on many-core architectures has never worked good enough.To obtain outstanding speedup on many-core architecture, first of all, we need to figure out three questions: (i) how to optimize the memory access, (ii) how to minimize the overhead of synchronization, (iii) how to exploit the parallelism in algorithm. Prior works hardly reach the goal if such questions are treated in separated way. Throughout this paper, we aim to settle these questions systematically, and try to provide a set of methods of optimizing graph algorithms on many-core architecture.This paper mainly discusses how to accelerate the Single Source Shortest Path (SSSP) problem on Intel Many Integrated Core (MIC) architecture, on which we propose an asynchronous parallel Dijkstra’s algorithm. It aims at maximizing parallelism and minimizing overhead of synchronization. Experimental result shows that the MIC architecture could efficiently solve the SSSP problem, and its performance could be sped up by 9.2x compared to the benchmark of DIMACS.

Weidong Zhang, Lei Zhang, Yifeng Chen
DA Placement: A Dual-Aware Data Placement in a Deduplicated and Erasure-Coded Storage System

Simultaneously incorporating deduplication as well as erasure coding is preferred for modern storage systems for the enhanced storage efficiency and economical data reliability. However, simple incorporation suffers from the “read imbalance problem”, in which parallel data accesses are curbed by throttled storage nodes. This problem is due to the uneven data placement in the system, which is unaware of the employment of both deduplication and erasure coding, each of whom alters the order of data if unattended. This paper proposes a systematic design and implementation of a Dual-Aware(DA) placement in a combined storage system to achieve both deduplication-awareness and erasure-coding-awareness at the same time. DA not only records the node number of each unique data to allow for quick references with ease, but also dynamically tracks used nodes for each writes request. In this way, deduplication awareness is formed to skip inconvenient placement locations. Besides, DA serializes the placement of parity blocks with a stripe and across stripes. Such realization of erasure coding awareness ensures the separation of data and parity, as well as maintains data sequentiality at bordering stripes. Additionally, DA manages to extend with further load-balancing through an innovative use of the deduplication level, which intuitively predicts future accesses of a piece of data. In short, DA manages to boost system performance with little memory or computation cost. Extensive experiments using both real-world traces and synthesized workloads, prove DA achieves a better read performance. For example, DA respectively leads an average latency margin of 30.86% and 29.63%, over the baseline rolling placement(BA) and random placement(RA) under CAFTL traces over a default cluster of 12 nodes with RS(8,4).

Mingzhu Deng, Ming Zhao, Fang Liu, Zhiguang Chen, Nong Xiao
Improving Restore Performance of Deduplication Systems by Leveraging the Chunk Sequence in Backup Stream

Traditional deduplication based backup systems normally employ containers to reduce the chunk fragmentation, thus improving the restore performance. However, the shared chunks belonging to a single backup grows with the increase of the number of backups. Those shared chunks are normally distributed across multiple containers. This feature increases chunk fragmentation and significantly degrades the restore performance. In order to improve the restore performance, some schemes are proposed to optimize the replacement strategy of the restore cache, such as the ones using LRU and OPT. However, LRU is inefficient and OPT consumes additional computational overhead. By analyzing the backup and restore process, we observe that the sequence of the chunks in the backup stream is consistent to that in the restore stream. Based on this observation, this paper proposes an off-line optimal replacement strategy—OFL for the restore cache. The OFL records the chunk sequence of backup process, and then uses this sequence to calculate the exact information of the required chunks in advance for the restore process. Finally, accurate prefetch will be employed by leveraging the above information to reduce the impact of chunk fragmentation. Real data sets are employed to evaluate the proposed OFL. The experimental results demonstrate that OFL improves the restore performance over 8% in contrast to the traditional LRU and OPT.

Ru Yang, Yuhui Deng, Cheng Hu, Lei Si
Blockchain-Based Secure and Reliable Distributed Deduplication Scheme

Due to the explosive growth of data volume on the Internet, deduplication techniques have been wildly used in cloud storage to save both disk space and network bandwidth. However, conventional deduplication schemes lead to problems with data reliability that can be attributed to the algorithm implementation where there is only one copy for each file stored in the cloud. Furthermore, the participation of trusted third party in most of the previous work has brought about the security challenge as single point of failure. In this paper, we propose a blockchain based deduplication scheme with high reliability and confidentiality in which the files are distributed to multiple servers and the information of files is recorded on the time-stamped blockchain whose central authorities are replaced to automatically decentralized smart contracts. Based on the proposed scheme, we present relevant protocols to achieve secure cloud storage derived from the consensus and incentive mechanism. Security analysis demonstrates that our deduplication scheme can achieve the proposed security goals while it has limited overhead proved by simulation experiments.

Jingyi Li, Jigang Wu, Long Chen, Jiaxing Li
Controlled Channel Attack Detection Based on Hardware Virtualization

Controlled-channel attack is a novel side-channel attack that uses page faults (#PF) to infer process-sensitive information of guest-VMs. Existing protection schemes focus on restricting malicious OS of virtual machine access to page number information. They need to copy memory page content frequently or manually mark and recompile sensitive programs, which takes a lot of time and labor overhead. This paper introduces a hardware-based detection method against it in a different way. The Hypervisor monitors the modification of the guest page table entry (PTE) and the Interrupt Descriptor Table (IDT) entries to find the trace of adversary’s operations. As there is a semantic gap between VMs and Hypervisor, we take advantage of VMI (Virtual Machine Introspection) to convert important data. To overcome the challenge of changeable page tables, we grasp the feature of the target attack and filter out required records. Experiments show that this method can effectively detect controlled-channel attacks. In general, the performance overhead of the operations related to context switching will increase but within an acceptable range.

Chenyi Qiang, Weijie Liu, Lina Wang, Rongwei Yu
CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search

Given a set of high dimensional sparse vectors, a similarity function and a threshold, AllPairs Similarity Search finds out all pairs of vectors whose similarity values are higher than or equal to the threshold. AllPairs Similarity Search (APSS) has been studied in many different fields of computer science, including information retrieval, data mining, database and so on. It is a crucial part of lots of applications, such as near-duplicate document detection, collaborative filtering, query refinement and clustering. For cosine similarity, many serial algorithms have been proposed to solve the problem by decreasing the possible similarity candidates for each query object. However, the efficiency of those serial algorithms degrade badly as the threshold decreases. Other parallel implementations of APSS based on OpenMP or MapReduce also adopt the pruning policy and do not solve the problem thoroughly. In this context, we introduce CuAPSS, which solves the All Pairs cosine similarity search problem in CUDA environment on GPUs. Our method adopts a hybrid method to utilize both forward list and inverted list in APSS which compromises between the memory visiting and dot-product computing. The experimental results show that our method could solve the problem much faster than existing methods on several benchmark datasets with hundreds of millions of non-zero values, achieving the speedup of 1.5x–23x against the state-of-the-art parallel algorithm, while keep a relatively stable running time with different values of the threshold.

Yilin Feng, Jie Tang, Chongjun Wang, Junyuan Xie
A Parallel Branch and Bound Algorithm for the Probabilistic TSP

The paper presents parallelization of exact algorithm of resolution for the Probabilistic Traveling Salesman Problem (PTSP). This algorithm allows us, first, to verify the stability of well-solvable special cases and also to optimally solve useful instances of PTSP. It again allows to perform our version of Karp partitioning algorithm, where real problems are very large-sized. The implementation of the algorithm of Karp consists in subdividing the square plan, into sub-plans. So we transform the resolution of a large size problem to the resolution of many small size sub-problems which can be exactly solved. This application can be gridified and these different sub-problems would be processed in parallel by different nodes since they are totally independent. In each sub-plan the Branch and Bound algorithm is used. In this paper we propose two parallelizations of the Branch and Bound algorithm for the resolution of the PTSP. On the one hand, the parallelization of the branches used in the exploration of the tree, on the other hand the parallelization of the algorithm associated with the notion of partitioning introduced by Karp. We perform an experimental study conducted in a multi-core environment to evaluate the performance of the proposed approach.

Mohamed Abdellahi Amar, Walid Khaznaji, Monia Bellalouna
Accelerating Artificial Bee Colony Algorithm with Elite Neighborhood Learning

Artificial bee colony (ABC) algorithm has been shown good performance over many optimization problems. For some complex optimization problems, however, ABC often suffers from a slow convergence speed, because it is good at exploration but poor at exploitation. To achieve a better tradeoff between the exploration and exploitation capabilities, we introduce the breadth-first search (BFS) framework and depth-first search (DFS) framework into different phases of ABC respectively. The BFS framework is combined with the employed bee phase to focus on the exploration, while the DFS framework is integrated into the onlooker bee phase to concentrate on exploitation. After that, an elite neighborhood learning (ENL) strategy is proposed to enhance the information exchange between the employed bee phase and the onlooker bee phase, because in ABC the employed bees cannot well communicate with the onlooker bees which may also cause slow convergence speed. Extensive experiments are conducted on 22 well-known test functions, and six well-established ABC variants are included in the comparison. The results showed that our approach can effectively accelerate the convergence speed and significantly perform better on the majority of test functions.

Xinyu Zhou, Yunan Liu, Yong Ma, Mingwen Wang, Jianyi Wan
Distributed Parallel Simulation of Primary Sample Space Metropolis Light Transport

Monte-Carlo rendering algorithms are known for producing highly realistic images, but at a significant computational cost, because they rely on tracing up to trillions of light paths through a scene to simulate physically based light transport. For this reason, a large body of research exists on various techniques for accelerating these costly algorithms. As one of the Monte-Carlo rendering algorithms, PSSMLT (Primary Sample Space Metropolis Light Transport) is widely used nowadays for photorealistic rendering. Unfortunately, the computational cost of PSSMLT is still very high since the space of light paths in high-dimension and up to trillions of paths are typically required in such path space. Recent research on PSSMLT has proposed a variety of optimized methods for single node rendering, however, multi-node rendering for PSSMLT is rarely mentioned due in large part to the complicated mathematical model, complicated physical processes and the irregular memory access patterns, and the imbalanced workload of light-carrying paths.In this paper, we present a highly scalable distributed parallel simulation framework for PSSMLT. Firstly, based on light transport equation, we propose the notion of sub-image with certain property for multi-node rendering and theoretically prove that the whole set of sub-images can be combined to produce the final image; Then we further propose a sub-image based assignment partitioning algorithm for multi-node rendering since the traditional demand-driven assignment partitioning algorithm doesn’t work well. Secondly, we propose a physically based parallel simulation for the PSSMLT algorithm, which is revealed on a parallel computer system in master-worker paradigm. Finally, we discuss the issue of granularity of the assignment partitioning and some optimization strategies for improving overall performance, and then a static/dynamic hybrid scheduling strategy is described. Experiments show that framework has a nearly linear speedup along with the CPU core count up to 9,600 on the Tianhe-2 supercomputer, which suggests that the proposed framework has a high scalability and efficiency.

Changmao Wu, Changyou Zhang, Qiao Sun
Parallel Statistical and Machine Learning Methods for Estimation of Physical Load

Several statistical and machine learning methods are proposed to estimate the type and intensity of physical load and accumulated fatigue. They are based on the statistical analysis of accumulated and moving window data subsets with construction of a kurtosis-skewness diagram. This approach was applied to the data gathered by the wearable heart monitor for various types and levels of physical activities, and for people with various physical conditions. The different levels of physical activities, loads, and fitness can be distinguished from the kurtosis-skewness diagram, and their evolution can be monitored. Several metrics for estimation of the instant effect and accumulated effect (physical fatigue) of physical loads were proposed. The data and results presented allow to extend application of these methods for modeling and characterization of complex human activity patterns, for example, to estimate the actual and accumulated physical load and fatigue, model the potential dangerous development, and give cautions and advice in real time.

Sergii Stirenko, Peng Gang, Wei Zeng, Yuri Gordienko, Oleg Alienin, Oleksandr Rokovyi, Nikita Gordienko, Ivan Pavliuchenko, Anis Rojbi
Parallel Communication Mechanisms in Solving Integral Equations for Electromagnetic Scattering Based on the Method of Moments

In this paper, a parallel solution of impedance filling was studied to improve the efficiency for the method of moments (MOM) in solving the integral equation for electromagnetic scattering. Based on the formalization method, the correctness verification method of the parallel communication protocol was proposed to avoid the abnormal situation such as deadlock caused by unsynchronized MPI message delivery. Finally, a numerical example is given to verify the accuracy of the parallel algorithm in a multi-core cluster environment, and the parallel efficiency and computational capability are also tested. From the experiment data, it can be seen that the results of the parallel-based integral equation match well with the results of the Mie analytical solution. The parallel algorithm proposed in this paper is evaluated in terms of the acceleration ratio and parallel efficiency. The experimental results demonstrate the reliability and effectiveness of the method.

Lan Song, Dennis K. Peters, Weimin Huang, Zhiwei Liu, Lixia Lei, Tangliu Wen
POWER: A Parallel-Optimization-Based Framework Towards Edge Intelligent Image Recognition and a Case Study

To improve the intelligent image recognition abilities of edge devices, a parallel-optimization-based framework called POWER is introduced in this paper. With FPGA (Field-Programmable Gate Array) as its hardware module, POWER provides well extensibility and flexible customization capability for developing intelligent firmware suitable for different types of edge devices in various scenarios. Through an actual case study, we design and implement a firmware prototype following the specification of POWER and explore its performance improvement using parallel optimization. Our experimental results show that the firmware prototype we implement exhibits good performance and is applicable to substation inspection robots, which also validate the effectiveness of our POWER framework in designing edge intelligent firmware modules indirectly.

Yingyi Yang, Xiaoming Mai, Hao Wu, Ming Nie, Hui Wu
SP-TSRM: A Data Grouping Strategy in Distributed Storage System

With the development of smart devices and social media, massive unstructured data is uploaded to distributed storage systems. Since the characteristics of multi-users and high concurrency the unstructured data accesses have, it brings new challenges to traditional distributed storage systems designed for large files. We propose a grouping strategy to analyze relevant data in access according to disk access logs in the real distributed storage systems environment. When any data in the group is accessed, the whole group is prefetched from disk to the cache. Firstly, we conduct statistical analysis on the access logs and propose a preliminary classification method to classify files in spatiotemporal locality. Secondly, a strength-priority tree structure relation model (SP-TSRM) is proposed to mine file group efficiently. Finally, experiments show that the proposed model can improve the cache hit rate significantly, thereby improving the read efficiency of distributed storage systems.

Dongjie Zhu, Haiwen Du, Ning Cao, Xueming Qiao, Yanyan Liu
Abstract Parallel Array Types and Ghost Cell Update Implementation

Stencil patterns are widely used in scientific simulations and image processing. To parallelize these problems, we have to divide data into chunks that are processed by different processors. One challenge with this approach is the update of ghost cells, which are the neighbor values that calculated on remote processes. This paper focus on the update communication. We provide an abstract array types to describe distribution patterns, such as ghost cells, from global and intuitive view. Based on this description, a general copyto function is provided to perform communication automatically. Furthermore, our work makes it possible to design a distribution-independent algorithm. This results in better productivity on tuning performance.

Shuang Zhang, Bei Wang, Yifeng Chen

High Performance Computing

Frontmatter
Accelerating Low-End Edge Computing with Cross-Kernel Functionality Abstraction

This paper envisions a future in which high performance and energy-modest parallel computing on low-end edge devices were achieved through cross-device functionality abstraction to make them interactive to cloud machines. Rather, there has been little exploration of the overall optimization into kernel processing can deliver for increasingly popular but heavy burden on low-end edge devices. Our idea here is to extend the capability of functionality abstraction across edge clients and cloud servers to identify the computation-intensive code regions automatically and execute the instantiation on the server at runtime. This paper is an attempt to explore this vision, ponder on the principle, and take the first steps towards addressing some of the challenges with . As a kernel-level solution, enables edge devices to abstract not only application layer but also system layer functionalities, as if they were to instantiate the abstracted function inside the same kernel programming. Experimental results demonstrate that makes cross-kernel functionality abstraction efficient for low-end edge devices and benefits them significant performance optimization than the default scheme unless in a constraint of low transmission bandwidth.

Chao Wu, Yaoxue Zhang, Yuezhi Zhou, Qiushi Li
A High-Performance and High-Reliability RAIS5 Storage Architecture with Adaptive Stripe

In the era of big data, the traditional RAID storage system has been incapable of meeting the requirements of performance and reliability for the large amount of data storage and computing. In view of the situation, Solid State Disks (SSDs), which can provide better performance than Hard Disk Drives (HDDs), are widely used to construct storage arrays in enterprise environments. Today many studies on Redundant Array of Independent SSDs (RAIS) storage systems concentrate more on improving write performance, and show less attention on the reconstruction performance of RAIS storage systems. In this paper, we proposed RAIS5AS, a novel RAIS5 storage architecture with adaptive stripe for improving the performance and reliability of RAIS5. RAIS5AS distinguishes between logical stripe and physical stripe. Logical stripe is a traditional RAID stripe. Physical stripe consists of the blocks (in a logical stripe) which have been written data. When handling write requests, RAIS5AS uses physical stripe as the basic processing unit to choose which blocks are read to compute the new parity block. When recovering data, RAIS5AS skips the unused failed blocks. In addition, RAIS5AS simplifies the synchronization process of RAIS5 storage system. We have implemented the proposed scheme and carried out a series of experiments. RAIS5AS on average improves write performance and reconstruction performance of the basic RAIS5 by up to 7.92% and 95.65% respectively, and those of JOR by 9.16% and 14.57% respectively.

Linjun Mei, Dan Feng, Lingfang Zeng, Jianxi Chen, Jingning Liu
ADAM: An Adaptive Directory Accelerating Mechanism for NVM-Based File Systems

Byte-addressable non-volatile memory (NVM) offers fast, fine-grained random access to persistent storage, which revolutionizes the architecture design of file systems. Existing NVM-based file systems seldom optimize the directory-access performance despite that directory operations significantly impact application performance. These file systems still follow the traditional design of multi-level directory namespace which is inadequate for byte-addressable NVM and involves redundant access overhead.In this paper, we propose an adaptive directory accelerating mechanism (ADAM) for NVM-based file systems. ADAM analyzes different directory states including read/write frequency and size and then builds adaptive full-name directory namespace (AFDN) areas as well as an evolving strategy. Compared with multi-level directory namespace, AFDN areas offer fast read access and low write latency. Besides, the evolving strategy helps AFDN areas maintain a stable performance during system runtime. We implement ADAM on NOn-Volatile memory Accelerated (NOVA) log-structured file system and build an efficient hybrid index for DRAM/NVMM architecture. Experiments show that NOVA with ADAM achieves up to 43% latency reduction and 76% throughput improvement, compared with original NOVA. Moreover, ADAM generally outperforms other state-of-the-art NVM-based file systems in various tests.

Xin Cui, Linpeng Huang, Shengan Zheng
A Parallel Method for All-Pair SimRank Similarity Computation

How to measure SimRank similarity of all-pair vertices in a graph is a very important research topic which has a wide range of applications in many fields. However, computation of SimRank is costly in both time and space, making traditional computing methods failing to handle graph data of ever-growing size.This paper proposes a parallel multi-level solution for all-pair SimRank similarity computing on large graphs. We partition the objective graph first with the idea of modularity maximization and get a collapsed graph based on the blocks. Then we compute the similarities between verteices inside a block as well as the similarities between the blocks. In the end, we integrate these two types of similarities and calculate the approximate SimRank simlarities between all vertex pairs. The method is implemented on Spark platform and it makes an improvement on time efficiency while maintaining the effectiveness compared to SimRank.

Xuan Huang, Xingkun Gao, Jie Tang, Gangshan Wu
CLDM: A Cache Cleaning Algorithm for Host Aware SMR Drives

Host aware SMR (HA-SMR) drives can effectively increase the capacity of hard disk drives. However, the cache cleaning algorithms implemented in the HA-SMR drives need to be improved. Current cache cleaning algorithms do not consider the characteristics of applications and usually bring too much data migration. In this paper, we propose a new cache cleaning algorithm called CLDM, which takes the characteristics of applications into account. It uses the “zone heat” to reflect the access frequency in the disk cache of a zone, and the “zone data migration” to reflect the data migration of a zone when cache cleaning is performed on the zone. When CLDM is performed, it first computes the “zone heat” for each zone which is currently buffered in the disk cache, and then computes the “average zone heat” for all the buffered zones. After that, CLDM computes the “zone data migration” for each buffered zone, and sorts all the buffered zones in the ascending order of their “zone data migration”s. CLDM first cleans the zones which satisfy the condition “the zone heat of a zone is less than the average zone heat”. And then it cleans the zones with less “zone data migration”s. Experimental results show that CLDM can effectively reduce the amount of migrated data during both the cache cleaning process and garbage collection process, and improve the performance of HA-SMR drives.

Wenguo Liu, Lingfang Zeng, Dan Feng
HyGrid: A CPU-GPU Hybrid Convolution-Based Gridding Algorithm in Radio Astronomy

New-generation radio telescopes have been producing an unprecedented scale of data every day and requiring fast algorithms to speedup their data processing work flow urgently. The most data intensive computing phase during the entire work flow is gridding, which converts original data from irregular sampling space to regular grid space. Current methods are mainly focused on interferometers or have limitations on the resolutions due to the memory wall. Here we propose a CPU-GPU hybrid algorithm which accelerates the process of gridding. It employs multi-CPU to perform pre-ordering and GPU to speed up convolution-based gridding. Several optimization strategies are further proposed for reducing unnecessary memory access and maximizing the utilization of the heterogeneous architecture. Testing results demonstrate that the proposal is especially suitable for gridding large-scale data and can improve performance by up to 71.25 times compared to the traditional multi-thread CPU-based approach.

Qi Luo, Jian Xiao, Ce Yu, Chongke Bi, Yiming Ji, Jizhou Sun, Bo Zhang, Hao Wang
COUSTIC: Combinatorial Double Auction for Crowd Sensing Task Assignment in Device-to-Device Clouds

With the emerging technologies of Internet of Things (IOTs), the capabilities of mobile devices have increased tremendously. However, in the big data era, to complete tasks on one device is still challenging. As an emerging technology, crowdsourcing utilizing crowds of devices to facilitate large scale sensing tasks has gaining more and more research attention. Most of existing works either assume devices are willing to cooperate utilizing centralized mechanisms or design incentive algorithms using double auctions. There are two cases that may not practical to deal with, one is a lack of centralized controller for the former, the other is not suitable for the seller device’s resource constrained for the later. In this paper, we propose a truthful incentive mechanism with combinatorial double auction for crowd sensing task assignment in device-to-device (D2D) clouds, where a single mobile device with intensive sensing task can hire a group of idle neighboring devices. With this new mechanism, time critical sensing tasks can be handled in time with a distributed nature. We prove that the proposed mechanism is truthful, individual rational, budget balance and computational efficient.

Yutong Zhai, Liusheng Huang, Long Chen, Ning Xiao, Yangyang Geng
Backmatter
Metadaten
Titel
Algorithms and Architectures for Parallel Processing
herausgegeben von
Jaideep Vaidya
Jin Li
Copyright-Jahr
2018
Electronic ISBN
978-3-030-05051-1
Print ISBN
978-3-030-05050-4
DOI
https://doi.org/10.1007/978-3-030-05051-1