Skip to main content

2016 | Buch

Network and Parallel Computing

13th IFIP WG 10.3 International Conference, NPC 2016, Xi'an, China, October 28-29, 2016, Proceedings

herausgegeben von: Guang R. Gao, Depei Qian, Xinbo Gao, Barbara Chapman, Wenguang Chen

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 13th IFIP WG 10.3International Conference on Network and Parallel Computing, NPC 2016,held in Xi'an, China, in October 2016.
The 17 full papers presented were carefully reviewed and selected from 99 submissions. They are organized in the following topical sections; memory: non-volatile, solid state drives, hybrid systems; resilience and reliability; scheduling and load-balancing; heterogeneous systems; data processing and big data; and algorithms and computational models.

Inhaltsverzeichnis

Frontmatter

Memory: Non-Volatile, Solid State Drives, Hybrid Systems

Frontmatter
VIOS: A Variation-Aware I/O Scheduler for Flash-Based Storage Systems
Abstract
NAND flash memory has gained widespread acceptance in storage systems because of its superior write/read performance, shock-resistance and low-power consumption. I/O scheduling for Solid State Drives (SSDs) has received much attention in recent years for its ability to take advantage of the rich parallelism within SSDs. However, most state-of-the-art I/O scheduling algorithms are oblivious to the increasingly significant inter-block variation introduced by the advanced technology scaling. This paper proposes a variation-aware I/O scheduler by exploiting the speed variation among blocks to minimize the access conflict latency of I/O requests. The proposed VIOS schedules I/O requests into a hierarchical-batch structured queue to preferentially exploit channel-level parallelism, followed by chip-level parallelism. Moreover, conflict write requests are allocated to faster blocks to reduce access conflict of waiting requests. Experimental results shows that VIOS reduces write latency significantly compared to state-of-the-art I/O schedulers while attaining high read efficiency.
Jinhua Cui, Weiguo Wu, Shiqiang Nie, Jianhang Huang, Zhuang Hu, Nianjun Zou, Yinfeng Wang
Exploiting Cross-Layer Hotness Identification to Improve Flash Memory System Performance
Abstract
Flash memory has been widely deployed in modern storage systems. However, the density improvement and technology scaling would decrease its endurance and I/O performance, which motivates the search to improve flash performance and reduce cell wearing. Wearing reduction can be achieved by lowering the threshold voltages, but at the cost of slower reads. In this paper, the access hotness characteristics are exploited for read performance and endurance improvement. First, with the understanding of the reliability characteristics of flash memory, the relationship among flash cell wearing, read latency and bit error rate is introduced. Then, based on the hotness information provided by buffer management, the threshold voltages of a cell for write-hot data are decreased for wearing reduction, while these for read-hot data are increased for read latency reduction. We demonstrate analytically through simulation that the proposed technique achieves significant endurance and read performance improvements without sacrificing the write throughput performance.
Jinhua Cui, Weiguo Wu, Shiqiang Nie, Jianhang Huang, Zhuang Hu, Nianjun Zou, Yinfeng Wang
Efficient Management for Hybrid Memory in Managed Language Runtime
Abstract
Hybrid memory, which leverages the benefits of traditional DRAM and emerging memory technologies, is a promising alternative for future main memory design. However popular management policies through memory-access recording and page migration may invoke non-trivial overhead in execution time and hardware space. Nowadays, managed language applications are increasingly dominant in every kind of platform. Managed runtimes provide services for automatic memory management. So it is important to adapt them for the underlying hybrid memory.
This paper explores two opportunities, heap partition placement and object promotion, inside managed runtimes for allocating hot data in a fast memory space (fast-space) without any access recording or data migration overhead. For heap partition placement, we quantitatively analyze LLC miss density and performance effect for each partition. Results show that LLC misses especially store misses mostly hit nursery partitions. Placing nursery in fast-space, which is 20 % total memory footprint of tested benchmarks on average, causes only 10 % performance difference from all memory footprint in fast-space. During object promotion, hot objects will be directly allocated to fast-space. We develop a tool to analyze the LLC miss density for each method of workloads, since we have found that the LLC misses are mostly triggered by a small percentage of the total set of methods. The objects visited by the top-ranked methods are recognized as hot. Results show that hot objects do have higher access density, more than 3 times of random distribution for SPECjbb and pmd, and placing them in fast-space further reduces their execution time by 6 % and 13 % respectively.
Chenxi Wang, Ting Cao, John Zigman, Fang Lv, Yunquan Zhang, Xiaobing Feng

Resilience and Reliability

Frontmatter
Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory
Abstract
The Mean Time to Failure continues to decrease as the scaling of computing systems. To maintain the reliability of computing systems, checkpoint has to be taken more frequently. Incremental checkpointing is a well-researched technique that makes frequent checkpointing possible. Fine-grained incremental checkpointing minimizes checkpoint size but suffers from significant monitoring overhead. We observe the memory access at page granularity and find that the size of contiguous memory regions visited by applications tends to be proportional to size of corresponding memory allocation. In this paper, we propose the Application-Based Coarse-Grained Incremental Checkpointing (ACCK) that leverages the priori information of the memory allocation to release the memory monitoring granularity in an incremental and appropriate way. This provides better opportunities for balancing the tradeoff between monitoring and copying overhead. ACCK is also assisted by hugepage to alleviate the TLB overhead. Our experiment shows that ACCK presents 2.56x performance improvement over the baseline mechanism.
Zhan Shi, Kai Lu, Xiaoping Wang, Wenzhe Zhang, Yiqi Wang
DASM: A Dynamic Adaptive Forward Assembly Area Method to Accelerate Restore Speed for Deduplication-Based Backup Systems
Abstract
Data deduplication yields an important role in modern backup systems for its demonstrated ability to improve storage efficiency. However, in deduplication-based backup systems, the consequent exhausting fragmentation problem has drawn ever-increasing attention over in terms of backup frequencies, which leads to the degradation of restoration speed. Various Methods are proposed to address this problem. However, most of them purchase restore speed at the expense of deduplication ratio reduction, which is not efficient.
In this paper, we present a Dynamic Adaptive Forward Assembly Area Method, called DASM, to accelerate restore speed for deduplication-based backup systems. DASM exploits the fragmentation information within the restored backup streams and dynamically trades off between chunk-level cache and container-level cache. DASM is a pure data restoration module which pursues optimal read performance without sacrificing deduplication ratio. Meanwhile, DASM is a resource independent and cache efficient scheme, which works well under different memory footprint restrictions. To demonstrate the effectiveness of DASM, we conduct several experiments under various backup workloads. The results show that, DASM is sensitive to fragmentation granularity and can accurately adapt to the changes of fragmentation size. Besides, experiments also show that DASM improves the restore speed of traditional LRU and ASM methods by up to 58.9 % and 57.1 %, respectively.
Chao Tan, Luyu Li, Chentao Wu, Jie Li

Scheduling and Load-Balancing

Frontmatter
A Statistics Based Prediction Method for Rendering Application
Abstract
As an interesting commercial application, rendering plays an important role in the field of animation and movie production. Generally, render farm is used to rendering mass images concurrently according to the independence among frames. How to scheduling and manage various rendering jobs efficiently is a significant issue for render farm. Therefore, the prediction of rendering time for frames is relevant for scheduling, which offers the reference and basis for scheduling method. In this paper a statistics based prediction method is addressed. Initially, appropriate parameters which affect the rendering time are extracted and analyzed according to parsing blend formatted files which offers a general description for synthetic scene. Then, the sample data are gathered by open source software Blender and J48 classification algorithm is used for predicting rendering time. The experimental results show that the proposed method improve the prediction accuracy about 60 % and 75.74 % for training set and test set, which provides reasonable basis for scheduling jobs efficiently and saving rendering cost.
Qian Li, Weiguo Wu, Long Xu, Jianhang Huang, Mingxia Feng
IBB: Improved K-Resource Aware Backfill Balanced Scheduling for HTCondor
Abstract
HTCondor, a batch system characterized by its matchmaking mechanism, schedules job in FCFS way, so its performance is not ideal as expected. Backfilling is a technique to address the above problem. Most backfilling algorithms are based on CPU information and have large room for improvements with considering other resource information. The K-resource aware scheduling algorithm Backfill Balanced (BB) selects backfill job which can best balance the usage of all resources and achieve better performance compared with the classical backfilling algorithm. However, BB does not realize that small jobs’ impacts on resource utilization are negligible and they mainly contribute to reduce the average response time. Here we propose the IBB algorithm, which utilizes the characteristics of small jobs to guide a better job selection. We implemented IBB on HTCondor to improve its performance. Experiments results show that IBB can provide up to 60 % performance gains in most performance metrics compared with BB.
Lan Liu, Zhongzhi Luan, Haozhan Wang, Depei Qian
Multipath Load Balancing in SDN/OSPF Hybrid Network
Abstract
Software defined network (SDN) is an emerging network architecture that has drawn the attention of academics and industry in recent years. Affected by investment protection, risk control and other factors, the full deployment of SDN will not be finished in the short term, thus it results into a coexistence state of traditional IP network and SDN which is named hybrid SDN. In this paper, we formulate the SDN controller’s optimization problem for load balancing as a mathematical model. Then we propose a routing algorithm Dijkstra-Repeat in SDN nodes which can offer disjoint multipath routing. To make it computationally feasible for large scale networks, we develop a new Fast Fully Polynomial Time Approximation Schemes (FPTAS) based Lazy Routing Update (LRU).
Xiangshan Sun, Zhiping Jia, Mengying Zhao, Zhiyong Zhang

Heterogeneous Systems

Frontmatter
A Study of Overflow Vulnerabilities on GPUs
Abstract
GPU-accelerated computing gains rapidly-growing popularity in many areas such as scientific computing, database systems, and cloud environments. However, there are less investigations on the security implications of concurrently running GPU applications. In this paper, we explore security vulnerabilities of CUDA from multiple dimensions. In particular, we first present a study on GPU stack, and reveal that stack overflow of CUDA can affect the execution of other threads by manipulating different memory spaces. Then, we show that the heap of CUDA is organized in a way that allows threads from the same warp or different blocks or even kernels to overwrite each other’s content, which indicates a high risk of corrupting data or steering the execution flow by overwriting function pointers. Furthermore, we verify that integer overflow and function pointer overflow in struct also can be exploited on GPUs. But other attacks against format string and exception handler seems not feasible due to the design choices of CUDA runtime and programming language features. Finally, we propose potential solutions of preventing the presented vulnerabilities for CUDA.
Bang Di, Jianhua Sun, Hao Chen
Streaming Applications on Heterogeneous Platforms
Abstract
Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Currently, very few cases have been streamed to demonstrate the streaming performance impact and a systematic investigation of streaming necessity and how-to over a large number of test cases remains a gap. In this paper, we use a total of 56 benchmarks to build a statistical view of the data transfer overhead, and give an in-depth analysis of the impacting factors. Among the heterogeneous codes, we identify two types of non-streamable codes and three types of streamable codes, for which a streaming approach has been proposed. Our experimental results on the CPU-MIC platform show that, with multiple streams, we can improve the application performance by up 90 %. Our work can serve as a generic flow of using multiple streams on heterogeneous platforms.
Zhaokui Li, Jianbin Fang, Tao Tang, Xuhao Chen, Canqun Yang

Data Processing and Big Data

Frontmatter
DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets
Abstract
Statistical analysis of aggregated records is widely used in various domains such as market research, sociological investigation and network analysis, etc. Stratified sampling (SS), which samples the population divided into distinct groups separately, is preferred in the practice for its high effectiveness and accuracy. In this paper, we propose a scalable and efficient algorithm named DSS, for SS to process large datasets. DSS executes all the sampling operations in parallel by calculating the exact subsample size for each partition according to the data distribution. We implement DSS on Spark, a big-data processing system, and we show through large-scale experiments that it can achieve lower data-transmission cost and higher efficiency than state-of-the-art methods with high sample representativeness.
Minne Li, Dongsheng Li, Siqi Shen, Zhaoning Zhang, Xicheng Lu
A Fast and Better Hybrid Recommender System Based on Spark
Abstract
With the rapid development of information technology, recommender systems have become critical components to solve information overload. As an important branch, weighted hybrid recommender systems are widely used in electronic commerce sites, social networks and video websites such as Amazon, Facebook and Netflix. In practice, developers typically set a weight for each recommendation algorithm by repeating experiments until obtaining better accuracy. Despite the method could improve accuracy, it overly depends on experience of developers and the improvements are poor. What worse, workload will be heavy if the number of algorithms rises. To further improve performance of recommender systems, we design an optimal hybrid recommender system on Spark. Experimental results show that the system can improve accuracy, reduce execution time and handle large-scale datasets. Accordingly, the hybrid recommender system balances accuracy and execution time.
Jiali Wang, Hang Zhuang, Changlong Li, Hang Chen, Bo Xu, Zhuocheng He, Xuehai Zhou
Discovering Trip Patterns from Incomplete Passenger Trajectories for Inter-zonal Bus Line Planning
Abstract
Collecting the trajectories occurring in the city and mining the patterns implied in the trajectories can support the ITS (Intelligent Transportation System) applications and foster the development of smart cities. For improving the operations of inter-zonal buses in the cities, we define a new trip pattern, i.e., frequent bus passenger trip patterns for bus lines (FBPT4BL patterns in short). We utilize the passenger trajectories from bus smart card data and propose a two-phase approach to mine FBPT4BL patterns and then recommend inter-zonal bus lines. We conduct extensive experiments on the real data from the Beijing Public Transport Group. By comparing the experimental results with the actual operation of inter-zonal buses at the Beijing Public Transport Group, we verify the validity of our proposed method.
Zhaoyang Wang, Beihong Jin, Fusang Zhang, Ruiyang Yang, Qiang Ji
FCM: A Fine-Grained Crowdsourcing Model Based on Ontology in Crowd-Sensing
Abstract
Crowd sensing between users with smart mobile devices is a new trend of development in Internet. In order to recommend the suitable service providers for crowd sensing requests, this paper presents a Fine-grained Crowdsourcing Model (FCM) based on Ontology theory that helps users to select appropriate service providers. First, the characteristic properties which extracted from the service request will be compared with the service provider based on ontology triple. Second, recommendation index of each service provider is calculated through similarity analysis and cluster analysis. Finally, the service decision tree is proposed to predict and recommend appropriate candidate users to participate in crowd sensing service. Experimental results show that this method provides more accurate recommendation than present recommendation systems and consumes less time to find the service provider through clustering algorithm.
Jian An, Ruobiao Wu, Lele Xiang, Xiaolin Gui, Zhenlong Peng
QIM: Quantifying Hyperparameter Importance for Deep Learning
Abstract
Recently, Deep Learning (DL) has become super hot because it achieves breakthroughs in many areas such as image processing and face identification. The performance of DL models critically depend on hyperparameter settings. However, existing approaches that quantify the importance of these hyperparameters are time-consuming.
In this paper, we propose a fast approach to quantify the importance of the DL hyperparameters, called QIM. It leverages Plackett-Burman design to collect as few as possible data but can still correctly quantify the hyperparameter importance. We conducted experiments on the popular deep learning framework – Caffe – with different datasets to evaluate QIM. The results show that QIM can rank the importance of the DL hyperparameters correctly with very low cost.
Dan Jia, Rui Wang, Chengzhong Xu, Zhibin Yu

Algorithms and Computational Models

Frontmatter
Toward a Parallel Turing Machine Model
Abstract
In the field of parallel computing, the late leader Ken Kennedy, has raised a concern in early 1990s: “Is Parallel Computing Dead?” Now, we have witnessed the tremendous momentum of the “second spring” of parallel computing in recent years. But, what lesson should we learn from the history of parallel computing when we are walking out from the bottom state of the field?
To this end, this paper examines the disappointing state of the work in parallel Turing machine models in the past 50 years of parallel computing research. Lacking a solid yet intuitive parallel Turing machine model will continue to be a serious challenge. Our paper presents an attempt to address this challenge — by presenting a proposal of a parallel Turing machine model — the PTM model. We also discuss why we start our work in this paper from a parallel Turing machine model instead of other choices.
Peng Qu, Jin Yan, Guang R. Gao
On Determination of Balance Ratio for Some Tree Structures
Abstract
In this paper, we studies the problem to find the maximal number of red nodes of a kind of balanced binary search tree. We have presented a dynamic programming formula for computing r(n), the maximal number of red nodes of a red-black tree with n keys. The first dynamic programming algorithm uses \(O(n^2\log n)\) time and uses \(O(n\log n)\) space. The basic algorithm is then improved to a more efficient O(n) time algorithm. The time complexity of the new algorithm is finally reduced to O(n) and the space is reduced to only \(O(\log n)\).
Daxin Zhu, Tinran Wang, Xiaodong Wang
Backmatter
Metadaten
Titel
Network and Parallel Computing
herausgegeben von
Guang R. Gao
Depei Qian
Xinbo Gao
Barbara Chapman
Wenguang Chen
Copyright-Jahr
2016
Electronic ISBN
978-3-319-47099-3
Print ISBN
978-3-319-47098-6
DOI
https://doi.org/10.1007/978-3-319-47099-3