Inference and Declaration of Independence in Task-Parallel Programs

Abstract

The inherent difficulty of thread-based shared-memory programming has recently motivated research in high-level, task-parallel programming models. Recent advances of Task-Parallel models add implicit synchronization, where the system automatically detects and satisfies data dependencies among spawned tasks. However, dynamic dependence analysis incurs significant runtime overheads, because the runtime must track task resources and use this information to schedule tasks while avoiding conflicts and races.

We present SCOOP, a compiler that effectively integrates static and dynamic analysis in code generation. SCOOP combines context-sensitive points-to, control-flow, escape, and effect analyses to remove redundant dependence checks at runtime. Our static analysis can work in combination with existing dynamic analyses and task-parallel runtimes that use annotations to specify tasks and their memory footprints. We use our static dependence analysis to detect non-conflicting tasks and an existing dynamic analysis to handle the remaining dependencies. We evaluate the resulting hybrid dependence analysis on a set of task-parallel programs.

Foivos S. Zakkak, Dimitrios Chasapis, Polyvios Pratikakis, Angelos Bilas, Dimitrios S. Nikolopoulos

BDDT: Block-Level Dynamic Dependence Analysis for Task-Based Parallelism

Abstract

We present BDDT, a task-parallel runtime system that dynamically discovers and resolves dependencies among parallel tasks. BDDT allows the programmer to specify detailed task footprints on any memory address range, multidimensional array tile or dynamic region. BDDT uses a block-based dependence analysis with arbitrary granularity. The analysis is applicable to existing C programs without having to restructure object or array allocation, and provides flexibility in array layouts and tile dimensions.

We evaluate BDDT using a representative set of benchmarks, and we compare it to SMPSs (the equivalent runtime system in StarSs) and OpenMP. BDDT performs comparable to or better than SMPSs and is able to cope with task granularity as much as one order of magnitude finer than SMPSs. Compared to OpenMP, BDDT performs up to 3.9× better for benchmarks that benefit from dynamic dependence analysis. BDDT provides additional data annotations to bypass dependence analysis. Using these annotations, BDDT outperforms OpenMP also in benchmarks where dependence analysis does not discover additional parallelism, thanks to a more efficient implementation of the runtime system.

George Tzenakis, Angelos Papatriantafyllou, Hans Vandierendonck, Polyvios Pratikakis, Dimitrios S. Nikolopoulos

A User-Level NUMA-Aware Scheduler for Optimizing Virtual Machine Performance

Abstract

Commodity servers deployed in the data centers are now typically using the Non-Uniform Memory Access (NUMA) architecture. The NUMA multicore servers provide scalable system performance and cost-effective property. However, virtual machines (VMs) running on NUMA systems will access remote memory and contend for shared on-chip resources, which will decrease the overall performance of VMs and reduce the efficiency, fairness, and QoS that a virtualized system is capable to provide. In this paper, we propose a “Best NUMA Node” based virtual machine scheduling algorithm and implement it in a user-level scheduler that can periodically adjust the placement of VMs running on NUMA systems. Experimental results show that our NUMA-aware virtual machine scheduling algorithm is able to improve VM performance by up to 23.4% compared with the default CFS (Completely Fair Scheduler) scheduler used in KVM. Moreover, the algorithm achieves more stable virtual machine performance.

Yuxia Cheng, Wenzhi Chen, Xiao Chen, Bin Xu, Shaoyu Zhang

Towards RTOS: A Preemptive Kernel Basing on Barrelfish

Abstract

Multi-core or even many-core systems are becoming common. New operating system architectures are needed to meet the ever increasing performance requirements of multi-/many-core based embedded systems. Barrelfish is a multi-/many-core oriented open-source operating system built by ETH Zurich and Microsoft Research in a multi-kernel style. Every kernel runs on a dedicated core with local interrupts disabled, which may result in a failure in real-time systems. To make it preemptive, new kernel state and private kernel stack are added. Capability is the central mechanism of Barrelfish, lots of analyses and modifications have been done to make the duration interrupts are disabled as short as possible when kernel objects in capability space are accessed. As a result, meaningful real-time performance have been obtained. Combined with the inherent scalability of multi-kernel design in Barrelfish, the work in this paper could be a useful reference for multi-/many-core based embedded systems.

Jingwei Yang, Xiang Long, Xukun Shen, Lei Wang, Shuaitao Feng, Siyao Zheng

Interrupt Modeling and Verification for Embedded Systems Based on Time Petri Nets

Abstract

The embedded systems are interrupt-driven systems, but the triggered methods of interrupts are with randomness and uncertainty. The behavior of interrupt can be quite difficult to fully understand, and many catastrophic system failures are caused by unexpected behaviors. Therefore, interrupt-driven systems need high quality tests, but there is lack of effective interrupt system detection method at present. In this paper, a modeling method of interrupt system is firstly proposed based on time Petri nets, which has ability of describing concurrency and time series. Then the time petri nets are transformed into timed automata for model checking. Consequentially, a symbolic encoding approach is investigated for formalized timed automata, through which the timed automata could be bounded model checked (BMC) with regard to invariant properties by using Satisfiability Modulo Theories (SMT) solving technique. Finally, Z3 is used in the experiments to evaluate the effectiveness of our approach.

Gang Hou, Kuanjiu Zhou, Junwang Chang, Rui Li, Mingchu Li

Pruning False Positives of Static Data-Race Detection via Thread Specialization

Abstract

Static data-race detection is a powerful tool by providing clues for dynamic approaches to only instrument certain memory accesses. However, static data-race analysis suffers from high false positive rate. A key reason is that static analysis overestimates the set of shared objects a thread can access. We propose thread specialization to distinguish threads statically. By fixing the number of threads as well as the ID assigned to each thread, a program can be transformed to a simplified version. Static data-race analysis on this simplified program can infer the range of addresses accessed by each thread more accurately. Our approach prunes false positives by an average of 89.2% and reduces dynamic instrumentation by an average of 63.4% in seven benchmarks.

Chen Chen, Kai Lu, Xiaoping Wang, Xu Zhou, Li Fang

G-Paradex: GPU-Based Parallel Indexing for Fast Data Deduplication

Abstract

Deduplication technology has been increasingly used to reduce the storage cost. In practice, the duplicate detection upon large on-disk index incurs unavoidable and significant overheads in write operations. Most existing deduplication methods perform single-pass processing, while pay little attention to develop highly parallel methods for the emerging parallel processors. In this paper, we present the design of G-Paradex, a novel deduplication framework that can significantly reduce the duplicate detecting time. Utilizing a prefix tree to organize the chunk fingerprints, G-Paradex is able to do fast deduplicating by using GPU to search the target tree in parallel. Leveraging the inherent chunk locality in writing data stream, we group consecutive chunks and extract the handprints into the prefix tree, aiming at shrinking the index size and reducing the on-disk accesses. Our experimental evaluation based on real-world datasets demonstrate that, compared with the traditional single-pass method, G-aparadex achieves a speedup of 2-4X for duplicate detecting.

Bin Lin, Xiangke Liao, Shanshan Li, Yufeng Wang, He Huang, Ling Wen

Research on SPH Parallel Acceleration Strategies for Multi-GPU Platform

Abstract

This paper proposes an acceleration strategy for SPH on single-node multi-GPU platform. First the acceleration strategy for SPH on single-GPU is studied in conjunction with the characteristics of architecture. Then the changing pattern of SPH’s computation time has been discussed. Based on the fact that the changing pattern is rather slow, using a simple dynamic load balancing algorithm an acceptable load balance is obtained on multi-GPU. Finally, an almost linear speedup is achieved on multi-GPU by further optimizing dynamic load balancing algorithm and communication strategy among multiple GPUs

Lei Hu, Xukun Shen, Xiang Long

Binarization-Based Human Detection for Compact FPGA Implementation

Abstract

The implementation of human detection in the embedded domain can be a challenging issue. In this paper, a real-time, low-power human detection method with high detection accuracy is implemented on a low-cost field-programmable gate array (FPGA) platform. For the histogram of oriented gradients feature and linear support vector machine classifier, the binarization process is employed instead of normalization, as the original algorithm is unsuitable for compact implementation. Furthermore, pipeline architecture is introduced to accelerate the processing rate. The initial experimental results demonstrate that the proposed implementation achieved 293 fps by using a low-end Xilinx Spartan-3e FPGA. The detection accuracy attained a miss rate of 1.97% and false positive rate of 1%. For further demonstration, a prototype is developed using an OV7670 camera device. With the speed of the camera device, 30 fps can be achieved, which satisfies most real-time applications. Considering the energy restriction of the battery-based system at a speed of 30 fps, the implementation can work with a power consumption of less than 353mW.

Shuai Xie, Yibin Li, Zhiping Jia, Lei Ju

HPACS: A High Privacy and Availability Cloud Storage Platform with Matrix Encryption

Abstract

As the continuous development of cloud computing and big data, data storage as a service in the cloud is becoming increasingly popular. More and more individuals and organizations begin to store their data in cloud rather than building their own data centers. Cloud storage holds the advantages of high reliability, simple management and cost-effective. However, the privacy and availability of the data stored in cloud is still a challenge. In this paper, we design and implement a High Privacy and Availability Cloud Storage (HPACS) platform built on Apache Hadoop to improve the data privacy and availability. A matrix encryption and decryption module is integrated in HDFS, through which the data can be encoded and reconstructed to/from different storage servers transparently. Experimental results show that HPACS can achieve high privacy and availability but with reasonable write/read performance and storage capacity overhead as compared with the original HDFS.

Yanzhang He, Xiaohong Jiang, Kejiang Ye, Ran Ma, Xiang Li

ECAM: An Efficient Cache Management Strategy for Address Mappings in Flash Translation Layer

Abstract

Solid State Drives (SSDs) have been widely adopted in both enterprise and embedded storage systems with the great improvement in NAND flash memory technology. With the growing size of NAND flash memory, how to keep the most active address mappings be cached in limited on-flash SRAM is crucial to a Flash Translation Layer (FTL) scheme, that plays an important role in managing NAND flash. In this paper, we propose an efficient cache management strategy, called ECAM, to enhance the capability of caching page-level address mappings in demand-based Flash Translation Layer. In ECAM, we optimize the structure of Cached Mapping Table (CMT) to record multiple address mappings with consecutive logical page numbers and physical page numbers in just one mapping entry, and propose another two tables, Cached Split Table (CST) and Cached Translation Table (CTT). CST can cache the split mapping entries caused by the partial updates in CMT and CTT is used to reduce the overhead of address translation for large number of sequential requests. By the cooperation of CMT, CST and CTT, ECAM implements an efficient two-tier selective caching strategy to jointly exploit the temporal and spatial localities of workloads. The simulation on various realistic workloads shows that ECAM can improve the cache hit ratio and reduce the number of expensive extra read/write operations between SRAM and flash efficiently.

Xuchao Xie, Qiong Li, Dengping Wei, Zhenlong Song, Liquan Xiao

A Performance Study of Software Prefetching for Tracing Garbage Collectors

Abstract

In automatic memory management programming language, the garbage collector is an important feature which reclaims memory that is no longer referenced by the program. The tracing collector performs a transitive closure across the object reference graph. And this process may result in too many cache misses, because the traverse exhibits poor locality. Previous studies have already proposed using software prefetching to improve memory performance of garbage collector. In this work, we studied the characteristic of prefetch instructions on x86 platforms in detail, and proposed a method to control the prefetch ratios by address thresholds. The experimental results show that a further improvement and optimization is obtained by tuning the prefetch ratio. We use these results to devise some guidelines for optimizing the memory performance as well as minimizing the collection pause times.

Hao Wu, Zhenzhou Ji, Suxia Zhu, Zhigang Chen

Adaptive Implementation Selection in the SkePU Skeleton Programming Library

Abstract

In earlier work, we have developed the SkePU skeleton programming library for modern multicore systems equipped with one or more programmable GPUs. The library internally provides four types of implementations (implementation variants) for each skeleton: serial C++, OpenMP, CUDA and OpenCL targeting either CPU or GPU execution respectively. Deciding which implementation would run faster for a given skeleton call depends upon the computation, problem size(s), system architecture and data locality.

In this paper, we present our work on automatic selection between these implementation variants by an offline machine learning method which generates a compact decision tree with low training overhead. The proposed selection mechanism is flexible yet high-level allowing a skeleton programmer to control different training choices at a higher abstraction level. We have evaluated our optimization strategy with 9 applications/kernels ported to our skeleton library and achieve on average more than 94% (90%) accuracy with just 0.53% (0.58%) training space exploration on two systems. Moreover, we discuss one application scenario where local optimization considering a single skeleton call can prove sub-optimal, and propose a heuristic for bulk implementation selection considering more than one skeleton call to address such application scenarios.

Usman Dastgeer, Lu Li, Christoph Kessler

Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

Abstract

This paper presents a technique to fully automatically generate efficient and readable code for parallel processors. We base our approach on skeleton-based compilation and ‘algorithmic species’, an algorithm classification of program code. We use a tool to automatically annotate C code with species information where possible. The annotated program code is subsequently fed into the skeleton-based source-to-source compiler ‘Bones’, which generates OpenMP, OpenCL or CUDA code and optimises host-accelerator transfers. This results in a unique approach, integrating a skeleton-based compiler for the first time into an automated flow. We demonstrate the benefits of our approach on the PolyBench suite by showing average speed-ups of 1.4x and 1.6x for GPU code compared to ppcg and Par4All, two state-of-the-art compilers.

Cedric Nugteren, Pieter Custers, Henk Corporaal

Optimizing Program Performance via Similarity, Using a Feature-Agnostic Approach

Abstract

This work proposes a new technique for performance evaluation to predict performance of parallel programs across diverse and complex systems. In this work the term system is comprehensive of the hardware organization, the development and execution environment.

The proposed technique considers the collection of completion times for some pairs (program, system) and constructs an empirical model that learns to predict performance of unknown pairs (program, system). This approach is feature-agnostic because it does not involve previous knowledge of program and/or system characteristics (features) to predict performance.

Experimental results conducted with a large number of serial and parallel benchmark suites, including SPEC CPU2006, SPEC OMP2012, and systems show that the proposed technique is equally applicable to be employed in several compelling performance evaluation studies, including characterization, comparison and tuning of hardware configurations, compilers, run-time environments or any combination thereof.

Rosario Cammarota, Laleh Aghababaie Beni, Alexandru Nicolau, Alexander V. Veidenbaum

Scalable NIC Architecture to Support Offloading of Large Scale MPI Barrier

Abstract

MPI collective communication overhead dominates the communication cost for large scale parallel computers, scalability and operation latency for collective communication is critical for next generation computers. This paper proposes a fast and scalable barrier communication offload approach which supports millions of compute cores. Following our approach, the barrier operation sequence is packed by host MPI driver into the barrier ”descriptor”, which is pushed to the NIC (Network-Interfaces). The NIC can complete the barrier automatically following its algorithm descriptor. Our approach leverages an enhanced dissemination algorithm which is suitable for current large scale networks. We show that our approach achieves both barrier performance and scalability, especially for large scale computer system. This paper also proposes an extendable and easy-to-implement NIC architecture supporting barrier offload communication and also other communication pattern.

Shaogang Wang, Weixia Xu, Dan Wu, Zhengbin Pang, Pingjing Lu

Hierarchical Clustering Routing Protocol Based on Optimal Load Balancing in Wireless Sensor Networks

Abstract

Nowadays, Wireless Sensor Networks (WSNs) are widely used in different aspects of human life such as agriculture, health care systems, traffic engineering and so on. By improving WSN’s design and applicability, still due to energy limitation in the sensors, the main concern is about the lifetime of sensors. Beside hardware aspect, establishing some efficient techniques for data sensing, processing and transition in the sensors can increase WSN’s lifetime. In this paper, a new optimized routing protocolcalled OLBHC (Optimal Load Balancing Hierarchical Clustering) is designed. In contrast to some traditional methods such as LEACH, OLBHC could decrease the energy consumption in the sensors by utilizing the equalization method. In OLBHC, a Flag matrix which stores cluster head nodes’ connection status is used and then an optimization algorithm is applied for selecting the best clusters in the network based on the energy consumption. The simulation results prove this proposed approach’s efficiency because by applying OLBHC the average lifetime of nodes is about 160% longer than LEACH and almost all of the nodes are dead in LEACH when the first node begins to die in OLBHC.

Tianshu Wang, Gongxuan Zhang, Xichen Yang, Ahmadreza Vajdi

An Efficient Parallel Mechanism for Highly-Debuggable Multicore Simulator

Abstract

Fast multicore simulators are extremely useful in evaluating design alternatives and enabling early software development. Among the state-of-the-art multicore simulators, Simics is a very popular used one both in academia and industry. It has a powerful debugging system, and also provides an accelerator to support multithreaded or distributed simulation. However, this kind of parallel mechanism mainly aims at speed up distributed systems. It is not suitable for the shared-memory multicore systems which are much more commonly used. In this paper, we propose a novel parallel mechanism to improve the simulation speed of shared-memory multicore systems. More importantly, our approach is compatible with other optimizations and exist debugging systems used in Simics. Experimental results show that our parallel approach achieves an average speedup of 9.6× (up to 12.2×) when running SPLASH-2 kernel on a 16-core host machine.

Xiaochun Ye, Dongrui Fan, Da Wang, Fenglong Song, Hao Zhang, Zhimin Tang

Data Access Type Aware Replacement Policy for Cache Clustering Organization of Chip Multiprocessors

Abstract

Chip multiprocessors (CMPs) are becoming the trend of mainstream computing platforms. The design of an efficient on-chip memory hierarchy is one of the key challenges in computer architecture. Tiled architecture and non-uniform cache architecture (NUCA) is commonly adopted in modern CMPs. Previous efforts on cache replacement policy usually assume an unified last-level cache or running multiprogrammed workloads. However, few researches focus on the replacement policy of cache clustering scheme running parallel workloads. Cache clustering scheme can improve the system performance on parallel performance, which is a tradeoff between shared cache organization and private cache organization which adopts cache replication. In cache clustering scheme, cache blocks in last-level cache can be subdivided into eight types.

In this work we propose Data access Type Aware Replacement Policy (DTARP) for cache clustering organization, DTARP classifies data blocks in last-level cache into different access types, and designs the insertion and the victim selection policies according to different data access types based on traditional LRU policy. The global shared data will be kept in last-level cache longer than before. Simulation results show that DTARP can improve the system performance of cluster scheme using LRU policy by 10.9% on average.

Chongmin Li, Dongsheng Wang, Haixia Wang, Guohong Li, Yibo Xue

Agent-Based Credibility Protection Model for Decentralized Network Computing Environment

Abstract

Decentralized networks have some unique characteristics, such as heterogeneity, autonomy, distribution and openness, which lead to serious security issues and low credibility. In this paper, the agent technology is utilized to construct a novel credibility protection model, which efficiently makes open, disordered, decentralized networks evolve gradually as an orderly, stable and reliable computing environment. Within this model, our research is focused on the credibility measurement mechanism, and a novel behavior-credibility evaluation algorithm is proposed. The evaluation value of behavior-credibility is calculated based on the current and historical performances of nodes as service providers and consumers to make an accurate prediction. Based on a decentralized network simulation environment, we conducted several rounds of simulation experiments to test the function, performance and validity of the model, mechanism and algorithm.

Xiaolong Xu, Qun Tu, Xinheng Wang

A Vectorized K-Means Algorithm for Intel Many Integrated Core Architecture

Abstract

The K-Means algorithms is one of the most popular and effective clustering algorithms for many practical applications. However, direct K-Means methods, taking objects as processing unit, is computationally expensive especially in Objects-Assignment phase on Single-Instruction Single-Data (SISD) processors, typically as CPUs. In this paper, we propose a vectorized K-Means algorithm for Intel Many Integrated Core (MIC) coprocessor, a newly released product from Intel for highly parallel workloads. This new algorithm is able to achieve fine-grained Single-Instruction Multiple-Data (SIMD) parallelism by taking each dimension of all objects as a long vector. This vectorized algorithm is suitable for any-dimensional objects, which is little taken into consideration in preceding works. We also parallelize the vectorized K-Means algorithm on MIC coprocessor to achieve coarse-grained thread-level parallelism. Finally, we implement and evaluate the vectorized method on the first generation of Intel MIC product. Measurements show that this algorithm based on MIC coprocessor gets desired speedup to sequential algorithm on CPU and demonstrate that MIC coprocessor owns highly parallel computational power as well as scalability.

Fuhui Wu, Qingbo Wu, Yusong Tan, Lifeng Wei, Lisong Shao, Long Gao

Towards Eliminating Memory Virtualization Overhead

Abstract

Despite of continuous efforts on reducing virtualization overhead, memory virtualization overhead remains significant for certain applications and often the last piece standing. Hardware vendors even dedicate hardware resources to help minimize this overhead. Each of the two typical approaches, shadow paging (SP) and hardware assisted paging (HAP), has its own advantages and disadvantages. Hardware-dependent HAP eliminates most VM exits caused by page faults, but suffers from higher penalties of TLB misses. On the other hand, the software-only approach, SP, enjoys shorter page walk latencies while paying for the VM exits to maintain the consistency between the shadow page table and the guest page table. We observe that, although HAP and SP each holds its ground for a set of applications in a 32-bit virtual machine (VM), SP almost always performs on a par with or better than HAP in a 64-bit system, which steadily gains its popularity. This paper examines the root cause of this inconsistency through a series of experiments and shows that the major overhead of shadow paging in a 32-bit system can be substantially reduced using a customized memory allocator. We conclude that memory virtualization overhead can be minimized with software-only approaches and therefore hardware-assisted paging might no longer be necessary.

Xiaolin Wang, Lingmei Weng, Zhenlin Wang, Yingwei Luo

An Improved FPGAs-Based Loop Pipeline Scheduling Algorithm for Reconfigurable Compiler

Abstract

Reconfigurable compilers have shown significant promise in the field of reconfigurable computing, and pipeline scheduling algorithms are typically concerned with improving iteration performance or saving the resources. However, the lack of loop pipeline scheduling algorithm for reconfigurable systems hampers the widespread adoption of fine-grained reconfigurable compilers. This paper presents an improved FPGAs-based loop pipeline scheduling algorithm and has realized it in ASCRA (Application-Specific Compiler for Reconfigurable Architecture) compilation framework. In FPGAs-based loop pipeline scheduling algorithm, the adequate consideration of hardware operation logic delay can save the resources of pipelining and ensure the performance of reconfigurable systems. Both of iterations with carried dependencies and without carried dependency have been considered. The preliminary experiment results show that it can economize more than 20% of the register resources by combining the adjacent pipeline stages without influencing the performance, and the algorithm is feasible for the other fine-grained reconfigurable compilers.

Zhenhua Guo, Yanxia Wu, Guoyin Zhang, Tianxiang Sui

ACF: Networks-on-Chip Deadlock Recovery with Accurate Detection and Elastic Credit

Abstract

The key idea of progressive deadlock recovery scheme is providing an escape path outside the deadlock cycle. State-of-the-art schemes, such as Virtual Channel Reallocation (VCR) and DISHA, set up escape paths by dedicated deadlock-handling channels. These channels use additional data paths, central buffers and bypass logic. This paper proposed a novel deadlock recovery scheme for NoC, using Accurate on-Cycle Forwarding (ACF) path inside deadlock cycle to drain blocked packets. ACF does not decouple deadlock-handling channels from normal routing channels, but enhanced credit flow control on each channels to allow accurate deadlock detection and recovery. ACF method is constructed by three tightly combined components: adaptive routing, run-time accurate deadlock detection, and deadlock removal. We implemented ACF algorithm by O(n) time complexity and by distributed modules cooperating with NoC. The rigorous valuation on multiple traffic patterns shows that our scheme achieves significant performance improvement. ACF detected 10-20 times less fake deadlock alarms than approximation and heuristic time-out approaches. In high packet injection rates interval (40%-75%) where network is more frequently troubled by deadlock, ACF provides 67% communication latency improvement comparing to dimension order routing, and 14%-45% to VCR and DISHA. Moreover, the power consumption and hardware overhead of ACF are light.

Nan Wu, Yuran Qiao, Mei Wen, Chunyuan Zhang

An Auction and League Championship Algorithm Based Resource Allocation Mechanism for Distributed Cloud

Abstract

In cloud computing, all kinds of idle resources can be pooled to establish a resource pool, and different kinds of resources combined as a service is provided to users through virtualization. Therefore, an effective mechanism is necessary for managing and allocating the resources. In this paper, we propose a double combinatorial auction based allocation mechanism based on the characteristics of cloud resources and inspired by the flexibility and effectiveness of microeconomic methods. The feedback evaluation based reputation system with attenuation coefficient of time and the hierarchy of users introduced is implemented to avoid malicious behavior. In order to make decisions scientifically, we propose a price decision mechanism based on a BP (back propagation) neural network, in which various factors are taken into account, so the bidding/asking prices can adapt to the changing supply-demand relation in the market. Since the winner determination is an NP hard problem, a league championship algorithm is introduced to achieve optimal allocation with the optimization goals being market surplus and total reputation. We also conduct empirical studies to demonstrate the feasibility and effectiveness of the proposed mechanism.

Jiajia Sun, Xingwei Wang, Keqin Li, Chuan Wu, Min Huang, Xueyi Wang

Accelerating Software Model Checking Based on Program Backbone

Abstract

Model checking technique has been gradually applied to verify the reliability of software systems. However, as to software with large scale and complex structure, the verification procedure suffers from the state space explosion, thus leading to a low efficiency or resource exhaustion. In this paper, we propose a method of accelerating software model checking based on program backbone to verify the properties in ANSI-C source codes. We prune the program with respect to the assertion property, and compress the circular paths by maximal strongly connected components to extract the program backbone. Subsequently, the Hoare theory is used to generate an invariant from the compressed circular nodes, which reduces the length of path encoding. The assertion property is finally translated into a quantifier-free formula and checked for satisfiability by the SMT solver Z3. The experiments show that this method substantially improves the efficiency of program verification.

Kuanjiu Zhou, Jiawei Yong, Xiaolong Wang, Longtao Ren, Gang Hou, Junwang Chang

A Cloud Computing System for Snore Signals Processing

Abstract

Recently, snore signals (SS) have been demonstrated carrying significant information about the obstruction site and degree in the upper airway of Obstructive Sleep Apnea-Hypopnea Syndrome (OSAHS) suffers. To make this acoustic based method more accurate and robust, big SS data processing and analysis are necessary. Cloud computing has the potential to enhance decision agility and productivity while enabling greater efficiencies and reducing costs. We look to cloud computing as the structure to support processing big SS data. In this paper, we focused on the aspects of a Cloud environment that processing big SS data using software services hosted in the Cloud. Finally, we set up a group of comparable experiments to evaluate the performance of our proposed system with different system scales.

Jian Guo, Kun Qian, Zhaomeng Zhu, Gongxuan Zhang, Huijie Xu

Research on Optimum Checkpoint Interval for Hybrid Fault Tolerance

Abstract

With the rapid growth of the high performance computer system size and complexity, passive fault tolerance can no longer effectively provide reliability of the system because of the high overhead and poor scalability of these methods. Hybrid fault tolerant method which is the combination of passive and active fault tolerant approaches has the potential to be widely used in fault tolerance of exascale system. However, there are still many issues of this method need to be ironed out. This paper focuses on the issues of checkpointing of hybrid fault tolerant method. A common question surrounding checkpointing is the optimization of the checkpoint interval. This paper proposes two models to model the systems which adopt hybrid fault tolerance. By comparing their results with the simulation, this paper evaluates the effectiveness of these two models. Experimental result shows that the modified model can not only predict the total work time excellently, but also can predict the optimum checkpoint interval precisely.

Lei Zhu, Jianhua Gu, Yunlan Wang, Tianhai Zhao

Programming Real-Time Image Processing for Manycores in a High-Level Language

Abstract

Manycore architectures are gaining attention as a means to meet the performance and power demands of high-performance embedded systems. However, their widespread adoption is sometimes constrained by the need for mastering proprietary programming languages that are low-level and hinder portability.

We propose the use of the concurrent programming language occam-pi as a high-level language for programming an emerging class of manycore architectures. We show how to map occam-pi programs to the manycore architecture Platform 2012 (P2012). We describe the techniques used to translate the salient features of the language to the native programming model of the P2012. We present the results from a case study on a representative algorithm in the domain of real-time image processing: a complex algorithm for corner detection called Features from Accelerated Segment Test (FAST). Our results show that the occam-pi program is much shorter, is easier to adapt and has a competitive performance when compared to versions programmed in the native programming model of P2012 and in OpenCL.

Essayas Gebrewahid, Zain-ul-Abdin, Bertil Svensson, Veronica Gaspes, Bruno Jego, Bruno Lavigueur, Mathieu Robart

Self-adaptive Retransmission for Network Coding with TCP

Abstract

Incorporating network coding with TCP is a natural way to enhance the robustness and effectiveness of data transmission in lossy channels, it can mask packet loss by mixing data across time and across flows. The key of this approach is a suitable retransmission scheme which can adjust according to the changed of the lossy channel condition. However, most retransmission schemes can’t compensate losses effectively. In this paper we propose a novel self-adaptive retransmission scheme combining prospection with compensation, which can dynamically adjust the number and time of coding packets’s retransmission according to the channel state change. Compensatory retransmission transmit exact number of packets the receiver needs for decoding all packets based on feedback, and prospective retransmission transmit extra packet before losses happened, and the redundancy factor R is adjusted based on the channel conditions. The scheme can work well on handling not only random losses but also bursty losses. Our scheme also keeps the end-to-end philosophy of TCP that the coding operations are only performed at the end hosts. Thus it is easier to be implemented in practical systems. Simulation results show that our scheme significantly outperforms the previous coding approach in reducing size of decoding matrix and decoding delay, and and produces better TCP-throughput than the standard TCP/NC, TCP-Reno.

Chunqing Wu, Hongyun Zhang, Wanrong Yu, Zhenqian Feng, Xiaofeng Hu

Springer Professional

Advanced Parallel Processing Technologies

10th International Symposium, APPT 2013, Stockholm, Sweden, August 27-28, 2013, Revised Selected Papers

About this book

Table of Contents

Frontmatter

Inference and Declaration of Independence in Task-Parallel Programs

BDDT: Block-Level Dynamic Dependence Analysis for Task-Based Parallelism

A User-Level NUMA-Aware Scheduler for Optimizing Virtual Machine Performance

Towards RTOS: A Preemptive Kernel Basing on Barrelfish

Interrupt Modeling and Verification for Embedded Systems Based on Time Petri Nets

Pruning False Positives of Static Data-Race Detection via Thread Specialization

G-Paradex: GPU-Based Parallel Indexing for Fast Data Deduplication

Research on SPH Parallel Acceleration Strategies for Multi-GPU Platform

Binarization-Based Human Detection for Compact FPGA Implementation

HPACS: A High Privacy and Availability Cloud Storage Platform with Matrix Encryption

ECAM: An Efficient Cache Management Strategy for Address Mappings in Flash Translation Layer

A Performance Study of Software Prefetching for Tracing Garbage Collectors

Adaptive Implementation Selection in the SkePU Skeleton Programming Library

Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

Optimizing Program Performance via Similarity, Using a Feature-Agnostic Approach

Scalable NIC Architecture to Support Offloading of Large Scale MPI Barrier

Hierarchical Clustering Routing Protocol Based on Optimal Load Balancing in Wireless Sensor Networks

An Efficient Parallel Mechanism for Highly-Debuggable Multicore Simulator

Data Access Type Aware Replacement Policy for Cache Clustering Organization of Chip Multiprocessors

Agent-Based Credibility Protection Model for Decentralized Network Computing Environment

A Vectorized K-Means Algorithm for Intel Many Integrated Core Architecture

Towards Eliminating Memory Virtualization Overhead

An Improved FPGAs-Based Loop Pipeline Scheduling Algorithm for Reconfigurable Compiler

ACF: Networks-on-Chip Deadlock Recovery with Accurate Detection and Elastic Credit

An Auction and League Championship Algorithm Based Resource Allocation Mechanism for Distributed Cloud

Accelerating Software Model Checking Based on Program Backbone

A Cloud Computing System for Snore Signals Processing

Research on Optimum Checkpoint Interval for Hybrid Fault Tolerance

Programming Real-Time Image Processing for Manycores in a High-Level Language

Self-adaptive Retransmission for Network Coding with TCP

Backmatter

Premium Partner