Skip to main content

2011 | Buch

Algorithms and Architectures for Parallel Processing

11th International Conference, ICA3PP, Melbourne, Australia, October 24-26, 2011, Proceedings, Part I

herausgegeben von: Yang Xiang, Alfredo Cuzzocrea, Michael Hobbs, Wanlei Zhou

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two volume set LNCS 7016 and LNCS 7017 constitutes the refereed proceedings of the 11th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2011, held in Melbourne, Australia, in October 2011. The first volume presents 24 revised regular papers and 17 revised short papers together with the abstract of the keynote lecture - all carefully reviewed and selected from 85 initial submissions. The papers cover the many dimensions of parallel algorithms and architectures, encompassing fundamental theoretical approaches, practical experimental results, and commercial components and systems and focus on two broad areas of parallel and distributed computing, i.e., architectures, algorithms and networks, and systems and applications.

Inhaltsverzeichnis

Frontmatter

ICA3PP 2011 Keynote

Keynote: Assertion Based Parallel Debugging

Programming languages have advanced tremendously over the years, but program debuggers have hardly changed. Sequential debuggers do little more than allow a user to control the flow of a program and examine its state. Parallel ones support the same operations on multiple processes, which are adequate with a small number of processors, but become unwieldy and ineffective on very large machines. Typical scientific codes have enormous multi-dimensional data structures and it is impractical to expect a user to view the data using traditional display techniques. In this seminar I will discuss the use of debug-time assertions, and show that these can be used to debug parallel programs. The techniques reduce the debugging complexity because they reason about the state of large arrays without requiring the user to know the expected value of every element. Assertions can be expensive to evaluate, but their performance can be improved by running them in parallel. I demonstrate the system with a case study finding errors in a parallel version of the Shallow Water Equations, and evaluate the performance of the tool on a 4,096 cores Cray XE6.

David Abramson

ICA3PP 2011 Regular Papers

Secure and Energy-Efficient Data Aggregation with Malicious Aggregator Identification in Wireless Sensor Networks

Data aggregation in wireless sensor networks is employed to reduce the communication overhead and prolong the network lifetime. However, an adversary may compromise some sensor nodes, and use them to forge false values as the aggregation result. Previous secure data aggregation schemes have tackled this problem from different angles. The goal of those algorithms is to ensure that the Base Station (BS) does not accept any forged aggregation results. But none of them have tried to detect the nodes that inject into the network bogus aggregation results. Moreover, most of them usually have a communication overhead that is (at best) logarithmic per node. In this paper, we propose a secure and energy-efficient data aggregation scheme that can detect the malicious nodes with a constant per node communication overhead. In our solution, all aggregation results are signed with the private keys of the aggregators so that they cannot be altered by others. Nodes on each link additionally use their pairwise shared key for secure communications. Each node receives the aggregation results from its parent (sent by the parent of its parent) and its siblings (via its parent node), and verifies the aggregation result of the parent node. Theoretical analysis on energy consumption and communication overhead accords with our comparison based simulation study over random data aggregation trees.

Hongjuan Li, Keqiu Li, Wenyu Qu, Ivan Stojmenovic
Dynamic Data Race Detection for Correlated Variables

In parallel programs concurrency bugs are often caused by unsynchronized accesses to shared memory locations, which are called

data races

. In order to support programmers in writing correct parallel programs, it is therefore highly desired to have tools on hand that automatically detect such data races. Today, most of these tools only consider unsynchronized read and write operations on a single memory location. Concurrency bugs that involve multiple accesses on a set of correlated variables may be completely missed. Tools may overwhelm programmers with data races on various memory locations, without noticing that the locations are correlated. In this paper, we propose a novel approach to data race detection that automatically infers sets of correlated variables and logical operations by analyzing data and control dependencies. For data race detection itself, we combine a modified version of the lockset algorithm with happens-before analysis providing the first

hybrid, dynamic race detector for correlated variables

. We implemented our approach on top of the Valgrind, a framework for dynamic binary instrumentation. Our evaluation confirmed that we can catch data races missed by existing detectors and provide additional information for correct bug fixing.

Ali Jannesari, Markus Westphal-Furuya, Walter F. Tichy
Improving the Parallel Schnorr-Euchner LLL Algorithm

This paper introduces a number of modifications that allow for significant improvements of parallel LLL reduction. Experiments show that these modifications result in an increase of the speed-up by a factor of more than 1.35 for SVP challenge type lattice bases in comparing the new algorithm with the state-of-the-art parallel LLL algorithm.

Werner Backes, Susanne Wetzel
Distributed Mining of Constrained Frequent Sets from Uncertain Data

With the advance in technology, sensor networks have been widely used in many application areas such as environmental surveillance. Sensors distributed in these networks serve as good sources for data. This calls for

distributed data mining

, which searches for implicit, previously unknown, and potentially useful patterns that might be embedded in the distributed data. Many existing distributed data mining algorithms do not allow users to express the patterns to be mined according to their intention via the use of constraints. Consequently, these unconstrained mining algorithms can yield numerous patterns that are not interesting to users. Moreover, due to inherited measurement inaccuracies and/or network latencies, the data are often riddled with uncertainty. These call for

constrained mining

and

uncertain data mining

. In this paper, we propose a tree-based system for mining frequent sets that satisfy user-defined constraints from a distributed environment such as a wireless sensor network of uncertain data. Experimental results show effectiveness of our proposed system.

Alfredo Cuzzocrea, Carson K. Leung
Set-to-Set Disjoint-Paths Routing in Recursive Dual-Net

Recursive dual-net (RDN) is a newly proposed interconnection network for massive parallel computers. The RDN is based on recursive dual-construction of a symmetric base-network. A

k

-level dual-construction for

k

 > 0 creates a network containing

${{(2n_0)^{2^k}/2}}$

nodes with node-degree

d

0

 + 

k

, where

n

0

and

d

0

are the number of nodes and the node-degree of the base network, respectively. The RDN is a symmetric graph and can contain huge number of nodes with small node-degree and short diameter. Node-to-set disjoint-paths routing is fundamental and has many applications for fault-tolerant and secure communications in a network. In this paper, we propose an efficient algorithm for set-to-set disjoint-paths routing in RDN. We show that, given two sets of

d

0

 + 

k

nodes,

S

and

T

in

RDN

k

(

B

),

d

0

 + 

k

disjoint paths, each connecting a node in

S

to a node in

T

, can be found in

O

(

lglgN

*

lgN

) time, where

N

is the number of nodes in

RDN

k

(

B

). The length of the paths is at most 3(

D

0

/2 + 1)(

lgN

 + 1)/(

lgn

0

 + 1), where

D

0

and

n

0

are the diameter and the number of nodes of base-network

B

, respectively.

Yamin Li, Shietung Peng, Wanming Chu
Redflag: A Framework for Analysis of Kernel-Level Concurrency

Although sophisticated runtime bug detection tools exist to root out several kinds of concurrency errors, they cannot easily be used at the kernel level. Our

Redflag

framework and system seeks to bring these essential techniques to the Linux kernel by addressing issues faced by other tools. First, other tools typically examine every potentially concurrent memory access, which is infeasible in the kernel because of the overhead it would introduce. Redflag minimizes overhead by using offline analysis together with an efficient in-line logging system and by supporting targeted configurable logging of specific kernel components and data structures. Targeted analysis reduces overhead and avoids presenting developers with error reports for components they are not responsible for. Second, other tools do not take into account some of the synchronization patterns found in the kernel, resulting in false positives. We explore two algorithms for detecting concurrency errors: one for race conditions and another for atomicity violations; we enhanced them to take into account some specifics of synchronization in the kernel. In particular, we introduce Lexical Object Availability (LOA) analysis to deal with multi-stage escape and other complex order-enforcing synchronization. We evaluate the effectiveness and performance of Redflag on two file systems and a video driver.

Justin Seyster, Prabakar Radhakrishnan, Samriti Katoch, Abhinav Duggal, Scott D. Stoller, Erez Zadok
Exploiting Parallelism in the H.264 Deblocking Filter by Operation Reordering

In the H.264 video compression standard, the deblocking filter contributes about one-third of all computation in the decoder. With multiprocessor architectures becoming the future trend of system design, computation time reduction can be achieved if the deblocking filter well apportions its operations to multiple processing elements. In this paper, we apply a 16 pixel long boundary, the basic unit for deblocking in the H.264 standard, as the basis for analyzing and exploiting possible parallelism in deblocking filtering. Compared with existing approaches using a macroblock as a basic unit for analysis, a 16 pixel long boundary by having a finer granularity can improve the chances of increasing the degree of parallelism. Moreover, a possible compromise to fully utilize limited hardware resources and hardware architectural requirements for deblocking are also proposed in this paper. Compared with the 2D wave-front method order for deblocking both 1920*1080 and 1080*1920 pixel sized frames, the proposed design gains speedups of 1.57 and 2.15 times given an un-limited number of processing elements respectively. Using this approach, the execution time of the deblocking filter is proportional to the square root of the growth of the frame size (keeping the same width/height ratio), pushing the boundary of practical real-time deblocking of increasingly larger video sizes.

Tsung-Hsi Weng, Yi-Ting Wang, Chung-Ping Chung
Compiler Support for Concurrency Synchronization

How to write a parallel program is a critical issue for Chip multi-processors (CMPs). To overcome the communication and synchronization obstacles of CMPs, transactional memory (TM) has been proposed as an alternative for controlling concurrency mechanism. Unfortunately, TM has led to seven performance pathologies: DuelingUpgrades, FutileStall, StarvingWriter, StarvingElder, SerializedCommit, RestartConvoy, and FriendlyFire. Such pathologies degrade performance during the interaction between workload and system. Although this performance issue can be solved by hardware, the software solution remains elusive. This paper proposes a priority scheduling algorithm to remedy these performance pathologies. By contrast, the proposed approach can not only solve this issue, but also achieve higher performance than hardware transactional memory (HTM) systems on some benchmarks.

Tzong-Yen Lin, Cheng-Yu Lee, Chia-Jung Chen, Rong-Guey Chang
Fault-Tolerant Routing Based on Approximate Directed Routable Probabilities for Hypercubes

Recently, parallel processing systems have been studied very actively, and many topologies have been proposed. A hypercube is one of the most popular topologies for interconnection networks. In this paper, we propose two new fault-tolerant routing algorithms for hypercubes based on approximate directed routable probabilities. The probability represents ability of routing to any node at a specific distance and also taking account of from which direction the message was received. Each node chooses one of its neighbor nodes to send a message by comparing the approximate directed routable probabilities. We also conducted a computer experiment to verify the effectiveness of our algorithms.

Thuy Dinh Duong, Keiichi Kaneko
Finding a Hamiltonian Cycle in a Hierarchical Dual-Net with Base Network of p -Ary q -Cube

We first introduce a flexible interconnection network, called the hierarchical dual-net (HDN), with low node degree and short diameter for constructing a large-scale supercomputer. The HDN is constructed based on a symmetric product graph (base network). A

k

-level hierarchical dual-net, HDN(

B

,

k

,

S

), contains

${(2N_0)^{2^k}/(2\prod_{i=1}^{k}s_i)}$

nodes, where

S

 = {

s

i

|1 ≤ 

i

 ≤ 

k

} is the set of integers with each

s

i

representing the number of nodes in a super-node at the level

i

for 1 ≤ 

i

 ≤ 

k

, and

N

0

is the number of nodes in the base network

B

. The node degree of HDN(

B

,

k

,

S

) is

d

0

 + 

k

, where

d

0

is the node degree of the base network. The benefit of the HDN is that we can select suitable

s

i

to control the growing speed of the number of nodes for constructing a supercomputer of the desired scale. Then we show that an HDN with the base network of

p

-ary

q

-cube is Hamiltonian and give an efficient algorithm for finding a Hamiltonian cycle in such hierarchical dual-nets.

Yamin Li, Shietung Peng, Wanming Chu
Adaptive Resource Remapping through Live Migration of Virtual Machines

In this paper we present ARRIVE-F, a novel open source framework which addresses the issue of heterogeneity in compute farms. Unlike the previous attempts, our framework is not based on linear frequency models and does not require source code modifications or off-line profiling. The heterogeneous compute farm is first divided into a number of virtualized homogeneous sub-clusters. The framework then carries out a lightweight ‘online’ profiling of the CPU, communication and memory subsystems of all the active jobs in the compute farm. From this, it constructs a performance model to predict the execution times of each job on all the distinct sub-clusters in the compute farm. Based upon the predicted execution times, the framework is able to relocate the compute jobs to the best suited hardware platforms such that the overall throughput of the compute farm is increased. We utilize the live migration feature of virtual machine monitors to migrate the job from one sub-cluster to another.

The prediction accuracy of our performance estimation model is over 80%. The implementation of ARRIVE-F is lightweight, with an overhead of 3%. Experiments on a synthetic workload of scientific benchmarks show that we are able to improve the throughput of a moderately heterogeneous compute farm by up to 25%, with a time saving of up to 33%.

Muhammad Atif, Peter Strazdins
LUTS: A Lightweight User-Level Transaction Scheduler

Software Transactional Memory (STM) systems have poor performance under high contention scenarios. Since many transactions compete for the same data, most of them are aborted, wasting processor runtime. Contention management policies are typically used to avoid that, but they are passive approaches as they wait for an abort to happen so they can take action. More proactive approaches have emerged, trying to predict when a transaction is likely to abort so its execution can be delayed. Such techniques are limited, as they do not replace the doomed transaction by another or, when they do, they rely on the operating system for that, having little or no control on which transaction should run. In this paper we propose LUTS, a Lightweight User-Level Transaction Scheduler, which is based on an execution context record mechanism. Unlike other techniques, LUTS provides the means for selecting another transaction to run in parallel, thus improving system throughput. Moreover, it avoids most of the issues caused by pseudo parallelism, as it only launches as many system-level threads as the number of available processor cores. We discuss LUTS design and present three conflict-avoidance heuristics built around LUTS scheduling capabilities. Experimental results, conducted with STMBench7 and STAMP benchmark suites, show LUTS efficiency when running high contention applications and how conflict-avoidance heuristics can improve STM performance even more. In fact, our transaction scheduling techniques are capable of improving program performance even in overloaded scenarios.

Daniel Nicácio, Alexandro Baldassin, Guido Araújo
Verification of Partitioning and Allocation Techniques on Teradata DBMS

Data fragmentation and allocation in distributed and parallel Database Management Systems (DBMS) have been extensively studied in the past. Previous work tackled these two problems separately even though they are dependent on each other. We recently developed a combined algorithm that handles the dependency issue between fragmentation and allocation. A novel genetic solution was developed for this problem. The main issue of this solution and previous solutions is the lack of real life verifications of these models. This paper addresses this gap by verifying the effectiveness of our previous genetic solution on the Teradata DBMS. Teradata is a shared nothing DBMS with proven scalability and robustness in real life user environments as big as 10’s of petabytes of relational data. Experiments are conducted for the genetic solution and previous work using the SSB benchmark (TPC-H like) on a Teradata appliance running TD 13.10. Results show that the genetic solution is faster than previous work by a 38%.

Ladjel Bellatreche, Soumia Benkrid, Ahmad Ghazal, Alain Crolotte, Alfredo Cuzzocrea
Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86_64 Systems

Because of the continuous trend towards higher core counts, parallelization is mandatory for many application domains beyond the traditional HPC sector. Current commodity servers comprise up to 48 processor cores in configurations with only four sockets. Those shared memory systems have distinct NUMA characteristics. The exact location of data within the memory system significantly affects both access latency and bandwidth. Therefore, NUMA aware memory allocation and scheduling are highly performance relevant issues. In this paper we use low-level microbenchmarks to compare two state-of-the-art quad-socket systems with x86_64 processors from AMD and Intel. We then investigate the performance of the application based OpenMP benchmark suite SPEC OMPM2001. Our analysis shows how these benchmarks scale on shared memory systems with up to 48 cores and how scalability correlates with the previously determined characteristics of the memory hierarchy. Furthermore, we demonstrate how the processor interconnects influence the benchmark results.

Daniel Molka, Robert Schöne, Daniel Hackenberg, Matthias S. Müller
Anonymous Communication over Invisible Mix Rings

Protect the identity of participants may be advantageous or essential and even critical for many internet applications. Mix rings architecture give better performance than mix-nets while maintaining anonymity that is stronger than onion routing. This paper presents an enhancement of mix rings, which is a hybrid P2P system and is designed to provide anonymity under a strong adversarial model in terms of identification, anonymity and resilience to collusion, along with low latency of data delivery and the link utilization. We use a double-layer overlay network, composed of nodes that are interested in anonymity and trusted index nodes and introduce several cluster escape and random extend mechanisms into mix rings. We present a description of the protocol, an analysis of common attack defense, and evaluate the degree of anonymity using MATLAB simulations.

Ming Zheng, Haixin Duan, Jianping Wu
Game-Based Distributed Resource Allocation in Horizontal Dynamic Cloud Federation Platform

In this paper, we propose a game-theoretic solution to the problem of distributed resource allocation in emerging horizontal dynamic cloud federation (HDCF) platform. It differs from the existing vertical supply chain federation (VSCF) models in terms of establishing federation and dynamic pricing. We study two resource allocation games - non cooperative and cooperative games to analyze interaction among CPs in a HDCF environment. We use price-based resource allocation strategies and present both centralized and distributed algorithms to find optimal solutions which have low overhead and robust performance. Various simulations were carried out to validate and verify the effectiveness of the proposed resource allocation games. The simulation results demonstrate that a cost effective resource allocation with robust performance is achieved in the cooperative scheme.

Mohammad Mehedi Hassan, Biao Song, Eui-Nam Huh
Stream Management within the CloudMiner

Nowadays cloud computing has become a major trend that enterprises and research organizations are pursuing with increasing zest. A potentially important application area for clouds is data analytics. In our previous publication, we introduced a novel cloud infrastructure, the CloudMiner, which facilitates data mining on massive scientific data. By providing a cloud platform which hosts data mining cloud services following the Software as a Service (SaaS) paradigm, CloudMiner offers the capability for realizing cloud-based data mining tasks upon traditional distributed databases and other dataset types. However, little attention has been paid to the issue of data stream management on the cloud so far. We have noticed the fact that some features of the cloud meet very well the requirements of data stream management. Consequently, we developed an innovative software framework, called the StreamMiner, which is introduced in this paper. It serves as an extension to the CloudMiner for facilitating, in particular, real-world data stream management and analysis using cloud services. In addition, we also introduce our tentative implementation of the framework. Finally, we present and discuss the first experimental performance results achieved with the first StreamMiner prototype.

Yuzhang Han, Peter Brezany, Andrzej Goscinski
Security Architecture for Virtual Machines

We propose security architecture based on virtual machine monitor to efficiently deal with attacks on virtual machines. We will show that our model is capable of detecting suspicious processes running in the virtual machine, can detect and prevent different types of attacks including zero day attacks by monitoring the virtual machine traffic and the processes that are generating or receiving the traffic. The architecture also makes use of sharing information about the suspicious behaviour among multiple Intrusion detection systems deployed in different virtual machine monitors. We describe the implementation of the proposed architecture and present a detailed analysis of how our architecture can be used to detect zero day attacks.

Udaya Tupakula, Vijay Varadharajan, Abhishek Bichhawat
Fast and Accurate Similarity Searching of Biopolymer Sequences with GPU and CUDA

CUDA is an architecture introduced by NVIDIA Corporation, which allows software developers to take advantage of GPU resources in order to increase the computational power. This paper presents an approach to accelerate the similarity searching of DNA and protein molecules through parallel alignments of their sequences with the use of GPU and CUDA. In order to optimally align two biopolymer sequences, such as amino acid or nucleotide sequences, we employ the Smith-Waterman algorithm. We present the optimization steps leading to achieve a very good efficiency of our implementation on GPU and we compare results of efficiency tests with other known implementations. The results show that it is possible to search bioinformatics databases accurately within a reasonable time.

Robert Pawłowski, Bożena Małysiak-Mrozek, Stanisław Kozielski, Dariusz Mrozek
Read Invisibility, Virtual World Consistency and Probabilistic Permissiveness are Compatible

The aim of a Software Transactional Memory (STM) is to discharge the programmers from the management of synchronization in multiprocess programs that access concurrent objects. To that end, an STM system provides the programmer with the concept of a transaction. The job of the programmer is to design each process the application is made up of as a sequence of transactions. A transaction is a piece of code that accesses concurrent objects, but contains no explicit synchronization statement. It is the job of the underlying STM system to provide the illusion that each transaction appears as being executed atomically. Of course, for efficiency, an STM system has to allow transactions to execute concurrently. Consequently, due to the underlying STM concurrency management, a transaction commits or aborts.

This paper studies the relation between two STM properties (read invisibility and permissiveness) and two consistency conditions for STM systems, namely, opacity and virtual world consistency. Both conditions ensure that any transaction (be it a committed or an aborted transaction) reads values from a consistent global state, a noteworthy property if one wants to prevent abnormal behavior from concurrent transactions that behave correctly when executed alone. A read operation issued by a transaction is invisible if it does not entail shared memory modifications. This is an important property that favors efficiency and privacy. An STM system is permissive (respectively probabilistically permissive) with respect to a consistency condition if it accepts (respectively accepts with positive probability) every history that satisfies the condition. This is a crucial property as a permissive STM system never aborts a transaction “for free”. The paper first shows that read invisibility, probabilistic permissiveness and opacity are incompatible, which means that there is no probabilistically permissive STM system that implements opacity while ensuring read invisibility. It then shows that read invisibility, probabilistic permissiveness and virtual world consistency are compatible. To that end the paper describes a new STM protocol called IR_VWC_P. This protocol presents additional noteworthy features: it uses only base read/write objects and locks which are used only at commit time; it satisfies the disjoint access parallelism property; and, in favorable circumstances, the cost of a read operation is

O

(1).

Tyler Crain, Damien Imbs, Michel Raynal
Parallel Implementations of Gusfield’s Cut Tree Algorithm

This paper presents parallel versions of Gusfield’s cut tree algorithm. Cut trees are a compact representation of the edge-connectivity between every pair of vertices of an undirected graph. Cut trees have many applications in combinatorial optimization and in the analysis of networks originated in many applied fields. However, surprisingly few works have been published on the practical performance of cut tree algorithms. This paper describes two parallel versions of Gusfield’s cut tree algorithm and presents extensive experimental results which show a significant speedup on most real and synthetic graphs in our dataset.

Jaime Cohen, Luiz A. Rodrigues, Fabiano Silva, Renato Carmo, André L. P. Guedes, Elias P. Duarte Jr.
Efficient Parallel Implementations of Controlled Optimization of Traffic Phases

Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(

N

3

) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm that reduces the complexity to O(

N

2

). Three implementations that span a wide range of parallel hardware are developed. The first is based on shared-memory architecture, using the OpenMP programming model. The second implementation is based on message passing, targeting massively parallel machines including high performance clusters, and supercomputers. The third implementation is based on the data parallel programming model mapped on Graphics Processing Units (GPUs). Key optimizations include loop reversal, communication pruning, load-balancing, and efficient thread to processors assignment. Experiments have been conducted on 8-core server, IBM BlueGene/L supercomputer 2-node boards with 128 processors, and GPU GTX470 GeForce Nvidia with 448 cores. Results indicate practical scalability on all platforms, with maximum speed up reaching 76x for the GTX470.

Sameh Samra, Ahmed El-Mahdy, Walid Gomaa, Yasutaka Wada, Amin Shoukry
Scheduling Concurrent Workflows in HPC Cloud through Exploiting Schedule Gaps

Many large-scale scientific applications are usually constructed as workflows due to large amounts of interrelated computation and communication. Workflow scheduling has long been a research topic in parallel and distributed computing. However, most previous research focuses on single workflow scheduling. As cloud computing emerges, users can now have easy access to on-demand high performance computing resources, usually called HPC cloud. Since HPC cloud has to serve many users simultaneously, it is common that many workflows submitted from different users are running concurrently. Therefore, how to schedule concurrent workflows efficiently becomes an important issue in HPC cloud environments. Due to the dependency and communication costs between tasks in a workflow, there usually are gaps formed in the schedule of a workflow. In this paper, we propose a method which exploits such schedule gaps to efficiently schedule concurrent workflows in HPC cloud. The proposed scheduling method was evaluated with a series of simulation experiments and compared to the existing method in the literature. The results indicate that our method can deliver good performance and outperform the existing method significantly in terms of average makespan, up to 18% performance improvement.

He-Jhan Jiang, Kuo-Chan Huang, Hsi-Ya Chang, Di-Syuan Gu, Po-Jen Shih
Efficient Decoding of QC-LDPC Codes Using GPUs

In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs). We optimize the data structures of the messages used in the decoding process such that both the read

and

write processes can be performed in a highly parallel manner by the GPUs. We also propose a highly efficient algorithm to convert the data structure of the messages from one form to another with very little latency. Finally, with the use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPU-based LDPC decoder is found to run at around 100 times faster than a CPU-based simulator.

Yue Zhao, Xu Chen, Chiu-Wing Sham, Wai M. Tam, Francis C. M. Lau

ICA3PP 2011 Short Papers

A Combined Arithmetic Logic Unit and Memory Element for the Design of a Parallel Computer

Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. A simple low latency single on-chip memory and processor cannot solve the problem as the fundamental channel bottleneck will still be there due to the logical splitting of processor and memory. This paper presents that a paradigm shift is possible by combining Arithmetic logic unit and Random Access Memory (ARAM) elements at bit level. This bit level modest ARAM is used to perform word level ALU instructions with minor modifications. This makes the ARAM cells capable of executing instructions in parallel. It is also asynchronous and hence reduces power consumption significantly. A CMOS implementation is presented that verifies the practicality of the proposed ARAM.

Mohammed Ziaur Rahman
Parallel Implementation of External Sort and Join Operations on a Multi-core Network-Optimized System on a Chip

In a commercial Relational Database Management System (RDBMS), sort and join are the most demanding operations, and it is quite beneficial to improve the performance of external sort and external join algorithms that handle large input data sizes. This paper proposes parallel implementations of multi-threaded external sort and external hash join algorithms to accelerate IBM DB2, one of leading RDBMSs, using an IBM Power Edge of Network (IBM PowerEN

TM

) Peripheral Component Interconnect Express (PCIe) card as an accelerator. The preliminary results show that the proposed parallel implementation of the algorithms on PowerEN

TM

PCIe card can speed up the DB2 sort and join performance about two times.

Elahe Khorasani, Brent D. Paulovicks, Vadim Sheinin, Hangu Yeo
STM with Transparent API Considered Harmful

One of the key selling points of Software Transactional Memory (STM) systems is that they simplify the development of concurrent programs, because programmers do not have to be concerned with which objects are accessed concurrently. Instead, they just have to say which operations are to be executed atomically. Yet, one of the consequences of making an STM completely transparent to the programmer is that it may incur in large overheads.

In this paper, we describe a port to Java of the WormBench benchmark, and use it to explore the effects on performance of relaxing the transparency of an STM. To that end, we implemented, in a well known STM framework (Deuce), a couple of annotations that allow programmers to specify that certain objects or fields of objects should not be transactified. Our results show that we get an improvement of up to 22-fold in the performance of the benchmark when we tell the STM framework which data is not transactional, and that the performance of the improved version is as good as or better than a fine-grained lock-based approach.

Fernando Miguel Carvalho, Joao Cachopo
A Global Snapshot Collection Algorithm with Concurrent Initiators with Non-FIFO Channel

Taking a global snapshot in the absence of a global clock is a challenging issue in distributed system. The problem becomes more challenging when the communication channel is a non-FIFO one, due to the lack of FIFO properties in transmitting messages. Multiple initiators further complicate the situation. In this paper, we present a global snapshot collection algorithm with multiple initiators in the case of non-FIFO communication channel. We have shown that the algorithm can take a unique global consistent snapshot with non-FIFO channel, and terminates in

O

(

mn

2

) message complexity where

m

is the number of concurrent initiators, and

n

is the number of processes in the system.

Diganta Goswami, Soumyadip Majumder
An Approach for Code Compression in Run Time for Embedded Systems – A Preliminary Results

Several factors are considered in the development of embedded systems, among which may be mentioned: physical size, weight, mobility, power consumption, memory, safety, all combined with a low cost and ease of use. There are several techniques to optimize the execution time and power consumption in embedded systems. One such technique is the code compression, the majority of existing proposals focuses on decompression assuming the code is compressed in time compilation. This article proposes the development of a new method of compression and decompression code implemented in VHDL and prototyped on an FPGA, called MIC (Middle Instruction Compression). The proposed method was compared with the traditional of Huffman method also implemented in hardware. The MIC showed better performance compared with Huffman for some programs MiBench, widely used in embedded systems, obtaining 17% to less of the logical elements of FPGA, 6% increase in clock frequency (in MHz) and 42% more in compression codes compared the of using Huffman method, and allows the compression and decompression at runtime.

Wanderson Roger Azevedo Dias, Edward David Moreno, Raimundo da Silva Barreto
Optimized Two Party Privacy Preserving Association Rule Mining Using Fully Homomorphic Encryption

In two party privacy preserving association rule mining, the issue to securely compare two integers is considered as the bottle neck to achieve maximum privacy. Recently proposed fully homomorphic encryption (FHE) scheme by Dijk et.al. can be applied in secure computation. Kaosar, Paulet and Yi have applied it in preserving privacy in two-party association rule mining, but its performance is not very practical due to its huge cyphertext, public key size and complex carry circuit. In this paper we propose some optimizations in applying Dijk et.al.’s encryption system to securely compare two numbers. We also applied this optimized solution in preserving privacy in association rule mining (ARM) in two-party settings. We have further enhanced the two party secure association rule mining technique proposed by Kaosar et.al. The performance analysis shows that this proposed solution achieves a significant improvement.

Md. Golam Kaosar, Russell Paulet, Xun Yi
SLA-Based Resource Provisioning for Heterogeneous Workloads in a Virtualized Cloud Datacenter

Efficient provisioning of resources is a challenging problem in cloud computing environments due to its dynamic nature and the need for supporting heterogeneous applications with different performance requirements. Currently, cloud datacenter providers either do not offer any performance guarantee or prefer static VM allocation over dynamic, which lead to inefficient utilization of resources. Earlier solutions, concentrating on a single type of SLAs (Service Level Agreements) or resource usage patterns of applications, are not suitable for cloud computing environments. In this paper, we tackle the resource allocation problem within a datacenter that runs different type of application workloads, particularly non-interactive and transactional applications. We propose admission control and scheduling mechanism which not only maximizes the resource utilization and profit, but also ensures the SLA requirements of users. In our experimental study, the proposed mechanism has shown to provide substantial improvement over static server consolidation and reduces SLA Violations.

Saurabh Kumar Garg, Srinivasa K. Gopalaiyengar, Rajkumar Buyya
ΣC: A Programming Model and Language for Embedded Manycores

We present ΣC, a programming model and language for high performance embedded manycores. The programming model is based on process networks with non determinism extensions and process behavior specifications. The language itself extends C, with parallelism, composition and process abstractions. It is intended to support architecture independent, high-level parallel programming on embedded manycores, and allows for both low execution overhead and strong execution guarantees. ΣC is being developed as part of an industry-grade tool chain for a high performance embedded manycore architecture.

Thierry Goubier, Renaud Sirdey, Stéphane Louise, Vincent David
Provisioning Spot Market Cloud Resources to Create Cost-Effective Virtual Clusters

Infrastructure-as-a-Service providers are offering their unused resources in the form of variable-priced virtual machines (VMs), known as “spot instances”, at prices significantly lower than their standard fixed-priced resources. To lease spot instances, users specify a maximum price they are willing to pay per hour and VMs will run only when the current price is lower than the user’s bid. This paper proposes a resource allocation policy that addresses the problem of running deadline-constrained compute-intensive jobs on a pool of composed solely of spot instances, while exploiting variations in price and performance to run applications in a fast and economical way. Our policy relies on job runtime estimations to decide what are the best types of VMs to run each job and when jobs should run. Several estimation methods are evaluated and compared, using trace-based simulations, which take real price variation traces obtained from Amazon Web Services as input, as well as an application trace from the Parallel Workload Archive. Results demonstrate the effectiveness of running computational jobs on spot instances, at a fraction (up to 60% lower) of the price that would normally cost on fixed priced resources.

William Voorsluys, Saurabh Kumar Garg, Rajkumar Buyya
A Principled Approach to Grid Middleware
Status Report on the Minimum Intrusion Grid

This paper provides an overview of MiG, a Grid middleware for advanced job execution, data storage and group collaboration in an integrated, yet lightweight solution using standard software.

In contrast to most other Grid middlewares, MiG is developed with a particular focus on usability and minimal system requirements, applying strict principles to keep the middleware free of legacy burdens and overly complicated design. We provide an overview of MiG and describe its features in view of the Grid vision and its relation to more recent cloud computing trends.

Jost Berthold, Jonas Bardino, Brian Vinter
Performance Analysis of Preemption-Aware Scheduling in Multi-cluster Grid Environments

In multi-cluster Grids each cluster serves requests from external (Grid) users along with their own local users. The problem arises when there is not sufficient resources for local users (which have high priority) to be served urgently. This problem could be solved by

preempting

resources from Grid users and allocating them to the local users. However, resource preemption entails decreasing resource utilization and increasing Grid users’ response time. The question is that how we can minimize the number of preemptions taking place in a resource sharing environment. In this paper, we propose a preemption-aware scheduling policy based on the queuing theory for a virtualized multi-cluster Grid where the number of preemptions is minimized. Simulation results indicate that the proposed scheduling policy significantly decreases the number of virtual machine (VM) preemptions (up to 22.5%).

Mohsen Amini Salehi, Bahman Javadi, Rajkumar Buyya
Performance Evaluation of Open Source Seismic Data Processing Packages

In many businesses, including hydrocarbon industries, reducing cost is of high priority. Although hydrocarbon industries appear able to afford the expensive computing infrastructure and software packages used to process seismic data in the search for hydrocarbon traps, it is always imperative to find ways to minimize cost. Seismic processing costs can be significantly reduced by using inexpensive, open source seismic data processing packages. However, hydrocarbon industries question the processing performance capability of open source packages, claiming that their seismic functions are less integrated and provide almost no technical guarantees for one to use. The objective of this paper is to demonstrate, through a comparative analysis, that open source seismic data processing packages are capable of executing the required seismic functions on an actual industrial workload. To achieve this objective we investigate whether or not open source seismic data processing packages can be executed using the same set of seismic data through data format conversions, and whether or not they can achieve reasonable performance and speedup when executing parallel seismic functions on a HPC cluster. Among the few open source packages available on the Internet, the subjects of our study are two popular packages: Seismic UNIX (SU) and Madagascar.

Izzatdin A. Aziz, Andrzej M. Goscinski, Michael M. Hobbs
Reputation-Based Resource Allocation in Market-Oriented Distributed Systems

The scale of the parallel and distributed systems (PDSs), such as grids and clouds, and the diversity of applications running on them put reliability a high priority performance metric. This paper presents a reputation-based resource allocation strategy for PDSs with a market model. Resource reputation is determined by availability and reliable execution. The market model helps in defining a trust interaction between provider and consumer that leverages dependable computing. We also have explicitly taken into account data staging and its delay when making the decisions. Results demonstrate that our approach significantly increases successful execution, while exploiting diversity in tasks and resources.

Masnida Hussin, Young Choon Lee, Albert Y. Zomaya
Cooperation-Based Trust Model and Its Application in Network Security Management

This paper presents a User Cooperation Trust Model (UCTM) which not only encourages the good behavior liberally, but also punishes those minatory behaviors decisively. In this model, a malicious user will be kicked out from the network system and refused from accessing resources of the network automatically when the user’s reputation is lower than some threshold.

Wu Liu, Hai-xin Duan, Ping Ren
Performance Evaluation of the Three-Dimensional Finite-Difference Time-Domain(FDTD) Method on Fermi Architecture GPUs

GPUs excel at solving many parallel problems and hence dramatically increase the computation performance. In electrodynamics and many other fields, FDTD method is widely used due to its simplicity, accuracy, and practicability. In this paper, we applied the FDTD method on the Fermi Architecture GPUs, the latest product of NVidia, for a better understanding of Fermi’s new features, such as the double precision support and improved memory hierarchy. Then we make a comparison between the strategies using the shared memory, the traditional optimization method on GPUs, and using L1 cache. Next, the paper provides insights into the disparity of these two strategies. We demonstrate that parallel computations only using L1 cache can reach the similar or even better performance as the traditional optimization method using the shared memory does when the dataset is not too large or the frequency of repeated use of the related data is low.

Kaixi Hou, Ying Zhao, Jiumei Huang, Lingjie Zhang
The Probability Model of Peer-to-Peer Botnet Propagation

Active Peer-to-Peer worms are great threat to the network security since they can propagate in automated ways and flood the Internet within a very short duration. Modeling a propagation process can help us to devise effective strategies against a worm’s spread. This paper presents a study on modeling a worm’s propagation probability in a P2P overlay network and proposes an optimized patch strategy for defenders. Firstly, we present a probability matrix model to construct the propagation of P2P worms. Our model involves three indispensible aspects for propagation: infected state, vulnerability distribution and patch strategy. Based on a fully connected graph, our comprehensive model is highly suited for real world cases like Code Red II. Finally, by inspecting the propagation procedure, we propose four basic tactics for defense of P2P botnets. The rationale is exposed by our simulated experiments and the results show these tactics are of effective and have considerable worth in being applied in real-world networks

Yini Wang, Sheng Wen, Wei Zhou, Wanlei Zhou, Yang Xiang
A Parallelism Extended Approach for the Enumeration of Orthogonal Arrays

Orthogonal Array plays an important role in Design of Experiment and software testing. The most challenging aspect of Orthogonal array is the enumeration problem, that is for given parameter sets, we want to count exactly how many isomorphism classes exist. Enumeration of orthogonal array usually requires huge effort needed to complete its computation. There are several algorithms that have been proposed for enumeration of orthogonal array, however, there exists ineffective parallelism approach for handling this problem.

In this paper, we present a step-by-step parallelism extending approach for enumeration of orthogonal array. The experiments show that this proposed approach could get an relative speedup efficiency of up to 128 processes and a great dynamic load balancing between computing processes.

Hien Phan, Ben Soh, Man Nguyen
Backmatter
Metadaten
Titel
Algorithms and Architectures for Parallel Processing
herausgegeben von
Yang Xiang
Alfredo Cuzzocrea
Michael Hobbs
Wanlei Zhou
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-24650-0
Print ISBN
978-3-642-24649-4
DOI
https://doi.org/10.1007/978-3-642-24650-0

Premium Partner