Skip to main content

2007 | Buch

Parallel Computing Technologies

9th International Conference, PaCT 2007, Pereslavl-Zalessky, Russia, September 3-7, 2007. Proceedings

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Models and Languages

Looking for a Definition of Dynamic Distributed Systems

This paper is a position paper on the nature of dynamic systems. While there is an agreement on the definition of what a static distributed system is, there is no agreed definition on what a dynamic distributed system is. This paper is a first step in that direction. To that end, it emphasizes two orthogonal dimensions that are present in any dynamic distributed system, namely the varying and possibly very large number of entities that currently define the system, and the fact that each of these entities knows only a few other entities (its neighbors) and possibly will never be able to know the whole system it is a member of. To illustrate the kind of issues one has to cope with in dynamic systems, the paper considers, as a “canonical” problem, a simple data aggregation problem. It shows the type of dynamic systems in which that problem can be solved and the ones in which it cannot be solved. The aim of the paper is to give the reader an idea of the subtleties and difficulties encountered when one wants to understand the nature of dynamic distributed systems.

Roberto Baldoni, Marin Bertier, Michel Raynal, Sara Tucci-Piergiovanni
Adaptive Workflow Nets for Grid Computing

Existing grid applications commonly use workflows for the orchestration of grid services. Existing workflow models however suffer from the lack of adaptivity. In this paper we define Adaptive Grid Workflow nets (AGWF nets) appropriate for modeling grid workflows and allowing changes in the process structure as a response to triggering events/exceptions. Moreover, a recursion is allowed, which makes the model especially appropriate for a number of grid applications. We show that soundness can be verified for AGWF nets.

Carmen Bratosin, Kees van Hee, Natalia Sidorova
A Stochastic Semantics for BioAmbients

We consider BioAmbients, a calculus for specifying biological entities and for simulating and analysing their behaviour. We extend BioAmbients to take quantitative information into account by defining a stochastic semantics, based on a simulation stochastic algorithm, to determine the actual rate of transitions.

Linda Brodo, Pierpaolo Degano, Corrado Priami
A Categorical Observation of Timed Testing Equivalence

Timed transition systems are a widely studied model for real-time systems. The intention of the paper is to show the applicability of the general categorical framework of open maps to the setting of testing equivalence on timed transition systems, in order to transfer general concepts of equivalences to the model. In particular, we define a category of timed transition systems, whose morphisms are to be thought of as simulations, and an accompanying (sub)category of observations, to which the corresponding notion of open maps is developed. We then use the open maps framework to obtain an abstract bisimilarity which is established to coincide with timed testing equivalence.

Natalya Gribovskaya, Irina Virbitskaite
From Unreliable Objects to Reliable Objects: The Case of Atomic Registers and Consensus

A

concurrent

object is an object that can be concurrently accessed by several processes. It has been shown by Maurice Herlihy that any concurrent object

O

defined by a sequential specification can be wait-free implemented from reliable atomic registers (shared variables) and consensus objects.

Wait-free

means that any invocation of an operation of the object

O

issued by a non-faulty process does terminate, whatever the behavior of the other processes (e.g., despite the fact they are very slow or even have crashed).

So, an important issue consists in providing reliable atomic registers and reliable consensus objects despite the failures experienced by the base objects from which these atomic registers and consensus objects are built. This paper considers self-implementations, i.e., the case where a reliable atomic register (resp., consensus object) is built from unreliable atomic registers (resp., unreliable consensus objects). The paper addresses the object failure model where the base objects can suffer

responsive

or

nonresponsive

crash failures. When there are solutions the paper presents corresponding algorithms, and when there is no solution, it presents the corresponding impossibility result. The paper has a tutorial flavor whose aim is to make the reader familiar with important results when one has to build resilient concurrent objects. To that aim, the paper use both algorithms from the literature and new algorithms.

Rachid Guerraoui, Michel Raynal
A Functional Programming System SFP: Sisal 3.1 Language Structures Decomposition

The paper describes equivalent transformations of structures of the Sisal 3.1 programming language (based on Sisal 90). These transformations are aimed to decompose the complex language structures into more simple ones that can be directly expressed by the internal representation IR1 (based on the IF1 language). Currently some description of similar transformations can be found in few works about Sisal 90 in the form of examples. A front-end compiler from Sisal 3.1 into IR1 performs these transformations, so they can help to understand better its translation strategy. The paper also briefly describes Sisal 3.1 and IR1.

Victor N. Kasyanov, Alexander P. Stasenko
Towards a Computing Model for Open Distributed Systems

This paper proposes an implementation of the data structure called bag or multiset used by descriptive programming languages (e.g. Gamma, Linda) over an open system. In this model, a succession of ”chemical reactions” consumes the elements of the bag and produces new elements according to specific rules. This approach is particularly interesting as it suppresses all unneeded synchronization and reveals all the potential parallelism of a program. An efficient implementation of a bag provides an efficient implementation of the subsequent program. This paper defines a new communication and synchronization model adapted from workqueues used in parallel computing. The proposed model allows to benefit from the potential parallelism offered by this style of programming when only an approximate solution is needed.

Achour Mostefaoui
Enhancing Online Computer Games for Grids

Massively multiplayer online games (MMOG) require large amounts of computational resources for providing a responsive and scalable gameplay for thousands of concurrently participating players. In current MMOG, large data-centers are dedicated to a particular game title. Such static hosting requires a huge upfront investment and carries the risk of false estimation of user demand. The concept of grid computing allows to use resources on-demand in a dynamic way, and is therefore a promising approach for MMOG services to overcome the limitations of static game provisioning. In this paper, we discuss different parallelization mechanisms for massively multiplayer gaming and grid architecture concepts suitable for on-demand game services. The work presented here provides both a state-of-the-art analysis and conceptual use case discussion: We outline the new European project

edutain@grid

which targets at scaling real-time interactive online applications and MMOG, including First Person Shooter (FPS) and Real-Time Strategy (RTS) games, in an on-demand manner using a distributed grid architecture. Finally, we describe our experimental online game Rokkatan and report experimental scalability results for this game on a multi-server grid architecture.

Jens Müller, Sergei Gorlatch

Applications

Optimized Parallel Approach for 3D Modelling of Forest Fire Behaviour

In this paper we present methods for parallelization of 3D CFD forest fire modelling code on Non-uniform memory computers in frame of the OpenMP environment. Mathematical model is presented first. Then, some peculiarities of this class of computers are considered, along with properties and limitations of the OpenMP model. Techniques for efficient parallelization are discussed, considering different types of data processing algorithms. Finally, performance results for the parallelized algorithm are presented and analyzed (for up to 16 processors).

Gilbert Accary, Oleg Bessonov, Dominique Fougère, Sofiane Meradji, Dominique Morvan
A High-Level Toolkit for Development of Distributed Scientific Applications

The paper presents IARnet toolkit, a set of high-level tools and services simplifying integration of software resources into a distributed computing environment and development of distributed applications involving dynamic discovery and composition of resources. A case study of using IARnet for solving large scale discrete optimization problems is discussed.

Alexander Afanasiev, Oleg Sukhoroslov, Mikhail Posypkin
Orthogonal Organized Finite State Machine Application to Sensor Acquired Information

The application of the Orthogonal Organized Finite State Machine (OOFSM) to the representation of data acquired by sensor networks is proposed. The OOFSM was proposed in earlier work; it is succinctly reviewed here. The approach and representation of the OOFSM to sensor acquired data is formalized. The usefulness of this OOFSM application is illustrated by several case studies, specifically, gradients, contouring and discrete trajectory path determination. In addition, this paper informally discusses the OOFSM as a Cellular Automata.

Brian J. d’Auriol, John Kim, Sungyoung Lee, Young-Koo Lee
Parallel Broadband Finite Element Time Domain Algorithm Implemented to Dispersive Electromagnetic Problem

The numerical analysis of some broadband electromagnetic fields and frequency-dependent materials using a time domain method is the main subject of this paper. The spatial and time-dependent distribution of the electromagnetic field is approximated by the finite element method. The parallel form of the algorithm valid for some linear materials, and the formulation of the FE code for a dispersive electromagnetic problem are presented and compared. The complex forms of these algorithms have an effect on the memory and computational costs of the distributed formulation. The properties of the algorithm are estimated using high performance cluster of workstations.

Boguslaw Butrylo
Strategies for Development of a Parallel Program for Protoplanetary Disc Simulation

Protoplanetary disc simulation must be done first, with high precision, and second, with high speed. Some strategies to reach these goals are presented in the paper. They include: the reduction of the 3D protoplanetary disc model to quasi-3D, the use of fundamental Poisson equation solution, the simulation in the natural (cylindrical) coordinate system and computation domain decomposition. The domain decomposition strategy is shown to reach the simulation goals the best.

Sergei Kireev, Elvira Kuksheva, Aleksey Snytnikov, Nikolay Snytnikov, Vitaly Vshivkov
Generation of SMACA and Its Application in Web Services

Web Search Engine uses forward indexing and inverted indexing as a part of its functional design. This indexing mechanism helps retrieving data from the database based on user query. In this paper, an efficient solution to handle the indexing problem is proposed with the introduction of Nonlinear Single Cycle Multiple Attractor Cellular Automata (SMACA). This work simultaneously shows generation of SMACA by using specific rule sequence. Searching mechanism is done with linear time complexity.

Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay
Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.

G. Rodríguez, P. González, M. J. Martín, J. Touriño
Study of 3D Dynamics of Gravitating Systems Using Supercomputers: Methods and Applications

We describe parallel numerical code for solving problems of stellar dynamics. The code is based on numerical solving of Poisson and Vlasov equations in cylindrical coordinates using particle-in-cells method. The code is designed for use on supercomputers with distributed memory. We consider different possible strategies of parallelization according to initial technical parameters of numerical methods and physical conditions of the model. We present results of numerical simulations for the following problems of stellar dynamics: investigation of influence of central potential on the vertical motions of thin gravitating disk; stability of uniform sphere with anisotropic distribution of velocity; numerical approximation of equilibrium states of gravitating systems.

Nikolay Snytnikov, Vitaly Vshivkov, Valery Snytnikov
Transient Mechanical Wave Propagation in Semi-infinite Porous Media Using a Finite Element Approach with Domain Decomposition Technology

In this paper, the authors propose a numerical investigation in the time domain of the mechanical wave propagation due to an impulsional load on a semi-infinite soil. The ground is modelled as a porous saturated viscoelastic medium involving the complete Biot theory. An accurate and efficient Finite Element Method using a matrix-free technique is used. Two parallel algorithms are used: Geometrical Domain Decomposition (GDD) and Algebraic Decomposition (AD). Numerical results show that GDD algorithm has the best time. Physical numerical results present the displacements of the fluid and solid particles over the surface and in depth.

Andrey Terekhov, Arnaud Mesgouez, Gaelle Lefeuve-Mesgouez
The Location of the Gene Regions Under Selective Pressure: Plato Algorithm Parallelization

The number of sequenced genes is dramatically increasing with that of international genomic projects. The gene sequence information proved to be helpful in predictions of protein structure, protein function and mutations targeted at improving the biological and biotechnological properties of proteins. Processing of the immense information stored in the databases demands high-throughput computational approaches. Here, we performed a parallelization of the algorithm for analysis of nucleotide substitutions in gene sequences from different organisms previously implemented in the PLATO program. The results demonstrated that the parallelization of the algorithm provides linear speedup of the PLATO program.

Yuri Vyatkin, Konstantin Gunbin, Alexey Snytnikov, Dmitry Afonnikov

Techniques for Parallel Programming Supporting

Object Serialization and Remote Exception Pattern for Distributed C++/MPI Application

MPI is commonly used standard in development of scientific applications. It focuses on interlanguage operability and is not very well object oriented. The paper proposes a general pattern enabling design of distributed and object oriented applications. It also presents its sample implementations and performance tests.

Karol Bańczyk, Tomasz Boiński, Henryk Krawczyk
Improving Job Scheduling Performance with Dynamic Replication Strategy in Data Grids

Dealing with a large amount of data in Data Grids makes the requirement for efficient data access more critical. In this paper, we proposed a new approach to replication problem by organizing the data into several data categories that it belongs to. This organizing will help improving placement strategy of data replication. We studied our approach in combination with scheduling issue and evaluating it through simulation. The result shows that our strategy has improved the scheduling performance by 30%.

Nguyen Dang Nhan, Soon Wook Hwang, Sang Boem Lim
Address-Free All-to-All Routing in Sparse Torus

In this work we present a simple network design for all-to-all routing and study deflection routing on it. We present a time-scheduled routing algorithm where packets are routed address-free. We show that a total exchange relation, where every processor has a packet to route to every other processor, can be routed with routing cost of 1/2 + 

o

(1) time units per packet.

The network consists of an

n

-sided

d

-dimensional torus, where the

n

d

 − 1

processor (or input/output) nodes are sparsely but regularly situated among

n

d

 − 

n

d

 − 1

deflection routing nodes, having

d

input and

d

output links. The finite-state routing nodes change their states by a fixed, preprogrammed pattern.

Risto Honkanen, Ville Leppänen, Martti Penttonen
On the Parallel Technologies of Conjugate and Semi-conjugate Gradient Methods for Solving Very Large Sparse SLAEs

The parallel technologies of iterative solving the symmetric and nonsymmetric systems of linear algebraic equations (SLAEs) with very large sparse matrices by means of conjugate and semi-conjugate gradient iterative methods are described. The performance computing for various matrix formats (diagonal, compressed sparse row/column), at the different degrees of freedom of SLAEs, are analysed. The results of experimental measurements under OPENMP, MPI and hybrid systems are presented and discussed.

Valery P. Ilin, Dasha V. Knysh
TRES-CORE: Content-Based Retrieval Based on the Balanced Tree in Peer to Peer Systems

Most existing Peer to Peer (P2P) systems support name-based retrieval and have provided very limited support for the full-text search of document contents. In this paper, we present a scheme (TRES-CORE) to support content-based retrieval. First, we propose a tree structure to organize data objects in vector-format in the P2P system, which is height-balanced so that the time complexity of search can be decreased. Second, we give a simple strategy for the placement of tree’s nodes, which can guarantee both load balancing and fault tolerance. Then an efficient policy for the query is given. Besides theoretical analysis that can prove the correctness of our scheme, a simulation-based study is carried out to evaluate its performance under various scenarios finally. In this study, it shows that using this content-based retrieval scheme (TRES-CORE) is more accurate and more efficient than some other schemes in the P2P system.

Hai Jin, Jie Xu
Efficient Race Verification for Debugging Programs with OpenMP Directives

Races must be detected for debugging parallel programs with OpenMP directives because they may cause unintended nondeterministic results of programs. The previous tool that detects races does not verify the existence of races in programs with no internal nondeterminism because the tool regards nested sibling threads as ordered threads and has the possibility of ignoring accesses involved in races in program models with synchronization such as critical section. This paper suggests an efficient tool that verifies the existence of races with optimal performance by applying race detection engines for labeling and detection protocol. The labeling scheme generates a unique identifier for each parallel thread created during a program execution, and the protocol scheme detects at least one race if any. This tool verifies the existence of races over 250 times faster in average than the previous tool even in the case that the maximum parallelism increases with the fixed number of total accesses using a set of synthetic programs without synchronization such as critical section.

Young-Joo Kim, Mun-Hye Kang, Ok-Kyoon Ha, Yong-Kee Jun
Adaptive Scheduling and Resource Assessment in GRID

The problems of scheduling computations in GRID and optimal usage of GRID resources from client side are considered. The general cost functional for GRID scheduling is defined. The cost function is then used to define some scheduling policy based on Simutaneous Perturbation Stochastic Optimization Algorithm, which is used because of it’s fast convergence in multidimensional noisy systems. The technique proposed is being implemented for brokering in GPE4GTK environment to compare it with other techniques.

Veniamin Krasnotcshekov, Alexander Vakhitov
Dynamic Load Balancing of Black-Box Applications with a Resource Selection Mechanism on Heterogeneous Resources of the Grid

In this paper we address the critical issues of efficient resource management and high-performance parallel distributed computing on the Grid by introducing a new hierarchical approach that combines a user-level job scheduling with a dynamic load balancing technique that automatically adapts a

black-box

distributed or parallel application to the heterogeneous resources. The algorithm developed dynamically selects the resources best suited for a particular task or parallel process of the executed application, and optimizes the load balance based on the dynamically measured resource parameters and estimated requirements of the application. We describe the proposed algorithm for automated load balancing, paying attention to the influence of resource heterogeneity metrics, demonstrate the speedup achieved with this technique for different types of applications and resources, and propose a way to extend the approach to a wider class of applications.

Valeria V. Krzhizhanovskaya, Vladimir V. Korkhov
A Novel Algorithm of Optimal Matrix Partitioning for Parallel Dense Factorization on Heterogeneous Processors

. In this paper, we present a novel algorithm of optimal matrix partitioning for parallel dense matrix factorization on heterogeneous processors based on their constant performance model. We prove the correctness of the algorithm and estimate its complexity. We demonstrate that this algorithm better suits extensions to more complicated, non-constant, performance models of heterogeneous processors than traditional algorithms.

Alexey Lastovetsky, Ravi Reddy
Parallel Pseudorandom Number Generator for Large-Scale Monte Carlo Simulations

A parallel random number generator is given to perform large-scale distributed Monte Carlo simulations. The generator’s quality was verified using statistically rigorous tests. Also special problems with known solutions were used for the testing. The description of program system MONC for large-scale distributed Monte Carlo simulations is also given.

Mikhail Marchenko
Dynamic Job Scheduling on the Grid Environment Using the Great Deluge Algorithm

The utilization of the computational Grid processor network has become a common method for researchers and scientists without access to local processor clusters to avail of the benefits of parallel processing for compute-intensive applications. As a result, this demand requires effective and efficient dynamic allocation of available resources. Although static scheduling and allocation techniques have proved effective, the dynamic nature of the Grid requires innovative techniques for reacting to change and maintaining stability for users. The dynamic scheduling process requires quite powerful optimization techniques, which can themselves lack the performance required in reaction time for achieving an effective schedule solution. Often there is a trade-off between solution quality and speed in achieving a solution. This paper presents an extension of a technique used in optimization and scheduling which can provide the means of achieving this balance and improves on similar approaches currently published.

Paul McMullan, Barry McCollum
Parallelism Granules Aggregation with the T-System

T-system is a tool for parallel computing developed at the PSI RAS. The most recent implementation is available on both Linux and Windows platforms. The paper is dedicated to one of important T-system aspects — ability to change parallelism granule size at runtime. The technique is available, primarily, for recursive programs, but it’s possible to extent it to non-recursive ones as well. In the latter case, we employ C++ template“traits”for program transformation. The technique is shown to reduce overhead incurred by runtime support library dramatically.

Alexander Moskovsky, Vladimir Roganov, Sergei Abramov
Toward a Distributed Implementation of OpenMP Using CAPE

Traditionally, checkpointing techniques have been used to secure the execution of sequential and parallel programs. This article shows that checkpointing techniques can also be used to automatically generate a parallel program from a sequential program, this program being executed on any kind of distributed parallel system. The article also presents how this new technique have been included inside the usual compilation chain to provide a distributed implementation of OpenMP. Finally, some performance measurements are discussed.

Éric Renault
Multicriteria Scheduling Strategies in Scalable Computing Systems

An approach to generation and optimization of scheduling and resource allocation strategies in scalable computing systems is proposed. The approach allows the decomposition of the problem of multicriteria strategy synthesis for the totality of parameterized models of programs with the use of partial and vector quality criteria including, for instance, a cost function and load balancing factors.

Victor Toporkov
Latencies of Conflicting Writes on Contemporary Multicore Architectures

This paper provides a detailed investigation of latency penalties caused by repeated memory writes to nearby memory cells from different threads in parallel programs. When such writes map to the same corresponding cache lines in multiple processors, one can observe the so called

false sharing

effect. This effect can unnecessarily hamper parallel code due to the line granularity based cache hierarchy, which is common on contemporary processor architectures. In this contribution, a benchmark allowing for quantitative estimates about the consequences of the false sharing effect, is presented. Results show that multicore architectures with shared cache can reduce unwanted effects of false sharing.

Josef Weidendorfer, Michael Ott, Tobias Klug, Carsten Trinitis
A Novel Self-Similar (S 2) Traffic Filter to Enhance E-Business Success by Improving Internet Communication Channel Fault Tolerance

Internet traffic patterns can cause serious buffer overflow in electronic business (e-business) systems. This leads to widespread retransmission that prolongs the service roundtrip time (RTT). As a result customers are unhappy and avoid returning to do more business. The previous Real-Time Traffic Pattern Detector (RTPD) was proposed to improve Internet channel fault tolerance. With RTPD support time-critical applications can identify the traffic patterns and invoke the corresponding measures to neutralize their ill effects in a dynamic manner. The extant RTPD, however, cannot detect self-similar traffic. This inspired the development of the novel self-similarity (

S

2

) filter proposed in this paper, which makes the RTPD capability more complete. The “RTPD +

S

2

” package is the

enhanced

RTPD or ERTPD package. The test results indicate that the addition of the

S

2

mechanism can indeed contribute to improved e-business communication channels over the Internet.

Allan K. Y. Wong, Wilfred W. K. Lin, Tharam S. Dillon, Jackei H. K. Wong
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD

We propose an approach to speed up the singular value decomposition (SVD) of very large rectangular matrices using the CSX600 floating point coprocessor. The CSX600-based acceleration board we use offers 50GFLOPS of sustained performance, which is many times greater than that provided by standard microprocessors. However, this performance can be achieved only when a vendor-supplied matrix-matrix multiplication routine is used and the matrix size is sufficiently large. In this paper, we optimize two of the major components of rectangular SVD, namely, QR decomposition of the input matrix and back-transformation of the left singular vectors by matrix

Q

, so that large-size matrix multiplications can be used efficiently. In addition, we use the Integrable SVD algorithm to compute the SVD of an intermediate bidiagonal matrix. This helps to further speed up the computation and reduce the memory requirements. As a result, we achieved up to 3.5 times speedup over the Intel Math Kernel Library running on an 3.2GHz Xeon processor when computing the SVD of a 100,000 × 4000 matrix.

Yusaku Yamamoto, Takeshi Fukaya, Takashi Uneyama, Masami Takata, Kinji Kimura, Masashi Iwasaki, Yoshimasa Nakamura
Parallel Dynamic SPT Update Algorithm in OSPF

Shortest-Path-Tree (SPT) computation, as the main load in OSPF protocol, contributes to the slow convergence time in intra-domain routing. With the increasing interest for upcoming routers of multi-core based processing board, efficient parallel routing algorithms are required to take this advantage to speedup SPT computation in order to meet the needs for fast failure recovery applications such as VoIP. However, currently available parallel SPT algorithms are all based on static method, which re-computes the entire tree for each link change. In this paper, we explore parallel algorithms for dynamic SPT update, a more efficient method, which only updates the affected nodes by making use of the previous SPT We first analyze characters of dynamic method to show how they affect parallel design; then we give our parallel dynamic SPT algorithm framework, which uses: (1) parallel distance-updating mode, to get a near liner speedup (assuming perfect load balance) and (2) group-removal schema, to reduce communication cost. Further, to provide load balance, we give a task distribution algorithm called RR_DFS, which makes use of the topology information of the previous SPT. Complexity analysis and simulation result are also presented

Yuanbo Zhu, Mingwei Xu, Qian Wu

Cellular Automata

Pedestrian and Crowd Dynamics Simulation: Testing SCA on Paradigmatic Cases of Emerging Coordination in Negative Interaction Conditions

The paper presents a set of theoretical experiments performed to evaluate Situated Cellular Agent (SCA) approach within pedestrian dynamics research context. SCA is a modeling and simulation approach based on Multi Agent Systems principles that derives from Cellular Automata. In particular, we focus on two emerging phenomena (freezing by heating and lane formation phenomena) that have been empirically observed and already modeled by analytical particle–based models and Cellular Automata–based models.

Stefania Bandini, Mizar Luca Federici, Sara Manzoni, Giuseppe Vizzari
Coarse-Grained Parallelization of Cellular-Automata Simulation Algorithms

Simulating spatial dynamics in physics by Cellular Automata (CA) requires very large computation power, and, hence, CA simulation algorithms are to be implemented on multiprocessors. The preconceived opinion, that no much effort is required to obtain highly efficient coarse grained parallel CA algorithm, is not always true. In fact, a great variety of CA modifications coming into practical use need appropriate, sometimes sophisticated, methods of CA algorithms parallel implementation. Proceeding from the above a general approach to CA parallelization, based on domain decomposition correctness conditions, is formulated. Starting from the correctness conditions particular parallelization methods are developed for the main classes of CA simulation models: synchronous CA with multi-cell updating rules, asynchronous probabilistic CA, and CA compositions. Examples and experimental results are given for each case.

Olga Bandman
Cellular Automata Models for Complex Matter

Complex matter may lie in various forms from granular matter, soft matter, fluid-fluid or solid-fluid mixtures to compact heterogeneous material. Cellular automata models make a suitable and powerful tool to catch the influence of the microscopic scale onto the macroscopic behaviour of these complex systems. Rather than a survey, this paper will attempt to bring out the main concepts underlying these models and to give an insight for future work.

Dominique Désérable, Pascal Dupont, Mustapha Hellou, Siham Kamali-Bernard
Hysteresis in Oscillatory Behaviour in CO Oxidation Reaction over Pd(110) Revealed by Asynchronous Cellular Automata Simulation

The dynamic behaviour of the CO oxidation reaction over Pd(110) has been studied by means of probabilistic asynchronous cellular automata (Dynamic Monte-Carlo). The influence of the internal parameters on the shapes of surface concentration waves obtained in simulations under the limited surface diffusion intensity conditions has been studied. The hysteresis in oscillatory behaviour has been found under step-by-step variation of oxygen partial pressure. Two different oscillatory regimes could exist at one and the same parameters of the reaction. The parameters of oscillations (amplitude, period and the shape of spatio-temporal patterns on the surface) depend on the kinetic prehistory of the system. The possibility for the appearance of the cellular and turbulent patterns, spiral, target and stripe oxygen waves on the surface in the cases under study has been shown.

Vladimir Elokhin, Andrey Matveev, Vladimir Gorodetskii, Evgenii Latkin
CAOS: A Domain-Specific Language for the Parallel Simulation of Cellular Automata

We present the design and implementation of CAOS, a domain-specific high-level programming language for the parallel simulation of extended cellular automata. CAOS allows scientists to specify complex simulations with limited programming skills and effort. Yet the CAOS compiler generates efficiently executable code that automatically harnesses the potential of contemporary multi-core processors, shared memory multiprocessors, workstation clusters and supercomputers.

Clemens Grelck, Frank Penczek, Kai Trojahner
Parallel Hardware Architecture to Simulate Movable Creatures in the CA Model

The general question of our investigation is: how can the simulation of moving objects (or agents) in a cellular automaton (CA) be accelerated by hardware architectures. We exemplify our approach using the creatures’ exploration problem:

n

creatures are moving around in an unknown environment in order to visit all cells in shortest time. This problem is modeled as CA because this model is massively parallel and therefore it can be perfectly supported by hardware (FPGA technology). We need a very fast simulation because we want to observe and evaluate the collaborative performance for a different number of creatures, different behaviors of the creatures and for many different environments. As a main result from these simulations and evaluations we expect to find the best algorithms which can fulfill the task with the lowest work units (generations × creatures). In this contribution we have investigated the question how the creatures’ exploration problem can be accelerated in hardware with a minimum of hardware resources. We have designed and evaluated five different architectures that vary in the combination or separation of the logic for the environment, for the creatures and for the collision detection. A speedup in the range of thousands compared to software can be reached using an architecture which separates the environment from the creatures and makes use of the memory banks embedded in the FPGA.

Mathias Halbach, Rolf Hoffmann
Comparison of Evolving Uniform, Non-uniform Cellular Automaton, and Genetic Programming for Centroid Detection with Hardware Agents

Current industrial applications require fast and robust

image processing

in systems with low size and power dissipation. One of the main tasks in industrial vision is fast detection of centroids of objects. This paper compares three different approaches for finding

geometric algorithms

for centroid detection which are appropriate for a fine-grained parallel hardware architecture in an embedded vision chip. The algorithms shall comprise emergent capabilities and high problem-specific functionality without requiring large amounts of states or memory. For that problem, we consider

uniform

and

non-uniform cellular automata

(CA) as well as

Genetic Programming

. Due to the inherent complexity of the problem, an

evolution

ary approach is applied. The appropriateness of these approaches for centroid detection is discussed.

Marcus Komann, Andreas Mainka, Dietmar Fey
Associative Version of Italiano’s Decremental Algorithm for the Transitive Closure Problem

We propose a natural implementation of Italiano’s algorithm for updating the transitive closure of directed graphs after deletion of an edge on a model of associative (content addressable) parallel systems with vertical processing (the STAR–machine). The associative version of Italiano’s decremental algorithm is given as procedure

DeleteArc

, whose correctness is proved and time complexity is evaluated. We compare implementations of Italiano’s decremental algorithm and its associative version and enumerate the main advantages of the associative version.

Anna Nepomniaschaya
Support for Fine-Grained Synchronization in Shared-Memory Multiprocessors

It has been already verified that hardware-supported finegrain synchronization provides a significant performance improvement over coarse-grained synchronization mechanisms, such as barriers. Support for fine-grain synchronization on individual data items becomes notably important in order to efficiently exploit thread-level parallelism available on multi-threading and multi-core processors. Fine-grained synchronization can be achieved using the full/empty tagged shared memory. We define the complete set of synchronizing memory instructions as well as the architecture of the full/empty tagged shared memory that provides support for these operations. We develop a snoopy cache coherency protocol for an SMP with the centralized full/empty tagged memory.

Vladimir Vlassov, Oscar Sierra Merino, Csaba Andras Moritz, Konstantin Popov
Self-organised Criticality in a Model of the Rat Somatosensory Cortex

Large Hodgkin-Huxley (HH) neural networks were examined and the structures discussed in this article simulated a part of the rat somatosensory cortex. We used a modular architecture of the network divided into layers and sub-regions. Because of a high degree of complexity effective parallelisation of algorithms was required. The results of parallel simulations were presented. An occurrence of the self-organised criticality (SOC) was demonstrated. Most notably, in large biological neural networks consisting of artificial HH neurons, the SOC was shown to manifest itself in the frequency of its appearance as a function of the size of spike potential avalanches generated within such nets. These two parameters followed the power law characteristic of other systems exhibiting the SOC behaviour.

Grzegorz M. Wojcik, Wieslaw A. Kaminski, Piotr Matejanka
Control of Fuzzy Cellular Automata: The Case of Rule 90

This paper is dedicated to the study of fuzzy rule 90 in relation with control theory. The dynamics and global evolution of fuzzy rules have been recently investigated and some interesting results have been obtained in [10,15,16]. The long term evolution of all 256 one-dimensional fuzzy cellular automata (FCA) has been determined using an analytical approach. We are interested in this paper in the FCA state at a given time and ask whether it can coincide with a desired state by controlling only the initial condition. We investigate two initial states consisting of a single control value

u

on a background of zeros and one seed adjacent to the controlled site in a background of zeros.

Samira El Yacoubi, Angelo B. Mingarelli

Methods and Tools of Parallel Programming of Multicomputers

Intensive Atmospheric Vortices Modeling Using High Performance Cluster Systems

The goal of the paper is development of a scalable parallel program calculating the numerical solution of the system of equations modeling the processes and origin conditions of intensive atmospheric vortices (IAV) in 3D compressible atmosphere according to the theory of mesovortice turbulence by Nikolaevskiy. Original system of non-linear equations, and its initial and boundary conditions are discussed. The structure of a parallel program for high performance cluster is developed. The problems concerning to optimization of the program in order to increase its scalability are studied. In summary the results of numerical computations are discussed.

Arutyun I. Avetisyan, Varvara V. Babkova, Sergey S. Gaissaryan, Alexander Yu. Gubar
Dynamic Strategy of Placement of the Replicas in Data Grid

Grid computing is a type of parallel and distributed systems, that is designed to provide pervasive and reliable access to data and computational resources over wide are network. Data Grids connect a collect of geographically distributed computers and storage resources located in different parts of the world to facilitate sharing of data and resources. These grids are concentrated on the reduction of the execution time of the applications that require a great number of processing cycles by the computer. In such environment, these advantages are not possible unless by the use of the replication. This later is considered as an important technique to reduce the cost of access to the data in grid. In this present paper, we present our contribution to a cost model whose objective is to reduce the cost of access to replicated data. These costs depend on many factors like the bandwidth, data size, network latency and the number of the read/ write operations.

Ghalem Belalem, Farouk Bouhraoua
ISO: Comprehensive Techniques Toward Efficient GEN_BLOCK Redistribution with Multidimensional Arrays

Runtime data redistribution is usually required in parallel algorithms to enhance data locality, achieve dynamic load balancing and reduce remote data access on distributed memory multicomputers. In this paper, we present comprehensive techniques to implement GEN_BLOCK redistribution in parallelizing compilers, including

Indexing

schemes for communication sets generation, a contention-free communication

Scheduling

algorithm and an

Optimization

technique for improving communication efficiency. Both theoretical analysis and experimental results show that the proposed techniques can efficiently perform GEN_BLOCK data redistribution during runtime.

Shih-Chang Chen, Ching-Hsien Hsu
A New Memory Slowdown Model for the Characterization of Computing Systems

Performance measurements were extensively conducted to characterize parallel computer systems by using modelling and experiments. By analyzing them, we corroborate current models did not provide precise memory characterization. After detailed result observation, we conclude that the performance slowdown is linear when using the main memory, and exponential when using the virtual memory.

In this paper, we propose a characterization model composed of two regressions which represent the slowdown caused by memory usage. Experimental results confirm the memory slowdown model improves the quality of computing system characterization, allowing to carry out simulations and the use of such results as a way to design real systems, minimizing project design costs.

Rodrigo Fernandes de Mello, Luciano José Senger, Kuan-Ching Li, Laurence Tianruo Yang
SCRF – A Hybrid Register File Architecture

In VLIW processor design, clustered architecture becomes a popular solution for better hardware efficiency. But the inter-cluster communication (ICC) will cause the execution cycles overhead. In this paper, we propose a shared cluster register file (SCRF) architecture and a SCRF register allocation algorithm to reduce the ICC overhead. The SCRF architecture is a hybrid register file (RF) organization composed of shared RF (SRF) and clustered RFs (CRFs). By putting the frequently used variables that need ICCs on SRF, we can reduce the number of data communication of clusters and thus reduce the ICC overhead. The SCRF register allocation algorithm exploits this architecture feature to perform optimization on ICC reduction and spill codes balancing. The SCRF register allocation algorithm is a heuristic based on graph coloring. To evaluate the performance of the proposed architecture and the SCRF register allocation algorithm, the frequently used two-cluster architecture with and without the SRF scheme are simulated on Trimaran. The simulation results show that the performance of the SCRF architecture is better than that of the clustered RF architecture for all test programs in all measured metrics.

Jer-Yu Hsu, Yan-Zu Wu, Xuan-Yi Lin, Yeh-Ching Chung
Model Based Performance Evaluation for MPI Programs

The paper considers the model of a parallel program, which can be effectively interpreted using an instrumental computer, allowing for fairly exact prediction of the actual runtime of a parallel program on a specific parallel computing system. The model has been developed for parallel Java-programs with explicit exchange of messages by means of the MPI library. The model is part of the ParJava environment. The model is derived by converting the program control tree, which, for the Java-program, can be built by modifying the abstract syntax tree. Communication functions are modeled by using the LogGP model, allowing to account for the specific features of a distributed computational system.

Victor Ivannikov, Serguei Gaissaryan, Arutyun Avetisyan, Vartan Padaryan
Runtime System for Parallel Execution of Fragmented Subroutines

The architecture of a runtime system supporting parallel execution of fragmented library subroutines on multicomputers is proposed. The approach makes possible to develop the library of parallel subroutines and to provide automatically their dynamic properties such as dynamic load balancing. Usage of the MPI for communications programming provides good portability of an application.

K. V. Kalgin, V. E. Malyshkin, S. P. Nechaev, G. A. Tschukin
Application of Simulation Approaches to Creation of Decision Support System for IT Service Management

The paper presents a simulation-based approach to creation of decision support system for IT Service Management. The presented approach includes monitoring of stochastic data IT Services and calculation of measure of alignment to business goals with its characteristic basing on SLA/SLO. The approach combines the benefits of two kinds of models: analytical and simulation ones. The key idea of the paper is to demonstrate how modern methods of stochastic process analysis may enhance trustworthiness and quality of decision making along business goals within IT Services.

Yuri G. Karpov, Rostislav I. Ivanovsky, Kirill A. Sotnikov
Using Analytical Models to Load Balancing in a Heterogeneous Network of Computers

An effective workload distribution has a prime rule on reducing the total execution time of a parallel application on heterogeneous environments, such as computational grids and heterogeneous clusters. Several methods have been proposed in the literature by many researchers in the last decade. This paper presents two approaches to workload distribution based on analytical models developed to performance prediction of parallel applications, named PEMPIs VRP (

Vector of Relative Performances

). The workload is distributed based on relative performance ratios, obtained by these models. In this work, we present two schemes, static and dynamic, in a research middleware for a heterogeneous network of computers. In the experimental tests we evaluated and compared them using two MPI applications. The results show that, using the VRP’s dynamic strategy, we can reduce the imbalance, among the execution time of the processes, in relation to average time from 25% to near of 5%.

Jean M. Laine, Edson T. Midorikawa
Block-Based Allocation Algorithms for FLASH Memory in Embedded Systems

A flash memory has write-once and bulk-erase properties so that an intelligent allocation algorithm is essential to providing applications efficient storage service. This paper first demonstrates that the online version of FLASH allocation problem is difficult, since we can find an adversary that makes every online algorithm to use as many number of blocks as a naive and inefficient algorithm. As a result we propose an offline allocation algorithm called

Best Match

(BestM) for allocating blocks in FLASH file systems. The experimental results indicate that BestM delivers better performance than a previously proposed First Re-arrival First Serve (FRFS) method.

Pangfeng Liu, Chung-Hao Chuang, Jan-Jan Wu
Variable Reassignment in the T++ Parallel Programming Language

The paper describes the OpenTS parallel programming system that provides the runtime environment for T++ language. T++ is an extension for C++ that adds a set of keywords to C++, allowing smooth transition from sequential to parallel applications. In this context the support of repeated assignments to a variable is an important feature. The paper focused on semantics and implementation of such variables in T++. Applications written in T++ can be run on computational clusters, SMPs and GRIDs, either in Linux or Windows OS.

Alexander Moskovsky, Vladimir Roganov, Sergei Abramov, Anton Kuznetsov
Parallel Construction of Moving Adaptive Meshes Based on Self-organization

A new highly parallelizable method of moving mesh construction based on the Kohonen’s Self-Organizing Maps (SOM) is proposed. This method belongs to a class of methods in which the mesh is an image under an appropriate mapping of a fixed mesh over a computational domain. Unlike the conventional methods of this class, the proposed method doesn’t require solving complicated systems of nonlinear partial differential equations and is able to work with arbitrary time-dependent mesh density function. High efficiency of parallelization is conditioned by the inherent parallelism of the underlying stochastic SOM algorithm. Sequential as well as detailed parallel algorithms for moving mesh construction are proposed.

Olga Nechaeva, Mikhail Bessmeltsev
Data Transfer in Advance on Cluster

Scientific applications are increasingly challenging computational platforms and software tools. In this scenario of improving performance demand, computer cluster users require for mechanisms that could reduce data transfer delay. To contribute to this question, we proposed a data transfer in advance mechanism that improved overall system performance by diminishing data-intensive job wait. We also designed and implemented the Integrated Scheduling System (ISS) to analyze and evaluate our proposal. This system automates the preparation, submission and tracking of job executions in a cluster. The mechanism is also combined with an I/O file operation and computation overlapping idea that results in significant improvement of performance rates, confirmed by some experiments.

Nilton Cézar de Paula, Gisele da Silva Craveiro, Liria Matsumoto Sato
A Trust-Oriented Heuristic Scheduling Algorithm for Grid Computing

Security and reliability are major concerns in Grid computing systems. Trust mechanism has been focus of much research in recent years providing a safety and reliable Grid computing environment. Based on EigenTrust model, in this paper, we extend the traditional job scheduling strategies and present a new algorithm named Trust-Oriented Sufferage algorithm. Simulations are performed to evaluate the performance of the new algorithm.

Mingjun Sun, Guosun Zeng, Lulai Yuan, Wei Wang
3-Points Relationship Based Parallel Algorithm for Minimum Ultrametric Tree Construction

To construct an evolutionary tree is an important topic in computational biology. An evolutionary tree can symbolize the relationship and histories for a set of species. There are many models had been proposed to resolve these problems. However, most of them are NP-hard problem. Ultrametric tree is one of the most popular models, it is used by a well-accepted tree construction method–Unweighted Pair Group Method with Arithmetic Mean, which is widely used by biologists to observe the relationship among species. However, it is a heuristic algorithm. In this paper, we proposed a 3-Points relationship (3PR) based parallel algorithm to solve this problem. 3PR is a relationship between distance matrix and constructed evolutionary trees. The main concept is for any triplet species, two species closer to each other in distance matrix should be closer to each other in evolutionary tree. Then we combined this property and branch-and-bound strategy to reduce the computation time to obtain the optimal solution. Moreover, we put the lower ranked path which is determined by3PR to delay bound pool (DBP) to accelerate the algorithm execution. DBP is a mechanism which can store the lower ranked path and can be helping algorithm to find a better bounding values speedily. The experimental results show that our proposed algorithm can reduce the computation time compared with algorithm without 3PR. Moreover, it also shows 3PR can reduce the computation time when number of computing nodes increasing.

Kun-Ming Yu, Jiayi Zhou, Chun-Yuan Lin, Chuan Yi Tang
Load Balancing Approach Parallel Algorithm for Frequent Pattern Mining

Association rules mining from transaction-oriented databases is an important issue in data mining. Frequent pattern is crucial for association rules generation, time series analysis, classification, etc. There are two categories of algorithms that had been proposed, candidate set generate-and-test approach (Apriori-like) and Pattern growth approach. Many methods had been proposed to solve the association rules mining problem based on FP-tree instead of Apriori-like, since apriori-like algorithm scans the database many times. However, the computation time is costly when the database size is large with FP-tree data structure. Parallel and distributed computing is a good strategy to solve this circumstance. Some parallel algorithms had been proposed, however, most of them did not consider the load balancing issue. In this paper, we proposed a parallel and distributed mining algorithm based on FP-tree structure, Load Balancing FP-Tree (LFP-tree). The algorithm divides the item set for mining by evaluating the tree’s width and depth. Moreover, a simple and trusty calculate formulation for loading degree is proposed. The experimental results show that LFP-tree can reduce the computation time and has less idle time compared with Parallel FP-Tree (PFP-tree). In addition, it has better speed-up ratio than PFP-tree when number of processors grow. The communication time can be reduced by preserving the heavy loading items in their local computing node.

Kun-Ming Yu, Jiayi Zhou, Wei Chen Hsiao
Backmatter
Metadaten
Titel
Parallel Computing Technologies
herausgegeben von
Victor Malyshkin
Copyright-Jahr
2007
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-73940-1
Print ISBN
978-3-540-73939-5
DOI
https://doi.org/10.1007/978-3-540-73940-1

Premium Partner