skip to main content
10.1145/3178487acmconferencesBook PagePublication PagesppoppConference Proceedingsconference-collections
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
ACM2018 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Vienna Austria February 24 - 28, 2018
ISBN:
978-1-4503-4982-6
Published:
10 February 2018
Sponsors:

Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome you to PPoPP 2018, the 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming held in Vienna, Austria during February 24-28, 2018, and co-located with the CGO 2018 and HPCA 2018 conferences. This year's symposium continues and reinforces the PPoPP tradition of publishing leading work on all aspects of parallel programming, including foundational and theoretical aspects, techniques, languages, compilers, runtime systems, tools, and practical experiences. Given the pervasiveness of parallel architectures in the general consumer market, PPoPP, with its interest in new parallel workloads, techniques and productivity tools for parallel programming, is becoming more relevant than ever to the computer science community.

Interval-based memory reclamation

In this paper we present interval-based reclamation (IBR), a new approach to safe reclamation of disconnected memory blocks in nonblocking concurrent data structures. Safe reclamation is a difficult problem: a thread, before freeing a block, must ensure ...

Harnessing epoch-based reclamation for efficient range queries

Concurrent sets with range query operations are highly desirable in applications such as in-memory databases. However, few set implementations offer range queries. Known techniques for augmenting data structures with range queries (or operations that ...

research-article
A persistent lock-free queue for non-volatile memory

Non-volatile memory is expected to coexist with (or even displace) volatile DRAM for main memory in upcoming architectures. This has led to increasing interest in the problem of designing and specifying durable data structures that can recover from ...

research-article
Public Access
Superneurons: dynamic GPU memory management for training deep neural networks

Going deeper and wider in neural architectures improves their accuracy, while the limited GPU DRAM places an undesired restriction on the network design domain. Deep Learning (DL) practitioners either need to change to less desired network architectures,...

Juggler: a dependence-aware task-based execution framework for GPUs

Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may ...

research-article
Open Access
HPVM: heterogeneous parallel virtual machine

We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our ...

research-article
Public Access
Hierarchical memory management for mutable state

It is well known that modern functional programming languages are naturally amenable to parallel programming. Achieving efficient parallelism using functional languages, however, remains difficult. Perhaps the most important reason for this is their ...

Bridging the gap between deep learning and sparse matrix format selection

This work presents a systematic exploration on the promise and special challenges of deep learning for sparse matrix format selection---a problem of determining the best storage format for a matrix to maximize the performance of Sparse Matrix Vector ...

Optimizing N-dimensional, winograd-based convolution for manycore CPUs

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse ...

research-article
vSensor: leveraging fixed-workload snippets of programs for performance variance detection

Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer ...

Cache-tries: concurrent lock-free hash tries with constant-time operations

Concurrent non-blocking hash tries have good cache locality, and horizontally scalable operations. However, operations on most existing concurrent hash tries run in O(log n) time.

In this paper, we show that the concurrent hash trie operations can run ...

Featherlight on-the-fly false-sharing detection

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...

Register optimizations for stencils on GPUs

The recent advent of compute-intensive GPU architecture has allowed application developers to explore high-order 3D stencils for better computational accuracy. A common optimization strategy for such stencils is to expose sufficient data reuse by means ...

research-article
Public Access
FlashR: parallelize and scale R for machine learning using SSDs

R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R ...

research-article
DisCVar: discovering critical variables using algorithmic differentiation for transient faults

Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC ...

research-article
Practical concurrent traversals in search trees

Operations of concurrent objects often employ optimistic concurrency-control schemes that consist of a traversal followed by a validation step. The validation checks if concurrent mutations interfered with the traversal to determine if the operation ...

research-article
Communication-avoiding parallel minimum cuts and connected components

We present novel scalable parallel algorithms for finding global minimum cuts and connected components, which are important and fundamental problems in graph processing. To take advantage of future massively parallel architectures, our algorithms are ...

research-article
Safe privatization in transactional memory

Transactional memory (TM) facilitates the development of concurrent applications by letting the programmer designate certain code blocks as atomic. Programmers using a TM often would like to access the same data both inside and outside transactions, ...

Making pull-based graph processing performant

Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth ...

An effective fusion and tile size model for optimizing image processing pipelines

Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with ...

research-article
Lazygraph: lazy data coherency for replicas in distributed graph-parallel computation

Replicas 1 of a vertex play an important role in existing distributed graph processing systems which make a single vertex to be parallel processed by multiple machines and access remote neighbors locally without any remote access. However, replicas of ...

PAM: parallel augmented maps

Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical ...

research-article
Efficient shuffle management with SCache for DAG computing frameworks

In large-scale data-parallel analytics, shuffle, or the cross-network read and aggregation of partitioned data between tasks with data dependencies, usually brings in large overhead. To reduce shuffle overhead, we present SCache, an open source plug-in ...

research-article
High-performance genomic analysis framework with in-memory computing

In this paper, we propose an in-memory computing framework (called GPF) that provides a set of genomic formats, APIs and a fast genomic engine for large-scale genomic data processing. Our GPF comprises two main components: (1) scalable genomic data ...

research-article
Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism

Interactive information retrieval services, such as enterprise search and document search, must provide relevant results with consistent, low response times in the face of rapidly growing data sets and query loads. These growing demands have led ...

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures

Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the ...

VerifiedFT: a verified, high-performance precise dynamic race detector

Dynamic data race detectors are valuable tools for testing and validating concurrent software, but to achieve good performance they are typically implemented using sophisticated concurrent algorithms. Thus, they are ironically prone to the exact same ...

Efficient parallel determinacy race detection for two-dimensional dags

A program is said to have a determinacy race if logically parallel parts of a program access the same memory location and one of the accesses is a write. These races are generally bugs in the program since they lead to non-deterministic program behavior ...

poster
Public Access
Performance challenges in modular parallel programs

Over the past decade, many programming languages and systems for parallel-computing have been developed, including Cilk, Fork/Join Java, Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction at ...

poster
Reducing the burden of parallel loop schedulers for many-core processors

This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the scheduling overhead of our techniques ...

Contributors
  • Vienna University of Technology
  • Swiss Federal Institute of Technology, Zurich

Recommendations

Acceptance Rates

Overall Acceptance Rate230of1,014submissions,23%
YearSubmittedAcceptedRate
PPoPP '211503121%
PPoPP '201212823%
PPoPP '191522919%
PPoPP '171322922%
PPoPP '141842815%
PPoPP '07652234%
PPoPP '03452044%
PPoPP '99791722%
PPOPP '97862630%
Overall1,01423023%