article

Free Access

Performance evaluation of the Orca shared-object system

Authors:
Henri E. Bal

Vrije Univ., Amsterdam, The Netherlands

Vrije Univ., Amsterdam, The Netherlands
View Profile

,
Raoul Bhoedjang

Vrije Univ., Amsterdam, The Netherlands

Vrije Univ., Amsterdam, The Netherlands
View Profile

,
Rutger Hofman

Vrije Univ., Amsterdam, The Netherlands

Vrije Univ., Amsterdam, The Netherlands
View Profile

,
Ceriel Jacobs

Vrije Univ., Amsterdam, The Netherlands

Vrije Univ., Amsterdam, The Netherlands
View Profile

,
Koen Langendoen

Vrije Univ., Amsterdam, The Netherlands

Vrije Univ., Amsterdam, The Netherlands
View Profile

,
Tim Rühl

Vrije Univ., Amsterdam, The Netherlands

Vrije Univ., Amsterdam, The Netherlands
View Profile

,
M. Frans Kaashoek

Massachusetts Institute of Technology, Cambridge

Massachusetts Institute of Technology, Cambridge
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 16 Issue 1pp 1–40https://doi.org/10.1145/273011.273014

Published:01 February 1998Publication History

ACM Transactions on Computer Systems

Abstract

Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that Orca programs send fewer messages and less data than the TreadMarks and CRL programs and obtain better speedups.

References

AMZA, C., COX, A., DWARKADAS, S., AND ZWAENEPOEL, W. 1997. Software DSM protocols that adapt between single writer and multiple writer. In Proceedings of the Third International Symposium on High-Performance Computer Architecture. 261-271.]] Google Scholar
BAL, H. AND TANENBAUM, A. 1988. Distributed programming with shared data. In Proceedings of the IEEE CS 1988 International Conference on Computer Languages. 82-91.]]Google Scholar
BAL, H., BHOEDJANG, R., HOFMAN, R., JACOBS, C., LANGENDOEN, K., RIJHL, T., AND KAASHOEK, M. 1997a. Portability in the Orca shared object system. Tech. Rep., Vrije Universiteit.]]Google Scholar
BAL, H., BHOEDJANG, R., HOFMAN, R., JACOBS, C., LANGENDOEN, K., RUHL, T., AND VERSTOEP, K. 1997b. Performance of a high-level parallel language on a high-speed network. J. Parallel Distrib. Comput. 40, 1 (Jan.), 49-64.]] Google Scholar
BAL, H., KAASHOEK, M., AND TANENBAUM, A. 1992. Orca: A language for parallel programming of distributed systems. IEEE Trans. Softw. Eng. 18, 3 (Mar.), 190-205.]] Google Scholar
BEN HASSEN, S. AND BAL, H. 1996. Integrating task and data parallelism using shared objects. In Proceedings of the lOth ACM International Conference on Supercomputing. ACM, New York, 317-324.]] Google Scholar
BENNETT, J., CARTER, J., AND ZWAENEPOEL, W. 1990. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 168-176.]] Google Scholar
BERSHAD, B., ZEKAUSHAS, M., AND SAWDON, W. 1993. The Midway distributed shared memory system. In Proceedings of COMPCON. 528-537.]]Google Scholar
BLUMRICH, M., DUBNICKI, C., FELTEN, E., LI, K., AND MESARINA, M. 1995. Virtual-memory mapped network interface. IEEE Micro 15, 1 (Feb.), 21-28.]] Google Scholar
BODEN, N., COHEN, D., FELDERMAN, R., KULAWIK, A., SEITZ, C., SEIZOVIC, J., AND SU, W. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (Feb.), 29-36.]] Google Scholar
CARREIRA, J., SILVA, J., LANGENDOEN, K., AND BAL, H. 1997. Implementing tuple space with threads. In Proceedings of the International Conference on Parallel and Distributed Systems (Euro-PDS'97). 259-264.]]Google Scholar
CASTRO, M., GUEDES, P., SEQUEIRA, M., AND COSTA, M. 1996. Efficient and flexible object sharing. In Proceedings of the 1996 International Conference on Parallel Processing. Vol. 1. 128-137.]]Google Scholar
CHAIKEN, D., FIELDS, C., KURIHARA, K., AND AGARWAL, A. 1990. Directory-based cache coherence in large-scale multiprocessors. IEEE Comput. 23, 6 (June), 49-58.]] Google Scholar
CHANDRA, S. AND LARUS, J. 1997. Optimizing communication in HPF programs on fine-grain distributed shared memory. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97). ACM, New York, 100-111.]] Google Scholar
Cox, A., DWARKADAS, S., KELEHER, P., AND ZWAENEPOEL, W. 1994. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 94 Usenix Conference. USENIX Assoc., Berkeley, Calif., 115-131.]]Google Scholar
DIJKSTRA, E.W. 1975. Guarded commands. Commun. ACM 18, 8 (Aug.), 453-457.]] Google Scholar
DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1996. An integrated compile-time/runtime software distributed shared memory system. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS-VII). 186-197.]] Google Scholar
FLEISCH, B. AND POPEK, G. 1989. Mirage: A coherent distributed shared memory design. In Proceedings of the 12th ACM Symposium on Operating System Principles. ACM, New York, 211-223.]] Google Scholar
GREENBERG, D., PARK, J., AND SCHWABE, E. 1996. The cost of complex communication on simple networks. J. Parallel Distrib. Comput. 35, 2 (June), 133-141.]] Google Scholar
GRIMSHAW, A. 1993. Easy-to-use object-oriented parallel processing with Mentat. IEEE Comput. 26, 5 (May), 39-51.]] Google Scholar
GRIMSHAW, A., WEISSMAN, J., AND STRAYER, W. 1996. Portable run-time support for dynamic object-oriented parallel processing. ACM Trans. Comput. Syst. 14, 2 (May), 139-170.]] Google Scholar
HAINES, M. AND LANGENDOEN, K. 1997. Platform-independent runtime optimizations using OpenThreads. In Proceedings of the 11th International Parallel Processing Symposium. 460-466.]] Google Scholar
JOHNSON, K., KAASHOEK, M., AND WALLACH, D. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles. ACM, New York, 213-228.]] Google Scholar
KAASHOEK, M. 1992. Group communication in distributed computer systems. Ph.D. thesis, Vrije Universiteit, Amsterdam.]]Google Scholar
KELEHER, P. AND TSENG, C.-W. 1997. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium. 490-499.]] Google Scholar
KONTOTHANASSIS, L., HUNT, G., STETS, R., HARDAVELLAS, N., CIERNIAK, M., PARTHASARATHY, S., MEIRA, W., DWARKADAS, S., AND SCOTT, M. 1997. VM-based shared memory on lowlatency, remote-memory-access networks. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 157-169.]] Google Scholar
LANGENDOEN, K., ROMEIN, J., BHOEDJANG, R., AND BAL, H. 1996. Integrating polling, interrupts, and thread management. In Proceedings of Frontiers '96. 13-22.]] Google Scholar
LEE, J. 1993. Concord: Re-thinking the division of labor in a distributed shared memory system. Tech. Rep. TR-93-12-05, Univ. of Washington, Seattle, Wash.]]Google Scholar
LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W.-D., GUPTA, A., HENNESSY, J., HORO- WITZ, M., AND LAM, M. 1992. The Stanford Dash multiprocessor. IEEE Comput. 25, 3 (Mar.), 63-79.]] Google Scholar
LI, K. AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov.), 321-359.]] Google Scholar
Lu, H., Cox, A., DWARKADAS, S., RAJAMONY, R., AND ZWAENEPOEL, W. 1997a. Compiler and software distributed shared memory support for irregular applications. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97). ACM, New York, 48-56.]] Google Scholar
Lu, H., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1995. Message passing versus distributed shared memory on networks of workstations. In Proceedings of Supercomputing '95.]] Google Scholar
Lu, H., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1997b. Quantifying the performance differences between PVM and TreadMarks. J. Parallel Distrib. Comput. 34, 2 (June), 65-78.]] Google Scholar
Lucco, S. 1987. A heuristic Linda kernel for hypercube multiprocessors. In Proceedings of the Conference on Hypercube Multiprocessors. 32-38.]]Google Scholar
MINNICH, R. AND FARBER, D. 1990. Reducing host load, network load, and latency in a distributed shared memory. In Proceedings of the l Oth International Conference on Distributed Computing Systems. 468-475.]]Google Scholar
NIKHIL, R. 1994. Cid: A parallel, shared-memory C for distributed-memory machines. In Proceedings of the 7th Annual Workshop on Languages and Compilers for Parallel Computing.]] Google Scholar
PAKIN, S., LAURIA, M., AND CHIEN, A. 1995. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing '95.]] Google Scholar
PETERSON, L., HUTCHINSON, N., O'MALLEY, S., AND RAO, H. 1990. The x-kernel: A platform for accessing internet resources. IEEE Comput. 23, 5 (May), 23-33.]] Google Scholar
RAGHAVACHARI, M. AND ROGERS, A. 1997. Ace: Linguistic mechanisms for customizable protocols. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 80-89.]] Google Scholar
RUHL, T. AND BAL, H. 1995. Optimizing atomic functions using compile-time information. In Proceedings of the Working Conference on Massively Parallel Programming Models (MPPM- 95). 68-75.]] Google Scholar
RUHL, T., BAL, H., BHOEDJANG, R., LANGENDOEN, K., AND BENSON, G. 1996. Experience with a portability layer for implementing parallel programming systems. In Proceedings of the 1996 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '96). 1477-1488.]]Google Scholar
SCALES, D., GHARACHORLOO, K., AND THEKKATH, C. 1996. Shasta: A low overhead, softwareonly approach for supporting fine-grain shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). ACM, New York, 174-185.]] Google Scholar
SCALES, D. AND LAM, M. 1994. The design and evaluation of a shared object system for distributed memory machines. In Proceedings of the 1st USENIX Symposium on Operating Systems Design and Implementation. USENIX Assoc., Berkeley, Calif., 101-114.]] Google Scholar
SCHOINAS, I., FALSAFI, B., LEBECK, A., REINHARDT, S., LARUS, J., AND WOOD, D. 1994. Finegrain access control for distributed shared memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI). ACM, New York, 297-306.]] Google Scholar
SINGH, J., WEBER, W.-D., AND GUPTA, A. 1992. SPLASH: Stanford Parallel Applications for Shared Memory. ACM Comput. Arch. News 20, 1 (Mar.), 5-44.]] Google Scholar
WEST, E. AND GRIMSHAW, A. 1995. Braid: Integrating task and data parallelism. In Proceedings of Frontiers 95. 211-219.]] Google Scholar
WILSON, G. AND BAL, H. 1996. An empirical assessment of the usability of Orca using the Cowichan problems. IEEE Parallel Distrib. Tech. 4, 3 (Fall), 36-44.]] Google Scholar
WILSON, G. AND LU, P., Eds. 1996. Parallel Programming Using C++. The MIT Press, Cambridge, Mass.]] Google Scholar

Index Terms

Performance evaluation of the Orca shared-object system
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
    2. General programming languages
      1. Language types
        Concurrent programming languages
        Distributed programming languages
        Parallel programming languages

Recommendations

Performance Modeling and Evaluation of MPI

Users of parallel machines need to have a good grasp for how different communication patterns and styles affect the performance of message-passing applications. LogGP is a simple performance model that reflects the most important parameters required to ...
Read More
LogGP Performance Evaluation of MPI
HPDC '98: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing

Users of parallel machines need good performance evaluations for several communication patterns in order to develop efficient message-passing applications. LogGP is a simple parallel machine model that reflects the important parameters required to ...
Read More
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

We present the design for the NYU Ultracomputer, a shared-memory MIMD parallel machine composed of thousands of autonomous processing elements. This machine uses an enhanced message switching network with the geometry of an Omega-network to approximate ...
Read More

Reviews

Reviewer: Herbert G. Mayer

Orca is a portable, distributed shared memory (DSM) system with a C-like user interface and an underlying runtime system (RTS) named Panda. According to the paper, the design choices that characterize Orca are effective. These choices include a coherence protocol that is based on write updates with function shipping, a totally ordered group communication protocol, a strategy for object placement that replicates only shared data with a high readwrite ratio, and a complete software implementation in user space. The following four properties make Orca unique among successful DSMs: Shared data structures are encapsulated in shared objects accessible only via user-defined operations. Orca is language-based. Orca integrates synchronization and data access. The shared-memory model is sequentially consistent. Portability is ensured by encapsulating target-dependent interfaces on the Panda level. The result is a DSM system that makes near-optimal decisions about object placement and replication, starting with static compile-time information, refining the initial decisions dynamically during runtime, and never burdening the user with the need to provide hints or directives. Compared with TreadMarks and CRL, two other effective DSM systems, the Orca execution on a 32-node Pentium Pro cluster shows amazing scalability. This was measured by the authors for Fast Ethernet and Myrinet networks by comparing the execution behavior of ten representative parallel applications. After an introduction, section 2 provides a detailed over view of the Orca shared-object system, showing the implementation of shared objects and discussing the placement strategy. Sections 3 and 4 compare the performance of Orca, TreadMarks, and CRL by contrasting the speedup, overhead, number of broadcasts, and execution times of ten parallel applications. Sections 5 and 6 discuss related work and state conclusions. The bibliography is detailed and well chosen. Since some of the parallel applications run on the Pentium Pro Myrinet cluster scaled so well (IDA scaled almost linearly), I would have liked to see a discussion of how relevant these ten parallel tests are for comparing DSM implementations. It would also be helpful for the reader to understand how representative these parallel applications are for scaling general applications. Otherwise, the paper is a paradigm of honest research: for all design parameters, the authors discuss pros and cons, alerting readers to disadvantages that accompany each choice. For example, the totally ordered group communication selected to solve the coherence protocol has several drawbacks. These are discussed and contrasted with feasible alternatives, and the design choice is justified by the performance data. Anyone wishing to work on a DSM system should read this paper. In fact, it is mandatory reading for anyone interested in the performance, implementation, and internal design of DSM systems. The paper is a joy to read and sets a good example for research, where the researcher is a good devils advocate.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 16, Issue 1
Feb. 1998
92 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/273011
Issue’s Table of Contents

Copyright © 1998 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 1998
Published in tocs Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed shared memory
parallel processing
portability
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 103
  Total Citations
  View Citations
- 792
  Total Downloads
- Downloads (Last 12 months)88
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance evaluation of the Orca shared-object system

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Performance Modeling and Evaluation of MPI

LogGP Performance Evaluation of MPI

The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Performance evaluation of the Orca shared-object system

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Performance Modeling and Evaluation of MPI

LogGP Performance Evaluation of MPI

The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media