skip to main content
article
Free Access

Performance evaluation of the Orca shared-object system

Published:01 February 1998Publication History
Skip Abstract Section

Abstract

Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that Orca programs send fewer messages and less data than the TreadMarks and CRL programs and obtain better speedups.

References

  1. AMZA, C., COX, A., DWARKADAS, S., AND ZWAENEPOEL, W. 1997. Software DSM protocols that adapt between single writer and multiple writer. In Proceedings of the Third International Symposium on High-Performance Computer Architecture. 261-271.]] Google ScholarGoogle Scholar
  2. BAL, H. AND TANENBAUM, A. 1988. Distributed programming with shared data. In Proceedings of the IEEE CS 1988 International Conference on Computer Languages. 82-91.]]Google ScholarGoogle Scholar
  3. BAL, H., BHOEDJANG, R., HOFMAN, R., JACOBS, C., LANGENDOEN, K., RIJHL, T., AND KAASHOEK, M. 1997a. Portability in the Orca shared object system. Tech. Rep., Vrije Universiteit.]]Google ScholarGoogle Scholar
  4. BAL, H., BHOEDJANG, R., HOFMAN, R., JACOBS, C., LANGENDOEN, K., RUHL, T., AND VERSTOEP, K. 1997b. Performance of a high-level parallel language on a high-speed network. J. Parallel Distrib. Comput. 40, 1 (Jan.), 49-64.]] Google ScholarGoogle Scholar
  5. BAL, H., KAASHOEK, M., AND TANENBAUM, A. 1992. Orca: A language for parallel programming of distributed systems. IEEE Trans. Softw. Eng. 18, 3 (Mar.), 190-205.]] Google ScholarGoogle Scholar
  6. BEN HASSEN, S. AND BAL, H. 1996. Integrating task and data parallelism using shared objects. In Proceedings of the lOth ACM International Conference on Supercomputing. ACM, New York, 317-324.]] Google ScholarGoogle Scholar
  7. BENNETT, J., CARTER, J., AND ZWAENEPOEL, W. 1990. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 168-176.]] Google ScholarGoogle Scholar
  8. BERSHAD, B., ZEKAUSHAS, M., AND SAWDON, W. 1993. The Midway distributed shared memory system. In Proceedings of COMPCON. 528-537.]]Google ScholarGoogle Scholar
  9. BLUMRICH, M., DUBNICKI, C., FELTEN, E., LI, K., AND MESARINA, M. 1995. Virtual-memory mapped network interface. IEEE Micro 15, 1 (Feb.), 21-28.]] Google ScholarGoogle Scholar
  10. BODEN, N., COHEN, D., FELDERMAN, R., KULAWIK, A., SEITZ, C., SEIZOVIC, J., AND SU, W. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (Feb.), 29-36.]] Google ScholarGoogle Scholar
  11. CARREIRA, J., SILVA, J., LANGENDOEN, K., AND BAL, H. 1997. Implementing tuple space with threads. In Proceedings of the International Conference on Parallel and Distributed Systems (Euro-PDS'97). 259-264.]]Google ScholarGoogle Scholar
  12. CASTRO, M., GUEDES, P., SEQUEIRA, M., AND COSTA, M. 1996. Efficient and flexible object sharing. In Proceedings of the 1996 International Conference on Parallel Processing. Vol. 1. 128-137.]]Google ScholarGoogle Scholar
  13. CHAIKEN, D., FIELDS, C., KURIHARA, K., AND AGARWAL, A. 1990. Directory-based cache coherence in large-scale multiprocessors. IEEE Comput. 23, 6 (June), 49-58.]] Google ScholarGoogle Scholar
  14. CHANDRA, S. AND LARUS, J. 1997. Optimizing communication in HPF programs on fine-grain distributed shared memory. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97). ACM, New York, 100-111.]] Google ScholarGoogle Scholar
  15. Cox, A., DWARKADAS, S., KELEHER, P., AND ZWAENEPOEL, W. 1994. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 94 Usenix Conference. USENIX Assoc., Berkeley, Calif., 115-131.]]Google ScholarGoogle Scholar
  16. DIJKSTRA, E.W. 1975. Guarded commands. Commun. ACM 18, 8 (Aug.), 453-457.]] Google ScholarGoogle Scholar
  17. DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1996. An integrated compile-time/runtime software distributed shared memory system. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS-VII). 186-197.]] Google ScholarGoogle Scholar
  18. FLEISCH, B. AND POPEK, G. 1989. Mirage: A coherent distributed shared memory design. In Proceedings of the 12th ACM Symposium on Operating System Principles. ACM, New York, 211-223.]] Google ScholarGoogle Scholar
  19. GREENBERG, D., PARK, J., AND SCHWABE, E. 1996. The cost of complex communication on simple networks. J. Parallel Distrib. Comput. 35, 2 (June), 133-141.]] Google ScholarGoogle Scholar
  20. GRIMSHAW, A. 1993. Easy-to-use object-oriented parallel processing with Mentat. IEEE Comput. 26, 5 (May), 39-51.]] Google ScholarGoogle Scholar
  21. GRIMSHAW, A., WEISSMAN, J., AND STRAYER, W. 1996. Portable run-time support for dynamic object-oriented parallel processing. ACM Trans. Comput. Syst. 14, 2 (May), 139-170.]] Google ScholarGoogle Scholar
  22. HAINES, M. AND LANGENDOEN, K. 1997. Platform-independent runtime optimizations using OpenThreads. In Proceedings of the 11th International Parallel Processing Symposium. 460-466.]] Google ScholarGoogle Scholar
  23. JOHNSON, K., KAASHOEK, M., AND WALLACH, D. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles. ACM, New York, 213-228.]] Google ScholarGoogle Scholar
  24. KAASHOEK, M. 1992. Group communication in distributed computer systems. Ph.D. thesis, Vrije Universiteit, Amsterdam.]]Google ScholarGoogle Scholar
  25. KELEHER, P. AND TSENG, C.-W. 1997. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium. 490-499.]] Google ScholarGoogle Scholar
  26. KONTOTHANASSIS, L., HUNT, G., STETS, R., HARDAVELLAS, N., CIERNIAK, M., PARTHASARATHY, S., MEIRA, W., DWARKADAS, S., AND SCOTT, M. 1997. VM-based shared memory on lowlatency, remote-memory-access networks. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 157-169.]] Google ScholarGoogle Scholar
  27. LANGENDOEN, K., ROMEIN, J., BHOEDJANG, R., AND BAL, H. 1996. Integrating polling, interrupts, and thread management. In Proceedings of Frontiers '96. 13-22.]] Google ScholarGoogle Scholar
  28. LEE, J. 1993. Concord: Re-thinking the division of labor in a distributed shared memory system. Tech. Rep. TR-93-12-05, Univ. of Washington, Seattle, Wash.]]Google ScholarGoogle Scholar
  29. LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W.-D., GUPTA, A., HENNESSY, J., HORO- WITZ, M., AND LAM, M. 1992. The Stanford Dash multiprocessor. IEEE Comput. 25, 3 (Mar.), 63-79.]] Google ScholarGoogle Scholar
  30. LI, K. AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov.), 321-359.]] Google ScholarGoogle Scholar
  31. Lu, H., Cox, A., DWARKADAS, S., RAJAMONY, R., AND ZWAENEPOEL, W. 1997a. Compiler and software distributed shared memory support for irregular applications. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97). ACM, New York, 48-56.]] Google ScholarGoogle Scholar
  32. Lu, H., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1995. Message passing versus distributed shared memory on networks of workstations. In Proceedings of Supercomputing '95.]] Google ScholarGoogle Scholar
  33. Lu, H., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1997b. Quantifying the performance differences between PVM and TreadMarks. J. Parallel Distrib. Comput. 34, 2 (June), 65-78.]] Google ScholarGoogle Scholar
  34. Lucco, S. 1987. A heuristic Linda kernel for hypercube multiprocessors. In Proceedings of the Conference on Hypercube Multiprocessors. 32-38.]]Google ScholarGoogle Scholar
  35. MINNICH, R. AND FARBER, D. 1990. Reducing host load, network load, and latency in a distributed shared memory. In Proceedings of the l Oth International Conference on Distributed Computing Systems. 468-475.]]Google ScholarGoogle Scholar
  36. NIKHIL, R. 1994. Cid: A parallel, shared-memory C for distributed-memory machines. In Proceedings of the 7th Annual Workshop on Languages and Compilers for Parallel Computing.]] Google ScholarGoogle Scholar
  37. PAKIN, S., LAURIA, M., AND CHIEN, A. 1995. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing '95.]] Google ScholarGoogle Scholar
  38. PETERSON, L., HUTCHINSON, N., O'MALLEY, S., AND RAO, H. 1990. The x-kernel: A platform for accessing internet resources. IEEE Comput. 23, 5 (May), 23-33.]] Google ScholarGoogle Scholar
  39. RAGHAVACHARI, M. AND ROGERS, A. 1997. Ace: Linguistic mechanisms for customizable protocols. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 80-89.]] Google ScholarGoogle Scholar
  40. RUHL, T. AND BAL, H. 1995. Optimizing atomic functions using compile-time information. In Proceedings of the Working Conference on Massively Parallel Programming Models (MPPM- 95). 68-75.]] Google ScholarGoogle Scholar
  41. RUHL, T., BAL, H., BHOEDJANG, R., LANGENDOEN, K., AND BENSON, G. 1996. Experience with a portability layer for implementing parallel programming systems. In Proceedings of the 1996 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '96). 1477-1488.]]Google ScholarGoogle Scholar
  42. SCALES, D., GHARACHORLOO, K., AND THEKKATH, C. 1996. Shasta: A low overhead, softwareonly approach for supporting fine-grain shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). ACM, New York, 174-185.]] Google ScholarGoogle Scholar
  43. SCALES, D. AND LAM, M. 1994. The design and evaluation of a shared object system for distributed memory machines. In Proceedings of the 1st USENIX Symposium on Operating Systems Design and Implementation. USENIX Assoc., Berkeley, Calif., 101-114.]] Google ScholarGoogle Scholar
  44. SCHOINAS, I., FALSAFI, B., LEBECK, A., REINHARDT, S., LARUS, J., AND WOOD, D. 1994. Finegrain access control for distributed shared memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI). ACM, New York, 297-306.]] Google ScholarGoogle Scholar
  45. SINGH, J., WEBER, W.-D., AND GUPTA, A. 1992. SPLASH: Stanford Parallel Applications for Shared Memory. ACM Comput. Arch. News 20, 1 (Mar.), 5-44.]] Google ScholarGoogle Scholar
  46. WEST, E. AND GRIMSHAW, A. 1995. Braid: Integrating task and data parallelism. In Proceedings of Frontiers 95. 211-219.]] Google ScholarGoogle Scholar
  47. WILSON, G. AND BAL, H. 1996. An empirical assessment of the usability of Orca using the Cowichan problems. IEEE Parallel Distrib. Tech. 4, 3 (Fall), 36-44.]] Google ScholarGoogle Scholar
  48. WILSON, G. AND LU, P., Eds. 1996. Parallel Programming Using C++. The MIT Press, Cambridge, Mass.]] Google ScholarGoogle Scholar

Index Terms

  1. Performance evaluation of the Orca shared-object system

              Recommendations

              Reviews

              Herbert G. Mayer

              Orca is a portable, distributed shared memory (DSM) system with a C-like user interface and an underlying runtime system (RTS) named Panda. According to the paper, the design choices that characterize Orca are effective. These choices include a coherence protocol that is based on write updates with function shipping, a totally ordered group communication protocol, a strategy for object placement that replicates only shared data with a high readwrite ratio, and a complete software implementation in user space. The following four properties make Orca unique among successful DSMs: Shared data structures are encapsulated in shared objects accessible only via user-defined operations. Orca is language-based. Orca integrates synchronization and data access. The shared-memory model is sequentially consistent. Portability is ensured by encapsulating target-dependent interfaces on the Panda level. The result is a DSM system that makes near-optimal decisions about object placement and replication, starting with static compile-time information, refining the initial decisions dynamically during runtime, and never burdening the user with the need to provide hints or directives. Compared with TreadMarks and CRL, two other effective DSM systems, the Orca execution on a 32-node Pentium Pro cluster shows amazing scalability. This was measured by the authors for Fast Ethernet and Myrinet networks by comparing the execution behavior of ten representative parallel applications. After an introduction, section 2 provides a detailed over view of the Orca shared-object system, showing the implementation of shared objects and discussing the placement strategy. Sections 3 and 4 compare the performance of Orca, TreadMarks, and CRL by contrasting the speedup, overhead, number of broadcasts, and execution times of ten parallel applications. Sections 5 and 6 discuss related work and state conclusions. The bibliography is detailed and well chosen. Since some of the parallel applications run on the Pentium Pro Myrinet cluster scaled so well (IDA scaled almost linearly), I would have liked to see a discussion of how relevant these ten parallel tests are for comparing DSM implementations. It would also be helpful for the reader to understand how representative these parallel applications are for scaling general applications. Otherwise, the paper is a paradigm of honest research: for all design parameters, the authors discuss pros and cons, alerting readers to disadvantages that accompany each choice. For example, the totally ordered group communication selected to solve the coherence protocol has several drawbacks. These are discussed and contrasted with feasible alternatives, and the design choice is justified by the performance data. Anyone wishing to work on a DSM system should read this paper. In fact, it is mandatory reading for anyone interested in the performance, implementation, and internal design of DSM systems. The paper is a joy to read and sets a good example for research, where the researcher is a good devils advocate.

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader