Abstract
Orca is a portable, object-based distributed shared memory (DSM) system. This article studies and evaluates the design choices made in the Orca system and compares Orca with other DSMs. The article gives a quantitative analysis of Orca's coherence protocol (based on write-updates with function shipping), the totally ordered group communication protocol, the strategy for object placement, and the all-software, user-space architecture. Performance measurements for 10 parallel applications illustrate the trade-offs made in the design of Orca and show that essentially the right design decisions have been made. A write-update protocol with function shipping is effective for Orca, especially since it is used in combination with techniques that avoid replicating objects that have a low read/write ratio. The overhead of totally ordered group communication on application performance is low. The Orca system is able to make near-optimal decisions for object placement and replication. In addition, the article compares the performance of Orca with that of a page-based DSM (TreadMarks) and another object-based DSM (CRL). It also analyzes the communication overhead of the DSMs for several applications. All performance measurements are done on a 32-node Pentium Pro cluster with Myrinet and Fast Ethernet networks. The results show that Orca programs send fewer messages and less data than the TreadMarks and CRL programs and obtain better speedups.
- AMZA, C., COX, A., DWARKADAS, S., AND ZWAENEPOEL, W. 1997. Software DSM protocols that adapt between single writer and multiple writer. In Proceedings of the Third International Symposium on High-Performance Computer Architecture. 261-271.]] Google Scholar
- BAL, H. AND TANENBAUM, A. 1988. Distributed programming with shared data. In Proceedings of the IEEE CS 1988 International Conference on Computer Languages. 82-91.]]Google Scholar
- BAL, H., BHOEDJANG, R., HOFMAN, R., JACOBS, C., LANGENDOEN, K., RIJHL, T., AND KAASHOEK, M. 1997a. Portability in the Orca shared object system. Tech. Rep., Vrije Universiteit.]]Google Scholar
- BAL, H., BHOEDJANG, R., HOFMAN, R., JACOBS, C., LANGENDOEN, K., RUHL, T., AND VERSTOEP, K. 1997b. Performance of a high-level parallel language on a high-speed network. J. Parallel Distrib. Comput. 40, 1 (Jan.), 49-64.]] Google Scholar
- BAL, H., KAASHOEK, M., AND TANENBAUM, A. 1992. Orca: A language for parallel programming of distributed systems. IEEE Trans. Softw. Eng. 18, 3 (Mar.), 190-205.]] Google Scholar
- BEN HASSEN, S. AND BAL, H. 1996. Integrating task and data parallelism using shared objects. In Proceedings of the lOth ACM International Conference on Supercomputing. ACM, New York, 317-324.]] Google Scholar
- BENNETT, J., CARTER, J., AND ZWAENEPOEL, W. 1990. Munin: Distributed shared memory based on type-specific memory coherence. In Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, 168-176.]] Google Scholar
- BERSHAD, B., ZEKAUSHAS, M., AND SAWDON, W. 1993. The Midway distributed shared memory system. In Proceedings of COMPCON. 528-537.]]Google Scholar
- BLUMRICH, M., DUBNICKI, C., FELTEN, E., LI, K., AND MESARINA, M. 1995. Virtual-memory mapped network interface. IEEE Micro 15, 1 (Feb.), 21-28.]] Google Scholar
- BODEN, N., COHEN, D., FELDERMAN, R., KULAWIK, A., SEITZ, C., SEIZOVIC, J., AND SU, W. 1995. Myrinet: A gigabit-per-second local area network. IEEE Micro 15, 1 (Feb.), 29-36.]] Google Scholar
- CARREIRA, J., SILVA, J., LANGENDOEN, K., AND BAL, H. 1997. Implementing tuple space with threads. In Proceedings of the International Conference on Parallel and Distributed Systems (Euro-PDS'97). 259-264.]]Google Scholar
- CASTRO, M., GUEDES, P., SEQUEIRA, M., AND COSTA, M. 1996. Efficient and flexible object sharing. In Proceedings of the 1996 International Conference on Parallel Processing. Vol. 1. 128-137.]]Google Scholar
- CHAIKEN, D., FIELDS, C., KURIHARA, K., AND AGARWAL, A. 1990. Directory-based cache coherence in large-scale multiprocessors. IEEE Comput. 23, 6 (June), 49-58.]] Google Scholar
- CHANDRA, S. AND LARUS, J. 1997. Optimizing communication in HPF programs on fine-grain distributed shared memory. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97). ACM, New York, 100-111.]] Google Scholar
- Cox, A., DWARKADAS, S., KELEHER, P., AND ZWAENEPOEL, W. 1994. TreadMarks: Distributed shared memory on standard workstations and operating systems. In Proceedings of the Winter 94 Usenix Conference. USENIX Assoc., Berkeley, Calif., 115-131.]]Google Scholar
- DIJKSTRA, E.W. 1975. Guarded commands. Commun. ACM 18, 8 (Aug.), 453-457.]] Google Scholar
- DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1996. An integrated compile-time/runtime software distributed shared memory system. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (AS- PLOS-VII). 186-197.]] Google Scholar
- FLEISCH, B. AND POPEK, G. 1989. Mirage: A coherent distributed shared memory design. In Proceedings of the 12th ACM Symposium on Operating System Principles. ACM, New York, 211-223.]] Google Scholar
- GREENBERG, D., PARK, J., AND SCHWABE, E. 1996. The cost of complex communication on simple networks. J. Parallel Distrib. Comput. 35, 2 (June), 133-141.]] Google Scholar
- GRIMSHAW, A. 1993. Easy-to-use object-oriented parallel processing with Mentat. IEEE Comput. 26, 5 (May), 39-51.]] Google Scholar
- GRIMSHAW, A., WEISSMAN, J., AND STRAYER, W. 1996. Portable run-time support for dynamic object-oriented parallel processing. ACM Trans. Comput. Syst. 14, 2 (May), 139-170.]] Google Scholar
- HAINES, M. AND LANGENDOEN, K. 1997. Platform-independent runtime optimizations using OpenThreads. In Proceedings of the 11th International Parallel Processing Symposium. 460-466.]] Google Scholar
- JOHNSON, K., KAASHOEK, M., AND WALLACH, D. 1995. CRL: High-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles. ACM, New York, 213-228.]] Google Scholar
- KAASHOEK, M. 1992. Group communication in distributed computer systems. Ph.D. thesis, Vrije Universiteit, Amsterdam.]]Google Scholar
- KELEHER, P. AND TSENG, C.-W. 1997. Enhancing software DSM for compiler-parallelized applications. In Proceedings of the 11th International Parallel Processing Symposium. 490-499.]] Google Scholar
- KONTOTHANASSIS, L., HUNT, G., STETS, R., HARDAVELLAS, N., CIERNIAK, M., PARTHASARATHY, S., MEIRA, W., DWARKADAS, S., AND SCOTT, M. 1997. VM-based shared memory on lowlatency, remote-memory-access networks. In Proceedings of the 24th Annual International Symposium on Computer Architecture. 157-169.]] Google Scholar
- LANGENDOEN, K., ROMEIN, J., BHOEDJANG, R., AND BAL, H. 1996. Integrating polling, interrupts, and thread management. In Proceedings of Frontiers '96. 13-22.]] Google Scholar
- LEE, J. 1993. Concord: Re-thinking the division of labor in a distributed shared memory system. Tech. Rep. TR-93-12-05, Univ. of Washington, Seattle, Wash.]]Google Scholar
- LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W.-D., GUPTA, A., HENNESSY, J., HORO- WITZ, M., AND LAM, M. 1992. The Stanford Dash multiprocessor. IEEE Comput. 25, 3 (Mar.), 63-79.]] Google Scholar
- LI, K. AND HUDAK, P. 1989. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7, 4 (Nov.), 321-359.]] Google Scholar
- Lu, H., Cox, A., DWARKADAS, S., RAJAMONY, R., AND ZWAENEPOEL, W. 1997a. Compiler and software distributed shared memory support for irregular applications. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97). ACM, New York, 48-56.]] Google Scholar
- Lu, H., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1995. Message passing versus distributed shared memory on networks of workstations. In Proceedings of Supercomputing '95.]] Google Scholar
- Lu, H., DWARKADAS, S., Cox, A., AND ZWAENEPOEL, W. 1997b. Quantifying the performance differences between PVM and TreadMarks. J. Parallel Distrib. Comput. 34, 2 (June), 65-78.]] Google Scholar
- Lucco, S. 1987. A heuristic Linda kernel for hypercube multiprocessors. In Proceedings of the Conference on Hypercube Multiprocessors. 32-38.]]Google Scholar
- MINNICH, R. AND FARBER, D. 1990. Reducing host load, network load, and latency in a distributed shared memory. In Proceedings of the l Oth International Conference on Distributed Computing Systems. 468-475.]]Google Scholar
- NIKHIL, R. 1994. Cid: A parallel, shared-memory C for distributed-memory machines. In Proceedings of the 7th Annual Workshop on Languages and Compilers for Parallel Computing.]] Google Scholar
- PAKIN, S., LAURIA, M., AND CHIEN, A. 1995. High performance messaging on workstations: Illinois Fast Messages (FM) for Myrinet. In Proceedings of Supercomputing '95.]] Google Scholar
- PETERSON, L., HUTCHINSON, N., O'MALLEY, S., AND RAO, H. 1990. The x-kernel: A platform for accessing internet resources. IEEE Comput. 23, 5 (May), 23-33.]] Google Scholar
- RAGHAVACHARI, M. AND ROGERS, A. 1997. Ace: Linguistic mechanisms for customizable protocols. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 80-89.]] Google Scholar
- RUHL, T. AND BAL, H. 1995. Optimizing atomic functions using compile-time information. In Proceedings of the Working Conference on Massively Parallel Programming Models (MPPM- 95). 68-75.]] Google Scholar
- RUHL, T., BAL, H., BHOEDJANG, R., LANGENDOEN, K., AND BENSON, G. 1996. Experience with a portability layer for implementing parallel programming systems. In Proceedings of the 1996 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '96). 1477-1488.]]Google Scholar
- SCALES, D., GHARACHORLOO, K., AND THEKKATH, C. 1996. Shasta: A low overhead, softwareonly approach for supporting fine-grain shared memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII). ACM, New York, 174-185.]] Google Scholar
- SCALES, D. AND LAM, M. 1994. The design and evaluation of a shared object system for distributed memory machines. In Proceedings of the 1st USENIX Symposium on Operating Systems Design and Implementation. USENIX Assoc., Berkeley, Calif., 101-114.]] Google Scholar
- SCHOINAS, I., FALSAFI, B., LEBECK, A., REINHARDT, S., LARUS, J., AND WOOD, D. 1994. Finegrain access control for distributed shared memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI). ACM, New York, 297-306.]] Google Scholar
- SINGH, J., WEBER, W.-D., AND GUPTA, A. 1992. SPLASH: Stanford Parallel Applications for Shared Memory. ACM Comput. Arch. News 20, 1 (Mar.), 5-44.]] Google Scholar
- WEST, E. AND GRIMSHAW, A. 1995. Braid: Integrating task and data parallelism. In Proceedings of Frontiers 95. 211-219.]] Google Scholar
- WILSON, G. AND BAL, H. 1996. An empirical assessment of the usability of Orca using the Cowichan problems. IEEE Parallel Distrib. Tech. 4, 3 (Fall), 36-44.]] Google Scholar
- WILSON, G. AND LU, P., Eds. 1996. Parallel Programming Using C++. The MIT Press, Cambridge, Mass.]] Google Scholar
Index Terms
- Performance evaluation of the Orca shared-object system
Recommendations
Performance Modeling and Evaluation of MPI
Users of parallel machines need to have a good grasp for how different communication patterns and styles affect the performance of message-passing applications. LogGP is a simple performance model that reflects the most important parameters required to ...
LogGP Performance Evaluation of MPI
HPDC '98: Proceedings of the 7th IEEE International Symposium on High Performance Distributed ComputingUsers of parallel machines need good performance evaluations for several communication patterns in order to develop efficient message-passing applications. LogGP is a simple parallel machine model that reflects the important parameters required to ...
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer
We present the design for the NYU Ultracomputer, a shared-memory MIMD parallel machine composed of thousands of autonomous processing elements. This machine uses an enhanced message switching network with the geometry of an Omega-network to approximate ...
Comments