Abstract
In high-energy physics experiments, large particle accelerators produce enormous quantities of data, measured in hundreds of terabytes or petabytes per year, which are deposited onto tertiary storage. The experiments are designed to study the collisions of fundamental particles, called "events", each of which is represented as a point in a multi-dimensional universe. In these environments, the best retrieval performance can be achieved only if the data is clustered on the tertiary storage by all searchable attributes of the events. Since the number of these attributes is high, the underlying data-management facility must be able to cope with extremely large volumes and very high dimensionalities of data at the same time. The proposed indexing technique is designed to facilitate both clustering and efficient retrieval of high-dimensional data on tertiary storage. The structure uses an original space-partitioning scheme, which has numerous advantages over other space-partitioning techniques. While the main objective of the design is to support high-energy physics experiments, the proposed solution is appropriate for many other scientific applications.
- S. Berchtold, C. Bohm and H. P. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality," Proc. ACM SIGMOD Int. Conf. on Management of Data, 142-153, 1998.]] Google ScholarDigital Library
- S. Berchtold, D. A. Keim and H. P. Kriegel, "The X-tree: An Index Structure for High-Dimensional Data," Proc. 22nd Int. VLDB Conf., 28-39, 1996.]] Google ScholarDigital Library
- K. S. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, "When Is 'Nearest Neighbor' Meaningful?" Proc. 7th Int. Conf. on DB Theory, 217-235, 1999.]] Google ScholarDigital Library
- D. Comer, "The Ubiquitous B-tree," ACM Comp. Surveys,11(2):121-137, 1979.]] Google ScholarDigital Library
- I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastucture, Chapter 5, "Data-Intensive Computing," Morgan Kaufmann, 1999.]] Google ScholarDigital Library
- V. Gaede and O. Gunther, "Multidimensional Access Methods," ACM Comp. Surveys,30(2):170-231, 1998.]] Google ScholarDigital Library
- A. Guttman, "R-trees: A Dynamic Index Structure for Spatial Searching," Proc. ACM SIGMOD Int. Conf. on Management of Data, 47-54, 1984.]] Google ScholarDigital Library
- W. Hoschek, J. Jaen-Martinez, A. Samar, H. Stockinger and K. Stockinger, "Data Management in an International Data Grid Project," Proc. 1st IEEE/ACM Int. Workshop on Grid Computing, 2000.]] Google ScholarDigital Library
- D. Malon, Argonne National Laboratory, 2001 (private communication).]]Google Scholar
- E. J. Otoo, A. Shoshani and S. Hwang, "Clustering High Dimensional Massive Scientific Datasets," Proc. 13th Int. Conf. on Scientific and Statistical Database Management SSDBM'01, 147-157, 2001.]]Google Scholar
- R. Orlandic and B. Yu, "Implementing KDB-Trees to Support High-Dimensional Data," Proc. Int. Database Engineering and Applications Symposium IDEAS'2001, 58-67, 2001.]] Google ScholarDigital Library
- J. T. Robinson, "The K-D-B Tree: A Search Structure for Large Multidimensional Dynamic Indexes," Proc. ACM SIGMOD Int. Conf. on Management of Data, 10-18, 1981.]] Google ScholarDigital Library
- Y. Sakurai, M. Yoshikawa, S. Uemura and H. Kojima, "The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation," Proc. 26th Int. VLDB Conf., 516-526, 2000.]] Google ScholarDigital Library
- J. Shiers, "Building a Multi-Petabyte Database: The RD45 Project at CERN," in M. E. S. Loomis and A. B. Chaudri, editors, Object Databases in Practice, 164-176, Prentice Hall, 1997.]]Google Scholar
- A. Shoshani, L. M. Bernardo, H. Nordberg, D. Rotem and A. Sim, "Multidimensional Indexing and Query Coordination for Tertiary Storage Management," Proc. 11th Int. Conf. on Scientific and Statistical Database Management SSDBM'99, 214-225, 1999.]] Google ScholarDigital Library
- A. Shoshani, A. Sim, L. M. Bernardo and H. Nordberg, "Coordinating Simultaneous Caching of File Bundles from Tertiary Storage," Proc. 12th Int. Conf. on Scientific and Statistical Database Management SSDBM'2000, 196-206, 2000.]] Google ScholarDigital Library
- R. Weber, H.-J. Schek and S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces," Proc. 24th Int. VLDB Conf., 194-205, 1998.]] Google ScholarDigital Library
Index Terms
- The design of a retrieval technique for high-dimensional data on tertiary storage
Recommendations
High Throughput Tertiary Storage in HPC Environments
Middleware Doctoral Symposium'16: Proceedings of the Doctoral Symposium of the 17th International Middleware ConferenceMagnetic tape in High Performance Computing (HPC) environments has traditionally been used as a tertiary layer in a Hierarchical Storage Management (HSM) system. We propose that tape be given a more central role in the HPC data centre with direct ...
Using tertiary storage in video-on-demand servers
COMPCON '95: Proceedings of the 40th IEEE Computer Society International ConferenceVideo-on-demand is a new entertainment service that will soon be widely available. A small amount of material is very popular, while large amounts of material are viewed less frequently. This skew can be exploited by using a storage hierarchy, storing ...
Coordinating Simultaneous Caching of File Bundles from Tertiary Storage
SSDBM '00: Proceedings of the 12th International Conference on Scientific and Statistical Database ManagementIn a previous paper, we described a system called STACS (Storage Access Coordination System) for High Energy and Nuclear Physics (HENP) experiments. These experiments generate very large volumes of event data at a very high rate. The volumes of data may ...
Comments