article

The design of a retrieval technique for high-dimensional data on tertiary storage

Authors:
Ratko Orlandic

Illinois Institute of Technology, Chicago, IL

Illinois Institute of Technology, Chicago, IL
View Profile

,
Jack Lukaszuk

Illinois Institute of Technology, Chicago, IL

Illinois Institute of Technology, Chicago, IL
View Profile

,
Craig Swietlik

Argonne National Lab Argonne, IL

Argonne National Lab Argonne, IL
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 31 Issue 2June 2002pp 15–21https://doi.org/10.1145/565117.565120

Published:01 June 2002Publication History

ACM SIGMOD Record

Abstract

In high-energy physics experiments, large particle accelerators produce enormous quantities of data, measured in hundreds of terabytes or petabytes per year, which are deposited onto tertiary storage. The experiments are designed to study the collisions of fundamental particles, called "events", each of which is represented as a point in a multi-dimensional universe. In these environments, the best retrieval performance can be achieved only if the data is clustered on the tertiary storage by all searchable attributes of the events. Since the number of these attributes is high, the underlying data-management facility must be able to cope with extremely large volumes and very high dimensionalities of data at the same time. The proposed indexing technique is designed to facilitate both clustering and efficient retrieval of high-dimensional data on tertiary storage. The structure uses an original space-partitioning scheme, which has numerous advantages over other space-partitioning techniques. While the main objective of the design is to support high-energy physics experiments, the proposed solution is appropriate for many other scientific applications.

References

S. Berchtold, C. Bohm and H. P. Kriegel, "The Pyramid-Technique: Towards Breaking the Curse of Dimensionality," Proc. ACM SIGMOD Int. Conf. on Management of Data, 142-153, 1998.]] Google ScholarDigital Library
S. Berchtold, D. A. Keim and H. P. Kriegel, "The X-tree: An Index Structure for High-Dimensional Data," Proc. 22nd Int. VLDB Conf., 28-39, 1996.]] Google ScholarDigital Library
K. S. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, "When Is 'Nearest Neighbor' Meaningful?" Proc. 7th Int. Conf. on DB Theory, 217-235, 1999.]] Google ScholarDigital Library
D. Comer, "The Ubiquitous B-tree," ACM Comp. Surveys,11(2):121-137, 1979.]] Google ScholarDigital Library
I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastucture, Chapter 5, "Data-Intensive Computing," Morgan Kaufmann, 1999.]] Google ScholarDigital Library
V. Gaede and O. Gunther, "Multidimensional Access Methods," ACM Comp. Surveys,30(2):170-231, 1998.]] Google ScholarDigital Library
A. Guttman, "R-trees: A Dynamic Index Structure for Spatial Searching," Proc. ACM SIGMOD Int. Conf. on Management of Data, 47-54, 1984.]] Google ScholarDigital Library
W. Hoschek, J. Jaen-Martinez, A. Samar, H. Stockinger and K. Stockinger, "Data Management in an International Data Grid Project," Proc. 1st IEEE/ACM Int. Workshop on Grid Computing, 2000.]] Google ScholarDigital Library
D. Malon, Argonne National Laboratory, 2001 (private communication).]]Google Scholar
E. J. Otoo, A. Shoshani and S. Hwang, "Clustering High Dimensional Massive Scientific Datasets," Proc. 13th Int. Conf. on Scientific and Statistical Database Management SSDBM'01, 147-157, 2001.]]Google Scholar
R. Orlandic and B. Yu, "Implementing KDB-Trees to Support High-Dimensional Data," Proc. Int. Database Engineering and Applications Symposium IDEAS'2001, 58-67, 2001.]] Google ScholarDigital Library
J. T. Robinson, "The K-D-B Tree: A Search Structure for Large Multidimensional Dynamic Indexes," Proc. ACM SIGMOD Int. Conf. on Management of Data, 10-18, 1981.]] Google ScholarDigital Library
Y. Sakurai, M. Yoshikawa, S. Uemura and H. Kojima, "The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation," Proc. 26th Int. VLDB Conf., 516-526, 2000.]] Google ScholarDigital Library
J. Shiers, "Building a Multi-Petabyte Database: The RD45 Project at CERN," in M. E. S. Loomis and A. B. Chaudri, editors, Object Databases in Practice, 164-176, Prentice Hall, 1997.]]Google Scholar
A. Shoshani, L. M. Bernardo, H. Nordberg, D. Rotem and A. Sim, "Multidimensional Indexing and Query Coordination for Tertiary Storage Management," Proc. 11th Int. Conf. on Scientific and Statistical Database Management SSDBM'99, 214-225, 1999.]] Google ScholarDigital Library
A. Shoshani, A. Sim, L. M. Bernardo and H. Nordberg, "Coordinating Simultaneous Caching of File Bundles from Tertiary Storage," Proc. 12th Int. Conf. on Scientific and Statistical Database Management SSDBM'2000, 196-206, 2000.]] Google ScholarDigital Library
R. Weber, H.-J. Schek and S. Blott, "A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces," Proc. 24th Int. VLDB Conf., 194-205, 1998.]] Google ScholarDigital Library

Index Terms

The design of a retrieval technique for high-dimensional data on tertiary storage
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Index terms have been assigned to the content through auto-classification.

Recommendations

High Throughput Tertiary Storage in HPC Environments
Middleware Doctoral Symposium'16: Proceedings of the Doctoral Symposium of the 17th International Middleware Conference

Magnetic tape in High Performance Computing (HPC) environments has traditionally been used as a tertiary layer in a Hierarchical Storage Management (HSM) system. We propose that tape be given a more central role in the HPC data centre with direct ...
Read More
Using tertiary storage in video-on-demand servers
COMPCON '95: Proceedings of the 40th IEEE Computer Society International Conference

Video-on-demand is a new entertainment service that will soon be widely available. A small amount of material is very popular, while large amounts of material are viewed less frequently. This skew can be exploited by using a storage hierarchy, storing ...
Read More
Coordinating Simultaneous Caching of File Bundles from Tertiary Storage
SSDBM '00: Proceedings of the 12th International Conference on Scientific and Statistical Database Management

In a previous paper, we described a system called STACS (Storage Access Coordination System) for High Energy and Nuclear Physics (HENP) experiments. These experiments generate very large volumes of event data at a very high rate. The volumes of data may ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 31, Issue 2
June 2002
112 pages
ISSN:0163-5808
DOI:10.1145/565117
Issue’s Table of Contents

Copyright © 2002 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2002
Check for updates
Author Tags
access methods
data dimensionality
scientific databases
tertiary storage
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 312
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The design of a retrieval technique for high-dimensional data on tertiary storage

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

High Throughput Tertiary Storage in HPC Environments

Using tertiary storage in video-on-demand servers

Coordinating Simultaneous Caching of File Bundles from Tertiary Storage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The design of a retrieval technique for high-dimensional data on tertiary storage

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

High Throughput Tertiary Storage in HPC Environments

Using tertiary storage in video-on-demand servers

Coordinating Simultaneous Caching of File Bundles from Tertiary Storage

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media