article

Free Access

Duplicate record elimination in large data files

Authors:
Dina Bitton

Univ. of Wisconsin-Madison, Madison

Univ. of Wisconsin-Madison, Madison
View Profile

,
David J. DeWitt

Univ. of Wisconsin-Madison, Madison

Univ. of Wisconsin-Madison, Madison
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 8 Issue 2pp 255–265https://doi.org/10.1145/319983.319987

Published:01 June 1983Publication History

ACM Transactions on Database Systems

Abstract

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.

References

1 ASTRAHAN, M., BLASGEN, M.W., CHAMBERLIN, D.D., ESWARAN, K.P., GRAY, J.N., GRIFFITHS, P.P., KING, W.F., LORIE, R.A., MCJONES, P.R., MEHL, J.W., PUTZOLU, G.R., TRAIGER, I.L., WAVE, B.W., AND WATSON. V. System-R: A relational approach to database management. ACM Trans. Database Syst. I, 2 (June 1976), 97-137. Google ScholarDigital Library
2 BABB E. Implementing a relational database by means of specialized hardware. A CM Trans. Database Syst. 4, 1 (March 1979), pp. 1-29. Google ScholarDigital Library
3 KNUTH, D.E. The Art of Computer Programming, Vol. 3. Addison-Wesley, Reading, Mass., 1973. Google ScholarDigital Library
4 MUNRO, I., AND SPIRA, P.M. Sorting and searching in multisets. Siam J. Comput. 5, 1 (March 1976).Google ScholarCross Ref

Index Terms

Duplicate record elimination in large data files
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data access methods
    2. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Efficient Sorting, Duplicate Removal, Grouping, and Aggregation
Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and ...
Read More
Improving duplicate elimination in storage systems

Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at ...
Read More
An approximate duplicate elimination in RFID data streams

The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 8, Issue 2
June 1983
120 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/319983
Issue’s Table of Contents

Copyright © 1983 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 1983
Published in tods Volume 8, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
duplicate elimination
projection operator
sorting
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 159
  Total Citations
  View Citations
- 1,684
  Total Downloads
- Downloads (Last 12 months)151
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Duplicate record elimination in large data files

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Sorting, Duplicate Removal, Grouping, and Aggregation

Improving duplicate elimination in storage systems

An approximate duplicate elimination in RFID data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Duplicate record elimination in large data files

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Efficient Sorting, Duplicate Removal, Grouping, and Aggregation

Improving duplicate elimination in storage systems

An approximate duplicate elimination in RFID data streams

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media