Abstract
The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.
- 1 ASTRAHAN, M., BLASGEN, M.W., CHAMBERLIN, D.D., ESWARAN, K.P., GRAY, J.N., GRIFFITHS, P.P., KING, W.F., LORIE, R.A., MCJONES, P.R., MEHL, J.W., PUTZOLU, G.R., TRAIGER, I.L., WAVE, B.W., AND WATSON. V. System-R: A relational approach to database management. ACM Trans. Database Syst. I, 2 (June 1976), 97-137. Google ScholarDigital Library
- 2 BABB E. Implementing a relational database by means of specialized hardware. A CM Trans. Database Syst. 4, 1 (March 1979), pp. 1-29. Google ScholarDigital Library
- 3 KNUTH, D.E. The Art of Computer Programming, Vol. 3. Addison-Wesley, Reading, Mass., 1973. Google ScholarDigital Library
- 4 MUNRO, I., AND SPIRA, P.M. Sorting and searching in multisets. Siam J. Comput. 5, 1 (March 1976).Google ScholarCross Ref
Index Terms
- Duplicate record elimination in large data files
Recommendations
Efficient Sorting, Duplicate Removal, Grouping, and Aggregation
Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and ...
Improving duplicate elimination in storage systems
Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at ...
An approximate duplicate elimination in RFID data streams
The RFID technology has been applied to a wide range of areas since it does not require contact in detecting RFID tags. However, due to the multiple readings in many cases in detecting an RFID tag and the deployment of multiple readers, RFID data ...
Comments