research-article

Set cover algorithms for very large datasets

Authors:
Graham Cormode

AT&T Labs - Research, Florham Park, NJ, USA

AT&T Labs - Research, Florham Park, NJ, USA
View Profile

,
Howard Karloff

AT&T Labs - Research, Florham Park, NJ, USA

AT&T Labs - Research, Florham Park, NJ, USA
View Profile

,
Anthony Wirth

The University of Melbourne, Parkville, Australia

The University of Melbourne, Parkville, Australia
View Profile

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementOctober 2010Pages 479–488https://doi.org/10.1145/1871437.1871501

Published:26 October 2010Publication History

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Pages 479–488

ABSTRACT

The problem of Set Cover - to find the smallest subcollection of sets that covers some universe - is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal.

However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.

References

B. Berger, J. Rompel, and P. Shor. Efficient NC algorithms for set cover with applications to learning and geometry. Journal of Computer and System Sciences, 49(3):454--477, 1994. Google ScholarDigital Library
T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--60, 1999. Google ScholarDigital Library
A. Broder and M. Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 2003.Google Scholar
F. Chierichetti, R. Kumar, and A. Tomkins. Max-Cover in Map-Reduce. In Proceedings of the 19th International Conference on World Wide Web, pages 231--240. ACM, 2010. Google ScholarDigital Library
U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634--652, 1998. Google ScholarDigital Library
K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, page 18pp, January 2003.Google Scholar
B. Goethals. Frequent itemset mining dataset repository. http://fimi.cs.helsinki.fi/dat.Google Scholar
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. In Proceedings of Very Large Databases Conference (VLDB), 2008. Google ScholarDigital Library
F. Gomes, C. Meneses, P. Pardalos, and G. Viana. Experimental analysis of approximation algorithms for the vertex cover and set covering problems. Computers & Operations Research, 33(12):3520--3534, 2006. Google ScholarDigital Library
T. Grossman and A. Wool. Computational experience with approximation algorithms for the set covering problem. European Journal of Operational Research, 101(1):81--92, 1997.Google ScholarCross Ref
D. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9(3):256--278, 1974. Google ScholarDigital Library
C. Lucchese, S. Orlando, R. Perego, and F. Silvestri. Webdocs: a real-life huge transactional dataset. In Proceedings of the ICDM Workshop in Frequent Itemset Mining Implementations, 2004.Google Scholar
M. Mihail. Set cover with requirements and costs evolving over time. In Proceedings of the Second International Workshop on Approximation Algorithms for Combinatorial Optimization Problems: RANDOM-APPROX, pages 63--72, 1999. Google ScholarDigital Library
K. Munagala, S. Babu, R. Motwani, and J. Widom. The pipelined set cover problem. Technical Report 2003-65, Stanford InfoLab, October 2003.Google Scholar
B. Saha and L. Getoor. On Maximum Coverage in the streaming model & application to multi-topic blog-watch. In 2009 SIAM International Conference on Data Mining (SDM09), April 2009.Google ScholarCross Ref

Index Terms

Set cover algorithms for very large datasets
1. Information systems
  1. Information systems applications
    1. Data mining
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Variable neighborhood search for the discounted {0-1} knapsack problem
Abstract
The discounted {0,1} knapsack problem (D{0-1}KP) is a relatively recent variant of the well-known knapsack problem. In the D{0-1}KP a set of items is partitioned into groups of three items and at most one item can be chosen from each ...
Highlights
- Application of the VNS methodology to solve the discounted 0-1 knapsack problem.
Read More
Greedy versus recursive greedy: Uncorrelated heuristics for the binary paint shop problem
Abstract
It is well-known that there are instances of the binary paint shop problem for which the recursive greedy heuristic is better than the greedy heuristic. In this note, we give an example of a family of instances where the greedy ...
Read More
Algorithms for very large scale set covering problems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
General Chair:
Jimmy Huang
York University, Canada
,
Program Chairs:
Nick Koudas
University of Toronto, Canada
,
Gareth Jones
Dublin City University, Ireland
,
Xindong Wu
University of Vermont, USA
,
Kevyn Collins-Thompson
Microsoft Research, USA
,
Aijun An
York University, Canada
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
disk friendly
greedy heuristic
set cover
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 62
  Total Citations
  View Citations
- 772
  Total Downloads
- Downloads (Last 12 months)56
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Set cover algorithms for very large datasets

CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Variable neighborhood search for the discounted {0-1} knapsack problem

Greedy versus recursive greedy: Uncorrelated heuristics for the binary paint shop problem

Algorithms for very large scale set covering problems