ABSTRACT
The problem of Set Cover - to find the smallest subcollection of sets that covers some universe - is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal.
However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.
- B. Berger, J. Rompel, and P. Shor. Efficient NC algorithms for set cover with applications to learning and geometry. Journal of Computer and System Sciences, 49(3):454--477, 1994. Google ScholarDigital Library
- T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--60, 1999. Google ScholarDigital Library
- A. Broder and M. Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 2003.Google Scholar
- F. Chierichetti, R. Kumar, and A. Tomkins. Max-Cover in Map-Reduce. In Proceedings of the 19th International Conference on World Wide Web, pages 231--240. ACM, 2010. Google ScholarDigital Library
- U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634--652, 1998. Google ScholarDigital Library
- K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, page 18pp, January 2003.Google Scholar
- B. Goethals. Frequent itemset mining dataset repository. http://fimi.cs.helsinki.fi/dat.Google Scholar
- L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. In Proceedings of Very Large Databases Conference (VLDB), 2008. Google ScholarDigital Library
- F. Gomes, C. Meneses, P. Pardalos, and G. Viana. Experimental analysis of approximation algorithms for the vertex cover and set covering problems. Computers & Operations Research, 33(12):3520--3534, 2006. Google ScholarDigital Library
- T. Grossman and A. Wool. Computational experience with approximation algorithms for the set covering problem. European Journal of Operational Research, 101(1):81--92, 1997.Google ScholarCross Ref
- D. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9(3):256--278, 1974. Google ScholarDigital Library
- C. Lucchese, S. Orlando, R. Perego, and F. Silvestri. Webdocs: a real-life huge transactional dataset. In Proceedings of the ICDM Workshop in Frequent Itemset Mining Implementations, 2004.Google Scholar
- M. Mihail. Set cover with requirements and costs evolving over time. In Proceedings of the Second International Workshop on Approximation Algorithms for Combinatorial Optimization Problems: RANDOM-APPROX, pages 63--72, 1999. Google ScholarDigital Library
- K. Munagala, S. Babu, R. Motwani, and J. Widom. The pipelined set cover problem. Technical Report 2003-65, Stanford InfoLab, October 2003.Google Scholar
- B. Saha and L. Getoor. On Maximum Coverage in the streaming model & application to multi-topic blog-watch. In 2009 SIAM International Conference on Data Mining (SDM09), April 2009.Google ScholarCross Ref
Index Terms
- Set cover algorithms for very large datasets
Recommendations
Variable neighborhood search for the discounted {0-1} knapsack problem
AbstractThe discounted {0,1} knapsack problem (D{0-1}KP) is a relatively recent variant of the well-known knapsack problem. In the D{0-1}KP a set of items is partitioned into groups of three items and at most one item can be chosen from each ...
Highlights- Application of the VNS methodology to solve the discounted 0-1 knapsack problem.
Greedy versus recursive greedy: Uncorrelated heuristics for the binary paint shop problem
AbstractIt is well-known that there are instances of the binary paint shop problem for which the recursive greedy heuristic is better than the greedy heuristic. In this note, we give an example of a family of instances where the greedy ...
Comments