skip to main content
10.1145/1871437.1871501acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Set cover algorithms for very large datasets

Published:26 October 2010Publication History

ABSTRACT

The problem of Set Cover - to find the smallest subcollection of sets that covers some universe - is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining. Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal.

However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.

References

  1. B. Berger, J. Rompel, and P. Shor. Efficient NC algorithms for set cover with applications to learning and geometry. Journal of Computer and System Sciences, 49(3):454--477, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--60, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Broder and M. Mitzenmacher. Survey: Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 2003.Google ScholarGoogle Scholar
  4. F. Chierichetti, R. Kumar, and A. Tomkins. Max-Cover in Map-Reduce. In Proceedings of the 19th International Conference on World Wide Web, pages 231--240. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 45(4):634--652, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, page 18pp, January 2003.Google ScholarGoogle Scholar
  7. B. Goethals. Frequent itemset mining dataset repository. http://fimi.cs.helsinki.fi/dat.Google ScholarGoogle Scholar
  8. L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. In Proceedings of Very Large Databases Conference (VLDB), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Gomes, C. Meneses, P. Pardalos, and G. Viana. Experimental analysis of approximation algorithms for the vertex cover and set covering problems. Computers & Operations Research, 33(12):3520--3534, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Grossman and A. Wool. Computational experience with approximation algorithms for the set covering problem. European Journal of Operational Research, 101(1):81--92, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9(3):256--278, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Lucchese, S. Orlando, R. Perego, and F. Silvestri. Webdocs: a real-life huge transactional dataset. In Proceedings of the ICDM Workshop in Frequent Itemset Mining Implementations, 2004.Google ScholarGoogle Scholar
  13. M. Mihail. Set cover with requirements and costs evolving over time. In Proceedings of the Second International Workshop on Approximation Algorithms for Combinatorial Optimization Problems: RANDOM-APPROX, pages 63--72, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Munagala, S. Babu, R. Motwani, and J. Widom. The pipelined set cover problem. Technical Report 2003-65, Stanford InfoLab, October 2003.Google ScholarGoogle Scholar
  15. B. Saha and L. Getoor. On Maximum Coverage in the streaming model & application to multi-topic blog-watch. In 2009 SIAM International Conference on Data Mining (SDM09), April 2009.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Set cover algorithms for very large datasets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
        October 2010
        2036 pages
        ISBN:9781450300995
        DOI:10.1145/1871437

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader