skip to main content
research-article
Open Access

Ground Truth Inference for Weakly Supervised Entity Matching

Published:30 May 2023Publication History
Skip Abstract Section

Abstract

Entity matching (EM) refers to the problem of identifying pairs of data records in one or more relational tables that refer to the same entity in the real world. Supervised machine learning (ML) models currently achieve state-of-the-art matching performance; however, they require a large number of labeled examples, which are often expensive or infeasible to obtain. This has inspired us to approach data labeling for EM using weak supervision. In particular, we use the labeling function abstraction popularized by Snorkel, where each labeling function (LF) is a user-provided program that can generate many noisy match/non-match labels quickly and cheaply. Given a set of user-written LFs, the quality of data labeling depends on a labeling model to accurately infer the ground-truth labels. In this work, we first propose a simple but powerful labeling model for general weak supervision tasks. Then, we tailor the labeling model specifically to the task of entity matching by considering the EM-specific transitivity property.

The general form of our labeling model is simple while substantially outperforming the best existing method across ten general weak supervision datasets. To tailor the labeling model for EM, we formulate an approach to ensure that the final predictions of the labeling model satisfy the transitivity property required in EM, utilizing an exact solution where possible and an ML-based approximation in remaining cases. On two single-table and nine two-table real-world EM datasets, we show that our labeling model results in a 9% higher F1 score on average than the best existing method. We also show that a deep learning EM end model (DeepMatcher) trained on labels generated from our weak supervision approach is comparable to an end model trained using tens of thousands of ground-truth labels, demonstrating that our approach can significantly reduce the labeling efforts required in EM.

Skip Supplemental Material Section

Supplemental Material

PACMMOD-V1mod032.mp4

Presentation video for the paper "Ground Truth Inference for Weakly Supervised Entity Matching" at SIGMOD 2023

mp4

79.9 MB

References

  1. [n.d.]. Benchmark datasets for entity resolution. https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution.Google ScholarGoogle Scholar
  2. [n.d.]. Competera Product Matching for Price Comparison. https://competera.net/solutions/by-need/product-matching.Google ScholarGoogle Scholar
  3. 2021. Blocking - py_ entitymatching 0.4.0 documentation. http://anhaidgroup.github.io/py_entitymatching/v0.4.0/user_manual/api/blocking.html [Online; accessed 6. Jul. 2022].Google ScholarGoogle Scholar
  4. 2021. Cholesky decomposition - Wikipedia. https://en.wikipedia.org/w/index.php?title=Cholesky_decomposition&oldid=1059421881 [Online; accessed 21. Jan. 2022].Google ScholarGoogle Scholar
  5. 2021. Permutation matrix - Wikipedia. https://en.wikipedia.org/w/index.php?title=Permutation_matrix&oldid=1059174802 [Online; accessed 11. Dec. 2021].Google ScholarGoogle Scholar
  6. 2021. scipy.sparse.csgraph.min _weight _ full_ bipartite_matching - SciPy v1.7.1 Manual. https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csgraph.min_weight_full_bipartite_matching.html#scipy.sparse.csgraph. min_weight_full_bipartite_matching [Online; accessed 9. Dec. 2021].Google ScholarGoogle Scholar
  7. 2022. Ground Truth Inference for Weakly Supervised Entity Matching (technical report). https://figshare.com/s/6d57cabada80b1e3d42d.Google ScholarGoogle Scholar
  8. 2022. SIMPLE: data and code. https://figshare.com/s/60a4b1595827bb44d5aeGoogle ScholarGoogle Scholar
  9. 2022. snorkel. https://github.com/snorkel-team/snorkel [Online; accessed 23. Jan. 2022].Google ScholarGoogle Scholar
  10. 2022. wrench. https://github.com/JieyuZ2/wrench [Online; accessed 23. Feb. 2022].Google ScholarGoogle Scholar
  11. anhaidgroup. 2022. deepmatcher. https://github.com/anhaidgroup/deepmatcher [Online; accessed 7. Jan. 2022].Google ScholarGoogle Scholar
  12. Anonymous. 2023. Learning Hyper Label Model for Programmatic Weak Supervision. In Submitted to The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=aCQt_BrkSjC under review.Google ScholarGoogle Scholar
  13. Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 783--794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Eric Arazo, Diego Ortego, Paul Albert, Noel E O'Connor, and Kevin McGuinness. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jurian Baas, Mehdi Dastani, and Ad Feelders. 2021. Exploiting transitivity constraints for entity matching in knowledge graphs. arXiv preprint arXiv:2104.12589 (2021).Google ScholarGoogle Scholar
  16. Christoph Böhm, Gerard De Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: distributed web-of-data-scale entity matching. In Proceedings of the 21st ACM international conference on Information and knowledge management. 2104--2108.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.Google ScholarGoogle ScholarCross RefCross Ref
  18. Peter Christen. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. chu-data lab. 2022. zeroer. https://github.com/chu-data-lab/zeroer [Online; accessed 10. Jul. 2022].Google ScholarGoogle Scholar
  20. Contributors to Wikimedia projects. 2022. Variational Bayesian methods - Wikipedia. https://en.wikipedia.org/w/index.php?title=Variational_Bayesian_methods&oldid=1071116594 [Online; accessed 25. Mar. 2022].Google ScholarGoogle Scholar
  21. Valter Crescenzi, Andrea De Angelis, Donatella Firmani, Maurizio Mazzei, Paolo Merialdo, Federico Piai, and Divesh Srivastava. 2021. Alaska: A Flexible Benchmark for Data Integration Tasks. arXiv preprint arXiv:2101.11259 (2021).Google ScholarGoogle Scholar
  22. Hong Cui, Jingjing Zhang, Chunfeng Cui, and Qinyu Chen. 2016. Solving large-scale assignment problems by Kuhn-Munkres algorithm. In 2nd Int. Conf. Advances Mech. Eng. Ind. Inform.(AMEII 2016).Google ScholarGoogle ScholarCross RefCross Ref
  23. Tivadar Danka and Peter Horvath. [n.d.]. modAL: A modular active learning framework for Python. ([n. d.]). https://github.com/modAL-python/modAL available on arXiv at https://arxiv.org/abs/1805.00979.Google ScholarGoogle Scholar
  24. Nilaksh Das, Sanya Chaba, Renzhi Wu, Sakshi Gandhi, Duen Horng Chau, and Xu Chu. 2020. Goggles: Automatic image labeling with affinity coding. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1717--1732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository.Google ScholarGoogle Scholar
  26. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20--28.Google ScholarGoogle ScholarCross RefCross Ref
  27. Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2012. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web. 469--478.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xin Luna Dong. 2019. Building a Broad Knowledge Graph for Products. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 25. https://doi.org/10.1109/ICDE.2019.00010Google ScholarGoogle ScholarCross RefCross Ref
  29. Xin Luna Dong and Theodoros Rekatsinas. 2018. Data integration and machine learning: A natural synergy. In Proceedings of the 2018 international conference on management of data. 1645--1650.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ahmed K Elmagarmid, Panagiotis G Ipeirotis, and Vassilios S Verykios. 2007. Duplicate Record Detection: A Survey. IEEETKDE 19, 1 (2007), 1--16.Google ScholarGoogle Scholar
  31. Jason A Fries, Paroma Varma, Vincent S Chen, Ke Xiao, Heliodoro Tejeda, Priyanka Saha, Jared Dunnmon, Henry Chubb, Shiraz Maskatia, Madalina Fiterau, et al. 2019. Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences. Nature communications 10, 1 (2019), 1--10.Google ScholarGoogle Scholar
  32. Daniel Fu, Mayee Chen, Frederic Sala, Sarah Hooper, Kayvon Fatahalian, and Christopher Ré. 2020. Fast and threerious: Speeding up weak supervision with triplet methods. In International Conference on Machine Learning. PMLR, 3280--3291.Google ScholarGoogle Scholar
  33. Huiji Gao, Geoffrey Barbier, and Rebecca Goolsby. 2011. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems 26, 3 (2011), 10--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Lise Getoor and Ashwin Machanavajjhala. 2012. Entity resolution: theory, practice and open challenges. PVLDB 5, 12 (2012), 2018--2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe (2022).Google ScholarGoogle Scholar
  36. Thomas N Herzog, Fritz J Scheuren, and William E Winkler. 2007. Data Quality and Record Linkage Techniques. Springer Science & Business Media.Google ScholarGoogle Scholar
  37. Petr Hruby, Timothy Duff, Anton Leykin, and Tomas Pajdla. 2022. Learning to Solve Hard Minimal Problems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5532--5542.Google ScholarGoogle ScholarCross RefCross Ref
  38. Shahana Ibrahim and Xiao Fu. 2021. Crowdsourcing via Annotator Co-occurrence Imputation and Provable Symmetric Nonnegative Matrix Factorization. In International Conference on Machine Learning. PMLR, 4544--4554.Google ScholarGoogle Scholar
  39. jettify. 2021. pytorch-optimizer. https://github.com/jettify/pytorch-optimizer [Online; accessed 10. Dec. 2021].Google ScholarGoogle Scholar
  40. Roy Jonker and Anton Volgenant. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38, 4 (1987), 325--340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. David Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. Advances in neural information processing systems 24 (2011).Google ScholarGoogle Scholar
  42. Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. Magellan: Toward building entity matching management systems. Proceedings of the VLDB Endowment 9, 12 (2016), 1197--1208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Pradap Venkatramanan Konda. 2018. Magellan: Toward building entity matching management systems. The University of Wisconsin-Madison.Google ScholarGoogle Scholar
  45. Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3, 1--2 (2010), 484--493.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. 2013. Sigma: Simple greedy matching for aligning large knowledge bases. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 572--580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. 896.Google ScholarGoogle Scholar
  48. Peng Li, Xiang Cheng, Xu Chu, Yeye He, and Surajit Chaudhuri. 2021. Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1064--1076. https://doi.org/10.1145/3448016.3452824Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han. 2014. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 1187--1198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. Proc. VLDB Endow. 14, 1 (Sept. 2020), 50--60. https://doi.org/10.14778/3421424.3421431Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).Google ScholarGoogle Scholar
  52. Yuan Li, Benjamin Rubinstein, and Trevor Cohn. 2019. Exploiting worker correlation for label aggregation in crowdsourcing. In International Conference on Machine Learning. 3886--3895.Google ScholarGoogle Scholar
  53. Yinghao Li, Pranav Shetty, Lucas Liu, Chao Zhang, and Le Song. 2021. BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6178--6190.Google ScholarGoogle ScholarCross RefCross Ref
  54. Pierre Lison, Jeremy Barnes, Aliaksandr Hubin, and Samia Touileb. 2020. Named Entity Recognition without Labelled Data: A Weak Supervision Approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1518--1533.Google ScholarGoogle ScholarCross RefCross Ref
  55. Qiang Liu, Jian Peng, and Alexander T Ihler. 2012. Variational inference for crowdsourcing. Advances in neural information processing systems 25 (2012).Google ScholarGoogle Scholar
  56. Gilles Louppe. 2014. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502 (2014).Google ScholarGoogle Scholar
  57. Xuezhe Ma. 2020. Apollo: An adaptive parameter-wise diagonal quasi-newton method for nonconvex stochastic optimization. arXiv preprint arXiv:2009.13586 (2020).Google ScholarGoogle Scholar
  58. megagonlabs. 2022. ditto. https://github.com/megagonlabs/ditto [Online; accessed 6. Jul. 2022].Google ScholarGoogle Scholar
  59. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10--15, 2018, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). ACM, 19--34. https://doi.org/10.1145/3183713.3196926Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Ebraheem Muhammad, Thirumuruganathan Saravanan, Joty Shafiq, Nan Tang, and Ouzzani Mourad. 2018. Distributed Representations of Tuples for Entity Resolution. Proceedings of the VLDB Endowment 11, 11 (2018).Google ScholarGoogle Scholar
  61. Fionn Murtagh and Pedro Contreras. 2012. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 1 (2012), 86--97.Google ScholarGoogle ScholarCross RefCross Ref
  62. Radford M Neal and Geoffrey E Hinton. 1998. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models. Springer, 355--368.Google ScholarGoogle Scholar
  63. Eniola Olaleye. 2022. WINNING APPROACH ML COMPETITION 2022 - Machine Learning Insights - Medium. Medium (Mar 2022). https://medium.com/machine-learning-insights/winning-approach-ml-competition-2022-b89ec512b1bbGoogle ScholarGoogle Scholar
  64. Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.Google ScholarGoogle Scholar
  65. Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2008. Dataset shift in machine learning. Mit Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. Proc. VLDB Endow. 11, 3 (2017), 269--282. https://doi.org/10.14778/3157794.3157797Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, and Christopher Ré. 2019. Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4763--4771.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. Advances in neural information processing systems 29 (2016), 3567--3575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Alexander J. Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 3567--3575. http://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quicklyGoogle ScholarGoogle Scholar
  70. Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of machine learning research 11, 4 (2010).Google ScholarGoogle Scholar
  71. S Reddi, Manzil Zaheer, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. 2018. Adaptive methods for nonconvex optimization. In Proceeding of 32nd Conference on Neural Information Processing Systems (NIPS 2018).Google ScholarGoogle Scholar
  72. Joshua Robinson, Stefanie Jegelka, and Suvrit Sra. 2020. Strength from weakness: Fast learning using weak supervision. In International Conference on Machine Learning. PMLR, 8127--8136.Google ScholarGoogle Scholar
  73. Salva Rühling Cachay, Benedikt Boecking, and Artur Dubrawski. 2021. End-to-End Weak Supervision. Advances in Neural Information Processing Systems 34 (2021).Google ScholarGoogle Scholar
  74. Sunita Sarawagi and Anuradha Bhamidipaty. 2002. Interactive deduplication using active learning. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 269--278.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental knowledge base construction using deepdive. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 1310.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Michael Stonebraker and Ihab F Ilyas. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 41, 2 (2018), 3--9.Google ScholarGoogle Scholar
  77. Paroma Varma and Christopher Ré. 2018. Snuba: Automating weak supervision to label training data. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, Vol. 12. NIH Public Access, 223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Paroma Varma and Christopher Ré. 2018. Snuba: Automating Weak Supervision to Label Training Data. Proc. VLDB Endow. 12, 3 (nov 2018), 223--236. https://doi.org/10.14778/3291264.3291268Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web. 155--164.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (May 2020), 1149--1164. https://doi.org/10.1145/3318464.3389743Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Renzhi Wu, Shen-En Chen, Jieyu Zhang, and Xu Chu. 2023. Learning Hyper Label Model for Programmatic Weak Supervision. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=aCQt_BrkSjCGoogle ScholarGoogle Scholar
  82. Renzhi Wu, Nilaksh Das, Sanya Chaba, Sakshi Gandhi, Duen Horng Chau, and Xu Chu. 2022. A Cluster-then-label Approach for Few-shot Learning with Application to Automatic Image Data Labeling. ACM Journal of Data and Information Quality (JDIQ) 14, 3 (2022), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan, and Jingren Zhou. 2021. Learning to Be a Statistician: Learned Estimator for Number of Distinct Values. Proc. VLDB Endow. 15, 2 (oct 2021), 272--284. https://doi.org/10.14778/3489496.3489508Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Renzhi Wu, Prem Sakala, Peng Li, Xu Chu, and Yeye He. 2021. Demonstration of Panda: A Weakly Supervised Entity Matching System. Proc. VLDB Endow. 14, 12 (jul 2021), 2735--2738. https://doi.org/10.14778/3476311.3476332Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the 2018 international conference on management of data. 1301--1316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. 2021. WRENCH: A Comprehensive Benchmark for Weak Supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=Q9SKS5k8ioGoogle ScholarGoogle Scholar
  87. Chen Zhao and Yeye He. 2019. Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference. 2413--2424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. 2017. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment 10, 5 (2017), 541--552.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ground Truth Inference for Weakly Supervised Entity Matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 1
      PACMMOD
      May 2023
      2807 pages
      EISSN:2836-6573
      DOI:10.1145/3603164
      Issue’s Table of Contents

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 May 2023
      Published in pacmmod Volume 1, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader