Top

Published in:

2018 | OriginalPaper | Chapter

CrowdCorrect: A Curation Pipeline for Social Data Cleansing and Curation

Authors : Amin Beheshti, Kushal Vaghani, Boualem Benatallah, Alireza Tabebordbar

Published in: Information Systems in the Big Data Era

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Process and data are equally important for business process management. Data-driven approaches in process analytics aims to value decisions that can be backed up with verifiable private and open data. Over the last few years, data-driven analysis of how knowledge workers and customers interact in social contexts, often with data obtained from social networking services such as Twitter and Facebook, have become a vital asset for organizations. For example, governments started to extract knowledge and derive insights from vastly growing open data to improve their services. A key challenge in analyzing social data is to understand the raw data generated by social actors and prepare it for analytic tasks. In this context, it is important to transform the raw data into a contextualized data and knowledge. This task, known as data curation, involves identifying relevant data sources, extracting data and knowledge, cleansing, maintaining, merging, enriching and linking data and knowledge. In this paper we present CrowdCorrect, a data curation pipeline to enable analysts cleansing and curating social data and preparing it for reliable business data analytics. The first step offers automatic feature extraction, correction and enrichment. Next, we design micro-tasks and use the knowledge of the crowd to identify and correct information items that could not be corrected in the first step. Finally, we offer a domain-model mediated method to use the knowledge of domain experts to identify and correct items that could not be corrected in previous steps. We adopt a typical scenario for analyzing Urban Social Issues from Twitter as it relates to the Government Budget, to highlight how CrowdCorrect significantly improves the quality of extracted knowledge compared to the classical curation pipeline and in the absence of knowledge of the crowd and domain experts.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Achieving Service Accountability Through Blockchain and Digital Identity

next chapter Service Discovery and Composition in Smart Cities

https://github.com/unsw-cse-soc/CrowdCorrect.

http://www.budget.gov.au/.

https://noisy-text.github.io/norm-shared-task.html.

https://en.wikipedia.org/wiki/Malcolm_Turnbull.

https://azure.microsoft.com/en-au/try/cognitive-services/my-apis/.

https://help.twitter.com/en/using-twitter/twitter-polls.

https://en.wikipedia.org/wiki/PageRank.

https://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2465.

Abilhoa, W.D., De Castro, L.N.: A keyword extraction method from Twitter messages represented as graphs. Appl. Math. Comput. 240, 308–325 (2014)

Abu-Salih, B., Wongthongtham, P., Beheshti, S., Zhu, D.: A preliminary approach to domain-based evaluation of users’ trustworthiness in online social networks. In: 2015 IEEE International Congress on Big Data, New York City, NY, USA, 27 June–2 July 2015, pp. 460–466 (2015)

Aggarwal, C.C.: An introduction to social network data analytics. In: Social Network Data Analytics, pp. 1–15 (2011)CrossRef

Anderson, M., et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)

Bae, Y., Lee, H.: Sentiment analysis of Twitter audiences: measuring the positive or negative influence of popular Twitterers. J. Assoc. Inf. Sci. Technol. 63(12), 2521–2535 (2012)CrossRef

Batarfi, O., Shawi, R.E., Fayoumi, A.G., Nouri, R., Beheshti, S., Barnawi, A., Sakr, S.: Large scale graph processing systems: survey and an experimental evaluation. Cluster Comput. 18(3), 1189–1213 (2015)CrossRef

Beheshti, A., Benatallah, B., Motahari-Nezhad, H.R.: ProcessAtlas: a scalable and extensible platform for business process analytics. Softw. Pract. Exp. 48(4), 842–866 (2018)CrossRef

Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 06–10 November 2017, pp. 2451–2454 (2017)

Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Galaxy: a platform for explorative analysis of open data sources. In: Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, 15–16 March 2016, pp. 640–643 (2016)

10.

Beheshti, S., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34(3), 379–423 (2016)CrossRef

11.

Beheshti, S.-M.-R., Benatallah, B., Sakr, S., Grigori, D., Motahari-Nezhad, H.R., Barukh, M.C., Gater, A., Ryu, S.H.: Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-25037-3CrossRef

12.

Beheshti, S., Benatallah, B., Venugopal, S., Ryu, S.H., Motahari-Nezhad, H.R., Wang, W.: A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99(4), 313–349 (2017)MathSciNetCrossRef

13.

Beheshti, S., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017, pp. 165–169 (2017). https://doi.org/10.1145/3041021.3054726

14.

Beheshti, S., Venugopal, S., Ryu, S.H., Benatallah, B., Wang, W.: Big data and cross-document coreference resolution: current state and future opportunities. CoRR abs/1311.3987 (2013)

15.

Beheshti, S., et al.: Business process data analysis. In: Beheshti, S., et al. (eds.) Process Analytics, pp. 107–134. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-25037-3_5CrossRef

16.

Brigadir, I., Greene, D., Cunningham, P.: A system for Twitter user list curation. In: Proceedings of the Sixth ACM Conference on Recommender Systems, pp. 293–294. ACM (2012)

17.

Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K.: Measuring user influence in Twitter: the million follower fallacy. ICWSM 10(10–17), 30 (2010)

18.

Chai, X., et al.: Social media analytics: the Kosmix story. IEEE Data Eng. Bull. 36(3), 4–12 (2013)

19.

Chitrakala, S.: Twitter data analysis. In: Modern Technologies for Big Data Classification and Clustering, p. 124 (2017)

20.

Duh, K., Hirao, T., Kimura, A., Ishiguro, K., Iwata, T., Yeung, C.M.A.: Creating stories: social curation of Twitter messages. In: ICWSM (2012)

21.

Ginn, R., Pimpalkhute, P., Nikfarjam, A., Patki, A., OConnor, K., Sarker, A., Smith, K., Gonzalez, G.: Mining Twitter for adverse drug reaction mentions: a corpus and classification benchmark. In: Proceedings of the Fourth Workshop on Building and Evaluating Resources for Health and Biomedical Text Processing (2014)

22.

Godin, F., Slavkovikj, V., De Neve, W., Schrauwen, B., Van de Walle, R.: Using topic models for Twitter hashtag recommendation. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 593–596. ACM (2013)

23.

Goonetilleke, O., Sellis, T., Zhang, X., Sathe, S.: Twitter analytics: a big data management perspective. SIGKDD Explor. Newsl. 16(1), 11–20 (2014). https://doi.org/10.1145/2674026.2674029CrossRef

24.

Howe, J.: The rise of crowdsourcing. Wired Mag. 14(6), 1–4 (2006)

25.

Kim, N.W., et al.: BudgetMap: engaging taxpayers in the issue-driven classification of a government budget. In: CSCW, pp. 1026–1037 (2016)

26.

Kittur, A., Nickerson, J.V., Bernstein, M., Gerber, E., Shaw, A., Zimmerman, J., Lease, M., Horton, J.: The future of crowd work. In: CSCW (2013)

27.

Kooge, E., et al.: Merging data streams. Res. World 2016(56), 34–37 (2016)CrossRef

28.

Koyutürk, M., Grama, A., Szpankowski, W.: An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics 20(Suppl_1), i200–i207 (2004)CrossRef

29.

Krishnan, S., et al.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA@ SIGMOD, p. 9 (2016)

30.

Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: WWW (2010)

31.

Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary, A.: Twitter trending topic classification. In: 2011 IEEE 11th International Conference on Data Mining Workshops (ICDMW), pp. 251–258. IEEE (2011)

32.

Maynard, D., Funk, A.: Automatic detection of political opinions in tweets. In: García-Castro, R., Fensel, D., Antoniou, G. (eds.) ESWC 2011. LNCS, vol. 7117, pp. 88–99. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-25953-1_8CrossRef

33.

Perera, R.D., Anand, S., Subbalakshmi, K., Chandramouli, R.: Twitter analytics: architecture, tools and analysis. In: Military Communications Conference, 2010-MILCOM 2010, pp. 2186–2191. IEEE (2010)

34.

Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

35.

Roberts, K., Roach, M.A., Johnson, J., Guthrie, J., Harabagiu, S.M.: EmpaTweet: annotating and detecting emotions on Twitter. In: LREC, vol. 12, pp. 3806–3813 (2012)

36.

Rundensteiner, E., et al.: Maintaining data warehouses over changing information sources. Commun. ACM 43(6), 57–62 (2000)CrossRef

37.

Russom, P., et al.: Big data analytics. TDWI Best Practices Report, Fourth Quarter, pp. 1–35 (2011)

38.

Sadeghi, F., et al.: VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. In: CVPR, pp. 1456–1464. IEEE (2015)

39.

Salih, B.A., Wongthongtham, P., Beheshti, S.M.R., Zajabbari, B.: Towards a methodology for social business intelligence in the era of big social data incorporating trust and semantic analysis. In: Second International Conference on Advanced Data and Information Engineering (DaEng-2015). Springer, Bali (2015)

40.

Shen, W., et al.: Entity linking with a knowledge base: issues, techniques, and solutions. ITKDE 27(2), 443–460 (2015)

41.

Sosamphan, P., et al.: SNET: a statistical normalisation method for Twitter. Master’s thesis (2016)

42.

Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in Twitter to improve information filtering. In: SIGIR. ACM (2010)

43.

Troncy, R.: Linking entities for enriching and structuring social media content. In: WWW, pp. 597–597 (2016)

44.

Ye, S., Wu, S.F.: Measuring message propagation and social influence on Twitter.com. In: Bolc, L., Makowski, M., Wierzbicki, A. (eds.) SocInfo 2010. LNCS, vol. 6430, pp. 216–231. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16567-2_16CrossRef

45.

Zhao, W.X., Jiang, J., He, J., Song, Y., Achananuparp, P., Lim, E.P., Li, X.: Topical keyphrase extraction from Twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 379–388. Association for Computational Linguistics (2011)

Title: CrowdCorrect: A Curation Pipeline for Social Data Cleansing and Curation
Authors: Amin Beheshti
Kushal Vaghani
Boualem Benatallah
Alireza Tabebordbar
Publisher: Springer International Publishing
Book: Information Systems in the Big Data Era
Print ISBN: 978-3-319-92900-2

Electronic ISBN: 978-3-319-92901-9

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-92901-9_3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner