nach oben

Erschienen in:

2011 | OriginalPaper | Buchkapitel

3. Spam, Opinions, and Other Relationships: Towards a Comprehensive View of the Web Knowledge Discovery

verfasst von : Bettina Berendt

Erschienen in: Advanced Topics in Information Retrieval

Verlag: Springer Berlin Heidelberg

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

“Web mining” or “Web Knowledge Discovery” is the analysis of web resources with data-mining techniques such as classification, clustering, association-rule or graph-structure methods. Its applications pervade much of the software web users interact with on a daily basis: search engines’ indexing and ranking choices, recommender systems’ recommendations, targeted advertising, and many others. An understanding of this fast-moving field is therefore a key component of digital information literacy for everyone and a useful and fascinating extension of knowledge and skills for Information Retrieval researchers and practitioners. This chapter proposes an integrating model of learning cycles involving data, information and knowledge, explains how this model subsumes Information Retrieval and Knowledge Discovery and relates them to one another. We illustrate the usefulness of this model in an introduction to web content/text mining, using the model to structure the activities in this form of Knowledge Discovery. We focus on spam detection, opinion mining and relation mining. The chapter aims at complementing other books and articles that focus on the computational aspects of web mining, by emphasizing the often-neglected context in which these computational analyses take place: the full cycle of Knowledge Discovery, which ranges from application understanding via data understanding, data preparation, modeling and evaluation to deployment.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Scalability Challenges in Web Search Engines

Nächstes Kapitel The User in Interactive Information Retrieval Evaluation

Two other application areas of web mining that have received a lot of attention recently are the mining of news and the mining of social media such as blogs; for overviews of their specifics, see for example the proceedings of the International Conference on Weblogs and Social Media at http://www.icwsm.org and Berendt (2010).

There are various concepts of “data vs. information vs. knowledge”. The notions we use are designed to be maximally consistent with the uses of the term in the databases, Information Retrieval, and Knowledge Discovery literatures. For a summary, see Fig. 3.1 for details.

The classical definition is “the nontrivial process of identifying valid, previously unknown, and potentially useful patterns” (Fayyad et al. 1996).

The association of induction/abduction with new knowledge goes back to Peirce, cf. the collection of relevant text passages at http://www.helsinki.fi/science/commens/terms/abduction.html.

Thanks to Ricardo Baeza-Yates for the ideas and discussions that led to this figure.

See http://www.ecmlpkdd2007.org/CD/tutorials/KDUbiq/kdubiq_print.pdf, retrieved on 2010-04-07.

Diagram adapted from http://www.crisp-dm.org/Process/index.htm.

While the exploration of data is often considered but one and the first step of data-mining modeling, it is also common to regard the whole of data mining (modeling) as exploratory data analysis. The reason is that in contrast to confirmatory methods, one usually does not test a previously specified hypothesis, does not collect data only for this purpose, and performs an open-ended number of statistical tests.

Other spammers want to convince gullible people to disclose their passwords (phishing). For reasons of space, we do not investigate this further here.

“New web spam techniques are introduced every 2–3 days.” (Liverani 2008).

See, e.g. AIRWeb 2009 at http://airweb.cse.lehigh.edu/2009.

All retrieved on 2010-04-10.

These are typical examples of humans having fed their knowledge into machine-readable data as described by the left-pointing arrows at the bottom of Fig. 3.3.

RDF triples (just like database content) do not need to be authored by technology-savvy users: Web forms are a convenient way to collect structured data from laypeople. Thus, for example, social networks generate and hold masses of personal data in table/RDF form and accessible over the Web. Examples are the FOAF export of Livejournal (http://www.livejournal.com/bots/) and exporter tools for Facebook (http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html), Twitter (http://sioc-project.org/node/262) or Flickr (http://apassant.net/home/2007/12/flickrdf).

These are typical examples of humans having fed their knowledge into machine-readable information as described by the left-pointing arrows in the middle of Fig. 3.3.

The decision whether to treat something as a concept (standing in a subclass relation to another concept) or as an instance (standing in an instance-of relation) is not always straightforward, handled differently by different extraction methods, and even treated differently by different logics and ontology formalisms. For reasons of space, we will therefore not investigate this differentiation.

Both retrieved on 2010-04-10.

These are typical examples of humans having fed their knowledge into machine-readable knowledge as described by the left-pointing arrows at the top of Fig. 3.3a, and into the form that can be used for automatic consistency checking in the sense of Fig. 3.3b.

See http://www.wikipedia.org.

See http://www.cyc.com.

See http://www.crisp-dm.org/Process/index.htm, retrieved on 2010-04-10.

See http://www.sociovision.de/loesungen/sinus-milieus.html, retrieved on 2010-04-10.

A cross-disciplinary initiative to understand the ways in which personal details are collected, stored, transmitted, checked, and used as means of influencing and managing people and populations; for an overview, see Lyon (2007).

We have deliberately not discussed any accuracies, F measure values, or other absolute numbers here, in order to concentrate on the big picture. However, the reader is encouraged to consult original articles, investigate the reported quality values closely, and consider what for example a 20% misclassification rate or an unknown recall rate may mean in practice.

See http://movielens.umn.edu.

See http://wikiscanner.virgil.gr.

Attenberg J, Suel T (2008) Cleaning search results using term distance features. In: Proceedings of the International Workshop on Adversarial Information Retrieval on the Web. AIRWeb ’08. ACM Press, New York, NY, pp 21–24. http://doi.acm.org/10.1145/1451983.1451989, visited on December, 2010

Backstrom L, Dwork C, Kleinberg JM (2007) Wherefore art thou r3579x?: Anonymized social networks and hidden patterns and structural steganography. In: Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ (eds) Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 181–190 CrossRef

Baldi P, Frasconi P, Smyth P (2003) Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley, Chichester

Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the Web. In: Veloso MM, Veloso MM (eds) Proceedings of the International Joint Conferences on Artificial Intelligence, pp 2670–2676

Barbaro M, Zeller T (2006) A face is exposed for AOL searcher No 4417749. http://www.nytimes.com/2006/08/09/technology/09aol.html, visited on December, 2010

Barth A, Datta A, Mitchell JC, Nissenbaum H (2006) Privacy and contextual integrity: Framework and applications. In: Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society, Los Alamitos, pp 184–198

Baxter D, Shepard B, Siegel N, Gottesman B, Schneider D (2005) Interactive natural language explanations of Cyc inferences. http://www.cyc.com/doc/white_papers/ExACt2005.pdf, visited on February, 2011

Berendt B (2007) Intelligent business intelligence and privacy: More knowledge through less data? In: Köppen, Müller R (eds) Business Intelligence: Methods and Applications. Verlag Dr. Kovač, Hamburg, pp 63–79

Berendt B (2008) You are a document too: Web mining and IR for next-generation information literacy. In: Macdonald C, Ounis I, Plachouras V, Ruthven I, White RW (eds) Proceedings of the European Conference on Information Retrieval. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, p 3

Berendt B (2010) Text mining for news and blogs analysis. In: Sammut C, Webb G (eds) Encyclopedia of Machine Learning. Springer, Berlin, pp 968–972

Berendt B, Krause B, Kolbe-Nusser S (2010) Intelligent scientific authoring tools: Interactive data mining for constructive uses of citation networks. Information Processing and Management 46(1):1–10 CrossRef

Berry M, Linoff G (2002) Mining the Web: Transforming customer data. Wiley, Hoboken, NJ

Berry M, Linoff G (2004) Data Mining Techniques. Wiley, Hoboken, NJ

Bíró I, Szabó J, Benczúr AA (2008) Latent Dirichlet allocation in web spam filtering. In: Proceedings of the International Workshop on Adversarial Information Retrieval on the Web, pp 29–32

Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. International Journal of Semantic Web Information Systems 5(3):1–22 CrossRef

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. Journal of Machine Learning Research 3:993–1022 MATHCrossRef

Buitelaar P, Cimiano P, Magnini B (2005) Ontology learning from text: An overview. In: Buitelaar P, Cimiano P, Magnini B (eds) Ontology Learning from Text: Methods, Evaluation and Applications/Frontiers in Artificial Intelligence and Applications, vol 7. IOS Press, pp 3–14

Cafarella MJ (2009) Extracting and querying a comprehensive Web database. http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_106.pdf, visited on February, 2011

Carlson A, Betteridge J, Wang RC, Hruschka ER Jr, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Davison BD, Suel T, Craswell N, Liu B (eds) Proceedings of the ACM Conference on Web Search and Data Mining. ACM Press, New York, NY, pp 101–110 CrossRef

Chakrabarti S (2003) Mining the Web. Morgan Kaufmann, San Francisco, CA

Cycorp (2001) Foundations of knowledge representation in Cyc: Microtheories. http://www.cyc.com/doc/tut/DnLoad/Microtheories.pdf, visited on December, 2010

Davenport T, Beck J (2001) The Attention Economy: Understanding the New Currency of Business. Harvard Business School Press, Cambridge, MA

Deerwester SC, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41:391–407 CrossRef

Domingo-Ferrer J (2007) A three-dimensional conceptual framework for database privacy. In: Secure Data Management. Lecture Notes in Computer Science, vol 4721. Springer, Berlin, pp 193–202 CrossRef

Drost I, Scheffer T (2005) Thwarting the nigritude ultramarine: Learning to identify link spam. In: João Gama RC, Brazdil P, Jorge A, Torgo L (eds) Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery. Lecture Notes in Computer Science, vol 3720. Springer, Berlin, pp 96–107

Etzioni O, Cafarella MJ, Downey D, Popescu AM, Tal Shaked SS, Weld DS, Yates A (2004) Methods for domain-independent information extraction from the Web: An experimental comparison. In: Proceedings of the National Conference on Artificial Intelligence, pp 391–398

Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery. In: Fayyad M, G Piatetsky-Shapiro PS, Uthurusamy R (eds) Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, Cambridge, MA, pp 1–34

Feldman R, Sanger J (2007) The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge

Fellbaum C (1998) Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, MA MATH

Fortuna B, Grobelnik M, Mladenic D (2005) Visualization of text document corpus. Informatica (Slovenia) 29(4):497–504

Fortuna B, Mladenic D, Grobelnik M (2006) Semi-automatic construction of topic ontologies. In: Ackermann M (ed) Proceedings of the Semantics, Web and Mining Workshops at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery. Lecture Notes in Computer Science, vol 4289. Springer, Berlin, pp 121–131

Fortuna B, Galleguillos C, Cristianini N (2009) Detecting the bias in media with statistical learning methods. In: Text Mining: Classification, Clustering and Applications. Chapman & Hall/CRC Press, New York, NY, pp 27–50 CrossRef

Frankowski D, Cosley D, Sen S, Terveen LG, Riedl J (2006) You are what you say: Privacy risks of public mentions. In: Efthimiadis EN, Dumais ST, Hawking D, Järvelin K (eds) Proceedings of the ACM Conference on Research and Development in Information Retrieval. ACM Press, New York, NY, pp 565–572

Gordon DF, des Jardins M (1995) Evaluation and selection of biases in machine learning. Machine Learning 20(1–2):5–22

Gürses FS (2010) Multilateral privacy requirements analysis in online social network services. PhD thesis, KU Leuven and Dept of Computer Science

Gürses FS, Berendt B (2010) The social Web and privacy: Practices, reciprocity and conflict detection in social networks. In: Ferrari E, Bonchi F (eds) Privacy-Aware Knowledge Discovery. Chapman & Hall/CRC Press, New York, NY, pp 395–432 CrossRef

Hand DJ, Smyth P, Mannila H (2001) Principles of Data Mining. MIT Press, Cambridge, MA

Hartig O (2009) Provenance information in the Web of data. http://www.dbis.informatik.hu-berlin.de/fileadmin/research/papers/conferences/2009-ldow-hartig.pdf, visited on December, 2010

Hayes P (2009) Blogic. http://www.slideshare.net/PatHayes/blogic-iswc-2009-invited-talk, visited on December, 2010

Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the Conference on Computational Linguistics, pp 539–545 CrossRef

Hu M, Liu B (2004) Mining opinion features in customer reviews. In: Proceedings of the National Conference on Artificial Intelligence, pp 755–760

Hu J, Zeng HJ, Li H, Niu C, Chen Z (2007) Demographic prediction based on user’s browsing behavior. In: Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 151–160 CrossRef

Katayama T, Utsuro T, Sato Y, Yoshinaka T, Kawada Y, Fukuhara T (2009) An empirical study on selective sampling in active learning for splog detection. In: Proceedings of the International Workshop on Adversarial Information Retrieval on the Web. ACM Press, New York, NY, pp 29–36

Kushmerick N, Weld DS, Doorenbos RB (1997) Wrapper induction for information extraction. In: Proceedings of the International Joint Conferences on Artificial Intelligence, pp 729–737

Lin WH, Xing EP, Hauptmann AG (2008) A joint topic and perspective model for ideological discourse. In: Daelemans W, Goethals B, Morik K (eds) Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery. Lecture Notes in Computer Science, vol 5212. Springer, Berlin, pp 17–32

Liu B (2007) Web Data Mining. Exploring Hyperlinks and Contents and Usage Data. Springer, Berlin MATH

Liu H, Mihalcea R (2007) Of men and women and computers: Data-driven gender modeling for improved user interfaces. In: Proceedings of the International Conference on Weblogs Social Media, pp 121–128

Liverani R (2008) Web spam techniques. http://malerisch.net/docs/web_spam_techniques/web_spam_techniques.html, visited on December, 2010

Lyon D (2007) Surveillance Studies: An Overview. Polity Press, Cambridge

Maedche A, Staab S (2001) Ontology learning for the semantic Web. IEEE Intelligent Systems 16(2):72–79 CrossRef

Matuszek C, Witbrock MJ, Kahlert RC, Cabral J, Schneider D, Shah P, Lenat DB (2005) Searching for common sense: Populating Cyc from the Web. In: Veloso MM, Kambhampati S (eds) Proceedings of the National Conference on Artificial Intelligence. AAAI/MIT Press, Cambridge, MA, pp 1430–1435

McGarry K (2005) A survey of interestingness measures for knowledge discovery. Knowledge Engineering Review 20(1):39–61 CrossRef

Mihalcea R, Liu H (2006) A corpus-based approach to finding happiness. http://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-027.pdf, visited on February, 2011

Mladenic D (1998) Turning Yahoo! to automatic web-page classifier. In: Proceedings of the European Conference on Artificial Intelligence, pp 473–474

Mobasher B (2007) Web usage mining. In: Liu B (ed) Web Data Mining: Exploring Hyperlinks and Contents and Usage Data. Springer, Berlin, pp 449–484. Chap 12

Nakasaki H, Kawaba M, Yamazaki S, Utsuro T, Fukuhara T (2009) Visualizing cross-lingual/cross-cultural differences in concerns in multilingual blogs. http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/161/485, visited on December, 2010

Narayanan A, Shmatikov V (2008) Robust de-anonymization of large sparse datasets. In: Proceedings of the IEEE Symposium on Security and Privacy. IEEE Computer Society, Los Alamitos, pp 111–125

Nissenbaum H (2004) Privacy as contextual integrity. Washington Law Review 79(1):119–158

Nonaka I, Takeuchi H (1995) The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation. Oxford University Press, New York

Owad T (2006) Data mining 101: Funding subversives with amazon wishlists. http://www.applefritter.com/bannedbooks, visited on December, 2010

Pang B, Lee L (2008) Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1–2):1–135 CrossRef

Phillips D (2004) Privacy policy and PETs: The influence of policy regimes on the development and social implications of privacy enhancing technologies. New Media and Society 6(6):691–706 CrossRef

Piskorski J, Sydow M, Weiss D (2008) Exploring linguistic features for web spam detection: A preliminary study. In: Proceedings of the International Workshop on Adversarial Information Retrieval on the Web, pp 25–28

Popescu AM, Etzioni O (2005) Extracting product features and opinions from reviews. In: Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. The Association for Computational Linguistics, pp 339–346 CrossRef

Preibusch S (2006) Implementing privacy negotiations in e-commerce. In: Zhou X, Li J, Shen HT, Kitsuregawa M, Zhang Y (eds) Proceedings of the Asia-Pacific Web Conference. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, pp 604–615

Pyle D (1999) Data Preparation for Data Mining. Academic Press, San Diego, CA

Sarjant S, Legg C, Robinson M, Medelyan O (2009) All you can eat ontology-building: Feeding wikipedia to Cyc. Web Intelligence 341–348

Shearer C (2000) The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing 5(4):13–22. http://www.crisp-dm.org, visited on December, 2010

Stumme G, Hotho A, Berendt B (2006) Semantic web mining: State of the art and future directions. Journal of Web Semantics 4(2):124–143 CrossRef

Sweeney L (2002) K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5):557–570 MathSciNetMATHCrossRef

Tian Y, Weiss GM, Ma Q (2007) A semi-supervised approach for web spam detection using combinatorial feature-fusion. In: Proceedings of the Graph Labelling Workshop and Web Spam Challenge at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery, pp 16–23. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.70.9384, visited on December, 2010

Urvoy T, Lavergne T, Filoche P (2006) Tracking web spam with hidden style similarity. In: Proceedings of the International Workshop on Adversarial Information Retrieval on the Web, pp 25–31

W3C (2000) HTML techniques for Web content accessibility guidelines. http://www.w3.org/TR/WCAG10-HTML-TECHS/, visited on December, 2010

Wardlow DL (1996) Theory, Practice and Research Issues in Marketing: Gays, Lesbians and Consumer Behavior. Haworth

Williams G, Anand S (2009) Predicting the polarity strength of adjectives using wordnet. http://www.aaai.org/ocs/index.php/ICWSM/09/paper/download/214/541, visited on December, 2010

Wu B, Davison BD (2005) Identifying link farm spam pages. In: Ellis A, Hagino T (eds) Proceedings of the International Conference on the World Wide Web. ACM Press, New York, NY, pp 820–829 CrossRef

Zaïane OR (1998) From resource discovery to knowledge discovery on the internet. Tech Rep TR 1998–13, Simon Fraser University

Zittrain J (2008) The Future of the Internet—and How to Stop It. Caravan Books. http://futureoftheinternet.org/, visited on December, 2010

Titel: Spam, Opinions, and Other Relationships: Towards a Comprehensive View of the Web Knowledge Discovery
verfasst von: Bettina Berendt
Verlag: Springer Berlin Heidelberg
Buch: Advanced Topics in Information Retrieval
Print ISBN: 978-3-642-20945-1

Electronic ISBN: 978-3-642-20946-8

Copyright-Jahr: 2011
DOI: https://doi.org/10.1007/978-3-642-20946-8_3

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Internationaler Motorenkongress/© [M] ATZlive | Chisnikov / Fotolia.com, Search Icon, Banner Hanser, Gardiner von Trapp/© Alpega Group, Benny Hahn/© ZEP GmbH, Customer Experience/© © oatawa / Getty Images / iStock, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.