nach oben

GeoInformatica

Erschienen in:

01.10.2013

An algorithm for local geoparsing of microtext

verfasst von: Judith Gelernter, Shilpa Balaji

Erschienen in: GeoInformatica | Ausgabe 4/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The location of the author of a social media message is not invariably the same as the location that the author writes about in the message. In applications that mine these messages for information such as tracking news, political events or responding to disasters, it is the geographic content of the message rather than the location of the author that is important. To this end, we present a method to geo-parse the short, informal messages known as microtext. Our preliminary investigation has shown that many microtext messages contain place references that are abbreviated, misspelled, or highly localized. These references are missed by standard geo-parsers. Our geo-parser is built to find such references. It uses Natural Language Processing methods to identify references to streets and addresses, buildings and urban spaces, and toponyms, and place acronyms and abbreviations. It combines heuristics, open-source Named Entity Recognition software, and machine learning techniques. Our primary data consisted of Twitter messages sent immediately following the February 2011 earthquake in Christchurch, New Zealand. The algorithm identified location in the data sample, Twitter messages, giving an F statistic of 0.85 for streets, 0.86 for buildings, 0.96 for toponyms, and 0.88 for place abbreviations, with a combined average F of 0.90 for identifying places. The same data run through a geo-parsing standard, Yahoo! Placemaker, yielded an F statistic of zero for streets and buildings (because Placemaker is designed to find neither streets nor buildings), and an F of 0.67 for toponyms.

Vorheriger Artikel Blind evaluation of location based queries using space transformation to preserve location privacy

Nächster Artikel Decentralized querying of topological relations between regions monitored by a coordinate-free geosensor network

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

These statistics date to February 2012, from http://www.ebizmba.com/articles/social-networking-websites

These are found at the following web addresses as of February 7, 2012: Yahoo Placemaker at http://developer.yahoo.com/geo/placemaker/, Metacarta geoparser at http://www.metacarta.com/products-platform-queryparser.htm; Drupal at http://geoparser.andrewl.net/, and the Unlock system at http://unlock.edina.ac.uk/texts/introduction.

http://thenextweb.com/socialmedia/2010/04/14/twitter-announces-annotations-add-metadata-tweet-starting-quarter-2/

Our data consists of about 300,000 tweets (1 out of every 1000 tweets of about 300,000,000 per hour) sampled from 1 h of tweets. The tweets were dated right after the earthquake. Takahashi, Abe, Igata, “Can Twitter be an alternative of real-world sensors” LNCS 6763, 2011, found that 0.6 % of tweets had GPS coordinates.

Twitter users developed their own indexing practices of using a “#” symbol, called a hashtag, to label tweets of a topic.

We would like to add time as representative of distance, since presently we miss the radius around San Bruno in a tweet like “about an hr and a half from San Bruno”

Illinois co-reference package: http://cogcomp.cs.illinois.edu/page/software_view/18; BART at http://www.bart-coref.org/

We use the dictionary that loads with every Linux operating system as a dictionary of the English language. We use a dictionary of abbreviations common to Twitter called the Twittonary, which we were granted permission to use in research. We refer also to some minor word lists, such as the buildings list from Wikipedia, and a list of saints’ names (to distinguish saints from streets) from http://www.catholic.org/saints/stindex.php

http://developer.gauner.org/jspellcorrect

http://en.wikipedia.org/wiki/list_of_building_types

U.S. airports are found in tweets. But they do not make good training data because U.S. airport abbreviations are forced into a 3-letter mold, and are not supposed to repeat around the country so that many do not follow customary abbreviations rules. For example, LAX stands for the Los Angeles, California airport, and EWR represents the Newark, New Jersey airport. We therefore avoided this sort of abbreviation for training the classifier.

http://www.catholic.org/saints/stindex.php

Part of speech tagger for Twitter by Noah Smith et al., is at http://www.ark.cs.cmu.edu/TweetNLP/

“Consensus decision-making” in Wikipedia, Retrieved July 24, 2012, from http://en.wikipedia.org/wiki/Consensus_decision-making

Kilem Gwet (2002). Kappa statistic is not satisfactory for assessing the extent of agreement between raters. Retrieved July 15, 2012 from http://www.agreestat.com/research_papers/kappa_statistic_is_not_satisfactory.pdf; Julius Sim, Chris Wright (2005). The kappa statistic in reliability studies: Use, interpretation and sample size requirements. Phys, Ther. 85(3):257–68.

Official New Zealand gazetteer of place names, at http://www.linz.govt.nz/placenames/find-names/nz-gazetteer-official-names as of January 31, 2012.

Write to gelern@cs.cmu.edu for use of the geo-tagged 2011 earthquake tweets from Christchurch, New Zealand, or the geo-tagged 2011 fire tweets from Austin, Texas.

We reported results of testing the second version of the algorithm at the high performance computing (XSEDE’12) conference in Chicago, Illinois, USA, this July 2012.

http://www.ark.cs.cmu.edu/TweetNLP/

These have been replaced in the next version of the algorithm that will be presented at the XSEDE’12 conference in July 2012

List of Saints’ Names: http://www.catholic.org/saints/stindex.php

Adriani M, Paramita ML (2007) Identifying location in Indonesian documents for geographic information retrieval. GIR’07, November 9, 2007, Lisbon, Portugal, pp 19–23

Ammar W, Darwish K, El Kahki, A, Hafez, K (2011) ICE-TEA: in-context expansion and translation of English abbreviations. In Gelbukh A (ed) CICLing 2011, Part II, LNCS 6609, pp 41–54

Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating Twitter users. CIKM’10, October 26–30, 2010, Toronto, Ontario, Canada, pp 759–768

Dannélls D (2006) Automatic acronym recognition. Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006), April 3–7, Trento, Italy, pp 167–170

Eisenstein J, O’Connor B, Smith NA, Xing E (2010) A latent variable model for geographic lexical variation. In Proceedings of EMNLP, pp 1277–1287

Gelernter J, Mushegian N (2011) Geo-parsing messages from microtext. Transactions in GIS 15(6):753–773CrossRef

Hecht B, Hong L, Suh B, Chi EH (2011) Tweets from Justin Bieber’s Heart: the dynamics of the “location” field in user profiles, CHI 2011, May 7–12, 2011, Vancouver, BC, Canada, pp 237–246

Hill E, Fry ZP, Boyd H, Sridhara G, Novikova Y, Pollock L, Vijay-Shanker K (2008) AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools. MSR,’08, May 10–11, 2008, Leipzig, Germany, pp 79–88

Ireson N, Cirabegna F (2008) Toponym resolution in social media. PF Patel-Schneider et al. (eds.) ISWC 2010, Part I, LNCS 6496, pp 370–385

10.

Jung JJ (2011) Towards named entity recognition method for microtexts in online social networks: a case study of Twitter. 2011 International Conference on Advances in Social Network Analysis and Mining (ASONAM), pp 563–564

11.

Khanal N, Kehoe A, Kumar A, MacDonald A, Mueller M, Plaisant C, Ruecker S, Sinclair S Monk Tutorial: Metadata offers new knowledge. Retrieved January 31, 2012 from http://gautam.lis.illinois.edu/monkmiddleware/public/analytics/decisiontree.html

12.

Kinsella S, Murdock V, O’Hare N (2011) “I’m eating a sandwich in Glasgow”: modelling locations with tweets. SMUC’11, October 28, 2011, Glasgow, Scotland, pp 61–68

13.

Leveling J, Hartrumpf S (2008) On metonymy recognition for geographic IR. Int J Geogr Inf Sci 22(3), http://www.geo.uzh.ch/~rsp/gir06/papers/individual/leveling.pdf, accessed 12 January 2012

14.

Lieberman MD, Samet H (2011) Multifaceted toponym recognition for streaming news. SIGIR’11. Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, July 2011, pp 843–852

15.

Lieberman MD, Samet H, Sankaranarayanan J (2010) Geotagging with local lexicons to build indexes for textually-specified spatial data. IEEE 26th International Conference on Data Engineering (ICDE), pp 201–212

16.

Liu J, Chen J, Liu T, Huang Y (2011) Expansion finding for given acronyms using conditional random fields. In: Wang H, et al. (eds) WAIM 2011, LNCS 6897, pp 191–200

17.

Liu X, Zhang S, Wei F, Zhou M (2011) Recognizing named entities in tweets. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland Oregon, June 19–24, pp 359–367

18.

Liu Y, Piyawongwisal P, Handa S, Yu L, Xu Y, Samuel A (2011) Going beyond citizen data collection with mapster: a mobile+cloud real-time citizen science experiment. Seventh IEEE international conference on e-science workshops, pp 1–6

19.

Marcus A, Bernstein MS, Badar O, Karger DR, Madden S, Miller RC (2011) Processing and visualizing the data in tweets. SIMOD Record 40(4):21–27

20.

McInnes BT, Pedersen T, Liu Y, Pakhomov SV, Melton GB (2011) Using second-order vectors in a knowledge-based method for acronym disambiguation. Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp 145–153

21.

Moschitti A, Chu-Carroll J, Patwardhan S, Fan J, Riccardi G (2011) Using syntactic and semantic structural kernels for classifying definition questions in jeopardy! Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 27–31, 2011, pp 712–724

22.

Nadeau D, Turney PD (2005) A supervised learning approach to acronym identification. In: Kégl B, Lapalme G (eds) AI 2005, LNAI 3501, pp 319–329

23.

Okazaki M, Matsuo Y (2009) Semantic Twitter: analyzing tweets for real-time event notification. In: Breslin JG et al. (eds) BlogTalk 2008/2009, LNCS 6045. Proceedings of the 2008/2009 international conference on social software. Springer, Heidelberg, 2010 pp 63–74

24.

Okazaki N, Ananiadou S (2006) A term recognition approach to acronym recognition. Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp 643–650

25.

Okazaki N, Ananiadou S, Tsujii J (2008) A discriminative alignment model for abbreviation recognition. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp 657–664

26.

Paradesi S (2011) Geotagging tweets using their content. Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference, May 18–20, 2011, Florida, USA, pp 355–356

27.

Park Y, Byrd RJ (2001) Hybrid text mining for finding abbreviations and their definitions. Association for Computational Linguistics http://aclweb.org/anthology/W/W01/W01-0516.pdf, Retrieved January 3, 2012

28.

Pennell D, Liu Y (2011) Toward text message normalization: modeling abbreviation generation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May, 2011, pp 5364–5367

29.

Ponte J, Croft WB (1998) A language modeling approach to information retrieval. In Proceedings of SIGIR, pp 275–281

30.

Ritter A, Clark S, Etzioni M, Etzioni O (2011) Named entity recognition in tweets: an experimental study. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 1524–1534

31.

Roche M, Prince V (2007) AcroDef: a quality measure for discriminating expansions of ambiguous acronyms. In: Kokinov B et al. (eds) Context 2007, LNAI 4635, pp 441–424

32.

Starbird K, Palen L, Hughes A, Vieweg S (2010) Chatter on the red: what hazards threat reveal about the social life of microblogged information. CSCW 2010, February 6–10, 2010, Savannah, Georgia, USA, pp 241–250

33.

Taghva K, Vyas L (2011) Acronym expansion via Hidden Markov Models. 21st International Conference on Systems Engineering, 16–18 August 2011, pp 120–125

34.

Takahashi K, Pramudiono Il, Kitsuregawa M (2005) Geo-word centric association rule mining. Proceedings of the sixth international conference on Mobile Data Management (MDM) 2005, Ayia Napa, Cyprus, pp 273–280

35.

Tanasescu V, Domingue J (2008) A differential notion of place for local search. LocWeb 2008, April 22, 2008, Beijing, China, pp 9–15

36.

Vanopstal K, Desmet B, Hoste V (2010) Towards a learning approach for abbreviation detection and resolution. LREC 2010, May 19–21, 2010, Valletta, Malta, pp 1043–1049

37.

Vieweg S, Hughes AL, Starbird K, Palen L (2010) Microblogging during two natural hazards events: what Twitter may contribute to situational awareness. In: Proceedings of the 2010 Annual Conference on Human Factors in Computing Systems (CHI 2010), Atlanta, Georgia: pp 1079–1088

38.

Watanabe K, Ochi M, Okabe M, Onai R (2011) Jasmine: a real-time local-event detection system based on geolocation information propagated to microblogs. CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK, pp 2541–2544

39.

Wing BP, Baldridge J (2011) Simple supervised document geolocation with geodesic grids. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, June 19–24, 2011, pp 955–964

Titel: An algorithm for local geoparsing of microtext
verfasst von: Judith Gelernter
Shilpa Balaji
Publikationsdatum: 01.10.2013
Verlag: Springer US
Erschienen in: GeoInformatica / Ausgabe 4/2013
Print ISSN: 1384-6175
Elektronische ISSN: 1573-7624
DOI: https://doi.org/10.1007/s10707-012-0173-8

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2013

Blind evaluation of location based queries using space transformation to preserve location privacy

The k closest pairs in spatial databases

Using virtual reality and percolation theory to visualize fluid flow in porous media

Decentralized querying of topological relations between regions monitored by a coordinate-free geosensor network

Opportunistic sampling-based query processing in wireless sensor networks