Skip to main content
Erschienen in: Annals of Data Science 1/2015

01.03.2015

Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

verfasst von: Ekaterina Chernyak, Boris Mirkin

Erschienen in: Annals of Data Science | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A step-by-step approach to taxonomy construction is presented. On the first step, the upper layer frame of taxonomy is built manually according to educational materials. On the next steps, the frame is refined at a chosen topic using the Wikipedia category tree and articles, both cleaned of noise. Our main tool in this is a naturally defined string-to-text relevance score, based on annotated suffix trees. The relevance scoring is used at several tasks: (1) cleaning the Wikipedia tree or page set of noise; (2) allocating Wikipedia categories to taxonomy topics; (3) deciding whether an allocated category should be included as a child to the taxonomy topic, etc. The resulting fragment of taxonomy consists of three parts: the manually set upper layer topic, the adopted part of the Wikipedia category tree and Wikipedia articles as leaves. Every leaf is assigned a set of so-called descriptors; these are phrases explaining aspects of the leaf topic. The method is illustrated by its application to two domains in the area of Mathematics: (a) “Probability theory and mathematical statistics”, (b) “Numerical mathematics” (both in Russian).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Loukachevitch N (2011) Thesauri in information retrieval tasks. MSU, Moscow (In Russian) Loukachevitch N (2011) Thesauri in information retrieval tasks. MSU, Moscow (In Russian)
3.
Zurück zum Zitat Robinson P, Bauer S (2011) Introduction to bio-ontologies. Chapman & Hall, London Robinson P, Bauer S (2011) Introduction to bio-ontologies. Chapman & Hall, London
4.
Zurück zum Zitat Sadikov E, Madhavan J, Wang L, Halevy A (2008) Clustering query refinements by user intent. In: Proceedings of the 19th international conference on world wide web, pp 841–850 Sadikov E, Madhavan J, Wang L, Halevy A (2008) Clustering query refinements by user intent. In: Proceedings of the 19th international conference on world wide web, pp 841–850
5.
Zurück zum Zitat White R, Bennett P, Dumais S (2010) Predicting short-term interests using activity-based search contexts. In: Proceedings of 19th ACM conference on information and knowledge management, pp 1009–1018 White R, Bennett P, Dumais S (2010) Predicting short-term interests using activity-based search contexts. In: Proceedings of 19th ACM conference on information and knowledge management, pp 1009–1018
6.
Zurück zum Zitat Nascimento S, Fenner T, Felizardo R, Mirkin B, Nascimento S, Fenner T, Felizardo R, Mirkin B (2011) How to visualize a crisp or fuzzy topic set over a taxonomy, vol 6744., Lecture Notes in Computer ScienceSpringer, Heidelberg Nascimento S, Fenner T, Felizardo R, Mirkin B, Nascimento S, Fenner T, Felizardo R, Mirkin B (2011) How to visualize a crisp or fuzzy topic set over a taxonomy, vol 6744., Lecture Notes in Computer ScienceSpringer, Heidelberg
7.
Zurück zum Zitat Chernyak E (2015) An approach to the problem of annotation of research publications. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15. ACM, New York, NY, USA, pp 429–434 Chernyak E (2015) An approach to the problem of annotation of research publications. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM ’15. ACM, New York, NY, USA, pp 429–434
12.
Zurück zum Zitat Van Hage W, Katrenko S, Schreiber G (2005) Method to combine linguistic ontology-mapping techniques. In: Proceedings of the 19th International conference on world wide web, pp 34–39 Van Hage W, Katrenko S, Schreiber G (2005) Method to combine linguistic ontology-mapping techniques. In: Proceedings of the 19th International conference on world wide web, pp 34–39
13.
Zurück zum Zitat Grau B, Parsia B, Sirin E (2004) Working with multiple ontologies on the semantic web. In: Proceedings of the 3rd international semantic web conference, pp 620–634 Grau B, Parsia B, Sirin E (2004) Working with multiple ontologies on the semantic web. In: Proceedings of the 3rd international semantic web conference, pp 620–634
14.
Zurück zum Zitat Cui C, Lu Q, Li W, Chen Y (2009) Mining concepts from wikipedia for ontology construction. In: Proceedings of the 2009 IEEE/WIC/ACM international joint conference on web intelligence and intelligent agent technology, vol. 3, pp 287–290 Cui C, Lu Q, Li W, Chen Y (2009) Mining concepts from wikipedia for ontology construction. In: Proceedings of the 2009 IEEE/WIC/ACM international joint conference on web intelligence and intelligent agent technology, vol. 3, pp 287–290
15.
Zurück zum Zitat Ponzetto S, Strube M (2001) Deriving a large scale taxonomy from wikipedia. In: Proceedings of AAAI conference on artificial intelligence, pp 78–85 Ponzetto S, Strube M (2001) Deriving a large scale taxonomy from wikipedia. In: Proceedings of AAAI conference on artificial intelligence, pp 78–85
16.
Zurück zum Zitat Wu F, Weld D (2008) Automatically refining wikipedia infobox ontology. In: Proceedings of the 17th international world wide web conference, pp 635–645 Wu F, Weld D (2008) Automatically refining wikipedia infobox ontology. In: Proceedings of the 17th international world wide web conference, pp 635–645
17.
Zurück zum Zitat Hovy E, Navigli R, Ponzetto SP (2013) Collaboratively built semi-structured content and artificial intelligence: the story so far. Artifi Intell 194:2–27CrossRef Hovy E, Navigli R, Ponzetto SP (2013) Collaboratively built semi-structured content and artificial intelligence: the story so far. Artifi Intell 194:2–27CrossRef
18.
Zurück zum Zitat Tiziano F, Vannella D, Pasini T, Navigli R (2014) Two is bigger (and better) than one: the wikipedia bitaxonomy project. In: Proceedings of ACL, pp 429–434 Tiziano F, Vannella D, Pasini T, Navigli R (2014) Two is bigger (and better) than one: the wikipedia bitaxonomy project. In: Proceedings of ACL, pp 429–434
20.
Zurück zum Zitat Medelyan O, Manion S, Broekstra J, Divoli A (2013) Constructing a focused taxonomy from a document collection. The semantic web: semantics and big data. Springer, Heidelberg, pp 367–381CrossRef Medelyan O, Manion S, Broekstra J, Divoli A (2013) Constructing a focused taxonomy from a document collection. The semantic web: semantics and big data. Springer, Heidelberg, pp 367–381CrossRef
21.
Zurück zum Zitat Kittur A, Chi E, Suh B (2009) What’s in wikipedia? mapping topics and conflict using socially annotated category structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 1509–1512 Kittur A, Chi E, Suh B (2009) What’s in wikipedia? mapping topics and conflict using socially annotated category structure. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 1509–1512
22.
Zurück zum Zitat Chernyak E (2015) An approach to the problem of annotation of research publications. In: Proceedings of the eighth ACM international conference on web search and data mining, pp 429–434 Chernyak E (2015) An approach to the problem of annotation of research publications. In: Proceedings of the eighth ACM international conference on web search and data mining, pp 429–434
23.
Zurück zum Zitat Chernyak E, Chugunova O, Askarova J, Nascimento S, Mirkin B (2011) Abstracting concepts from text documents by using an ontology. In: Proceedings of the 1st international workshop on concept discovery in unstructured data, pp 21–31 Chernyak E, Chugunova O, Askarova J, Nascimento S, Mirkin B (2011) Abstracting concepts from text documents by using an ontology. In: Proceedings of the 1st international workshop on concept discovery in unstructured data, pp 21–31
24.
Zurück zum Zitat Chernyak E, Chugunova O, Mirkin B (2012) Annotated suffix tree method for measuring degree of string to text belongingness. Bus Inform 21(3):31–41 (In Russian) Chernyak E, Chugunova O, Mirkin B (2012) Annotated suffix tree method for measuring degree of string to text belongingness. Bus Inform 21(3):31–41 (In Russian)
25.
Zurück zum Zitat Pampapathi R, Mirkin B, Levene M (2006) A suffix tree approach to anti-spam email filtering. Mach Learn 65(1):309–338CrossRef Pampapathi R, Mirkin B, Levene M (2006) A suffix tree approach to anti-spam email filtering. Mach Learn 65(1):309–338CrossRef
29.
Zurück zum Zitat Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New YorkCrossRef Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New YorkCrossRef
30.
Zurück zum Zitat Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 46–54 Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 46–54
Metadaten
Titel
Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources
verfasst von
Ekaterina Chernyak
Boris Mirkin
Publikationsdatum
01.03.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
Annals of Data Science / Ausgabe 1/2015
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-015-0032-1

Weitere Artikel der Ausgabe 1/2015

Annals of Data Science 1/2015 Zur Ausgabe

EditorialNotes

Preface