nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Web Data Extraction from Scientific Publishers’ Website Using Hidden Markov Model

verfasst von : Jing Huang, Ziyu Liu, Beibei Wang, Mingyue Duan, Bo Yang

Erschienen in: Knowledge Science, Engineering and Management

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Recently, large amounts of information on web pages have been emerging in an endless stream. And numerously papers are published on more than three thousands of journals, especially in the field of technology. It’s almost impossible for the user to search the information one by one. The user has to click a lot of links when he or she wants to get information among the thousands of journals, such as the introduction of the journals, impact factor, ISSN and so on. To solve this problem, it’s necessary to develop an automatic method that filter the information out of deep web automatically. The method in this paper is able to help people quickly get needed information classified and extracted. This paper contains the following work: firstly, the method of machine learning, HMM, is used to extract the journal information from the publisher’s website, which improves the generalization ability of using the heuristic method; then, during the data processing step, content extraction technique is used to improve the performance of Hidden Markov Model; finally, we store the extracted information in a structured way and display it. In the experimental step, three algorithms are tested and compared in the accuracy, recall and F-measure, the results show that HMM with content extraction (C-HMM) has the best performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Stock Price Prediction Using Time Convolution Long Short-Term Memory Network

Nächstes Kapitel MedSim: A Novel Semantic Similarity Measure in Bio-medical Knowledge Graphs

Bergman, M.: The deep web: surfacing hidden value. J. Electron. Publ. 7(1), 1–14 (2001)MathSciNetCrossRef

Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, Roma, Italy (2001)

Gutierrez, F., Dou, D., Fickas, S., et al.: A hybrid ontology-based information extraction system. J. Inf. Sci. 42(6), 798–820 (2016)CrossRef

Zhang, N., Chen, H., Wang, Y., et al.: Odaies: ontology-driven adaptive Web information extraction system. In: IEEE/WIC International Conference on Intelligent Agent Technology, pp. 454–460. IEEE (2003)

Wang, J., Lochovsky, F.H.: Data-rich section extraction from HTML pages. In: International Conference on Web Information Systems Engineering, pp. 313–322. IEEE, Singapore (2003)

Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)

Kumaresan, U., Ramanujam, K.: Web data extraction from scientific publishers’ website using heuristic algorithm. Int. J. Intell. Syst. Appl. 9(10), 31–39 (2017)

Zhong, P., Chen, J.: A generalized hidden markov model approach for web information extraction. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 709–718. IEEE, Hong Kong (2006)

Forney, G.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)MathSciNetCrossRef

10.

Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)CrossRef

11.

Lai, J., Liu, Q., Liu, Y.: Web information extraction based on hidden Markov model. In: 14th International Conference on Computer Supported Cooperative Work in Design, pp. 234–238. IEEE, Shanghai (2010)

12.

Xiong, Z., Lin, X., Zhang, Y., Ya, M.: Content extraction method combining web page structure and text feature. Comput. Eng. 39(12), 200–203 (2013)

13.

Elsevier. https://www.elsevier.com/. Accessed 25 Apr 2018

14.

Springer. https://link.springer.com/. Accessed 25 Apr 2018

15.

Wiley. https://onlinelibrary.wiley.com/. Accessed 25 Apr 2018

16.

APP download link. http://www.acheadline.com/

Titel: Web Data Extraction from Scientific Publishers’ Website Using Hidden Markov Model
verfasst von: Jing Huang
Ziyu Liu
Beibei Wang
Mingyue Duan
Bo Yang
Verlag: Springer International Publishing
Buch: Knowledge Science, Engineering and Management
Print ISBN: 978-3-319-99364-5

Electronic ISBN: 978-3-319-99365-2

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-99365-2_42

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner