Skip to main content
Erschienen in: Wireless Personal Communications 1/2015

01.09.2015

Effectual Web Content Mining using Noise Removal from Web Pages

verfasst von: P. Sivakumar

Erschienen in: Wireless Personal Communications | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web mining is an emerging research area due to the rapid growth of websites. Web mining is classified into Web Content Mining (WCM), Web Usage Mining and Web Structure Mining. Extraction of required information from web page content available on World Wide Web (WWW) is WCM. The WCM is further classified into two categories first category is to directly mine the content on documents and second category is to mine the content using search engine. The mining method focuses on the information extraction and integration. The content of Web may be text, image, audio, video. Web pages typically contain a large amount of information that is not part of the main contents of the pages, like banner advertisements, navigation bars, copyright notices, etc. Such noises on Web pages usually lead to poor results in Web mining. This paper focuses on the problem of Noise free Information retrieval on web pages, which means the pre-processing of Web pages automatically to detect and eliminate noises. This paper proposes an approach for eliminating noises from web pages for the purpose of improving the accuracy and efficiency of web content mining. The main objective of removing noise from a Web Page is to improve the performance of the search. It is very essential to differentiate important information from noisy content that may misguide users’ interest. This approach mainly concentrates on removing the following noises in stages: (1) Primary noises-Navigation bars, Panels and Frames, Page Headers and Footers, Copyright and Privacy Notices, Advertisements and other Uninteresting Data such as audio, video, multiple links. (2) Duplicate Contents and (3) Noise Contents according to block importance. The removal of these noises is done by performing three operations. Firstly, using the Block Splitting operation, primary noises are removed and only the useful text contents are partitioned into blocks. Secondly, using simhash algorithm, the duplicate blocks are removed to obtain the distinct blocks. For each block, three parameters namely Keyword Redundancy (KR), Linkword Percentage (LP) and Titleword Relevancy (TR) calculated. Using these three parameters block importance value (BI) is calculated, which is called Simhash algorithm. The importance of the block is then calculated using simhash algorithm. Based on a threshold value the important blocks are selected using sketching algorithm and the keywords are extracted from those important blocks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abidin, S. Z. Z., Idris, N. M., & Husain, A. H. (2010). Extraction and classification of unstructured data in webpages for structured multimedia database via XML. IEEE Information Retrieval & Knowledge Management, pp. 44–49. Abidin, S. Z. Z., Idris, N. M., & Husain, A. H. (2010). Extraction and classification of unstructured data in webpages for structured multimedia database via XML. IEEE Information Retrieval & Knowledge Management, pp. 44–49.
2.
Zurück zum Zitat Alsulami, B. S., Abulkhair, M. F., & Eassa, F. E. (2011). Near duplicate document detection survey. International Journal of Computer Science & Communication Networks, 2(2), 147–151. Alsulami, B. S., Abulkhair, M. F., & Eassa, F. E. (2011). Near duplicate document detection survey. International Journal of Computer Science & Communication Networks, 2(2), 147–151.
3.
Zurück zum Zitat Liu, B., Chen-Chuan-Chang, K. (2004). Editorial: Special issue on web content mining. ACM SIGKDD Explorations Newsletter, 6(2), 1–4. Liu, B., Chen-Chuan-Chang, K. (2004). Editorial: Special issue on web content mining. ACM SIGKDD Explorations Newsletter, 6(2), 1–4.
4.
Zurück zum Zitat Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional word clusters versus words for text categorization. Journal of Machine Learning Research, 3, 1183–1208.MATH Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional word clusters versus words for text categorization. Journal of Machine Learning Research, 3, 1183–1208.MATH
5.
Zurück zum Zitat Broder, A., Glassman, S. C., Manasse, M. S., & Zweig, G. (1999). Syntactic clustering of the web. Computer Networks, 29, 1157–1166. Broder, A., Glassman, S. C., Manasse, M. S., & Zweig, G. (1999). Syntactic clustering of the web. Computer Networks, 29, 1157–1166.
6.
Zurück zum Zitat Broder, A. Z., Kumar, S. R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., & Wiener, J. (2000). Graph structure in the web: Experiments and models. Proceedings 9th WWW Conference, pp. 309–320. Broder, A. Z., Kumar, S. R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., & Wiener, J. (2000). Graph structure in the web: Experiments and models. Proceedings 9th WWW Conference, pp. 309–320.
7.
Zurück zum Zitat Kang, B. H., & Kim, Y. S. (2006). Noise elimination from the web documents by using URL paths and information redundancy. In Proceedings of the International Conference on Information & Knowledge Engineering, pp. 135–141. Kang, B. H., & Kim, Y. S. (2006). Noise elimination from the web documents by using URL paths and information redundancy. In Proceedings of the International Conference on Information & Knowledge Engineering, pp. 135–141.
8.
Zurück zum Zitat Cesario, E., Folino, F., Manco, G., & Pontieri, L. (2005). An incremental clustering scheme for duplicate detection in large databases. Database Engineering and Application Symposium, pp. 89–95. Cesario, E., Folino, F., Manco, G., & Pontieri, L. (2005). An incremental clustering scheme for duplicate detection in large databases. Database Engineering and Application Symposium, pp. 89–95.
9.
Zurück zum Zitat Wang, C., Lua, J., & Zhanga, G. (2007). Mining key information of web pages: A method and its application. Expert Systems with Applications, 33(2), 425–433.CrossRef Wang, C., Lua, J., & Zhanga, G. (2007). Mining key information of web pages: A method and its application. Expert Systems with Applications, 33(2), 425–433.CrossRef
10.
Zurück zum Zitat Chuang, S.-L., & Hsu, J. Y.-J. (2004). Tree-structured template generation for web pages. In Proceedings of IEEE/WIC/ACM International Conference on web intelligence, pp. 327–333. Chuang, S.-L., & Hsu, J. Y.-J. (2004). Tree-structured template generation for web pages. In Proceedings of IEEE/WIC/ACM International Conference on web intelligence, pp. 327–333.
11.
Zurück zum Zitat Chisholm, W., Vanderheiden, G., & Jacobs, I. (2000). Techniques for web content accessibility guidelines 1.0. Chisholm, W., Vanderheiden, G., & Jacobs, I. (2000). Techniques for web content accessibility guidelines 1.0.
12.
Zurück zum Zitat Cai1, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific Web Conference on Web Technologies and Applications, Xian, China, pp. 406–417. Cai1, D., Yu, S., Wen, J.-R., & Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proceedings of the 5th Asia-Pacific Web Conference on Web Technologies and Applications, Xian, China, pp. 406–417.
13.
Zurück zum Zitat Yi, L., & Liu, B. (2003). Web page cleaning for web mining through feature weighting. In Proceedings of the 18th International Joint Conference on Artificial Intelligence Acapulco Mexico: 18, pp. 43–50. Yi, L., & Liu, B. (2003). Web page cleaning for web mining through feature weighting. In Proceedings of the 18th International Joint Conference on Artificial Intelligence Acapulco Mexico: 18, pp. 43–50.
14.
Zurück zum Zitat Yi, L., Liu, B., & Li, X. (2003). Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305. Yi, L., Liu, B., & Li, X. (2003). Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 296–305.
15.
Zurück zum Zitat Marathe, M., Patil, S. H., Garje, G. V., & Bewoor, M. S. (2009). Extracting content blocks from web Pages. International Journal of Recent Trends in Engineering, 2(4), 62–64. Marathe, M., Patil, S. H., Garje, G. V., & Bewoor, M. S. (2009). Extracting content blocks from web Pages. International Journal of Recent Trends in Engineering, 2(4), 62–64.
16.
Zurück zum Zitat Nasri, M., Shariati, S., & Azgomi, M. A. (2008). Performance modeling of a distributed web crawler using stochastic activity networks. Communications in Computer and Information Science (CCIS) Springer ISSN: 1865-0929 9, pp. 535–542. Nasri, M., Shariati, S., & Azgomi, M. A. (2008). Performance modeling of a distributed web crawler using stochastic activity networks. Communications in Computer and Information Science (CCIS) Springer ISSN: 1865-0929 9, pp. 535–542.
17.
Zurück zum Zitat Moses, S., & Charikar. (2002). Similarity Estimation Techniques From Rounding Algorithms In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC 2002) Montreal Quebec Canada. pp. 380-388. Moses, S., & Charikar. (2002). Similarity Estimation Techniques From Rounding Algorithms In Proceedings of the 34th Annual Symposium on Theory of Computing (STOC 2002) Montreal Quebec Canada. pp. 380-388.
18.
Zurück zum Zitat Poonkuzhali, G., Thiagarajan, G., & Sarukesi, K. (2009). Elimination of redundant links in web pages—mathematical approach. World Academy of Science Engineering and Technology, 52, 562. Poonkuzhali, G., Thiagarajan, G., & Sarukesi, K. (2009). Elimination of redundant links in web pages—mathematical approach. World Academy of Science Engineering and Technology, 52, 562.
19.
Zurück zum Zitat Poonkuzhali, G., Thiagarajan, K., Sarukesi, K., & Uma, G. V. (2009). Signed approach for mining web content outliers. World Academy of Science Engineering and Technology, 56, 820–824. Poonkuzhali, G., Thiagarajan, K., Sarukesi, K., & Uma, G. V. (2009). Signed approach for mining web content outliers. World Academy of Science Engineering and Technology, 56, 820–824.
20.
Zurück zum Zitat Guo, Y., Tang, H, Song, L., Wang, Y., & Ding, G. (2010). ECON: An approach to extract content from web news page In Proceedings of the 12th International Asia-Pacific Web Conference (APWEB), pp. 314–320. Guo, Y., Tang, H, Song, L., Wang, Y., & Ding, G. (2010). ECON: An approach to extract content from web news page In Proceedings of the 12th International Asia-Pacific Web Conference (APWEB), pp. 314–320.
Metadaten
Titel
Effectual Web Content Mining using Noise Removal from Web Pages
verfasst von
P. Sivakumar
Publikationsdatum
01.09.2015
Verlag
Springer US
Erschienen in
Wireless Personal Communications / Ausgabe 1/2015
Print ISSN: 0929-6212
Elektronische ISSN: 1572-834X
DOI
https://doi.org/10.1007/s11277-015-2596-7

Weitere Artikel der Ausgabe 1/2015

Wireless Personal Communications 1/2015 Zur Ausgabe

Neuer Inhalt