Elsevier

Information Systems

Volume 56, March 2016, Pages 36-54
Information Systems

A tool for producing structured interoperable data from product features on the web

https://doi.org/10.1016/j.is.2015.09.002Get rights and content

Highlights

  • A tool producing structured data from product features on the web is introduced.

  • This is the first Protégé plug-in that extracts product features from web pages.

  • Extracting information from complex-data intensive web sites is partially handled.

  • The user creates a template manually using a domain-specific language.

  • The output is GoodRelations snippets containing product features in RDFa/ Microdata.

Abstract

This paper introduces a tool that produces structured interoperable data from product features, i.e., attribute name–value pairs, on the web. The tool extracts the product features using a web site-specific template created by the user. The value of the extracted data is maximized by using GoodRelations, which is the standard vocabulary for modeling product types and their features. The final output of the tool is GoodRelations snippets, which contain product features encoded in RDFa or Microdata. These snippets can be embedded into existing static and dynamic web pages in a way accessible to major search engines like Google and Yahoo, mobile applications, and browser extensions. This increases the visibility of your products and services in the latest generation of search engines, recommender systems, and other novel applications.

Introduction

The web contains a huge number of online shops which provide excellent resources for product information. Besides, the data of e-commerce is growing at a rapid speed [1]. Information in e-commerce includes technical specifications and descriptions of products. If we present this information in a structured way, it will significantly improve the effectiveness of many applications [2].

The vast majority of web content consists of different kinds of textual documents, which are provided in a number of different formats and vary from plain text to semi-structured documents containing data records. This makes different methods of bringing structure and semantics to the web (including web information extraction) an active research field [3]. Although the web has a dynamic nature, Etzioni has argued for that “information on the web is sufficiently structured to facilitate effective web mining” [4]. Since a big portion of web content subject to web information extraction is created from data repositories, a web information extraction system rediscovers the structure that was encoded in a web page.

This paper introduces a tool1 that produces structured interoperable data from product features, i.e., attribute name–value pairs, on the web. It extends the previous work of the author [5] in two ways. First it supports tree nodes that define text operations (e.g. concatenate, contains, fragment, lower, upper, replace, substring, and trim) on tree nodes. Second it presents a user-based evaluation accomplished using 15 different “real world” scenarios.

Designed as a plug-in for the open source ontology editor Protégé [6], the proposed tool exploits the advantages of the ontology as a formal model for the domain knowledge.

Another promising feature of the tool is support for building an ontology that is compatible with GoodRelations Vocabulary [7]. GoodRelations is the most powerful vocabulary for publishing all of the details of your products and services in a way friendly to search engines, mobile applications, and browser extensions. In [8], GoodRelations product ontology is defined as a “product atlas” describing specifications, marketing copy, catalog data, photos, videos, manuals, installation instructions, updates, responses to issues, prices and reviews of over 1 million products from various sources. It updates twice a week, so companies can enter all their descriptions and prices and the information will flow through the thousands of e-commerce systems within a few days. The goal is to have extremely deep information on millions of products, providing a resource that can be plugged into any e-commerce system without limitation.

If you have GoodRelations in your markup, Google, Bing, Yahoo, and Yandex will or plan to improve the rendering of your page directly in the search results. Rich snippets—the few lines of text that appear under every search result—are designed to give users a sense for what is on the page and why it is relevant to their query. Fig. 1 shows the difference between a regular and a rich snippet. The first search result is a regular snippet and the second one is a rich snippet. The proposed tool supports marking up your content with RDFa or Microformats for creating rich snippets of the extracted products.

In addition to seeing rich snippets in the search results, rich markup indicates the relevance of your page for a particular query. You provide information to the search engines so that they can rank up your page for queries to which your offer is a particularly relevant match. For many popular shop applications including Drupal Commerce, Magento, Prestashop, and Rakuten.de/Tradoria.de, Joomla/Virtuemart, there exist free extension modules that make adding GoodRelations RDFa for semantic SEO as simple as a few mouse-clicks.

The tool supports “Open Standards” including RDFa, Microdata and JSON. International Telecommunications Union (ITU-T) specifies that “Open Standards” facilitate interoperability and data exchange among different products or services and are intended for widespread adoption. In [9] the key benefits of interoperable data are listed as follows: enabling information sharing with trusted partners, enhancing system capabilities and longevity, lowering overall costs of information applications, improving the breadth and quality of information, increasing the speed and accuracy of decisions, improving transparency and speed of disclosure of information to valid constituents, preserving data for future uses.

The organization of the article is as follows: In Section 2, we review background information and related work. Section 3 includes an overview of the system׳s architecture, features and settings and a scenario based quick-start guide. Section 4 presents a user-based evaluation accomplished using 15 different “real world” scenarios. Finally, Section 5 concludes the article with a brief talk about possible future work.

Section snippets

Background knowledge and related work

The information extraction systems can be divided into following three categories [10]:

  • Procedural wrapper: The approach is based on writing customized wrappers for accessing required data from a given set of information sources. In these systems the extraction rules are coded into the program.

  • Declarative wrapper: These systems consist of a general execution engine and declarative extraction rules developed for specific data sources.

  • Automatic wrapper: These systems use machine learning

Scenario-based system specification

The proposed tool gathers semi-structured product information from an HTML page, applies extraction rules specified in the template file, and presents the extracted product data in an ontology that is compatible with GoodRelations Vocabulary. It has two main components: the wrapper and the ontology builder (Fig. 3).

The wrapper extracts the product data from the web page using the template file. The HTML page is first parsed into a DOM tree using HtmlUnit, which is a web driver that supports

Evaluation

We evaluate the usability of the tool using three questionnaires: Computer System Usability Questionnaire (CSUQ), System Usability Scale (SUS) and Microsoft׳s Product Reaction Card (MPRC). We primarily choose CSUQ and SUS for our experiments because these two approaches have a higher accuracy with an increasing sample size than the other questionnaires. These two formal approaches provided results to evaluate whether users can complete tasks. We supported these approaches with MPRC for

Conclusion and future work

This paper introduces a Protégé plug-in that collects product features from web and transforms this information into GoodRelations snippets in RDFa or Microformats. The system attempts to solve an increasingly important problem: extracting useful information from the product descriptions provided by the sellers and structuring this information into a common and sharable format among business entities, software agents and search engines. It also presents a user-based evaluation accomplished

References (36)

  • T.R. Gruber

    A translation approach to portable ontology specifications

    Knowl. Acquis.

    (1993)
  • Q. Liu, H. Wang, H. Gao, Q. Lv, J. Fu, A recommendation method in e-commerce based on product taxonomy graph, in:...
  • W. Tang, Y. Hong, Y.-H. Feng, J.-M. Yao, Q.-M. Zhu, Simultaneous product attribute name and value extraction with...
  • F. Dominik, Semi-automatic web information extraction (Ph.D. thesis), Poznan University of Economics Department of...
  • O. Etzioni

    The world wide webquagmire or gold mine?

    Commun. ACM

    (1996)
  • T. Özacar, IRIS: a Protégé plug-in to extract and serialize product attribute name-value pairs, in: Proceedings of the...
  • N. Noy, R. Fergerson, M. Musen, The knowledge model of protégé-2000: combining interoperability and flexibility, in:...
  • M. Hepp, Goodrelations: an ontology for describing products and services offers on the web, in: Proceedings of the 16th...
  • D. Siegel, Pull: The Power of the Semantic Web to Transform Your Business, 1st edition, Portfolio Hardcover,...
  • P. O׳Dell, Silver Bullets: How Interoperable Data Will Revolutionize Information Sharing and Transparency, AuthorHouse,...
  • J. Han, Design of web semantic integration system (Ph.D. thesis), Tennessee State University. Electrical & Computer...
  • A. Firat, Information integration using contextual knowledge and ontology merging (Ph.D. thesis), Massachusetts...
  • The xpath 2.0 Standard, W3c Recommendation 〈http://www.network-theory.co.uk/w3c/xpath/〉,...
  • I. Muslea et al.

    A Hierarchical Approach to Wrapper Induction

    (1999)
  • C.H. Chang et al.

    A survey of web information extraction systems

    IEEE Trans. Knowl. Data Eng.

    (2006)
  • C. Leondes

    Neural networks, fuzzy theory and genetic algorithms

    (2005)
  • G. Huck, P. Fankhauser, K. Aberer, E.J. Neuhold, Jedi: extracting and synthesizing information from the web., in:...
  • H. Garcia-Molina et al.

    The TSIMMIS approach to mediationdata models and languages

    J. Intell. Inf. Syst.

    (2004)
  • Cited by (0)

    View full text