1 Introduction
-
establish the good start point to explore IES and the proposed BigGrams system through the theoretical and practical description of the above systems,
-
briefly describe the novel WI algorithm with the use case and theoretical preliminaries,
-
establish the impact of the (1) input form (the seeds set and the taxonomy of seeds), (2) pre-processing domain’s web pages, (3) matching techniques, and (4) a level of HTML documents representation to the WI algorithm results,
-
find the best combination of the elements mentioned above to achieve the best results of the WI algorithm,
-
check what kind of requirements must be satisfied to use the proposed WI in an iterative way, i.e. the boosting mode, where the output results are provided to the system input.
2 State of the art and related work
3 Formal description of the information extraction system
3.1 Theoretical preliminaries
3.1.1 Practical preliminaries
3.2 The general framework of the information extraction system
4 BigGrams as the implementation of the information extraction system
4.1 The comparison of BigGrams and SEAL systems
4.2 Specification on high level of abstraction
4.2.1 Specification details with examples
-
fixes the structure of an HTML document (closes the HTML tags, closes the attribute values using the chars ”, etc.),
-
cleans an HTML document from unnecessary elements (header, JavaScript, css, comments, footers, etc.),
-
changes the level of granularity of HTML tags.
-
\({<}\) h1 attribute1 = “ attribute2 = ”> - the HTML tags without attribute values,
-
\({<}\) h1> - the HTML tags without attributes and their values.
4.3 Implementation
4.3.1 Theoretical preliminaries
4.3.2 The wrapper induction algorithm and the use case
\(c_1\)
|
\(c_2\)
|
\(c_3\)
|
\(c_4\)
|
\(c_5\)
| |
---|---|---|---|---|---|
<br/> | <li class = “film title”\(>
<\)br/> | <ul\(>
<\)li class = “film title”\(>
<\)br/> | <p class= “film title”\(>
<\)br/> | </li\(>
<\)p class = “film title”\(>
<\)br/> | |
\(o_1\)
| 1 | 1 | 1 | 0 | 0 |
\(o_2\)
| 1 | 0 | 0 | 1 | 1 |
\(o_3\)
| 1 | 1 | 1 | 0 | 0 |
\(o_4\)
| 1 | 1 | 1 | 0 | 0 |
\(o_5\)
| 1 | 0 | 0 | 1 | 1 |
\(o_6\)
| 1 | 1 | 1 | 0 | 0 |
\(o_7\)
| 1 | 1 | 1 | 0 | 0 |
5 Empirical evaluation of the solution
5.1 The description of the reference data set
5.1.1 Practical preliminaries
-
a single movie/series/video game/theatre arts. The templates print the information from the following n-tuple <polish title, english title, list of actors names, list of actors roles, music, photos>. In addition, the author in the test set detected the templates that present a short and full version of the above n-tuple, e.g. the list of actors names can present all values (a full version) or k-first values (a short version).
-
a set of movies. The templates print the information from the following n-tuples <polish and english films titles>,
-
a set of movies. The templates print the information from the following n-tuples <polish films titles, english films titles>. Furthermore, in the test set, the author detected two different templates that represent above-mentioned tuples,
-
a single actor. The templates print the information from the following n-tuples <actor’s name and surname, polish films titles, english films titles, names of films roles>,
-
the user’s favourite films. The templates print the information from the following n-tuples <prefix as the film year production and english films titles, polish films titles>. Furthermore, in the test set, the author detected two different templates that represent the above-mentioned n-tuple.
-
there is an available layout of HTML tags <tag1 \(> <\) tag2>[information to extraction]<tag3> [suffix associated with the information to extraction]<tag4>. Based on the HTML tags we may create two patterns {<tag1 \(> <\) tag2 \({>} (.{+}?){<}\) tag3>, <tag1 \(> <\) tag2 \({>}(.{+}?){<}\) tag4>}. Using these patterns, we may extract the following information [information to extraction] and [information to extraction]<tag3>[suffix associated with the information to extraction]. Using a simple pre-processing, we may filter out the unnecessary HTML tags from extracted information. This way, the correct form of instance, i.e. [information to extraction] [suffix associated with the information to extraction] is obtained. It often occurs, for instance, in case of displaying the cast. Usually, additional information is added to the film role name. This information indicates whether an actor lent their own voice or not, e.g. barry “big bear” thorne and barry “big bear” thorne (voice), etc.,
-
there is an available layout of HTML tags <tag1 \({>}(.{+}?){<}\) tag2>, which covers, for example, name of an actor such as a matthew perry i. However, for the given page the WI may create another pattern, which extracts a similar semantic information, e.g. matthew perry,
-
there is also an available layout of HTML tags <tag1 \({>}(.{+}?){<}\) tag2>, which covers, for example, the film names in the following form production year | english film title (1979 | Apocalypse Now). However, this layout may also cover only the prefixes, as a production year (1979) because there is no English version of the film title.
5.2 The indicators to evaluate the proposed solutions
-
Precision$$\begin{aligned} { Prec} = \frac{|V_{{ ref}} \cap V_{{ rec}}|}{|V_{{ rec}}|} \end{aligned}$$(5)
-
Recall$$\begin{aligned} { Rec} = \frac{|V_{{ ref}} \cap V_{{ rec}}|}{|V_{{ ref}}|} \end{aligned}$$(6)
-
F-measure$$\begin{aligned} F = \frac{2 \cdot { Prec} \cdot { Rec} }{ { Prec} + { Rec} } \end{aligned}$$(7)
-
Macro-average precision$$\begin{aligned} { Prec}_{{ mac}\text{- }{} { avg}} = \frac{ \sum _{k=1}^n { Prec}_{p_{k}}}{n} = \frac{1}{n} \sum \limits _{k=1}^n \frac{ \left| V_{{ ref}_{p_{k}}} \cap V_{{ rec}_{p_{k}}}\right| }{\left| V_{{ rec}_{p_{k}}}\right| } \end{aligned}$$(8)
-
Macro-average recall$$\begin{aligned} { Rec}_{{ mac}\text{- }{} { avg}} = \frac{ \sum _{k=1}^n { Rec}_{p_{k}}}{n} = \frac{1}{n} \sum \limits _{k=1}^n \frac{\left| V_{{ ref}_{p_{k}}} \cap V_{{ rec}_{p_{k}}}\right| }{\left| V_{{ ref}_{p_{k}}}\right| } \end{aligned}$$(9)
-
Macro-average F-measure$$\begin{aligned} F_{{ mac}\text{- }{} { avg}} = \frac{1}{n} \sum \limits _{k=1}^n \frac{2 \cdot { Prec}_{p_{k}} \cdot { Rec}_{p_{k}} }{ { Prec}_{p_{k}} + { Rec}_{p_{k}} } \end{aligned}$$(10)
5.3 The plan of the experiment
5.4 The realization of the experiment plan and the results
5.4.1 The batch mode
Domain name | Experiment name | Tested system | Indicators | |||||
---|---|---|---|---|---|---|---|---|
\({ Prec}\)
|
\({ Rec}\)
|
F
|
\({ Prec}_{{ mac}\text{- }{} { avg}} \pm s \)
|
\({ Rec}_{{ mac}\text{- }{} { avg}} \pm s\)
|
\(F_{{ mac}\text{- }{} { avg}} \pm s\)
| |||
filmweb.pl |
\({ Experiment}_1\)
| SEAL’ | 0.2579 | 0.2320 | 0.2443 | 0.0893 ± 0.2489 | 0.2360 ± 0.3809 | 0.1035 ± 0.2557 |
\({ Experiment}_2\)
| SEAL” | 0.4174 | 0.2397 | 0.3045 | 0.1893 ± 0.3134 | 0.1904 ± 0.3391 | 0.1369 ± 0.2583 | |
\({ Experiment}_3\)
| BigGrams* | 0.8043 | 0.2258 | 0.3526 | 0.6092 ± 0.4286 | 0.3777 ± 0.3827 | 0.4218 ± 0.3907 | |
\({ Experiment}_4\)
| BigGrams* | 0.4417 | 0.2533 | 0.3219 | 0.3224 ± 0.2560 | 0.3549 ± 0.3458 | 0.2990 ± 0.2641 | |
\({ Experiment}_5\)
| BigGrams* | 0.7567 | 0.4029 | 0.5258 | 0.4990 ± 0.4238 | 0.3503 ± 0.3685 | 0.3848 ± 0.3743 | |
\({ Experiment}_6\)
| BigGrams* | 0.7114 | 0.3658 | 0.4832 | 0.3256 ± 0.2832 | 0.4495 ± 0.3953 | 0.3286 ± 0.3051 | |
\({ Experiment}_7\)
| BigGrams* | 0.6143 | 0.6234 | 0.6188 | 0.0944 ± 0.1437 | 0.3355 ± 0.3840 | 0.1374 ± 0.1922 | |
\({ Experiment}_8\)
| BigGrams* | 0.6999 | 0.5935 | 0.6424 | 0.1808 ± 0.1911 | 0.4335 ± 0.3709 | 0.2361 ± 0.2321 | |
\({ Experiment}_9\)
| BigGrams* | 0.4622 | 0.0131 | 0.0256 | 0.3408 ± 0.4479 | 0.2341 ± 0.4092 | 0.2447 ± 0.4063 | |
\({ Experiment}_{10}\)
| BigGrams* | 0.7255 | 0.2994 | 0.4239 | 0.1849 ± 0.2641 | 0.2737 ± 0.3899 | 0.2138 ± 0.3017 | |
\({ Experiment}_{11}\)
| BigGrams** | 0.9990 | 0.2432 | 0.3912 | 0.5748 ± 0.4954 | 0.4555 ± 0.4291 | 0.4959 ± 0.4467 | |
\({ Experiment}_{12}\)
| BigGrams** | 0.9722 | 0.3589 | 0.5243 | 0.8950 ± 0.2932 | 0.5987 ± 0.4038 | 0.6482 ± 0.3947 | |
\({ Experiment}_{13}\)
| BigGrams** | 0.9993 | 0.7016 | 0.8244 | 0.7946 ± 0.4045 | 0.6719 ± 0.3878 | 0.7155 ± 0.3862 | |
\({ Experiment}_{14}\)
| BigGrams** | 0.9948 | 0.9603 | 0.9773 | 0.9738 ± 0.1565 | 0.9362 ± 0.1733 | 0.9523 ± 0.1613 | |
\({ Experiment}_{15}\)
| BigGrams** | 0.9958 | 0.7864 | 0.8788 | 0.7894 ± 0.4026 | 0.6620 ± 0.3841 | 0.7072 ± 0.3823 | |
\({ Experiment}_{16}\)
| BigGrams** | 0.9962 | 0.8691 | 0.9283 | 0.9484 ± 0.1998 | 0.8807 ± 0.2357 | 0.9007 ± 0.2111 | |
ptaki.info |
\({ Experiment}_1\)
| SEAL’ | 1 | 0.0490 | 0.0933 | 0.6700 ± 0.4714 | 0.6525 ± 0.4681 | 0.6583 ± 0.4672 |
\({ Experiment}_2\)
| SEAL” | 0.5 | 0.0559 | 0.1006 | 0.6609 ± 0.4737 | 0.6525 ± 0.4681 | 0.6532 ± 0.4699 | |
\({ Experiment}_3\)
| BigGrams* | 1 | 0.4965 | 0.6636 | 0.9950 ± 0.0707 | 0.8150 ± 0.2471 | 0.8750 ± 0.1718 | |
\({ Experiment}_4\)
| BigGrams* | 1 | 0.993 | 0.9965 | 1 ± 0 | 0.9975 ± 0.0354 | 0.9983 ± 0.0236 | |
\({ Experiment}_5\)
| BigGrams* | 1 | 0.5035 | 0.6698 | 1 ± 0 | 0.8175 ± 0.2413 | 0.8783 ± 0.1609 | |
\({ Experiment}_6\)
| BigGrams* | 1 | 0.5874 | 0.7401 | 1 ± 0 | 0.8500 ± 0.2297 | 0.9000 ± 0.1531 | |
\({ Experiment}_7\)
| BigGrams* | 1 | 0.5035 | 0.6698 | 1 ± 0 | 0.8175 ± 0.2413 | 0.8783 ± 0.1609 | |
\({ Experiment}_8\)
| BigGrams* | 1 | 0.585 | 0.7401 | 1 ± 0 | 0.8500 ± 0.2297 | 0.9000 ± 0.1531 | |
\({ Experiment}_9\)
| BigGrams* | 1 | 0.5874 | 0.6698 | 1 ± 0 | 0.8175 ± 0.2413 | 0.8783 ± 0.1609 | |
\({ Experiment}_{10}\)
| BigGrams* | 1 | 0.5874 | 0.7401 | 1 ± 0 | 0.8500 ± 0.2297 | 0.9000 ± 0.1531 | |
\({ Experiment}_{11}\)
| BigGrams** | 1 | 1 | 1 | 1 ± 0 | 1 ± 0 | 1 ± 0 | |
\({ Experiment}_{12}\)
| BigGrams** | 1 | 1 | 1 | 1 ± 0 | 1 ± 0 |
1 ± 0
| |
\({ Experiment}_{13}\)
| BigGrams** | 1 | 1 | 1 | 1 ± 0 | 1 ± 0 | 1 ± 0 | |
\({ Experiment}_{14}\)
| BigGrams** | 1 | 1 | 1 | 1 ± 0 | 1 ± 0 |
1 ± 0
| |
\({ Experiment}_{15}\)
| BigGrams** | 1 | 1 | 1 | 1 ± 0 | 1 ± 0 | 1 ± 0 | |
\({ Experiment}_{16}\)
| BigGrams** | 1 | 1 | 1 | 1 ± 0 | 1 ± 0 |
1 ± 0
|
Domain name | Experiment name | Tested system | Indicators | |||||
---|---|---|---|---|---|---|---|---|
\({ Prec}\)
|
\({ Rec}\)
|
F
|
\({ Prec}_{{ mac}\text{- }{} { avg}} \pm s\)
|
\({ Rec}_{{ mac}\text{- }{} { avg}} \pm s\)
|
\(F_{{ mac}\text{- }{} { avg}} \pm s\)
| |||
agatameble.pl |
\({ Experiment}_1\)
| SEAL’ | 0.8533 | 0.0506 | 0.0955 | 0.1542 ± 0.3610 | 0.1049 ± 0.2756 | 0.1171 ± 0.2791 |
\({ Experiment}_2\)
| SEAL” | 0.8533 | 0.0506 | 0.0955 | 0.1642 ± 0.3703 | 0.1105 ± 0.2791 | 0.1243 ± 0.3005 | |
\({ Experiment}_3\)
| BigGrams* | 0.8113 | 0.7241 | 0.7652 | 0.7608 ± 0.3236 | 0.7073 ± 0.3168 | 0.7266 ± 0.3159 | |
\({ Experiment}_4\)
| BigGrams* | 0.8113 | 0.7241 | 0.7652 | 0.7605 ± 0.3235 | 0.7073 ± 0.3168 | 0.7264 ± 0.3158 | |
\({ Experiment}_5\)
| BigGrams* | 0.6534 | 0.7779 | 0.7102 | 0.6125 ± 0.3074 | 0.7145 ± 0.3354 | 0.6497 ± 0.3148 | |
\({ Experiment}_6\)
| BigGrams* | 0.6534 | 0.7779 | 0.7102 | 0.6125 ± 0.3074 | 0.7145 ± 0.3354 | 0.6497 ± 0.3148 | |
\({ Experiment}_7\)
| BigGrams* | 0.8113 | 0.7241 | 0.7652 | 0.7608 ± 0.3236 | 0.7073 ± 0.3168 | 0.7266 ± 0.3159 | |
\({ Experiment}_8\)
| BigGrams* | 0.7773 | 0.7779 | 0.7776 | 0.7341 ± 0.3332 | 0.7145 ± 0.3354 | 0.7192 ± 0.3310 | |
\({ Experiment}_9\)
| BigGrams* | 0.8113 | 0.7241 | 0.7652 | 0.7608 ± 0.3236 | 0.7073 ± 0.3168 | 0.7266 ± 0.3159 | |
\({ Experiment}_{10}\)
| BigGrams* | 0.7773 | 0.7779 | 0.7776 | 0.7341 ± 0.3332 | 0.7145 ± 0.3354 | 0.7192 ± 0.3310 | |
\({ Experiment}_{11}\)
| BigGrams** | 0.7059 | 0.1802 | 0.2872 | 0.4137 ± 0.3039 | 0.1354 ± 0.2408 | 0.1606 ± 0.2319 | |
\({ Experiment}_{12}\)
| BigGrams** | 0.7037 | 0.1802 | 0.2870 | 0.4139 ± 0.3042 | 0.1363 ± 0.2417 | 0.1617 ± 0.2335 | |
\({ Experiment}_{13}\)
| BigGrams** | 0.9247 | 1 | 0.9609 | 0.9087 ± 0.1557 | 1 ± 0 | 0.9424 ± 0.1221 | |
\({ Experiment}_{14}\)
| BigGrams** | 0.9240 | 1 | 0.9605 | 0.9084 ± 0.1555 | 1 ± 0 | 0.9422 ± 0.1220 | |
\({ Experiment}_{15}\)
| BigGrams** | 0.9247 | 1 | 0.9609 | 0.9087 ± 0.1557 | 1 ± 0 | 0.9424 ± 0.1221 | |
\({ Experiment}_{16}\)
| BigGrams** | 0.9240 | 1 | 0.9605 | 0.9084 ± 0.1555 | 1 ± 0 | 0.9422 ± 0.1220 |
5.4.2 The boosting mode
Iteration number |
\(|V_{a}|\) for each attribute name a
|
\(\sum |V_{a}|\)
|
\(\frac{\sum |V_{a}|}{|V_{{ ref}}|}\) (%) |
F
|
\(F_{{ mac}\text{- }{} { avg}}\)
| |||||
---|---|---|---|---|---|---|---|---|---|---|
Actor names | Film titles pl | Film titles en | Film titles pl/en | Role names | Another | |||||
1 | 4 | 4 | 4 | 2 | 2 | 2 | 18 | 0.43 | 0.333 | 0.657 ± 0.393 |
2 | 10 | 15 | 15 | 5 | 5 | 5 | 55 | 1.3 | 0.884 | 0.87 ± 0.278 |
3 | 26 | 28 | 30 | 10 | 10 | 10 | 114 | 2.7 | 0.953 | 0.943 ± 0.175 |
4 | 26 | 60 | 50 | 15 | 15 | 15 | 181 | 4.3 | 0.964 | 0.95 ± 0.161 |
5 | 41 | 112 | 90 | 15 | 22 | 19 | 299 | 7.1 | 0.968 | 0.95 ± 0.161 |
6 | 71 | 216 | 170 | 29 | 43 | 19 | 548 | 13 | 0.974 | 0.952 ± 0.161 |
7 | 131 | 425 | 330 | 29 | 86 | 19 | 1020 | 24.4 | 0.977 | 0.953 ± 0.161 |
\(\sum |V_{a}|\)
| Iteration number |
F
|
\(F_{{ mac}\text{- }{} { avg}}\)
|
---|---|---|---|
18 | 1 | 0.33 | 0.657 ± 0.393 |
2 | 0.853 | 0.95 ± 0.161 | |
3 | 0.853 | 0.95 ± 0.161 | |
55 | 1 | 0.884 | 0.87 ± 0.278 |
2 | 0.839 | 0.882 ± 0.162 | |
3 | 0.822 | 0.878 ± 0.164 | |
4 | 0.822 | 0.878 ± 0.164 | |
548 | 1 | 0.974 | 0.952 ± 0.161 |
2 | 0.896 | 0.914 ± 0.182 | |
3 | 0.853 | 0.917 ± 0.142 | |
4 | 0.853 | 0.92 ± 0.127 | |
5 | 0.853 | 0.92 ± 0.127 |
6 Conclusion
-
the empirical research shows that we can improve and achieve a high quality of the WI output results by using the described techniques,
-
the empirical research shows that the quality of information extraction depends on the (1) form of input data, (2) pre-processing domain’s web pages, (3) matching techniques, and (4) the level of HTML documents representation (the granularity of HTML tags),
-
the worst results are obtained when the HTML tags contain attributes and their values. In this case, the algorithm creates very detailed patterns with a low degree of generalization,
-
the best results are achieved when the proposed taxonomy approach is used as the input of the WI algorithm, and when the pre-processing technique clearing the values of HTML attributes, where the seeds are matched only between HTML tags, and if we use the tags level rather than the chars level representation of HTML documents. Thanks to this configuration, the WI created generic patterns covering the most of the expected instance,
-
if we can ensure well-diversified input data, the WI may be used in the boosting mode,
-
the weak assumption made about the fact that on the basis of seeds belonging to semantic classes patterns, that will extract new semantically consistent instances, will be created is useful, but it is also only partly right. Adoption of this assumption in the first iteration of the proposed algorithm produces good results,
-
the BigGrams system is suitable for extracting relevant keywords from Internet domains.