1 Introduction
How can an item content data set be systematically extended with respect to the data quality dimension completeness, aiming to improve recommendation quality?
2 Foundation
2.1 General Background
2.2 Theoretical Background
2.3 Related Work and Research Gap
3 A Procedure for Extending an Item Content Data Set
3.1 Duplicate Detection in the Context of Recommender Systems
Similarity measure functions | Properties | Examples in the context of recommender systems |
---|---|---|
Levenshtein The Levenshtein SMF is based on the minimum number of edit operations of single characters necessary to transform a string \({s}_{1}\) into a string \({s}_{2}\). | • Appropriate for misspellings/typographical errors • Inappropriate for truncated/ shortened strings and divergent pre-/suffixes • Complexity: \(O\left(\left|{s}_{1}\right|\ast\left|{s}_{2}\right|\right)\) | Typographical error in the attribute “Restaurant Name”: “Fulffy’s” vs. “Fluffys”. |
Jaro The Jaro SMF is based on the number of agreeing characters c contained in the strings \({s}_{1}\) and \({s}_{2}\) within half the length of the longer string, and the number of transpositions t in the set of common substrings. | • Appropriate for misspellings/typographical errors • Inappropriate for long divergent pre-/suffixes • Complexity: \(O\left(\left|{s}_{1}\right|+\left|{s}_{2}\right|\right)\) | Misspelling in the attribute “Restaurant Name”: “Fluffy’s Café” vs. “Flufy’s Café”. |
Jaro-Winkler The Jaro-Winkler SMF extends the Jaro SMF, putting more emphasis on the beginning of the strings. | • Appropriate for misspellings/typographical errors and divergent suffixes • Inappropriate for long divergent prefixes • Complexity: \(O\left(\left|{s}_{1}\right|+\left|{s}_{2}\right|\right)\) | Divergent suffixes of the attribute “Restaurant Name”: “Fluffy’s New York” vs. “Fluffy’s Café & Pizzeria”. |
Haversine This SMF is based on the haversine formula, which measures the distance between two locations on earth. | • Appropriate for geographical coordinates given in latitude/longitude | “40.711, -73.966” vs. “40.710, -73.965”. |
3.2 Data Integration in the Context of Recommender Systems
Imputation methods | Properties | Examples in the context of recommender systems |
---|---|---|
Arithmetic Mean Imputation (AMI)
Missing attribute values are replaced with the mean attribute value of all items, where the values for this attribute are not missing. | • AMI is convenient to implement • AMI attenuates standard deviation and variance | Each missing value of the attribute “Runtime” is replaced with the mean value of “Runtime” (as an indicator) over all movies that do have a value for “Runtime”. |
Regression Imputation (RI)
Missing values are replaced with predicted scores from regression equations. The regression equations are estimated by analyzing the extended data set. | • RI is complicated to implement • RI attenuates standard deviation and variance (but less than AMI) | For two hotel attributes “Price” (\({P}_{i}\)) and “Service” (\({S}_{i}\)), there are only missing values for “Service”. A regression equation \({\widehat{S}}_{i}={{\widehat{\beta }}_{0}+\widehat{\beta }}_{1}\left({P}_{i}\right)\) for the attribute “Service”, depending on the attribute “Price”, is estimated by analyzing the hotels with given values for “Service”. The missing values \({S}_{i}\) of “Service” are replaced by \({\widehat{S}}_{i}\). |
Hot Deck Imputation (HDI)
Missing attribute values of an item are replaced with the corresponding values of the most similar item. | • HDI is convenient to implement • HDI attenuates standard deviation and variance (but less than AMI) | The movie “The Dark Knight” is the most similar movie to “The Dark Knight Rises”, as both movies belong to the batman trilogy of the director “Christopher Nolan”. The value of “The Dark Knight” for the attribute “Genres” is “Action” and thus, the missing value of “The Dark Knight Rises” for “Genres” is inferred with the value “Action”. |
3.3 Subsequent Step: Recommendation Determination
4 Evaluating the Procedure in Real-world Scenarios
4.1 Selection and Description of the Real-world Scenarios
Portal R1 (DSR1) | Portal R2 (DSR2) | |
---|---|---|
# of items (restaurants) | 8,909 | 18,507 |
# of users | 386,958 | 583,815 |
# of ratings | 855,357 | 2,396,643 |
# of key attributes | 6 | 6 |
# of further attributes (category attributes and business information attributes) | 143 | 251 |
# of possible attribute values | 1,247,260 | 4,589,736 |
# of missing values | 3,253 (0.26%) | 190,789 (4.16%) |
Portal M1 (DSM1) | Portal M2 (DSM2) | |
---|---|---|
# of items (movies) | 28,973 | 12,842 |
# of users | 428,519 | 230,151 |
# of ratings | 528,777 | 409,935 |
# of key attributes | 1 | 1 |
# of further attributes | 13 | 103 |
# of possible attribute values | 376,649 | 1,322,726 |
# of missing values | 247,341 (65.67%) | 1,082,387 (81.83%) |
4.2 Evaluation of the Procedure
4.2.1 Evaluation of Step 1 – Duplicate Detection
Key attributes | Data type | Example key attribute values from both portals for a duplicate |
---|---|---|
Name | String | “9 Ten Restaurant” (in DSR1), “9 10 Restaurant” (in DSR2) |
Postal Code | Number | “10019 − 2132” (in DSR1), “10019” (in DSR2) |
Geolocation | Geographic coordinates (latitude and longitude) | “N 40.76591° / W -73.97979°” (in DSR1), “N 40.7659964050293° / W -73.9797178100586°” (in DSR2) |
Address | String | “910 Seventh Avenue” (in DSR1), “910 7th Av” (in DSR2) |
Phone | Number | “+1 917-639-3366” (in DSR1), “(917) 639 3666” (in DSR2) |
District | String | “Midtown” (in DSR1), “Midtown West” (in DSR2) |
4.2.2 Evaluation of Step 2 – Data Integration
4.2.3 Evaluation of Subsequent Step – Recommendation Determination
Topics | Challenges in the context of recommender systems | References to procedure step / task |
---|---|---|
Data / Content | • Decentral data capturing by many different users results in data quality problems requiring standardization • Heterogeneous data policies among portals lead to different characteristics of the data across data sets, also requiring standardization • Item content data is a central decisive factor for e-commerce business models and respective recommender systems | 1.1 Data Standardization and Preparation |
Key Attributes and Item Pair Classification | • Labeled training data is missing in the context of recommender systems for a supervised item pair classification • No natural unique IDs are available for items (e.g. restaurants) • Values of key attributes are entered in a decentral way and depend on the users’ own interpretation leading to highly diverse data values • Items with the same name referring to the same organization (e.g., “McDonald’s”) and items with similar names referring to different organizations (e.g., “Sushi You” vs. “Sushi Ko”) in the restaurant domain are potentially in close proximity in urban areas; however, they have to be distinguished as separate items | 1.2 Item Pair Classification |
Matching Attributes | • Heterogeneous data policies among portals lead to different names of the same attribute (e.g., “Bar” vs. “Pub”) • Portals potentially use different levels of granularity when describing the attributes (e.g., “Asian Cuisine” vs. “Japanese Cuisine”) | 2.1 Identification of Attributes |
Additional Attributes | • Attributes and their values (e.g., eight times more attributes after data set extension in the movie domain) directly affect the quality of the recommender system and the resulting recommendations | 2.2 Extension of Item Data |
Missing Values | • Many recommender system techniques cannot handle missing values (e.g., 75% missing attribute values had to be imputed in the movie domain) | 2.3 Handling Missing Values |
4.3 Effects on Recommendation Quality
Effects | Evaluation configurations | Relative improvements in recommendation quality (RMSE) by procedure application | ||
---|---|---|---|---|
Restaurants | Movies | |||
1 | Standard procedure configuration (as outlined in Section 4.2) | 13.2% | 24.6% | |
2 | Procedure with simplified rule-based duplicate detection | 9.8% | 23.9% | |
3 | Procedure without imputation and … | additional attributes with low number of available attribute values (Set 1) | 0.1% | 1.7% |
additional attributes with high number of available attribute values (Set 2) | 12.6% | 17.4% | ||
all additional attributes (Set 3) | 12.7% | 17.4% | ||
all attributes of DS2 (Set 4) | 12.6% | 17.4% | ||
4 | Standard procedure configuration (as outlined in Section 4.2) (Setting 1) | 13.2% | 24.6% | |
Procedure without imputation (Setting 2) | 12.7% | 17.4% | ||
Procedure without imputation and further removed attribute values (Setting 3) | 6.5% | 13.7% | ||
5 | Procedure for users with high rating numbers (Group 1) | 17.1% | 45.4% | |
Procedure for users with moderate rating numbers (Group 2) | 16.3% | 42.7% | ||
Procedure for users with low rating numbers (Group 3) | 9.9% | 6.0% |
-
Effect 1. Extending the basis data set (DSR1 and DSM1, respectively) by applying the proposed procedure improved recommendation quality considerably.
-
Effect 2. A sophisticated duplicate detection as proposed by our procedure yielded a high improvement in recommendation quality.
-
Effect 3. The extension of the basis data set (DSR1 and DSM1, respectively) with further attributes (of DSR2 and DSM2, respectively) generally supported the increase in recommendation quality, with the extent of improvement depending on the attribute set used for the extension.
-
Effect 4. More attribute values (i.e., less missing values) resulted in increased recommendation quality.
-
Effect 5. Users with a high number of submitted ratings benefitted more from the data set extension than users with a low number of submitted ratings.