Top

Published in:

Open Access 2022 | OriginalPaper | Chapter

Leveraging Customer Reviews for E-commerce Query Generation

Authors : Yen-Chieh Lien, Rongting Zhang, F. Maxwell Harper, Vanessa Murdock, Chia-Jung Lee

Published in: Advances in Information Retrieval

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Customer reviews are an effective source of information about what people deem important in products (e.g. “strong zipper” for tents). These crowd-created descriptors not only highlight key product attributes, but can also complement seller-provided product descriptions. Motivated by this, we propose to leverage customer reviews to generate queries pertinent to target products in an e-commerce setting. While there has been work on automatic query generation, it often relied on proprietary user search data to generate query-document training pairs for learning supervised models. We take a different view and focus on leveraging reviews without training on search logs, making reproduction more viable by the public. Our method adopts an ensemble of the statistical properties of review terms and a zero-shot neural model trained on adapted external corpus to synthesize queries. Compared to competitive baselines, we show that the generated queries based on our method both better align with actual customer queries and can benefit retrieval effectiveness.

Y.-C. Lien—Work done while an intern at Amazon.

1 Introduction

Customer reviews contain diverse descriptions about how people reflect the properties, pros and cons of the products that they have experienced. For example, properties such as “for underwater photos” or “for kayaking recording” were mentioned in reviews for action cameras, as well as “compact” or “strong zipper” for tents. These descriptors not only paint a rich picture of what people deem important, but also can complement and uncover shopping considerations that may be absent in seller-provided product descriptions. Motivated by this, our work investigates ways to generate queries that surface key properties about the target products using reviews.

Previous work on automatic query generation often relied on human labels or logs of queries and engaged documents (or items) [16‐20] to form relevance signals for training generative models. Despite the reported effectiveness, the cost of acquiring high quality human labels is high, whereas the access to search logs is often only limited to site owners. As we approach the problem using reviews, it brings an advantage of not requiring any private, proprietary user data, making reproduction more viable by the public in general. Meanwhile, generation based on reviews is favorable as the outcome may likewise produce human-readable language patterns, potentially facilitating people-facing experiences such as related search recommendation.

We propose a simple yet effective ensemble method for query generation. Our approach starts with building a candidate set of “query-worthy” terms from reviews. To begin, we first leverage syntactic and statistical signals to build up a set of terms from reviews that are most distinguishable for a given product. A second set of candidate terms is obtained through a zero-shot sequence-to-sequence model trained according to adapted external relevance signals. Our ensemble method then devises a statistics-based scoring function to rank the combined set of all candidates, from which a query can be formulated by providing a desired query length.

Our evaluation examines two crucial aspects of query quality. To quantify how readable the queries are, we take the human-submitted queries from logs as ground truth to evaluate how close the generated queries are to them for each product. Moreover, we investigate whether the generated queries can benefit retrieval tasks, similar to prior studies [6, 7, 15]. We collect pairs of product descriptions and generated queries, both of which can be derived from public sources, to train a deep neural retrieval model. During inference, we take human-submitted queries on the corresponding product to benchmark the retrieval effectiveness. Compared with the competitive alternatives YAKE [1, 2] and Doc2Query [6], our approach shows significantly higher similarity with human-submitted queries and benefits retrieval performance across multiple product types.

Related search recommendation (or query suggestion) helps people automatically discover related queries pertinent to their search journeys. With the advances in deep encoder-decoder models [9, 12], query generation [6, 16, 17, 19, 20] sits at the core of many recent recommendation algorithms. Sordoni et al. [17] proposed hierarchical RNNs [24] to generate next queries based on observed queries in a session. Doc2Query [6] adapted T5 [12] to generate queries according to input documents. Ahmad et al. [20] jointly optimized two companion ranking tasks, document ranking and query suggestion, by RNNs. Our approach differs in that we do not require in-domain logs of query-document relations for supervision.

Studies also showed that generated queries can be used for enhancing retrieval effectiveness [6, 7, 15]. Doc2Query [6] leveraged the generated queries to enrich and expand document representations. Liang et al. [7] proposed to synthesize query-document relations based on MSMARCO [8] and Wikipedia for training large-scale neural retrieval models. Ma et al. [15] explored a similar zero-shot learning method for a different task of synthetic question generation, while Puri et al. [21] improve QA performance by incorporating synthetic questions. Our work resembles the zero-shot setup but differs in how we adapt external corpus particularly for e-commerce query generation.

Customer reviews have been adopted as a useful resource for summarization [22] and product question answering. Approaches to PQA [10, 11, 13, 14] often take in reviews as input, conditioned on which answers are generated for user questions. Deng et al. [11] jointly learned answer generation and opinion mining tasks, and required both a reference answer and its opinion type during training phase. While our work also depends on reviews as input, we focus on synthesizing the most relevant queries without requiring ground-truth labels.

3 Method

Our approach involves a candidate generation phrase to identify key terms from reviews, and a selection phrase that employs an unsupervised scoring function to rank and aggregate the term candidates into queries.

3.1 Statistics-Based Approach

We started with a pilot study to characterize the opportunity of whether and how reviews could be useful for query generation. We found that a subset of terms in reviews resemble that of search queries, which are primarily composed of combinations of nouns, adjectives and participles to reflect critical semantics. For example, given a headphone, the actual queries that had led to purchases may contain nouns such as “earbuds” or “headset” to denote product types, adjectives such as “wireless” or “comfortable” to reflect desired properties, and participles such as “running” or “sleeping” to emphasize use cases.

Inspired by this, we first leverage part-of-speech analysis to scope down reviews to the three types of POS-tags. From this set, we then rely on conventional tf-idf corpus statistics to mine distinguishing terms salient in a product type but not generic across the entire catalog. Specifically, an importance score \(I^D_t = \frac{p(t, R_D)}{p(t, R_G)}\) is used to estimate the salience of a term t in a product type D by contrasting its density in review set \(R_D\) to generic reviews \(R_G\), where \(p(t, R) = \frac{freq(t, R)}{\Sigma _{r \in R} |r|}\). Beyond unigrams, we also consider if the relative frequency of bigram phrases containing the unigrams \(\frac{freq([t, t'], R_D)}{freq(t, R_D)}\) is above some threshold; in this case, bigrams will replace unigrams and become the candidates. We apply \(I^D_t\) to each review sentence, and collect top scored terms or phrases as candidates.

A straightforward way to form queries is to directly use the candidates as-is. We additionally consider an alternative which trains a seq2seq model using the candidates as weak supervision (i.e. encode review sentences to fit the candidates). By doing so, we anticipate the terms decoded during inference can generalize more broadly compared to a direct application. The two methods are referred to as Stats-base and Stats-s2s respectively.

3.2 Zero-Shot Generation Based on Adapted External Corpus

Recent findings [7, 15] suggest that zero-shot domain adaptation can deliver high effectiveness given the knowledge embedded in large-scale language models via pre-training tasks. With this, we propose to rely on fine-tuning T5 [12] on MSMARCO query-passage pairs to capture the notion of generic relevance, and apply the trained model to e-commerce reviews to identify terms that are more probable to be adopted in queries.

This idea has been experimented by Nogueira et al. [6], where their Doc2Query approach focused on generating queries as document expansion for improving retrieval performance. Different from [6], our objective is to generate queries that are not only beneficial to retrieval but also similar to actual queries in terms of syntactic forms. Thus, a direct application of Doc2Query on MSMARCO creates a gap in our case since MSMARCO “queries” predominantly follow a natural-language question style, resulting in generated queries of similar forms¹. To tighten the loop, we propose to apply POS-tag analysis to MSMARCO queries and retain only terms that satisfy the selected POS-tags (i.e. nouns, adjectives and participles). For example, an original query “what does physical medicine do” is first transformed into “physical medicine” as pre-processing. After the adaptation, we conduct T5 seq2seq model training and apply it in a zero-shot fashion to generate salient terms based on input review sentences.

3.3 Ensemble Approach to Query Generation

For a product p in the product type D, we employ both statistical and zero-shot approaches on its reviews to construct candidates for generating queries, which we denote as \(C_p\). To select representative terms from the set, we devise a scoring function \(S_t = freq(t, C_p) \cdot log(\frac{|\{p' \in D\}|}{|\{p' \mid p' \in D, t \in C_{p'}\}|})\) to rank all candidates, where higher ranked terms are more distinguishable for a specific product based on the tf-idf intuition. Given a desired query length n, we formulate the pseudo queries for a product by selecting all possible \({k \atopwithdelims ()n}\) combinations from the top-k scored terms in the \(C_p\) set². A final post-processing step removes any redundant words after stemming from the queries and adds product types if not already included.

4 Experiments

Our evaluation set is composed of products from three different product types, together with the actual queries³ that were submitted by people who purchased those products on Amazon.com. As shown in Table 1, we consider headphones, tents and conditioners to evaluate our method across diverse product types, for which people tend to behave and shop differently with variances reflected in search queries. The query vocabulary size for conditioners, for instance, is about thrice the size of tents, with headphones sitting in-between the two.

As our approach disregards the actual queries for supervision, we primarily consider competitive baselines that do not involve using query logs. In particular, we compare to the unsupervised approach YAKE [1, 2] which reportedly outperforms a variety of seminal key word extraction approaches, including RAKE [4], TextRank [3] and SingleRank [5] methods. In addition, we leverage the zero-shot Doc2Query model on adapted corpus as our baseline to reflect the absence of e-commerce logs. For generation, we initialize separate Huggingface T5-base [12] weights with conditional generation head and fine-tune for Stats-s2s and Doc2Query models respectively. Training is conducted on review sentences broken down by NLTK. For retrieval, we fine-tune a Sentence-Transformer [23] ms-marco-TinyBERT⁴ pre-trained with MSMARCO data, which was shown to be effective for semantics matching. Our experiments use a standard AdamW optimizer with learning rate 0.001 and \(\beta _1, \beta _2 = (0.9,0.999)\), and conduct 2 and 4 epochs training on a batch size of 16 respectively for generation and retrieval.

Table 1.

Statistics of the three product types used in the experiments. For each product type, the dev and test split respectively contains 500 disjoint products.

	Headphone		Tent		Conditioner
	Dev	Test	Dev	Test	Dev	Test
# of reviews	23,165	23,623	19,208	18,734	17,055	17,689
# of sentences	102,281	103,771	97,553	97,320	68,691	70,829

4.1 Intrinsic Similarity Evaluation

Constructing readable and human-like queries is desirable since it is practically useful for applications such as related search recommendation. A natural way to reflect readability is to evaluate the similarity between the generated and customer-submitted queries since the latter is created by human. In practice, we consider customer-submitted queries that had led to at least 5 purchases on the corresponding products as ground-truth queries, to which the generated queries are then compared. We use conventional metrics adopted in generative tasks including corpus BLEU and METEOR for evaluation. The results in Table 2 show that our ensemble approach consistently achieves the highest similarity with human-queries across product types, suggesting that the statistical and zero-shot methods could be mutually beneficial.

Table 2.

The similarity in BLEU and METEOR between generated queries and real queries. \(\star \) stands for p-value \(< 0.05\) in T-test compared to the second best performing method in each column. The bottom shows example generated queries by ensemble.

	Headphone		Tent		Conditioner
	BLEU	METEOR	BLEU	METEOR	BLEU	METEOR
YAKE	0.1014	0.1371	0.2794	0.2002	0.3143	0.1998
Doc2Query	0.1589	0.1667	0.3684	0.2145	0.4404	0.264
Stats-base	0.1743	0.2001	0.3294	0.2201	0.4048	0.2723
Stats-s2s	0.1838	0.2004	0.321	0.2189	0.3931	0.2641
Ensemble	0.2106\(^{\star }\)	0.2024	0.394\(^{\star }\)	0.2334\(^{\star }\)	0.5047\(^{\star }\)	0.2956\(^{\star }\)
Examples	Noise cancelling headphone Truck driver headphone Hearing aids headphone		Lightweight tent Alps backpacking tent Air mattresses queen tent		Detangling conditioner Shea moisture conditioner Dry hair conditioner

4.2 Extrinsic Retrieval Evaluation

We further study how the generated queries can benefit e-commerce retrieval. Our evaluation methodology leverages pairs of generated queries and product descriptions to train a retrieval model and validates its quality based on actual queries. During training, we fine-tune a Sentence-Transformer based on top-3 generated queries of each product. For each query, we prepare its corresponding relevant product description, together with 49 negative product descriptions randomly sampled from the same product type. During inference, instead of generated queries, we use customer-submitted queries to fetch descriptions from the product corpus, and an ideal retrieval model should rank the corresponding product description at the top. We also include BM25 as a common baseline. Table 3 shows that Doc2Query and the ensemble methods are the most effective and are on par in aggregate, with some variance in different product types. Stats-s2s slightly outperforms Stats-base overall, which may hint a potential for better generalization.

Table 3.

The retrieval effectiveness for queries generated by baselines and our method.

	Headphone			Tent			Conditioner
	MRR	P@1	P@10	MRR	P@1	P@10	MRR	P@1	P@10
BM25	0.28	0.19	0.06	0.43	0.29	0.11	0.56	0.47	0.14
YAKE	0.23	0.11	0.07	0.46	0.34	0.11	0.54	0.43	0.14
Doc2Query	0.28	0.18	0.08	0.49	0.40	0.12	0.58	0.49	0.15
Stats-base	0.28	0.16	0.07	0.44	0.29	0.12	0.54	0.42	0.15
Stats-s2s	0.27	0.17	0.07	0.44	0.32	0.12	0.56	0.46	0.16
Ensemble	0.29	0.20	0.07	0.46	0.33	0.13	0.59	0.48	0.15

5 Conclusion

This paper connected salient review descriptors with zero-shot generative models for e-commerce query generation, without requiring human labels or search logs. The empirical results showed that the ensemble queries both better resemble customer-submitted queries and benefit training effective rankers. Besides MSMARCO, our future plan seeks to incorporate other publicly available resources such as community question-answering threads to generalize the notion of relevance. It is worth to consider ways to combine weak labels with few strong labels and dive deep into the impact of employing different hyper-parameters. A user study that characterizes the extent to which the generated queries can reflect people’s purchase intent will further help qualitative understanding.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter GameOfThronesQA: Answer-Aware Question-Answer Pairs for TV Series

next chapter Question Rewriting? Assessing Its Importance for Conversational Question Answering

Original Doc2Query is unsuitable since question-style queries are rare in e-commerce.

Our experiment sets \(k=3\) and \(n=1,2,3\) per its popularity in generic search queries.

Note that we use actual data only for the purpose of evaluation not training.

https://www.sbert.net/docs/pretrained-models/ce-msmarco.html.

Campos, R., Mangaravite, V., Pasquali, A., Jorge, A.M., Nunes, C., Jatowt, A.: A text feature based automatic keyword extraction method for single documents. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 684–691. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_63CrossRef

Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., Jatowt, A.: YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 509, 257–289 (2020)CrossRef

Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)

Mining, T., Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Text Mining: Theory and Applications, vol. 1, pp. 1–20 (2010)

Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 855–860 (2008)

Nogueira, R., Yang, W., Lin, J.J., Cho, K.: Document expansion by query prediction. ArXiv (2019)

Liang, D., et al.: Embedding-based Zero-shot retrieval through query generation. ArXiv (2020)

Bajaj, P., et a.: A human generated MAchine Reading COmprehension dataset. ArXiv, MS MARCO (2016)

Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880 (2020)

10.

Liu, Y., Lee, K.-Y.: E-commerce query-based generation based on user review. ArXiv (2020)

11.

Deng, Y., Zhang, W., Lam, W.: Opinion-aware answer generation for review-driven question answering in e-commerce. In: Proceedings of the 29th ACM International Conference on Information and Knowledge Management, pp. 255–264 (2020)

12.

Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (JMLR) 21, 1–67 (2020)

13.

Chen, S., Li, C., Ji, F., Zhou, W., Chen, H.: Review-driven answer generation for product-related questions in e-commerce. In: Proceedings of the 12th ACM International Web Search and Data Mining Conference, pp. 411–419 (2019)

14.

Gao, S., Ren, Z., Zhao, Y., Zhao, D., Yin, D., Yan, R.: Product-aware answer generation in e-commerce question-answering. In: Proceedings of the 12th ACM International Web Search and Data Mining Conference, pp. 429–437 (2019)

15.

Ma, J., Korotkov, I., Yang, Y., Hall, K., McDonald, R.: Zero-shot neural passage retrieval via domain-targeted synthetic question generation. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1075–1088 (2021)

16.

Chen, R.-C., Lee, C.-J.: Incorporating behavioral hypotheses for query generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 3105–3110, (2020)

17.

Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Simonsen, J.G., Nie, J.Y.: A hierarchical recurrent encoderdecoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, pp. 553–562 (2015)

18.

Jiang, J.-Y., Wang, W.: RIN: reformulation inference network for context-aware query suggestion. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 197–206 (2018)

19.

Kim, K., Lee, K., Hwang, S.-W., Song, Y.-I., Lee, S.: Query generation for multimodal documents. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pp. 659–668 (2021)

20.

Ahmad, W.U., Chang, K.-W., Wang, H.: Context attentive document ranking and query suggestion. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 385–394 (2019)

21.

Puri, R., Spring, R., Patwary, M., Shoeybi, M., Catanzaro, B.: Training question answering models from synthetic data. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 5811–5826 (2020)

22.

Zhang, X., et al.: DSGPT: domain-specific generative pre-training of transformers for text generation in e-commerce title and review summarization. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2146–2150 (2021)

23.

Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3982–3992 (2019)

24.

Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report ICS 8504. Institute for Cognitive Science, University of California, San Diego, California, September 1985

Title: Leveraging Customer Reviews for E-commerce Query Generation
Authors: Yen-Chieh Lien
Rongting Zhang
F. Maxwell Harper
Vanessa Murdock
Chia-Jung Lee
Publisher: Springer International Publishing
Book: Advances in Information Retrieval
Print ISBN: 978-3-030-99738-0

Electronic ISBN: 978-3-030-99739-7

Copyright Year: 2022
DOI: https://doi.org/10.1007/978-3-030-99739-7_22