Skip to main content
Erschienen in: Empirical Software Engineering 2/2018

04.11.2017

Domain-specific cross-language relevant question retrieval

verfasst von: Bowen Xu, Zhenchang Xing, Xin Xia, David Lo, Shanping Li

Erschienen in: Empirical Software Engineering | Ausgabe 2/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Chinese developers often cannot effectively search questions in English, because they may have difficulties in translating technical words from Chinese to English and formulating proper English queries. For the purpose of helping Chinese developers take advantage of the rich knowledge base of Stack Overflow and simplify the question retrieval process, we propose an automated cross-language relevant question retrieval (CLRQR) system to retrieve relevant English questions for a given Chinese question. CLRQR first extracts essential information (both Chinese and English) from the title and description of the input Chinese question, then performs domain-specific translation of the essential Chinese information into English, and finally formulates an English query for retrieving relevant questions in a repository of English questions from Stack Overflow. We propose three different retrieval algorithms (word-embedding, word-matching, and vector-space-model based methods) that exploit different document representations and similarity metrics for question retrieval. To evaluate the performance of our approach and investigate the effectiveness of different retrieval algorithms, we propose four baseline approaches based on the combination of different sources of query words, query formulation mechanisms and search engines. We randomly select 80 Java, 20 Python and 20 .NET questions in SegmentFault and V2EX (two Chinese Q&A websites for computer programming) as the query Chinese questions. We conduct a user study to evaluate the relevance of the retrieved English questions using CLRQR with different retrieval algorithms and the four baseline approaches. The experiment results show that CLRQR with word-embedding based retrieval achieves the best performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
3
FudanNLP, available at http://​nlp.​fudan.​edu.​cn
 
4
 
5
Youdao translation API, available at http://​fanyi.​youdao.​com/​openapi
 
9
The 120 Query Chinese questions, available at https://​goo.​gl/​zAbLVp
 
10
Stack Exchange Data Dump, available at https://​archive.​org/​download/​stackexchange
 
11
A Chinese question on V2EX, available at https://​www.​v2ex.​com/​t/​47663
 
12
A Chinese question on SegmentFault available at https://​SegmentFault.​com/​q/​1010000003408795​
 
13
A Chinese question on V2EX, available at https://​www.​v2ex.​com/​t/​137913
 
Literatur
Zurück zum Zitat Aceves-Pérez RM, Montes-y Gómez M, Villaseñor-Pineda L (2007) Enhancing cross-language question answering by combining multiple question translations. In: Computational Linguistics and Intelligent Text Processing, Springer, pp 485–493 Aceves-Pérez RM, Montes-y Gómez M, Villaseñor-Pineda L (2007) Enhancing cross-language question answering by combining multiple question translations. In: Computational Linguistics and Intelligent Text Processing, Springer, pp 485–493
Zurück zum Zitat Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval, vol 463. ACM Press, New York Baeza-Yates R, Ribeiro-Neto B et al (1999) Modern information retrieval, vol 463. ACM Press, New York
Zurück zum Zitat Bao L, Lo D, Xia X, Li S (2017) Automated android application permission recommendation. Sci China Inf Sci 60(9):092,110CrossRef Bao L, Lo D, Xia X, Li S (2017) Automated android application permission recommendation. Sci China Inf Sci 60(9):092,110CrossRef
Zurück zum Zitat Canfora G, Cerulo L (2005) How software repositories can help in resolving a new change request. STEP 2005:99 Canfora G, Cerulo L (2005) How software repositories can help in resolving a new change request. STEP 2005:99
Zurück zum Zitat Cohen J (1988) Statistical power analysis for the behavioral sciences. hilsdale. Lawrence Earlbaum Associates, New Jersey, p 2 Cohen J (1988) Statistical power analysis for the behavioral sciences. hilsdale. Lawrence Earlbaum Associates, New Jersey, p 2
Zurück zum Zitat Cui H, Wen JR, Nie JY, Ma WY (2002) Probabilistic query expansion using query logs. In: Proceedings of the 11th international conference on World Wide Web, ACM, pp 325–332 Cui H, Wen JR, Nie JY, Ma WY (2002) Probabilistic query expansion using query logs. In: Proceedings of the 11th international conference on World Wide Web, ACM, pp 325–332
Zurück zum Zitat Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013a) Automatic query reformulations for text retrieval in software engineering. In: 2013 35th international conference on software engineering (ICSE), IEEE, pp 842–851 Haiduc S, Bavota G, Marcus A, Oliveto R, De Lucia A, Menzies T (2013a) Automatic query reformulations for text retrieval in software engineering. In: 2013 35th international conference on software engineering (ICSE), IEEE, pp 842–851
Zurück zum Zitat Haiduc S, De Rosa G, Bavota G, Oliveto R, De Lucia A, Marcus A (2013b) Query quality prediction and reformulation for source code search: The refoqus tool. In: Proceedings of the 2013 international conference on software engineering, IEEE Press, pp 1307–1310 Haiduc S, De Rosa G, Bavota G, Oliveto R, De Lucia A, Marcus A (2013b) Query quality prediction and reformulation for source code search: The refoqus tool. In: Proceedings of the 2013 international conference on software engineering, IEEE Press, pp 1307–1310
Zurück zum Zitat Hayes JH, Sultanov H, Kong WK, Li W (2011) Software verification and validation research laboratory (svvrl) of the university of kentucky: traceability challenge 2011: language translation. Selabnetlabukyedu pp 50–53 Hayes JH, Sultanov H, Kong WK, Li W (2011) Software verification and validation research laboratory (svvrl) of the university of kentucky: traceability challenge 2011: language translation. Selabnetlabukyedu pp 50–53
Zurück zum Zitat Hiemstra D, De Jong F, Kraaij W (1997) A domain specific lexicon acquisition tool for cross-language information retrieval. In: Computer-Assisted Information Searching on Internet, LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, pp 255–268 Hiemstra D, De Jong F, Kraaij W (1997) A domain specific lexicon acquisition tool for cross-language information retrieval. In: Computer-Assisted Information Searching on Internet, LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, pp 255–268
Zurück zum Zitat Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of nl-queries for software maintenance and reuse. In: IEEE 31st international conference on software engineering, 2009. ICSE 2009. IEEE, pp 232–242 Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of nl-queries for software maintenance and reuse. In: IEEE 31st international conference on software engineering, 2009. ICSE 2009. IEEE, pp 232–242
Zurück zum Zitat Hull DA, Grefenstette G (1996) A dictionary-based approach to multilingual informaion retrieval. In: Proceedings of the 19th international conference on research and development in information retrieval, pp 49–57 Hull DA, Grefenstette G (1996) A dictionary-based approach to multilingual informaion retrieval. In: Proceedings of the 19th international conference on research and development in information retrieval, pp 49–57
Zurück zum Zitat Jones G, Sakai T, Collier N, Kumano A, Sumita K (1999) A comparison of query translation methods for english-japanese cross-language information retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 269–270 Jones G, Sakai T, Collier N, Kumano A, Sumita K (1999) A comparison of query translation methods for english-japanese cross-language information retrieval. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 269–270
Zurück zum Zitat Jui SL (2010) Innovation in China: the Chinese software industry. Routledge, Abingdon Jui SL (2010) Innovation in China: the Chinese software industry. Routledge, Abingdon
Zurück zum Zitat Kluck M, Gey FC (2001a) The domain-specific task of clef - specific evaluation strategies in cross-language information retrieval. In: Peters C. (ed) Proceedings of the CLEF 2000 evaluation forum, pp 48–56 Kluck M, Gey FC (2001a) The domain-specific task of clef - specific evaluation strategies in cross-language information retrieval. In: Peters C. (ed) Proceedings of the CLEF 2000 evaluation forum, pp 48–56
Zurück zum Zitat Kluck M, Gey FC (2001b) The domain-specific task of clef-specific evaluation strategies in cross-language information retrieval. In: Cross-Language Information Retrieval and Evaluation, Springer, pp 48–56 Kluck M, Gey FC (2001b) The domain-specific task of clef-specific evaluation strategies in cross-language information retrieval. In: Cross-Language Information Retrieval and Evaluation, Springer, pp 48–56
Zurück zum Zitat Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-language information retrieval. Comput Linguist 29(3):381–419CrossRefMATH Kraaij W, Nie JY, Simard M (2003) Embedding web-based statistical translation models in cross-language information retrieval. Comput Linguist 29(3):381–419CrossRefMATH
Zurück zum Zitat Liu X, Gong Y, Xu W, Zhu S (2002) Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 191–198 Liu X, Gong Y, Xu W, Zhu S (2002) Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 191–198
Zurück zum Zitat Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):50. Acm Transactions on Software Engineering & Methodology 16CrossRef Lucia AD, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4):50. Acm Transactions on Software Engineering & Methodology 16CrossRef
Zurück zum Zitat Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(Nov):2579–2605MATH Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(Nov):2579–2605MATH
Zurück zum Zitat Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, 2004. Proceedings. IEEE, pp 214–223 Marcus A, Sergeyev A, Rajlich V, Maletic JI (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering, 2004. Proceedings. IEEE, pp 214–223
Zurück zum Zitat Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Association for Computational Linguistics Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Association for Computational Linguistics
Zurück zum Zitat Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16-20, 2006, Boston, Massachusetts, USA, pp 775–780 Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, July 16-20, 2006, Boston, Massachusetts, USA, pp 775–780
Zurück zum Zitat Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781 Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Zurück zum Zitat Peñas A, Magnini B, Forner P, Sutcliffe R, Rodrigo Á, Giampiccolo D (2012) Question answering at the cross-language evaluation forum 2003–2010. Lang Resour Eval 46(2):177–217CrossRef Peñas A, Magnini B, Forner P, Sutcliffe R, Rodrigo Á, Giampiccolo D (2012) Question answering at the cross-language evaluation forum 2003–2010. Lang Resour Eval 46(2):177–217CrossRef
Zurück zum Zitat Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRef
Zurück zum Zitat Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432CrossRef Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432CrossRef
Zurück zum Zitat Resnik P, Melamed ID (1997) Semi-automatic acquisition of domain-specific translation lexicons. In: Proceedings of the fifth conference on Applied natural language processing, Association for Computational Linguistics, pp 340–347 Resnik P, Melamed ID (1997) Semi-automatic acquisition of domain-specific translation lexicons. In: Proceedings of the fifth conference on Applied natural language processing, Association for Computational Linguistics, pp 340–347
Zurück zum Zitat Saggion H, Radev D, Teufel S, Lam W, Strassel SM (2002) Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. Ann Arbor 1001(48):109–1092 Saggion H, Radev D, Teufel S, Lam W, Strassel SM (2002) Developing infrastructure for the evaluation of single and multi-document summarization systems in a cross-lingual environment. Ann Arbor 1001(48):109–1092
Zurück zum Zitat Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRef Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523CrossRef
Zurück zum Zitat Shepherd D, Pollock L, Tourwé T (2005) Using language clues to discover crosscutting concerns. Acm Sigsoft Soft Engineer Notes 30:1–6CrossRef Shepherd D, Pollock L, Tourwé T (2005) Using language clues to discover crosscutting concerns. Acm Sigsoft Soft Engineer Notes 30:1–6CrossRef
Zurück zum Zitat Shepherd D, Fry ZP, Hill E, Pollock L, Vijay-Shanker K (2007) Using natural language program analysis to locate and understand action-oriented concerns. In: Proceedings of the 6th international conference on Aspect-oriented software development, ACM, pp 212–224 Shepherd D, Fry ZP, Hill E, Pollock L, Vijay-Shanker K (2007) Using natural language program analysis to locate and understand action-oriented concerns. In: Proceedings of the 6th international conference on Aspect-oriented software development, ACM, pp 212–224
Zurück zum Zitat Tan PN et al (2006) Introduction to data mining. Pearson Education, London Tan PN et al (2006) Introduction to data mining. Pearson Education, London
Zurück zum Zitat Thai P (2007) An introduction to cross-language information retrieval approaches. Web.simmons.edu Thai P (2007) An introduction to cross-language information retrieval approaches. Web.simmons.edu
Zurück zum Zitat Čubranić D, Murphy GC (2003) Hipikat: recommending pertinent software development artifacts. In: 25th international conference on software engineering, 2003. Proceedings. pp 408–418 Čubranić D, Murphy GC (2003) Hipikat: recommending pertinent software development artifacts. In: 25th international conference on software engineering, 2003. Proceedings. pp 408–418
Zurück zum Zitat Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83. JSTORCrossRef Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83. JSTORCrossRef
Zurück zum Zitat Xia X, Lo D (2017) An effective change recommendation approach for supplementary bug fixes. Autom Softw Eng 24(2):455–498. SpringerCrossRef Xia X, Lo D (2017) An effective change recommendation approach for supplementary bug fixes. Autom Softw Eng 24(2):455–498. SpringerCrossRef
Zurück zum Zitat Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd International Conference on Program Comprehension, ACM, pp 275–278 Xia X, Lo D, Wang X, Zhang C, Wang X (2014) Cross-language bug localization. In: Proceedings of the 22nd International Conference on Program Comprehension, ACM, pp 275–278
Zurück zum Zitat Xia X, Lo D, Wang X, Yang X (2015) Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 261–270 Xia X, Lo D, Wang X, Yang X (2015) Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In: 2015 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 261–270
Zurück zum Zitat Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Workshop on Mining Software Repositories, ACM, pp 413– 424 Xu B, Xing Z, Xia X, Lo D, Wang Q, Li S (2016) Domain-specific cross-language relevant question retrieval. In: Proceedings of the 13th International Workshop on Mining Software Repositories, ACM, pp 413– 424
Zurück zum Zitat Xu B, Xing Z, Xia X, Lo D (2017a) Answerbot - automated generation of answer summary to developers technical questions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, IEEE, p Accepted Xu B, Xing Z, Xia X, Lo D (2017a) Answerbot - automated generation of answer summary to developers technical questions. In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, IEEE, p Accepted
Zurück zum Zitat Xu B, Xing Z, Xia X, Lo D, Le XBD (2017b) Xsearch: a domain-specific cross-language relevant question retrieval tool. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ACM, pp 1009–1013 Xu B, Xing Z, Xia X, Lo D, Le XBD (2017b) Xsearch: a domain-specific cross-language relevant question retrieval tool. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ACM, pp 1009–1013
Zurück zum Zitat Yang J, Tan L (2012) Inferring semantically related words from software context. In: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, IEEE Press, pp 161–170 Yang J, Tan L (2012) Inferring semantically related words from software context. In: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, IEEE Press, pp 161–170
Zurück zum Zitat Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 127–137 Yang X, Lo D, Xia X, Bao L, Sun J (2016) Combining word embedding with information retrieval to recommend similar bug reports. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE), IEEE, pp 127–137
Zurück zum Zitat Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997CrossRef Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997CrossRef
Zurück zum Zitat Zhang Y, Lo D, Xia X, Le TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 110–121 Zhang Y, Lo D, Xia X, Le TDB, Scanniello G, Sun J (2016) Inferring links between concerns and methods with multi-abstraction vector space model. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 110–121
Zurück zum Zitat Zhang Y, Lo D, Kochhar PS, Xia X, Li Q, Sun J (2017) Detecting similar repositories on github. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, pp 13–23 Zhang Y, Lo D, Kochhar PS, Xia X, Li Q, Sun J (2017) Detecting similar repositories on github. In: 2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER), IEEE, pp 13–23
Zurück zum Zitat Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed?-more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 34th International Conference on Software Engineering, IEEE Press, pp 14–24 Zhou J, Zhang H, Lo D (2012) Where should the bugs be fixed?-more accurate information retrieval-based bug localization based on bug reports. In: Proceedings of the 34th International Conference on Software Engineering, IEEE Press, pp 14–24
Metadaten
Titel
Domain-specific cross-language relevant question retrieval
verfasst von
Bowen Xu
Zhenchang Xing
Xin Xia
David Lo
Shanping Li
Publikationsdatum
04.11.2017
Verlag
Springer US
Erschienen in
Empirical Software Engineering / Ausgabe 2/2018
Print ISSN: 1382-3256
Elektronische ISSN: 1573-7616
DOI
https://doi.org/10.1007/s10664-017-9568-3

Weitere Artikel der Ausgabe 2/2018

Empirical Software Engineering 2/2018 Zur Ausgabe

Premium Partner