Skip to main content
Top
Published in: Programming and Computer Software 5/2018

01-09-2018

Detecting Near Duplicates in Software Documentation

Authors: D. V. Luciv, D. V. Koznov, G. A. Chernishev, A. N. Terekhov, K. Yu. Romanovsky, D. A. Grigoriev

Published in: Programming and Computer Software | Issue 5/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of “near duplicate” fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal definition of near duplicates and present an algorithm for their detection in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact duplicates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and commercial projects. Our evaluation is very comprehensive – it covers various documentation types: design and requirement specifications, programming guides and API documentation, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a significant number of both exact and near duplicates. Next, we report on the performed manual analysis of the detected near duplicates for the Linux Kernel Documentation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the benefits of duplicate management in software documents.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Parnas, D.L., Precise documentation: The key to better software, in The Future of Software Engineering, Berlin, Heidelberg: Springer-Verlag, 2011, pp. 125–148. Parnas, D.L., Precise documentation: The key to better software, in The Future of Software Engineering, Berlin, Heidelberg: Springer-Verlag, 2011, pp. 125–148.
2.
go back to reference Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, in Proceedings of the 32 ACM/IEEE International Conference on Software Engineering (ICSE’10), New York, NY, USA: ACM, 2010, vol. 2, pp. 79–88. Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, in Proceedings of the 32 ACM/IEEE International Conference on Software Engineering (ICSE’10), New York, NY, USA: ACM, 2010, vol. 2, pp. 79–88.
3.
go back to reference Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576. Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.
4.
go back to reference Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Program. Comput. Software, 2008, vol. 34, no. 4, pp. 216–224.CrossRefMATH Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Program. Comput. Software, 2008, vol. 34, no. 4, pp. 216–224.CrossRefMATH
5.
go back to reference Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of software product lines, Lecture Notes in Compute Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980 of CEE-SET 2008, pp. 158–170. Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of software product lines, Lecture Notes in Compute Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980 of CEE-SET 2008, pp. 158–170.
6.
go back to reference Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396. Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.
7.
go back to reference Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone detection in reuse of software technical documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics, 2015, Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185. Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone detection in reuse of software technical documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics, 2015, Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.
8.
go back to reference Luciv, D.V., Koznov, D.V., Basit, H.A., and Terekhov, A.N., On fuzzy repetitions detection in documentation reuse, Program. Comput. Software, 2016, vol. 42, no. 4, pp. 216–224.MathSciNetCrossRef Luciv, D.V., Koznov, D.V., Basit, H.A., and Terekhov, A.N., On fuzzy repetitions detection in documentation reuse, Program. Comput. Software, 2016, vol. 42, no. 4, pp. 216–224.MathSciNetCrossRef
9.
go back to reference Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: ACM, 2007, pp. 513–516. Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: ACM, 2007, pp. 513–516.
10.
go back to reference Bassett, P.G., Framing software reuse: Lessons from the real World, Upper Saddle River, NJ, USA: Prentice-Hall, 1997. Bassett, P.G., Framing software reuse: Lessons from the real World, Upper Saddle River, NJ, USA: Prentice-Hall, 1997.
11.
go back to reference Documentation Refactoring Toolkit. http://www. math.spbu.ru/user/kromanovsky/docline/index_en. html. Documentation Refactoring Toolkit. http://​www.​ math.spbu.ru/user/kromanovsky/docline/index_en. html.
12.
go back to reference Torvalds, L., Linux Kernel Documentation, Dec 2013 snapshot. https://github.com/torvalds/linux/tree/ master/Documentation/DocBook/. Torvalds, L., Linux Kernel Documentation, Dec 2013 snapshot. https://​github.​com/​torvalds/​linux/​tree/​ master/Documentation/DocBook/.
13.
go back to reference Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108. Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.
14.
go back to reference Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Europ. J. Comput. Sci., 2014, vol. 4, no. 4, pp. 242–258. Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Europ. J. Comput. Sci., 2014, vol. 4, no. 4, pp. 242–258.
15.
go back to reference Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, ICSR 2017, Salvador, Brazil, 2017, Proceedings, Botterweck, G. and Werner, C., Eds., Cham: Springer-Verlag, 2017, pp. 12–27. Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, ICSR 2017, Salvador, Brazil, 2017, Proceedings, Botterweck, G. and Werner, C., Eds., Cham: Springer-Verlag, 2017, pp. 12–27.
16.
go back to reference Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software Syst. Model., 2016, vol. 15, no. 2, pp. 579–603.CrossRef Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software Syst. Model., 2016, vol. 15, no. 2, pp. 579–603.CrossRef
17.
go back to reference Huang, T.-K., Rahman, Md.S., Madhyastha, H.V., Faloutsos, M., and Ribeiro, B., An analysis of socware cascades in online social networks, Proceedings of the 22Nd International Conference on World Wide Web, New York, NY, USA: ACM, 2013, pp. 619–630. Huang, T.-K., Rahman, Md.S., Madhyastha, H.V., Faloutsos, M., and Ribeiro, B., An analysis of socware cascades in online social networks, Proceedings of the 22Nd International Conference on World Wide Web, New York, NY, USA: ACM, 2013, pp. 619–630.
18.
go back to reference Williams, K. and Giles, C.L., Near duplicate detection in an Academic Digital Library, Proceedings of the ACM Symposium on Document Engineering, New York, NY, USA: ACM, 2013, pp. 91–94. Williams, K. and Giles, C.L., Near duplicate detection in an Academic Digital Library, Proceedings of the ACM Symposium on Document Engineering, New York, NY, USA: ACM, 2013, pp. 91–94.
19.
go back to reference Zhang, Q., Zhang Yu., Yu, H., and Huang, X., Efficient partial-duplicate detection based on sequence matching, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, 2010, pp. 675–682. Zhang, Q., Zhang Yu., Yu, H., and Huang, X., Efficient partial-duplicate detection based on sequence matching, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA: ACM, 2010, pp. 675–682.
20.
go back to reference Abdel Hamid, O., Behzadi, B., Christoph, S., and Henzinger, M., Detecting the origin of text segments efficiently, Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA: ACM, 2009, pp. 61–70. Abdel Hamid, O., Behzadi, B., Christoph, S., and Henzinger, M., Detecting the origin of text segments efficiently, Proceedings of the 18th International Conference on World Wide Web, New York, NY, USA: ACM, 2009, pp. 61–70.
21.
go back to reference Ramaswamy, L., Iyengar, A., Liu, L., and Douglis, F., Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA: ACM, 2004, pp. 443–454. Ramaswamy, L., Iyengar, A., Liu, L., and Douglis, F., Automatic detection of fragments in dynamically generated web pages, Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA: ACM, 2004, pp. 443–454.
22.
go back to reference Gibson, D., Punera, K., and Tomkins, A., The volume and evolution of web page templates, Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA: ACM, 2005, pp. 830–839. Gibson, D., Punera, K., and Tomkins, A., The volume and evolution of web page templates, Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA: ACM, 2005, pp. 830–839.
23.
go back to reference Vall’es, E. and Rosso, P., Detection of near-duplicate user generated contents: The SMS spam collection, Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents, New York, NY, USA: ACM, 2011, pp. 27–34. Vall’es, E. and Rosso, P., Detection of near-duplicate user generated contents: The SMS spam collection, Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents, New York, NY, USA: ACM, 2011, pp. 27–34.
24.
go back to reference Barrón-Cedeño, A., Vila, M., Martí, M., and Rosso, P., Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection, Comput. Linguist., 2013, vol. 39, no. 4, pp. 917–947.CrossRef Barrón-Cedeño, A., Vila, M., Martí, M., and Rosso, P., Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection, Comput. Linguist., 2013, vol. 39, no. 4, pp. 917–947.CrossRef
25.
go back to reference Antiplagiarism (in Russian). https://www.antiplagiat.ru/. Accessed January 16, 2018. Antiplagiarism (in Russian). https://​www.​antiplagiat.​ru/​.​ Accessed January 16, 2018.
26.
go back to reference Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., and Lopes, C.V., SourcererCC: Scaling code clone detection to big-code, Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA: ACM, 2016, pp. 1157–1168. Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., and Lopes, C.V., SourcererCC: Scaling code clone detection to big-code, Proceedings of the 38th International Conference on Software Engineering, New York, NY, USA: ACM, 2016, pp. 1157–1168.
27.
go back to reference Jiang, L., Misherghi, G., Su, Z., and Glondu, S., DECKARD: Scalable and accurate tree-based detection of code clones, Proceedings of the 29th International Conference on Software Engineering, Washington, DC, USA: IEEE Computer Soc., 2007, pp. 96–105. Jiang, L., Misherghi, G., Su, Z., and Glondu, S., DECKARD: Scalable and accurate tree-based detection of code clones, Proceedings of the 29th International Conference on Software Engineering, Washington, DC, USA: IEEE Computer Soc., 2007, pp. 96–105.
28.
go back to reference Cordy, J.R. and Roy, C.K., The NiCad clone detector, in Proceedings of IEEE 19th International Conference on Program Comprehension, 2011, pp. 219–220. Cordy, J.R. and Roy, C.K., The NiCad clone detector, in Proceedings of IEEE 19th International Conference on Program Comprehension, 2011, pp. 219–220.
29.
go back to reference Akhin, M. and Itsykson, V., Tree slicing in clone detection: Syntactic analysis made (semi)-semantic (in Russian), Model. Anal. Inform. Syst., 2012, vol. 19, no. 6, pp. 69–78. Akhin, M. and Itsykson, V., Tree slicing in clone detection: Syntactic analysis made (semi)-semantic (in Russian), Model. Anal. Inform. Syst., 2012, vol. 19, no. 6, pp. 69–78.
30.
go back to reference Zeltser, N.G., Automatic clone detection for refactoring, Proc. Inst. Syst. Program., 2013, vol. 25, pp. 39–50.CrossRef Zeltser, N.G., Automatic clone detection for refactoring, Proc. Inst. Syst. Program., 2013, vol. 25, pp. 39–50.CrossRef
31.
go back to reference Wagner, S. and Fernández, D.M., Analyzing text in software projects, The Art and Science of Analyzing Software Data, Elsevier, 2015, pp. 39–72. Wagner, S. and Fernández, D.M., Analyzing text in software projects, The Art and Science of Analyzing Software Data, Elsevier, 2015, pp. 39–72.
32.
go back to reference Korshunov, A. and Gomzin, A., Topic modeling in natural language texts (in Russian), Proc. Inst. Syst. Program., 2012, vol. 23, pp. 215–242.CrossRef Korshunov, A. and Gomzin, A., Topic modeling in natural language texts (in Russian), Proc. Inst. Syst. Program., 2012, vol. 23, pp. 215–242.CrossRef
33.
go back to reference Tomita-parser – Yandex Technologies (in Russian). https://tech.yandex.ru/tomita/. Accessed January 16, 2018. Tomita-parser – Yandex Technologies (in Russian). https://​tech.​yandex.​ru/​tomita/​.​ Accessed January 16, 2018.
34.
go back to reference Rattan, D., Bhatia, R., and Singh, M., Software clone detection: A systematic review, Inform. Software Technol., 2013, vol. 55, no. 7, pp. 1165–1199.CrossRef Rattan, D., Bhatia, R., and Singh, M., Software clone detection: A systematic review, Inform. Software Technol., 2013, vol. 55, no. 7, pp. 1165–1199.CrossRef
35.
go back to reference Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.MathSciNetCrossRefMATH Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.MathSciNetCrossRefMATH
36.
go back to reference Bassett, P.G., The theory and practice of adaptive reuse, SIGSOFT Software Eng. Notes, 1997, vol. 22, no. 3, pp. 2–9.CrossRef Bassett, P.G., The theory and practice of adaptive reuse, SIGSOFT Software Eng. Notes, 1997, vol. 22, no. 3, pp. 2–9.CrossRef
37.
go back to reference de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M., Computational Geometry, Berlin: Springer, 2008, pp. 220–226.CrossRefMATH de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M., Computational Geometry, Berlin: Springer, 2008, pp. 220–226.CrossRefMATH
38.
go back to reference Preparata, F.P. and Shamos, M.I., Computational Geometry: An Introduction, Berlin: Springer-Verlag, 1985, pp. 359–363.CrossRefMATH Preparata, F.P. and Shamos, M.I., Computational Geometry: An Introduction, Berlin: Springer-Verlag, 1985, pp. 359–363.CrossRefMATH
39.
go back to reference PyIntervalTree. URL: https://github.com/chaimleib/ intervaltree. PyIntervalTree. URL: https://​github.​com/​chaimleib/​ intervaltree.
40.
go back to reference Kolchin, A.V., Kotljarov, V.P., and Drobincev, P.D., The method of test scenariogeneration in the environment of the insertion modeling, Control Syst. Mach., 2012, no. 6, pp. 43–48, 63. Kolchin, A.V., Kotljarov, V.P., and Drobincev, P.D., The method of test scenariogeneration in the environment of the insertion modeling, Control Syst. Mach., 2012, no. 6, pp. 43–48, 63.
41.
go back to reference Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. Syst. Program., 2011, vol. 20, pp. 125–141. Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. Syst. Program., 2011, vol. 20, pp. 125–141.
42.
go back to reference Kudryavtsev, D. and Gavrilova T., Diagrammatic knowledge modeling for managers: Ontologybased approach, Proceedings of the International Conference on Knowledge Engineering and Ontology Development, 2011, pp. 386–389. Kudryavtsev, D. and Gavrilova T., Diagrammatic knowledge modeling for managers: Ontologybased approach, Proceedings of the International Conference on Knowledge Engineering and Ontology Development, 2011, pp. 386–389.
Metadata
Title
Detecting Near Duplicates in Software Documentation
Authors
D. V. Luciv
D. V. Koznov
G. A. Chernishev
A. N. Terekhov
K. Yu. Romanovsky
D. A. Grigoriev
Publication date
01-09-2018
Publisher
Pleiades Publishing
Published in
Programming and Computer Software / Issue 5/2018
Print ISSN: 0361-7688
Electronic ISSN: 1608-3261
DOI
https://doi.org/10.1134/S0361768818050079

Other articles of this Issue 5/2018

Programming and Computer Software 5/2018 Go to the issue

Premium Partner