research-article

Clustering Mobile Apps Based on Mined Textual Features

Authors:
A. A. Al-Subaihin

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

,
F. Sarro

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

,
S. Black

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

,
L. Capra

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

,
M. Harman

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

,
Y. Jia

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

,
Y. Zhang

CREST, Department of Computer Science, University College London, UK

CREST, Department of Computer Science, University College London, UK
View Profile

ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and MeasurementSeptember 2016Article No.: 38Pages 1–10https://doi.org/10.1145/2961111.2962600

Published:08 September 2016Publication History

ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Pages 1–10

ABSTRACT

Context: Categorising software systems according to their functionality yields many benefits to both users and developers. Goal: In order to uncover the latent clustering of mobile apps in app stores, we propose a novel technique that measures app similarity based on claimed behaviour. Method: Features are extracted using information retrieval augmented with ontological analysis and used as attributes to characterise apps. These attributes are then used to cluster the apps using agglomerative hierarchical clustering. We empirically evaluate our approach on 17,877 apps mined from the BlackBerry and Google app stores in 2014. Results: The results show that our approach dramatically improves the existing categorisation quality for both Blackberry (from 0.02 to 0.41 on average) and Google (from 0.03 to 0.21 on average) stores. We also find a strong Spearman rank correlation (ρ= 0.96 for Google and ρ= 0.99 for BlackBerry) between the number of apps and the ideal granularity within each category, indicating that ideal granularity increases with category size, as expected. Conclusions: Current categorisation in the app stores studied do not exhibit a good classification quality in terms of the claimed feature space. However, a better quality can be achieved using a good feature extraction technique and a traditional clustering method.

References

About WordNet. http://wordnet.princeton.edu/. Accessed: 2016-01-29.Google Scholar
BlackBerry World. https://appworld.blackberry.com/webstore/. Accessed: 2014-08-23.Google Scholar
Elevate - Brain Training - Google Play. https://play.google.com/store/apps/details?id=com.wonder. Accessed: 2016-01-29.Google Scholar
Google Play. https://play.google.com/store/apps. Accessed: 2014-08-23.Google Scholar
Mobile Learn™-Google Play. https://play.google. com/store/apps/details?id=com.blackboard.android. Accessed: 2016-01-29.Google Scholar
A. Al-Subaihin, A. Finkelstein, M. Harman, Y. Jia, W. Martin, F. Sarro, and Y. Zhang. App store mining and analysis. In DeMobile'15, pages 1--2, 2015. Google ScholarDigital Library
A. Al-Subaihin, M. Harman, Y. Jia, W. Martin, F. Sarro, and Y. Zhang. Mobile app and app store analysis, testing and optimisation. In MobileSoft'16, pages 243--244, 2016. Google ScholarDigital Library
W. Albert and T. Tullis. Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics. Newnes, 2013. Google ScholarDigital Library
E. R. Babbie. The practice of social research, volume 112. Wadsworth publishing company Belmont, CA, 1998.Google Scholar
J. J. Bartko. The intraclass correlation coefficient as a measure of reliability. Psychological reports, 19(1):3--11, 1966.Google ScholarCross Ref
F. Can and E. A. Ozkarahan. Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Trans. Database Syst., 15(4):483--517, Dec. 1990. Google ScholarDigital Library
J. Cohen. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4):213--220, 1968.Google ScholarCross Ref
J.-M. Davril, E. Delfosse, N. Hariri, M. Acher, J. Cleland-Huang, and P. Heymans. Feature model extraction from large collections of informal product descriptions. In FSE 2013, pages 290--300, Aug. 2013. Google ScholarDigital Library
I. S. Dhillon and D. S. Modha. Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning, 42(1-2):143--175, 2001.Google ScholarDigital Library
H. Dumitru, M. Gibiec, N. Hariri, J. Cleland-Huang, B. Mobasher, C. Castro-Herrera, and M. Mirakhorli. On-demand feature recommendations derived from mining public product descriptions. In ICSE '11, pages 181--190, 2011. Google ScholarDigital Library
J. Escobar-Avila, M. Linares-Vásquez, and S. Haiduc. Unsupervised software categorization using bytecode. In Proc. of the 23rd International Conference on Program Comprehension, ICPC'15, pages 229--239. IEEE Press, May 2015. Google ScholarDigital Library
A. Finkelstein, M. Harman, Y. Jia, W. Martin, F. Sarro, and Y. Zhang. App store analysis: Mining app stores for relationships between customer, business and technical characteristics. Technical Report RN/14/10, Department of Computer Science, University College London, 2014.Google Scholar
J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378--382, 1971.Google ScholarCross Ref
A. Gorla, I. Tavecchia, F. Gross, and A. Zeller. Checking app behavior against app descriptions. In Proc. of the 36th International Conference on Software Engineering - ICSE14, pages 1025--1035, May 2014. Google ScholarDigital Library
E. Guzman and W. Maalej. How Do Users Like This Feature? A Fine Grained Sentiment Analysis of App Reviews. In IEEE 22nd International Requirements Engineering Conference (RE), pages 153--162, 2014.Google ScholarCross Ref
N. Hariri, C. Castro-Herrera, M. Mirakhorli, J. Cleland-Huang, and B. Mobasher. Supporting Domain Analysis through Mining and Recommending Features from Online Product Listings. IEEE TSE, 39(12):1736--1752, 2013. Google ScholarDigital Library
M. Harman, Y. Jia, and Y. Zhang. App store mining and analysis: Msr for app stores. In Proc. of the 9th IEEE Working Conference on Mining Software Repositories, MSR'12, pages 108--111, 2012. Google ScholarDigital Library
S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241--254, 1967.Google ScholarCross Ref
S. Kawaguchi, P. K. Garg, M. Matsushita, and K. Inoue. MUDABlue: An automatic categorization system for Open Source repositories. Journal of Systems and Software, 79(7):939--953, July 2006. Google ScholarDigital Library
H. Khalid, E. Shihab, M. Nagappan, and A. E. Hassan. What do mobile app users complain about? IEEE Software, 32(3):70--77, 2015.Google ScholarDigital Library
M. Linares-Vásquez, A. Holtzhauer, and D. Poshyvanyk. On Automatically Detecting Similar Android Apps. In Proc. of the 24th International Conference on Program Comprehension, ICPC'16. IEEE Press, May 2016.Google ScholarCross Ref
M. Linares-Vásquez, C. McMillan, D. Poshyvanyk, and M. Grechanik. On using machine learning to automatically classify software applications into domain categories. Empirical Software Engineering, 19(3):582--618, Oct. 2012. Google ScholarDigital Library
Y. S. Maarek, D. M. Berry, and G. E. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE TSE, 17(8):800--813, 1991. Google ScholarDigital Library
W. Martin, M. Harman, Y. Jia, F. Sarro, and Y. Zhang. The app sampling problem for app store mining. In Proc. of the Working Conference on Mining Software Repositories - MSR15, pages 123--133, 2015. Google ScholarDigital Library
W. Martin, F. Sarro, and M. Harman. Causal Impact Analysis for App Releases in Google Play. In FSE'16, 2016.Google Scholar
W. Martin, F. Sarro, Y. Jia, and Y. Zhang. Survey of app store analysis for software engineering. Technical Report RN/16/02, Department of Computer Science, University College London, 2016.Google Scholar
A. Massey, J. Eisenstein, A. Anton, and P. Swire. Automated text mining for requirements analysis of policy documents. In IEEE International Requirements Engineering Conference, pages 4--13, 2013.Google ScholarCross Ref
G. A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, Nov. 1995. Google ScholarDigital Library
F. Murtagh and P. Legendre. Ward's Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward's Criterion? Journal of Classification, 31(3):274--295, Oct 2014. Google ScholarDigital Library
C. E. Osgood. The nature and measurement of meaning. Psychological bulletin, 49(3):197--237, May 1952.Google ScholarCross Ref
R. Pandita, X. Xiao, W. Yang, W. Enck, and T. Xie. WHYPER: Towards automating risk assessment of mobile applications. In USENIX Security Symposium, 2013. Google ScholarDigital Library
P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53--65, nov 1987. Google ScholarDigital Library
D. Rowinski. Another Reason Why App Discovery Is Completely Broken. http://arc.applause.com.Google Scholar
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613--620, Nov. 1975. Google ScholarDigital Library
K. Sangaralingam, N. Pervin, N. Ramasubbu, A. Datta, and K. Dutta. Takeoff and Sustained Success of Apps in Hypercompetitive Mobile Platform Ecosystems: An Empirical Analysis. In ICIS'12, pages 1850--1867, 2012.Google Scholar
B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, and P. G. Bringas. On the automatic categorisation of android applications. In 2012 IEEE Consumer Communications and Networking Conference (CCNC), pages 149--153. IEEE, Jan. 2012.Google ScholarCross Ref
F. Sarro, A. AlSubaihin, M. Harman, Y. Jia, W. Martin, and Y. Zhang. Feature lifecycles as they spread, migrate, remain and die in app stores. Requirements Engineering (RE'15), pages 76--85, 2015.Google Scholar
S. Seneviratne, A. Seneviratne, M. A. Kaafar, A. Mahanti, and P. Mohapatra. Early detection of spam mobile apps. WWW '15, pages 949--959, 2015. Google ScholarDigital Library
A. Shabtai, Y. Fledel, and Y. Elovici. Automated Static Code Analysis for Classifying Android Applications Using Machine Learning. In 2010 International Conference on Computational Intelligence and Security, pages 329--333. IEEE, Dec. 2010. Google ScholarDigital Library
M. J. Shepperd. Foundations of software measurement. Prentice Hall, 1995. Google ScholarDigital Library
C. E. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72--101, January 1904.Google ScholarCross Ref
A. Sutcliffe and P. Sawyer. Requirements elicitation: Towards the unknown unknowns. In IEEE International Requirements Engineering Conference, pages 92--104, 2013.Google ScholarCross Ref
K. Tian, M. Revelle, and D. Poshyvanyk. Using Latent Dirichlet Allocation for automatic categorization of software. In 6th IEEE International Working Conference on Mining Software Repositories MSR'09, pages 163--166. IEEE, 2009. Google ScholarDigital Library
N. H. Timm. Applied Multivariate Analysis. Springer Science & Business Media, 2007.Google Scholar
S. Vakulenko, O. Müller, and J. Brocke. Enriching iTunes App Store Categories via Topic Modeling. In Proc. of the Thirty Fifth International Conference on Information Systems, ICIS'14, 2014.Google Scholar
T. Wang, H. Wang, G. Yin, C. X. Ling, X. Li, and P. Zou. Mining Software Profile across Multiple Repositories for Hierarchical Categorization. In IEEE International Conference on Software Maintenance ICSE'13, pages 240--249. IEEE, Sept. 2013. Google ScholarDigital Library
J. H. Ward. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58(301):236--244, Mar 1963.Google ScholarCross Ref

Clustering Mobile Apps Based on Mined Textual Features
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Serving Mobile Apps: A Slice at a Time
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

End users wanting to do more and more with mobile apps has led to explosive growth in the number of available apps. This has widened the gap between developers making apps available and end users being able to install all the apps they want on their ...
Read More
Web-based hybrid mobile apps: state of the practice and research opportunities
MOBILESoft '16: Proceedings of the International Conference on Mobile Software Engineering and Systems

This paper describes the contents of a tutorial on web-based hybrid mobile apps. Nowadays millions of mobile apps are downloaded and used all over the world. Mobile apps are distributed via different app stores like Google Play Store, the Apple App ...
Read More
Spam Mobile Apps: Characteristics, Detection, and in the Wild Analysis

The increased popularity of smartphones has attracted a large number of developers to offer various applications for the different smartphone platforms via the respective app markets. One consequence of this popularity is that the app markets are also ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
September 2016
457 pages
ISBN:9781450344272
DOI:10.1145/2961111
General Chair:
Marcela Genero
University of Castilla-La Mancha, Spain
,
Program Chairs:
Andreas Jedlitschka
Fraunhofer IESE, Germany
,
Magne Jørgensen
Simula Research Laboratory, Norway
,
Giuseppe Scanniello
University of Basilicata, Italy
,
Sreedevi Sampath
University of Maryland Baltimore County, USA
,
Danilo Caivano
SER&Practices, Italy
,
Daniel Port
University of Hawaii, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ESEM '16 Paper Acceptance Rate27of122submissions,22%Overall Acceptance Rate130of594submissions,22%
More
Upcoming Conference
ESEM '24

Sponsor:

sigsoft

ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

October 24 - 25, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 51
  Total Citations
  View Citations
- 616
  Total Downloads
- Downloads (Last 12 months)71
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustering Mobile Apps Based on Mined Textual Features

ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

ABSTRACT

References

Cited By

Recommendations

Serving Mobile Apps: A Slice at a Time

Web-based hybrid mobile apps: state of the practice and research opportunities

Spam Mobile Apps: Characteristics, Detection, and in the Wild Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Clustering Mobile Apps Based on Mined Textual Features

ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

ABSTRACT

References

Cited By

Recommendations

Serving Mobile Apps: A Slice at a Time

Web-based hybrid mobile apps: state of the practice and research opportunities

Spam Mobile Apps: Characteristics, Detection, and in the Wild Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media