Skip to main content
Top
Published in: Soft Computing 6/2016

17-03-2015 | Methodologies and Application

Automatic constraints generation for semisupervised clustering: experiences with documents classification

Authors: Irene Diaz-Valenzuela, Vincenzo Loia, Maria J. Martin-Bautista, Sabrina Senatore, M. Amparo Vila

Published in: Soft Computing | Issue 6/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the last times, semi-supervised clustering has been an area that has received a lot of attention. It is distinguished from more traditional unsupervised approaches on the use of a small amount of supervision to “steer” clustering. Unfortunately in the real world, the supervision is not always available: data to process are often too large and so the cost (in terms of time and human resources) for user-provided information is not conceivable. To address this issue, this work presents an automatic generation of the supervision, by the analysis of the data structure itself. This analysis is performed using a partitional clustering algorithm that discovers relationships between pairs of instances that may be used as a semi-supervision in the clustering process. The methodology has been studied in the document clustering domain, an area where novel approaches for accurate documents classifications are strongly required. Experimental result shows the validity of this approach.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Aggarwal C, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, US, pp 77–128CrossRef Aggarwal C, Zhai C (2012) A survey of text clustering algorithms. Mining text data. Springer, US, pp 77–128CrossRef
go back to reference Barr J, Cament L, Bowyer K, Flynn P (2014) Active clustering with ensembles for social structure extraction. In: Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on. pp 969–976 Barr J, Cament L, Bowyer K, Flynn P (2014) Active clustering with ensembles for social structure extraction. In: Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on. pp 969–976
go back to reference Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Proceedings of the Nineteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 27–34 (ICML ’02) Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Proceedings of the Nineteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 27–34 (ICML ’02)
go back to reference Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 59–68. doi:10.1145/1014052.1014062 (KDD ’04) Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 59–68. doi:10.​1145/​1014052.​1014062 (KDD ’04)
go back to reference Basu S, Davidson I, Wagstaff K (2008) Constrained clustering: advances in algorithms, theory, and applications, 1st edn. Chapman & Hall/CRC Basu S, Davidson I, Wagstaff K (2008) Constrained clustering: advances in algorithms, theory, and applications, 1st edn. Chapman & Hall/CRC
go back to reference Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, pp 318–329. doi: 10.1145/133160.133214 (SIGIR ’92) Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, pp 318–329. doi: 10.​1145/​133160.​133214 (SIGIR ’92)
go back to reference Diaz-Valenzuela I, Martin-Bautista MJ, Vila MA (2013) Using a semisupervised fuzzy clustering process for identity identification in digital libraries. In: IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), 2013 Joint. pp 831–836 Diaz-Valenzuela I, Martin-Bautista MJ, Vila MA (2013) Using a semisupervised fuzzy clustering process for identity identification in digital libraries. In: IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), 2013 Joint. pp 831–836
go back to reference Diaz-Valenzuela I, Martín-Bautista MJ, Vila MA (2014) A fuzzy semisupervised clustering method: Application to the classification of scientific publications. In: Laurent A, Strauss O, Bouchon-Meunier B, Yager RR (eds) Information Processing and management of uncertainty in knowledge-based systems—15th International Conference, IPMU 2014, Montpellier, France, July 15–19, 2014. Proceedings, Part I, Springer, Communications in Computer and Information Science, vol 442. pp 179–188. doi:10.1007/978-3-319-08795-5 Diaz-Valenzuela I, Martín-Bautista MJ, Vila MA (2014) A fuzzy semisupervised clustering method: Application to the classification of scientific publications. In: Laurent A, Strauss O, Bouchon-Meunier B, Yager RR (eds) Information Processing and management of uncertainty in knowledge-based systems—15th International Conference, IPMU 2014, Montpellier, France, July 15–19, 2014. Proceedings, Part I, Springer, Communications in Computer and Information Science, vol 442. pp 179–188. doi:10.​1007/​978-3-319-08795-5
go back to reference Grira N, Crucianu M, Boujemaa N (2004) Unsupervised and semi-supervised clustering: a brief survey. In: in ‘A Review of Machine Learning Techniques for Processing Multimedia Content’, Report of the MUSCLE European Network of Excellence FP6 Grira N, Crucianu M, Boujemaa N (2004) Unsupervised and semi-supervised clustering: a brief survey. In: in ‘A Review of Machine Learning Techniques for Processing Multimedia Content’, Report of the MUSCLE European Network of Excellence FP6
go back to reference Hu Y, Milios EE, Blustein J (2012) Semi-supervised document clustering with dual supervision through seeding. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. ACM, New York, pp 144–151. doi:10.1145/2245276.2245306 (SAC ’12) Hu Y, Milios EE, Blustein J (2012) Semi-supervised document clustering with dual supervision through seeding. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. ACM, New York, pp 144–151. doi:10.​1145/​2245276.​2245306 (SAC ’12)
go back to reference Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverMATH Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverMATH
go back to reference Leuski A (2001) Evaluating document clustering for interactive information retrieval. In: Proceedings of the Tenth International Conference on Information and Knowledge Management. ACM, New York, pp 33–40, doi:10.1145/502585.502592 (CIKM ’01) Leuski A (2001) Evaluating document clustering for interactive information retrieval. In: Proceedings of the Tenth International Conference on Information and Knowledge Management. ACM, New York, pp 33–40, doi:10.​1145/​502585.​502592 (CIKM ’01)
go back to reference Li X, Wang L, Song Y, Zhao X (2010) A hybrid constrained semi-supervised clustering algorithm. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, vol 4. pp 1597–1601 Li X, Wang L, Song Y, Zhao X (2010) A hybrid constrained semi-supervised clustering algorithm. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, vol 4. pp 1597–1601
go back to reference Pedrycz W, Loia V, Senatore S (2010) Fuzzy clustering with viewpoints. IEEE Trans Fuzzy Syst 18(2):274–284 Pedrycz W, Loia V, Senatore S (2010) Fuzzy clustering with viewpoints. IEEE Trans Fuzzy Syst 18(2):274–284
go back to reference Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web. ACM, New York, pp 91–100, doi:10.1145/1367497.1367510 (WWW ’08) Phan XH, Nguyen LM, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web. ACM, New York, pp 91–100, doi:10.​1145/​1367497.​1367510 (WWW ’08)
go back to reference Rigutini L, Maggini M (2005) A semi-supervised document clustering algorithm based on EM. In: Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on. pp 200–206. doi:10.1109/WI.2005.13 Rigutini L, Maggini M (2005) A semi-supervised document clustering algorithm based on EM. In: Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on. pp 200–206. doi:10.​1109/​WI.​2005.​13
go back to reference Sahoo N, Callan J, Krishnan R, Duncan G, Padman R (2006) Incremental hierarchical clustering of text documents. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, New York, pp 357–366. doi:10.1145/1183614.1183667 (CIKM ’06) Sahoo N, Callan J, Krishnan R, Duncan G, Padman R (2006) Incremental hierarchical clustering of text documents. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, New York, pp 357–366. doi:10.​1145/​1183614.​1183667 (CIKM ’06)
go back to reference Tang W, Xiong H, Zhong S, Wu J (2007) Enhancing semi-supervised clustering: a feature projection perspective. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 707–716 (KDD ’07) Tang W, Xiong H, Zhong S, Wu J (2007) Enhancing semi-supervised clustering: a feature projection perspective. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, pp 707–716 (KDD ’07)
go back to reference Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp 1103–1110 Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the Seventeenth International Conference on Machine Learning. pp 1103–1110
go back to reference Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 577–584 (ICML ’01) Wagstaff K, Cardie C, Rogers S, Schrödl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 577–584 (ICML ’01)
go back to reference Xiong S, Azimi J, Fern X (2014) Active learning of constraints for semi-supervised clustering. Knowl Data Eng IEEE Trans 26(1):43–54CrossRef Xiong S, Azimi J, Fern X (2014) Active learning of constraints for semi-supervised clustering. Knowl Data Eng IEEE Trans 26(1):43–54CrossRef
Metadata
Title
Automatic constraints generation for semisupervised clustering: experiences with documents classification
Authors
Irene Diaz-Valenzuela
Vincenzo Loia
Maria J. Martin-Bautista
Sabrina Senatore
M. Amparo Vila
Publication date
17-03-2015
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 6/2016
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-015-1643-3

Other articles of this Issue 6/2016

Soft Computing 6/2016 Go to the issue

Premium Partner