Skip to main content
Top
Published in: Network Modeling Analysis in Health Informatics and Bioinformatics 1/2016

01-12-2016 | Original Article

An evaluation of approaches for using unlabeled data with domain adaptation

Authors: Nic Herndon, Doina Caragea

Published in: Network Modeling Analysis in Health Informatics and Bioinformatics | Issue 1/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
When checking for convergence we assigned hard labels to all instances from the target unlabeled data set.
 
Literature
go back to reference Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLOS Comput Biol 3(3):e54MathSciNetCrossRef Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLOS Comput Biol 3(3):e54MathSciNetCrossRef
go back to reference Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory., COLT’ 98ACM, New York, NY, USA, pp 92–100 Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory., COLT’ 98ACM, New York, NY, USA, pp 92–100
go back to reference Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci 97(1):262–267CrossRef Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D (2000) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci 97(1):262–267CrossRef
go back to reference Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised Learning. Adaptive computation and machine learning. The MIT Press, Cambridge Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised Learning. Adaptive computation and machine learning. The MIT Press, Cambridge
go back to reference Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the national conference on artificial intelligence. AAAI Press, MIT Press, Menlo Park, CA, Cambridge, MA, London, vol 22, p 540 Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring Naïve Bayes classifiers for text classification. In: Proceedings of the national conference on artificial intelligence. AAAI Press, MIT Press, Menlo Park, CA, Cambridge, MA, London, vol 22, p 540
go back to reference Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38MathSciNetMATH
go back to reference Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016CrossRef Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4):1005–1016CrossRef
go back to reference Gardy JL, Brinkman FS (2006) Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 4(1):741–751CrossRef Gardy JL, Brinkman FS (2006) Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 4(1):741–751CrossRef
go back to reference Gardy JL, Laird MR, Chen F, Rey S, Walsh C, Ester M, Brinkman FS (2005) Psortb v. 2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617–623CrossRef Gardy JL, Laird MR, Chen F, Rey S, Walsh C, Ester M, Brinkman FS (2005) Psortb v. 2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617–623CrossRef
go back to reference Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef
go back to reference Herndon N, Caragea D (2014a) Empirical study of domain adaptation with Naïve Bayes on the task of splice site prediction. In: Proceedings of the 5th international conference on bioinformatics models, methods and algorithms, BIOINFORMATICS 2014, pp 57–67 Herndon N, Caragea D (2014a) Empirical study of domain adaptation with Naïve Bayes on the task of splice site prediction. In: Proceedings of the 5th international conference on bioinformatics models, methods and algorithms, BIOINFORMATICS 2014, pp 57–67
go back to reference Herndon N, Caragea D (2014b) Predicting protein localization using a domain adaptation approach. In: Biomedical engineering systems and technologies. Springer, Berlin, pp 191–206 Herndon N, Caragea D (2014b) Predicting protein localization using a domain adaptation approach. In: Biomedical engineering systems and technologies. Springer, Berlin, pp 191–206
go back to reference Herndon N, Caragea D (2015) Domain adaptation with logistic regression for the task of splice site prediction. In: Proceedings of the 11th international symposium on bioinformatics research and applications, ISBRA 2015, pp 125–137 Herndon N, Caragea D (2015) Domain adaptation with logistic regression for the task of splice site prediction. In: Proceedings of the 11th international symposium on bioinformatics research and applications, ISBRA 2015, pp 125–137
go back to reference Hubbard T, Park J (1995) Fold recognition and Ab Initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 23(3):398–402CrossRef Hubbard T, Park J (1995) Fold recognition and Ab Initio structure predictions using hidden Markov models and beta-strand pair potentials. Proteins 23(3):398–402CrossRef
go back to reference Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, BerlinCrossRef Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, BerlinCrossRef
go back to reference John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345 John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 338–345
go back to reference Lewis DD (1992) Representation and learning in information retrieval. Ph.D. thesis, University of Massachusetts Lewis DD (1992) Representation and learning in information retrieval. Ph.D. thesis, University of Massachusetts
go back to reference Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the association for computational linguistics on interactive poster and demonstration sessions, ACL demo ’04. Association for computational linguistics, Stroudsburg, PA, USA Maeireizo B, Litman D, Hwa R (2004) Co-training for predicting emotions with spoken dialogue data. In: Proceedings of the association for computational linguistics on interactive poster and demonstration sessions, ACL demo ’04. Association for computational linguistics, Stroudsburg, PA, USA
go back to reference McCallum A, Nigam K et al (1998) A comparison of event models for Naïve Bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence workshop on learning for text categorization, vol 752, Citeseer, pp 41–48 McCallum A, Nigam K et al (1998) A comparison of event models for Naïve Bayes text classification. In: Proceedings of the association for the advancement of artificial intelligence workshop on learning for text categorization, vol 752, Citeseer, pp 41–48
go back to reference Müller KR, Mika S, Rätsch G, Tsuda S, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–202CrossRef Müller KR, Mika S, Rätsch G, Tsuda S, Schölkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12(2):181–202CrossRef
go back to reference Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134CrossRefMATH
go back to reference Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B (2007) Improving the C. elegans genome annotation using machine learning. PLoS Comput Biol 3:e20MathSciNetCrossRef Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer R, Schölkopf B (2007) Improving the C. elegans genome annotation using machine learning. PLoS Comput Biol 3:e20MathSciNetCrossRef
go back to reference Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL., CONLL ’03Association for computational linguistics, Stroudsburg, PA, USA, pp 25–32 Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL., CONLL ’03Association for computational linguistics, Stroudsburg, PA, USA, pp 25–32
go back to reference Roli F, Marcialis G (2006) Semi-supervised PCA-based face recognition using self-training. In: Yeung DY, Kwok J, Fred A, Roli F, de Ridder D (eds) Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol 4109. Springer, Berlin, pp 560–568CrossRef Roli F, Marcialis G (2006) Semi-supervised PCA-based face recognition using self-training. In: Yeung DY, Kwok J, Fred A, Roli F, de Ridder D (eds) Structural, syntactic, and statistical pattern recognition. Lecture notes in computer science, vol 4109. Springer, Berlin, pp 560–568CrossRef
go back to reference Schweikert G, Widmer C, Schölkopf B, Rätsch G (2008) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 1433–1440 Schweikert G, Widmer C, Schölkopf B, Rätsch G (2008) An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In: Proceedings of the fifth annual conference on neural information processing systems (NIPS), pp 1433–1440
go back to reference Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Supplement 10):1–16 Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Supplement 10):1–16
go back to reference Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(suppl 2):ii215–ii225CrossRef Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(suppl 2):ii215–ii225CrossRef
go back to reference Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics., ACL ’95Association for computational linguistics, Stroudsburg, PA, USA, pp 189–196 Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on association for computational linguistics., ACL ’95Association for computational linguistics, Stroudsburg, PA, USA, pp 189–196
go back to reference Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Tech. rep, Citeseer Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Tech. rep, Citeseer
go back to reference Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR (2000) Engineering support vector machine kernels that recognize translation initiation Sites. Bioinformatics 16(9):799–807CrossRef Zien A, Rätsch G, Mika S, Schölkopf B, Lengauer T, Müller KR (2000) Engineering support vector machine kernels that recognize translation initiation Sites. Bioinformatics 16(9):799–807CrossRef
Metadata
Title
An evaluation of approaches for using unlabeled data with domain adaptation
Authors
Nic Herndon
Doina Caragea
Publication date
01-12-2016
Publisher
Springer Vienna
Published in
Network Modeling Analysis in Health Informatics and Bioinformatics / Issue 1/2016
Print ISSN: 2192-6662
Electronic ISSN: 2192-6670
DOI
https://doi.org/10.1007/s13721-016-0133-6

Other articles of this Issue 1/2016

Network Modeling Analysis in Health Informatics and Bioinformatics 1/2016 Go to the issue

Premium Partner