Skip to main content
Top
Published in:
Cover of the book

2016 | OriginalPaper | Chapter

The Complexity of Some Pattern Problems in the Logical Analysis of Large Genomic Data Sets

Authors : Giuseppe Lancia, Paolo Serafini

Published in: Bioinformatics and Biomedical Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Many biomedical experiments produce large data sets in the form of binary matrices, with features labeling the columns and individuals (samples) associated to the rows. An important case is when the rows are also labeled into two groups, namely the positive (or healthy) and the negative (or diseased) samples. The Logical Analysis of Data (LAD) is a procedure aimed at identifying relevant features and building boolean formulas (rules) which can be used to classify new samples as positive or negative. These rules are said to explain the data set. Each rule can be represented by a string over {0,1,-}, called a pattern. A data set can be explained by alternative sets of patterns, and many computational problems arise related to the choice of a particular set of patterns for a given instance. In this paper we study the computational complexity of these pattern problems and show that they are, in general, very hard. We give an integer programming formulation for the problem of determining if two sets of patterns are equivalent. We also prove computational complexity results which imply that there should be no simple ILP model for finding a minimal set of patterns explaining a given data set.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alexe, G., Alexe, S., Bonates, T.O., Kogan, A.: Logical analysis of data - the vision of Peter L. Hammer. Ann. Math. Artif. Intell. 49, 265–312 (2007)MathSciNetCrossRefMATH Alexe, G., Alexe, S., Bonates, T.O., Kogan, A.: Logical analysis of data - the vision of Peter L. Hammer. Ann. Math. Artif. Intell. 49, 265–312 (2007)MathSciNetCrossRefMATH
2.
go back to reference Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., Tuli, M.A.: The EMBL nucleotide sequence database. Nucleic Acid Res. 28, 19–23 (2000)CrossRef Baker, W., van den Broek, A., Camon, E., Hingamp, P., Sterk, P., Stoesser, G., Tuli, M.A.: The EMBL nucleotide sequence database. Nucleic Acid Res. 28, 19–23 (2000)CrossRef
3.
go back to reference Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acid Res. 28, 235–242 (2000)CrossRef Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acid Res. 28, 235–242 (2000)CrossRef
4.
go back to reference Bertolazzi, P., Felici, G., Festa, P., Lancia, G.: Logic classification and feature selection for biomedical data. Comput. Math. Appl. 55, 889–899 (2008)MathSciNetCrossRefMATH Bertolazzi, P., Felici, G., Festa, P., Lancia, G.: Logic classification and feature selection for biomedical data. Comput. Math. Appl. 55, 889–899 (2008)MathSciNetCrossRefMATH
5.
go back to reference Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12, 292–306 (2000)CrossRef Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Trans. Knowl. Data Eng. 12, 292–306 (2000)CrossRef
6.
go back to reference Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S., Fodor, S.P.A.: Accessing genetic information with high-density DNA arrays. Science 274, 610–614 (1996)CrossRef Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S., Fodor, S.P.A.: Accessing genetic information with high-density DNA arrays. Science 274, 610–614 (1996)CrossRef
7.
go back to reference Chikalov, I., Lozin, V., Lozina, I., Moshkov, M., Son Nguyen, H., Skowron, A., Zielosko, B.:Logical analysis of data: theory, methodology and applications. In: Three Approaches to Data Analysis, pp. 147–192. Springer (2013) Chikalov, I., Lozin, V., Lozina, I., Moshkov, M., Son Nguyen, H., Skowron, A., Zielosko, B.:Logical analysis of data: theory, methodology and applications. In: Three Approaches to Data Analysis, pp. 147–192. Springer (2013)
8.
go back to reference Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1, 131–156 (1997)CrossRef Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1, 131–156 (1997)CrossRef
9.
go back to reference Felici, G., de Angelis, V., Mancinelli, G.: Feature selection for data mining. In: Felici, G., Triantaphyllou, E. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, pp. 227–252. Springer (2006) Felici, G., de Angelis, V., Mancinelli, G.: Feature selection for data mining. In: Felici, G., Triantaphyllou, E. (eds.) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, pp. 227–252. Springer (2006)
10.
go back to reference Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)MATH Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, San Francisco (1979)MATH
11.
go back to reference Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRef Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRef
12.
go back to reference Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)CrossRefMATH Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002)CrossRefMATH
13.
go back to reference Hammer, P., Bonates, T.: Logical analysis of data: from combinatorial optimization to medical applications. RUTCOR Research Report, 10-05, Rutgers University, NJ (2005) Hammer, P., Bonates, T.: Logical analysis of data: from combinatorial optimization to medical applications. RUTCOR Research Report, 10-05, Rutgers University, NJ (2005)
14.
go back to reference Hebert, P.D.N., Cywinska, A., Ball, S.L., de Waard, J.R.: Biological identifications through DNA barcodes. Proc. R. Soc. Lond. B 270, 313–321 (2003)CrossRef Hebert, P.D.N., Cywinska, A., Ball, S.L., de Waard, J.R.: Biological identifications through DNA barcodes. Proc. R. Soc. Lond. B 270, 313–321 (2003)CrossRef
15.
go back to reference Hu, H., Li, J., Plank, A., Wang, H., Daggard, G.: A comparative study of classification methods for microarray data analysis. In: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, vol. 61, pp. 33–37, Sydney, Australia (2006) Hu, H., Li, J., Plank, A., Wang, H., Daggard, G.: A comparative study of classification methods for microarray data analysis. In: Proceedings of the fifth Australasian Conference on Data Mining and Analystics, vol. 61, pp. 33–37, Sydney, Australia (2006)
16.
go back to reference Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437 (2004)CrossRef Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20, 2429–2437 (2004)CrossRef
17.
go back to reference Montgomery, D., Undem, B.L.: CombiMatrix’ customizable DNA microarrays-Tutoial: In situ computer-aided synthesis of custom oligo microarrays. Genetic Eng. News 22, 32–33 (2002). Drug Discovery Montgomery, D., Undem, B.L.: CombiMatrix’ customizable DNA microarrays-Tutoial: In situ computer-aided synthesis of custom oligo microarrays. Genetic Eng. News 22, 32–33 (2002). Drug Discovery
18.
go back to reference Serafini, P.: Classifying negative and positive points by optimal box clustering. Discrete Appl. Math. 165(270–282), 10 (2014)MathSciNetMATH Serafini, P.: Classifying negative and positive points by optimal box clustering. Discrete Appl. Math. 165(270–282), 10 (2014)MathSciNetMATH
Metadata
Title
The Complexity of Some Pattern Problems in the Logical Analysis of Large Genomic Data Sets
Authors
Giuseppe Lancia
Paolo Serafini
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-31744-1_1

Premium Partner