Skip to main content

2016 | OriginalPaper | Buchkapitel

Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters

verfasst von : Mouhamadou Ba, Sébastien Ferré, Mireille Ducassé

Erschienen in: Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIV

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Heterogeneity of data and data formats in bioinformatics entail mismatches between inputs and outputs of different services, making it difficult to compose them into workflows. To reduce those mismatches, bioinformatics platforms propose ad’hoc converters, called shims. When shims are written by hand, they are time-consuming to develop, and cannot anticipate all needs. When shims are automatically generated, they miss transformations, for example data composition from multiple parts, or parallel conversion of list elements.
This article proposes to systematically detect convertibility from output types to input types. Convertibility detection relies on a rule system based on abstract types, close to XML Schema. Types allow to abstract data while precisely accounting for their composite structure. Detection is accompanied by an automatic generation of converters between input and output XML data. We show the applicability of our approach by abstracting concrete bioinformatics types (e.g., complex biosequences) for a number of bioinformatics services (e.g., blast). We illustrate how our automatically generated converters help to resolve data mismatches when composing workflows. We conducted an experiment on bioinformatics services and datatypes, using an implementation of our approach, as well as a survey with domain experts. The detected convertibilities and produced converters were validated as relevant from a biological point of view. Furthermore the automatically produced graph of potentially compatible services exhibited a connectivity higher than with the ad’hoc approaches. Indeed, the experts discovered unknown possible connexions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Oinn, T., Greenwood, M., Addis, M., Ferris, J., Glover, K., Goble, C., Hull, D., Marvin, D., Li, P., Lord, P.: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency Comput. Pract. Experience 18(10), 1067–1100 (2006)CrossRef Oinn, T., Greenwood, M., Addis, M., Ferris, J., Glover, K., Goble, C., Hull, D., Marvin, D., Li, P., Lord, P.: Taverna: lessons in creating a workflow environment for the life sciences. Concurrency Comput. Pract. Experience 18(10), 1067–1100 (2006)CrossRef
2.
Zurück zum Zitat Gundersen, S., Kalas, M., Abul, O., Frigessi, A., Hovig, E., Sandve, G.K.: Identifying elemental genomic track types and representing them uniformly. BMC Bioinform. 12, 494 (2011)CrossRef Gundersen, S., Kalas, M., Abul, O., Frigessi, A., Hovig, E., Sandve, G.K.: Identifying elemental genomic track types and representing them uniformly. BMC Bioinform. 12, 494 (2011)CrossRef
3.
Zurück zum Zitat Rice, P., Longden, I., Bleasby, A.: Emboss: the European molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)CrossRef Rice, P., Longden, I., Bleasby, A.: Emboss: the European molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)CrossRef
4.
Zurück zum Zitat Goecks, J., Nekrutenko, A., Taylor, J., Team, T.G.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)CrossRef Goecks, J., Nekrutenko, A., Taylor, J., Team, T.G.: Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11(8), R86 (2010)CrossRef
5.
Zurück zum Zitat Ménager, H., Gopalan, V., Néron, B., Larroudé, S., Maupetit, J., Saladin, A., Tufféry, P., Huyen, Y., Caudron, B.: Bioinformatics applications discovery and composition with the mobyle suite and mobyleNet. In: Lacroix, Z., Vidal, M.E. (eds.) RED 2010. LNCS, vol. 6799, pp. 11–22. Springer, Heidelberg (2012)CrossRef Ménager, H., Gopalan, V., Néron, B., Larroudé, S., Maupetit, J., Saladin, A., Tufféry, P., Huyen, Y., Caudron, B.: Bioinformatics applications discovery and composition with the mobyle suite and mobyleNet. In: Lacroix, Z., Vidal, M.E. (eds.) RED 2010. LNCS, vol. 6799, pp. 11–22. Springer, Heidelberg (2012)CrossRef
6.
Zurück zum Zitat Wassink, I.H.C., van der Vet, P.E., Wolstencroft, K., Neerincx, P.B.T., Roos, M., Rauwerda, H., Breit, T.M.: Analysing scientific workflows: why workflows not only connect web services. In: SERVICES, pp. 314–321 (2009) Wassink, I.H.C., van der Vet, P.E., Wolstencroft, K., Neerincx, P.B.T., Roos, M., Rauwerda, H., Breit, T.M.: Analysing scientific workflows: why workflows not only connect web services. In: SERVICES, pp. 314–321 (2009)
7.
Zurück zum Zitat Seibel, P.N., Krüger, J., Hartmeier, S., Schwarzer, K., Löwenthal, K., Mersch, H., Dandekar, T., Giegerich, R.: XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinform. 7, 490 (2006)CrossRef Seibel, P.N., Krüger, J., Hartmeier, S., Schwarzer, K., Löwenthal, K., Mersch, H., Dandekar, T., Giegerich, R.: XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinform. 7, 490 (2006)CrossRef
8.
Zurück zum Zitat Han, M.V., Zmasek, C.M.: phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinform. 10, 356 (2009)CrossRef Han, M.V., Zmasek, C.M.: phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinform. 10, 356 (2009)CrossRef
9.
Zurück zum Zitat Kalas, M., Puntervoll, P., Joseph, A., Bartaseviciute, E., Töpfer, A., Venkataraman, P., Pettifer, S., Bryne, J.C., Ison, J.C., Blanchet, C., Rapacki, K., Jonassen, I.: Bioxsd: the common data-exchange format for everyday bioinformatics web services. Bioinformatics 26(18), i540–i546 (2010)CrossRef Kalas, M., Puntervoll, P., Joseph, A., Bartaseviciute, E., Töpfer, A., Venkataraman, P., Pettifer, S., Bryne, J.C., Ison, J.C., Blanchet, C., Rapacki, K., Jonassen, I.: Bioxsd: the common data-exchange format for everyday bioinformatics web services. Bioinformatics 26(18), i540–i546 (2010)CrossRef
10.
Zurück zum Zitat Embley, D.W., Xu, L., Ding, Y.: Automatic direct and indirect schema mapping: experiences and lessons learned. SIGMOD Rec. 33(4), 14–19 (2004)CrossRef Embley, D.W., Xu, L., Ding, Y.: Automatic direct and indirect schema mapping: experiences and lessons learned. SIGMOD Rec. 33(4), 14–19 (2004)CrossRef
11.
Zurück zum Zitat Li, X., Fan, Y., Jiang, F.: A classification of service composition mismatches to support service mediation. In: GCC, pp. 315–321 (2007) Li, X., Fan, Y., Jiang, F.: A classification of service composition mismatches to support service mediation. In: GCC, pp. 315–321 (2007)
12.
Zurück zum Zitat Lebreton, N., Blanchet, C., Claro, D.B., Chabalier, J., Burgun, A., Dameron, O.: Verification of parameters semantic compatibility for semi-automatic web service composition: a generic case study. In: Taniar, D., Pardede, E., Nguyen, H.-Q., Rahayu, J.W., Khalil, I. (eds.) International Conference on Information Integration and Web Based Applications and Services, pp. 845–848. ACM (2010) Lebreton, N., Blanchet, C., Claro, D.B., Chabalier, J., Burgun, A., Dameron, O.: Verification of parameters semantic compatibility for semi-automatic web service composition: a generic case study. In: Taniar, D., Pardede, E., Nguyen, H.-Q., Rahayu, J.W., Khalil, I. (eds.) International Conference on Information Integration and Web Based Applications and Services, pp. 845–848. ACM (2010)
13.
Zurück zum Zitat Elizondo, P.V., Dwivedi, V., Garlan, D., Schmerl, B.R., Fernandes, J.M.: Resolving data mismatches in end-user compositions. In: IS-EUD, pp. 120–136 (2013) Elizondo, P.V., Dwivedi, V., Garlan, D., Schmerl, B.R., Fernandes, J.M.: Resolving data mismatches in end-user compositions. In: IS-EUD, pp. 120–136 (2013)
14.
Zurück zum Zitat Hull, D., Stevens, R., Lord, P., Wroe, C., Goble, C.: Treating “shimantic web” syndrome with ontologies (2004) Hull, D., Stevens, R., Lord, P., Wroe, C., Goble, C.: Treating “shimantic web” syndrome with ontologies (2004)
15.
Zurück zum Zitat Bowers, S., Ludäscher, B.: An ontology-driven framework for data transformation in scientific workflows. In: Rahm, E. (ed.) DILS 2004. LNCS (LNBI), vol. 2994, pp. 1–16. Springer, Heidelberg (2004)CrossRef Bowers, S., Ludäscher, B.: An ontology-driven framework for data transformation in scientific workflows. In: Rahm, E. (ed.) DILS 2004. LNCS (LNBI), vol. 2994, pp. 1–16. Springer, Heidelberg (2004)CrossRef
16.
Zurück zum Zitat Kashlev, A., Lu, S., Chebotko, A.: Coercion approach to the shimming problem in scientific workflows. In: 2013 IEEE International Conference on Services Computing, Santa Clara, CA, USA, 28 June–3 July 2013, pp. 416–423 (2013) Kashlev, A., Lu, S., Chebotko, A.: Coercion approach to the shimming problem in scientific workflows. In: 2013 IEEE International Conference on Services Computing, Santa Clara, CA, USA, 28 June–3 July 2013, pp. 416–423 (2013)
17.
Zurück zum Zitat DiBernardo, M., Pottinger, R., Wilkinson, M.: Semi-automatic web service composition for the life sciences using the biomoby semantic web framework. J. Biomed. Inform. 41(5), 837–847 (2008)CrossRef DiBernardo, M., Pottinger, R., Wilkinson, M.: Semi-automatic web service composition for the life sciences using the biomoby semantic web framework. J. Biomed. Inform. 41(5), 837–847 (2008)CrossRef
18.
Zurück zum Zitat Ba, M., Ferré, S., Ducassé, M.: Generating data converters to help compose services in bioinformatics workflows. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014, Part I. LNCS, vol. 8644, pp. 284–298. Springer, Heidelberg (2014) Ba, M., Ferré, S., Ducassé, M.: Generating data converters to help compose services in bioinformatics workflows. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014, Part I. LNCS, vol. 8644, pp. 284–298. Springer, Heidelberg (2014)
19.
Zurück zum Zitat Missier, P., Wolstencroft, K., Tanoh, F., Li, P., Bechhofer, S., Belhajjame, K., Pettifer, S., Goble, C.A.: Functional units: abstractions for web service annotations. In: SERVICES, pp. 306–313. IEEE Computer Society (2010) Missier, P., Wolstencroft, K., Tanoh, F., Li, P., Bechhofer, S., Belhajjame, K., Pettifer, S., Goble, C.A.: Functional units: abstractions for web service annotations. In: SERVICES, pp. 306–313. IEEE Computer Society (2010)
20.
Zurück zum Zitat Hosoya, H., Vouillon, J., Pierce, B.C.: Regular expression types for XML. In: ICFP, pp. 11–22 (2000) Hosoya, H., Vouillon, J., Pierce, B.C.: Regular expression types for XML. In: ICFP, pp. 11–22 (2000)
21.
Zurück zum Zitat Chen, Z., Wu, J., Deng, S., Li, Y., Wu, Z.: Describing and verifying web service using type theory. In: Proceedings of the 10th International Conference on CSCW in Design, CSCWD 2006, 3–5 May 2006, Southeast University, Nanjing, China, pp. 746–750 (2006) Chen, Z., Wu, J., Deng, S., Li, Y., Wu, Z.: Describing and verifying web service using type theory. In: Proceedings of the 10th International Conference on CSCW in Design, CSCWD 2006, 3–5 May 2006, Southeast University, Nanjing, China, pp. 746–750 (2006)
22.
Zurück zum Zitat Bates, J.L., Constable, R.L.: Proofs as programs. ACM Trans. Program. Lang. Syst. 7(1), 113–136 (1985)MATHCrossRef Bates, J.L., Constable, R.L.: Proofs as programs. ACM Trans. Program. Lang. Syst. 7(1), 113–136 (1985)MATHCrossRef
23.
Zurück zum Zitat Moreews, F., Lavenier, D.: Seamless coarse grained parallelism integration in intensive bioinformatics workflows. In: 20th European MPI Users’s Group Meeting, EuroMPI 2013, Madrid, Spain, 15–18 September 2013, pp. 277–282 (2013) Moreews, F., Lavenier, D.: Seamless coarse grained parallelism integration in intensive bioinformatics workflows. In: 20th European MPI Users’s Group Meeting, EuroMPI 2013, Madrid, Spain, 15–18 September 2013, pp. 277–282 (2013)
24.
Zurück zum Zitat Westbrook, J.D., Ito, N., Nakamura, H., Henrick, K., Berman, H.M.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)CrossRef Westbrook, J.D., Ito, N., Nakamura, H., Henrick, K., Berman, H.M.: PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics 21(7), 988–992 (2005)CrossRef
25.
Zurück zum Zitat Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R., Stein, L.: The distributed annotation system. BMC Bioinform. 2, 7 (2001)CrossRef Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R., Stein, L.: The distributed annotation system. BMC Bioinform. 2, 7 (2001)CrossRef
26.
Zurück zum Zitat Consortium, U., et al.: The universal protein resource (uniprot) in 2010. Nucleic Acids Res. 38, 142–148 (2010). Database-IssueCrossRef Consortium, U., et al.: The universal protein resource (uniprot) in 2010. Nucleic Acids Res. 38, 142–148 (2010). Database-IssueCrossRef
27.
Zurück zum Zitat McWilliam, H., Valentin, F., Goujon, M., Li, W., Narayanasamy, M., Martin, J., Miyar, T., Lopez, R.: Web services at the European bioinformatics institute-2009. Nucleic Acids Res. 37, 6–10 (2009). Web-Server-IssueCrossRef McWilliam, H., Valentin, F., Goujon, M., Li, W., Narayanasamy, M., Martin, J., Miyar, T., Lopez, R.: Web services at the European bioinformatics institute-2009. Nucleic Acids Res. 37, 6–10 (2009). Web-Server-IssueCrossRef
28.
Zurück zum Zitat Wilkinson, M.D., Links, M.: Biomoby: an open source biological web services proposal. Briefings Bioinform. 3(4), 331–341 (2002)CrossRef Wilkinson, M.D., Links, M.: Biomoby: an open source biological web services proposal. Briefings Bioinform. 3(4), 331–341 (2002)CrossRef
29.
Zurück zum Zitat Sirin, E., Hendler, J., Parsia, B.: Semi-automatic composition of web services using semantic descriptions. In: Web Services: Modeling, Architecture And Infrastructure Workshop in ICEIS, vol. 2003. Citeseer (2003) Sirin, E., Hendler, J., Parsia, B.: Semi-automatic composition of web services using semantic descriptions. In: Web Services: Modeling, Architecture And Infrastructure Workshop in ICEIS, vol. 2003. Citeseer (2003)
30.
Zurück zum Zitat Lin, C., Lu, S., Fei, X., Pai, D., Hua, J.: A task abstraction and mapping approach to the shimming problem in scientific workflows. In: 2009 IEEE International Conference on Services Computing (SCC 2009), Bangalore, India, 21–25 September 2009, pp. 284–291 (2009) Lin, C., Lu, S., Fei, X., Pai, D., Hua, J.: A task abstraction and mapping approach to the shimming problem in scientific workflows. In: 2009 IEEE International Conference on Services Computing (SCC 2009), Bangalore, India, 21–25 September 2009, pp. 284–291 (2009)
31.
Zurück zum Zitat Kongdenfha, W., Nezhad, H.R.M., Benatallah, B., Casati, F., Saint-Paul, R.: Mismatch patterns and adaptation aspects: a foundation for rapid development of web service adapters. IEEE T. Serv. Comput. 2(2), 94–107 (2009)CrossRef Kongdenfha, W., Nezhad, H.R.M., Benatallah, B., Casati, F., Saint-Paul, R.: Mismatch patterns and adaptation aspects: a foundation for rapid development of web service adapters. IEEE T. Serv. Comput. 2(2), 94–107 (2009)CrossRef
32.
Zurück zum Zitat Ison, J.C., Kalas, M., Jonassen, I., Bolser, D.M., Uludag, M., McWilliam, H., Malone, J., Lopez, R., Pettifer, S., Rice, P.M.: EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 29(10), 1325–1332 (2013)CrossRef Ison, J.C., Kalas, M., Jonassen, I., Bolser, D.M., Uludag, M., McWilliam, H., Malone, J., Lopez, R., Pettifer, S., Rice, P.M.: EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 29(10), 1325–1332 (2013)CrossRef
33.
Zurück zum Zitat Wolstencroft, K., Alper, P., Hull, D., Wroe, C., Lord, P.W., Stevens, R.D., Goble, C.A.: The myGrid ontology,: bioinformatics service discovery. Int. J. Bioinform. Res. Appl. 3(3), 303–325 (2007)CrossRef Wolstencroft, K., Alper, P., Hull, D., Wroe, C., Lord, P.W., Stevens, R.D., Goble, C.A.: The myGrid ontology,: bioinformatics service discovery. Int. J. Bioinform. Res. Appl. 3(3), 303–325 (2007)CrossRef
34.
Zurück zum Zitat Stroulia, E., Wang, Y.: Structural and semantic matching for assessing web-service similarity. Int. J. Coop. Inf. Syst. 14(4), 407–438 (2005)CrossRef Stroulia, E., Wang, Y.: Structural and semantic matching for assessing web-service similarity. Int. J. Coop. Inf. Syst. 14(4), 407–438 (2005)CrossRef
35.
Zurück zum Zitat Linke, B., Giegerich, R., Goesmann, A.: Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7), 903–911 (2011)CrossRef Linke, B., Giegerich, R., Goesmann, A.: Conveyor: a workflow engine for bioinformatic analyses. Bioinformatics 27(7), 903–911 (2011)CrossRef
36.
Zurück zum Zitat Sadedin, S.P., Pope, B., Oshlack, A.: Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 28(11), 1525–1526 (2012)CrossRef Sadedin, S.P., Pope, B., Oshlack, A.: Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics 28(11), 1525–1526 (2012)CrossRef
37.
Zurück zum Zitat Köster, J., Rahmann, S.: Snakemake:a scalable bioinformatics workflow engine. Bioinformatics 28(19), 2520–2522 (2012)CrossRef Köster, J., Rahmann, S.: Snakemake:a scalable bioinformatics workflow engine. Bioinformatics 28(19), 2520–2522 (2012)CrossRef
Metadaten
Titel
Solving Data Mismatches in Bioinformatics Workflows by Generating Data Converters
verfasst von
Mouhamadou Ba
Sébastien Ferré
Mireille Ducassé
Copyright-Jahr
2016
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/978-3-662-49214-7_3

Neuer Inhalt