Skip to main content
Top
Published in: World Wide Web 5/2018

30-11-2017

ACRES: efficient query answering on large compressed sequences

Authors: Bin Wang, Xiaochun Yang, Guoren Wang

Published in: World Wide Web | Issue 5/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With the advances in next generation sequencing, the amount of genomic sequence data being produced continues to grow at an exponential rate. It is estimated that the entire genome of each individual human, each containing about 3 billion letters, could be made available in the next a few years. An increasingly pressing issue in genomics and medicine is how to efficiently store and query these massive amounts of sequence data. Recently a lossless compression technique has been proposed to drastically reduce the storage space of genomic sequences, taking advantage of the fact that any two genomes from the same species are highly similar and therefore only their differences need to be encoded. In this paper we study how to efficiently answer queries on the compressed sequences without first decompressing them. We study three important types of queries, including retrieving a subsequence, finding subsequences matching a given pattern, and finding subsequences similar to a pattern. We propose an index structure, filtering techniques, and efficient algorithms for answering these queries. We further demonstrate the utility of these algorithms using a real dataset.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
We use the terms “string” and “sequence” in a synonymous way. Note, however, that we clearly distinguish between the terms “substring” and “subsequence,” the latter being the much more general term.
 
2
We use (x,y] to express a PMR that overlapping with its left interval, similar, [x,y) represents a region overlapping with its right interval.
 
Literature
1.
go back to reference Aluru, S., Ko, P.: Encyclopedia of Algorithms, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Springer (2008) Aluru, S., Ko, P.: Encyclopedia of Algorithms, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Springer (2008)
2.
go back to reference Arasu, A., Ganti, V., Kaushik, R.: Exact set-similarity joins. In: VLDB, pp. 918–929 (2006) Arasu, A., Ganti, V., Kaushik, R.: Exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
4.
go back to reference Bayardo, R., Ma, Y., Srikant, R.: Scaling up all-pairs similarity search. In: WWW Conference (2007) Bayardo, R., Ma, Y., Srikant, R.: Scaling up all-pairs similarity search. In: WWW Conference (2007)
5.
go back to reference Brandon, M.C., Wallace, D.C.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)CrossRef Brandon, M.C., Wallace, D.C.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)CrossRef
6.
go back to reference Chaudhuri, S., Ganti, V.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006) Chaudhuri, S., Ganti, V.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
7.
go back to reference Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)CrossRef Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)CrossRef
8.
go back to reference Dublin, M.: So long, data depression. Genome Technology (2009) Dublin, M.: So long, data depression. Genome Technology (2009)
9.
go back to reference González, R., Navarro, G.: Compressed text indexes with fast locate. In: CPM, p. 4580. LNCS (2007) González, R., Navarro, G.: Compressed text indexes with fast locate. In: CPM, p. 4580. LNCS (2007)
10.
go back to reference Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001) Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
11.
go back to reference Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997) Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)
12.
go back to reference Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Set similarity selection queries at interactive speeds. In: ICDE (2008) Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Set similarity selection queries at interactive speeds. In: ICDE (2008)
13.
go back to reference Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string matching over ziv-lempel compressed text. In: CPM, p. 1848. LNCS (2000) Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string matching over ziv-lempel compressed text. In: CPM, p. 1848. LNCS (2000)
14.
go back to reference Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: CPM, pp. 203–210. LNCS 2676 (2003) Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: CPM, pp. 203–210. LNCS 2676 (2003)
15.
go back to reference Li, C., Wang, B., X. Yang.: Vgram: Improving performance of approximate queries on string collections using variablelength grams. In: VLDB, pp. 303–314 (2007) Li, C., Wang, B., X. Yang.: Vgram: Improving performance of approximate queries on string collections using variablelength grams. In: VLDB, pp. 303–314 (2007)
16.
go back to reference Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008) Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)
17.
go back to reference Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv., 33(1) (2001) Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv., 33(1) (2001)
18.
go back to reference Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001) Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
19.
go back to reference Papapetrou, P., Athitsos, V., Kollios, G., Gunopulos, D.: Referencebased alignment in large sequence databases. In: VLDB (2009) Papapetrou, P., Athitsos, V., Kollios, G., Gunopulos, D.: Referencebased alignment in large sequence databases. In: VLDB (2009)
20.
go back to reference Sarawagi, S., Kirpai, A.: Efficient set joins on similarity predicatess. In: SIGMOD (2004) Sarawagi, S., Kirpai, A.: Efficient set joins on similarity predicatess. In: SIGMOD (2004)
21.
go back to reference Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006) Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)
22.
go back to reference Wang, W, Xiao, C., Lin, X., Zhang, C.: Efficent approximate entity extraction with edit distance constraints. In: SIGMOD (2009) Wang, W, Xiao, C., Lin, X., Zhang, C.: Efficent approximate entity extraction with edit distance constraints. In: SIGMOD (2009)
23.
go back to reference Wang, B., Zhu, R., Yang, X., Wang, G.: Top-k representative documents query over geo-textual data stream. World Wide Web-internet Web Inf. Syst., 20(8) (2017) Wang, B., Zhu, R., Yang, X., Wang, G.: Top-k representative documents query over geo-textual data stream. World Wide Web-internet Web Inf. Syst., 20(8) (2017)
24.
go back to reference Welch, T.A.: A technique for high performance data compression. IEEE Comput. Mag., 17(6) (1984) Welch, T.A.: A technique for high performance data compression. IEEE Comput. Mag., 17(6) (1984)
25.
go back to reference Wheeler, D.A., Srinivasan, M., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)CrossRef Wheeler, D.A., Srinivasan, M., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)CrossRef
26.
go back to reference Wu, S., Manber, U.: Fast text searching allowing errors. Comm. of the ACM 35(10), 83–91 (1992)CrossRef Wu, S., Manber, U.: Fast text searching allowing errors. Comm. of the ACM 35(10), 83–91 (1992)CrossRef
27.
go back to reference Yang, X., Wang, B., Li, C.: Cost-based variable- length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008) Yang, X., Wang, B., Li, C.: Cost-based variable- length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008)
28.
go back to reference Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: Improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)MathSciNetCrossRef Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: Improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)MathSciNetCrossRef
29.
go back to reference Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: SIGMOD, pp. 377–392 (2016) Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: SIGMOD, pp. 377–392 (2016)
30.
go back to reference Zhu, R., Wang, B., Yang, X., Zheng, B., Wang, G.: Sap: Improving continuous top-k queries over streaming data. IEEE Trans. Knowl. Data Eng. 29(6), 1310–1328 (2017)CrossRef Zhu, R., Wang, B., Yang, X., Zheng, B., Wang, G.: Sap: Improving continuous top-k queries over streaming data. IEEE Trans. Knowl. Data Eng. 29(6), 1310–1328 (2017)CrossRef
31.
32.
go back to reference Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)CrossRefMATH Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)CrossRefMATH
Metadata
Title
ACRES: efficient query answering on large compressed sequences
Authors
Bin Wang
Xiaochun Yang
Guoren Wang
Publication date
30-11-2017
Publisher
Springer US
Published in
World Wide Web / Issue 5/2018
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-017-0518-1

Other articles of this Issue 5/2018

World Wide Web 5/2018 Go to the issue

Premium Partner