Top

World Wide Web

Published in:

30-11-2017

ACRES: efficient query answering on large compressed sequences

Authors: Bin Wang, Xiaochun Yang, Guoren Wang

Published in: World Wide Web | Issue 5/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

With the advances in next generation sequencing, the amount of genomic sequence data being produced continues to grow at an exponential rate. It is estimated that the entire genome of each individual human, each containing about 3 billion letters, could be made available in the next a few years. An increasingly pressing issue in genomics and medicine is how to efficiently store and query these massive amounts of sequence data. Recently a lossless compression technique has been proposed to drastically reduce the storage space of genomic sequences, taking advantage of the fact that any two genomes from the same species are highly similar and therefore only their differences need to be encoded. In this paper we study how to efficiently answer queries on the compressed sequences without first decompressing them. We study three important types of queries, including retrieving a subsequence, finding subsequences matching a given pattern, and finding subsequences similar to a pattern. We propose an index structure, filtering techniques, and efficient algorithms for answering these queries. We further demonstrate the utility of these algorithms using a real dataset.

previous article CrimeTelescope: crime hotspot prediction based on urban and social media data fusion

next article A novel parallel community detection scheme based on label propagation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

We use the terms “string” and “sequence” in a synonymous way. Note, however, that we clearly distinguish between the terms “substring” and “subsequence,” the latter being the much more general term.

We use (x,y] to express a PMR that overlapping with its left interval, similar, [x,y) represents a region overlapping with its right interval.

See http://silver.ics.uci.edu/~dnazip/index.html.

See http://www.ncbi.nlm.nih.gov/IEB/ToolBox.

Aluru, S., Ko, P.: Encyclopedia of Algorithms, Chapter on “Lookup Tables, Suffix Trees and Suffix Arrays”. Springer (2008)

Arasu, A., Ganti, V., Kaushik, R.: Exact set-similarity joins. In: VLDB, pp. 918–929 (2006)

Baeze-Yates, R.A., Navarro, G.: Faster approximate string matching. Algorithmica 23(2), 127–158 (1999)MathSciNetCrossRefMATH

Bayardo, R., Ma, Y., Srikant, R.: Scaling up all-pairs similarity search. In: WWW Conference (2007)

Brandon, M.C., Wallace, D.C.: Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14), 1731–1738 (2009)CrossRef

Chaudhuri, S., Ganti, V.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)

Christley, S., Lu, Y., Li, C., Xie, X.: Human genomes as email attachments. Bioinformatics 25(2), 274–275 (2009)CrossRef

Dublin, M.: So long, data depression. Genome Technology (2009)

González, R., Navarro, G.: Compressed text indexes with fast locate. In: CPM, p. 4580. LNCS (2007)

10.

Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

11.

Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)

12.

Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Set similarity selection queries at interactive speeds. In: ICDE (2008)

13.

Kärkkäinen, J., Navarro, G., Ukkonen, E.: Approximate string matching over ziv-lempel compressed text. In: CPM, p. 1848. LNCS (2000)

14.

Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. In: CPM, pp. 203–210. LNCS 2676 (2003)

15.

Li, C., Wang, B., X. Yang.: Vgram: Improving performance of approximate queries on string collections using variablelength grams. In: VLDB, pp. 303–314 (2007)

16.

Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE (2008)

17.

Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv., 33(1) (2001)

18.

Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)

19.

Papapetrou, P., Athitsos, V., Kollios, G., Gunopulos, D.: Referencebased alignment in large sequence databases. In: VLDB (2009)

20.

Sarawagi, S., Kirpai, A.: Efficient set joins on similarity predicatess. In: SIGMOD (2004)

21.

Venkateswaran, J., Lachwani, D., Kahveci, T., Jermaine, C.: Reference-based indexing of sequence databases. In: VLDB, pp. 906–917 (2006)

22.

Wang, W, Xiao, C., Lin, X., Zhang, C.: Efficent approximate entity extraction with edit distance constraints. In: SIGMOD (2009)

23.

Wang, B., Zhu, R., Yang, X., Wang, G.: Top-k representative documents query over geo-textual data stream. World Wide Web-internet Web Inf. Syst., 20(8) (2017)

24.

Welch, T.A.: A technique for high performance data compression. IEEE Comput. Mag., 17(6) (1984)

25.

Wheeler, D.A., Srinivasan, M., et al.: The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008)CrossRef

26.

Wu, S., Manber, U.: Fast text searching allowing errors. Comm. of the ACM 35(10), 83–91 (1992)CrossRef

27.

Yang, X., Wang, B., Li, C.: Cost-based variable- length-gram selection for string collections to support approximate queries efficiently. In: SIGMOD (2008)

28.

Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: Improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)MathSciNetCrossRef

29.

Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: Improving the performance of approximate queries on string collections. In: SIGMOD, pp. 377–392 (2016)

30.

Zhu, R., Wang, B., Yang, X., Zheng, B., Wang, G.: Sap: Improving continuous top-k queries over streaming data. IEEE Trans. Knowl. Data Eng. 29(6), 1310–1328 (2017)CrossRef

31.

Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)MathSciNetCrossRefMATH

32.

Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 530–536 (1978)CrossRefMATH

Title: ACRES: efficient query answering on large compressed sequences
Authors: Bin Wang
Xiaochun Yang
Guoren Wang
Publication date: 30-11-2017
Publisher: Springer US
Published in: World Wide Web / Issue 5/2018
Print ISSN: 1386-145X
Electronic ISSN: 1573-1413
DOI: https://doi.org/10.1007/s11280-017-0518-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 5/2018

Learning to embed music and metadata for context-aware music recommendation

No-but-semantic-match: computing semantically matched xml keyword search results

S-LPM: segmentation augmented light-weighting and progressive meshing for the interactive visualization of large man-made Web3D models

Recommending topics in dialogue

Recommending diverse friends in signed social networks based on adaptive soft consensus paradigm using variable length genetic algorithm

A novel parallel community detection scheme based on label propagation

Premium Partner