Swipe to navigate through the articles of this issue
This work is supported by the Danish National Research Foundation under the Sapere Aude program.
A basic question on two pieces of data is: “What is the similarity of the data?” In this extended abstract we give an overview of new developments in randomized algorithms and data structures that relate to this question. In particular we provide new state of the art methods in three particular settings, that all relate to the computation of intersection sizes:
We give a new space-efficient summary data structure for answering set intersection size queries. The new summaries are based on one-permutation min-wise hashing, and we provide a lower bound that nearly matches our new upper bound.
For sparse matrix multiplication, we give new tight bounds in the I/O model, settling the I/O complexity a natural parameterization of the problem—namely where the complexity depends on the input sparsity N, the output sparsity Z and the parameters of the I/O model. In the RAM model we give a new algorithm that exploits output sparsity and which beats previous known results for most of the parameter space.
We give a new I/O efficient algorithm to compute the similarity join between two sets: two elements are members of this join if they are close according to a specified metric. Our new algorithm is based on locality-sensitive hashing and strictly improves on previous work.
Please log in to get access to this content
To get access to this content you need the following product:
Amossen RR, Pagh R (2009) Faster join-projects and sparse matrix multiplications. In: Proceedings of the 12th international conference on database theory, ICDT ’09, St. Petersburg, Russia, 23–25 March, pp 121–126
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences. SEQUENCES ’97. IEEE Computer Society, Washington, DC, p 21
Jacob, R, Stöckel M (2015) Fast output-sensitive matrix multiplication. In: Algorithms, ESA 2015: 23rd annual European symposium, Proceedings. Springer, Heidelberg, pp 766–778
Jia-Wei H, Kung HT (1981) I/o complexity: the red–blue pebble game. In: Proceedings of the thirteenth annual ACM symposium on theory of computing, STOC ’81, ACM, pp 326–333
Knudsen MBT, Stöckel M (2015) Quicksort, largest bucket, and min-wise hashing with limited independence. In: Algorithms, ESA 2015, 23rd annual European symposium, Proceedings, pp 828–839
Li P, König AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109 CrossRef
Pagh R, Pham N, Silvestri F, Stöckel M (2015) I/O-efficient similarity join. In: Algorithms, ESA 2015: 23rd annual European symposium, Proceedings. Springer, Heidelberg, pp 941–952
Pagh R, Stöckel M (2014) The input/output complexity of sparse matrix multiplication. In: Algorithms, ESA 2014: 22th annual European symposium, Proceedings. Springer, Heidelberg, pp 750–761
Pagh R, Stöckel M (2015) Association rule mining using maximum entropy. CoRR. arxiv:1501.02143
Pagh R, Stöckel M, Woodruff DP (2014) Is min-wise hashing optimal for summarizing set intersection? In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS’14. ACM, pp 109–120
Stöckel M (2015) Randomized primitives for big data processing. https://en.itu.dk/~/media/en/research/phd-programme/phd-defences/2015/morten-st%C3%B6ckel-phd-thesis-final-pdf.pdf?la=en
- Randomized Primitives for Big Data Processing
- Publication date
- Springer Berlin Heidelberg
Neuer Inhalt/© ITandMEDIA