Skip to main content
Erschienen in: The VLDB Journal 6/2015

01.12.2015 | Regular Paper

S\(^3\)-TM: scalable streaming short text matching

verfasst von: Fuat Basık, Buğra Gedik, Hakan Ferhatosmanoğlu, Mert Emin Kalender

Erschienen in: The VLDB Journal | Ausgabe 6/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Micro-blogging services have become major venues for information creation, as well as channels of information dissemination. Accordingly, monitoring them for relevant information is a critical capability. This is typically achieved by registering content-based subscriptions with the micro-blogging service. Such subscriptions are long-running queries that are evaluated against the stream of posts. Given the popularity and scale of micro-blogging services like Twitter and Weibo, building a scalable infrastructure to evaluate these subscriptions is a challenge. To address this challenge, we present the S\(^3\)-TM system for streaming short text matching. S\(^3\)-TM is organized as a stream processing application, in the form of a data parallel flow graph designed to be run on a data center environment. It takes advantage of the structure of the publications (posts) and subscriptions to perform the matching in a scalable manner, without broadcasting publications or subscriptions to all of the matcher instances. The basic design of S\(^3\)-TM uses a scoped multicast for publications and scoped anycast for subscriptions. To further improve throughput, we introduce publication routing algorithms that aim at minimizing the scope of the multicasts. First set of algorithms we develop are based on partitioning the word co-occurrence frequency graph, with the aim of routing posts that include commonly co-occurring words to a small set of matchers. While effective, these algorithms fell short in balancing the load. To address this, we develop the SALB algorithm, which provides better load balance by modeling the load more accurately using the word-to-post bipartite graph. We also develop a subscription placement algorithm, called LASP, to group together similar subscriptions, in order to minimize the subscription matching cost. Furthermore, to achieve good scalability for increasing number of nodes, we introduce techniques to handle workload skew. Finally, we introduce load shedding techniques for handling unexpected load spikes with small impact on the accuracy. Our experimental results show that S\(^3\)-TM is scalable. Furthermore, the SALB algorithm provides more than \(2.5\times \) throughput compared to the baseline multicast and outperforms the graph partitioning-based approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aguilera, M.K., Strom, R.E., Sturman, D.C., Astley, M., Chandra, T.D.: Matching events in a content-based subscription system. In: ACM Symposium on Principles of Distributed Computing (PODC) (1999) Aguilera, M.K., Strom, R.E., Sturman, D.C., Astley, M., Chandra, T.D.: Matching events in a content-based subscription system. In: ACM Symposium on Principles of Distributed Computing (PODC) (1999)
2.
Zurück zum Zitat Barazzutti, R., Felber, P., Fetzer, C., Onica, E., Pineau, J.F., Pasin, M., Rivière, E., Weigert, S.: Streamhub: a massively parallel architecture for high-performance content-based publish/subscribe. In: ACM International Conference on Distributed Event-based Systems (DEBS), pp. 63–74 (2013) Barazzutti, R., Felber, P., Fetzer, C., Onica, E., Pineau, J.F., Pasin, M., Rivière, E., Weigert, S.: Streamhub: a massively parallel architecture for high-performance content-based publish/subscribe. In: ACM International Conference on Distributed Event-based Systems (DEBS), pp. 63–74 (2013)
3.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
4.
Zurück zum Zitat Carzaniga, A., Rosenblum, D.S., Wolf, A.L.: Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. 19(3), 332–383 (2001)CrossRef Carzaniga, A., Rosenblum, D.S., Wolf, A.L.: Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. 19(3), 332–383 (2001)CrossRef
5.
Zurück zum Zitat Castro, M., Druschel, P., Kermarrec, A.M., Rowstron, A.I.: Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE J. Sel. Areas Commun. 20(8), 1489–1499 (2006)CrossRef Castro, M., Druschel, P., Kermarrec, A.M., Rowstron, A.I.: Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE J. Sel. Areas Commun. 20(8), 1489–1499 (2006)CrossRef
6.
Zurück zum Zitat Choudhury, M.D., Lin, Y.R., Sundaram, H., Candan, K.S., Xie, L., Kelliher, A.: How does the data sampling strategy impact the discovery of information diffusion in social media? In: AAAI Conference on Weblogs and Social Media (ICWSM) (2010) Choudhury, M.D., Lin, Y.R., Sundaram, H., Candan, K.S., Xie, L., Kelliher, A.: How does the data sampling strategy impact the discovery of information diffusion in social media? In: AAAI Conference on Weblogs and Social Media (ICWSM) (2010)
7.
Zurück zum Zitat Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.M.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2), 114–131 (2003)CrossRef Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.M.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2), 114–131 (2003)CrossRef
8.
Zurück zum Zitat Fidler, E., Jacobsen, H.A., Li, G., Mankovski, S.: The padres distributed publish/subscribe system. In: International Conference on Feature Interactions in Telecommunications and Software Systems (FIW) (2005) Fidler, E., Jacobsen, H.A., Li, G., Mankovski, S.: The padres distributed publish/subscribe system. In: International Conference on Feature Interactions in Telecommunications and Software Systems (FIW) (2005)
9.
Zurück zum Zitat Gedik, B., Liu, L.: Quality-aware distributed data delivery for continuous query services. In: ACM International Conference on Management of Data (SIGMOD) (2006) Gedik, B., Liu, L.: Quality-aware distributed data delivery for continuous query services. In: ACM International Conference on Management of Data (SIGMOD) (2006)
10.
Zurück zum Zitat Kale, S., Hazan, E., Cao, F., Singh, J.P.: Analysis and algorithms for content-based event matching. In: International Workshop on Distributed Event-Based Systems (DEBS), pp. 363–369 (2005) Kale, S., Hazan, E., Cao, F., Singh, J.P.: Analysis and algorithms for content-based event matching. In: International Workshop on Distributed Event-Based Systems (DEBS), pp. 363–369 (2005)
11.
Zurück zum Zitat Karanasos, K., Katsifodimos, A., Manolescu, I.: Delta: Scalable data dissemination under capacity constraints. VLDB Endow. (PVLDB) 7(4), 217–228 (2013)CrossRef Karanasos, K., Katsifodimos, A., Manolescu, I.: Delta: Scalable data dissemination under capacity constraints. VLDB Endow. (PVLDB) 7(4), 217–228 (2013)CrossRef
12.
Zurück zum Zitat Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRef Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRef
13.
Zurück zum Zitat Li, M., Ye, F., Kim, M., Chen, H., Lei, H.: Bluedove: A scalable and elastic publish/subscribe service. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1254–1265 (2011) Li, M., Ye, F., Kim, M., Chen, H., Lei, H.: Bluedove: A scalable and elastic publish/subscribe service. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1254–1265 (2011)
14.
Zurück zum Zitat Liu, L., Pu, C., Tang, W.: Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)CrossRef Liu, L., Pu, C., Tang, W.: Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)CrossRef
16.
Zurück zum Zitat Papaemmanouil, O., Çetintemel, U.: SemCast: Semantic multicast for content-based stream dissemination. In: IEEE International Conference on Data Engineering (ICDE), pp. 37–42 (2004) Papaemmanouil, O., Çetintemel, U.: SemCast: Semantic multicast for content-based stream dissemination. In: IEEE International Conference on Data Engineering (ICDE), pp. 37–42 (2004)
17.
Zurück zum Zitat Pietzuch, P.R., Bacon, J.M.: Hermes: A distributed event-based middleware architecture. In: IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 611–618 (2002) Pietzuch, P.R., Bacon, J.M.: Hermes: A distributed event-based middleware architecture. In: IEEE International Conference on Distributed Computing Systems (ICDCS), pp. 611–618 (2002)
18.
Zurück zum Zitat Porter, M.F.: An algorithm for suffix stripping. Program: electronic library and information systems pp. 313–316 (1997) Porter, M.F.: An algorithm for suffix stripping. Program: electronic library and information systems pp. 313–316 (1997)
19.
Zurück zum Zitat Ramasubramanian, V., Peterson, R., Sirer, E.G.: Corona: a high performance publish-subscribe system for the world wide web. In: USENIX Conference on Networked Systems Design and Implementation (NSDI) (2006) Ramasubramanian, V., Peterson, R., Sirer, E.G.: Corona: a high performance publish-subscribe system for the world wide web. In: USENIX Conference on Networked Systems Design and Implementation (NSDI) (2006)
20.
Zurück zum Zitat Rose, I., Murty, R., Pietzuch, P., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Contentbased filtering and aggregation of blogs and rss feeds. In: USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 3–3 (2007) Rose, I., Murty, R., Pietzuch, P., Ledlie, J., Roussopoulos, M., Welsh, M.: Cobra: Contentbased filtering and aggregation of blogs and rss feeds. In: USENIX Conference on Networked Systems Design and Implementation (NSDI), pp. 3–3 (2007)
21.
Zurück zum Zitat Sadoghi, M., Jacobsen, H.A.: Be-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space. In: ACM International Conference on Management of Data (SIGMOD), pp. 637–648 (2011) Sadoghi, M., Jacobsen, H.A.: Be-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space. In: ACM International Conference on Management of Data (SIGMOD), pp. 637–648 (2011)
23.
24.
Zurück zum Zitat Shraer, A., Gurevich, M., Fontoura, M., Josifovski, V.: Top-k publish-subscribe for social annotation of news. VLDB Endow. (PVLDB) 6(6), 385–396 (2013)CrossRef Shraer, A., Gurevich, M., Fontoura, M., Josifovski, V.: Top-k publish-subscribe for social annotation of news. VLDB Endow. (PVLDB) 6(6), 385–396 (2013)CrossRef
25.
Zurück zum Zitat Tatbul, N., Çetintemel, U., Zdonik, S., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Very Large Databases Conference (VLDB), pp. 309–320 (2003) Tatbul, N., Çetintemel, U., Zdonik, S., Cherniack, M., Stonebraker, M.: Load shedding in a data stream manager. In: Very Large Databases Conference (VLDB), pp. 309–320 (2003)
26.
Zurück zum Zitat TIBCO Inc., Tib/rendezvous. White Paper (1999) TIBCO Inc., Tib/rendezvous. White Paper (1999)
28.
Zurück zum Zitat Yan, T., Garcia-Molina, H.: Index structures for selective dissemination of information under the boolean model. ACM Trans. Database Syst. 19(2), 332–364 (1994)CrossRef Yan, T., Garcia-Molina, H.: Index structures for selective dissemination of information under the boolean model. ACM Trans. Database Syst. 19(2), 332–364 (1994)CrossRef
Metadaten
Titel
S-TM: scalable streaming short text matching
verfasst von
Fuat Basık
Buğra Gedik
Hakan Ferhatosmanoğlu
Mert Emin Kalender
Publikationsdatum
01.12.2015
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 6/2015
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-015-0404-3

Weitere Artikel der Ausgabe 6/2015

The VLDB Journal 6/2015 Zur Ausgabe

Premium Partner