nach oben

2019 | Buch

Kapitel lesen Erstes Kapitel lesen

Real-Time Business Intelligence and Analytics

International Workshops, BIRTE 2015, Kohala Coast, HI, USA, August 31, 2015, BIRTE 2016, New Delhi, India, September 5, 2016, BIRTE 2017, Munich, Germany, August 28, 2017, Revised Selected Papers

herausgegeben von: Malu Castellanos, Dr. Panos K. Chrysanthis, Konstantinos Pelechrinis

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Business Information Processing

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the thoroughly refereed conference proceedings of the BIRTE workshops listed below, which were held in in conjunction with VLDB, the International Conference on Very Large Data Bases:

9th International Workshop on Business Intelligence for the Real-Time Enterprise, BIRTE 2015, held in Kohala Coast, Hawaii, in August 2015, 10th International Workshop on Enabling Real-Time Business Intelligence, BIRTE 2016, held in New Delhi, India, in September 2016,11th International Workshop on Real-Time Business Intelligence and Analytics, BIRTE 2017, held in Munich, Germany, in August 2017.

The BIRTE workshop series provides a forum for the discussion and advancement of the science and engineering enabling real-time business intelligence and the novel applications that build on these foundational techniques.

The book includes five selected papers from BIRTE 2015; five selected papers from BIRTE 2016; and three selected papers from BIRTE 2017.

Inhaltsverzeichnis

Frontmatter

BIRTE 2015

Frontmatter

RILCA: Collecting and Analyzing User-Behavior Information in Instant Search Using Relational DBMS

Abstract

An instant-search engine computes answers immediately as a user types a query character by character. In this paper, we study how to systematically collect information about user behaviors when they interact with an instant search engine, especially in a real-time environment. We present a solution, called RILCA, which uses front-end techniques to keep track of rich information about user activities. This information provides more insights than methods based on traditional Web servers such as Apache. We store the log records in a relational DBMS system, and leverage the existing powerful capabilities of the DBMS system to analyze the log records efficiently. We study how to use a dashboard to monitor and analyze log records in real time. We conducted experiments on real data sets collected from two live systems to show the benefits and efficiency of these techniques.

Taewoo Kim, Chen Li

A Federated In-memory Database System for Life Sciences

Abstract

Cloud computing has become a synonym for elastic provision of shared computing resources operated by a professional service provider. However, data needs to be transferred from local systems to shared resources for processing, which might results in significant process delays and the need to comply with special data privacy acts. Based on the concrete requirements of life sciences research, we share our experience in integrating existing decentralized computing resources to form a federated in-memory database system. Our approach combines advantages of cloud computing, such as efficient use of hardware resources and provisioning of managed software, whilst sensitive data are stored and processed on local hardware only.

Matthieu-P. Schapranow, Cindy Perscheid, Alf Wachsmann, Martin Siegert, Cornelius Bock, Friedrich Horschig, Franz Liedke, Janos Brauer, Hasso Plattner

An Integrated Architecture for Real-Time and Historical Analytics in Financial Services

Abstract

The integration of historical data has become one of the most pressing issues for the financial services industry: trading floors rely on real-time analytics of ticker data with very strong emphasis on speed, not scale, yet, a large number of critical tasks, including daily reporting and backtesting of models, put emphasis on scale. As a result, implementers continuously face the challenge of having to meet contradicting requirements and either scale real-time analytics technology at considerable cost, or deploy separate stacks for different tasks and keep them synchronized—a solution that is no less costly.

In this paper, we propose Adaptive Data Virtualization, as an alternative approach, to overcome this problem. ADV lets applications use different data management technologies without the need for database migrations or re-configuration of applications. We review the incumbent technology and compare it with the recent crop of MPP databases and draw up a strategy that, using ADV, lets enterprises use the right tool for the right job flexibly. We conclude the paper summarizing our initial experience working with customers in the field and outline an agenda for future research.

Lyublena Antova, Rhonda Baldwin, Zhongxian Gu, F. Michael Waas

Processing of Aggregate Continuous Queries in a Distributed Environment

Abstract

Data Stream Management Systems (DSMSs) performing online analytics rely on the efficient execution of large numbers of Aggregate Continuous Queries (ACQs). In this paper, we study the problem of generating high quality execution plans of ACQs in DSMSs deployed on multi-node (multi-core and multi-processor) distributed environments. Towards this goal, we classify optimizers based on how they partition the workload among computing nodes and on their usage of the concept of Weavability, which is utilized by the state-of-the-art WeaveShare optimizer to selectively combine ACQs and produce low cost execution plans for single-node environments. For each category, we propose an optimizer, which either adopts an existing strategy or develops a new one for assigning and grouping ACQs to computing nodes. We implement and experimentally compare all of our proposed optimizers in terms of (1) keeping the total cost of the ACQs execution plan low and (2) balancing the load among the computing nodes. Our extensive experimental evaluation shows that our newly developed Weave-Group to Nodes (\(WG_{TN}\)) and Weave-Group Inserted (\(WG_{I}\)) optimizers produce plans of significantly higher quality than the rest of the optimizers. \(WG_{TN}\) minimizes the total cost, making it more suitable from a client perspective, and \(WG_{I}\) achieves load balancing, making it more suitable from a system perspective.

Anatoli U. Shein, Panos K. Chrysanthis, Alexandros Labrinidis

High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads

Abstract

Google’s Ads Data Infrastructure systems run the multi-billion dollar ads business at Google. High availability and strong consistency are critical for these systems. While most distributed systems handle machine-level failures well, handling datacenter-level failures is less common. In our experience, handling datacenter-level failures is critical for running true high availability systems. Most of our systems (e.g. Photon, F1, Mesa) now support multi-homing as a fundamental design property. Multi-homed systems run live in multiple datacenters all the time, adaptively moving load between datacenters, with the ability to handle outages of any scale completely transparently.

This paper focuses primarily on stream processing systems, and describes our general approaches for building high availability multi-homed systems, discusses common challenges and solutions, and shares what we have learned in building and running these large-scale systems for over ten years.

Ashish Gupta, Jeff Shute

BIRTE 2016

Frontmatter

Past and Future Steps for Adaptive Storage Data Systems: From Shallow to Deep Adaptivity

Abstract

Data systems with adaptive storage can autonomously change their behavior by altering how data is stored and accessed. Such systems have been studied primarily for the case of adaptive indexing to automatically create the right indexes at the right granularity. More recently work on adaptive loading and adaptive data layouts brought even more flexibility. We survey this work and describe the need for even deeper adaptivity that goes beyond adjusting knobs in a single architecture; instead it can adapt the fundamental architecture of a data system to drastically alter its behavior.

Stratos Idreos, Manos Athanassoulis, Niv Dayan, Demi Guo, Mike S. Kester, Lukas Maas, Kostas Zoumpatianos

PolyRecs: Improving Page–View Rates Using Real-Time Data Analysis

Abstract

In this paper, we outline our effort to enhance the page-view rates of e-content that online customers read on a popular portal in Greece. The portal, athensvoice.gr, provides continuous coverage on news, politics, science, the arts, and opinion columns and its customers generate approximately 6 million unique visits per month. Gains both in terms of advertisement and further e-content market penetration were the objectives of our effort which yielded the PolyRecs system, in production for more than a year now. In designing PolyRecs, we were primarily concerned with the use of pages in real-time and to this end, we elected to utilize five key criteria to achieve the aforementioned goals. We selected criteria for which we were able to obtain pertinent statistics without compromising performance and offered a real-time exploitation of the user page-views on the go. In addition, we were keen in realizing not only effective on-the-fly calculations of what might be interesting to the browsing individuals at specific points in time but also produce accurate results capable of improving the user-experience. The key factors exploited by PolyRecs entail features from both collaboration and content-based systems. Once operational, PolyRecs helped the news portal attain an average increase of 6.3% of the overall page-views in its traffic. To ascertain the PolyRecs utility, we provide a brief economic analysis in terms of measured performance indicators and identify the degree of contribution each of the key factors offers. Last but not least, we have developed PolyRecs as a domain-agnostic hybrid-recommendation system for we wanted it to successfully function regardless of the underlying data and/or content infrastructure.

Mihalis Papakonstantinou, Alex Delis

Enabling Real Time Analytics over Raw XML Data

Abstract

The data generated by many applications is in semi structured format, such as XML. This data can be used for analytics only after shredding and storing it in structured format. This process is known as Extract-Transform-Load or ETL. However, ETL process is often time consuming due to which crucial time-sensitive insights can be lost or they may become un-actionable. Hence, this paper poses the following question: How do we expose analytical insights in the raw XML data? We address this novel problem by discovering additional information from the raw semi-structured data repository, called complementary information (CI), for a given user query. Experiments with real as well as synthetic data show that the discovered CI is relevant in the context of the given user query, nontrivial, and has high precision. The recall is also found to be high for most queries. Crowd-sourced feedback on the discovered CI corroborates these findings, showing that our system is able to discover highly relevant and potentially useful CI in real-world XML data repositories. Concepts behind our technique are generic and can be used for other semi-structured data formats as well.

Manoj K. Agarwal, Krithi Ramamritham, Prashant Agarwal

Multi-engine Analytics with IReS

Abstract

We present IReS, the Intelligent Resource Scheduler that is able to abstractly describe, optimize and execute any batch analytics workflow with respect to a multi-objective policy. Relying on cost and performance models of the required tasks over the available platforms, IReS allocates distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and decides on the exact amount of resources provisioned. Moreover, IReS efficiently adapts to the current cluster/engine conditions and recovers from failures by effectively monitoring the workflow execution in real-time. Our current prototype has been tested in a plethora of business driven and synthetic workflows, proving its potential of yielding significant gains in cost and performance compared to statically scheduled, single-engine executions. IReS incurs only marginal overhead to the workflow execution performance, managing to discover an approximate pareto-optimal set of execution plans within a few seconds.

Katerina Doka, Ioannis Mytilinis, Nikolaos Papailiou, Victor Giannakouris, Dimitrios Tsoumakos, Nectarios Koziris

Ubiq: A Scalable and Fault-Tolerant Log Processing Infrastructure

Abstract

Most of today’s Internet applications generate vast amounts of data (typically, in the form of event logs) that needs to be processed and analyzed for detailed reporting, enhancing user experience and increasing monetization. In this paper, we describe the architecture of Ubiq, a geographically distributed framework for processing continuously growing log files in real time with high scalability, high availability and low latency. The Ubiq framework fully tolerates infrastructure degradation and data center-level outages without any manual intervention. It also guarantees exactly-once semantics for application pipelines to process logs as a collection of multiple events. Ubiq has been in production for Google’s advertising system for many years and has served as a critical log processing framework for several dozen pipelines. Our production deployment demonstrates linear scalability with machine resources, extremely high availability even with underlying infrastructure failures, and an end-to-end latency of under a minute.

Venkatesh Basker, Manish Bhatia, Vinny Ganeshan, Ashish Gupta, Shan He, Scott Holzer, Haifeng Jiang, Monica Chawathe Lenart, Navin Melville, Tianhao Qiu, Namit Sikka, Manpreet Singh, Alexander Smolyanov, Yuri Vasilevski, Shivakumar Venkataraman, Divyakant Agrawal

BIRTE 2017

Frontmatter

Towards Interactive Data Exploration

Abstract

Enabling interactive visualization over new datasets at “human speed” is key to democratizing data science and maximizing human productivity. In this work, we first argue why existing analytics infrastructures do not support interactive data exploration and outline the challenges and opportunities of building a system specifically designed for interactive data exploration. Furthermore, we present the results of building IDEA, a new type of system for interactive data exploration that is specifically designed to integrate seamlessly with existing data management landscapes and allow users to explore their data instantly without expensive data preparation costs. Finally, we discuss other important considerations for interactive data exploration systems including benchmarking, natural language interfaces, as well as interactive machine learning.

Carsten Binnig, Fuat Basık, Benedetto Buratti, Ugur Cetintemel, Yeounoh Chung, Andrew Crotty, Cyrus Cousins, Dylan Ebert, Philipp Eichmann, Alex Galakatos, Benjamin Hättasch, Amir Ilkhechi, Tim Kraska, Zeyuan Shang, Isabella Tromba, Arif Usta, Prasetya Utama, Eli Upfal, Linnan Wang, Nathaniel Weir, Robert Zeleznik, Emanuel Zgraggen

DCS: A Policy Framework for the Detection of Correlated Data Streams

Abstract

There is an increasing demand for real-time analysis of large volumes of data streams that are produced at high velocity. The most recent data needs to be processed within a specified delay target in order for the analysis to lead to actionable result. To this end, in this paper, we present an effective solution for detecting the correlation of such data streams within a micro-batch of a fixed time interval. Our solution, coined DCS, for Detection of Correlated Data Streams, combines (1) incremental sliding-window computation of aggregates, to avoid unnecessary re-computations, (2) intelligent scheduling of computation steps and operations, driven by a utility function within a micro-batch, and (3) an exploration policy that tunes the utility function. Specifically, we propose nine policies that explore correlated pairs of live data streams across consecutive micro-batches. Our experimental evaluation on a real world dataset shows that some policies are more suitable to identifying high numbers of correlated pairs of live data streams, already known from previous micro-batches, while others are more suitable to identifying previously unseen pairs of live data streams across consecutive micro-batches.

Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis, Mohamed Sharaf, Alexandros Labrinidis

Towards Dynamic Data Placement for Polystore Ingestion

Abstract

Integrating low-latency data streaming into data warehouse architectures has become an important enhancement to support modern data warehousing applications. In these architectures, heterogeneous workloads with data ingestion and analytical queries must be executed with strict performance guarantees. Furthermore, the data warehouse may consists of multiple different types of storage engines (a.k.a., polystores or multi-stores). A paramount problem is data placement; different workload scenarios call for different data placement designs. Moreover, workload conditions change frequently. In this paper, we provide evidence that a dynamic, workload-driven approach is needed for data placement in polystores with low-latency data ingestion support. We study the problem based on the characteristics of the TPC-DI benchmark in the context of an abbreviated polystore that consists of S-Store and Postgres.

Jiang Du, John Meehan, Nesime Tatbul, Stan Zdonik

Backmatter

Titel: Real-Time Business Intelligence and Analytics
herausgegeben von: Malu Castellanos
Dr. Panos K. Chrysanthis
Konstantinos Pelechrinis
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-24124-7
Print ISBN: 978-3-030-24123-0
DOI: https://doi.org/10.1007/978-3-030-24124-7