nach oben

2012 | Buch

Kapitel lesen Erstes Kapitel lesen

Big Data Analytics

First International Conference, BDA 2012, New Delhi, India, December 24-26, 2012. Proceedings

herausgegeben von: Srinath Srinivasa, Vasudha Bhatnagar

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the refereed proceedings of the First International Conference on Big Data Analytics, BDA 2012, held in New Delhi, India, in December 2012. The 5 regular papers and 5 short papers presented were carefully reviewed and selected from 42 submissions. The volume also contains two tutorial papers in the section perspectives on big data analytics. The regular contributions are organized in topical sections on: data analytics applications; knowledge discovery through information extraction; and data models in analytics.

Inhaltsverzeichnis

Frontmatter

Perspectives on Big Data Analytics

Scalable Analytics – Algorithms and Systems

Abstract

The amount of data collected is increasing and the time window to leverage this has been decreasing. To satisfy the twin requirements, both algorithms and systems have to keep pace. The goal of this tutorial is to provide an overview of the common problems, algorithms, and systems for handling large-scale analytics tasks.

Srinivasan H. Sengamedu

Big-Data – Theoretical, Engineering and Analytics Perspective

Abstract

The advent of social networks, increasing speed of computer networks, the increasing processing power (through multi-cores) has given enterprise and end users the ability to exploit big-data. The focus of this tutorial is to explore some of the fundamental trends that led to the Big-Data hype (reality) as well as explain the analytics, engineering and theoretical trends in this space.

Vijay Srinivas Agneeswaran

Data Analytics Applications

A Comparison of Statistical Machine Learning Methods in Heartbeat Detection and Classification

Abstract

In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms.

Tony Basil, Bollepalli S. Chandra, Choudur Lakshminarayan

Enhanced Query-By-Object Approach for Information Requirement Elicitation in Large Databases

Abstract

Information Requirement Elicitation (IRE) recommends a framework for developing interactive interfaces, which allow users to access database systems without having prior knowledge of a query language. An approach called ‘Query-by-Object’ (QBO) has been proposed in the literature for IRE by exploiting simple calculator like operations. However, the QBO approach was proposed by assuming that the underlying database is simple and contains few tables of small size. In this paper, we propose an enhanced QBO approach called Query-by-Topics (QBT), for designing calculator like user interfaces for large databases. We use methodologies for clustering database entities and discovering topical structures to represent objects at a higher level of abstraction. The QBO approach is then enhanced to allow users to query by topics (QBT). We developed a prototype system based on QBT and conducted experimental studies to show effectiveness of the proposed approach.

Ammar Yasir, Mittapally Kumara Swamy, Polepalli Krishna Reddy, Subhash Bhalla

Cloud Computing and Big Data Analytics: What Is New from Databases Perspective?

Abstract

Many industries, such as telecom, health care, retail, pharmaceutical, financial services, etc., generate large amounts of data. Gaining critical business insights by querying and analyzing such massive amounts of data is becoming the need of the hour. The warehouses and solutions built around them are unable to provide reasonable response times in handling expanding data volumes. One can either perform analytics on big volume once in days or one can perform transactions on small amounts of data in seconds. With the new requirements, one needs to ensure the real-time or near real-time response for huge amount of data. In this paper we outline challenges in analyzing big data for both data at rest as well as data in motion. For big data at rest we describe two kinds of systems: (1) NoSQL systems for interactive data serving environments; and (2) systems for large scale analytics based on MapReduce paradigm, such as Hadoop, The NoSQL systems are designed to have a simpler key-value based data model having in-built sharding, hence, these work seamlessly in a distributed cloud based environment. In contrast, one can use Hadoop based systems to run long running decision support and analytical queries consuming and possible producing bulk data. For processing data in motion, we present use-cases and illustrative algorithms of data stream management system (DSMS). We also illustrate applications which can use these two kinds of systems to quickly process massive amount of data.

Rajeev Gupta, Himanshu Gupta, Mukesh Mohania

A Model of Virtual Crop Labs as a Cloud Computing Application for Enhancing Practical Agricultural Education

Abstract

A model of crop specific virtual labs is proposed to improve practical agricultural education by considering the agricultural education system in India. In agricultural education, the theoretical concepts are being imparted through class room lectures and laboratory skills are imparted in the dedicated laboratories. Further, practical agricultural education is being imparted by exposing the students to the field problems through Rural Agricultural Work Experience Program (RAWEP), experiential learning and internships. In spite of these efforts, there is a feeling that the level of practical skills exposed to the students is not up to the desired level. So we have to devise the new ways and means to enhance the practical knowledge and skills of agricultural students to understand the real-time crop problems and provide the corrective steps at the field level. Recent developments in ICTs, thus, provide an opportunity to improve practical education by developing virtual crop labs. The virtual crop labs contain a well organized, indexed and summarized digital data (text, photograph, and video). The digital data corresponds to farm situations reflecting life cycles of several farms of different crops cultivated under diverse farming conditions. The practical knowledge of the students could be improved, if we systematically expose them to virtual crop labs along with course teaching. We can employ cloud computing platform to store huge amounts of data and render to students and other stakeholders in an online manner.

Polepalli Krishna Reddy, Basi Bhaskar Reddy, D. Rama Rao

Knowledge Discovery through Information Extraction

Exploiting Schema and Documentation for Summarizing Relational Databases

Abstract

Schema summarization approaches are used for carrying out schema matching and developing user interfaces. Generating schema summary for any given database is a challenge which involves identifying semantically correlated elements in a database schema. Research efforts are being made to propose schema summarization approaches by exploiting database schema and data stored in the database. In this paper, we have made an effort to propose an efficient schema summarization approach by exploiting database schema and the database documentation. We propose a notion of table similarity by exploiting referential relationship between tables and the similarity of passages describing the corresponding tables in the database documentation. Using the notion of table similarity, we propose a clustering based approach for schema summary generation. Experimental results on a benchmark database show the effectiveness of the proposed approach.

Ammar Yasir, Mittapally Kumara Swamy, Polepalli Krishna Reddy

Faceted Browsing over Social Media

Abstract

The popularity of social media as a medium for sharing information has made extracting information of interest a challenge. In this work we provide a system that can return posts published on social media covering various aspects of a concept being searched. We present a faceted model for navigating social media that provides a consistent, usable and domain-agnostic method for extracting information from social media. We present a set of domain independent facets and empirically prove the feasibility of mapping social media content to the facets we chose. Next, we show how we can map these facets to social media sites, living documents that change periodically to topics that capture the semantics expressed in them. This mapping is used as a graph to compute the various facets of interest to us. We learn a profile of the content creator, enable content to be mapped to semantic concepts for easy navigation and detect similarity among sites to either suggest similar pages or determine pages that express different views.

Ullas Nambiar, Tanveer Faruquie, Shamanth Kumar, Fred Morstatter, Huan Liu

Analog Textual Entailment and Spectral Clustering (ATESC) Based Summarization

Abstract

In the domain of single document and automatic extractive text summarization, a recent approach Logical TextTiling (LTT) has been proposed [1]. In-depth analysis has revealed that LTTs performance is limited by unfair entailment calculation, weak segmentation and assignment of equal importance to each segment produced. It seems that because of these drawbacks, the summary produced from experimentation on articles collected from New York Times website has been of poor/inferior quality. To overcome these limitations, the present paper proposes a novel technique called ATESC(Analog Textual Entailment and Spectral Clustering) which employs the use of analog entailment values in the range [0,1], segmentation using Normalized Spectral Clustering, and assignment of relative importance to the produced segments based on the scores of constituent sentences. At the end, a comparative study of results of LTT and ATESC is carried out. It shows that ATESC produces better quality of summaries in most of the cases tested experimentally.

Anand Gupta, Manpreet Kathuria, Arjun Singh, Ashish Sachdeva, Shruti Bhati

Economics of Gold Price Movement-Forecasting Analysis Using Macro-economic, Investor Fear and Investor Behavior Features

Abstract

Recent works have shown that besides fundamental factors like interest rate, inflation index and foreign exchange rates, behavioral factors like consumer sentiments and global economic stability play an important role in driving gold prices at shorter time resolutions. In this work we have done comprehensive modeling of price movements of gold, using three feature sets, namely- macroeconomic factors (using CPI index and foreign exchange rates), investor fear features (using US Economy Stress Index and gold ETF Volatility Index) and investor behavior features (using the sentiment of Twitter feeds and web search volume index from Google Search Volume Index). Our results bring insights like high correlation (upto 0.92 for CPI) between various features, which is a significant improvement over earlier works. Using Grangers causality analysis, we have validated that the movement in gold price is greatly affected in the short term by some features, consistently over a five week lag. Finally, we implemented forecasting techniques like expert model mining system (EMMS) and binary SVM classifier to demonstrate forecasting performance using different features.

Jatin Kumar, Tushar Rao, Saket Srivastava

Data Models in Analytics

An Efficient Method of Building an Ensemble of Classifiers in Streaming Data

Abstract

To efficiently refine a classifier in streaming data such as sensor data and web log data we have to decide whether each streaming unlabeled datum is selected or not. The exiting methods refine a classifier based on a regular time interval. They refine a classifier even if the classification accuracy of the classifier is high. Also it uses a classifier even if the classification accuracy is low. In this paper, our ensemble method selects data in an online process that should be labeled. The selected data are used to build new classifiers of an ensemble. Our selection methodology uses training data that are applied to generate an ensemble of classifiers over streaming data. We compared the results of our ensemble approach and of a conventional ensemble approach where new classifiers for an ensemble are periodically generated. In experiments with ten benchmark data sets including three real streaming data sets, our ensemble approach generated 12.9% new classifiers for the chunk-based ensemble approach using partially labeled samples, and used an average of 10% labeled samples for the ten data sets. In all the experiments, our ensemble approach produced comparable classification accuracy. We showed that our approach can efficiently maintain the performance of an ensemble over streaming data.

Joung Woo Ryu, Mehmed M. Kantardzic, Myung-Won Kim, A. Ra Khil

I/O Efficient Algorithms for Block Hessenberg Reduction Using Panel Approach

Abstract

Reduction to Hessenberg form is a major performance bottleneck in the computation of the eigenvalues of a nonsymmetric matrix; which takes O(N ³) flops. All the known blocked and unblocked direct Hessenberg reduction algorithms have an I/O complexity of O(N ³/B). To improve the performance by incorporating matrix-matrix operations in the computation, usually the Hessenberg reduction is computed in two steps: the first reducing the matrix to a banded Hessenberg form, and the second further reducing it to Hessenberg form. We propose and analyse the first step of the reduction, i.e., reduction of a nonsymmetric matrix to banded Hessenberg form of bandwidth t for varying values of N and M (the size of the internal memory), on external memory model introduced by Aggarwal and Vitter for the I/O complexity and show that the reduction can be performed in \(O(N^3/\min\{t,\sqrt{M}\}B)\) I/Os.

Sraban Kumar Mohanty, Gopalan Sajith

Luring Conditions and Their Proof of Necessity through Mathematical Modelling

Abstract

Luring is a social engineering technique used to capture individuals having malicious intent of breaching the information security defense of an organization. Certain conditions(Need, Environment, Masquerading Capability and Unawareness) are necessary for its effective implementation. To the best of our knowledge the necessity of these conditions is not yet proved so far. The proof is essential as it not only facilitates automation of the luring mechanism but also paves way for proof of the completeness of the conditions. The present paper attempts on this aspect by invoking three approaches namely probability, entropy and proof by contra positive. Also, the concept of cost effectiveness is introduced. Luring is acceptable if its cost works out less than cost of data theft.

Anand Gupta, Prashant Khurana, Raveena Mathur

Efficient Recommendation for Smart TV Contents

Abstract

In this paper, we propose an efficient recommendation technique for smart TV contents. Our method solves the scalability and sparsity problems from which the conventional algorithms suffer in smart TV environment characterized by the large numbers of users and contents. Our method clusters users into user groups of similar preference patterns and a set of similar users to the target user are extracted, and then the user-based collaborative filtering is applied. We experimented with our method using the data of the real one-month IPTV services. The experiment results showed the success rate of 93.6% and the precision of 77.4%, which are recognized as a good performance for smart TV. We also investigate integration of recommendation methods for more personalized and efficient recommendation. Category match ratios for different integrations are compared as a measure for personalized recommendation.

Myung-Won Kim, Eun-Ju Kim, Won-Moon Song, Sung-Yeol Song, A. Ra Khil

Materialized View Selection Using Simulated Annealing

Abstract

A data warehouse is designed for the purpose of answering decision making queries. These queries are usually long and exploratory in nature and have high response time, when processed against a continuously expanding data warehouse leading to delay in decision making. One way to reduce this response time is by using materialized views, which store pre-computed summarized information for answering decision queries. All views cannot be materialized due to their exponential space overhead. Further, selecting optimal subset of views is an NP-Complete problem. Alternatively, several view selection algorithms exist in literature, out of which most are empirical or based on heuristics like greedy, evolutionary etc. It has been observed that most of these view selection approaches find it infeasible to select good quality views for materialization for higher dimensional data sets. In this paper, a randomized view selection algorithm based on simulated annealing, for selecting Top-K views from amongst all possible sets of views in a multidimensional lattice, is presented. It is shown that the simulated annealing based view selection algorithm, in comparison to the better known greedy view selection algorithm, is able to select better quality views for higher dimensional data sets.

T. V. Vijay Kumar, Santosh Kumar

Backmatter

Titel: Big Data Analytics
herausgegeben von: Srinath Srinivasa
Vasudha Bhatnagar
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-35542-4
Print ISBN: 978-3-642-35541-7
DOI: https://doi.org/10.1007/978-3-642-35542-4