Skip to main content
main-content
Top

About this book

This book presents a language integrated query framework for big data. The continuous, rapid growth of data information to volumes of up to terabytes (1,024 gigabytes) or petabytes (1,048,576 gigabytes) means that the need for a system to manage and query information from large scale data sources is becoming more urgent. Currently available frameworks and methodologies are limited in terms of efficiency and querying compatibility between data sources due to the differences in information storage structures. For this research, the authors designed and programmed a framework based on the fundamentals of language integrated query to query existing data sources without the process of data restructuring. A web portal for the framework was also built to enable users to query protein data from the Protein Data Bank (PDB) and implement it on Microsoft Azure, a cloud computing environment known for its reliability, vast computing resources and cost-effectiveness.

Table of Contents

Frontmatter

Chapter 1. Introduction

In this modern technological age, data is growing larger and faster compared to previous decades. The existing methods used to process and analyse the overflowing amount of data are no longer sufficient. The term large scale data first surfaced in the magazine “Visually Exploring Gigabyte Datasets in Real Time” [1] published in Association for Computing Machinery (ACM) in 1999. It was mentioned having large scale data without a proper methodology to analyse data is a huge challenge and a sad occasion at the same time.
Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu

Chapter 2. Background

Reductionist molecular biology is a hypothesis-based approach used by scientists in the second half of the 20th century to determine and characterize molecules, cells and major structures of living systems. Biologists identified that, as a single community, they are required to continue using reductionist strategies to further their cause in elucidating the whole structure of components and every single one of their functions.
Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu

Chapter 3. Large Scale Data Analytics

The nature of protein data is complicated and constantly updated by researchers around the globe. To query from multiple data sources, a query framework written and built using Python with the concept of Language Integrated Query is proposed as the solution to overcome the limitations discussed in previous chapters. A cloud computing platform is used for this research to host the query framework to enable the framework to use the vast resources available to perform a query with minimal latency while avoiding computing resource deficiency. In this chapter, Language Integrated Query, cloud computing and algebraic operators are explained in detail.
Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu

Chapter 4. Query Framework

Protein Data Bank, PDB has a vast amount of resources related to protein 3D models, complex assemblies, and nucleic acids that can be utilized by both students and researchers for learning the characteristics of biomedicine. Therefore, a framework is needed to effectively retrieve information from their database. The functions that are utilized to enable users to query RCSB PDB is explained in this chapter.
Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu

Chapter 5. Results and Discussion

For this research, the structure of the query framework that has been explained in Chap. 4 is implemented on Microsoft Azure. The query framework can be accessed in the form of a web portal through any web browsing application, for example, Internet Explorer, Microsoft Edge, Google Chrome and others. The web portal is built to be user friendly and easy to navigate to retrieve data from RCSB PDB. The results of the query web portal are shown in this chapter.
Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu

Chapter 6. Conclusion and Future Works

The study of this research shows the difficulties faced by the current generation for database querying. Recent methodologies such as semantic integration focuses on data integration, data mapping and data translation. These approaches can be done for small to medium data sources. However, when it comes to querying databases that are huge and are being constantly updated by users around the world, these approaches are not suitable and not cost effective.
Chung Yik Cho, Rong Kun Jason Tan, John A. Leong, Amandeep S. Sidhu

Backmatter

Additional information

Premium Partner

image credits