Skip to main content

About this book

This volume constitutes the proceedings of the 7th International Conference on BIGDATA 2018, held as Part of SCF 2018 in Seattle, WA, USA in June 2018.
The 22 full papers together with 10 short papers published in this volume were carefully reviewed and selected from 97 submissions. They are organized in topical sections such as Data analysis, data as a service, services computing, data conversion, data storage, data centers, dataflow architectures, data compression, data exchange, data modeling, databases, and data management.

Table of Contents


Research Track: BigData Modeling


Time Series Similarity Search Based on Positive and Negative Query

Traditional time series similarity search, based on relevance feedback, combines initial, positive and negative relevant series directly to create new query sequence for the next search; it can’t make full use of the negative relevant sequence, even results in inaccurate query results due to excessive adjustment of the query sequence in some cases. In this paper, time series similarity search based on separate relevance feedback is proposed, each round of query includes positive query and negative query, and combines the results of them to generate the query results of each round. For one data sequence, positive query evaluates its similarity to the initial and positive relevant sequences, and negative query evaluates it’s similarity to the negative relevant sequences. The final similar sequences should be not only close to positive relevant series but also far away from negative relevant series. The experiments on UCR data sets showed that, compared with the retrieval method without feedback and the commonly used feedback algorithm the proposed method can improve accuracy of similarity search on some data sets.

Jimin Wang, Qi Liu, Pengcheng Zhang

Inter-Category Distribution Enhanced Feature Extraction for Efficient Text Classification

Text data is one of the dominating data types in Big Data driven services and applications. The performance of text classification largely depends on the quality of feature extraction over the text corpus. For supervised learning over text documents, the TF-IDF (Term Frequency-Inverse Document Frequency) weighting factor is one of the most frequently used features in text classification. In this paper, we address two known limitations of TF-IDF based feature extraction method: First, the conventional TF-IDF weighting factor lacks of consideration about the synonymous relationship between feature terms. Second, for big corpus with large number of text documents and large number of feature terms, the computational complexity of text classification increases with the dimensionality of the feature space. We address these problems by introducing an optimization technique based on the Inter-Category Distributions (ICD) of terms and the Inter-Category Distributions of documents. We call this new weighting factor TF-IDF-ICD, namely TF-IDF with Inter-Category Distributions. To further enhance the effectiveness of our TF-IDF-ICD method, we describe a TF-IDF-ICD threshold based Dimensionality Reduction (DR) optimization. We test the text classifier with a corpus of 10, 000 articles. The evaluation results show that the proposed TF-IDF-ICD based text classification method outperforms the conventional TF-IDF based classification solution by $$7.84\%$$7.84% at only about $$43.19\%$$43.19% of the training time used by the conventional TF-IDF based text classification methods.

Yuming Wang, Jun Huang, Yun Liu, Lai Tu, Ling Liu

Reversible Data Perturbation Techniques for Multi-level Privacy-Preserving Data Publication

The amount of digital data generated in the Big Data age is increasingly rapidly. Privacy-preserving data publishing techniques based on differential privacy through data perturbation provide a safe release of datasets such that sensitive information present in the dataset cannot be inferred from the published data. Existing privacy-preserving data publishing solutions have focused on publishing a single snapshot of the data with the assumption that all users of the data share the same level of privilege and access the data with a fixed privacy level. Thus, such schemes do not directly support data release in cases when data users have different levels of access on the published data. While a straight-forward approach of releasing a separate snapshot of the data for each possible data access level can allow multi-level access, it can result in a higher storage cost requiring separate storage space for each instance of the published data. In this paper, we develop a set of reversible data perturbation techniques for large bipartite association graphs that use perturbation keys to control the sequential generation of multiple snapshots of the data to offer multi-level access based on privacy levels. The proposed schemes enable multi-level data privacy, allowing selective de-perturbation of the published data when suitable access credentials are provided. We evaluate the techniques through extensive experiments on a large real-world association graph dataset and our experiments show that the proposed techniques are efficient, scalable and effectively support multi-level data privacy on the published data.

Chao Li, Balaji Palanisamy, Prashant Krishnamurthy

Real-Time Analysis of Big Network Packet Streams by Learning the Likelihood of Trusted Sequences

Deep Packet Inspection (DPI) is a basic monitoring step for intrusion detection and prevention, where the sequences of packed packets are to be unpacked according to the layered network structure. DPI is performed against overwhelming network packet streams. By nature, network packet data is big data of real-time streaming. The DPI big data analysis, however are extremely expensive, likely to generate false positives, and less adaptive to previously unknown attacks. This paper presents a novel machine learning approach to multithreaded analysis for network traffic streams. The contribution of this paper includes (1) real-time packet data analysis, (2) learning the likelihood of trusted and untrusted packet sequences, and (3) improvement of adaptive detection against previous unknown intrusive attacks.

John Yoon, Michael DeBiase

Forecasting Traffic Flow: Short Term, Long Term, and When It Rains

Forecasting is the art of taking available information of the past and attempting to make the best educated guesses of the ever unforeseen future. From the historical data, patterns can be observed and forecasting models have been developed to capture such patterns. This work focuses on forecasting traffic flow in major urban areas and freeways in the state of Georgia using large amounts of data collected from traffic sensors. Much of the existing work on traffic flow forecasting focuses on the immediate short terms. In addition to that, this work studies the forecasting powers of various models, including seasonal ARIMA, exponential smoothing and neural networks, for relatively long terms. A second experiment that incorporates precipitation data into forecasting models to better predict traffic flow in rainy weather is also conducted. Dynamic regression models and neural networks are used in this experiment. In both experiments, neural networks outperformed the others overall.

Hao Peng, Santosh U. Bobade, Michael E. Cotterell, John A. Miller

Approximate Query Matching for Graph-Based Holistic Image Retrieval

Image retrieval has transitioned from retrieving images with single object descriptions to retrieving images by using complex natural language to describe desired image content. We present work on holistic image search to perform exact and approximate image retrieval that returns images from a database that most closely match the user’s description. Our approach can handle simple queries for single objects (ex: cake) to more complex descriptions of multiple objects and prepositional relations between objects (ex: girl eating cake with a fork on a plate) in graph notation. In addition, our approach can generalize to retrieve queries that are semantically similar in case specific results are not found. We use the scene graph, developed in the Visual Genome dataset as a formalization of image content stored as a graph with nodes for objects and edges for relations describing objects in an image. We combine this with approximate search techniques for large-scale graphs and a semantic scoring algorithm developed by us to holistically retrieve images based on given search criteria. We also present a method to store scene graphs and metadata in graph databases using Neo4 J.

Abhijit Suprem, Duen Horng Chau, Calton Pu

Research Track: BigData Analysis


PAGE: Answering Graph Pattern Queries via Knowledge Graph Embedding

Answering graph pattern queries have been highly dependent on a technique—i.e., subgraph matching, however, this approach is ineffective when knowledge graphs include incorrect or incomplete information. In this paper, we present a method called $$\mathtt {PAGE}$$PAGE that answers graph pattern queries via knowledge graph embedding methods. $$\mathtt {PAGE}$$PAGE computes the energy (or uncertainty) of candidate answers with the learned embeddings and chooses the lower-energy candidates as answers. Our method has the two advantages: (1) $$\mathtt {PAGE}$$PAGE is able to find latent answers hard to be found via subgraph matching and (2) presents a robust metric that enables us to compute the plausibility of an answer. In evaluations with two popular knowledge graphs, Freebase and NELL, $$\mathtt {PAGE}$$PAGE demonstrated the performance increase by up to 28% compared to baseline KGE methods.

Sanghyun Hong, Noseong Park, Tanmoy Chakraborty, Hyunjoong Kang, Soonhyun Kwon

Distributed Big Data Ingestion at Scale for Extremely Large Community of Users

To make big data analytics available to mass online users, in the range of tens of millions, a different architecture other than those in the market has been designed and implemented which employs distributed blob store, custom compression, and custom query algorithm, including filtering, joins and group by. The system has been in operation at eBay for years and is described in [1]. However, large scale ingestion of data to a distributed blob store presents a unique challenge. This paper outlines an approach to solve the problem and uses an example of ingesting one trillion real time impressions per day, or 11+ millions per second, to illustrate how the proposed approach work. As discussed in the paper, the approach manages to consume 1 trillion real time impressions per day and is capable of making the data available to 100 million online users for analytics in just a few minutes. The incoming stream is partitioned first and then combined for ingestion. The ingestion is also divided into two stages, while data are available for query immediately after the first stage. Techniques are discussed to distribute volume of the data among system components to bring down the load on each component to a reasonable level.

Venkat Tipparam, Belinda Liu, Yifei Chen, Zoe Lang, Gang Ye, Diana Li, Hong-Yen Nguyen, CP Lai, Steve Chan

Convolutional Neural Network Ensemble Fine-Tuning for Extended Transfer Learning

Nowadays, image classification is a core task for many high impact applications such as object recognition, self-driving cars, national security (border monitoring, assault detection), safety (fire detection, distracted driving), geo-monitoring (cloud, rock and crop-disease detection). Convolutional Neural Networks(CNNs) are effective for those applications. However, they need to be trained with a huge number of examples and a consequently huge training time. Unfortunately, when the training set is not big enough and when re-train the model several times is needed, a common approach is to adopt a transfer learning procedure. Transfer learning procedures use networks already pretrained in other context and extract features from them or retrain them with a small dataset related to the specific application (fine-tuning). We propose to fine-tuning an ensemble of models combined together from multiple pretrained CNNs (AlexNet, VGG19 and GoogleNet). We test our approach on three different benchmark datasets: Yahoo! Shopping Shoe Image Content, UC Merced Land Use Dataset, and Caltech-UCSD Birds-200-2011 Dataset. Each one represents a different application. Our suggested approach always improves accuracy over the state of the art solutions and accuracy obtained by the returning of a single CNN. In the best case, we moved from accuracy of 70.5% to 93.14%.

Oxana Korzh, Mikel Joaristi, Edoardo Serra

GLDA-FP: Gaussian LDA Model for Forward Prediction

In social networks, information propagation is affected by diversity factors. In this work, we study the formation of forward behavior, map into multidimensional driving mechanisms and apply the behavioral and structural features to forward prediction. Firstly, by considering the effect of behavioral interest, user activity and network influence, we propose three driving mechanisms: interest-driven, habit-driven and structure-driven. Secondly, by taking advantage of the Latent Dirichlet allocation (LDA) model in dealing with problems of polysemy and synonymy, the traditional text modeling method is improved by Gaussian distribution and applied to user interest, activity and influence modeling. In this way, the user topic distribution for each dimension can be obtained regardless of whether the word is discrete or continuous. Moreover, the model can be extended using the pre-discretizing method which can help LDA detect the topic evolution automatically. By introducing time information, we can dynamically monitor user activity and mine the hidden behavioral habit. Finally, a novel model, Gaussian LDA, for forward prediction is proposed. The experimental results indicate that the model not only mine user latent interest, but also improve forward prediction performance effectively.

Yunpeng Xiao, Liangyun Liu, Ming Xu, Haohan Wang, Yanbing Liu

Tracking Happiness of Different US Cities from Tweets

Research into the possibilities of Twitter data has grown greatly over the past few years. Studies have shown its potential in identifying and managing disasters, predicting flu trends, predicting the success of movies at the box office, and analyzing people’s emotions. In this study, tweets from Twitter were collected and analyzed from nine different cities across America. East Carolina University’s Hadoop cluster was used to run our application and the Stanford CoreNLP was then used to give the sentiment of each statement in the tweets. Although our research reviled small distinction between nine individual cities in the percentage of positive, negative, and neutral statements, but however, there were significant differences in overall statements, where up 47.88% of all the statements were neutral, positive statements only 14.95%, while 37.16% of the statements were negative.

Bryan Pauken, Mudit Pradyumn, Nasseh Tabrizi

An Innovative Lambda-Architecture-Based Data Warehouse Maintenance Framework for Effective and Efficient Near-Real-Time OLAP over Big Data

In order to speed-up query processing in the context of Data Warehouse Systems, auxiliary summaries, such as materialized views and calculated attributes, are built on top of the data warehouse relations. As changes are made to the data warehouse through maintenance transactions, summary data become stale, unless the refresh of summary data is characterized by an expensive cost. The challenge gets even worst when near real-time environments are considered, even with respect to emerging Big Data features. In this paper, inspired by the well-known Lambda architecture, we introduce a novel approach for effectively and efficiently supporting data warehouse maintenance processes in the context of near real-time OLAP scenarios, making use of so-called big summary data, and we assess it via an empirical study that stresses the complexity of such OLAP scenarios via using the popular TPC-H benchmark.

Alfredo Cuzzocrea, Rim Moussa, Gianni Vercelli

Application Track: BigData Algorithms


The Application of Machine Learning Algorithm Applied to 3Hs Risk Assessment

Hypertension, Hyperglycemia and Hyperlipidemia (3Hs) are the significant factors of Cardiovascular Disease. Considering indicators related to obesity containing Body Mass Index (BMI), Waist Circumference (WC), Hip Circumference (HC), Waist-to-hip Ratio (WHR), Waist-to-height Ratio (WHtR) and disease history, disease history of family, dietary and etc. obtained conveniently and noninvasively, this article mainly set up two models to study the application of algorithm applied to 3Hs risk assessment. According to different combinations and gender, we build prediction model respectively to test the performance of them. In this article, 10-fold cross-validation was used to verify the model. In model I (HCRI - Logistic Model), the logistic regression algorithm was used to train the RC of Harvard cancer risk index. In model II (Logistic - Cart Model), taking the advantage of Decision Tree dealt with continuous variables, we set the output of CART as the input of logistic. The results show that, in HCRI - Logistic Model, the differences between male and female were not obvious, the accuracies are both only close to 70%, and the prediction of hyperglycemia is better than other 2Hs. In Logistic - Cart Model, the prediction of adult female is superior than men using indicators related to obesity. Especially about hyperglycemia, for model II, the accuracy is as high as 89.85% raised by 19.28% compared with model I, the specificity is 96.62% and the sensitivity is 84.56%. It provides an important reference for the evaluation of 3Hs to reduce the growth of relative chronic diseases.

Guixia Kang, Bo Yang, Dongli Wei, Ling Li

Development of Big Data Multi-VM Platform for Rapid Prototyping of Distributed Deep Learning

The present study utilizes VirtualBox virtual environment technology to develop the personal big data multi-VM platform with four-node Spark and Hadoop cluster that can effectively replicate and provide an environment for developers to easily design and implement the Spark and Hadoop Map/Reduce programming. Before running their Big Data and deep learning applications in physical multi-node Spark and Hadoop Cluster, developers can conduct Map/Reduce programing simply on the proposed multi-VM platform, which is exactly the same as the physical one. To demonstrate its capability and applicability, this study utilizes the deep learning application as an example for function illustration. In this study, the big data multi-VM platform provides the rapid prototyping of distributed deep learning by using a cutting-edge framework TensorFlowOnSpark (TFoS) for AI developers. To look into deep insight, this study performs the deep-learning benchmark in different types of cluster systems including the multi-node big data VM platform, physical standalone system and the physical small-cluster system. The results indicate that InputMode.SPARK can get 3.3 times faster than InputMode.TENSORFLOW on the big data VM platform and even achieve 6.1 times faster on the physical server.

Chien-Heng Wu, Chiao-Ning Chuang, Wen-Yi Chang, Whey-Fone Tsai

Developing Cost-Effective Data Rescue Schemes to Tackle Disk Failures in Data Centers

Ensuring the reliability of large-scale storage systems remains a challenge, especially when there are millions of disk drives deployed. Post-failure disk rebuild takes much longer time nowadays due to the ever-increasing disk capacity, which also increases the risk of service unavailability and even data loss. In this paper, we present a proactive data protection (PDP) framework in the ZFS file system to rescue data from disks before actual failure onset. By reducing the risk of data loss and mitigating the prolonged disk rebuilds caused by disk failures, PDP is designed to enhance the overall storage reliability. We extensively evaluate the recovery performance of ZFS with diverse configurations, and further explore disk failure prediction techniques to develop a proactive data protection mechanism in ZFS. We further compare the performance of different data protection strategies, including post-failure disk recovery, proactive disk cloning, and proactive data recovery. We propose an analytic model that uses storage utilization and contextual system information to select the best data protection strategy to achieve cost-effective and enhanced storage reliability.

Zhi Qiao, Jacob Hochstetler, Shuwen Liang, Song Fu, Hsing-bung Chen, Bradley Settlemyer

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

Performance of traditional machine learning systems does not scale up while working in the world of Big Data with training sets that can easily contain petabytes of data. Thus, new technologies and approaches are needed that can efficiently perform complex and time-consuming data analytics without having to rely on expensive super machines. This paper discusses how a distributed machine learning system can be created to efficiently perform Big Data machine learning using classification algorithms. Specifically, it is shown how the Machine Learning Library (MLlib) of Apache Spark on Databricks can be utilized with several instances residing on Elastic Compute Cloud (EC2) of Amazon Web Services (AWS). In addition to performing predictive analytics on different numbers of executors, both in-memory processing and on-table scans were used to utilize the computing efficiency and flexibility of Spark. The conducted experiments, which were run multiple times on several instances and executors, demonstrate how to parallelize executions as well as to perform in-memory processing in order to drastically improve a learning system’s performance. To highlight the advantages of the proposed system, two very large data sets and three different supervised classification algorithms were used in each experiment.

Ameen Abdel Hai, Babak Forouraghi

Development of a Big Data Platform for Analysis of Road Driving Environment Using Vehicle Sensing Data and Public Data

The driving environment on the road can be rapidly changed due to various event factors such as bad weather, traffic accident, congestion, etc. If drivers fail to recognize these dangerous situations in advance, they can lead to major traffic accidents. For this purpose, it is very important to provide real-time driving environment information to drivers. In order to provide real-time driving environment information, data collection devices are required. Current data collection devices are types of fixed collection systems that collect driving environment data at specific points or intervals. This fixed collection system is limited in time and space, and if it collects all the roads nationwide, there is a huge installation cost. In order to overcome the limitations of the fixed data collection system, this study utilize vehicle sensing data collected from individual vehicle sensors as a mobile collector. Since the vehicle sensing data collected from individual vehicles nationwide correspond to the spatial big data, an analysis system for processing big data is needed. Since this analysis system should utilize various collection data such as public data as well as vehicle sensing data, it should be developed in a platform form considering data scalability. Therefore, this study developed a big data platform for collecting, storing, processing, analyzing and visualizing various kinds of big data such as vehicle sensing data and public data. The development platform consists of H/W and S/W, and it is applied for providing real time driving environment information and analyzing/forecasting driving environment information including road surface freezing, road rainfall/snowfall, incident situation, traffic congestion, etc.

Intaek Jung, Kyusoo Chong

Application Track: BigData Practices


Big Data Framework for Finding Patterns in Multi-market Trading Data

In the United States, multimarket trading is becoming very popular for investors, professionals and high-frequency traders. This research focuses on 13 exchanges and applies data mining algorithm, an unsupervised machine learning technique for discovering the relationships between stock exchanges. In this work, we used an association rule (FP-growth) algorithm for finding trading pattern in exchanges. Thirty days NYSE Trade and Quote (TAQ) data were used for these experiments. We implemented a big data framework of Spark clusters on the top of Hadoop to conduct the experiment. The rules and co-relations found in this work seems promising and can be used by the investors and traders to make a decision.

Daya Ram Budhathoki, Dipankar Dasgupta, Pankaj Jain

Who’s Next: Evaluating Attrition with Machine Learning Algorithms and Survival Analysis

Every business deals with employees who voluntarily resign, retire, or are let go. In other words, they have employee turnover. Employee turnover, also known as attrition can be detrimental if highly valued employees decide to leave at an unexpected time. This paper aims to find the employee(s) that are most at risk of attrition by first identifying them as someone who will leave. Second, identify if their department increases the probability of them leaving. And third, identify the individual probability of the employee leaving at a given time. This paper found Logistic regression to consistently perform well in attrition classification compared to other Machine Learning models. Kaplan-Meier survival function is applied to identify the department with the highest risk. An attempt is also made to identify the individual risk of an employee leaving using Cox proportional hazard. Using these methods, we were able to achieve two of the three goals identified.

Jessica Frierson, Dong Si

Volkswagen’s Diesel Emission Scandal: Analysis of Facebook Engagement and Financial Outcomes

This paper investigates Volkswagen’s diesel scandal with a focus on the relationship between their Facebook engagement and financial performance during the period of 2012–16. We employ the big social data analytics approaches of visual and text analytics on Volkswagen’s Facebook data and financial reporting data. We specifically analyze the potential effects on the company in the diesel emission scandal years of 2014–2016. We find that the diesel emission scandal had the most impact in the short-term period immediately after its occurrence resulting in Facebook users reacting negatively against Volkswagen but also some users defending the company. In the long-term, it seems that the scandal has not impacted the company based on the analysis of both their financial data and their social media data.

Qi An, Morten Grimmig Christensen, Annith Ramachandran, Raghava Rao Mukkamala, Ravi Vatrapu

LeadsRobot: A Sales Leads Generation Robot Based on Big Data Analytics

Sale leads are the essential concern of salesman and marketing staffs, who may seek them in blindly searching by using search engine in a substantial of online information. Unfortunately, it is tricky to extract useful and valuable leads from such huge online data. To address this issue, in this paper, we present a leads generation robot-LeadsRobot, which is a software enabled robot. It can intelligently understand the requirements of leads for salesman and then automatically mine the leads from web big data to recommend them to salesman. A robot architecture is devised with service based technologies, it can accomplish the automatic understanding, crawling, analysis and recommendation. To achieve the task, we use automatic web crawling to gain the raw data from web data. Natural language processing is employed for extract leads from them, then intelligence recommendation is proceeded for salesman via word2vec based text analysis. Finally we demonstrate our proposed robot in a real application case and evaluate performance of system to show its efficiency and effectiveness.

Jing Zeng, Jin Che, Chunxiao Xing, Liang-Jie Zhang

Research on the High and New Technology Development Potential of China City Clusters Based on China’s New OTC Market

This article builds the index of enterprise competitiveness based on the annual reports of the New Third Board of China, and then proposes the index of city competitiveness, which reflects the competitiveness of cities in high and new technology field. We select the critical values to get the city high and new technology development level, combined with the heat map analysis of the status quo of China urban high and new technology development and forecast the future development trend of Beijing-Tianjin-Hebei, Yangtze River Delta, Pearl River Delta and Chengdu-Chongqing science and technology ecosystem. Finally, the development potential of science and technology in china urban clusters is compared and analyzed.

Liping Deng, Huan Chen, Liang-Jie Zhang, Xinnan Li

Short Paper Track: BigData Analysis


Analysis of Activity Population and Mobility Impacts of a New Shopping Mall Using Mobile Phone Bigdata

This paper is an explorative research, examining the impact of the new shopping mall, Hyundai, in Seongnam city of Korea. The focuses are on the activity population and mobility around the major shopping malls in the city, including three pre-existing major shopping malls. For this purpose, we analyzed mobile phone records in 2015 and 2016, before and after the new shopping mall. The data represent total mobile users in Korea. The number of activity population in the Hyundai mall increased by 103%. The new shopping mall negatively impacted the nearest shopping mall, AK, decreasing activity population by 3%. Internal and external mobility showed that visitors coming from all regions to the Hyundai mall were increased, while internal population visiting to the SK mall showed mixed results. Local government and urban planners will find this case study to be of interest with regards to mitigating the impacts of a new shopping mall development in similar urban situations.

Kwang-Sub Lee, Jin Ki Eom, Jun Lee, Dae Seop Moon

Activity-Based Traveler Analyzer Using Mobile and Socioeconomic Bigdata: Case Study of Seoul in Korea

This paper introduces a pilot study of developing Activity-BAsed Traveler Analyzer (ABATA) using the Big Data. The mobile phone bigdata is used to estimate total activity population in a case study area, Gangnam, Seoul. The pilot system estimates the activity population and the derived travel demand from the activities taken into account for individual schedules and activity categories (home, work, shopping, and leisure) with respect to land use types based on the various data inventory. The transportation planners will find this case study to be of interest with regards to the simulation results on the socio-demographic factors and land use changes.

Jin Ki Eom, Kwang-Sub Lee, Jun Lee, Dae-Seop Moon

Design and Application of a Visual System for the Supply Chain of Thermal Coal Based on Big Data

Big data is now applied to many different fields. The paper will introduce the application of big data in the coal supply chain of the power industry. We designed and implemented a visual system for the Supply Chain of Thermal Coal (SCTC). This system can analyze and predict the coal demand for power generation enterprises. In the system, power companies can easily find suitable coal suppliers by comparing the price of coal, transportation cost, supply cycle, industry status, enterprise credit, etc. So they can reduce power generation cost and storage cost, match power generation plan, and understand regional situation. In addition, the system provides enterprise portrait for each coal company from many aspects, such as credit, risk information, service quality and so on. At the same time, we used actual data to verify the system. It is hoped that the application of this study can provide reference for peers and related industries.

Xinyue Zhang, Yanmin Han, Wei Ge, Daqiang Yan, Yiming Chen

Big Data Analytics on Twitter

A Systematic Review of Applications and Methods

As the amount of digital data is growing at an exponential rate, the emphasis is on forming an insight from the data. Although the new fields of research, including Twitter data analytics, are proven to be fruitful, there is a lack of literature review and classification of the research. Therefore, after segregating 1,025 research papers, we reviewed 29 papers from 20 journals on Twitter data analytics published from 2011 to 2017, and then classified them based on year of publication, the title of journals, data mining methods, and their application. This paper is written with the intent of understanding the trend of research in this field.

Mudit Pradyumn, Akshat Kapoor, Nasseh Tabrizi

A Survey of Big Data Use in Large and Medium Ecuadorian Companies

Big data has become a subject of great interest among a variety of organizations, both from the scientific and business sectors. In this line, it is important to know the focus of attention that companies have on big data. This paper presents a study conducted in Ecuador about big data initiatives among large and medium companies. Results indicate that companies do not have a clear understanding of the implications of big data for their own benefits. Also, higher interest on big data initiatives comes from the private rather than the public sector. And, companies have a preference for contracting big data services from third-parties instead of hiring specialized personnel.

Rosa Quelal, Monica Villavicencio

Short Paper Track: BigData Modeling


K-mer Counting for Genomic Big Data

Counting the abundance of all the k-mers (substrings of length k) in sequencing reads is an important step of many bioinformatics applications, including de novo assembly, error correction and multiple sequence alignment. However, processing large amount of genomic dataset (TB range) has become a bottle neck in these bioinformatics pipelines. At present, most of the k-mer counting tools are based on single node, and cannot handle the data at TB level efficiently. In this paper, we propose a new distributed method for k-mer counting with high scalability. We test our k-mer counting tool on Mira supercomputer at Argonne National Lab, the experimental results show that it can scale to 8192 cores with an efficiency of 43% when processing 2 TB simulated genome dataset with 200 billion distinct k-mers (graph size), and only 578 s is used for the whole genome statistical analysis.

Jianqiu Ge, Ning Guo, Jintao Meng, Bingqiang Wang, Pavan Balaji, Shengzhong Feng, Jiaxiu Zhou, Yanjie Wei

Ensemble Learning Based Gender Recognition from Physiological Signals

Gender recognition based on facial image, body gesture and speech has been widely studied. In this paper, we propose a gender recognition approach based on four different types of physiological signals, namely, electrocardiogram (ECG), electromyogram (EMG), respiratory (RSP) and galvanic skin response (GSR). The core steps of the experiment consist of data collection, feature extraction and feature selection & classification. We developed a wrapper method based on Adaboost and sequential backward selection for feature selection and classification. Through the data analysis of 234 participants, we obtained a recognition accuracy of 91.1% with a subset of 12 features from ECG/EMG/RSP/GSR, 82.3% with 11 features from ECG only, 80.8% with 5 features from RSP only, indicating the effectiveness of the proposed method. The ECG, EMG, RSP, GSR signals are collected from human wrist, face, chest and fingers respectively, hence the method proposed in this paper can be easily applied to wearable devices.

Huiling Zhang, Ning Guo, Guangyuan Liu, Junhao Hu, Jiaxiu Zhou, Shengzhong Feng, Yanjie Wei

Developing a Chinese Food Nutrient Data Analysis System for Precise Dietary Intake Management

A big mount of dietary data can be recorded in the daily life with the development of Internet of Things (e.g., RFID-equipped food carriers and food vending machines). Via monitoring and analyzing of personal dietary, it can provide valuable information for disease diagnosis, body weight control, and dietary habit management. The big data analysis benefits for patients, dieters, nutritionists and individuals who concern their health. While various techniques have been used for dietary monitoring in clinical trials and user studies, they are not ready for daily use. Existing solutions either require tedious manual recording or may impede normal daily activities. In this paper, we designed a smart big data framework using RFID technology to analyze the nutrition intake from dietary every day. The framework is capable to record Chinese food dietary information efficiently and effectively. It is promising for individuals and dietarians to set up personalized nutrient plan in the future.

Xiaowei Xu, Li Hou, Zhen Guo, Ju Wang, Jiao Li

A Real-Time Professional Content Recommendation System for Healthcare Providers’ Knowledge Acquisition

Lifelong learning has become an essential component in a professional’s career path. With the wide use of mobile devices in working environments, professionals can acquire knowledge anytime and anywhere. For healthcare providers, their clinical knowledge acquisition in a timely manner can significantly improve patient treatment. In this paper, we present a real-time professional content recommendation system for healthcare providers’ knowledge acquisition. The system includes five layers: Data Layer (healthcare provider profile, learning behavior, social network and etc.), Algorithm Layer (clinical/medical knowledge classification, medical expertise similarity measurement, professional-knowledge matching algorithms and etc.), Service Layer (click/browsing behavior monitor, healthcare provider feedback collection and etc.), Application Layer (medical content recommendation, retrieval optimization and etc.) and Management Layer (clinical scenario configuration, medical terminology management and etc.). The system has been applied in a knowledge service applications targeting clinicians in mobile Health (mHealth) scenario.

Lu Qin, Xiaowei Xu, Jiao Li

Study on Big Data Visualization of Joint Operation Command and Control System

As more and more weapons and equipment are connected to the joint operational command and control system, the amount of information generated by the weapons and equipment is also increasing, which has brought tremendous challenges to the commanders and soldiers’ ability to make decisions. Based on the concept of “user-centered design”, this paper studies the information visualization method of the operational command and control system. The implementation of visualization is divided into three levels: user behavior, interaction architecture, and visual performance. Through the user’s task investigation, the logic of interactive architecture, and visual element coding, a joint air defense command and control system interface was designed. After a user test of the interface, the results show that the new design scheme is better than the original program. In this study, task research, design development, and user testing are used to propose a method of information visualization for command and control systems.

Gang Liu, Yi Su


Additional information

Premium Partner

    Image Credits