main-content

Über dieses Buch

This book constitutes the workshop proceedings of the 22nd International Conference on Database Systems for Advanced Applications, DASFAA 2017, held in Suzhou, China, in March 2017. The 32 full papers and 5 short papers presented were carefully selected and reviewed from 43 submissions to the four following workshops: the 4th International Workshop on Big Data Management and Service, BDMS 2017; the Second International Workshop on Big Data Quality Management, BDQM 2017; the 4th International Workshop on Semantic Computing and Personalization, SeCoP 2017; and the First International Workshop on Data Management and Mining on MOOCs, DMMOOC 2017.

Inhaltsverzeichnis

Automatically Classify Chinese Judgment Documents Utilizing Machine Learning Algorithms

In law, a judgment is a decision by a court that resolves a controversy and determines the rights and liabilities of parties in a legal action or proceeding. In 2013, China Judgments Online system was launched officially for record keeping and notification, up to now, over 23 million electronic judgment documents are recorded. The huge amount of judgment documents has witnessed the improvement of judicial justice and openness. Document categorization becomes increasingly important for judgments indexing and further analysis. However, it is almost impossible to categorize them manually due to their large volume and rapid growth. In this paper, we propose a machine learning approach to automatically classify Chinese judgment documents using machine learning algorithms including Naive Bayes (NB), Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM). A judgment document is represented as vector space model (VSM) using TF-IDF after words segmentation. To improve performance, we construct a set of judicial stop words. Besides, as TF-IDF generates a high dimensional feature vector, which leads to an extremely high time complexity, we utilize three dimensional reduction methods. Based on 6735 pieces of judgment documents, extensive experiments demonstrate the effectiveness and high classification performance of our proposed method.

Miaomiao Lei, Jidong Ge, Zhongjin Li, Chuanyi Li, Yemao Zhou, Xiaoyu Zhou, Bin Luo

A Partitioning Scheme for Big Dynamic Trees

In this paper, we propose a scheme for partitioning dynamically growing big trees and its implementation. The scheme is based on the history-pattern encoding scheme for dynamic multidimensional datasets. Our scheme of handling big dynamic trees is relying on the history-pattern encoding, by which large scale datasets can be treated efficiently. In order to partition these dynamic trees efficiently, the encoding scheme will be improved and adapted to the partitioning. In our partitioning scheme of a tree T, the path from the T’s root node to the root node of a partitioned subtree, is treated as an index for selecting the subtree. The path is split into a shared path and the local path in the subtree and each path is encoded by using the history-pattern encoding. The partitioning scheme contributes to the reduction of the storage cost and the improvement of the retrieval cost. In this paper, after our tree encoding scheme designed for the partitioning is described, some problems caused in the encoding are addressed and their countermeasure is presented. Finally, an implemented prototype system is described and evaluated.

Atsushi Sudoh, Tatsuo Tsuji, Ken Higuchi

Optimization Factor Analysis of Large-Scale Join Queries on Different Platforms

Popular big data computing platforms, such as Spark, provide new computing paradigm for traditional database operations, such as queries. Except for the management ability of large-scale data, big data platforms earn the reputation for their simple programming interface and good performance of scaling out. But traditional databases have intrinsic optimization mechanisms for fundamental operators, which supports efficient and flexible data processing. It is very valuable to give a comprehensive view of these two kinds of platforms on data processing performance. In this paper, we focus on join operation, a primary and frequently used operator for both databases and big data analysis, design and conduct extensive experiments to test the performance of the two classic platforms under unified datasets and hardware, which will disclose the performance influence on computing schema, storage media, etc. Based on the experimental analysis, we also put forwards our advice on computing platform onsideration for different application scenarios.

Chao Yang, Qian Wang, Qing Yang, Huibing Zhang, Jingwei Zhang, Ya Zhou

Which Mapping Service Should We Select in China?

The mapping services have been widely used in our daily lives. People can use the services to find his/her nearest POI (Point of Interests), the shortest travel route from a source location to a destination location, and even life services like booking hotels, calling taxis and so on. Consequently, more and more mapping service providers have emerged in the past years in China, like Baidu Maps, Amap and Sogou Maps. However, there is no existing study on how to select the suitable one for users/developers when they facing so many different mapping services, which is the problem that we focus on in this paper. We first design a questionnaire and analyze the results to show the current mapping service situation in China; then, we introduce and summarize the most three popular native mapping APIs in China, e.g., Baidu Maps API, Amap API and Sogou Maps API, to give readers a brief guider for selecting their suitable mapping services.

Detian Zhang, Jia-ao Wang, Fei Chen

An Online Prediction Framework for Dynamic Service-Generated QoS Big Data

With the prevalence of service computing, cloud computing, and Internet of Things (IoT), various service compositions are emerging on the Internet based on Service-Oriented Architecture (SOA). To evaluate the performance attribute of these service compositions, dynamic Quality of Service (QoS) data are generated abundantly, named service-generated QoS Big Data. Selecting optimal services to build high quality SOA systems can be based on these data. However, a mass of service-generated QoS data are unknown and it is become a challenge to predict these data. In this paper, we present a framework for service-generated QoS big data prediction, named DSPMF. Under this framework, we present an optimization objective function and employ online stochastic gradient descent algorithm to solve this function. Extensive experiments are conducted to verify the effectiveness and efficiency of our proposed approach.

Jianlong Xu, Changsheng Zhu, Qi Xie

Discovering Interesting Co-location Patterns Interactively Using Ontologies

Co-location pattern mining, which discovers feature types that frequently appear in a nearby geographic region, plays an important role in spatial data mining. Common frameworks for mining co-location patterns generate numerous redundant patterns. Thus, several methods were proposed to overcome this drawback. However, most of these methods did not guarantee that the extracted co-location patterns were interesting for being generally based on statistical information. Thus, it is crucial to help the decision-maker choose interesting co-location patterns with an efficient interactive procedure. This paper proposed an interactive approach to discover interesting co-location patterns. First, ontologies were used to improve the integration of user knowledge. Second, an interactive process was designed to collaborate with the user to find interesting co-location patterns efficiently. Finally, a filter was designed to reduce the number of discovered co-location patterns in the result set further. The experimental results on both synthetic and real data sets demonstrated the effectiveness of our approach.

Xuguang Bao, Lizhen Wang

LFLogging: A Latch-Free Logging Scheme for PCM-Based Big Data Management Systems

Big data introduces new challenges to database systems because of its big-volume and big-velocity properties. Specially, the big velocity, i.e., data arrives very fast, requires that database systems have to provide efficient solutions to process continuously-arriving queries. However, traditional disk-based DBMSs have a large overhead in maintaining database consistency. This is mainly due to the logging, locking, and latching mechanisms inside traditional DBMSs. In this paper, we aim to reduce the logging overheads for DBMSs by using new kinds of storage media such as PCM. Particularly, we propose a latch-free logging scheme named LFLogging. It uses PCM for both updating and transaction logging in disk-based DBMSs. Different from the traditional approaches where latches contention and complex logging schemes like WAL, LFLogging provides high performance by reducing latches and explicit logging. We conduct trace-driven experiments on the TPC-C benchmark to measure the performance of our proposal. The results show that LFLogging achieves up to 4~5X improvement in system throughput than existing approaches including WAL and PCMLogging.

Wenqiang Wang, Peiquan Jin, Shouhong Wan, Lihua Yue

RTMatch: Real-Time Location Prediction Based on Trajectory Pattern Matching

Due to the universality of mobile devices, such as GPS and the devices of location-based services, there is a growing number of mobile trajectory data. This provides the opportunities for innovation of analyzing trajectory and extracting information. We proposed a new method to predict next location of moving object - RTMatch. The main idea of the method is to store and query the trajectory frequency pattern (named T-pattern) of moving objects by designing a data structure - RTPT (Real Time Pattern Tree) and HT (Hash Table) contains the spatio-temporal information, and then find a best matched path on the tree (the best T-pattern matches the trajectory to be predicted). The RTMatch can provide real-time analysis during the on the fly processing. Experiments on the actual data prove that our method is more accuracy and efficiency than some existing methods.

Dong Zhenjiang, Deng Jia, Jiang Xiaohui, Wang Yongli

Online Formation of Large Tree-Structured Team

Software projects are often divided into different components and groups of individuals are assigned to various parts of the project. The matching of modular components of the project with right set of individuals is a fundamental challenge in both commercial and open source software projects. However, most of the extant studies on team formation have only considered the problem of creating flat teams, i.e., teams without communities and central authorities. In this paper, we study the problem of forming a hierarchically structured team. We use tree structure to model both teams and task specifications and introduce the notion of sub-team. Next, we define local density to minimize communication costs in sub-teams. Then, two algorithms are proposed to address this team formation problem in bottom up and top down manners. Furthermore, sub-teams are pre-computed and indexed to facilitate online formation of large teams. Results of experiments with a large dataset suggest that the index based algorithm can achieve both good effectiveness and excellent efficiency.

Cheng Ding, Fan Xia, Gopakumar, Weining Qian, Aoying Zhou

Cell-Based DBSCAN Algorithm Using Minimum Bounding Rectangle Criteria

The density-based spatial clustering of applications with noise (DBSCAN) algorithm has been well studied in database domains for clustering multi-dimensional data to extract arbitrary shape clusters. Recently, with the growing interest in big data and increasing diversification of data, the typical size and volume of databases have increased and data have increasingly become high-dimensional. Therefore, a large number of speed-up techniques for DBSCAN algorithms including exact and approximate approaches have been proposed. The fastest DBSCAN algorithm is the cell-based algorithm, which divides the whole data set into small cells. In this paper, we propose a novel exact version cell-based DBSCAN algorithm using minimum bounding rectangle (MBR) criteria. The connecting cells step is the most time-consuming step of the cell-based algorithm. The proposed algorithm can process the connecting cells step at high speed by using MBR criteria. We implemented the proposed cell-based DBSCAN algorithm and show that it outperforms the conventional one in high dimensions.

Tatsuhiro Sakai, Keiichi Tamura, Hajime Kitakami

Time-Aware and Topic-Based Reviewer Assignment

Peer review has become the most widely-used mechanism to judge the quality of submitted papers at academic conferences or journals. However, a challenging task in peer review is to assign papers to appropriate reviewers. Both the research directions of reviewers and topics of submitted papers are often multifaceted. Besides, reviewers’ research direction may change over time and their published papers closer to current time reflect their current research direction better. Hence in this paper, we present a time-aware and topic-based reviewer assignment model. We first crawl papers published by reviewers over years from web, and then build a time-aware reviewers’ personal profile using topic model to represent the expertise of reviewers. Then the relevant degree between reviewer and submitted paper is calculated through the similarity measure. In addition, by considering statistical characteristics such as TF-IDF of the papers, the matching degree between reviewer and submitted paper is further improved. At the same time, we also consider the quality of all past reviews to measure the reviewers’ present reviews. Extensive experiments on a real-world dataset demonstrate the effectiveness of the proposed method.

Hongwei Peng, Haojie Hu, Keqiang Wang, Xiaoling Wang

Adaptive Bayesian Network Structure Learning from Big Datasets

Since big data contain more comprehensive probability distributions and richer causal relationships than conventional small datasets, discovering Bayesian network (BN) structure from big datasets is becoming more and more valuable for modeling and reasoning under uncertainties in many areas. Facing big data, most of the current BN structure learning algorithms have limitations. First, learning BNs structure from big datasets is an expensive process that requires high computational cost, often ending in failure. Second, given any dataset as input, it is very difficult to choose one algorithm from numerous candidates for consistently achieving good learning accuracy. To address these issues, we introduce a novel approach called Adaptive Bayesian network Learning (ABNL). ABNL begins with an adaptive sampling process that extracts a sufficiently large data partition from any big dataset for fast structure learning. Then, ABNL feeds the data partition to different learning algorithms to obtain a collection of BN Structures. Lastly, ABNL adaptively chooses the structures and merge them into a final network structure using an ensemble method. Experimental results on four big datasets show that ABNL leads to a significantly improved performance than whole dataset learning and more accurate results than baseline algorithms.

Yan Tang, Qidong Zhang, Huaxin Liu, Wangsong Wang

A Novel Approach for Author Name Disambiguation Using Ranking Confidence

In digital libraries, ambiguous author names may occur because of the existence of multiple authors with the same name or different name variations for the same person. In recent years, name disambiguation has become a major challenge when integrating data from multiple sources in bibliographic digital libraries. Most of the previous works solve this issue by using many attributes, such as coauthors, title of articles/publications, topics of articles, and years of publications. However, in most cases, we can only get the coauthor and title attributes. In this paper, we propose an approach which is based on Hierarchical Agglomerative Clustering (HAC) and only use the coauthor and title attributes, but can more effectively identify the disambiguation authors. The whole algorithm can divide into two stages. In the first stage, we employ a pair-wise grouping algorithm which is based on coauthors’name to group records into clusters. Then, we merge two clusters if the similarity of the article titles from two clusters reach the threshold. Here, we use three kinds of similarity algorithms such as Jaccard Similarity, Cosine Similarity and Euclidean Distance to compare the similarity between the titles of two clusters. To minimize the risk of using only one similarity metric, we design the concept of ranking confidence to measure the confidence of different similarity meausrements. The ranking confidence decides which similarity measure to use when merging clusters. In the experiments, we use PairPresicion, PairRecall and PairF1 score to evaluate our method and compare with other methods. Experimental results indicate that our method significantly outperforms the baseline methods: HAC, K-means and SACluster when only use coauthor and title attributes.

Xueqin Lin, Jia Zhu, Yong Tang, Fen Yang, Bo Peng, Weiling Li

Capture Missing Values with Inference on Knowledge Base

Data imputation is a basic step for data cleaning. Traditional data imputation approaches are lack of accuracy in the absence of knowledge. Involving knowledge base in imputation could overcome this shortcoming. A challenge is that the missing value could be hardly found directly in the knowledge bases (KBs). To use knowledge base sufficiently for imputation, we present FOKES, an inference algorithm on knowledge bases. The inference not only makes full use of true facts in KBs, but also utilizes types to ensure the accuracy of captured missing values. Extensive experiments show that our proposed algorithm can capture missing values efficiently and effectively.

Zhixin Qi, Hongzhi Wang, Fanshan Meng, Jianzhong Li, Hong Gao

Weakly-Supervised Named Entity Extraction Using Word Representations

Named entity extraction is a key subtask of Information Extraction (IE), and also an important component for many Natural Language Processing (NLP) and Information Retrieval (IR) tasks. This paper proposes a weakly-supervised named entity extraction method by learning word representations on web-scale corpus. The highlights of our method include: (1) Word representations could be trained on either web documents or query logs; (2) Finding correct named entities is guided by a small set of seed entities, without any need for domain knowledge or human labor, allowing for the acquisition of named entities of any domain. Extensive experiments have been conducted to verify the effectiveness and efficiency of our method, comparing with the state-of-art approaches.

Kejun Deng, Dongsheng Wang, Junfei Liu

RDF Data Assessment Based on Metrics and Improved PageRank Algorithm

With the development of the Internet, lots of data appears on the Internet. But these data can’t be efficiently used owing to the lack of validity and believability, so data trust assessment has become a hot topic in the current research in the field of web. Considering the close relationship between the data credibility and its provenance, this paper proposes its own quantization rules with the existing trust evaluation model. And because of the similarities between web pages and RDF data, the improved PageRank algorithm is put forwarded in order to filter invalid data set. At last, the DataHub dataset is used to carry out a comprehensive experiment. The experimental results which carried out with DataHub set show that the proposed quantization rule and the improved PageRank algorithm can greatly improve the sorting result of the data set and reduce the effect of invalid data set on the sorting result.

Kai Wei, Pingfang Tian, Jinguang Gu, Li Huang

Efficient Web-Based Data Imputation with Graph Model

A challenge for data imputation is the lack of knowledge. In this paper, we attempt to address this challenge by involving extra knowledge from web. To achieve high-performance web-based imputation, we use the dependency, i.e. FDs and CFDs, to impute as many as possible values automatically and fill in the other missing values with the minimal access of web, whose cost is relatively large. To make sufficient use of dependencies, we model the dependency set on the data as a graph and perform automatical imputation and keywords generation for web-based imputation based on such graph model. With the generated keywords, we design two algorithms to extract values for imputation from the search results. Extensive experimental results based on real-world data collections show that the proposed approach could impute missing values efficiently and effectively compared to existing approach.

Yiwen Tang, Hongzhi Wang, Shiwei Zhang, Huijun Zhang, Ruoxi Shi

A New Schema Design Method for Multi-tenant Database

Existing multi-tenant database systems either emphasize on high performance and scalability at the expense of limited customization or provide enough customization at the cost of low performance and scalability. It calls for new efficient methods to address these limitations. In this paper, we propose a customized database schema design framework which supports schema customization for different tenants without sacrificing performance and scalability. We propose a customized schema integration method to help tenants better design their customized schema. To effectively integrate the customized schemas, we devise the interactive-based recommendation technique, hierarchical agglomerative clustering algorithm and multi-tenancy integration algorithm based on the schema and instance information. We propose the graph partition method to reorganize the integrated tables and develop optimization techniques from both the space and the workload perspectives. Besides our customized method can adapt to any schemas and query workloads. Further, our method can be easily applied to existing databases with minor revisions. Experimental results show that our method achieves better performance and higher scalability with schema customization property than the state-of-the-art methods.

Yaoqiang Xu, Jiacai Ni

SeCoP

Frontmatter

A Recommendation Platform

The majority of book sellers usually abstains from offering books that are of too little interest to potential customers. Thence, such sellers might face profit losses, because the product popularity can vary from place to place. In order to avoid these losses, this disquisition introduces Reader’s Choice – a system that recommends sellers to offer books based on the interest of people in different locations. Generally, most residents in a proximity share similar interests. In accordance with the search trends, Reader’s Choice can learn and output the vogue of books in various regions. Thereby, the searches and purchases help Reader’s Choice to determine where books are frequently sought respectively bought. Accordingly, Reader’s Choice can suggest products in regions where they were more often searched and merchandised. Basically, Reader’s Choice analyzes trends in datasets to draw insights. It employs Hadoop for the storage and analysis of search results and deals. A prudent performance scrutiny has testified Reader’s Choice for the best functionality and the second-best information retrieval metrics among competitive book recommendation systems.

Sayar Kumar Dey, Günter Fahrnberger

Accelerating Convolutional Neural Networks Using Fine-Tuned Backpropagation Progress

In computer vision many tasks have achieved state-of-the-art performance using convolutional neural networks (CNNs) [11], typically at the cost of massive computational complexity. A key problem of the training is the low speed of the progress. It may cost much time especially when computational resources are limited. The focus of this paper is speeding up the training progress based on fine-tuned backpropagation progress. More specifically, we train the CNNs with standard backpropagation firstly. When the feature extraction layers got better features, then we start to block the standard backpropagation in the whole layers, the loss function values only back propagates between fully connected layers. So it can not only save time but also pay more attention to train the classifier to get the same or better result compared with training with standard backpropagation all the time. Comprehensive experiments on JD (https://www.jd.com/) datasets demonstrate significant reduction in computational time, at the cost of negligible loss in accuracy.

Yulong Li, Zhenhong Chen, Yi Cai, Dongping Huang, Qing Li

A Personalized Learning Strategy Recommendation Approach for Programming Learning

Nowadays, it has been a significant problem to recommend learning strategy for different learners in programming learning projects. This paper discusses a personalized learning strategy recommendation approach to aid programming learning. In this paper, an improved design method of model learner strategies and programming learning strategy recommendation approach are presented. A reward factor is adopted to help to construct a learning strategy recommendation mechanism adaptively. The programming learning strategy recommendation system (ZZULI-PLS) is proposed based on those models to help learners learning in programming according to the actual progresses of learners. Usability tests are conducted to validate the recommendation efficiency in ZZULI-PLS system.

Peipei Gu, Junxia Ma, Wei Chen, Lujuan Deng, Lan Jiang

Wikipedia Based Short Text Classification Method

Short text is usually expressed in refined slightly, insufficient information, which makes text classification difficult. But we can try to introduce some information from the existing knowledge base to strengthen the performance of short text classification. Wikipedia [2, 13, 15] is now the largest human-edited knowledge base of high quality. It would benefit to short text classification if we can make full use of Wikipedia information in short text classification. This paper presents a new concept based [22] on Wikipedia short text representation method, by identifying the concept of Wikipedia mentioned in short text, and then expand the concept of wiki correlation and short text messages to the feature vector representation.

Junze Li, Yi Cai, Zhiwei Cai, Hofung Leung, Kai Yang

An Efficient Boolean Expression Index by Compression

Boolean expressions (BEs) have been widely used to represent subscriptions and publications in Publish/Subscribe (Pub/Sub) applications. Such applications frequently exhibit high dimensionality caused by the diversity of attributes and values. Given the high dimensionality in Pub/Sub, how to efficiently index BEs (space optimization) and next match events against the indexed BEs (matching efficiency) becomes a challenging task. Unfortunately, no existing Pub/Sub solutions could meet both objectives without compromising either of them.In this paper, we proposed a novel approach, namely BE-Matrix, to address the above issues. Firstly, we model BE subscriptions as a binary matrix and then encode the binary matrix into lists of encoded numbers, i.e. encoding lists, for lower space cost. Next, by adopting fast bitwise operation AND over encoding lists without a costly decoding phase, we can significantly speedup the matching process. Finally we conduct extensive experiments to evaluate the performance of BE-Matrix with the state of art indexing approach BE-Tree [9, 10], the results show that BE-Matrix outperforms BE-Tree by using $$\mathbf {12.72}{\varvec{\times }}$$ less space cost and $$\mathbf {11.38}{\varvec{\times }}$$ faster running speed.

Jin Tao, Chenxi Zhang, Weixiong Rao

MOOCon: A Framework for Semi-supervised Concept Extraction from MOOC Content

Recent years have witnessed the rapid development of Massive Open Online Courses (MOOCs). MOOC platforms not only offer a one-stop learning setting, but also aggregate a large number of courses with various kinds of textual content, e.g. video subtitles, quizzes and forum content. MOOCs are also regarded as a large-scale ‘knowledge base’ which covers various domains. However, all the contents generated by instructors and learners are unstructured. In order to process the data to be structured for further knowledge management and mining, the first step could be concept extraction. In this paper, we expect to utilize human knowledge through labeling data, and propose a framework for concept extraction based on machine learning methods. The framework is flexible to support semi-supervised learning, in order to alleviate human effort of labeling training data. Also course-agnostic features are designed for modeling cross-domain data. Experimental results demonstrate that only 10% labeled data can lead to acceptable performance, and the semi-supervised learning method is comparable to the supervised version under the consistent framework. We find the textual contents of various forms, i.e. subtitles, PPTs and questions, should be separately processed due to their formal difference. At last we evaluate a new task: identifying needs of concept comprehension. Our framework can work well in doing identification on forum content while learning a model from subtitles.

Zhuoxuan Jiang, Yan Zhang, Xiaoming Li

What Decides the Dropout in MOOCs?

Based on the datasets from the MOOCs of Peking University running on the Coursera platform, we extract 19 major features of tune in after analyzing the log structure. To begin with, we focus on the characteristics of start and dropout point of learners through the statistics of their start time and dropout time. Then we construct two models. First, several approaches of machine learning are used to build a sliding window model for predicting the dropout probabilities in a certain course. Second, SVM is used to build the model for predicting whether a student can get a score at the end of the course. For instructors and designers of MOOCs, dynamically tracking the records of the dropouts could be helpful to improve the course quality in order to reduce the dropout rate.

Xiaohang Lu, Shengqing Wang, Junjie Huang, Wenguang Chen, Zengwang Yan

Exploring N-gram Features in Clickstream Data for MOOC Learning Achievement Prediction

MOOC is an emerging online educational model in recent years. With the development of big data technology, a huge amount of learning behavior data can be mined by MOOC platforms. Mining learners’ past clickstream data to predict their future learning achievement by machine learning technology has become a hot research topic recently. Previous methods only consider the static counting-based features and ignore the correlative, temporal and fragmented nature of MOOC learning behavior, and thus have the limitation in interpretability and prediction accuracy. In this paper, we explore the effectiveness of N-gram features in clickstream data and model the MOOC learning achievement prediction problem as a multiclass classification task which classifies learners into four achievement levels. With extensive experiments on four real-world MOOC datasets, we empirically demonstrate that our methods outperform the state-of-the-art methods significantly.

Xiao Li, Ting Wang, Huaimin Wang

Predicting Student Examinee Rate in Massive Open Online Courses

Over the past few years, massive open online courses (a.b.a MOOCs) has rapidly emerged and popularized as a new style of education paradigm. Despite various features and benefits offered by MOOCs, however, unlike traditional classroom-style education, students enrolled in MOOCs often show a wide variety of motivations, and only quite a small percentage of them participate in the final examinations. To figure out the underlying reasons, in this paper, we make two key contributions. First, we find that being an examinee for a learner is almost a necessary condition of earning a certificate and hence investigation of the examinee rate prediction is of great importance. Second, after conducting extensive investigation of participants’ operation behaviours, we carefully select a set of features that are closely reflect participants’ learning behaviours. We apply existing commonly used classifiers over three online courses, generously provided by China University MOOC platform, to evaluate the effectiveness of the used features. Based on our experiments, we find there does not exist a single classifier that is able to dominate others in all cases, and in many cases, SVN performs the best.

Wei Lu, Tongtong Wang, Min Jiao, Xiaoying Zhang, Shan Wang, Xiaoyong Du, Hong Chen

In a massive online course with hundreds of thousands of students, it is unfeasible to provide an accurate and fast evaluation for each submission. Currently the researchers have proposed the algorithms called peer grading for the richly-structured assignments. These algorithms can deliver fairly accurate evaluations through aggregation of peer grading results, but not improve the effectiveness of allocating submissions. Allocating submissions to peers is an important step before the process of peer grading. In this paper, being inspired from the Longest Processing Time (LPT) algorithm that is often used in the parallel system, we propose a Modified Longest Processing Time (MLPT), which can improve the allocation efficiency. The dataset used in this paper consists of two parts, one part is collected from our MOOCs platform, and the other one is manually generated as the simulation dataset. We have shown the experimental results to validate the effectiveness of MLPT based on the two type datasets.

Yong Han, Wenjun Wu, Yanjun Pu

Predicting Honors Student Performance Using RBFNN and PCA Method

This paper proposes a predictive model based on Principle Component Analysis (PCA) combining with radical basis function Neutral Network (RBFNN) to accurately predict performance of honors student through the analysis of personalized characteristics. This model consists of two phases: PCA is firstly adopted to apply dimension reduction to the honors student dataset; extracted principle features are then employed as the input of RBF Neutral Network so as to build a three-layer RFF Neutral Network predictive model. Compared with other Neutral Network models, the PCA-RBF predictive model demonstrates a faster convergence speed, a higher predictive accuracy and stronger generation ability. Moreover, this model enables honors programmer administrators to identify those honor students at early stage of risk, and allow their academic advisors to provide appropriate advising in a timely manner.

Moke Xu, Yu Liang, Wenjun Wu

DKG: An Expanded Knowledge Base for Online Course

Recent years have witnessed a proliferation of large-scale online education platforms. However, the learning materials provided by online courses are still finite. In this paper, to expand the learning materials on MOOC platforms, we construct an expanded knowledge base named DKG. DKG combines priori knowledge from concept map with extended textual fragments collected from web sources. For the sake of DKG’s quality, we also propose a supervised method with four novel features to evaluate the quality of textual fragments. Finally, we conduct experiments on four online courses. The results show that our method can find good textual fragments efficiently and expand learning materials successfully.

Haimeng Duan, Yuanhao Zheng, Lei Shi, Changhong Jin, Hongwei Zeng, Jun Liu

Towards Economic Models for MOOC Pricing Strategy Design

MOOCs have brought unprecedented opportunities of making high-quality courses accessible to everybody. However, from the business point of view, MOOCs are often challenged for lacking of sustainable business models, and academic research for marketing strategies of MOOCs is also a blind spot currently. In this work, we try to formulate the business models and pricing strategies in a structured and scientific way. Based on both theoretical research and real marketing data analysis from a MOOC platform, we present the insights of the pricing strategies for existing MOOC markets. We focus on the pricing strategies for verified certificates in the B2C markets, and also give ideas of modeling the course sub-licensing services in B2B markets.

Yongzheng Jia, Zhengyang Song, Xiaolan Bai, Wei Xu

Using Pull-Based Collaborative Development Model in Software Engineering Courses: A Case Study

The pull-based development model is an emerging way of contributing to distributed software projects within the Open Source Software (OSS) communities. To train students’ development skills with this modern paradigm and evaluate the effects in classroom settings, we designed a pull-based development model in classroom settings. In addition, we built the support environment for the process and integrated it in a popular teaching platform – TRUSTIE. With this platform, we further conducted a case study to investigate how the students benefit from this process and what challenges exist. In this experiment 22 students worked in 5 groups to independently complete an in-classroom programming project. Quantitative and qualitative results show some different characteristics of using pull-based work model from which in the OSS context, and also provide constructive advice for future practices.

Yao Lu, Xinjun Mao, Gang Yin, Tao Wang, Yu Bai

A Method of Constructing the Mapping Knowledge Domains in Chinese Based on the MOOCs

While the number of MOOCs users in China has been increasing dramatically, the users still face the risk to give up learning in the half way due to unfamiliarity with the course structure, prerequisite courses and so on. To resolve this problem, the MKD plays an important role by providing a clear structure map of the course, helping user to realize about the appropriate learning path as well as the knowledge relationship. In this article, a method of constructing a usable MKD in Chinese has been raised based on online courses. The online course data are obtained by web crawling from the MOOCs sites, then processed through the data clean and data fusion after which the MKD is extracted and evaluated. This method is applied to all the existing MOOCs courses thus has showed practical significance.

Zhengzhou Zhu, Yang Li, Youming Zhang, Zhonghai Wu

Social Friendship-Aware Courses Arrangement on MOOCs

Massive open online courses (MOOCs) provide an opportunity for learners to access free courses offered by top universities in the world. However, with contrast to large scale enrollment, the completion rate of these courses is really low. One of the reasons for students to quit learning process is that they could not study the courses with their friends. In order to improve the completion rate, we address the importance of content interest and social friendship for courses arrangement for learners in MOOCs. We first develop a greedy algorithm to solve the arrangement according to the friendship of learners and the content of courses. Then we used the game theoretic framework to improve greedy algorithm performance. Finally, we verify the effectiveness and efficiency of the proposed solutions through extensive experiments on both real and synthetic datasets.

Yuan Liang

Quality-Aware Crowdsourcing Curriculum Recommendation in MOOCs

With larger and larger numbers of students participating in Massive Open Online Courses (MOOCs), finding top-k suitable courses increasingly becomes a challenging issue for students in terms of course quality, which is hard for computer to compare. Thanks to emerging crowdsourcing platforms, the crowd are assigned to compare the objects and infer the $$top-k$$ objects based on the crowdsourced comparison results. In this paper, we focus on one such function, $$top-k$$, that finds the former k ranked objects. We then provide heuristic functions to recommend the $$top-k$$ elements given evidence. We experimentally evaluate our functions to highlight their strengths and weaknesses.

Yunpeng Gao

Crowdsourcing Based Teaching Assistant Arrangement for MOOC

With the development of new web technologies, the Massive Open Online Course (MOOC) which aims at unlimited participation and access is emerging. In contrast to traditional education, learners could get access to filmed lectures and tests online anytime and anywhere. However, there still exists some problems with MOOCs. Currently, one major problem is the imbalance between teachers and learners online. In many courses, thousands of learners enroll in a class with a single instructor which could lead to bad learning effect and very low completion rates, so we propose crowdsourcing based teaching assistant assignment for MOOC in order to optimize the reasonable disposal of manpower. We present effective algorithms for the selection and assignment of teaching assistants. With experiments on various datasets, we verify the effectiveness of our proposed methods.

Dezhi Sun, Bo Liu

Quantitative Analysis of Learning Data in a Programming Course

Online learning platform, which has taken higher education by storm, provides an opportunity to track students’ learning behaviors. The vast majority of educational data mining research has been carried out based on the online learning platform in Europe and America but few of them use the data from programming courses with large scale. In this paper, we track students’ code submissions for assignments in a programming course and collect totally 17,854 submissions with the help of Trustie, a famous online education platform in China. We perform a preliminary exploratory inspect for code quality by SonarQube from the code submissions. The analysis results reveal several interesting observations over the programming courses. For example, results show that logical training is more important than grammar training. Moreover, the analysis itself also provides useful feedback of students’ learning effect to instructors for them to improve their teaching in time.

Yu Bai, Liqian Chen, Gang Yin, Xinjun Mao, Ye Deng, Tao Wang, Yao Lu, Huaimin Wang

Backmatter

Weitere Informationen