Top

2017 | Book

Read chapter Read first chapter

From Social Data Mining and Analysis to Prediction and Community Detection

Editors: Mehmet Kaya, Özcan Erdoǧan, Jon Rokne

Publisher: Springer International Publishing

Book Series : Lecture Notes in Social Networks

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This book presents the state-of-the-art in various aspects of analysis and mining of online social networks. Within the broader context of online social networks, it focuses on important and upcoming topics of social network analysis and mining such as the latest in sentiment trends research and a variety of techniques for community detection and analysis. The book collects chapters that are expanded versions of the best papers presented at the IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM’2015), which was held in Paris, France in August 2015. All papers have been peer reviewed and checked carefully for overlap with the literature. The book will appeal to students and researchers in social network analysis/mining and machine learning.

Frontmatter

An Offline–Online Visual Framework for Clustering Memes in Social Media

Abstract

The amount of data generated in Online Social Networks (OSNs) is increasing every day. Extracting and understanding trending topics and events from the vast amount of data is an important area of research in OSNs. This paper proposes a novel clustering framework to detect the spread of memes in OSNs in real time. The Offline–Online meme clustering framework exploits various similarity scores between different elements of Reddit submissions, two strategies to combine those scores based on Wikipedia concepts as an external knowledge, text semantic similarity and a modified version of Jaccard Coefficient. The two combination strategies include: (1) automatically computing the similarity score weighting factors for five elements of a submission and (2) allowing users to engage in the clustering process and filter out outlier submissions, modify submission class labels, or assign different similarity score weight factors for various elements of a submission using a visualization prototype. The Offline–Online clustering process does a one-pass clustering for existing OSN data in the first step by calculating and summarizing each cluster statistics using Wikipedia concepts. For the online component, it assigns new streaming data points to the appropriate clusters using a modified version of online k-means. The experiment results show that the use of Wikipedia as external knowledge and text semantic similarity improves the speed and accuracy of the meme clustering problem when comparing to baselines. For the online clustering process, using a damped window model approach is suitable for online streaming environments as it not only requires low prediction and training costs, but also assigns more weight to recent data and popular topics.

Anh Dang, Abidalrahman Moh’d, Anatoliy Gruzd, Evangelos Milios, Rosane Minghim

A System for Email Recipient Prediction

Abstract

The ability to accurately predict recipients of an email, while it is being composed, is of great practical importance for two reasons. First, prediction of recipients allows for effective “auto-complete” of this field, thereby improving user experience and reducing the overhead of manual typing of the recipient. Second, this capability allows the system to alert the user when she has typed unlikely recipients. Such alerts can help avoid human error that might result in forgetting relevant recipients, or, even worse, disclosure of personal or classified information.In this article, a system that effectively predicts email recipients, given an email history, will be presented. The system takes into consideration a variety of email related features to achieve high accuracy. Extensive experimentation on diverse email corpora has shown that our system adapts well to a variety of domains (such as business, personal and political email).

Zvi Sofershtein, Sara Cohen

A Credibility Assessment Model for Online Social Network Content

Abstract

Online social networks such as Twitter are among the most important sources of information in the current era of information overload, restiveness, and uncertainty. Therefore, it is necessary to develop a model for verifying information from Twitter, which is a challenging task. We propose a new credibility assessment model for identifying implausible content on Twitter to prevent the proliferation of false/malicious information. The proposed model consists of six integrated components operating in an algorithmic form to assess the credibility of tweets. We enhanced our classifier by weighting features extracted from tweets according to their relative importance. Further, we applied our model to two different datasets created from 155,794 unique accounts. To evaluate the performance of our model, we trained two naïve Bayes models, M1 (without relative importance algorithm) and M2 (with relative importance algorithm). The results were quite encouraging: M2 achieved accuracies of 82.25 and 85.47% on the two datasets.

Majed Alrubaian, Muhammad Al-Qurishi, Mabrook Al-Rakhami, Atif Alamri

Web Search Engine-Based Representation for Arabic Tweets Categorization

Abstract

In microblogging services such as Twitter, users post short texts messages called tweets, which are limited in length. These tweets sometimes express opinions about different topics and are presented to the user in a chronological order. As short texts do not provide sufficient contextual information, traditional texts representation methods have several limitations when directly applied to short text tasks. To tackle these issues, we propose to exploit the internal semantics from the original tweets and external knowledge from the web as a large and open corpus; and also based on the Rough Set Theory which is a mathematical tool to deal with vagueness and uncertainty; in order to enrich the tweets representation for the Arabic Language.

To test our method for enriching Arabic tweets representation, we build an Arabic tweets categorization system. The effectiveness has been evaluated and compared in terms of the F1-measure by Naïve Bayesian (NB), Support Vector Machine (SVM) classifier, and Decision Tree (DT) classifiers.

Mohammed Bekkali, Abdelmonaime Lachkar

Sentiment Trends and Classifying Stocks Using P-Trees

Abstract

For people who are not exposed to the financial markets but who would be interested to invest in stocks for them it is important to know which ticker symbols to follow, based on which investment decisions can be made. In this work, we propose how we can classify stock ticker symbols from tweets vertically using P-Trees. The solution described in this paper analyzes 3000 financial and news symbols vertically from the Twitter platform and finds the ticker symbols which are most frequently been discussed. It also provides an ability to scan through the tweet texts associated with the common ticker occurrences so that the context can be identified to help users make better informed business decisions. The paper also discusses on the bias of investors and the affect it has on the volatility of the stocks in the market. We also show how sentiment analysis can be run on the pulled tweets and why we chose the Microsoft Azure Sentiment Analyzer over the other Sentiment Analyzer tools. Finally, we provide some future direction where we plan to take this research forward and conclude with some closing remarks.

Arijit Chatterjee, William Perrizo

Mining Community Structure with Node Embeddings

Abstract

We develop node embeddings, a distributed representation of nodes, for large-scale social network applications. We compute embeddings for nodes based on their attributes and links. We show that node embeddings can effectively reflect community structure in networks and thus, be useful for a wide range of community related applications. We consider node embeddings in two different community related mining tasks.First, we propose a generic integration of node embeddings for network processing in community detection algorithms. Our strategy aims to re-adjust input networks by adding and trimming links, using embedding-based node distances. We empirically show that the strategy can remove up to 32.16% links from the DBLP (computer science literature) citation network, yet improve performance for different algorithms by different evaluation metrics for community detections.Second, we show that these embeddings can support many community-based mining tasks in social networks—including analyses of community homogeneity, distance, and detection of community connectors (inter-community outliers, actors who connect communities)—thanks to the convenient yet efficient computation provided by node embeddings for structural comparisons. Our experimental results include many interesting insights about DBLP. For example, prior to 2013 the best way for research in Natural Language & Speech to gain “best-paper” recognition was to emphasize aspects related to Machine Learning & Pattern Recognition.

Thuy Vu, D. Stott Parker

A LexDFS-Based Approach on Finding Compact Communities

Abstract

This article presents an efficient hierarchical clustering algorithm based on a graph traversal algorithm called LexDFS. This traversal algorithm has the property of going through the clustered parts of the graph in a small number of iterations, making them recognisable. The time complexity of our method is in O(n × log(n)). It is simple to implement and a thorough study shows that it outputs clusterings that are closer to some ground-truths than its competitors. Experiments are also carried out to analyse the behaviour of the algorithm during execution on sample graphs. This article also features a quality function called compactness, which measures how efficient is the cluster for internal communications. We prove that this quality function features interesting theoretical properties.

Jean Creusefond, Thomas Largillier, Sylvain Peyronnet

Computational Data Sciences and the Regulation of Banking and Financial Services

Abstract

The development of computational data science techniques in natural language processing (NLP) and machine learning (ML) algorithms to analyze large and complex textual information opens new avenues to study intricate policy processes at a scale unimaginable even a few years ago. We apply these scalable NLP and ML techniques to analyze the United States Government’s regulation of the banking and financial services sector. First, we employ NLP techniques to convert the text of financial regulation laws into feature vectors and infer representative “topics” across all the laws. Second, we apply ML algorithms to the feature vectors to predict various attributes of each law, focusing on the amount of authority delegated to regulators. Lastly, we compare the power of alternative models in predicting regulators’ discretion to oversee financial markets. These methods allow us to efficiently process large amounts of documents and represent the text of the laws in feature vectors, taking into account words, phrases, syntax, and semantics. The vectors can be paired with predefined policy features, thereby enabling us to build better predictive measures of financial sector regulation. The analysis offers policymakers and the business community alike a tool to automatically score policy features of financial regulation laws to and measure their impact on market performance.

Sharyn O’Halloran, Marion Dumas, Sameer Maskey, Geraldine McAllister, David K. Park

Frequent and Non-frequent Sequential Itemsets Detection

Abstract

Sequential frequent itemsets detection is one of the core problems in data mining with many applications in business, marketing, data stream analysis, etc. In the current paper, we propose a new methodology based on our previous work regarding the detection of all repeated patterns in a sequence, i.e., frequent and non-frequent itemsets. By analyzing big datasets from FIMI website of up to one million transactions we were able to detect not only the most frequent sequential itemsets, but also any sequential itemset that occurred at least twice in the dataset and, therefore, detect outliers which may be important while no other methodology can perform such analysis. For this purpose, we have used the novel data structure LERP-RSA (Longest Expected Repeated Pattern-Reduced Suffix Array) and the innovative ARPaD algorithm which allows the detection of all repeated patterns in a string. The methodology uses a pre-statistical analysis of the transactions and this allows constructing in a very efficient way smaller LERP-RSA data structures for each transaction. The integration and classification of all LERP-RSAs let the ARPaD algorithm to be executed in parallel which can accelerate the process and find the itemsets in a very efficient way.

Konstantinos F. Xylogiannopoulos, Panagiotis Karampelas, Reda Alhajj

Backmatter

Title: From Social Data Mining and Analysis to Prediction and Community Detection
Editors: Mehmet Kaya
Özcan Erdoǧan
Jon Rokne
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-51367-6
Print ISBN: 978-3-319-51366-9
DOI: https://doi.org/10.1007/978-3-319-51367-6

Springer Professional

From Social Data Mining and Analysis to Prediction and Community Detection

About this book

Table of Contents

Frontmatter

An Offline–Online Visual Framework for Clustering Memes in Social Media

A System for Email Recipient Prediction

A Credibility Assessment Model for Online Social Network Content

Web Search Engine-Based Representation for Arabic Tweets Categorization

Sentiment Trends and Classifying Stocks Using P-Trees

Mining Community Structure with Node Embeddings

A LexDFS-Based Approach on Finding Compact Communities

Computational Data Sciences and the Regulation of Banking and Financial Services

Frequent and Non-frequent Sequential Itemsets Detection

Backmatter

Premium Partner