1 Introduction
1.1 Terminology
-
Natural language processing—(NLP) is a field of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) languages. Specifically, it is the process of a computer extracting meaningful information from natural language input and/or producing natural language output.
-
News analytics—the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance and novelty.
-
Opinion mining—opinion mining (sentiment mining, opinion/sentiment extraction) is the area of research that attempts to make automatic systems to determine human opinion from text written in natural language.
-
Scraping—collecting online data from social media and other Web sites in the form of unstructured text and also known as site scraping, web harvesting and web data extraction.
-
Sentiment analysis—sentiment analysis refers to the application of natural language processing, computational linguistics and text analytics to identify and extract subjective information in source materials.
-
Text analytics—involves information retrieval (IR), lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization and predictive analytics.
1.2 Research challenges
-
Scraping—although social media data is accessible through APIs, due to the commercial value of the data, most of the major sources such as Facebook and Google are making it increasingly difficult for academics to obtain comprehensive access to their ‘raw’ data; very few social data sources provide affordable data offerings to academia and researchers. News services such as Thomson Reuters and Bloomberg typically charge a premium for access to their data. In contrast, Twitter has recently announced the Twitter Data Grants program, where researchers can apply to get access to Twitter’s public tweets and historical data in order to get insights from its massive set of data (Twitter has more than 500 million tweets a day).
-
Data cleansing—cleaning unstructured textual data (e.g., normalizing text), especially high-frequency streamed real-time data, still presents numerous problems and research challenges.
-
Holistic data sources—researchers are increasingly bringing together and combining novel data sources: social media data, real-time market & customer data and geospatial data for analysis.
-
Data protection—once you have created a ‘big data’ resource, the data needs to be secured, ownership and IP issues resolved (i.e., storing scraped data is against most of the publishers’ terms of service), and users provided with different levels of access; otherwise, users may attempt to ‘suck’ all the valuable data from the database.
-
Data analytics—sophisticated analysis of social media data for opinion mining (e.g., sentiment analysis) still raises a myriad of challenges due to foreign languages, foreign words, slang, spelling errors and the natural evolving of language.
-
Analytics dashboards—many social media platforms require users to write APIs to access feeds or program analytics models in a programming language, such as Java. While reasonable for computer scientists, these skills are typically beyond most (social science) researchers. Non-programming interfaces are required for giving what might be referred to as ‘deep’ access to ‘raw’ data, for example, configuring APIs, merging social media feeds, combining holistic sources and developing analytical models.
-
Data visualization—visual representation of data whereby information that has been abstracted in some schematic form with the goal of communicating information clearly and effectively through graphical means. Given the magnitude of the data involved, visualization is becoming increasingly important.
1.3 Social media research and applications
1.4 Social media overview
-
Social media data—social media data types (e.g., social network media, wikis, blogs, RSS feeds and news, etc.) and formats (e.g., XML and JSON). This includes data sets and increasingly important real-time data feeds, such as financial data, customer transaction data, telecoms and spatial data.
-
Social media programmatic access—data services and tools for sourcing and scraping (textual) data from social networking media, wikis, RSS feeds, news, etc. These can be usefully subdivided into:
-
Data sources, services and tools—where data is accessed by tools which protect the raw data or provide simple analytics. Examples include: Google Trends, SocialMention, SocialPointer and SocialSeek, which provide a stream of information that aggregates various social media feeds.
-
Data feeds via APIs—where data sets and feeds are accessible via programmable HTTP-based APIs and return tagged data using XML or JSON, etc. Examples include Wikipedia, Twitter and Facebook.
-
-
Text cleaning and storage tools—tools for cleaning and storing textual data. Google Refine and DataWrangler are examples for data cleaning.
-
Text analysis tools—individual or libraries of tools for analyzing social media data once it has been scraped and cleaned. These are mainly natural language processing, analysis and classification tools, which are explained below.
-
Transformation tools—simple tools that can transform textual input data into tables, maps, charts (line, pie, scatter, bar, etc.), timeline or even motion (animation over timeline), such as Google Fusion Tables, Zoho Reports, Tableau Public or IBM’s Many Eyes.
-
Analysis tools—more advanced analytics tools for analyzing social data, identifying connections and building networks, such as Gephi (open source) or the Excel plug-in NodeXL.
-
-
Social media platforms—environments that provide comprehensive social media data and libraries of tools for analytics. Examples include: Thomson Reuters Machine Readable News, Radian 6 and Lexalytics.
-
Social network media platforms—platforms that provide data mining and analytics on Twitter, Facebook and a wide range of other social network media sources.
-
News platforms—platforms such as Thomson Reuters providing commercial news archives/feeds and associated analytics.
-
2 Social media methodology and critique
2.1 Methodology
2.1.1 Data
-
Social network media—access to comprehensive historic data sets and also real-time access to sources, possibly with a (15 min) time delay, as with Thomson Reuters and Bloomberg financial data.
-
News data—access to historic data and real-time news data sets, possibly through the concept of ‘educational data licenses’ (cf. software license).
-
Public data—access to scraped and archived important public data; available through RSS feeds, blogs or open government databases.
-
Programmable interfaces—researchers also need access to simple application programming interfaces (APIs) to scrape and store other available data sources that may not be automatically collected.
2.1.2 Analytics
-
Analytics dashboards—non-programming interfaces are required for giving what might be termed as ‘deep’ access to ‘raw’ data.
-
Holistic data analysis—tools are required for combining (and conducting analytics across) multiple social media and other data sets.
-
Data visualization—researchers also require visualization tools whereby information that has been abstracted can be visualized in some schematic form with the goal of communicating information clearly and effectively through graphical means.
2.1.3 Facilities
-
Data storage—the volume of social media data, current and projected, is beyond most individual universities and hence needs to be addressed at a national science foundation level. Storage is required both for principal data sources (e.g., Twitter), but also for sources collected by individual projects and archived for future use by other researchers.
-
Computational facility—remotely accessible computational facilities are also required for: a) protecting access to the stored data; b) hosting the analytics and visualization tools; and c) providing computational resources such as grids and GPUs required for processing the data at the facility rather than transmitting it across a network.
2.2 Critique
2.2.1 Data
-
Siloed data—most data sources (e.g., Twitter) have inherently isolated information making it difficult to combine with other data sources.
-
Holistic data—in contrast, researchers are increasingly interested in accessing, storing and combining novel data sources: social media data, real-time financial market & customer data and geospatial data for analysis. This is currently extremely difficult to do even for Computer Science departments.
2.2.2 Analytics
2.2.3 Facilities
3 Social media data
3.1 Types of data
-
Historic data sets—previously accumulated and stored social/news, financial and economic data.
-
Real-time feeds—live data feeds from streamed social media, news services, financial exchanges, telecoms services, GPS devices and speech.
-
Raw data—unprocessed computer data straight from source that may contain errors or may be unanalyzed.
-
Cleaned data—correction or removal of erroneous (dirty) data caused by disparities, keying mistakes, missing bits, outliers, etc.
-
Value-added data—data that has been cleaned, analyzed, tagged and augmented with knowledge.
3.2 Text data formats
-
HTML—HyperText Markup Language (HTML) as well-known is the markup language for web pages and other information that can be viewed in a web browser. HTML consists of HTML elements, which include tags enclosed in angle brackets (e.g., <div>), within the content of the web page.
-
XML—Extensible Markup Language (XML)—the markup language for structuring textual data using <tag>…<\tag> to define elements.
-
JSON—JavaScript Object Notation (JSON) is a text-based open standard designed for human-readable data interchange and is derived from JavaScript.
-
CSV—a comma-separated values (CSV) file contains the values in a table as a series of ASCII text lines organized such that each column value is separated by a comma from the next column’s value and each row starts a new line.
4 Social media providers
-
Freely available databases—repositories that can be freely downloaded, e.g., Wikipedia (http://dumps.wikimedia.org) and the Enron e-mail data set available via http://www.cs.cmu.edu/~enron/.
-
Data access via tools—sources that provide controlled access to their social media data via dedicated tools, both to facilitate easy interrogation and also to stop users ‘sucking’ all the data from the repository. An example is Google’s Trends. These further subdivided into:
-
Free sources—repositories that are freely accessible, but the tools protect or may limit access to the ‘raw’ data in the repository, such as the range of tools provided by Google.
-
Commercial sources—data resellers that charge for access to their social media data. Gnip and DataSift provide commercial access to Twitter data through a partnership, and Thomson Reuters to news data.
-
-
Data access via APIs—social media data repositories providing programmable HTTP-based access to the data via APIs (e.g., Twitter, Facebook and Wikipedia).
4.1 Open-source databases
4.2 Data access via tools
4.2.1 Freely accessible sources
4.2.2 Commercial sources
4.3 Data feed access via APIs
4.3.1 Wiki media
4.3.2 Social networking media
4.3.2.1 Twitter
-
Search API—Query Twitter for recent Tweets containing specific keywords. It is part of the Twitter REST API v1.1 (it attempts to comply with the design principles of the REST architectural style, which stands for Representational State Transfer) and requires an authorized application (using oAuth, the open standard for authorization) before retrieving any results from the API.
-
Streaming API—A real-time stream of Tweets, filtered by user ID, keyword, geographic location or random sampling.
4.3.2.2 Facebook
4.3.3 RSS feeds
4.3.4 Blogs, news groups and chat services
4.3.5 News feeds
4.3.6 Geospatial feeds
-
Location and time sensitive—exchange of messages with relevance for one specific location at one specific point-in time (e.g., Foursquare).
-
Location sensitive only—exchange of messages with relevance for one specific location, which are tagged to a certain place and read later by others (e.g., Yelp and Qype)
-
Time sensitive only—transfer of traditional social media applications to mobile devices to increase immediacy (e.g., posting Twitter messages or Facebook status updates)
-
Neither location or time sensitive—transfer of traditional social media applications to mobile devices (e.g., watching a YouTube video or reading a Wikipedia entry)
5 Text cleaning, tagging and storing
-
Missing data—when a piece of information existed but was not included for whatever reason in the raw data supplied. Problems occur with: a) numeric data when ‘blank’ or a missing value is erroneously substituted by ‘zero’ which is then taken (for example) as the current price; and b) textual data when a missing word (like ‘not’) may change the whole meaning of a sentence.
-
Incorrect data—when a piece of information is incorrectly specified (such as decimal errors in numeric data or wrong word in textual data) or is incorrectly interpreted (such as a system assuming a currency value is in $ when in fact it is in £ or assuming text is in US English rather than UK English).
-
Inconsistent data—when a piece of information is inconsistently specified. For example, with numeric data, this might be using a mixture of formats for dates: 2012/10/14, 14/10/2012 or 10/14/2012. For textual data, it might be as simple as: using the same word in a mixture of cases, mixing English and French in a text message, or placing Latin quotes in an otherwise English text.
5.1 Cleansing data
5.2 Tagging unstructured data
5.3 Storing data
-
Flat file—a flat file is a two-dimensional database (somewhat like a spreadsheet) containing records that have no structured interrelationship, that can be searched sequentially.
-
Relational database—a database organized as a set of formally described tables to recognize relations between stored items of information, allowing more complex relationships among the data items. Examples are row-based SQL databases and column-based kdb + used in finance.
-
noSQL databases—a class of database management system (DBMS) identified by its non-adherence to the widely used relational database management system (RDBMS) model. noSQL/newSQL databases are characterized as: being non-relational, distributed, open-source and horizontally scalable.
5.3.1 Apache (noSQL) databases and tools
5.3.1.1 Apache open-source software
-
Cassandra/hive databases—Apache Cassandra is an open source (noSQL) distributed DBMS providing a structured ‘key-value’ store. Key-value stores allow an application to store its data in a schema-less way. Related noSQL database products include: Apache Hive, Apache Pig and MongoDB, a scalable and high-performance open-source database designed to handle document-oriented storage. Since noSQL databases are ‘structure-less,’ it is necessary to have a companion SQL database to retain and map the structure of the corresponding data.
-
Hadoop platform—is a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. An application is broken down into numerous small parts (also called fragments or blocks) that can be run on systems with thousands of nodes involving thousands of terabytes of storage.
-
Mahout—provides implementations of distributed or otherwise scalable analytics (machine learning) algorithms running on the Hadoop platform. Mahout4 supports four classes of algorithms: a) clustering (e.g., K-Means, Fuzzy C-Means) that groups text into related groups; b) classification (e.g., Complementary Naive Bayes classifier) that uses supervised learning to classify text; c) frequent itemset mining takes a set of item groups and identifies which individual items usually appear together; and d) recommendation mining (e.g., user- and item-based recommenders) that takes users’ behavior and from that tries to find items users might like.
6 Social media analytics techniques
6.1 Computational science techniques
-
Computational statistics—refers to computationally intensive statistical methods including resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation and principal components analysis.
-
Machine learning—a system capable of the autonomous acquisition and integration of knowledge learnt from experience, analytical observation, etc. (Murphy 2012). These sub-symbolic systems further subdivide into:
-
Supervised learning such as Regression Trees, Discriminant Function Analysis, Support Vector Machines.
-
Unsupervised learning such as Self-Organizing Maps (SOM), K-Means.
-
-
Complexity science—complex simulation models of difficult-to-predict systems derived from statistical physics, information theory and nonlinear dynamics. The realm of physicists and mathematicians.
-
Data mining—knowledge discovery that extracts hidden patterns from huge quantities of data, using sophisticated differential equations, heuristics, statistical discriminators (e.g., hidden Markov models), and artificial intelligence machine learning techniques (e.g., neural networks, genetic algorithms and support vector machines).
-
Simulation modeling—simulation-based analysis that tests hypotheses. Simulation is used to attempt to predict the dynamics of systems so that the validity of the underlying assumption can be tested.
6.1.1 Stream processing
6.2 Sentiment analysis
6.2.1 Sentiment classification
-
Sentiment context—to extract opinion, one needs to know the ‘context’ of the text, which can vary significantly from specialist review portals/feeds to general forums where opinions can cover a spectrum of topics (Westerski 2008).
-
Sentiment level—text analytics can be conducted at the document, sentence or attribute level.
-
Sentiment subjectivity—deciding whether a given text expresses an opinion or is factual (i.e., without expressing a positive/negative opinion).
-
Sentiment orientation/polarity—deciding whether an opinion in a text is positive, neutral or negative.
-
Sentiment strength—deciding the ‘strength’ of an opinion in a text: weak, mild or strong.
6.2.2 Supervised learning methods
-
Naïve Bayes (NB)—a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions (when features are independent of one another within each class).
-
Maximum entropy (ME)—the probability distribution that best represents the current state of knowledge is the one with largest information-theoretical entropy.
-
Support vector machines (SVM)—are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis.
-
Logistic regression (LR) model—is a type of regression analysis used for predicting the outcome of a categorical (a variable that can take on a limited number of categories) criterion variable based on one or more predictor variables.
-
Latent semantic analysis—an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text (Kobayashi and Takeda 2000).
6.2.2.1 Naïve Bayes classifier (NBC)
-
Training step—using the training samples, the method estimates the parameters of a probability distribution, assuming features are conditionally independent given the class.
-
Analysis/testing step—For any unseen test sample, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test sample according to the largest posterior probability.
7 Social media analytics tools
7.1 Scientific programming tools
7.2 Business toolkits
7.3 Social media monitoring tools
7.4 Text analysis tools
7.5 Data visualization tools
7.6 Case study: SAS Sentiment Analysis and Social Media Analytics
8 Social media analytics platforms
-
News platforms—platforms such as Thomson Reuters providing news archives/feeds and associated analytics and targeting companies such as financial institutions seeking to monitor market sentiment in news.
-
Social network media platforms—platforms that provide data mining and analytics on Twitter, Facebook and a wide range of other social network media sources. Providers typically target companies seeking to monitor sentiment around their brands or products.
8.1 News platforms
-
Author sentiment—metrics for how positive, negative or neutral the tone of the item is, specific to each company in the article.
-
Relevance—how relevant or substantive the story is for a particular item.
-
Volume analysis—how much news is happening on a particular company.
-
Uniqueness—how new or repetitive the item is over various time periods.
-
Headline analysis—denotes special features such as broker actions, pricing commentary, interviews, exclusives and wrap-ups.
8.2 Social network media platforms
8.3 Case study: Thomson Reuters News Analytics
-
Item type—stage of the story: Alert, Article, Updates or Corrections.
-
Item genre—classification of the story, i.e., interview, exclusive and wrap-up.
-
Headline—alert or headline text.
-
Relevance—varies from 0 to 1.0.
-
Prevailing sentiment—can be 1, 0 or −1.
-
Positive, neutral, negative—more detailed sentiment indication.
-
Broker action—denotes broker actions: upgrade, downgrade, maintain, undefined or whether it is the broker itself
-
Price/market commentary—used to flag items describing pricing/market commentary
-
Topic codes—describes what the story is about, i.e., RCH = Research, RES = Results, RESF = Results Forecast, MRG = Mergers and Acquisitions
-
Emotional indicators (sentiments)—emotions such as Gloom, Fear, Trust, Uncertainty, Innovation, Anger, Stress, Urgency, Optimism and Joy.
-
Buzz metrics—they indicate how much something is being discussed in the news and social media and include macroeconomic themes (e.g., Litigation, Mergers, Volatility, Financials sector, Airlines sector and Clean Technology sector)
9 Experimental computational environment for social media
9.1 Data
-
Data scraping—the ability through easily programmable APIs to scrape any type of social media (social networking media, RSS feeds, blogs, wikis, news, etc.).
-
Data streaming—to access and combine real-time feeds and archived data for analytics.
-
Data storage—a major facility for storing principal data sources and for archiving data collected for specific projects.
-
Data protection/security—the stored data needs to be protected to stop users attempting to ‘suck it out’ off the facility. Access to certain data sets may need to be restricted and charges may be levied on access (cf. Wharton Research Data Services).
-
Programmable interfaces—researchers need access to simple application programming interfaces (APIs) to scrape and store other available data sources that may not be automatically collected.
9.2 Analytics
-
Analytics dashboards—non-programming interfaces are required for giving what might be referred to as ‘deep’ access to ‘raw’ data.
-
Programmable analytics—programming interfaces are also required so users can deploy advanced data mining and computer simulation models using MATLAB, Java and Python.
-
Stream processing—facilities are required to support analytics on streamed real-time data feeds, such as Twitter feeds, news feeds and financial tick data.
-
High-performance computing—lastly the environment needs to support non-programming interfaces to MapReduce/Hadoop, NoSQL databases and Grids of processors.
-
Decentralized analytics—if researchers are to combine social media data with highly sensitive/valuable proprietary data held by governments, financial institutions, retailers and other commercial organizations, then the environment needs in the future to support decentralized analytics across distributed data sources and in a highly secure way.
9.3 System architecture
-
Connectivity engines—the connectivity modules communicate with the external data sources, including Twitter and Facebook’s APIs, financial blogs, various RSS and news feeds. The platform’s APIs are continually being expanded to incorporate other social media sources as required. Data is fed into SocialSTORM in real time, including a random sample of all public updates from Twitter, providing gigabytes of text-based data every day.
-
Messaging bus—the message bus serves as the internal communication layer which accepts the incoming data streams (messages) from the various connectivity engines, parses these (from either JSON or XML format) to an internal representation of data in the platform, distributes the information across all the interested modules and writes the various data to the appropriate tables of the main database.
-
Data warehouse—the database supports terabytes of text-based entries, which are accompanied by various types of metadata to expand the potential avenues of research. Entries are organized by source and accurately time-stamped with the time of publication, as well as being tagged with topics for easy retrieval by simulation models. The platform currently uses HBase, but in future might use Apache Cassandra or Hive.
-
Simulation manager—the simulation manager provides an external API for clients to interact with the data for research purposes, including a web-based GUI whereby users can select various filters to apply to the data sets before uploading a Java-coded simulation model to perform the desired analysis on the data. This facilitates all client-access to the data warehouse and also allows users to upload their own data sets for aggregation with UCL’s social data for a particular simulation. There is also the option to switch between historical mode (which mines data existing at the time the simulation is started) and live mode (which ‘listens’ to incoming data streams and performs analysis in real time).
9.4 Platform components
-
Back-end services—this provides the core of the platform functionalities. It is a set of services that allow connections to data providers, propagation processing and aggregation of data feeds, execution and maintenance of models, as well as their management in a multiuser environment.
-
Front-end client APIs—this provides a set of programmatic and graphical interfaces that can be used to interact with a platform to implement and test analytical models. The programmatic access provides model templates to simplify access to some of the functionalities and defines generic structure of every model in the platform. The graphic user interface allows visual management of analytical models. It enables the user to visualize data in various forms, provides data watch grid capabilities, provides a dynamic visualization of group behavior of data and allows users to observe information on events relevant to the user’s environment.
-
Connectivity engine—this functionality provides a means of communication with the outside world, with financial brokers, data providers and others. Each of the outside venues utilized by the platform has a dedicated connector object responsible for control of communication. This is possible due to the fact that each of the outside institutions provide either a dedicated API or is using a communication protocol (e.g., the FIX protocol and the JSON/XML-based protocol). The platform provides a generalized interface to allow standardization of a variety of connectors.
-
Internal communication layer—the idea behind the use of the internal messaging system in the platform draws from the concept of event-driven programming. Analytical platforms utilize events as a main means of communication between their elements. The elements, in turn, are either producers or consumers of the events. The approach significantly simplifies the architecture of such system while making it scalable and flexible for further extensions.
-
Aggregation database—this provides a fast and robust DBMS functionality, for an entry-level aggregation of data, which is then filtered, enriched, restructured and stored in big data facilities. Aggregation facilities enable analytical platforms to store, extract and manipulate large amounts of data. The storage capabilities of the Aggregation element not only allow replay of historical data for modeling purposes, but also enable other, more sophisticated tasks related to functioning of the platform including model risk analysis, evaluation of performance of models and many more.
-
Client SDK—this is a complete set of APIs (Application Programming Interfaces) that enable development, implementation and testing of new analytical models with use of the developer’s favorite IDE (Integrated Development Environment). The SDK allows connection from the IDE to the server side of the platform to provide all the functionalities the user may need to develop and execute models.
-
Shared memory—this provides a buffer-type functionality that speeds up the delivery of temporal/historical data to models and the analytics-related elements of the platform (i.e., the statistical analysis library of methods), and, at the same time, reduces the memory usage requirement. The main idea is to have a central point in the memory (RAM) of the platform that will manage and provide a temporal/historical data from the current point of time up to a specified number of timestamps back in history). Since the memory is shared, no model will have to keep and manage history by itself. Moreover, since the memory is kept in RAM rather than in the files or the DBMS, the access to it is instant and bounded only by the performance of hardware and the platform on which the buffers work.
-
Model templates—the platform supports two generic types of models: push and pull. The push type registers itself to listen to a specified set of data streams during initialization, and the execution of the model logic is triggered each time a new data feed arrives to the platform. This type is dedicated to very quick, low-latency, high-frequency models and the speed is achieved at the cost of small shared memory buffers. The pull model template executes and requests data on its own, based on a schedule. Instead of using the memory buffers, it has a direct connection to the big data facilities and hence can request as much historical data as necessary, at the expense of speed.