Top

Published in:

Open Access 2024 | OriginalPaper | Chapter

6. Toward Visually Analyzing Dynamic Social Messages and News Articles Containing Geo-Referenced Information

Authors : Johannes Knittel, Franziska Huth, Steffen Koch, Thomas Ertl

Published in: Volunteered Geographic Information

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

The number of social media posts and news articles that are being published every day is high. This makes them an attractive source of human-generated information for different domain experts such as journalists and business analysts but also emergency responders, particularly if posts contain references to geolocations. Visual analytics approaches can help to gain insights into such datasets and inform decision-makers. However, the high volume and the veracity of the data, as well as the velocity in the case of streaming data, pose challenges when supporting explorative analysis with interactive visualization. Based on four exemplary approaches, we outline recently proposed strategies to tackle these challenges. We describe how geo-aware filtering and anomaly detection methods can help to inform stakeholders based on geolocated tweets. We show that data-aware tag maps can provide analysts with an overview-first, details-on-demand visual summary of large amounts of text content over time. With space-filling curves, we can visualize the temporal evolution of geolocations in a two-dimensional plot without relying on animations that would impede comparative analyses. Additionally, we discuss the use of an efficient dynamic clustering algorithm for enabling large-scale visual analyses of streaming posts.

6.1 Introduction

Unstructured or semi-structured data such as news articles and social media posts contain a significant amount of human knowledge. Analyzing such vast amounts of data enables several stakeholders from different domains to obtain insights and inform their decision-making, for instance, business traders that need up-to-date information about new developments and first responders that benefit from timely witness reports on social media. In some cases, we can leverage the structured metadata associated with some of these documents such as the geolocation of posts or the timestamp of news articles, but it is generally challenging to gain insights into the actual content comprising text data.

The field of visual analytics (Thomas and Cook 2005; Keim et al. 2010) particularly aims at solving such complex problems of analyzing and exploring large amounts of data with often open and ill-defined goals. It combines the domain knowledge and intelligence of human experts with interactive visualizations that are sourced from advanced automated data analytics and machine learning models. If we want to harvest news reports and social media posts for timely insights using visual analytics, we need efficient algorithms to deal with the volume of the data, we need to extract information from unstructured data such as text, we need to integrate and combine this information with additional metadata such as geolocations into interactive visualizations, and we need to develop adaptive visualizations and streaming-aware algorithms that can deal with dynamic data sources. This chapter outlines recently proposed visual analytics approaches to tackle the said challenges.

6.2 Analyzing the Temporal Evolution of Text Data with PyramidTags

Making sense of large document corpora is a challenging endeavor, since it is inherently difficult for machines to grasp the meaning of natural language. Visual analytics approaches that combine methods of automatic information retrieval and data analytics with interactive visualizations help to tackle such challenges by incorporating human expertise and human interpretability. However, it remains challenging to provide a comprehensive overview of large amounts of text data such as news articles or social media posts due to the unstructured nature of the data, the variety of how people express similar things, and the inherent ambiguity of natural languages.

PyramidTags (Knittel et al. 2021c) proposes a novel tag layout for exploring large document collections such as tweets that aims at providing analysts with an overview of the content at hand and the temporal evolution of its themes without introducing hard clusters or topics. The approach utilizes an optimization process to place extracted relevant keywords and keyphrases from articles or posts onto a two-dimensional plot such that related tags ideally appear close to each other, while it is also possible to infer in which date ranges tags mostly appear in the dataset based on their position on the map (Fig. 6.1).

6.2.1 Processing and Objectives

PyramidTags first extracts the top k relevant keywords and keyphrases from the document collection using ELSKE (Knittel et al. 2021b), a fast keyphrase extraction library specialized in summarizing text collections. These tags serve as a summary of the content, and the way how they are placed on the two-dimensional tag map should support analysts making sense of the data with a date-aware, context-aware, and word-order-aware layout. In a second step, we process the dataset again to infer which tags are related based on how often and how close they appear in the same paragraph, whether there seems to be a preferred reading order of tag pairs (e.g., John — Doe vs. Doe — John), as well as in which date ranges tags and tag pairs mostly appear.

The resulting data structure informs the subsequent layouting process, which optimizes an objective function using particle swarm optimization (Kennedy and Eberhart 1995). Minimizing this function therefore corresponds to finding a balanced trade-off between different objectives such as that (1) related tags are placed nearby, (2) the position of the tag conveys the associated date range, (3) the preferred reading order is preserved for important pairs (if applicable), and (4) tags should not overlap.

6.2.2 Triangular Layout

One of the defining aspects of PyramidTags is its triangular layout, which aims to convey the temporal evolution of the extracted tags. Each tag is associated with a specific date range in which it mainly appears in the data (we may have several tags with the same text in case they appear in distinct clusters of date ranges). The vertical position on the map corresponds to the duration of this range, and the horizontal position to the mid-point of the said date range. At the bottom of the visualization, we place a timeline that depicts the entire date range of the dataset. For instance, if a word mainly appears on a specific date in the data, it is placed at the bottom of the map, right above the corresponding date in the timeline. On the other hand, if words or phrases appear in most of the articles, they are placed at the center-top of the visualization. Analysts should be able to infer this date range by spanning a right triangle from the tag to the timeline at the bottom. With this layout, we can visualize associated time spans of data points without relying on animations, which helps analysts to hypothesize about relevant events since tags that are mentioned during similar date ranges are also placed in the same neighborhood.

For instance, in Fig. 6.1, the tags diamond princess and cruise ship appear at the top of the map, indicating that the discussion about the Covid-19 cases on the said cruise ship was in the news during most of the depicted date range, whereas storm dennis and flooding are placed at the bottom-left around the first day of the 2-week date range.

6.2.3 Interactions and Document Retrieval

If users hover over certain tags, a lightly colored trapezoid visualizes the associated date range of the respective tag. While the optimization process tries to place related tags nearby, due to the inherent information loss of projecting data to two-dimensional spaces, not all tags that are close to each other are necessarily related, and there might be tags placed further away that are nevertheless related. The system therefore shades all other tags on the map depending on how related they are to the currently hovered tag (the more opaque, the less related). Users can also select one or several keywords by clicking on them. For instance, Fig. 6.2 shows an example in which the analyst has selected three tags (A). PyramidTags will then list the most related documents that contain the chosen selection of words or phrases, ranking the results based on the number of occurrences and the relative position of keywords to each other in the document (B). Users may also retrieve individual documents (C).

6.3 Leveraging Geodata to Scale the Visual Analysis of Posts

When dealing with large amounts of streaming text data, we can leverage geographical references to scale the analysis (e.g., the current position of the person that has just posted the respective tweet). Such geo-tags not only provide additional context to the textual content; we can also use them to cluster items and their content in a geospatial way, providing several important benefits. We can visualize content on top of a geographical map to help analysts focusing on specific regions of instance, drastically reducing the actual amount of data analysts have to cope with. Grouping content by geographic region may also help with providing thematic aggregations, as people in the same region within a certain time span may also have a higher chance of posting content with similar topics (e.g., football match in a city). Another advantage is that we can compare metadata and extracted aggregated information of documents with a spatial-aware baseline (e.g., typical occurrences of tags within a region). This also helps to develop anomaly detection algorithms, for instance, to notify first responders in a very timely manner about evolving situations.

ScatterBlogs (Thom et al. 2012, 2015; Bosch et al. 2013) is a visual analytics approach that proposed to leverage the geographical annotations of tweets in these ways to scale the visual analysis of streaming posts. Case studies with domain experts from crisis management groups and critical infrastructure companies underlined the need for such systems so that analysts can obtain important additional information in real time regarding critical situations, despite the apparent learning curve and the need for specialized human labor to monitor this channel (Thom et al. 2015). However, they also showed that the velocity of newly published posts, even regarding specific events, and the dynamically evolving nature of the content itself (e.g., novel hashtags) still pose significant challenges for analysts and the development of such interactive monitoring systems (Fathi et al. 2020).

6.3.1 Geospatial Clustering of Terms

One of the core ideas in ScatterBlogs is to continuously extract terms that appear unusually frequently in certain geographic regions and visualize them on top of a map such that analysts can get an overview of interesting developments that take place at specific locations. In the beginning, every term (except for stop words) defines a cluster comprising all geo-tagged posts containing the respective term; new received posts are added to these clusters based on the terms they contain. Once the distortion of any such cluster is too high (i.e., the geographic positions of related messages are too widely scattered), we split the cluster using the k-means algorithm (with \(k = 2\)). The system visualizes such dynamic term clusters by placing the respective tags (or representative dots) on a map based on the average geographic position of corresponding posts. The decision which terms to display in which size also incorporates how anomalous the usage of this term is. This score depends on the number of unique users and the geographic density of the corresponding posts, that is, the importance is high if many different users post messages at a specific location. Figure 6.3 depicts the main user interface of the system, visualizing anomalous terms in green on top of a map.

6.3.2 Keyword Lens and Topic Modeling

ScatterBlogs provides additional views to support explorative tasks. Analysts can move a lens across the map, which will highlight the most important keywords of posts that were sent in the corresponding geographic regions under the lens. When selecting specific term clusters, a histogram depicts the number of posts over time. Text-based and date/time-based filters can be applied to select a subset of tweets. For such a selection of posts, the system can provide a thematic overview of the content using LDA topic modeling (Blei et al. 2003).

6.3.3 Interactive Classifier

In addition to keyword-based filtering, ScatterBlogs also offers means to train and apply SVM-based classifiers interactively, which can be mapped to a color and icon to support the visual indication of classified posts. An initial training set can be labeled greedily based on keyword searches and geographic filters. The system then provides visual feedback of the classifier in its current state (e.g., visualization of affected posts on the map) so that analysts can refine them iteratively. It is also possible to combine several such trained classifiers with a visual graph structure (right side of Fig. 6.3) that helps to define Boolean chains.

6.4 Space-Filling Curves for Visualizing the Spatiotemporal Evolution of Data

In addition to the publishing date, a subset of posts and articles also contain a geographic reference (e.g., location of the tweeter). As outlined in the previous section, such geo-references enable analysts to filter data based on relevant regions, but evolving geographic patterns and anomalies can also hint at interesting developments and inform the decision-making process. The ScatterBlogs system focuses on the real-time analysis of streaming posts, and thus, the temporal aspect is mostly implicitly encoded by the dynamic nature of the visualizations. However, for certain analytical tasks, it might be important to analyze larger time ranges in retrospect. The first section introduced PyramidTags, which applies the triangular layout to visually encode date ranges without animations for exploring vast amounts of social media posts and news articles. However, it is challenging to visualize the temporal evolution of geolocated data without animations, since the two main dimensions are typically already reserved to visually encode the geographic location on a map. Animations, though, need to capture the attention of the analyst over a longer time span and impede comparative analyses. Franke et al. (2021) proposed the use of space-filling curves for visually encoding spatial data into just one dimension so that we can depict the evolution alongside the y-axis.

6.4.1 Neighborhood-Preserving 1D Projections

The main idea of the approach is to project geographic positions into one-dimensional positions using space-filling curves, such as Hilbert and Morton, while still preserving local neighborhoods to a certain degree. We can then plot a representative scalar value of geo-referenced data points across time in a two-dimensional plot so that analysts can better assess the temporal evolution of geographic neighborhoods, as well as spot spreading patterns, geographic hotspots, similar patterns across different regions, and trendsetters. In a preprocessing step, the system clusters the data points hierarchically based on their spatial position (if the spatial hierarchy is not already given). This clustering allows us to aggregate larger datasets at different levels of spatial granularity and enables analysts to focus their analysis on specific geographic regions, which also aids in the interpretation of the resulting geo-projections.

6.4.2 Main Interface

For each aggregation level in the spatial hierarchy, the timeline view (Fig. 6.4 underneath the map) visualizes the temporal evolution of each entity (e.g., aggregated cluster or single data point) alongside the y-axis. The bars correspond to geographic entities in the clustering and are ordered based on their calculated position in the respective space-filling curve. Analysts can select a specific entity to focus on (highlighted by a red border), which will trigger the system to re-order close entities in the detail view at the bottom. The map at the top serves as an aggregated overview of the different geospatial entities based on a specific point in time that users can specify with the slider at the top-right of the interface.

The system supports several methods for computing space-filling curves (top-left panel). Upon hovering over a specific method, the respective curve is plotted on top of the map, and the differences in the ordering of the elements to the currently selected curve are visualized. Several computed metrics help analysts better assess the quality of the projections.

6.5 Clustering Posts Dynamically to Analyze Posts in Real Time

Leveraging geo-annotations helps to scale the visual analysis of streaming data and enable geo-specific baselines as well as anomaly detection methods. However, while people still post textual geo-references (e.g., names of cities), the percentage of geolocated social media posts has steadily decreased in recent years. Thus, we need different strategies to enable the real-time analysis of social media posts, and we need to facilitate more context-rich analyses of the actual textual content.

To achieve this, Knittel et al. (2022) have proposed a visual analytics system that employs an efficient dynamic clustering algorithm, providing analysts a continuous overview of what people talk about on Twitter. A dynamic visualization of frequently used phrases and a stream of representative posts help analysts to monitor topics they are interested in, and they can also dive deeper into such topics while increasing the resolution of the analysis. Figure 6.5 depicts the main interface of the approach.

6.5.1 Dynamic Clustering

The system stores each new post in a sliding window of configurable size (e.g., the last 20 minutes) and computes corresponding bag-of-words vector representations (Salton and Buckley 1988). For the dynamic clustering, the approach adapts the k-means clustering algorithm (Lloyd 1982) based on a more efficient implementation for sparse vectors (Knittel et al. 2021a). The idea is to regularly cluster the documents in the sliding window with different cluster sizes while using the centroids from the previous clustering run (if available) to obtain more coherent clusters between runs. The Davies-Bouldin Index (DBI) (Davies and Bouldin 1979) determines which of the different clusterings is deemed best. The algorithm then tries to map the final cluster centroids to the ones from the previous run to determine which clusters are actually new, are deprecated, or have just been updated. The system runs two independent clustering processes with different levels of granularity. The more coarse-grained clustering provides a topical overview of the tweets in the window; the more fine-grained clustering facilitates a stream of representative posts.

6.5.2 Topical Overview

The left side of the main user interface (Fig. 6.5) visualizes the topics as computed by the coarse-grained dynamic clustering process. Each row represents one topic, conveying its size with the length of the bar and the number of posts over time in a small line chart, as well as providing a short topic summary with the most important words that define that cluster. Once the clustering has been updated with new posts, the visualization changes dynamically to represent the new state, but in a staggered way so that the mental map is preserved. Each row gets updated step by step (e.g., the lebron topic in Fig. 6.5), visualizing the number of new posts in dark green, the number of removed posts in red, and the number of posts that were moved to a different topic in magenta (while also depicting this flow with curves to the left of the bars). New terms are highlighted in dark green. The speed of this dynamic rollout is adjustable.

6.5.3 Frequent Phrases and Stream of Representative Posts

Users can select one or several of the main topics on the left, which will update the right side of the interface with more detailed views for summarizing the content of these topics. The system continuously extracts unusually frequent words and phrases in the selection with ELSKE (Knittel et al. 2021b) and lists them on the right side, highlighting new entries in dark green. Below each such phrase, a small barcode-like visualization allows analysts to infer which of these text parts co-occur by mentally overlapping their respective barcodes. The system can also emphasize this overlap in orange if analysts select several such phrases.

Below this view, a stream of posts appears that resembles a typical feed users would also see on Twitter. One of the challenges the case study of Fathi et al. (2020) with emergency managers identified is the dynamic nature of quickly evolving situations, which can easily lead to situations in which the sheer number of published posts related to a specific event or topic can overwhelm analysts. Hence, the idea of the feed below the frequent phrases is that the appearing posts should cover the thematic variety of the selected topics while keeping the number of new posts in a given time span low, ensuring that analysts can focus on a digestible set of posts.

The system selects for each cluster in the fine-grained dynamic clustering process one post that should represent this fine-grained theme by calculating distances between the vector representations with the respective cluster centroid. Due to their similarity with the respective centroids, these representative posts ideally cover a large proportion of tweets in the same cluster, but they should also provide a diverse summary as they were sourced from different clusters in the fine-grained clustering process. If such a representative post has not been added yet, it will be queued up, and a small badge in light blue appears that notifies the user about new posts. For each such post, a small bar chart depicts the number of similar posts, which can also be retrieved if analysts click on the bar.

6.5.4 Diving into Topics

While the frequent phrases and representative posts already provide more context-rich summaries of the selected topics, the general thematic variety on social media platforms is typically high, so it is challenging to group all posts into just 10 to 20 overview clusters. To alleviate this, the proposed system allows users to gradually dive into topics. After selecting one or several such coarse topics on the left side, analysts can click on the fork button at the top of the interface. This will define a new filter layer, in which only posts that fit the selection will be processed and visualized (it is also possible to create keyword-based filters). As a result, the topical overview on the left and the fine-grained clustering process now only operate on this filtered stream of data, increasing the resolution of the topics and aggregations and, thus, increasing the resolution and specificity of the analysis.

6.6 Conclusion

Geo-referenced social media posts and news articles are a rich source for harvesting information and knowledge, but the unstructured nature of the main content and the volume, veracity, and velocity of the data pose significant challenges for developing such visual analysis systems. This chapter outlined several recent approaches for tackling these challenges. The ScatterBlogs system scales the visual analysis of streaming geo-referenced posts by continuously extracting terms that exhibit spatiotemporal anomalies. Our proposed dynamic clustering algorithm enables the continuous monitoring of posts even if they are not geo-tagged. PyramidTags is a novel tag map layout for exploring large time-stamped text collections. We further outlined how we can utilize one-dimensional projection methods to visualize geo-referenced time series data such that we can still observe important spatial trends and patterns.

There are several benefits if we can incorporate geographic locations into our analysis, since this allows us to better detect interesting events and helps to filter the content that has to be processed. However, due to the decrease of geolocated documents, we need to develop new strategies and techniques for leveraging geographic references in social media posts and news articles.

Acknowledgements

This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—VA4VGI, 314647693.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

previous chapter Two Worlds in One Network: Fusing Deep Learning and Random Forests for Classification and Object Detection

next chapter Visually Reporting Geographic Data Insights as Integrated Visual and Textual Representations

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022. https://doi.org/10.1016/b978-0-12-411519-4.00006-9MATH

Bosch H, Thom D, Heimerl F, Puttmann E, Koch S, Kruger R, Worner M, Ertl T (2013) ScatterBlogs2: Real-time monitoring of microblog messages through user-guided filtering. IEEE Trans Vis Comput Graph 19(12):2022–2031. https://doi.org/10.1109/TVCG.2013.186CrossRef

Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI-1(2):224–227. https://doi.org/10.1109/TPAMI.1979.4766909CrossRef

Fathi R, Thom D, Koch S, Ertl T, Fiedrich F (2020) VOST: A case study in voluntary digital participation for collaborative emergency management. Inf Process Manag 57(4):102174. https://doi.org/https://doi.org/10.1016/j.ipm.2019.102174. https://www.sciencedirect.com/science/article/pii/S0306457319302316

Franke M, Martin H, Koch S, Kurzhals K (2021) Visual analysis of spatio-temporal phenomena with 1D projections. Comput Graph Forum 40(3):335–347. https://doi.org/10.1111/cgf.14311. https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14311

Keim D, Kohlhammer J, Ellis G, Mansmann F (2010) Mastering the information age: solving problems with visual analytics. Goslar: Eurographics Association. https://diglib.eg.org/handle/10.2312/14803

Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the International Conference on Neural Networks, ICNN 1995, vol 4, pp 1942–1948. https://doi.org/10.1109/ICNN.1995.488968

Knittel J, Koch S, Ertl T (2021a) Efficient sparse spherical K-means for document clustering. In: Proceedings of the 21st ACM Symposium on Document Engineering, DocEng 2021. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3469096.3474937

Knittel J, Koch S, Ertl T (2021b) ELSKE: efficient large-scale keyphrase extraction. In: Proceedings of the 21st ACM Symposium on Document Engineering, DocEng 2021. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3469096.3474930. https://dl.acm.org/doi/10.1145/3469096.3474930

Knittel J, Koch S, Ertl T (2021c) PyramidTags: context-, time- and word order-aware tag maps to explore large document collections. IEEE Trans Vis Comput Graph 27(12):4455–4468. https://doi.org/10.1109/TVCG.2020.3010095CrossRef

Knittel J, Koch S, Tang T, Chen W, Wu Y, Liu S, Ertl T (2022) Real-time visual analysis of high-volume social media posts. IEEE Trans Vis Comput Graph 28(1):879–889. https://doi.org/10.1109/TVCG.2021.3114800CrossRef

Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489MathSciNetCrossRefMATH

Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523. https://doi.org/10.1016/0306-4573(88)90021-0CrossRef

Thom D (2015) Visual analytics of social media for situation awareness. PhD thesis, University of Stuttgart. https://doi.org/10.18419/opus-3540

Thom D, Bosch H, Koch S, Worner M, Ertl T (2012) Spatiotemporal anomaly detection through visual analysis of geolocated Twitter messages. In: Proceedings of the 2012 IEEE Pacific Visualization Symposium, PacificVis 2012, pp 41–48. https://doi.org/10.1109/PacificVis.2012.6183572

Thom D, Kruger R, Ertl T, Bechstedt U, Platz A, Zisgen J, Volland B (2015) Can twitter really save your life? A case study of visual social media analytics for situation awareness. In: Proceedings of the 2015 IEEE Pacific Visualization Symposium, PacificVis 2015, pp 183–190. https://doi.org/10.1109/PACIFICVIS.2015.7156376

Thomas JJ, Cook KA (2005) Illuminating the path: The research and development agenda for visual analytics. Pacific Northwest National Laboratory (PNNL), Richland, WA

Title: Toward Visually Analyzing Dynamic Social Messages and News Articles Containing Geo-Referenced Information
Authors: Johannes Knittel
Franziska Huth
Steffen Koch
Thomas Ertl
Publisher: Springer Nature Switzerland
Book: Volunteered Geographic Information
Print ISBN: 978-3-031-35373-4

Electronic ISBN: 978-3-031-35374-1

Copyright Year: 2024
DOI: https://doi.org/10.1007/978-3-031-35374-1_6

Springer Professional