Skip to main content

2012 | Buch

Dark Web

Exploring and Data Mining the Dark Side of the Web

insite
SUCHEN

Über dieses Buch

The University of Arizona Artificial Intelligence Lab (AI Lab) Dark Web project is a long-term scientific research program that aims to study and understand the international terrorism (Jihadist) phenomena via a computational, data-centric approach. We aim to collect "ALL" web content generated by international terrorist groups, including web sites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc. We have developed various multilingual data mining, text mining, and web mining techniques to perform link analysis, content analysis, web metrics (technical sophistication) analysis, sentiment analysis, authorship analysis, and video analysis in our research. The approaches and methods developed in this project contribute to advancing the field of Intelligence and Security Informatics (ISI). Such advances will help related stakeholders to perform terrorism research and facilitate international security and peace.

This monograph aims to provide an overview of the Dark Web landscape, suggest a systematic, computational approach to understanding the problems, and illustrate with selected techniques, methods, and case studies developed by the University of Arizona AI Lab Dark Web team members. This work aims to provide an interdisciplinary and understandable monograph about Dark Web research along three dimensions: methodological issues in Dark Web research; database and computational techniques to support information collection and data mining; and legal, social, privacy, and data confidentiality challenges and approaches. It will bring useful knowledge to scientists, security professionals, counterterrorism experts, and policy makers. The monograph can also serve as a reference material or textbook in graduate level courses related to information security, information policy, information assurance, information systems, terrorism, and public policy.

Inhaltsverzeichnis

Frontmatter

Research Framework: Overview and Introduction

Frontmatter
Chapter 1. Dark Web Research Overview
Abstract
The AI Lab Dark Web project is a long-term scientific research program that aims to study and understand the international terrorism (jihadist) phenomena via a computational, data-centric approach. We aim to collect “all” web content generated by international terrorist groups, including web sites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc. We have developed various multilingual data mining, text mining, and web mining techniques to perform link analysis, content analysis, web metrics (technical sophistication) analysis, sentiment analysis, authorship analysis, and video analysis in our research. The approaches and methods developed in this project contribute to advancing the field of Intelligence and Security Informatics (ISI). Such advances will help related stakeholders to perform terrorism research and facilitate international security and peace. It is our belief that we (the USA and allies) are facing the dire danger of losing the “The War on Terror” in cyberspace (especially when many young people are being recruited, incited, infected, and radicalized on the web), and we would like to help in our small (computational) way. More details and updated information can be found at the project web site: http://ai.eller.arizona.edu/research/terror/.
Hsinchun Chen
Chapter 2. Intelligence and Security Informatics (ISI): Research Framework
Abstract
In this chapter, we review the computational research framework that is adopted by the Dark Web research. We first present the security research context, followed by description of a data mining framework for Intelligence and Security Informatics research.
The tragic events of September 11 and the following anthrax contamination of letters caused drastic effects on many aspects of society. Academics in the fields of natural sciences, computational science, information science, social sciences, engineering, medicine, and many others have been called upon to help enhance the government’s ability to fight terrorism and other crimes. Six critical mission areas have been identified where information technology can contribute, as suggested in the “National Strategy for Homeland Security” report, including: intelligence and warning, border and transportation security, domestic counterterrorism, protecting critical infrastructure, defending against catastrophic terrorism, and emergency preparedness and responses. Facing the critical missions of national security and various data and technical challenges, we believe there is a pressing need to develop the science of “Intelligence and Security Informatics” (ISI).
To address the data and technical challenges facing ISI, we present a research framework with a primary focus on KDD (Knowledge Discovery from Databases) technologies. The framework is discussed in the context of crime types and security implications. Selected data mining techniques, including information sharing and collaboration, association mining, classification and clustering, text mining, spatial and temporal mining, and criminal network analysis, are believed to be critical to criminal and intelligence analyses and investigations. In addition to the technical discussions, this chapter also discusses caveats for data mining and important civil liberties considerations.
Hsinchun Chen
Chapter 3. Terrorism Informatics
Abstract
In this chapter, we provide an overview of selected resources of relevance to “Terrorism Informatics,” a new discipline that aims to study the terrorism phenomena with a data-driven, quantitative, and computational approach. We first summarize several critical books that lay the foundation for studying terrorism in the new Internet era. We then review important terrorism research centers and resources that are of relevance to our Dark Web research. The Dark Web project has benefited significantly from their publications and ideas, and from interactions and collaboration with some of these terrorism study experts and scholars.
Hsinchun Chen

Dark Web Research: Computational Approach and Techniques

Frontmatter
Chapter 4. Forum Spidering
Abstract
The unprecedented growth of the Internet has propagated the escalation of the Dark Web, the problematic facet of the web associated with cyber crime, hate, and extremism. Despite the need for tools to collect and analyze Dark Web forums, the covert nature of this part of the Internet makes traditional web crawling techniques insufficient for capturing such content. In this chapter, we propose a novel crawling system designed to collect Dark Web forum content. The system uses a human-assisted accessibility approach to gain access to Dark Web forums. Several URL ordering features and techniques enable efficient extraction of forum postings. The system also includes an incremental crawler coupled with a recall improvement mechanism intended to facilitate enhanced retrieval and updating of collected content. Experiments conducted to evaluate the effectiveness of the human-assisted accessibility approach and the recall improvement–based incremental update procedure yielded favorable results. The human-assisted approach significantly improved access to Dark Web forums while the incremental crawler with recall improvement also outperformed standard periodic and incremental update approaches. Using the system, we were able to collect over 100 Dark Web forums from three regions.
Hsinchun Chen
Chapter 5. Link and Content Analysis
Abstract
While the Web has become a worldwide platform for communication, terrorists share their ideology and communicate with members on the “Dark Web” – the dark side of the Web used by terrorists. Currently, the problems of information overload and the difficulty of obtaining a comprehensive picture of terrorist activities hinder effective and efficient analysis of terrorist information on the Web. To improve understanding of terrorist activities, we have developed a novel methodology for collecting and analyzing Dark Web information. The methodology incorporates information collection, analysis, and visualization techniques, and exploits various Web information sources. We applied it to collecting and analyzing information of 39 jihad Web sites and developed visualization of their site contents, relationships, and activity levels. An expert evaluation showed that the methodology is very useful and promising, having a high potential to assist in investigation and understanding of terrorist activities by producing results that could potentially help guide both policy making and intelligence research.
Hsinchun Chen
Chapter 6. Dark Network Analysis
Abstract
Dark networks such as terrorist networks and narcotics-trafficking networks are hidden from our view yet could have a devastating impact on our society and economy. Understanding the topology of these dark networks can reveal greater insight into these clandestine organizations and help develop effective disruptive strategies. Based on analysis of four real-world “dark” networks, we found that these covert networks share many common topological properties with other types of networks. Their efficiency in communication and flow of information, commands, and goods can be tied to their small-world structures characterized by small average path length (l) and high clustering coefficient (C). In addition, we found that because of the small-world properties, dark networks are more vulnerable to attacks on the bridges that connect different communities than to attacks on the hubs. This may provide authorities with insight for intelligence and security purposes. An interesting finding about the three human dark networks is their substantially high clustering coefficients, which are not always present in other empirical networks.
Hsinchun Chen
Chapter 7. Interactional Coherence Analysis
Abstract
Despite the rapid growth of text-based computer-mediated communication (CMC), its limitations have rendered the media highly incoherent. The lack of coherence in CMC poses problems for content analysis of online discourse archives. Interactional coherence analysis (ICA) attempts to accurately identify and construct interaction networks of CMC messages. Although significant progress has been made, ICA research still has several limitations. Most previous ICA approaches used either system or linguistic features, but not both in conjunction, and also failed to address noise issues such as typos, misspellings, and idiosyncratic system usage behavior. Moreover, Web forums have seldom been studied for interactional coherence in spite of their prevalence. In this study, we propose the Hybrid Interactional Coherence (HIC) algorithm for identification of Web forum interaction. HIC utilizes both system features, such as header information and quotations, and linguistic features, such as direct address and lexical relation. Furthermore, several similarity-based methods, including a Lexical Match Algorithm (LMA) and a sliding window method, are utilized to account for interactional idiosyncrasies. Experiments were conducted on a large domestic extremist Web forum to compare the algorithm with traditional linkage and similarity-based methods. HIC significantly outperformed both comparison techniques in terms of precision, recall, and F-measure at both the forum and thread levels. The results demonstrate the effectiveness of HIC for identifying Web forum interaction.
Hsinchun Chen
Chapter 8. Dark Web Attribute System
Abstract
Terrorists and extremists are increasingly utilizing Internet technology to enhance their ability to influence the outside world. Due to the lack of multilingual and multimedia terrorist/extremist collections and advanced analytical methodologies, our empirical understanding of their Internet usage is still very limited. To address this research gap, we explore an integrated approach for identifying and collecting terrorist/extremist web contents. We also propose a Dark Web Attribute System (DWAS) to enable quantitative Dark Web content analysis from three perspectives: technical sophistication, content richness, and web interactivity. Using the proposed methodology, we identified and examined the Internet usage of major Middle Eastern terrorist/extremist groups. More than 200,000 multimedia web documents were collected from 86 Middle Eastern multilingual terrorist/extremist web sites. In our comparison of terrorist/extremist web sites to US government web sites, we found that terrorist/extremist groups exhibited levels of web knowledge similar to that of US government agencies. Moreover, terrorists/extremists had a strong emphasis on multimedia usage, and their web sites employed significantly more sophisticated multimedia technologies than government web sites. We also found that the terrorist/extremist groups are as effective as the US government agencies in terms of supporting communications and interaction using web technologies. Advanced Internet-based communication tools such as online forums and chat rooms are used much more frequently in terrorist/extremist web sites than government web sites. Based on our case study results, we believe that the DWAS is an effective tool to analyze the technical sophistication of terrorist/extremist groups’ Internet usage and could contribute to an evidence-based understanding of the applications of web technologies in the global terrorism phenomena.
Hsinchun Chen
Chapter 9. Authorship Analysis
Abstract
Following the tragic events of September 11, 2001, researchers have been called upon to assume a larger role in the preservation of public safety and national security. One of the major challenges facing the intelligence and security community is monitoring of online communication mediums that are commonly used by terrorist groups. In this chapter, we addressed the online anonymity problem by successfully applying authorship analysis to English and Arabic extremist group web forum messages. The performance impact of different feature categories and techniques was evaluated across both languages. In order to facilitate enhanced writing style identification, a comprehensive list of online authorship features was incorporated. Additionally, an Arabic language model was created by adopting specific features and techniques to deal with the challenging linguistic characteristics of Arabic, including an elongation filter and a root clustering algorithm. A series of experiments were conducted to evaluate the efficacy of our models with results indicating a high level of success. Finally, a comparison of the English and Arabic language models and messages was made to aid the research community’s understanding of the dynamics of these group’s authorship tendencies.
Hsinchun Chen
Chapter 10. Sentiment Analysis
Abstract
The Internet is frequently used as a medium for exchange of information and opinions, as well as propaganda dissemination. In this study, the use of sentiment analysis methodologies is proposed for classification of Web forum opinions in multiple languages. The utility of stylistic and syntactic features is evaluated for sentiment classification of English and Arabic content. Specific feature extraction components are integrated to account for the linguistic characteristics of Arabic. The entropy weighted genetic algorithm (EWGA) is also developed, which is a hybridized genetic algorithm that incorporates the information gain heuristic for feature selection. EWGA is designed to improve performance and get a better assessment of the key features. The proposed features and techniques are evaluated on US and Middle Eastern Web forum postings. The experimental results using EWGA with SVM indicate high performance levels, with accuracy over 95% on the benchmark dataset and over 93% for both the US and Middle Eastern forums. Stylistic features significantly enhanced performance across all test beds while EWGA also outperformed other feature selection methods, indicating the utility of these features and techniques for document-level classification of sentiments.
Hsinchun Chen
Chapter 11. Affect Analysis
Abstract
Analysis of affective intensities in computer-mediated communication is important in order to allow a better understanding of online users’ emotions and preferences. Despite considerable research on textual affect classification, it is unclear which features and techniques are most effective. In this chapter, we compared several feature representations for affect analysis, including learned n-grams and various automatically and manually crafted affect lexicons. We also proposed the support vector regression correlation ensemble (SVRCE) method for enhanced classification of affect intensities. SVRCE uses an ensemble of classifiers, each trained using a feature subset tailored toward classifying a single affect class. The ensemble is combined with affect correlation information to enable better prediction of emotive intensities. Experiments were conducted on four test beds encompassing web forums, blogs, and online stories. The results revealed that learned n-grams were more effective than lexicon-based affect representations. The findings also indicated that SVRCE outperformed comparison techniques, including Pace regression, semantic orientation, and WordNet models. Ablation testing showed that the improved performance of SVRCE was attributable to its use of feature ensembles as well as affect correlation information. A case study was conducted to illustrate the utility of the features and techniques for affect analysis of large archives of online discourse of US and Middle Eastern extremists.
Hsinchun Chen
Chapter 12. CyberGate Visualization
Abstract
Computer-mediated communication (CMC) analysis systems are important for improving participant accountability and researcher analysis capabilities. However, existing CMC systems focus on structural features, with little support for analysis of text content in web discourse. In order to address this shortcoming, we propose a framework for CMC text analysis grounded in Systemic Functional Linguistic Theory. Our framework addresses several ambiguous CMC text mining issues, including the relevant tasks, features, information types, feature selection methods, and visualization techniques. Based on it, we have developed a system called CyberGate, which includes the Writeprint and Ink Blot techniques. These techniques incorporate complementary feature selection and visualization methods in order to allow a breadth of analysis and categorization capabilities. An application example is used to illustrate the ability of these techniques for CMC text analysis. Furthermore, experiments were conducted in comparison with a benchmark technique (Support Vector Machine) in order to assess the viability of CyberGate’s Writeprint and Ink Blot techniques for categorization of various forms of CMC text. The results indicated that the CyberGate techniques matched the Support Vector Machine performance in most cases while outperforming it for certain information types. Collectively, the results indicate that the system and its underlying design framework can dramatically improve text content analysis functions over those found in existing CMC systems.
Hsinchun Chen
Chapter 13. Dark Web Forum Portal
Abstract
In recent years, there have been numerous studies from a variety of perspectives analyzing the Internet presence of hate and extremist groups. Yet the web sites and forums of extremist and terrorist groups have long remained an underutilized resource for terrorism researchers due to their ephemeral nature and persistent access and analysis problems. The purpose of the Dark Web archive, therefore, is to provide a research infrastructure for use by social scientists, computer and information scientists, policy and security analysts, and others studying a wide range of social and organizational phenomena and computational problems. The Dark Web Forum Portal provides web-enabled access to critical international jihadist web forums. The focus of this chapter is on the significant extensions to previous work including: increasing the scope of our data collection; adding an incremental spidering component for regular data updates; enhancing the searching and browsing functions; enhancing multilingual machine translation for Arabic, French, German, and Russian; and adding advanced social network analysis. A case study on identifying active participants is described at the end.
Hsinchun Chen

Dark Web Research: Case Studies

Frontmatter
Chapter 14. Jihadi Video Analysis
Abstract
This chapter presents an exploratory study of jihadi extremist groups’ videos using content analysis and a multimedia coding tool to explore the types of videos, groups’ modus operandi, and production features that lend support to extremist groups. The videos convey messages powerful enough to mobilize members, sympathizers, and even new recruits to launch attacks that are captured (on video) and disseminated globally through the Internet. They communicate the effectiveness of the campaigns and have a much wider impact because the messages are media rich with nonverbal cues and have vivid images of events that can evoke not only a multitude of psychological and emotional responses but also violent reactions. The videos are important for jihadi extremist groups’ learning, training, and recruitment. In addition, the content collection and analysis of extremist groups’ videos can help policy makers, intelligence analysts, and researchers better understand the extremist groups’ terror campaigns and modus operandi, and help suggest counterintelligence strategies and tactics for troop training.
Hsinchun Chen
Chapter 15. Extremist YouTube Videos
Abstract
With the emergence of Web 2.0, sharing personal content, communicating ideas, and interacting with other online users in Web 2.0 communities have become daily routines for online users. User-generated data from Web 2.0 sites provide rich personal information, such as personal preferences and interests, and can be utilized to obtain insight about cyber communities and their social networks. Many studies have focused on leveraging user-generated information to analyze blogs and forums, but few studies have applied this approach to video-sharing web sites. In this chapter, we proposed a text-based framework for video content classification of online video-sharing web sites. Different types of user-generated data (e.g., titles, descriptions, and comments) were used as proxies for online videos, and three types of text features (lexical, syntactic, and content-specific features) were extracted. Three feature-based classification techniques (C4.5, Naïve Bayes, and SVM) were used to classify videos. To evaluate the proposed framework, user-generated data from candidate videos, which were identified by searching user-given keywords on YouTube, were first collected. Then, a subset of the collected data was randomly selected and manually tagged by users as our experiment data. The experimental results showed that the proposed approach was able to classify online videos based on users’ interests with accuracy rates up to 87.2%, and all three types of text features contributed to discriminating videos. SVM outperformed C4.5 and Naïve Bayes in our experiments. In addition, our case study further demonstrated that accurate video classification results are very useful for identifying implicit cyber communities on video-sharing web sites.
Hsinchun Chen
Chapter 16. Improvised Explosive Devices (IED) on Dark Web
Abstract
This chapter presents a cyber-archaeology approach to social movement research. The approach overcomes many of the issues of scale and complexity facing social research on the Internet, enabling broad and longitudinal study of the virtual communities supporting social movements. Cultural cyber-artifacts of significance to the social movement are collected and classified using automated techniques, enabling analysis across multiple related virtual communities. Approaches to the analysis of cyber-artifacts are guided by perspectives of social movement theory. A Dark Web case study on a broad group of related IED virtual communities is presented to demonstrate the efficacy of the framework and provide a detailed instantiation of the proposed approach for evaluation.
Hsinchun Chen
Chapter 17. Weapons of Mass Destruction (WMD) on Dark Web
Abstract
The tragic events of September 11 have caused drastic effects on many aspects of society. Academics in the fields of natural sciences, computational science, information science, social sciences, engineering, medicine, and many others have been called upon to help enhance the government’s ability to fight terrorism and other crimes. In the area under defending against catastrophic terrorism, weapons of mass destruction (WMD), especially nuclear weapons, have been considered one of the most dangerous threats to US homeland security and international peace and prosperity. We believe the science of Intelligence and Security Informatics (ISI) can help with nuclear forensics and attribution. ISI research can help advance the intelligence collection, analytical techniques, and instrumentation used in determining the origin, capability, intent, and transit route of nuclear materials by selected hostile countries and (terrorist) groups. We propose a research framework that aims to investigate the capability, accessibility, and intent of critical high-risk countries, institutions, researchers, and extremist or terrorist groups. We propose to develop a knowledge base of the Nuclear Web that will collect, analyze, and pinpoint significant actors in the high-risk international nuclear physics and weapons communities. We also identify potential extremist or terrorist groups from our Dark Web test bed who might pose WMD threats to the USA and the international community. Selected knowledge mapping and focused web crawling techniques and findings from a preliminary study are presented in this chapter.
Hsinchun Chen
Chapter 18. Bioterrorism Knowledge Mapping
Abstract
Biomedical research used for defense purposes may also be applied to biological weapons development. To mitigate risk, the US Government has attempted to monitor and regulate biomedical research labs, especially those that study bioterrorism agents/diseases. However, monitoring worldwide biomedical researchers and their work is still an issue. In this chapter, we developed an integrated approach to mapping worldwide bioterrorism research literature. Our objectives are to identify the researchers who have expertise in the bioterrorism agents/diseases research domain, the major institutions and countries where these researchers reside, and the emerging topics and trends in bioterrorism agents/diseases research. By utilizing knowledge mapping techniques, we analyzed the productivity status, collaboration status, and emerging topics in the bioterrorism domain. The analysis results provide insights into the research status of bioterrorism agents/diseases and thus allow a more comprehensive view of bioterrorism researchers and ongoing work.
Hsinchun Chen
Chapter 19. Women’s Forums on the Dark Web
Abstract
With the recent advent of Web 2.0, more and more women participate in and exchange opinions through community-based social media on the Internet. Questions concerning gender differences in the context of online communication have been raised. In this study, we develop a feature-based text classification framework to examine the online gender differences between female and male posters on web forums by analyzing writing styles and topics of interests. We examine the performance of different feature sets in an experiment involving political opinions. The results of our experimental study on this Islamic women’s political forum show that the feature sets containing both content-free and content-specific features perform significantly better than those consisting of only content-free features. In addition, feature subset selection can improve the classification results significantly. Female and male participants were found to have significantly different topics of interest in our study.
Hsinchun Chen
Chapter 20. US Domestic Extremist Groups
Abstract
US domestic extremist groups have increased in number and are intensively utilizing the Internet as an effective tool to share resources and members with limited regard for geographic, legal, or other obstacles. Researchers find that monitoring extremist and hate groups’ web sites and analyzing their usage and content have become time consuming and challenging. In response, this chapter describes the development of automated or semiautomated methodologies for capturing, classifying, and organizing domestic extremist web site data and using them for analysis. We found that by analyzing the hyperlink structures and content of domestic extremist web sites and constructing social network maps, their interorganizational structure and cluster affinities could be identified. Such analysis results could help experts in terrorism, law enforcement, intelligence, and policy making domains better understand the domestic extremist phenomena and eventually boost our national security.
Hsinchun Chen
Chapter 21. International Falun Gong Movement on the Web
Abstract
Framing a collective identity is an essential process in a social movement. The identity defines the orientation of public actions to take and establishes an informal interaction network for circulating important information and material resources. While domestic social movements emphasize the coherence of identity in alliance, global or cyber activism is now flexible in its collective identity given the rise of the Internet. A campaign may include diverse social movement organizations (SMOs) with different social agendas. This flexible identity framing encourages personal involvement in direct action. On the other hand, it may damage solidarity within SMOs and make campaigns difficult to control. To assess the sustainability of an SMO, it is important to understand its collective identity and the social codes embedded within its associated cyber societies and cyber-artifacts. In this chapter, we took a cyber-archaeology approach and used the international Falun Gong (FLG) movement as a case study to investigate this identity-framing issue. We employed Social Network Analysis and Writeprint to analyze FLG’s cyber-artifacts from the perspectives of links, web content, and forum content. In the link analysis, FLG’s web sites linked closely to Chinese democracy and human rights SMOs, reflecting FLG’s historical conflicts with the Chinese government after the official ban in 1999. In the web content analysis, we used Writeprint to analyze the writings of Hongzhi Li and his editors, and found that Hongzhi Li’s writings center around the ideological teaching of Falun Dafa while the editors post specific programs to realize Li’s teaching. In the forum content analysis, FLG comprehensively organizes several different concepts on a continuum: from FLG ideology to life philosophy and mysterious phenomena, and from mysterious phenomena to anti-Chinese Communist Party and persecution by conceptualizing the Chinese government as evil. By deploying those cyber-artifacts, FLG seamlessly connects different ideologies and establishes its identity as a Qi-Gong, religious, and activist group.
Hsinchun Chen
Chapter 22. Botnets and Cyber Criminals
Abstract
In the last several years, the nature of computer hacking has completely changed. Cyber crime has risen to unprecedented sophistication with the evolution of botnet technology, and an underground community of cyber criminals has arisen, capable of inflicting serious socioeconomic and infrastructural damage in the information age. This chapter serves as an introduction to the world of modern cyber crime and discusses information systems to investigate it. We investigated the command and control (C&C) signatures of major botnet herders using data collected from the Shadowserver Foundation, a nonprofit research group for botnet research. We also performed exploratory population modeling of the bots and cluster analysis of selected cyber criminals.
Hsinchun Chen
Backmatter
Metadaten
Titel
Dark Web
verfasst von
Hsinchun Chen
Copyright-Jahr
2012
Verlag
Springer New York
Electronic ISBN
978-1-4614-1557-2
Print ISBN
978-1-4614-1556-5
DOI
https://doi.org/10.1007/978-1-4614-1557-2

Premium Partner