Skip to main content

About this book

The book proposes a systematic approach to big data collection, documentation and development of analytic procedures that foster collaboration on a large scale. This approach, designated as “data factoring” emphasizes the need to think of each individual dataset developed by an individual project as part of a broader data ecosystem, easily accessible and exploitable by parties not directly involved with data collection and documentation. Furthermore, data factoring uses and encourages pre-analytic operations that add value to big data sets, especially recombining and repurposing.
The book proposes a research-development agenda that can undergird an ideal data factory approach. Several programmatic chapters discuss specialized issues involved in data factoring (documentation, meta-data specification, building flexible, yet comprehensive data ontologies, usability issues involved in collaborative tools, etc.). The book also presents case studies for data factoring and processing that can lead to building better scientific collaboration and data sharing strategies and tools.
Finally, the book presents the teaching utility of data factoring and the ethical and privacy concerns related to it.
Chapter 9 of this book is available open access under a CC BY 4.0 license at

Table of Contents


Chapter 1. Introduction

Human interactions facilitated by social media, collaborative platforms, and the blogosphere generate an unprecedented volume of electronic trace data every day. These traces of human behavior online are a unique source for understanding contemporary life behaviors, beliefs, interactions, and knowledge flows. The social connections we make online, which reveal multiple types of human connection, are also recorded on a scale and to a level of granularity previously unimaginable, except possibly by science fiction writers. To many in the data analytics world, these traces are a gold mine. New sub-domains of inquiry have emerged as a consequence of this revolution: computational social science, big data, data science, open innovation data analytics, network science, and undoubtedly new ones yet to appear in the near future. Massive amounts of data, each counting millions of data records and behaviors, are now available to the academic, governmental, or industry research and teaching communities. They promise faster access to real-time social behavior and better understanding of how people behave and interact. Such “social” data include complete records of Wikipedia edits, interactions on social coding platforms like GitHub, and the expression of affiliations and engagement of participation on social media (Twitter, Facebook, YouTube, etc.).
Nicolas Jullien, Sorin Adam Matei, Sean P. Goggins

Theoretical Principles and Approaches to Data Factories


Chapter 2. Accessibility and Flexibility: Two Organizing Principles for Big Data Collaboration

The chapter argues that accessibility and flexibility are the two principles and practices that can bring big data projects the closest to a data factory ideal. The chapter elaborates on the necessity of these two principles, offering a reasoned explanation for their value in context. Using two big data social scientific research projects as a springboard for conversation the chapter highlights both the advantages and the practical limits within which accessibility and flexibility operate. The authors avoid both utopian and dystopian tropes about big data approaches. In addition, they offer a critical feminist discussion of big data collaboration. Of particular interest are also the manner in which specific characteristics of big data projects, especially volume and velocity, may affect multidisciplinary collaborations.
Libby Hemphill, Susan T. Jackson

Chapter 3. The Open Community Data Exchange: Advancing Data Sharing and Discovery in Open Online Community Science

While online behavior creates an enormous amount of digital data that can be the basis for social science research, to date, the science has been conducted piecemeal, one Internet address at a time, often without social or scholarly impact beyond the site’s own stakeholders. Scientists lack the tools, methods, and practices to combine, compare, contrast, and communicate about online behavior across Internet addresses or over time. In response, we are building the infrastructure for computational social scientists, social scientists, and citizens to make corresponding advances in our understanding of online human interactions. In this chapter, we present our effort to (1) specify the Open Community Data Exchange (OCDX) metadata standard to describe datasets, (2) introduce concepts from the data curation lifecycle to social computing research, and (3) describe candidate infrastructure for creating, editing, viewing, sharing, and analyzing manifests.
Sean P. Goggins, A. J. Million, Georg J. P. Link, Matt Germonprez, Kristen Schuster

Theoretical Principles and Ideas for Designing and Deploying Data Factory Approaches


Chapter 4. Levels of Trace Data for Social and Behavioural Science Research

The explosion of data available from online systems such as social media is creating a wealth of trace data, that is, data that record evidence of human activity. The volume of data available offers great potential to advance social and behavioural science research. However, the data are of a very different kind than more conventional social and behavioural science data, posing challenges to use. This paper adopts a data framework from Earth observation science and applies it to trace data to identify possible issues in analysing trace data. Application of the framework also reveals issues for sharing and reusing data.
Kevin Crowston

Chapter 5. The Ten Adoption Drivers of Open Source Software That Enables e-Research in Data Factories for Open Innovations

This chapter describes ten drivers of the adoption of open source software that enables e-research in data factories for open innovations. More specifically, the chapter discusses the emerging phenomena of big data and e-research, along with their various defining characteristics. Then the chapter makes a case for the importance of understanding the adoption of open source software for processing and harnessing big data. In other words, big data which remain in the raw form will continue to be big data with hidden insights uncovered without the adoption of appropriate software. Open source software applications, along with the larger concept of cyberinfrastructure, play a critical role in our ability to optimize the full potential of big data. The chapter also includes critical questions community stakeholders should keep in mind when promoting the diffusion and dissemination of good software applications that will support data factories for open innovations.
Kerk F. Kee

Chapter 6. Aligning Online Social Collaboration Data Around Social Order: Theoretical Considerations and Measures

Online media have revolutionized human interaction. Groups of people can rapidly converge, work on projects with little explicit coordination, and produce content that has immediate impact. Intellectual, analytic, or symbolic collaboration in academia, business, or even government is now almost inconceivable without online support. Work on text and narratives has been entirely transformed by Internet-based sites. Of these, wiki sites are the most successful and well known. Web-based wiki sites have greatly augmented if not supplanted traditional collaborative frameworks (Benkler, The wealth of networks: How social production transforms markets and freedom. New Haven: Yale University Press, 2006) for generating reference knowledge, documentation, and even education.
A growing literature looks at the processes that drive the most successful collaborative communities via wiki sites. One key issue that has emerged in the last few decades is the nature of the social order that informs them. Yet, researchers struggle with a fragmentary approach to aligning and comparing the social processes that generate social order. Furthermore, social order is itself under dispute, with some claiming that online collaborative communities represent a completely new form of organization (Benkler, The wealth of networks: How social production transforms markets and freedom. New Haven: Yale University Press, 2006) and others arguing that the traditional concept of hierarchical order itself is obsolete and needs to be abandoned (Brafman and Beckstrom, The starfish and the spider: The unstoppable power of leaderless organizations. New York: Penguin, 2006). In this chapter, we propose a theoretical strategy complete with some measurements that may make the problem of social order more tractable and comparisons across wiki and, more generally, online collaborative spaces and datasets more comparable. We strive to go beyond surface-level characteristics, mining key qualities that explain the success of wikis and other types of online communities. Our approach and the analytic framework we have developed can be used as a way to think about and experiment with not only the concept of order in the abstract but also with new tools, such as our Visible Effort. In what follows, we will detail our approach, complete with implementation recommendations for the measures proposed to capture social order.
Sorin Adam Matei, Brian C. Britt

Approaches in Action Through Case Studies of Data Based Research, Best Practice Scenarios, or Educational Briefs


Chapter 7. Lessons Learned from a Decade of FLOSS Data Collection

In 2004 a collaborative research team based at Syracuse University and Elon University began collecting and sharing data in order to understand how free/libre open source software (FLOSS) is made. Embodying some of the same FLOSS ethos, this team created a public-facing repository for their own data and analyses and encouraged other researchers to use it and contribute to it. This chapter tells the story of how the FLOSSmole project began, where the data comes from and what we have learned from it, and how the project has grown and changed over the years. In addition to capturing snapshots of the current state of the FLOSS landscape, FLOSSmole also serves as a mirror to the larger FLOSS ecosystem, since changes in FLOSSmole’s mission and goals over the years necessarily reflect some of the cultural and technological changes taking place in FLOSS itself. As such, FLOSSmole will continue to face many challenges in the future, including the continual need to provide broader access and more sophisticated and relevant data and analyses and to do all this in a way that is sustainable and community driven.
Kevin Crowston, Megan Squire

Chapter 8. Teaching Students How (Not) to Lie, Manipulate, and Mislead with Information Visualization

The authors explore the intellectual and pedagogical implications of big data visualizations. Representing data visually implies simplifying and essentializing information. However, the selective nature of information visualization can lend itself to lies, manipulations, and misleading information. To avoid these pitfalls, data analysts should focus and embrace specific principles and practices that aim to represent complete, contextualized, comparable, and scalable information in a way that reveals rather than isolates the viewer and the problem at hand from the problem space it reflects.
Athir Mahmud, Mél Hogan, Andrea Zeffiro, Libby Hemphill

Open Access

Chapter 9. Democratizing Data Science: The Community Data Science Workshops and Classes

Nearly every published discussion of data science education begins with a reflection on an acute shortage in labor markets of professional data scientists with the skills necessary to extract business value from burgeoning datasets created by online communities like Facebook, Twitter, and LinkedIn. This model of data science—professional data scientists mining online communities for the benefit of their employers—is only one possible vision for the future of the field. What if everybody learned the basic tools of data science? What if the users of online communities—instead of being ignored completely or relegated to the passive roles of data producers to be shaped and nudged—collected and analyzed data about themselves? What if, instead, they used data to understand themselves and communicate with each other? What if data science was treated not as a highly specialized set of skills but as a basic literacy in an increasingly data-driven world?
Benjamin Mako Hill, Dharma Dailey, Richard T. Guy, Ben Lewis, Mika Matsuzaki, Jonathan T. Morgan


Additional information

Premium Partner

    Image Credits