Elsevier

Digital Investigation

Volume 11, Issue 4, December 2014, Pages 295-313
Digital Investigation

Distributed filesystem forensics: XtreemFS as a case study

https://doi.org/10.1016/j.diin.2014.08.002Get rights and content

Abstract

Distributed filesystems provide a cost-effective means of storing high-volume, velocity and variety information in cloud computing, big data and other contemporary systems. These technologies have the potential to be exploited for illegal purposes, which highlights the need for digital forensic investigations. However, there have been few papers published in the area of distributed filesystem forensics. In this paper, we aim to address this gap in knowledge. Using our previously published cloud forensic framework as the underlying basis, we conduct an in-depth forensic experiment on XtreemFS, a Contrail EU-funded project, as a case study for distributed filesystem forensics. We discuss the technical and process issues regarding collection of evidential data from distributed filesystems, particularly when used in cloud computing environments. A number of digital forensic artefacts are also discussed. We then propose a process for the collection of evidential data from distributed filesystems.

Introduction

In recent years, the amount of data captured, stored and disseminated in electronic only form has increased exponentially (see Beath et al., 2012) and unsurprisingly, big data has constantly been ranked as one of the top ten technology trends (see Casonato et al., 2013, Chua, 2013) including by the United States National Intelligence Council (2012). A Gartner report, for example, has forecasted that “big data will generate [US]$232 billion in revenue cumulatively from 2011 to 2016” (Casonato et al., 2013, p. 4). A widely accepted definition of big data is from Gartner, which defines it as “high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” (Beyer and Laney, 2012, p. 2).

There are, however, large technology overheads and significant costs associated with the processing, storage and dissemination of big data. Businesses and governments will continue to be under pressure to deliver more with less especially in today's economic landscape. Business and government users have recognised the advantages of cloud computing for processing and storing big data. For example, a report by the Info-communications Development Authority of Singapore (2012, p. 6), pointed out that “[e]arly adopters of Big Data on the cloud would be users deploying Hadoop clusters on the highly scalable and elastic environments provided by Infrastructure-as-a-Service (IaaS) providers such as Amazon Web Services and Rackspace, for test and development, and analysis of existing datasets”. This is unsurprising as cloud computing offers users the capacity, scalability, resilience, efficiency and availability required to work with high-volume, velocity and variety information.

Cloud computing (like other networked cyber-infrastructure) is subject to criminal exploitation (Choo, 2010, Chonka and Abawajy, 2012, Patel et al., 2013). In a digital investigation, one would need to gather evidence of an incident or crime that has involved electronic devices (e.g. computer systems and their associated networks) – a process known as digital forensics. The latter is increasingly being used in the courts in Australia and overseas. Many conventional forensic tools have focused upon having physical access to the media that stores the data of potential interest. However, in a cloud computing environment it is often not possible or feasible to access the physical media that stores the user's data (Martini and Choo, 2012). Distributed filesystems can support cloud computing environments by providing data fragmentation and distribution, potentially across the globe and within numerous datacentres. This presents significant technical and jurisdictional challenges in the identification and seizure of evidential data by law enforcement and national security agencies in criminal investigations (Hooper et al., 2013) as well as by businesses in civil litigation matters.

A number of researchers and practitioners have emphasised the need for cloud computing specific digital forensics guidelines (Birk and Wegener, 2011, National Institute of Standards and Technology, 2011, Zatyko and Bay, 2012), and we believe this need extends to the underlying infrastructure which supports cloud computing. While a number of published papers have provided a sound grounding for the research required in cloud forensics by highlighting the issues for digital forensic researchers and practitioners (Birk and Wegener, 2011, Martini and Choo, 2012) there are relatively few technical papers discussing the forensic collection of evidential data from cloud servers or underlying supporting infrastructure such as distributed filesystems.

Much of the existing literature has focused on the Software as a Service (SaaS) component of cloud computing (e.g. Dropbox, Skydrive and Google Drive) (Marty, 2011, Chung et al., 2012, Dykstra and Sherman, 2013, Hale, 2013, Martini and Choo, 2013, Federici, 2014, Quick et al., 2014) rather than the Infrastructure as a Service (IaaS) implementations that supports these services. Researchers such as Dykstra and Riehl (2013) and Hay et al. (2011) identified the various legal and technical challenges in conducting forensic investigation of cloud IaaS; but there has been no prevalent in-depth forensic investigation of a distributed filesystem which is used or analogous to those used in public or private cloud installations. One reason this may not have been thoroughly explored is due to the difficulties in accessing (for research purposes) a public cloud computing environment of significant scale that makes use of a distributed storage environment.

In recent years, a number of researchers have examined distributed filesystems and the implicit issues for forensic investigations on these complex systems. Cho et al. (2012) conducted a preliminary study of Hadoop's distributed filesystem (HDFS). Hegarty et al. (2011) discuss a technique for distributed signature detection for the purpose of detecting the file signatures of illicit files in distributed filesystems. They note that existing signature techniques are unlikely to perform well in a distributed filesystem environment with a significant quantity of data stored. Almulla et al. (2013, p. 3) discuss a range of cloud forensic issues including the underlying role of ‘distributed computing’ and, in turn, distributed filesystems. The authors note the significant impact distributed filesystems have on forensics in terms of the requirement for a practitioner to ‘rebuild files from a range of filesystems’.

In this paper, we use XtreemFS as a case study to provide a better understanding of both the technical and process issues regarding collection of evidential data from distributed filesystems which are commonly used in cloud computing environments. XtreemFS, currently funded by the Contrail EU project (Contrail, 2013), is an open source example of a general purpose and fault-tolerant distributed and replicated filesystem that can be deployed for cloud and grid infrastructures to support big data initiatives (XtreemFS, 2013). To provide fault-tolerant file replication, the stored file data is generally split and replicated across multiple storage servers. In a cloud deployment, the data is also likely to be extensively distributed at the physical level within datacentres.

We chose to focus on a single distributed filesystem as this allows us to conduct an in-depth analysis of the client and, particularly, the servers to fully understand the potential evidential data that can be collected as part of a forensic investigation. XtreemFS (like most underlying infrastructure) does not receive substantial attention in mainstream technical media. However it has received significant attention in the academic community with many researchers choosing to analyse it or implement it as the underlying infrastructure in larger projects. Most commonly, in the literature, XtreemFS is implemented in cloud computing or grid computing (commonly understood to be one of the predecessors of cloud). For example Kielmann et al. (2010) describe the role of XtreemFS in supporting XtreemOS and its suitability to integrate with IaaS services. Pierre and Stratan (2012) integrate XtreemFS into their proposed ‘ConPaaS’ system which as the name describes is a Platform as a Service cloud environment. Enke et al. (2012) also implement XtreemFS (including a number of its advanced features) for the purpose of managing cloud data replication in their work to analyse distributed big datasets in Astronomy and Astrophysics. Krüger et al. (2014) note that XtreemFS has also been used to provide distributed data management in the MoSGrid science gateway (Molecular Simulation Grid), an EU-funded project. Kleineweber et al. (2014) selected XtreemFS for the underlying filesystem into which they integrated their reservation scheduler for object based filesystems (as an extension) to handle storage QoS in cloud environments. In addition to implementing or extending XtreemFS other researchers such as Dukaric and Juric (2013) and Petcu et al. (2013) have also noted its use when discussing filesystems generally in the cloud environment. This body of research demonstrates the contemporary applicability of XtreemFS in the cloud environment and, as such, makes it an appropriate choice as a case study for forensic investigation in this paper.

The digital forensics framework used in this paper is based on our previously published work (Martini and Choo, 2012), which we have previously validated using ownCloud (Martini and Choo, 2013, Quick et al., 2014). The framework is based upon the stages outlined by McKemmish (1999) and the National Institute of Standards and Technology (Kent et al., 2006) but differs in a number of significant ways. The iterative nature of this framework is integral to a successful investigation in a complex client/server environment as presented in XtreemFS. The client can be used to identify the existence of cloud services and to collect any data stored by the client. Hence, forensic analysis of the client is generally carried out before analysis of the server environment. The following four stages outline the high level process and order that a forensic practitioner should follow when conducting forensic investigations in the cloud computing environment.

  • 1.

    Evidence Source Identification and Preservation: This phase is concerned with identifying sources of evidence in a digital forensics investigation. During the first iteration, sources of evidence identified will generally be via a physical device (e.g. desktop computers, laptops and mobile devices) in possession of the suspect. However, in the case of a distributed filesystem used in cloud computing, the filesystem client may only exist on the cloud server nodes. This, however, does not prevent it from being the first point of identification and may lead to other components of the filesystem. During the second iteration, this phase is concerned with identifying other components of the environment or cloud which may be relevant to the case, possible evidence stored by the filesystem custodian (e.g. system administrator) and processes for preservation of this potential evidence. Preservation is integral to the integrity of forensic investigations and as such proper preservation techniques must be maintained regardless of the evidence source.

  • 2.

    Collection: This phase is concerned with the actual capture of the data. There are various methods of evidential data collection suited for the various cloud computing platforms and deployment models. While IaaS may result in the collection of virtual disks and memory, and SaaS may result in an export from the relevant cloud software, the collection of distributed filesystems supporting cloud computing installations may be considerably more involved. Another consideration for distributed filesystems is the likelihood of remote hosting. If the filesystem is hosted outside of jurisdiction of the investigating LEA, they should use the appropriate legal instrument to legally gain access to the filesystem remotely.

  • 3.

    Examination and Analysis: This phase is concerned with the examination and analysis of forensic data. Examination and analysis are key components of a forensic investigation dealing with distributed filesystems. Examination will be integral to gaining a complete understanding of the operating components in the filesystem, and analysis will be integral to reconstruction.

  • 4.

    Reporting and Presentation: This phase is concerned with legal presentation of the evidence collected. This phase remains very similar to the frameworks of McKemmish and NIST (Martini and Choo, 2012). In general, the report should include information on all processes, the tools and applications used and any limitations to prevent false conclusions from being reached (see US NIJ, 2004).

We regard the contributions of this paper to be three-fold:

  • 1.

    Provide technical insights on forensic analysis of the XtreemFS underlying infrastructure and IaaS instances;

  • 2.

    Propose processes for the collection of electronic evidence from XtreemFS (see Fig. 1 in Summary section) and distributed filesystems used in the cloud computing environment based on the technical findings from the previous contribution (see Fig. 2 in Data storage section); and finally

  • 3.

    Validate our published cloud forensic framework (Martini and Choo, 2012).

In the next section, we provide an overview of the XtreemFS architecture and discuss the role of the various components. The findings are discussed in the context of these components. We follow the first three stages of the cloud forensic framework (see Cloud forensics framework section) to outline the recommended process for collection of forensic artefacts and potential evidence. The Collecting evidence from a distributed filesystem – a process section presents our proposed high level process for collection of electronic evidence from distributed filesystems used in cloud environments. The last section concludes this paper.

Section snippets

XtreemFS architecture overview

XtreemFS is a virtual network-provisioned filesystem, which is used to deliver backend storage services for a cloud service provider by providing key services such as replication and striping. It is one example of a number of products available with similar feature sets (other examples include GlusterFS (Gluster, 2014), BeeGFS (Fraunhofer, 2014) and Ceph (Ceph, 2014)). It is important to make the distinction between backend and frontend storage systems in the cloud computing environment as both

Findings

This section will discuss the three main architectural components of the XtreemFS system and the client in the context of the relevant phases of the cloud forensics framework with a view to understanding the filesystem and data of forensic interest available from the XtreemFS system, ultimately producing a list of high-level artefacts that should be investigated in all distributed filesystem forensic investigations. As XtreemFS is an advanced and complex environment, it is beyond the scope of a

Collecting evidence from a distributed filesystem – a process

This research demonstrates that a process must be followed to ensure the collection of data and metadata to the furthest possible extent from a distributed filesystem environment. If a practitioner followed existing practice and attempted to acquire a bitstream image of the storage devices (in this case the OSDs), it is clear that a large amount of metadata (available at the MRC) would be missed. Metadata stored by the DIR may also be integral as part of evidence collection or environment

Conclusion

With the increasing digitalisation of data and use of services such as cloud computing to process, store and disseminate big data, there will be more opportunities for exploitation of large datasets (e.g. in corporate or state-sponsored espionage) and consequently, the continued development of the digital forensic discipline is more important than ever. An effective investigative process is one that follows well-researched and documented processes, which allow digital forensic practitioners to

Acknowledgements

The first author is supported by both the University of South Australia and the Defence Systems Innovation Centre. The views and opinions expressed in this article are those of the authors alone and not the organisations with whom the authors are or have been associated/supported.

References (48)

  • S. Almulla et al.

    Cloud forensics: a research perspective

  • BabuDB

    Usage example Java – babudb – BabuDB usage in Java – an embedded non-relational database for Java and C++ – Google Project Hosting

  • C. Beath et al.

    Finding value in the information explosion

    MIT Sloan Manag Rev

    (2012)
  • M.A. Beyer et al.

    The importance of ‘big data’: a definition

    (2012)
  • D. Birk et al.

    Technical issues of forensic investigations in cloud computing environments

  • A. Butler et al.

    IT standards and guides do not adequately prepare IT practitioners to appear as expert witnesses: an Australian perspective

    Secur J

    (2013)
  • R. Casonato et al.

    Top 10 technology trends impacting information infrastructure, 2013

    (2013)
  • Ceph

    Home Ceph

    (2014)
  • C. Cho et al.

    Cyber forensic for hadoop based cloud system

    Int J Secur Its Appl

    (2012)
  • A. Chonka et al.

    Detecting and mitigating hx-dos attacks against cloud web services

  • K.-K.R. Choo

    Cloud computing: challenges and future directions

    Trends Issues Crime Crim Justice

    (2010)
  • F. Chua

    Digital Darwinism: thriving in the face of technology change

  • Contrail

    Technology – contrail-project

  • J. Dykstra et al.

    Forensic collection of electronic evidence from infrastructure-as-a-service cloud computing

    Richmond J Law Technol

    (2013)
  • Cited by (70)

    • Cloud forecasting: Legal visibility issues in saturated environments

      2018, Computer Law and Security Review
      Citation Excerpt :

      The model, comprised of three stages, combines elements from existing acquisition frameworks to address the difficulties in obtaining usable evidence from cloud resources. Martini and Choo (2012) presented a four-stage cloud forensics framework, which is subsequently validated using ownCloud (Martini and Choo, 2013), Amazon EC2 (Thethi and Keane, 2014), XtreemFS (Martini and Choo, 2014b), vCloud (Martini and Choo, 2014c), and other cloud services (see Daryabar et al., 2016; Dezfouli et al., 2016; Shariati et al., 2015). More recently in 2015, Do et al. (2015) and Azfar et al. (2016) adapted the adversary model from the cryptography literature and presented forensically sound adversary models designed to facilitate forensic investigations involving cloud (and other) services on mobile devices.

    • Image search scheme over encrypted database

      2018, Future Generation Computer Systems
      Citation Excerpt :

      Such as space leasing, data storage, backup, sharing and so on. The data storage services are achieved through the application of cluster computing and distributed computation [3,4]. In the system, a large number of different network resources will be used.

    • Fog-based storage technology to fight with cyber threat

      2018, Future Generation Computer Systems
    View all citing articles on Scopus
    View full text