1 Introduction
The relevance of data as an organizational asset with intrinsic value is widely accepted. More data are produced and stored by organizations every year [
25]. However, value from data is only created once they are used for operational excellence, product innovation, improved business models, or monetized in the data economy. Therefore, data must be transformed, enriched, and contextualized to create actionable information [
18].
To realize the promise of data and analytics for competitive advantage, organizations steadily increase their investments in technology and people. Yet, improvements in data culture, data value creation, and innovation capability remain limited [
3]. Enterprises continue to struggle in areas such as data acquisition, data enablement, or data compliance [
15,
16,
21,
26].
(Meta)data management and data governance are essential means to address these challenges and to improve data usage and thus firm performance [
7,
24]. To implement and support these activities, data catalogs (DCs) play an important role as they (a) empower users to work with data; (b) make data-related issues visible; (c) reduce data preparation time and (d) promote compliant data handling and usage [
1,
8,
28]. For a holistic metadata management approach, DCs need to be integrated into the existing enterprise data ecosystem [
14]. This includes the integration with upstream data sources, downstream analytics applications, and further tools for data curation as part of a metadata management landscape.
However, implementing DCs as part of a holistic metadata management landscape is currently challenging [
19]. First, practitioners are faced with a vast array of commercial and open-source DC offerings that focus on different application areas with different goals. For example, not all DC applications support enterprise-wide metadata management [
29]. As the spectrum of DC applications and their capabilities remain undefined, it is demanding for practitioners to select the right tools to build such a tool landscape [
8]. Second, the successful technical integration of DCs depends on several factors including automatic data source integration, DC federation, and data access provisioning [
14]. Yet, lacking clarity about these characteristics hampers the usage of DCs as fundamental components of metadata management landscapes.
To address the challenge of DC integration, the paper first develops a typology of DC applications in the enterprise context. Typologies allow for a reduction of the plethora of entities into a lesser number of classes with key attributes. The theory bases on an extensive survey of DC offerings and is enriched by an analysis of the scientific state of the art. Second, the paper discusses relevant issues for integrating DCs into metadata management landscapes by analyzing 51 DC offerings fostering enterprise-wide metadata management in greater detail. The examined characteristics include (a) deployment types; (b) DC connectors and integrations; (c) DC federation; (d) means to provide data access, and (e) further potential modules of metadata management landscapes. The remainder of the paper is structured as follows: Sect.
2 introduces DCs and typologies. Sect.
3 describes the research methodology. In Sect.
4, the authors present the developed typology of DC applications, followed by an analysis of the current state of practice in DC integration in Sect.
5. Sect.
6 summarizes the main findings and presents further research directions.
3 Research Methodology
To develop a typology of DC applications this paper adapts the methodology of Nickerson et al. [
22] in three iterations. First iteration consisted of an empirical analysis of DC offerings in the enterprise context (empirical to conceptual). DC offerings were identified based on a web search, the scanning of analysts reports, and the authors’ experience in the DC context. Only the most comprehensive offer of a vendor was included as study subject in case multiple offers were available. Information about each offering was gathered from the vendor website, provided documentation, and tutorial videos. Second iteration consisted of an analysis of the state of the art in scientific DC literature (conceptual to empirical). Following search string was used to identify articles in the databases IEEE Xplore, AIS eLibrary, ACM Digital Library and SpringerLink:
(“data catalog” OR “data catalogue” OR “metadata management solution” OR “metadata management tool”) AND (enterprise OR business)
In the third iteration, conceptual and empirical findings were juxtaposed and the remaining knowledge gaps were filled (empirical to conceptual). In the end, 73 DC offerings
1 and 27 research papers were identified and analyzed. To answer the identified questions in the area of DC integration, 51 offerings fostering enterprise-wide metadata management were investigated in greater detail
2. The survey results show only those observations that occurred more than once in the surveyed population to exclude outliers.
4 A Typology of Data Catalog Applications
Based on the research methodology described above, seven classes of DC applications were identified. These classes can be structured according to the following dimensions: (a) organizational area; (b) integration; (c) metadata management scope; (d) data management level, and (e) provider – consumer relationship as depicted in Table
1. The following section first describes the dimensions and their characteristics, and then discusses the identified classes of DC applications in greater detail.
Table 1
Typology of DC applications
Enterprise Data Catalog | Intra-organizational | Stand-alone | Holistic | Metadata | Many-to-many |
Context-specific Data Catalog | Intra-organizational | Stand-alone | Specific | Metadata | Many-to-many |
Enterprise Data Management Platform | Intra-organizational | Module | Holistic | Data and Metadata | Many-to-many |
Enterprise Data Marketplace | Intra-organizational | Module | Holistic | Metadata | Many-to-many |
Data Spaces Data Catalog | Inter-organizational | Stand-alone | Holistic | Metadata | Many-to-many |
Data Portal | Inter-organizational | Module | Holistic | Data and Metadata | One-to-many |
Ecosystem Data Marketplace | Inter-organizational | Module | Specific | Both options possible | Many-to-many |
Organizational area describes the subcontext of DC application. DCs can be applied for data curation within (intra-organizational) or across (inter-organizational) enterprises. In an intra-organizational setting, actors are usually represented by business users, whereas in an inter-organizational setting actors consist of organizations or principals acting on their behalf. Further, inter-organizational settings usually require stricter data protection regimes.
Integration refers to the delivery of DC functionality to the respective environment. DCs can either be implemented as stand-alone solution or as module of a wider solution offering. For example, a DC can be seen as a modular part of a data marketplace [
10]. The
scope dimension describes the extent to which metadata management and data governance are supported by a DC application. Specific refers to the support of a specific environment (e.g., a cloud platform) or data application (e.g., business intelligence) by providing specifically fitted capabilities. Holistic refers to support for metadata management across all types, sources, and potential data applications in the organizational area. DC applications can be further divided into those that primarily curate metadata and those that also have the ability to manage or deliver the actual data. This is characterized by the
data management level dimension. Lastly,
provider-consumer relationship refers to the amount of entities interacting with each other based on the DC application as a platform. Most applications foster many-to-many relationships of providers and consumers, while data portals are typically deployed by a single data provider to address the data needs of multiple consumers.
The first DC application class identified in the intra-organizational context are
Enterprise Data Catalogs. They provide data cataloguing capabilities for all data-related roles in an organization and across departments or business units, enabling enterprise-wide data curation [
19]. To this end, many data providers register the metadata of data assets from diverse systems, which can be leveraged by data consumers for different data applications. Enterprise Data Catalogs can be deployed as stand-alone solutions without the need to integrate with further data management tools.
In the Context-specific Data Catalog class, DCs only serve in a specific environment or for a specific data application. Examples of DCs that primarily serve a specific environment include AWS Glue or Cloudera Navigator. Both are limited to automated metadata ingestion from their respective cloud platform resources and focus on processes such as orchestration and ETL-processes. The survey also reveals DCs that provide data discovery capabilities only for a specific use case, such as data analytics (e.g., Tableau Catalog) or data privacy (e.g., Immuta Data Security Platform). While all of these offerings allow actors to leverage DC capabilities in a familiar environment, federation and interoperability are needed to avoid duplication of efforts and the creation of data silos.
The class of
Enterprise Data Marketplaces was identified during the literature review phase. Researchers see the main function of Enterprise Data Marketplaces in providing data or data services brokerage features [
10,
14]. To provide these capabilities, same researchers design Enterprise Data Marketplaces built on top of Enterprise Data Catalogs. However, this needs to be reconciled with the findings of Labadie et al. [
19] who see brokerage functions such as data access requests as part of Enterprise Data Catalogs. Based on the analysis of real-world DC offerings, the authors of this paper argue that Enterprise Data Marketplaces are modular solutions that include an Enterprise Data Catalog module and an additional brokerage or marketplace component, which allows for the description and purchase of data products. Conversely, Enterprise Data Catalogs may support similar functionalities in a single module. Yet, this view is not represented in the overview of examined DC applications as no explicit commercial or open-source Enterprise Data Marketplace offering could be identified. However, an Enterprise Data Marketplace may be provided by implementing Enterprise Data Catalog and brokerage modules of Data Management Platform offerings.
Enterprise Data Management Platforms (EDMPs) support the management, storage, and distribution of data assets in the enterprise [
4]. They are not tied to a specific context or use case. While the specific composition of EDMPs is up to each implementation, they typically consist of several modules including Enterprise Data Catalogs or Enterprise Data Marketplaces, to provide a listing of available data [
14]. Data quality, data integration, or data privacy can also be modular components of EDMPs. EDMPs are deployed as an overarching layer that is agnostic to underlying databases, data lakes, or data warehouses. While they do access actual data for processes such as data integration, data quality, or data privacy assessments, they do not persist or replicate these data.
In the inter-organizational sphere,
Data Space Data Catalogs enable the metadata-based inventory and discovery of data products to be shared between organizations in data spaces. Next to the functional semantic description of available data sources they allow for the definition and assessment of accessibility information and usage conditions for other organizations [
6]. They do not hold the data itself as organizations want to preserve their data sovereignty and therefore neglect the transfer of data to central platforms before the actual exchange. In general, Data Space Data Catalogs are agnostic to environments and use cases fostered by the data exchange.
Data Portals are leveraged to enable the reuse of data for societal or economic benefit. They are set up by the data providing entity and allow the discovery and access of data for multiple stakeholders including natural persons and enterprises. Data can be directly accessed or downloaded from the Data Portal. While most Data Portals base on CKAN, different modules are implemented as added benefit (e.g., data visualization) [
23].
Lastly,
Ecosystem Data Marketplaces match organizational data sellers and buyers and manage data exchanges and transactions [
2]. In this sense, they can act as a trustee and manage access to data according to rules defined by the data seller. Metadata management, and thus the use of DC components, is seen as an essential module of Ecosystem Data Marketplaces. But, to fulfill the scope of data monetization, additional modules such as billing and invoicing are required. A comprehensive overview of Ecosystem Data Marketplaces is provided by Azcoitia et al. [
2].
6 Conclusion
Despite ongoing high investments in data technologies and human resources, the promise of data-driven enterprises has yet to be realized. As a step toward improving data value creation and ultimately supporting data-driven businesses, DCs are being integrated with other metadata management tools into metadata management landscapes that support holistic metadata management and data governance across the enterprise. However, implementing DCs as part of such a metadata management landscape is challenging due to the variety of DC application classes and a general lack of understanding of DC integration.
To mitigate these challenges, this paper first develops a typology of DC applications in the enterprise context. Seven classes of DC applications could be identified and were structured along five dimensions. The typology helps practitioners to focus on the right DC application classes when building enterprise metadata management landscapes. It further supports future research by resolving the conceptual ambiguity around different classes of DC applications and their relationship. Additionally, important concerns for creating comprehensive metadata management landscapes were analyzed by a survey of 51 Enterprise Data Catalog and EDMP applications.
The study reveals several open challenges for research and practice. First, Enterprise Data Marketplaces seem to be a promising DC application class as they enable the description and provisioning of data in a more consumer-centric way. However, further conceptual and practical research is needed to clarify and demonstrate the capabilities and value-adds. Second, the current state of providing data access is unsatisfying as it demands high manual efforts on the data provider side. The automatic provisioning of data access based on pre-defined access conditions across all enterprise data sources therefore seems to be a promising research direction. Ultimately, more attention should be directed towards the development and implementation of methods and standards for the federation of metadata management tools as organizations move towards greater decentralization in data management.