In this part, we introduce our proposed data lifecycle framework for GBDE. We thoroughly investigated data lifecycles to learn from existing work. Specifically, we wanted to identify commonalities and all the different characteristics introduced by the models to take them into consideration. We explain how we discover and define the phases of DaLiF.
We adopted a comprehensive approach to propose DaLiF. Our approach consists of five steps. In step 1, we explain the exclusion of data lifecycles based on logical rules. In the second step, we discover and enlist phases from the identified 76 data lifecycles. We already described the selection criteria of the 76 data lifecycles in the preceding section. After a detailed analysis, we ended up with fourteen distinct phases based on certain logical rules. In step 3, we group the fourteen phases into mandatory and optional. In the fourth step, we apply our analysis to validate the categorization of phases, and we conclude with six mandatory phases. In the last step, we also map phases with DM-BOK functions. We thoroughly describe the above-mentioned steps of our approach to propose the DaLiF as below:
Description of DaLiF phases and their critical functions
In this part, we present the description of the fourteen phases along with the functions of DaLiF. A function indicates an essential activity related to data to be performed by an entity in a phase. We also incorporate the principles of the DM-BOK concept in the respective phases to align our work with the said data management standard.
Planning phase The planning phase illustrates activities to be performed in the medium and long terms during the data lifecycle [
6,
53]. This phase consists of formulating a project (e.g., research or business project) to achieve PAs desired goals. Through this phase, it is possible to know the overall objective of the data management, what policies, procedures will be required to treat the data like procedures to collect or generate government data, what data types, sources, and methods are needed to analyze the government data. Additionally, how and where government data is to be stored when such data will be archived or destructed, how it safely will be accessed by the authorized users [
46,
48,
49,
55,
149,
158]. The planning phase can help the PAs, including public sector scientific researchers, to save time, boost effective governance, and meet the desired needs about planning for data management in GBDE [
6,
53]. The output of this phase is a holistic data management plan [
48].
Planning phase key functions The planning phase of DaLiF includes the following key functions:
-
Plan for all required resources, including finance and personnel, metadata contents and formats, data storage, data security, and expected outcomes for each phase of the big data lifecycle [
15,
53,
55].
-
Identification of requiring individuals, descriptions of skills that each of the individuals necessitate to acquire, define roles, and assign roles and responsibilities to the individuals and other public sector stakeholders [
53].
-
Define a data management plan that is a live document in nature and covers numerous public data aspects like data lifetime, approaches for data quality, data security, and data archive [
6,
15,
53].
-
Provide a detailed description of data that will be compiled, by whom, and how the data will be managed, made accessible, shared, and reused throughout the lifecycle [
6,
15,
53,
62].
-
Develop an appropriate plan to select modernized and extensible tools for the data phases, including public data collection [
15,
58].
-
Plan to prioritize public data that have more possibility for use and be published on the web [
93].
-
Expert people in the handling of data, records, and content should be fully engaged in planning [
70,
71].
-
Plan activities that apply quality management techniques to measure, assess, improve, and ensure the fitness of data for use [
70].
Collection phase Collection phase consists of set of activities through which data is gathered from different internal and external sources and in different formats i.e. structured, unstructured and semi-structured forms [
17,
48,
53,
55,
159,
160]. Big data would have been worthless if it cannot be collected into consistent information [
48]. The big data is created or collected from various data resources like social networks, the Internet of things, surveys, census, voting, historical maps, seismology motion sensor outputs, biological records, satellite observations, and commerce statistics [
55,
61,
84,
161]. The collection phase defines the moment when the new data or metadata is created in the system [
49,
57,
159]. In the Extract, Transform, Load (ETL) common procedure, “Extract” is close to “collect” and this procedure performs a vital role in data collection [
162]. In the public sector, during the data collection phase, the PAs should consider the once-only principle to collect data from citizens and businesses and reuse data instead of recollecting [
58].
Collection phase key functions The data collection phase of DaLiF includes the following essential functions:
-
Collection of raw data from any sources in order to handle big data ‘Variety’ challenge, and to ensure endpoint input validation to avoid security of data issues [
23,
42‐
44,
57,
58,
91,
132].
-
Implement a strategy plan to select modernized extensible tools for public data collection platforms [
58].
-
Introduce a data protection awareness program while collecting data [
23,
70,
132].
-
Collection of metadata about information, based on metadata standards, to ensure interoperability across an organization and another future course of action [
53,
57,
70].
-
Adherence to the once-only principle, in public sector organizations, to collect data once from citizens, and business community [
59].
-
Manage the ranges of valid and trusted data sources for data collection [
42‐
44,
54].
-
Manage the massive amount of data in any format to handle big data volume challenges and search and discover new sources for data collection [
42‐
44,
54].
-
Consider specific resources to manage the big data ‘Velocity’ challenge that refers to the speed rate of data stream generation and the consequent ability to process it well [
42‐
44].
Preparation The preparation phase refers to data integration, filtering, and enrichment [
91,
127,
135]. The integration consolidates all these data silos into a single place with a coherent and homogeneous structure [
26,
46,
155]. Filtering is focused on the purification of data by filtering the noisy and erroneous data [
17,
61]. Data Enrichment refers to the process that appends or otherwise, enhance collected raw data with relevant context obtained from additional sources [
10,
63,
143].
The integration enables users to do their queries easily and obtain responses from a single data source [
28,
155]. The integration can be considered as a database subarea that provides uniform access to various data sources [
163]. Data Lakes (DLs), conceptualize as big data repositories, store raw data, and give functionality for on-demand data integration with the help of metadata descriptions as and when required [
162]. However, in DLs, data integration does not take place after the data collection. In the Extract, Transform, Load (ETL) common procedure, “Transform” is close to “integrate” and this procedure performs a critical role in data integration. The integration is based on a set of rules and policies [
17]. The data integration is somehow an interim step to achieve a single source of data [
17,
28,
164]. In the literature, we noticed that a considerable cost is required for data integration. Such a substantial price is due to data integration across multiple domains, various formats, different vocabularies, metadata of varying quality, and political boundaries [
28].
Filtering also allows the data classification in different formats like structured and unstructured formats [
61]. The filtered data is further processed through succeeding phases. After due process, policymakers use filtered data to make better decisions within a limited time and with fewer resources [
58,
152]. The software development team implements data filters to extract the required data from a vast amount of collected data [
62]. The output of the data filtering is a set of categorized, purified, anonymize, less noisy, and error-prone datasets [
17,
42,
61,
121].
In enrichment, the normalized, enriched, and simplified data is used through data analysis and mining to generate new information [
130,
135]. The enrichment activities are performed on the integrated massive amount of data to limit the selection of data as per certain criteria [
28,
61,
62,
90]. The outcome of the data enrichment is a set of refined and mature datasets compared to the original raw data, which can be utilized for either further analysis or for archiving for future inquiries over historical data [
42]
Preparation phase key functions The preparation phase of DaLiF includes the following key functions:
-
Creation of a homogeneous set of data by consolidating data that is gathered from numerous data sources [
6,
60].
-
Implement a plan to select modernized and extensible tools for public data integration platforms [
17,
58] and tools for data filtering [
58,
62].
-
Perform scalability about the big data that is high in volume, veracity, velocity and comes from a diversity of sources [
155].
-
Process big data with the addition of extra measures to achieve a public administration’s short-term integration goals [
164].
-
Do activities, like forming relations among variables of different data sources, adapting units, translating, and building a single database along with all the acquired data, so the government data can be traceable and easier to access for future use [
46].
-
Consider data privacy protection constraints to avoid revealing private information, like citizen personal and government classified information, in the integrated data [
46].
-
Identify noise and errors in the collected data and process this public data to remove such issues [
61].
-
Involve reliable and authorized HR resources in the phase to avoid the leakage of sensitive government data [
23,
121].
-
Define filtering criteria to be used by the PAs and researchers to filter the public data, including research data as per their needs [
61,
62].
-
Verification of reliability of the GBD sources as well as of the own data, to manage for any data inconsistencies [
40,
42]
-
Responsible for carrying out certain fundamental data transformations to optimize the volume of data flowing from the data collection to the quality phases [
42].
-
Preparation of additional internal and/or external sources data to be merged with existing public data for data valuation [
10,
70,
143].
-
Extension of existing information by extending missing or incomplete data [
61,
63,
130].
-
Carefully process data to eliminate unnecessary, misleading, unreliable, and duplicate information [
42,
130].
-
Establish effective data integration architecture that controls the replication, and the flow of data to ensure data quality and consistency, especially for reference and master data [
70].
-
Describe source-to-target mappings and data transformation designs for extract-transform-load (ETL) programs and other technology for constant data cleansing and integration [
70].
-
Implement methods for integrating data from multiple sources and the suitable metadata to make sure meaningful integration of the data [
70].
Analysis The analysis phase is the most common phase of all data lifecycles models. The analysis phase enables an organization to handle ample information that can affect the business [
61]. This phase is responsible for developing all data analysis and data analytics to extract knowledge and discover new insights [
34,
42,
165]. This phase is like a human brain, i.e., processes the information for the next appropriate required actions by human beings [
143,
155]. In this phase, different big data analytical tools are utilized to analyze data. The data analysis tools help the policymakers to analyze much and complex data to understand what is happening (descriptive analytics), to understand the reasons that something is happening (causality) to run what-if scenarios, and to do forecasts. The policymakers use forecasts in their decision making [
143,
155]. Modern technologies, state-of-the-art data infrastructure, and highly skilled people are critical to extract relevant insights from big data in the analysis phase. Examples of modern technologies include machine learning, deep learning, artificial intelligence and natural language processing. Relevant people’s skills include data analytics, data mining, computing, statistics, etc. [
152]. The analysis phase includes analysis of unstructured data [
91]. The output of the data analysis phase includes knowledge, the discovery of new insights, new data, interpretations and/or new datasets [
42,
54,
61,
165].
Analysis phase key functions The data analysis phase of DaLiF includes the following key functions.
-
Data sources selection like identification of descriptions, data sources location, file types, and data provenance [
58,
91].
-
Perform analysis of data to extract knowledge and discover new insights, and then decision-makers use this knowledge intelligently to generate value for public organizations [
17,
42‐
44,
54,
61].
-
Consider innovative data analysis strategies like schema on read to manage public data through mode tools [
91].
-
Select appropriate data analysis tools and techniques, like data mining algorithms, cluster analysis, correlation analysis, statistical analysis, and regression analysis, to analyse public organizational data [
58,
61].
-
Set-up a data scientists’ group (actors) with sound expertise in data analytics to perform analysis of various types of public data, particularly unstructured data, and describes a set of actions to be completed by the group [
61,
62,
91,
152].
-
Discover business processes that can be enhanced through big data technology, perform analysis of existing issues in each business process, and perform re-engineering of each business process using big data technology [
62].
-
Define the required types of big data analysis that include descriptive, predictive, prescriptive, and diagnostic [
34,
62,
165].
-
In the analysis phase, prepare and publish outputs in machine-readable formats [
53,
55].
-
Extraction of value from big data through its extensive use and offers a natural interface with the data users [
35,
42‐
44,
54].
Visualization The visualization phase deals with the presentation and visualization of the outcomes, as well as explanation of the meaning of the discovered information [
62,
91]. The visualization phase has the highest value for the data consumer in the information value chain, and this phase also boosts the interaction of data analytics and the organizations [
91,
155]. We also noted the following categorization of data visualizations, exploratory data, explicatory, and explanatory visualization. The first category emphasizes better data understanding, particularly in a huge amount of repurposed data. This is because of the volume of the datasets that need new methods. Examples of exploratory data visualization include browsing, boundary conditions, and outlier detection. The second category focuses on analytical results. Example of explicatory visualization includes confirmation, interpreting analytical results, and near real-time presentation of analytics. The last category is about ‘telling the story’ and a simple way of presenting results to the layman to ingest easily. Examples of explanatory visualization include business intelligence, reports, and summarization [
91]. This phase results can be offered in various forms like dashboard, oral presentation, user interactions, alerts, reports [
62].
Visualization phase key functions The visualization phase of DaLiF includes the following key functions.
-
Visualize public data so that less tech-savvy decision-makers can understand and use results for effective decision making [
64,
91].
-
Implement a plan to select modernized and extensible tools for the data visualization, like pipes in ‘R’ computer programming language and geoms (Cleveland dot plots, box plots, and jittered graphs) [
58,
64].
-
Encrypt the resulting information and knowledge and adopt an access control strategy to avoid privacy threats [
23,
132].
-
Adopt appropriate mechanisms for reporting and analysing the data, including online and web-based reporting, BI scorecards, ad-hoc querying, OLAP, and portal [
70].
Storage The objective of the storage phase is to save data securely throughout the life cycle. Data storage is an essential process of big data analytics in real-world applications [
65,
166]. We noticed in the literature that the storage phase is considered in all data lifecycles models. There is a demand for stable and usually web-accessible storage [
90,
144,
147]. Data Lakes ingest raw data in its original format from various data sources, meet their role as storage repositories, and allow users to query and explore them to extract knowledge [
167]. In the Extract, Transform, Load (ETL) standard procedure, “Load” is close to “store,” and this procedure performs a critical role in data storage [
162]. The activities of data lifecycle phases, like data access, publish, data sharing, data use, and re(use), would be executed once the data is stored in a place [
85]]. However, big data storage is also a complex, costly, and challenging data lifecycle phase. In the public sector, government entities usually setup base registries to store GBD of particular importance i.e. master data. A base registry is a reliable and authentic source of information about people, health, vehicle, crime, and businesses [
58]. The base registries support PAs to eliminate data silos and maximize the data’s re-use across the public sector entities easily and inexpensively [
15,
58]. In the storage phase, different modern tools and technologies are required to store big data like NoSQL, NewSQL, Big Data Query Platforms, Hadoop Distributed File System (HDFS), and cloud storage technologies [
50,
95,
96,
168,
169]. Moreover, several NoSQL technologies, like HBase, MongoDB, Cassandra, CouchDB, DynamoDB, Riak, Redis, Neo4J. These technologies store data streams in a real-time fashion into a NoSQL database [
127].
Storage phase key functions The storage phase of DaLiF includes the following essential functions.
-
Identify public data to be stored, specify a data repository or data center where the shared data will be stored [
15,
58].
-
Develop and implement an appropriate, short & long- term storage plan to store data in GBDE [
53].
-
If relevant, ask permission from citizens and businesses to store data of their property [
66].
-
Store data in an appropriate location (in-house data center or private cloud environment) in a secure, scalable, accessible, and reliable manner [
65,
90,
147].
-
Compliance with industry standards, a) to store GBD along with improved data structures, appropriate cloud data security, and backing fault tolerance; and b) to improve data storage systems performance in terms of capacity and speed [
65,
75,
142,
166].
-
Implement a plan for the selection of modernized and extensible tools for data storage along with a balanced approach for data availability and scalability [
58,
64,
95].
-
Establish base registers to store public data at national and cross-border levels [
58].
-
Perform continuous work on data storage with improved data structures and backing fault tolerance [
65,
85].
-
Adopting approaches based on encryption techniques ensures privacy protection in the data storage phase [
95,
96].
-
Implement a document and content management system that offer electronic documents and electronic images of paper documents storage, versioning, security, metadata management, content indexing, and retrieval capabilities [
70,
71].
Access The data access phase focuses on ways of communication between the data provider and data consumer in the big data ecosystem [
60]. Through this phase, we decide and document which user [
60,
147] or re-user [
90,
147] is accessing which data and with what mechanisms [
58]. Public sector organizations offer multiple channels for data access [
94].
Access phase key functions The access phase of DaLiF includes the following essential functions.
-
Ensure the access of public data to users and reusers on a day-to-day basis as per agreed and signed an agreement [
60,
90,
147,
149].
-
Define data access controls, and data authentication methods [
58,
90,
117].
-
Establish data access models like cloud, intranet, and virtual desktop models that help determine the hosts’ identity, authority, clarify the operation authority, and identify, authenticate remote users, and ensure secure communication, respectively [
117].
-
Ensure that data that is openly accessible to all users may not by any means contain classified privacy information to avoid personal data privacy threats [
109].
-
Ensure that limitations on access are conveyed and admired [
17,
90,
147].
-
Allow government data exchange platforms, like Belgium platform ‘MAGDA’, to further facilitate data access and exchange of data among public bodies [
58].
-
The mission-critical data that need to be accessed frequently by the analytical tasks should be stored to offer fast retrieval and updates. While less urgently accessed data can be stored in a database, on disk, or in data files [
120].
-
Implement dynamic and scalable access control like Authenticator-based data integrity verification techniques [
23,
132].
-
Enable effective and efficient access and use of data and information in unstructured formats [
70].
-
PAs should allow access to documents/records in accordance with related policies, standards, and legal requirements [
70].
Use, re(use) and feedback In this phase, we combine two key concepts, use, re(use), and feedback. The ‘use & re(use)’ concept is about the use and re(use) of data by the data consumers [
118,
142,
161,
170] and focuses on discovering new and valuable information from existing public datasets by different stakeholders [
58]. While in case of second concept feedback, data users exploit the open government data and provide their feedback [
49,
98,
98]; such feedback is in the form of user reactions, comments, and suggestions that usually identify improvements and corrections in the published data or metadata [
49,
52,
98]. Moreover, re(use) is a process, not a single action, and it includes different activities like acquiring datasets from various public or private data sources to compare to recently collected data, returning to one’s own data for later comparisons, surveying available datasets as background research for a new project, or steering reanalysis of one or more datasets to address new research questions [
171]. The examples of data consumers include citizens, individuals [
67,
118], businesses, researchers, and employees of other government agencies [
168]. In the use, re(use), and feedback phase, PA is not the main actor, but the client as the PA can still use and re(use) the public data. App developers create new and valuable information by pulling the non-classified government datasets together and mashing up with other private data to build high-value Apps [
28,
142]. The governments are also being working to open data without personal attributes so that businesses and the community use and re(use) such data for innovation, accomplish their day-to-day tasks, and gain commercial benefits from this data [
58,
170]. There are a variety of open datasets that are usually used for several objectives by various users. The data publishers usually ensure that their data, incredibly private data, is accessible to designated data use and re(use) [
90,
147]. There are different motivations of the data users and re-users like community welfare, business growth, and earn money [
11,
28,
172]. The European Commission advised the European Member States to formulate a holistic big data strategy, including publishing open data and promote the use and re(use) of such data. Moreover, the Commission offers special proposals to them to achieve better data use and re(use) within a State and cross-border as well [
66,
142]. The other government entities may use and re(use) GBD as a tool to improve and optimize the internal processes of the public administration and make evidence-based decision-making to improve their public services for the public [
58,
66]. This phase’s output is a set of manipulated data values [
48,
147]. Data feedback is a way to obtain a consensus among stakeholders, including the community. The data providers examine the user feedback about data and again publish modified data after incorporating the data users’ feedback [
98,
173]. The PAs can gather a vast amount of all stakeholders’ viewpoints, as evidence-based information, on public data [
58,
120].
Use, re(use), and feedback phase key functions The use, re(use), and feedback phase of DaLiF includes the following key functions.
-
The data provider may provide data to the data consumers to use, re(use), and offer feedback about data along with an appropriate mechanism that enables an individual to manage and control their digital record of information [
67,
142,
171].
-
Ask for permission to citizens, and businesses, i.e., owners of private data, to use and re(use) data of their property consistent with the objectives of information collection [
66,
118].
-
Outreach all stakeholders so that everyone has an equal chance to provide feedback [
49,
66,
153].
-
Allocate enough time to the stakeholders and actively listen to them to provide their feedback [
58,
98].
-
The data provider implements data usage policy, relevant national and international regulations about data use and re(use), and creates awareness amongst about the said policy within data consumers to avoid individual data misuse [
58,
66,
67].
-
Adoption of consistent and uniform approach(es) and shared (interoperability) platforms to help the safe, transparent, and controlled use and re(use) of data across public organizations. These approaches and platforms also help to discover what data is available and facilitate its use and re(use), preventing duplication of effort across public organizations [
58,
142].
-
Interact in a more civilized and less bureaucratic manner with the stakeholders to get fruitful and enough feedback from them [
52,
66,
153].
-
Implementation of base registries, single authoritative sources of data, to enable data use and re(use), and decrease the requirement for citizens and businesses to give the same information to public organizations again and again [
58,
118,
152].
-
Implementation of the plan for the selection of modern tools and technologies, including API-based technologies to promote data use and re(use) with data harmonization and consistency [
58,
66,
161].
-
Develop IT systems, connectivity infrastructures, and platforms to proceed towards a country that functions as a unit and increase the use and re(use) of GBD for the decision making [
58,
66,
119].
-
Ensure the use of technological solutions and social media so that data providers, like PAs, can create informally and efficient ways of communication with data users’ including citizens [
49,
52,
98,
120].
-
Establish possible collaboration with the citizens to express their interest and offer feedback about data published by the government [
98,
133].
-
Facilitate easy and inexpensive reuse of data across the organisations, preventing, wherever possible, redundant and inconsistent data [
70].
Share/publish In this phase, we combine publishing and sharing concepts of traditional peer-reviewed publication with the distribution of data and information through (government) web portals, social media, data catalogs, eGovernment information systems, and other venues [
55,
61,
128]. Data and its resources are collected, prepared, and analysed for sharing and publishing to benefit the stakeholders. The examples of such stakeholders include governments, businesses, citizens, researchers, scientific partners, and federal agencies [
9,
58,
61]. The data provider shares data with the above-mentioned stakeholders, as per defined ethical and legal specifications [
58,
67,
128]. In the government sector, organizations have data related to tax revenue, health, education, economics, transport, etc. The government organizations share data with the rest of the government entities. Data sharing is helpful to achieve greater efficiency in the use and re(use) of data by the government [
35,
142,
152]. It is a key to transparency and economic growth [
174]. The fundamental idea of linked data is to use the World Wide Webs global architecture to share structured data worldwide [
26]. In this phase, the data publish concept emphasizes what data can and should be made public and how data needs to be published with appropriate security measures and integrity [
58,
92]. PAs determine which data is to be issued for other government departments and which information is to be disseminated openly to the public [
58]. However, PAs do not publish various data sets due to certain data traits, like data containing personal or sensitive information [
175]. PAs intend to publish government data for all to promote transparency, accountability, value creation, i.e., better governance, and to enhance the quality of life of the citizens [
67,
79,
175]. The data publish phase is highly essential for the open government domain. This phase’s output is publishing non-classified data [
92,
93].
Share/publish phase functions The data share/publish phase of DaLiF includes the following key functions:
-
Implement a plan for the selection of modern tools and technologies, including API-based technologies, to promote data sharing/publishing with stakeholders safely and effectively [
58,
67,
128].
-
Identification of non-classified public data to be shared or published [
58,
92].
-
Sign off data sharing agreements between governments and other stakeholders that emphasize the legitimate basis and logic behind why public data is being shared [
58,
66,
115].
-
Ensure to take appropriate measures that enable individuals to control whom to share data and how much the owner is eager to share [
67,
97,
114,
142].
-
Data providers focus on maintaining a balance between data availability and data redundancy when publishing data through various formats [
79,
93].
-
Consider data sharing granularity and data transmission in addition to authorization of data while sharing private data. As sharing granularity refers to conformity to sharing policy and data transmission indicates the isolation of sensitive information from the original data. This function makes the data is not related to the data owners [
118].
-
Follow open data publishing guidelines and principles as mentioned in [
176] and [
177] to publish open data [
93,
175].
-
PAs should keep balance to allocate powers to a different group of stakeholders (Government bodies, NGOs, Regulators, Data Brokers versus data subjects, entrepreneurs, archivists, data, data collectors) in driving the design, framing, and implementation of data sharing policies and practices [
174].
-
Implement web standards in data formats, like HTML, XML, RDF, CSV, and web protocols, like HTTP, FTP, and SOAP to publish data on web [
70,
92,
175].
Archiving phase Archiving is a process to anchors a chunk of data within a system through cataloging, indexing, or a related action [
49]. Archiving is for obsolete data, keeping for records in case access is needed, however, at a low storage cost. While data storage is for active information, available for day-to-day activities, but at a high storage cost [
61]. Data archiving is one of the prime phases of a big data lifecycle [
10]. It is pertinent to mention that effective data lifecycle management includes the intelligence not only to archive data; however to archive the data based on specific parameters or business rules. An example of such parameters consists of the data’s age or the last date of their use [
51]. In a cloud computing environment, archiving is a technique to shift less frequently used data to another place in cloud s for an extended period [
88,
142]. Data Archiving can also help storage administrators to develop a tiered and automated storage strategy to archive static data in a warehouse. Through this strategy, data warehousing specialists can improve overall data warehouse performance [
51]. Some researchers describe the data life cycle by the data access frequency [
142]. As the moment goes on, the data access rate gradually declines, and ultimately such data goes to an archived state. Additionally, in this phase, the following three main operations are required, encryption techniques, long-distance storage, and data retrieval mechanism. These operations permit the least used data to be shifted to separate storage devices for long-term storage. The archiving and storage devices are thus separated [
61,
88]. Some countries have special national archival legislation to archive the government record/data for reference and future use by PAs.
Archiving phase key functions The archiving phase of DaLiF include the following es research work provides a holistic view of 76 datsential functions:
-
Data, including personal data, should be archived with strict security measures to avoid such data leakage [
58,
142].
-
Implement a formal agreed plan to archive data to ensure data availability and data re-use [
55].
-
Use of appropriate archival standards like General International Standard Archival Description (ISAD-G) for various purposes, including hierarchical data description [
90,
147].
-
Implement a plan to select modernized and extensible data archiving tools [
49,
66].
-
Adopt an appropriate archive method to ensure that such data is accessible to the data scientists for data analytics reasons as and when require [
51].
-
Data resources, data infrastructure, and data management should be forecast to deliver continuity and archive data for as long as required [
66].
-
Use of appropriate anonymization techniques like generalization and suppression to protect data privacy during this phase [
95,
96].
End of life phase: In this phase, duplicated data, no longer required data, and useless data is removed from the system [
50,
58,
88,
111]. Data must be considered in terms of end of the usefulness of data or end of life [
58,
116]. In the cloud environment, to maximize resource usage, the storage location of data is often moved. As data is moved, the original location is also destroyed [
117,
142]. Such titles include deleting, terminate, destroy, and dispose of. Data-driven Public Administrations always make decisions regarding the end of life of data based on their data strategy [
58]. This phase’s output is a set of destructed data values [
48,
88].
End of life phase key functions The end of life phase of DaLiF includes the following key functions:
-
Useless, inactive, and data that has attained the end of its lifespan may be destroyed as per rules/regulations [
58,
88,
116].
-
Implement a plan to adopt appropriate methods for the data end of life [
49,
66].
-
Data centers, including government data centers, should offer suitable data end-of-life functions like disk replication and demagnetization to their clients to avoid sensitive public data leakage [
88].
-
Ensure that unnecessary data is permanently removed and cannot be restored from the storage medium to avoid inadvertently disclosing sensitive information [
118].
-
Ensure that data in the cloud is removed, through appropriate means, according to the owner’s mind, to guarantee the information not be disclosed or recovered [
119,
142].
-
Ensure wiping of unwanted data on partitions and hard disks [
118,
142].
Data quality phase The quality phase focuses on maintaining data quality during the whole data lifecycle, i.e., data collection, data integration, data analysis, data publishing, and data share phases [
42,
62]. A primary data quality management principle is that manage data as a core organisational asset [
70]. Data quality is one of the prime issues related to the value of data for the business [
54,
178]. DMBOK highlighted the following dimensions of data quality, accuracy, completeness, consistency, currency, precision, privacy, reasonableness, referential integrity, timeliness, uniqueness, and validity [
36,
70,
71]. The data-driven public administrations can offer better services and policies through improving data quality [
43,
66,
152,
179]. When we have well-defined quality requirements, then implement controls to measure the data quality’s satisfaction is more feasible. Examples of such quality requirements include margins of errors and the requisite level of precision [
17,
60]. The quality of the big data is essential for their consumption. Data may be precise, timely, and in accordance with actuality [
66,
180]. Base registries (public sector master data) are needed for valuable and highly reusable data [
44,
58]. The United Nations also described a set of actions for the computer scientists to ensure quality during data input and output results to limit the risks in various factors like complexity, speed, accuracy, validity, and clarity [
62].
Quality phase key functions The data quality phase of DaLiF includes the following essential functions:
-
Certify that public data, information, and metadata are of high quality by engaging data quality and metadata experts [
60,
66,
70].
-
Establish quality criteria and quality processes that consider generation, storage and processing [
62,
66].
-
Implement data quality management policies, international standards, procedures, and guidelines to cross-check the data quality level to discard the data with low quality, improve the data quality, etc. Such implementation ensures high quality, consistency, the integrity of public data, and help to handle the ‘Veracity’ challenge [
17,
42‐
44,
54,
66,
152].
-
Monitor the data quality flows, in case of failures, then proceed as per the data quality management policies [
42‐
44,
54].
-
Apply conformance checks to data quality business rules, like attribute domain constraints, format constraints, and standardisation constraints, at each phase of the data lifecycle to avoid low-quality data, like missing attribute values, schema, and data format differences [
43,
181].
-
Create and promote data quality awareness within an organisation [
70].
-
Make explicit data quality attributes like accuracy, integrity, completeness, and timeliness to help policymakers determine whether data is reliable for the decision-making process [
13,
180,
181].
-
Being business process owners, PAs should agree to and abide by the data quality SLAs [
70,
73].
Protection phase This phase focuses on data protection in terms of data integrity, security, access control, and privacy [
17,
61,
182]. The phase is being considered throughout the data lifecycle, i.e., from planning, collection to archive, and destruction phase to maintain data security and privacy protection against any accidental or malicious compromises to the GBDE [
58,
69,
97,
144,
183]. Due to the quantity, variety, and sensitivity of the big data and its management through heterogeneous based technological solutions, data security and privacy protection become crucial. A holistic methodological approach is required based on data protection standards and common practices to deal with these issues [
69,
118]. Adequate data security and privacy protection management establish governance mechanisms that are easy enough to abide by on a daily operational basis [
70]. Classified data must be secured and protected from unauthorized users through various data masking techniques to protect data from unauthorized access [
17,
51]. There is an in-balance between privacy and the risk of malicious data exploitation [
23,
58,
61,
183]. The processing of personal data in Europe is subject to the General Data Protection Regulation (GDPR) and the Data Protection Act of 2018. Such legislation ensures the privacy of citizens and the secrecy of data and information gave by businesses [
58,
66,
118]. This phase’s output is secured and protected data [
61,
144].
Protection phase key functions The data protection phase of DaLiF includes the following essential functions.
-
Government organizations should process data in a way that certifies the protection of personal data against unauthorized or unlawful data handling [
58,
66,
96,
118].
-
Implement privacy standards, introduced by ITU, CSA, ISO, etc., privacy policy, techniques, and security solutions to protect data, including personal data, to avoid data threats. Whereas such solutions and methods will be based on various security patterns like encryption, authentication, anonymization, and role-based access control [
23,
69,
70,
118,
119,
132].
-
The use of unique identifiers to manage users’ digital identities, their relationship to a real-world identity, and access to systems, data, and information are essential for data protection [
58,
118].
-
Data and information must be protected as prescribed by both regional (like EU) and national (like Italy) legal codes and data protection policies with suitable levels of data protection, security, confidentiality, privacy, integrity, and availability [
183].
-
PAs should also allocate sufficient funding, create awareness amongst the people, impart requisite training for the staff, and engaged technical experts to protect GBD [
96,
118].
-
Minimize the risk of privacy violation during the data collection/generation by appropriate means like restricting access or falsifying data [
95,
96].
-
Ensure privacy protection in the cloud environment by the strict separation of sensitive data from non-sensitive data [
118].
-
The PAs should take security and data protection processes to identify and protect citizen and business data; for example, privacy-by-default and privacy-by-design will be adopted. [
58,
66].
-
Ensure double encryption data system using an appropriate encryption algorithm, like AES and RSA, to avoid data mining-based security attacks [
23,
132].
-
PAs should form arrangements to identify and utilize security requirements applicable to the receipt, processing, physical storage, and output of data and classified messages [
70].
-
Execute effective data security policies and procedures to assure that the right people can use and update data in the right way [
70,
73].
-
Collaborate with stakeholders (e.g., IT security administrators, data stewards, internal and external audit teams, and the legal experts) for defining data security requirements and data protection policy [
70].
-
Adopt data protection tactics at data consumers, systems, and data providers levels to ensure data protection from unauthorized entities, systems, and un-trusted data providers, respectively. Examples of such tactics include personal data stores, software/hardware-based virtualization, data encryption [
182].
-
PAs should promote the concept of decentralization and private-by-design IoT through Blockchain technology in IoT-based Information Management Systems to ensure data security and privacy protection [
184].
Governance phase Data governance phase refers to a plan to guarantee that high-quality and protected data exists and exploited throughout the complete data lifecycle [
58,
185]. It determines the policies and procedures to safeguard preemptive and effective management of data assets [
70]. Data governance interacts with and influences each of the surrounding phases and guides how activities in other phases are performed [
70‐
72]. Data governance helps the PAs manage data in the public sector organizations as it also implies the allocation of decision-making rights and associated functions in such management [
68,
186]. The data governance phase helps PAs protect public sector organizations’ data assets to assure generally understandable, accurate, complete, reliable, protected, anonymous, and discoverable government big data. It is also assisting in systemizing these organizations by linking business processes with data in GBDE [
185,
187]. The data governance phase includes consistent management and helps public administrations set data rules/policies, provide insights, wisdom & judgment, and promote accountability [
61,
152]. The estimation of the quality of data is recognized to be crucial for data governance. Data governance is one of the central pillars of the data-driven government. Through excellent data governance, public administrations will guarantee that their data are precise, reliable, comprehensive, available, and secure [
58,
66,
186].
Governance phase key functions The data governance phase of DaLiF includes the following essential functions.
-
Utilize standards, guidelines, tools, policies, laws, procedures, roles, and responsibilities for public data governance to ensure data utility by the data consumers [
58,
68].
-
Establish a formal system of accountability for effective data governance [
58,
152,
187].
-
Apply machine learning and AI algorithms to improve data governance [
66,
187].
-
Create a collaborative environment within the stakeholders, including users, so that public administration will get proposals from them on improving the data life cycle, particularly in case of agility in work scope [
62,
186].
-
Focus on the following aspects, data quality, data security, and privacy protection to tackle data governance-related issues in cloud computing environment for better visibility, data quality, and protection control [
186,
187].
-
Promote the use of use machine learning and AI to reframe data governance to address related business requirements in a way that motivates data producers and consumers to work together [
68,
185].
-
Constitute a Governance Board or a Committee in the organization to oversee and drive data governance across the public services [
58,
66,
186].
-
PAs must take an organizational perspective to ensure the quality, security protection, and effective use of government data [
70,
72].
As an outcome of the research review protocol’s step 5, “results”, we mentioned our comprehensive research results and proposed DaLiF in the preceding sub-sections of this segment.