Background
Methods
Corpus
-
PubMed: “big data”[TIAB] OR (big[TIAB] AND “health data”[TIAB]) OR “large data” [TI];
-
PMC: “big data”[TI] OR “big data”[AB] OR (big[TI] AND “health data”[TI]) OR (big[AB] AND “health data”[AB]) OR “large data” [TI].
Topic modelling concepts
Topic modelling implementation
Big data definitions
Topic analysis
Results
Corpus
Topic modelling and model selection
Big data definitions
Theme name | Theme description | Definition sources | |
---|---|---|---|
I | Volume, size, voluminous, cardinality | Large quantities of data in number of bytes; size of available data (e.g. all records instead of a sample); beyond conventional storage techniques; number of records at a particular instance | |
Velocity, continuity | Flow rate at which data is created, stored, analysed, and visualised; increased through invention of new data streams such as social media; beyond conventional means of processing, needing new techniques such as streaming; growth of data over time | ||
Variety, complexity | Many different types of data; not bound to a traditional data format; format changes over time; heterogeneous and unstructured data | ||
Veracity | Trustworthiness of data; reliability of data quality and gathering environment | ||
Value | Worth/relevancy of data (e.g. economic, individual/privacy, societal, humanity value) | ||
Variability | Consistency of data over time; influences which systematically change data measures over time | ||
II | Information | Where signals are turned into data (e.g. book digitalisation, or gathering from personal device measurements) | [14] |
Technology | Tools, systems, and software (e.g. scalable processing and transmission systems such as Hadoop) | ||
Methods | Procedures and their application (e.g. clustering, natural language processing, machine learning, neural networks, visualisation) | ||
Impact | Ethical, business, societal | [14] | |
III | Beyond conventional | Data whose size call for methods beyond the tried-and-true; necessity of scalable systems for storage, processing, manipulation, analysis, visualisation | |
IV | Application | About the application domain treated in the papers | – |
-
The definition by Microsoft [40] was a web-blogpost from 2013, therefore possibly outdated;
-
Shneiderman et al. [41] does not specifically mention big data, as it was a publication from 2008 when this term was not in use yet;
-
The definition by Manyika et al. [43] was only described in the executive summary;
-
Mayer-Schönberger et al. [42] propose an abstract definition that was considered too difficult to convert into interpretable themes for topic analysis.
Topic analysis
Topic | Theme assignment grouped by observer | ||||||
---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | |
1 | Imp, value | Value | App, imp, value | Vera, value | imp, app, vera | Imp, value | |
2 | Vera, app | Imp, app | Info, app | Vera, velo | App | Tech, variety, vera | |
3 | Imp, app | App | App | ||||
4 | Met | Met | Vol, met | Met | Tech, met | Tech, velo | Met |
5 | Vol, velo, beyond | Tech | Vol, tech, beyond | Beyond, vol, velo | Tech, complex, beyond | Vol | Vol |
6 | Tech | Tech | Tech, velo | Tech, beyond | Tech, beyond | Tech | Tech, variety, vera |
7 | Met | Met | Vera, met | Met | Tech, met, info, app | Met | Met |
8 | App | App | Info, app | App, info | App | App | Variety, app |
9 | App | Imp | Imp | Imp | Value, imp, app | ||
10 | App | Met, tech | Variety, info, met | App, met | App | App, variety, info | Vol, beyond |
11 | App | App | App | App, Imp | App | App | Imp, value |
12 | Tech, vol, velo | Vol | Vol, velo | Vol, velo, beyond | Tech, vol, velo | Vol, velo | Met, vol |
13 | Variability, vera | Met | Met | Met | App, info | Met | Met |
14 | Info | Info | Tech, app | App, info | Imp | Info | Value, imp, app |
15 | Imp | App | Imp | App | Info, app | App, imp | Value, vera |
16 | App | Met | App | Info, app | Info, app | App | Beyond, vol |
17 | Value | Info | Tech, beyond | Info | Continuity, variability | Tech | Value, tech |
18 | App | Met | Info | App, info | Met, app, tech, info | App | Vol, vera |
19 | Value | App | Met, app | Info | Continuity, app | Variety | Tech, imp |
20 | Met | Met | Met | Met | Met, info | Met | Met |
21 | App | App | App | App, imp | Info, app | App | Variety, app, vera |
22 | Info, velo | Info | Info, app | Info, vera | Velo, continuity, app | App, info | Info |
23 | Info, app | App | Info, app | Info | Info | App, info | Beyond, vol, vera, info |
24 | Value | App | Info, app | Info, app | Continuity, info, imp | App | Vol, variety |
25 | Met | Met | Info | Info, met, tech | Vol, velo | Velo | |
Total | 33 | 22 | 39 | 40 | 53 | 35 | 49 |
Topic | Themes | Overall | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Volume | Velocity | Variety | Veracity | Value | Variability | Information | Technology | Methods | Impact | Beyond con. | Application | ||
1 | 2 |
\(\underline{{\mathbf {5}}}\)
| 4 | 2 | Value, Impact | ||||||||
2 | 1 | 1 | 3 | 1 | 1 | 1 | 4 | Application | |||||
3 | 1 | 3 | – | ||||||||||
4 | 1 | 1 | 2 |
\(\underline{{\mathbf {6}}}\)
| Methods | ||||||||
5 |
\(\underline{{\mathbf {5}}}\)
| 2 | 1 | 3 | 4 | Volume, Beyond conventional | |||||||
6 | 1 |
\(\underline{{\mathbf {7}}}\)
| 2 | Technology | |||||||||
7 | 1 | 1 | 1 |
\(\underline{{\mathbf {7}}}\)
| 1 | Methods | |||||||
8 | 1 | 2 |
\(\underline{{\mathbf {7}}}\)
| Application | |||||||||
9 | 1 | 4 | 2 | Impact | |||||||||
10 | 1 | 2 | 2 | 1 | 3 | 1 | 4 | Application | |||||
11 | 1 | 2 |
\(\underline{{\mathbf {6}}}\)
| Application | |||||||||
12 |
\(\underline{{\mathbf {6}}}\)
|
\(\underline{{\mathbf {5}}}\)
| 1 | 2 | 1 | 1 | Volume, Velocity | ||||||
13 | 1 | 1 | 1 |
\(\underline{{\mathbf {5}}}\)
| 1 | Methods | |||||||
14 | 1 | 4 | 1 | 1 | 2 | Information | |||||||
15 | 1 | 1 | 1 | 3 | 4 | Application | |||||||
16 | 1 | 2 | 1 | 1 |
\(\underline{{\mathbf {5}}}\)
| Application | |||||||
17 | 1 | 1 | 1 | 2 | 3 | 1 | – | ||||||
18 | 1 | 1 | 3 | 1 | 2 | 4 | Application | ||||||
19 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | – | ||||
20 | 1 |
\(\underline{{\mathbf {7}}}\)
| Methods | ||||||||||
21 | 1 | 1 | 1 | 1 |
\(\underline{{\mathbf {7}}}\)
| Application | |||||||
22 | 2 | 1 |
\(\underline{{\mathbf {6}}}\)
| 3 | Information | ||||||||
23 | 1 | 1 |
\(\underline{{\mathbf {6}}}\)
| 1 | 4 | Application, Information | |||||||
24 | 1 | 1 | 1 | 1 | 3 | 1 | 4 | Application | |||||
25 | 1 | 2 | 2 | 1 | 3 | – | |||||||
total | 17 | 17 | 8 | 12 | 14 | 2 | 39 | 24 | 36 | 19 | 11 | 66 |
Topics | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | 5 |
Health | Patient | Article | Algorithm | Challenged |
Research | Clinic | Review | Cluster | Analyte |
Healthcare | Hospital | Discuss | Learn | Tool |
Policies | Electron | Field | Method | Amount |
Health_care | Care | Recent | Feature | Technologic |
Privacies | Outcome | Issue | Efficiencies | Computability |
Nation | Medicaid | Aspect | Approximate | Analysing |
Ethic | Record | Focus | Tree | Require |
Protect | Ehr | Emerge | Represent | Advance |
Govern | Clinical_research | Future | Fast | Varieties |
Inform | Health_record | Highlight | Matrix | Solution |
Secure | Clinician | Current | Accuracies | Growth |
Challenged | Treatment | Context | Problem | Large_amount |
Share | Improve | Overview | Distance | Massive |
Concern | Assess | Paper | Hierarchical | Generate |
Access | Healthcare | Paradigm | Computability | Dataset |
Communities | Qualities | Confer | Faster | Vast |
Fund | Potential | Natural | Calculate | Process |
Health_informatics | Patient_care | Technologic | Graph | Handle |
Health_system | Routine | Literature | Outperform | Infrastructural |
6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|
System | Model | Age | Change | Network |
Process | Predict | Risk | Nurse | Molecular |
Device | Infer | Influenza | Innovated | Structural |
Framework | Statistic | Indicating | Science | Biomarker |
Cloud | Regress | Exposure | Social | Complex |
Architectural | Simulate | Cohort | Question | Heterogeneities |
Hadoop | Predictor | Rate | Historian | Integral |
Applicability | Bayesian | Symptom | Influence | Systems_biology |
Service | Fit | Month | Practical | Mechanical |
Manage | Good | Yearbook | Insight | Omic |
Platform | Optimal | Variable | Cultural | Approach |
Design | Prior | Life | Turn | Character |
Mapreducable | Base | Death | Product | Dynameomics |
Computability | Variable | Diabetes | Food | Function |
Base | Machine_learning | Adjust | Societies | Biologic |
Support | High_dimensional | Geographic | Understand | Transit |
Implement | Tradition | Condition | Drive | Rdge |
Task | Rank | Factor | Evolution | Topological |
Deploy | Parameter | Demographic | Scientific | Protein |
Cloud_computing | Feature | Incidence | Principle | Organ |
11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|
Disease | Dataset | Effect | Search | Biomedical |
Prevent | Time | Group | Social_media | Informatic |
Epidemiologic | Sample | Measurable | Language | Science |
Vaccination | Large_scale | Testable | Google | Medicinal |
Progress | Computability | Estimate | Word | Medicaid |
Immune | Speed | Analysing | Public | Educate |
Leverage | Performance | Studied | Relate | Research |
Popular | Increased | Statistic | Psychological | Learn |
Initial | Approach | Bias | Trend | Personalized_medicine |
Develop | Thousand | Large | Emoticon | Era |
Heart | Step | Eandom | Twitter | Ontological |
Administration | Rate | Valuable | Message | Disciplinary |
Intervention | Implement | Power | Online | Translate |
Generate | Full | Method | Relationship | Student |
Blood | Memorial | Sample_size | Social | Scientist |
Advance | Scale | Marker | Visit | Train |
Public_health | Hundred | Find | Content | Impact |
Reported | Block | Large_set | Caseness | Workshop |
Consensus | Applicability | Import | Posit | Discoveries |
Earlier | Multiple | Error | Investigacin | Knowledge |
16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|
Genet | Web | Sequence | Mine | Classifiable |
Gene | Resource | Genome | Knowledge | Set |
Associating | Code | Bioinformatic | Extract | Object |
Phenotype | File | Proteome | Inform | Large_set |
Pathway | Laboratories | High_throughput | Chemical | Class |
Disease | Public | DNA | Specialised | Noise |
Genotype | Compress | Transcriptome | Plant | General |
Factor | Semantic | Protein | Biologic | Pair |
Enrich | Software | Composite | Concept | Performance |
Trait | Retrievable | Ngs | Develop | Abilities |
Genome_wide | Access | Metagenome | Toxic | Neural_network |
Metabolic | Share | Virus | Construct | Similar |
Genome | Format | Analysing | Note | Train |
Mutated | Inform | Host | Curate | Dimension |
Number | Interface | Biologic | Rich | Machine |
Identifi | Source | Assemble | Gap | Categorical |
Polymorphism | Platform | Cell | Preservation | Appliance |
Individual | Metadata | Microbiome | Ecological | Formula |
Regular | Storage | Align | diverse | Encounter |
Unification | Exchange | Human | Abstract | Coefficient |
21 | 22 | 23 | 24 | 25 |
---|---|---|---|---|
Drug | Visual | Image | Cancer | Low |
Target | Activated | Brain | Studied | Reduce |
Cell | Human | Disorder | Tumor | Time |
Event | Behavior | Signal | Valid | Base |
Screen | Mobile | Subject | Research | Reduction |
Response | Environment | Resolution | Registries | Digital |
Experiment | Interact | Neuroimaging | Therapeutic | Node |
Detected | Exploration | Function | Database | Energies |
Analyse | User | Neuron | Injuries | Deep |
Adversary | Collect | Segment | Oncologist | Small |
Multiple | Sensor | Psychiatric | Clinical_trials | Cost |
Compound | Tool | Connectome | Claim | Size |
Profile | Wearable | Neuroscience | Therapies | Numerator |
Miss | Quantifiable | Mode | Efficacies | Operability |
Type | Track | Mri | Diagnostic | Combina |
Potential | Movement | Scan | Heterogeneities | Peak |
Combina | Physical | Quantitation | Set | Spectral |
Meta | Display | Analysing | Specific | Structural |
Complete | Smartphone | Microscopic | Ongoing | Locate |
Point | Interest | Multi | Consortium | Qualities |
Discussion
Identification of themes in big data definitions
Corpus gathering
Automatic identification of topics
topicmodels
had 22,576 downloads in 2015.3 Moreover, the paper describing the underlying model by Blei et al. [17] has been cited over 16,000 times.4 We therefore chose to use the LDA implementation of TM because of its appropriateness for our data, the relative ease of use of this approach (i.e., ready to use implementations in R), and extensive use in the literature by our peers.