Introduction
Background
Theory in information systems
Theory type | Distinguishing attributes |
---|---|
I. Analysis | Says what is The theory does not extend beyond analysis and description. No causal relationships among phenomena are specified and no predictions are made |
II. Explanation | Says what is, how, why, when and where The theory provides explanations but does not aim to predict with any precision. There are no testable propositions |
III. Prediction | Says what is and what will be The theory provides predictions and has testable propositions but does not have well-developed justificatory causal explanations |
IV. Explanation and prediction (EP) | Says what is, how, why, when, where and what will be Provides predictions and has both testable propositions and causal explanations |
V. Design and action | Says how to do something The theory gives explicit prescriptions (e.g. methods, techniques, principles of form and function) for constructing an artifact |
Aspect | Grand theory | Middle-range theory |
---|---|---|
Boundary | Unbounded | Bounded by subject matter |
Constitution | Axioms containing constructs and theoretical concepts | Propositions containing observables |
Level of falsifiability | Low | High |
Differentiated by | Philosophy | Specialization |
Legitimacy | It is primarily a means of establishing legitimacy | Legitimacy is evidenced by scope, precision and investigative tools |
Formation and growth | Fully formed from the mind of the theorist, and may grow as a result of discussion | Formed from a mass of basic observations, and grows by knowledge and experience of its scientists and researchers |
Data and generalization | Does not require data, generalization is based on the paradox of induction | Requires data, but is abstract enough to provide generalization |
Inception and systemic interactions | Starts from the outside with a total system and imposes on derived theories | Starts from the inside and possibly builds a unified system across domains |
Data science in a nutshell
On the epistemology of data science
Claim | Counterargument |
---|---|
Big data (as a key force behind data science) can capture the full resolution of a given domain or phenomenon | No matter how exhaustive the data is, it is still a representation and a sample, and is bound by space and time in a continuously changing world |
There is no need for a priori knowledge in the form of theory, models or hypotheses | Systems that capture and generate data are designed for very specific purposes, and analytical algorithms are designed through scientific reasoning (drawing on established theories) |
Data can speak for themselves free from human bias | Patterns extracted from data require human interpretation and theorization to avoid “ecological fallacies” when taking action based on random correlations. Making sense of data is always framed through our knowledge and experiences |
Meaning transcends context or domain-specific knowledge | Domain-specific expertise and wealth of knowledge is needed to assess and articulate problems and interpret results in order to avoid reductionism |
- It is suited to make sense of massive interconnected datasets, overcoming problems of small samples and scarce data
- Interdisciplinary research is fostered, since it is much less limited with a priori theoretical boundaries
- Abductive reasoning is encouraged
- Holistic models and theories about complex systems, rather than elements of it, are possible.
Objectives of data science
The state of data science in IS
Research method
Results: data science contributions to IS research
Scope of DS | MISQ | JBD | Total |
---|---|---|---|
Methodology | 4 | 5 | 9 |
Research method | 1 | 5 | 6 |
Object of study | 2 | 14 | 16 |
Method and object | 6 | 7 | 13 |
Total | 13 | 31 | 44 |
Data science as a methodology (M)
Title of M-type studies | Contribution | Goal |
---|---|---|
Cerchiello, P., & Giudici, P. (2016). Big data analysis for financial risk management. Journal of Big Data, 3(1), 18 | Theory | Prediction |
Hurtado, J. L., Agarwal, A., & Zhu, X. (2016). Topic discovery and future trend forecasting for texts. Journal of Big Data, 3(1), 7 | Theory | Prediction |
Padmaja, B., Prasad, V. V. R., & Sunitha, K. V. N. (2016). TreeNet analysis of human stress behavior using socio-mobile data. Journal of Big Data, 3(1), 24 | Theory | Explanation and prediction |
van Altena, A. J., Moerland, P. D., Zwinderman, A. H., & Olabarriaga, S. D. (2016). Understanding big data themes from scientific biomedical literature through topic modeling. Journal of Big Data, 3(1), 23 | Theory | Analysis |
Wu, H., Wu, H., Zhu, M., Chen, W., & Chen, W. (2017). A new method of large-scale short-term forecasting of agricultural commodity prices: Illustrated by the case of agricultural markets in Beijing. Journal of Big Data, 4(1), 1 | Theory and artifact | Prediction and design |
Geva, H., Oestreicher-Singer, G., & Saar-Tsechansky, M. (2019). Using Retweets When Shaping Our Online Persona: Topic Modeling Approach. MIS Quarterly, 43(2) | Theory | Explanation |
Gong, J., Abhishek, V., & Li, B. (2018). Examining the Impact of Keyword Ambiguity on Search Advertising Performance: A Topic Model Approach. MIS Quarterly, 42(3), 805–829 | Theory and artifact | Explanation and design |
Yahav, I., Shmueli, G., & Mani, D. (2016). A tree-based approach for addressing self-selection in impact studies with big data. MIS Quarterly, 40(4), 819–848 | Theory and artifact | Design |
Shi, Z., Lee, G. M., & Whinston, A. B. (2016). Toward a Better Measure of Business Proximity: Topic Modeling for Industry Intelligence. MIS quarterly, 40(4), 1035–1056 | Theory and artifact | Design |
Data science as a research method (RM)
Title of RM-type studies | Contribution | Goal |
---|---|---|
Agarwal, A., Baechle, C., Behara, R. S., & Rao, V. (2016). Multi-method approach to wellness predictive modeling. Journal of Big Data, 3(1), 15 | Theory and artifact | Prediction and design |
Asri, H., Mousannif, H., & Al Moatassime, H. (2019). Reality mining and predictive analytics for building smart applications. Journal of Big Data, 6(1), 66 | Theory and artifact | Prediction and design |
Goswami, K., Park, Y., & Song, C. (2017). Impact of reviewer social interaction on online consumer review fraud detection. Journal of Big Data, 4(1), 15 | Theory | Explanation and prediction |
Mavragani, A., & Ochoa, G. (2018). Infoveillance of infectious diseases in USA: STDs, tuberculosis, and hepatitis. Journal of Big Data, 5(1), 30 | Theory | Prediction |
Sohangir, S., Wang, D., Pomeranets, A., & Khoshgoftaar, T. M. (2018). Big Data: Deep Learning for financial sentiment analysis. Journal of Big Data, 5(1), 3 | Theory | Prediction |
Geva, T., Oestreicher-Singer, G., Efron, N., & Shimshoni, Y. (2017). Using forum and search data for sales prediction of high-involvement projects. MIS Quarterly, 41(1), 65–82 | Theory | Prediction |
Data science as an object of study (O)
Title of O-type studies | Contribution | Goal |
---|---|---|
Chandak, M. B. (2016). Role of big-data in classification and novel class detection in data streams. Journal of Big Data, 3(1), 5 | Artifact | Design |
Chopade, P., & Zhan, J. (2015). Structural and functional analytics for community detection in large-scale complex networks. Journal of Big Data, 2(1), 11 | Artifact | Design |
Fang, X., & Zhan, J. (2015). Sentiment analysis using product review data. Journal of Big Data, 2(1), 5 | Artifact | Design |
Hasanin, T., Khoshgoftaar, T. M., Leevy, J. L., & Seliya, N. (2019). Examining characteristics of predictive models with imbalanced big data. Journal of Big Data, 6(1), 69 | Theory and artifact | Prediction and design |
Kaur, A., & Datta, A. (2015). A novel algorithm for fast and scalable subspace clustering of high-dimensional data. Journal of Big Data, 2(1), 17 | Artifact | Design |
Khalilian, M., Mustapha, N., & Sulaiman, N. (2016). Data stream clustering by divide and conquer approach based on vector model. Journal of Big Data, 3(1), 1 | Artifact | Design |
Nagwani, N. K. (2015). Summarizing large text collection using topic modeling and clustering based on MapReduce framework. Journal of Big Data, 2(1), 6 | Artifact | Design |
O’Donovan, P., Leahy, K., Bruton, K., & O’Sullivan, D. T. J. (2015). An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities. Journal of Big Data, 2(1), 25 | Theory and artifact | Prediction and design |
Pirouz, M., & Zhan, J. (2016). Optimized relativity search: Node reduction in personalized page rank estimation for large graphs. Journal of Big Data, 3(1), 12 | Artifact | Design |
Prusa, J. D., & Khoshgoftaar, T. M. (2017). Improving deep neural network design with new text data representations. Journal of Big Data, 4(1), 7 | Artifact | Design |
Sharma, S., & Toshniwal, D. (2017). Scalable two-phase co-occurring sensitive pattern hiding using MapReduce. Journal of Big Data, 4(1), 4 | Artifact | Design |
Yang, Y., Zhang, K., Wang, J., & Nguyen, Q. V. (2015). Cabinet Tree: An orthogonal enclosure approach to visualizing and exploring big data. Journal of Big Data, 2(1), 15 | Artifact | Design |
Young-Min, K. (2019). Feature visualization in comic artist classification using deep neural networks. Journal of Big Data, 6(1), 56 | Artifact | Design |
Zhang, H., Raitoharju, J., Kiranyaz, S., & Gabbouj, M. (2016). Limited random walk algorithm for big graph data clustering. Journal of Big Data, 3(1), 26 | Artifact | Design |
Brynjolfsson, E., Geva, T., & Reichman, S. (2016). Crowd-Squared: Amplifying the Predictive Power of Search Trend Data. MIS Quarterly, 40(4), 941–962 | Artifact | Design |
Martens, D., Provost, F., Clark, J., & Junqué de Fortuny, E. (2016). Mining massive fine-grained behavior data to improve predictive analytics. MIS Quarterly, 40(4), 869–888 | Theory and artifact | Prediction and design |
Data science as a research method and object of study (RMO)
Title of RMO-type studies | Contribution | Goal |
---|---|---|
Baechle, C., Agarwal, A., & Zhu, X. (2017). Big data driven co-occurring evidence discovery in chronic obstructive pulmonary disease patients. Journal of Big Data, 4(1), 9 | Theory and artifact | Explanation, prediction and design |
Etani, N. (2015). Database application model and its service for drug discovery in Model-driven architecture. Journal of Big Data, 2(1), 16 | Theory and artifact | Prediction, design and action |
Hayes, M. A., & Capretz, M. A. (2015). Contextual anomaly detection framework for big sensor data. Journal of Big Data, 2(1), 2 | Artifact | Design |
Kumar, S., & Toshniwal, D. (2016). A novel framework to analyze road accident time series data. Journal of Big Data, 3(1), 8 | Theory and artifact | Prediction and design |
Mavragani, A., & Tsagarakis, K. P. (2019). Predicting referendum results in the Big Data Era. Journal of Big Data, 6(1), 3 | Theory and artifact | Prediction and design |
Subroto, A., & Apriyana, A. (2019). Cyber risk prediction through social media big data analytics and statistical machine learning. Journal of Big Data, 6(1), 50 | Theory and artifact | Explanation, prediction and design |
Yang, J., & Yecies, B. (2016). Mining Chinese social media UGC: A big-data framework for analyzing Douban movie reviews. Journal of Big Data, 3(1), 3 | Artifact | Analysis and design |
Liebman, E., Saar-Tsechansky, M., & Stone, P. (Forthcoming). The Right Music at the Right Time: Adaptive Personalized Playlists Based on Sequence Modeling. MIS Quarterly, 43(3), 765–786 | Theory and artifact | Prediction, design and action |
Son, J., Brennan, P.F., & Zhou, S. (Forthcoming). A Data Analytics Framework for Smart Asthma Management Based on Remote Health Information Systems with Bluetooth-Enabled Personal Inhalers. MIS Quarterly, 44 | Artifact | Design and action |
Mo, J., Sarkar, S., & Menon, S. (2018). Know When to Run: Recommendations in Crowdsourcing Contests. MIS Quarterly, 42(3), 919–944 | Theory and artifact | Prediction and design |
Abbasi, A., Zhou, Y., Deng, S., & Zhang, P. (2018). Text Analytics to Support Sense-Making in Social Media: A Language-Action Perspective. MIS Quarterly, 42(2), 427–464 | Theory and artifact | Design |
Lin, Y. K., Chen, H., Brown, R. A., Li, S. H., & Yang, H. J. (2017). Healthcare Predictive Analytics for Risk Profiling in Chronic Care: A Bayesian Multitask Learning Approach. MIS Quarterly, 41(2), 473–496 | Theory and artifact | Prediction and design |
Zhang, K., Bhattacharyya, S., & Ram, S. (2016). Large-Scale Network Analysis for Online Social Brand Advertising. MIS Quarterly, 40(4), 849–868 | Artifact | Design |
Discussion: data science as a research methodology
Research design
- A address the question of what makes an online review helpful.
- B address the question of what the existing body of knowledge on service innovation and service design entailed, and what may be possible future areas of integration and progress.
- C address the question of whether societies can experience mood states, as reflected in the public tweets, that would affect their collective decision making.
Data collection
- A downloaded a dataset of Amazon product reviews that is publicly available and curated earlier for research purposes. They applied sampling in two steps: (a) by product category to represent a specified unit of analysis, and (b) by excluding reviews with less than two helpfulness ratings in order to increase analytical reliability.
- B collected their dataset through publication database searches and aggregation. They documented their search steps thoroughly and followed established guidelines and benchmarks in their field.
- C collected two subsets of data that represent the study’s different constructs: tweets to extract moods, and Dow Jones Industrial Average (DJIA) closing values for stock market indicators. They applied sampling on collected tweets using regular expressions to extract on tweets that have explicit emotional states.
Preprocessing
Analytics
- A employs topic modeling for feature (variable) selection; meaning that they provide each review with a probability (weight) for each of the top identified topics. Along with other variables from the literature, they use random forests to classify and predict the helpfulness of the review.
- B use topic modeling to extract, label and classify different topics in research texts. In order to go from state-of-the-art towards future trajectories, they computed a linear trend of topics over time, as well as network structure between topics.
- C developed the hypothesis that including the public mood measure into existing stock market prediction models will enhance its accuracy. The new model was developed through a self-organizing fuzzy neural network.
Interpretation and theorizing
“It requires an understanding of the theory’s originality, how it modifies the rules of discourse in the field, what hidden assumptions underlie the theory, what new concepts are being introduced and how they impact the discourse, what laws will be affected or constructed as a result, and the range and scope the theory is expected to cover.”
- A provided two types of relational propositions: the key constructs that are predictive of a review helpfulness, and the direction of correlation (positive or negative) between the independent and dependent variables. The latter proposition, deemed necessary for explanatory interpretation, was developed through a second iteration analytics and interpretation.
- B also provided two propositions: definitional propositions represented by the extracted topics, and relational propositions represented by the identified topic network including topic nodes and edges. Theorization was also extended to identify mechanisms for future research that would utilize the domains’ trajectories.
- C developed the proposition in the form of a testable hypothesis including the constructs matching the independent variables identified through semantic analysis.
Example: exploring data-driven innovation through text analytics
Research design
Sampling and data collection
Case | Data sources | Documentation |
---|---|---|
Colour-in city | 2 interviews; 3 team members 3 reports 6 blog entries 12 design tools Documentation of 3 visual artifacts | 111 min; 20 pages 46 pages 39 pages 62 pages 9 pages |
Text preprocessing
- Tokenize: Transforms the text of a document into a sequence of 1–n tokens.
- Transform cases: Transforms all tokens to lower case.
- Filter stopwords: Removes tokens that match a built-in English stopwords (e.g. and, an).
- Stem: Transforms all words to their stem/origin (i.e. define, defined, defining—all become defin).
- Filter using a wordlist: Removes tokens that match a provided list (represented by the “Read Excel” operator). This step was introduced after certain tokens were found to skew the resulting topics. Accordingly, a list of those tokens was compiled and used. The list is case-specific and contained tokens not captured by the built-in list of stopwords (e.g. yeah, etc., http, www, com), as well as entity names that were skewing the word distribution (e.g. OrganiCity, team names).
Text analytics
T_ID | Expert panel | Innovation team | Researchers’ interpretation |
---|---|---|---|
T0 | Problem formulation | Design ingredients (process) | Design process |
T1 | Front end of innovation | Defining solution in uncertainty | Fuzzy front end (FFE) |
T2 | Need finding | Describing the user/audience | User understanding |
T3 | Questionnaire | Reporting (interim survey) | Reporting jargon |
T4 | Technical implementation | Data infrastructure | Data infrastructure |
T5 | Service end-user interaction | Chatbot function | User interaction |
T6 | Stakeholder analysis | Impact on each stakeholder | Stakeholders |
T7 | Experimenting | Approaches to innovation | Methods/approaches |
T8 | Communication of innovation | Usefulness of chatbot | Presenting innovation |
T9 | Data analysis | Insights using analytics | Data analytics |
Theoretical interpretation
- The what: through acknowledging a distinct phase of critical examination of the innovation that has been previously overlooked or downplayed.
- The how: through stressing on the iterative rather than linear nature of the process.
- The why: through calling for new logic underlying the process that accommodates for (a) the nature of data-driven innovations, and (b) the blurring boundaries between innovation teams and their users.