Statistical science in the world of big data☆
Introduction
From January to June 2015, a series of twelve workshops took place across Canada, as part of a six-month thematic program at the Fields Institute for Research in the Mathematical Sciences, organized by the Canadian Statistical Sciences Institute. Most of the activity was in Toronto at the Fields Institute (Fields, 2015), which also provided the main funding. Allied workshops were held at the Pacific Institute for Mathematical Sciences in Vancouver, the Centre de Recherches Mathématiques in Montreal, and the Atlantic Association for Research in the Mathematical Sciences in Halifax. I chaired the program organizing committee and the international advisory committee, but the success of the effort is really due to the hard work and varied contributions of the committee members.
One positive outcome of this program was that I spent considerable time discussing with my colleagues, and thinking about, the place of statistical science in what was then the booming area of big data. This essay summarizes some of my thoughts on this, with the advantage of hindsight, and informed by the many changes that have taken place at remarkable speed since then.
Some highlights of the thematic program are described in Franke et al. (2016).
Section snippets
Big data
When we started planning the proposal submission for the thematic program in the summer of 2013, everyone was talking about big data, and calling it “Big Data”. In fact we spent some time worrying that it might be risky to include this in the title of the thematic program, in case the phrase might be out of date before the program got started in 2015. As it turned out, it was not,although it had already decreased a few levels on the “Gartner hype cycle” (Gartner, 2014). The 2015 version of
Data science
By the time our program ended in mid-2015 data science was coming to replace big data as a short-hand for the world of lots of data. This has now become much more current, and in my view represents a multi-disciplinary field that includes aspects of applied mathematics, computer science, statistics, and subject-matter applications. Although it has been argued that statistical science is data science (Yu, 2014) and that departments of statistics should rename themselves (a few indeed have
Machine learning, deep learning and artificial intelligence
Machine learning is a distinct sub-field of computer science with a clear research agenda and a relatively long history; many statisticians will have been introduced to topics in machine learning via Hastie et al. (2009) or Bishop (2006). Deep learning, as explained for example in talks by Brendan Frey and by Machine Learning Workshop (2015), refers to both computational and modeling aspects of neural networks with a very complex architecture. Deep learning seems to be the breakthrough that has
Conclusion
We designed the thematic program as a blend of foundational themes and applications-oriented themes. In the former we chose to emphasize machine learning, high-dimensional inference, optimization and visualization; for the latter social policy, health policy, environmental science and networks. Of course as usual the distinctions between the applications and the foundations were blurry, and many other application areas were touched on in various presentations.
What emerged from our experience
Acknowledgments
I would like to thank the Fields Institute for Research in the Mathematical Sciences for support for the thematic program described here, and my colleagues on the organizing committee: Yoshua Bengio, Hugh Chipman, Sallie Keller, Lisa Lix, Richard Lockhart and Ruslan Salakhutdinov. Helpful conversations with Raymond Ng, Mary Thompson, Don Fraser, Sofia Olhede and Bin Yu have also framed my thinking on many of the issues around big data and data science.
References (39)
Big data and public policies: opportunities and challenges
Statist. Probab. Lett.
(2018)- et al.
On the role of latent variable models in the era of big data
Statist. Probab. Lett.
(2018) - et al.
Big data sampling and spatial analysis: “which of the two ladles, of fig-wood or gold, is appropriate to the soup and the pot?”
Statist. Probab. Lett.
(2018) - et al.
Statistics for big data: a perspective
Statist. Probab. Lett.
(2018) - et al.
Principles for statistical inference on big spatio-temporal data from climate models
Statist. Probab. Lett.
(2018) On the role of statistics in the era of big data: a computer science perspective
Statist. Probab. Lett.
(2018)Statistical challenges of big brain network data
Statist. Probab. Lett.
(2018)- et al.
Journeys in Big Data statistics
Statist. Probab. Lett.
(2018) - et al.
When small data beats big data
Statist. Probab. Lett.
(2018) - et al.
Statistical issues in radiosonde observation of atmospheric temperature and humidity profiles
Statist. Probab. Lett.
(2018)
Statistical modeling of spatial big data: an apporach from a functional analysis perspective
Statist. Probab. Lett.
Statistics within business in the era of big data
Statist. Probab. Lett.
The role of statistics in data-centric engineering
Statist. Probab. Lett.
Conducting highly principled data science: A statistician’s job and joy
Statist. Probab. Lett.
Statistical methods and challenges in connectome genetics
Statist. Probab. Lett.
The role of statistics in the era of big data: electronic health records for healthcare research
Statist. Probab. Lett.
How do statisticians analyse big data –our story
Statist. Probab. Lett.
A practical guide to big data
Statist. Probab. Lett.
On dimension reduction models for functional data
Statist. Probab. Lett.
Cited by (10)
Data science in the design of public policies: dispelling the obscurity in matching policy demand and data offer
2020, HeliyonCitation Excerpt :These progressively large datasets contain highly detailed information obtained ever more promptly from different sources, combining data of a traditional, transaction-based origin with those collected either automatically, like the signals emanating from mobile phones and web connections, or on a voluntary basis, like the material we publish on social media [4]. Data Science - the umbrella name given to the innovative use of “analytics” to extract information and insights from these many and diverse datasets [5, 6] - was initially developed for business purposes. It can, however, also be used to support decision-making in the public sector [7, 8], helping us to gain a deeper and more transparent understanding of our world [9], while improving the way we identify and assemble the choices made by people when faced with a number of possible options [10].
The role of Statistics in the era of Big Data
2018, Statistics and Probability LettersThe future of statistics and data science
2018, Statistics and Probability LettersCitation Excerpt :They may correspond to a mixture of many heterogeneous populations, with the differences within populations proving challenging to analysis. To remove unwanted artifacts, extensive preprocessing (sometimes aptly described as “data wrangling” Reid, 2018) must often take place—leading to an 80/20 rule of thumb amongst practitioners suggesting that four times as much time should be set aside for wrangling than for actual analysis and inference. The complexities of heterogeneous, unstructured data requiring substantial preprocessing are challenging to statistical modelers, and call for new approaches to theoretical concepts and methodological developments, as well as the pipeline that turns these into rigorous applications of modern statistics in practice.
Re-assessing the Role of the Statistician in the Era of Big Data: A Business Perspective
2023, Lecture Notes in Networks and SystemsMethodological, technological and design challenges in the new multisource statistics ecosystem
2021, Statistical Journal of the IAOS
- ☆
Supported by the Natural Sciences and Engineering Research Council of Canada .