Statistical science in the world of big data

https://doi.org/10.1016/j.spl.2018.02.049Get rights and content

Abstract

This essay considers the role of the statistical sciences in the world of big data, data science, machine learning, and artificial intelligence, with a decidedly Canadian slant.

Introduction

From January to June 2015, a series of twelve workshops took place across Canada, as part of a six-month thematic program at the Fields Institute for Research in the Mathematical Sciences, organized by the Canadian Statistical Sciences Institute. Most of the activity was in Toronto at the Fields Institute (Fields, 2015), which also provided the main funding. Allied workshops were held at the Pacific Institute for Mathematical Sciences in Vancouver, the Centre de Recherches Mathématiques in Montreal, and the Atlantic Association for Research in the Mathematical Sciences in Halifax. I chaired the program organizing committee and the international advisory committee, but the success of the effort is really due to the hard work and varied contributions of the committee members.

One positive outcome of this program was that I spent considerable time discussing with my colleagues, and thinking about, the place of statistical science in what was then the booming area of big data. This essay summarizes some of my thoughts on this, with the advantage of hindsight, and informed by the many changes that have taken place at remarkable speed since then.

Some highlights of the thematic program are described in Franke et al. (2016).

Section snippets

Big data

When we started planning the proposal submission for the thematic program in the summer of 2013, everyone was talking about big data, and calling it “Big Data”. In fact we spent some time worrying that it might be risky to include this in the title of the thematic program, in case the phrase might be out of date before the program got started in 2015. As it turned out, it was not,although it had already decreased a few levels on the “Gartner hype cycle” (Gartner, 2014). The 2015 version of

Data science

By the time our program ended in mid-2015 data science was coming to replace big data as a short-hand for the world of lots of data. This has now become much more current, and in my view represents a multi-disciplinary field that includes aspects of applied mathematics, computer science, statistics, and subject-matter applications. Although it has been argued that statistical science is data science (Yu, 2014) and that departments of statistics should rename themselves (a few indeed have

Machine learning, deep learning and artificial intelligence

Machine learning is a distinct sub-field of computer science with a clear research agenda and a relatively long history; many statisticians will have been introduced to topics in machine learning via Hastie et al. (2009) or Bishop (2006). Deep learning, as explained for example in talks by Brendan Frey and by Machine Learning Workshop (2015), refers to both computational and modeling aspects of neural networks with a very complex architecture. Deep learning seems to be the breakthrough that has

Conclusion

We designed the thematic program as a blend of foundational themes and applications-oriented themes. In the former we chose to emphasize machine learning, high-dimensional inference, optimization and visualization; for the latter social policy, health policy, environmental science and networks. Of course as usual the distinctions between the applications and the foundations were blurry, and many other application areas were touched on in various presentations.

What emerged from our experience

Acknowledgments

I would like to thank the Fields Institute for Research in the Mathematical Sciences for support for the thematic program described here, and my colleagues on the organizing committee: Yoshua Bengio, Hugh Chipman, Sallie Keller, Lisa Lix, Richard Lockhart and Ruslan Salakhutdinov. Helpful conversations with Raymond Ng, Mary Thompson, Don Fraser, Sofia Olhede and Bin Yu have also framed my thinking on many of the issues around big data and data science.

References (39)

Cited by (10)

  • Data science in the design of public policies: dispelling the obscurity in matching policy demand and data offer

    2020, Heliyon
    Citation Excerpt :

    These progressively large datasets contain highly detailed information obtained ever more promptly from different sources, combining data of a traditional, transaction-based origin with those collected either automatically, like the signals emanating from mobile phones and web connections, or on a voluntary basis, like the material we publish on social media [4]. Data Science - the umbrella name given to the innovative use of “analytics” to extract information and insights from these many and diverse datasets [5, 6] - was initially developed for business purposes. It can, however, also be used to support decision-making in the public sector [7, 8], helping us to gain a deeper and more transparent understanding of our world [9], while improving the way we identify and assemble the choices made by people when faced with a number of possible options [10].

  • The role of Statistics in the era of Big Data

    2018, Statistics and Probability Letters
  • The future of statistics and data science

    2018, Statistics and Probability Letters
    Citation Excerpt :

    They may correspond to a mixture of many heterogeneous populations, with the differences within populations proving challenging to analysis. To remove unwanted artifacts, extensive preprocessing (sometimes aptly described as “data wrangling” Reid, 2018) must often take place—leading to an 80/20 rule of thumb amongst practitioners suggesting that four times as much time should be set aside for wrangling than for actual analysis and inference. The complexities of heterogeneous, unstructured data requiring substantial preprocessing are challenging to statistical modelers, and call for new approaches to theoretical concepts and methodological developments, as well as the pipeline that turns these into rigorous applications of modern statistics in practice.

View all citing articles on Scopus

Supported by the Natural Sciences and Engineering Research Council of Canada .

View full text