Skip to main content
Top

2022 | OriginalPaper | Chapter

3. Measuring Firm-Level Digitalization

An Explorative Investigation via Website Data

Author : Daniel Wittenstein

Published in: Managing Digital Transformation

Publisher: Springer Fachmedien Wiesbaden

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The COVID-19 pandemic strikingly demonstrates the advantages of digitalization. To contain the spreading of the virus, many countries have taken severe actions and imposed restrictions that affected large parts of the analogue world. This had severe implications for the global economy, especially for businesses that relied on personal customer contact.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
COVID-19 (Coronavirus disease 2019) has resulted in an ongoing global pandemic with far-reaching consequences for the global economy. First estimates indicate that its economic fallout could cause the worst recession since the Great Depression (IMF 2020; The Economist 2020d).
 
2
I acknowledge that the term digitalization is subject to controversy and often used interchangeably for digitization and digital transformation. In this dissertation, digitization is defined as the process of creating a digital version of formerly analogue things, such as processes, documents etc. Hence, it is part of firm digitalization which refers to the conversion of business functions, interactions, and business models into more digital ones. Digital transformation broadly describes the path of moving to digital business.
 
3
According to survey data (Statista 2019b), 63% of firms with less than 10 employees have a website, 87% of firms with 11–49 employees, 93% of firms with 50–249 employees, and 97% of firms that employ more than 250 people.
 
4
The resilience and dynamic development of so-called ‘tech’ firms, which incorporate digital technologies and applications as crucial parts of their business, during the COVID-19 crisis is remarkable. While many established firms struggle with collapsing markets, changes in market and customer requirements seem to provide further growth opportunities for tech firms. As of August 2020, the ‘Big Five’ tech firms (Alphabet, Amazon, Facebook, Apple, and Microsoft) alone combine a market capitalization of more than $9 trillion, with Apple reaching an ever record high of $2 trillion (Klebnikov 2020; Nicas 2020). Within a decade, many of these ‘digital’ firms evolved from small niche players into the dominant forces that revolutionize industries and value creation processes (Statista 2020b, 2020c).
 
5
Countries included are Germany, Austria, Switzerland, Sweden, Finland, Italy, the United Kingdom, and France.
 
6
The sample of Swiss firms mainly covers companies in the German speaking region of Switzerland.
 
7
We include firm observations from Panel years 2012 to 2018. See, e.g., Rammer and Peters (2013) for technical details about the Mannheim Innovation Panel.
 
8
The MIP is the official  German Innovation Survey (GIS) conducted by the Center for European Economic Research (ZEW), the Fraunhofer Institute for Systems and Innovation Research (ISI), and the Institute for Applied Social Sciences (infas). It is commissioned by the German Federal Ministry of Education and Research (BMBF) and gathers information on the innovation activities of German companies. The survey covers 7,000 to 8,500 responding firms each year. As the survey is based on a stratified random sample, it allows to extrapolate survey results to the total firm population. The MIP represents the German contribution to the Community Innovation Survey (CIS) of the Statistical Office of the European Commission (EUROSTAT) which serves as the basis for the European Innovation Statistic (ZEW 2019).
 
9
Orbis is an online databank that covers more than 375 million companies worldwide. It provides detailed information on firm activities, performance, industry affiliation and history as well as contact information such as URLs (Bureau van Dijk 2020).
 
10
See Table C.3.1 in the appendix for the detailed comparison of age, size and industry distribution between MIP firms with websites and MIP firms for which no URL information is available.
 
11
A comparison of the distribution of firms which are successfully matched with the Orbis database and have website information with firms that are matched and have no website information shows that there is still a significant difference in terms of number of employees (+28.9%). In this context, however, matched firms with website information are 2.8 years older than firms with no website information.
 
12
See Appendix C.1.2 for a screenshot of the ARGUS user interface with the adjustable parameter settings.
 
13
Note on terminology: each website consists of a number of webpages. In the context of this paper, the highest-level webpage is referred to as main page. This main page is usually the first webpage that is downloaded by the scraping tool (URL). All lower-level webpages are referred to as subpages.
 
14
To test the parameter setting, we conduct a test of the web scraping tool for 100 randomly selected URLs based on our total sample of 48,236 websites with a predefined webpage limit of 2,500. The scraping results show that 15% of the websites report errors and weren’t downloaded, which is in line with previous findings (Kinne & Axenbeck 2019). For the 85 remaining websites, the mean number of webpages per website is 322.4. This average is highly skewed due to a relatively large number of websites with more than 1,000 webpages. A median of 24 suggests that a webpage limit of 25 should capture more than 50% of all firm websites completely. Figure C.2.1 in the appendix illustrates the distribution of webpages per websites. A manual check of text language among the 85 downloaded test websites reveals that 77 websites are in German with the remaining 8 in English.
 
15
For the detailed comparison of the final sample URLs and rejected/erroneous URLs in terms of size, age and industry distribution see Table C.3.2 in the appendix.
 
16
As shown in Figures C.2.2-C.2.5 in the appendix, both parameters differ substantially between industry classes and employee size classes. However, we do not find evidence for a statistically significant relationship between firm age and number of webpages or words per webpage.
 
17
See Appendix C.1.1 for the survey questions on digitalization included in the MIP 2016.
 
18
For more information about the concept of Industry 4.0 see, e.g., BMWi (2017a).
 
19
Missing URL information reduces the training sample by 28.2 percent, download errors by another 12.8% and an additional 11.9% of the initial URLs are redirected and, thus, excluded from further investigations. We do not find a statistically significant difference between the excluded and remaining observations in terms of the average digital readiness score.
 
20
See Table C.3.3 in the appendix for the average item scores on which the index is based on and Figure C.2.6 for the distribution of index scores.
 
21
See Appendix C.1.3 for a description of the included terms and expressions for the respective keyword terms.
 
22
A manual screening of 15 websites of firms with low digital readiness scores and 15 websites of firms with high digital readiness supports this claim and shows that higher digitalized firms mention the determined keywords more often.
 
23
To ease the matching procedure and prevent differences in capitalizations affecting the results, all website texts are transformed to lower cases.
 
24
See Table C.3.6 in the appendix for the summary statistics of all variables used in the keyword-based prediction models.
 
25
The F1 score is the harmonic mean of precision and recall of the respective model. Recall describes the fraction of the total amount of relevant instances that are actually retrieved and is calculated by dividing the true positives with the sum of the true positives and false positives. Precision describes the fraction of relevant instances among the retrieved instances and is calculated as true positive observations divided by the sum of true positive and false negative observations (Powers 2011).
 
26
The main difference between the AICc and EBIC lies in different theoretical considerations. Whereas the AICc tries to select the model that most adequately describes an unknown, high dimensional reality, the EBIC tries to find the true model among the presented parameters. This implies that the EBIC model selection is consistent if the true model is among the candidate models. In such cases, the AICc is less efficient in the sense that it selects the model that minimizes the squared average prediction error (Zhang et al. 2010). Accordingly, the EBIC tends to produce more parsimonious models compared to the AICc that imposes fewer penalty on the parameter estimates. For large sample sizes, the AICc (= N*log(RSS/N) + 2*df*N/(N-df)) and the EBIC (=N*log(RSS/N) + df*log(N) + 2*xi*df*log(p)) are equal to the AIC (Akaike 1974) and BIC (Schwarz 1978) values (Ahrens et al. 2020). We can confirm this observation for all our regularization models.
 
27
Regularization methods, such as lasso, elastic net and ridge regressions, are frequently applied in exploratory studies to reduce model complexity and decrease variability of the parameter estimates. See, e.g., Hastie and Friedman (2009) for the technical specifications of regularization models.
 
28
The lasso minimizes the residual sum of squares subject to a constraint on the absolute size of coefficient estimates (Tibshirani 1996). Like ridge regression, lasso decreases the variability of parameter estimates in comparison to least squares and, thus, can outperform OLS in terms of prediction performance (Ahrens et al. 2020).
 
29
In our models we do not find this to be an issue as the results and parameter selection remain stable over multiple cross fold validation steps.
 
30
The elastic net applies a mix of lasso-type and ridge-type penalization by introducing an additional penalty parameter alpha between zero and one. For alpha = 1, elastic net would become the lasso regression and for alpha = 0, it equals a ridge regression model (Ahrens et al. 2020).
 
31
We use the user-written rforest (Zou & Schonlau 2017) command in Stata. In particular, we apply a random forest with 1,000 trees, a minimum of one sample per leaf and two random predictor variables investigated for each partial-feature tree. See Figures C.2.7 and C.2.8 in the appendix for the convergence tables that support this decision.
 
32
This discrepancy illustrated in Figure C.2.9 in the appendix.
 
33
A further reduction of variables does reduce the overfitting but yields worse out-of-sample prediction performance.
 
34
We cannot directly distinguish between B2C and B2B companies and rely on the NACE industry classifications. Some industries, such as manufacturing (C), usually include a high share of B2B firms. A robustness check indicates a slight improvement when predictions are limited to firms in this sector, which may support our assumption.
 
35
Choudhury et al. (2020) provide a good overview on how ML can be used to identify patterns. See Hastie et al. (2009) for the technical foundations of ML.
 
36
We use the term semi-supervised as our models are partially based on unsupervised algorithms (attention-based topic modeling) and supervised methods (prediction and accuracy measurement relying on survey data information).
 
37
See Appendix C.1.5 for a description of the applied models and their shortcomings.
 
38
HTML stands for Hypertext Markup Language and is the standard markup language for website documents.
 
39
These stop words are provided by the python-based Natural Language Toolkit. For furher information see https://​www.​nltk.​org/​.
 
40
Rectified Adam (RAdam) is a modified version of the optimization algorithm Adam. It addresses the large variance issues of adaptive learning frameworks and stabilizes the convergence within deep learning models.
 
41
Epoch indicates the number of completed cycles over the full training data.
 
42
For a description of the extracted topics, see Appendix C.1.4.
 
43
Figure C.2.12 presents the corresponding error convergence graphs for the 14 topics over validation data.
 
44
See Table C.3.16 for the hyper-parameter configuration of the attention-based topic model based on RAdam.
 
45
Hidden layer is a layer, which lies between input and output of the deep learning algorithm and performs non-linear transformations of the inputs that enter the network (DeepAI 2020).
 
46
See Table C.3.17 in the appendix for the configuration of the hidden layers and the activation function.
 
47
Figures C.2.13 and C.2.14 in the appendix provide a scatter plot of observed and predicted digital readiness scores for in-sample and out-of-sample data, respectively. The graphs provide further support for our claim as they show a similar clustering of observation in the middle section of the index.
 
48
In our keyword-based regressions, industry effects and website word count information significantly improve our prediction performance. However, including these effects impedes potential inferences based on the predicted values. This is why we argue that the semi-supervised model is superior to keyword-based prediction.
 
49
See Table C.3.19 in the appendix for the structure of the neuronal network-based classification model and the activation functions.
 
50
Support also comes from an out-of-sample RMSE comparison. It shows that our predicted digital readiness scores within the non-digital and moderately group yield a substantially lower RMSE of 5.282 compared to 13.046 for the highly digital group only.
 
51
For more information on hidden champions see, e.g., Simon (2012) and Chapter 1 of this dissertation. Chapter 2 provides a detailed evaluation of their digital preparedness.
 
52
Bellstam et al. (2019) use web scraping techniques to gather data from analyst reports and are able to develop a sound measure of firm innovation activities.
 
Metadata
Title
Measuring Firm-Level Digitalization
Author
Daniel Wittenstein
Copyright Year
2022
DOI
https://doi.org/10.1007/978-3-658-36695-7_3

Premium Partner