3.1 Sources
As there are many websites with job offers on the Internet, estimating the number of websites with job offers in all of Poland with reasonable accuracy would be an unrealistic task. Websites with job offers are also heterogeneous; they can be divided into types. In particular, these types include:
-
Country-specific specialised websites.
-
Locally specialised websites, most often encompassing a city, a community (NUTS-5 region according to the European Union Nomenclature of Territorial Units for Statistics), or a voivodeship (NUTS-2 region).
-
Specialised websites with job offers, such as financial occupations or information technology (IT) jobs.
-
Websites of Local Labour Offices (LLOs) operating in the NUTS-4 region, and official Public Information Bulletins (PIBs) containing most of the job offers in the public sector.
-
Employer websites.
-
Internet forums and social media groups (for example, Facebook groups).
-
Websites that aggregate job offers from other portals.
Information obtained from the artefakt.pl website indicates that 97.3% of internet users in Poland use the Google search engine. We used Google Trends to find the most popular internet websites with job offers. Google Trends is an index of the volume of Google queries by geographic location and category (Choi and Varian
2012). This technique was suggested by Askitas and Zimmermann (
2015) for social sciences. Most often, after entering a search for local job offers, Google search shows links to countrywide websites. This is connected to the use of positioning techniques by website administrators. Countrywide websites often contain paid advertisements of job offers.
Job offers from local websites often contain a short description with more detailed job descriptions (e.g., employee requirements, skills, detailed qualifications) shown less often than for national websites. Also, employers using such sites know less about the professional terminology of the labour market and related education. Such offers may contain grammatical errors and may be less structured. They more often contain job offers for people without higher education. Local websites are more popular in smaller or medium-sized cities than in major cities. They often contain various local advertisements in addition to job offers. Some sites also assume the role of local information portals that additionally enable job posting.
A significant share of branch websites with job offers allow posting of job offers only after completing a registration process. These websites often allow free advertising. Fees are often charged for placing promoted job offers. To a large extent, they also contain branch articles, guides, and news about a selected branch.
Queries regarding the pages of Local Labour Offices (LLOs) and Public Information Bulletins (PIBs) constituted a relatively large part of all employee searches. However, they are less than half as popular a source of obtaining information about job offers than websites with national coverage. LLO and PIB websites are more popular in smaller and medium-sized areas than in large cities.
Job offers posted in Facebook groups are a separate category of information sources for employment seekers. These groups can be public or private. Access to public groups is transparent, and interaction with people posting job advertisements is permitted after joining the group. Private groups have limited access to job offers, as a person must be admitted to the group to view them. Groups on Facebook enable much greater interaction between job posters and job seekers. They usually have a regional specification, such as work in Kraków, or a specialisation, such as jobs for computer graphic designers.
Internet job seekers less frequently query the websites of a potential employer or specialised websites, especially in smaller cities. Specialty websites, local websites, and social networking sites occasionally also include jobs. Job descriptions here are unclear as to whether they relate to any formal contract. Websites that aggregate job offers from other websites contain the most job offers. However, they do not monitor which offers are obsolete or are still valid. The information on these websites is organised in various ways, since they refer to several differently-structured websites. Moreover, collecting data automatically from job aggregators is difficult because such websites usually block web crawlers. Two possible solutions to this limitation are to limit the number of data requests or to use proxy servers for web scraping.
We classified websites according to the frequency of search queries and average numbers of job offers they contained. We also used data from a media tracking website (Wirtualne Media
2) on most popular websites according to registered users, coverage, and page views. Finally, we chose twenty-five additional websites (see Table
6 in Appendix
2). These included mostly national websites, but also a few sites that aggregate job offers and websites that contain local subsites (e.g., the OLX portal). These websites cover both national and local job offers, bigger cities and small towns, and ensure a sufficient quantity of job offers. We excluded other websites for reasons including insufficient coverage of job offers containing the necessary detailed information (such as required skills), or information that was not readily extractable (e.g., Facebook groups).
3.2 Data collection procedure
Collection of online data from multiple sources is extremely impractical to do manually, especially cyclically. To aggregate job offers automatically, we developed web scraping applications (data collection tools).
3 The biggest advantage of web scraping is that once the application is ready, it can be used multiple times without much additional interaction. However, the application needs to be designed for each website separately, taking into consideration the website’s features such as its structure, type (dynamic or static), the technologies used, limitations, etc. Moreover, even a small change in a website may cause a critical error in the scraping tool, so monitoring of applications is also an important part of data collection process.
Since the data included mostly Polish and English data, we limited the analysis to these languages. At the initial stage of website analysis, a problem with encoding of non-Latin characters (including Polish letters) on Windows operating systems was identified. The issue was resolved by deciding that all information obtained from various portals would be saved using Unicode Transformation Format (UTF-8) encoding, which is the most common type of text encoding, thus ensuring consistency of data formatting.
Before the programming platform was created, a thorough analysis of websites with job offers was carried out, in order to recognise the structures and features of such websites. Some websites load page content dynamically during scrolling, while others required the user to be signed into a website-specific account to observe job offers; specific software was required to address these situations. All of these features have a significant impact on application design process and its sustainable work in the future. We noted that websites are characterised by:
-
Unique user interface (UI) design.
-
The necessity of authorization (user account creation and login).
-
Different paging mechanisms (e.g., selected websites provide a range of subpages, while other websites read subpages dynamically).
-
Various naming conventions for regions, although every website used NUTS-2 classification.
-
Various interaction models (e.g., selection of information after completing the form, selection of data based on shared selection lists, and dynamic loading of data using buttons and text fields).
-
Limited website availability at times.
-
Data inconsistency (e.g., non-existent links, expired job offers, non-existent web pages), or incorrectly listed voivodeships of presented job offers.
-
Restrictions on website traffic from the same IP addresses; if the page is downloaded or refreshed too often, the website will be redirected to an authorisation page using a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) mechanism in order to reload.
-
Encrypted HTTPS traffic (Hypertext Transfer Protocol Secure) required for selected portals.
-
Encoding of links and job titles in the URL (Uniform Resource Locator) standard, forcing a conversion to the UTF-8 standard.
-
Technical issues (e.g., HyperText Markup Language, HTML, syntax errors, skipped tags, incorrect tag parameters, CSS file and/or JavaScript errors).
Based on the observed limitations, specific requirements, and available technologies, we decided to develop an automated web crawler that resembles human activity. To meet the requirements for the processing of website data using different technologies while taking into account the limitations of each site, and to ensure interaction with each of the services, after the initial testing period, the target system architecture was determined as follows:
-
The application shall be implemented using a high-level programming language such as Python, Java, or R script; Java language was selected due to the ease of migration between Windows, Linux, and MacOS systems.
-
The application shall use selected frameworks providing access to Web API (Application Programming Interface); for this purpose, Chrome Devkit was selected.
-
The framework provides access to the DevKit data structure, ensuring two-way communication. Selenium WebDriver was selected for this purpose.
-
Behaviour scenarios should be developed for each website to ensure appropriate interaction with the selected website and to save data in a uniform format.
The developed system works automatically, based on behavioural patterns defined for each of the processed portals. Each pattern contains the following information:
-
The home page of the portal;
-
The voivodeship naming scheme;
-
Interaction with the website to select job offers from the indicated region;
-
A scheme for detecting links to pages with links to full offers on individual subpages; we used regular pattern matching for this purpose;
-
A scheme for downloading the full content of the job offer for each link collected previously.
The data collection algorithm is performed on a weekly basis. High frequency of data collection allows us to obtain job offers with a short life cycle (job offers that expire soon after publishing). Its task is to provide data from portals that for various reasons are not always available, or from which there are difficulties in downloading data. Following is the description of steps of the algorithm for downloading data for one portal (see also Fig.
2 in Appendix
1).
The first step of data collection is to ensure that the website responds (step 1 in Fig.
2). For some reasons the website could be unavailable, and its pages would not load. In this case, a one-day delay is advised (step 2). In this case, the website needs to be checked again the next day (step 3). If the website is still unavailable, revision of the website is required (step 4). In this case, further data collection would not be possible, so the cycle ends. There are multiple reasons why a website may not respond. It could be closed by its owners, or our IP address has been blocked due to a large number of requests per unit of time (usually per minute or hour).
If the website responds, the data collection process begins. In steps 5.1–5.3 we collect all links of job offers for each region. The steps of automated link collection are as follows:
1.
Load the main portal page.
2.
Load the page interaction pattern and then download information to interactively select the voivodeship.
3.
Determine the number of website subpages on the basis of web page content analysis, using the developed HTML parser and processing algorithm.
When all links have been collected, we need to ensure that the process has finished successfully (step 6). Even small changes in website structure may cause an error. If an error occurred, the application needs to be checked and fixed (step 7). This step requires user interaction. Usually, such changes are small and do not require a major redesign of the data collection tool; however, some changes may be significant (e.g., changing from a static to a dynamic website), so such changes may require a major redesign of application architecture. After the application is fixed (step 8), steps 5 and 6 are repeated.
In the next step (step 9), for each link collected in step 5, the algorithm downloads full information about job offers. We do not scrape multiple job offers from a single data source simultaneously, as this would overload the website and may result in IP blocking. Multiple applications may download data from many sources in parallel, but as with multiple job offers, the applications must not poll the websites too often, or else run the risk of blocking access from the requestor IP address, or having queries to the site detected as a DDoS (denial of service) attack. The information about daily visits to a website can be obtained with Alexa and Similar Web
4 tools in order to estimate the number of requests permitted, so as not to overload a website’s hardware. The solution to this problem is to use delays between repeated website visits. A mechanism to delay sending requests to websites has been introduced to the application, with the ability to select a separate delay time for each website. Implementation of Java applications and the use of additional libraries available for various hardware platforms allow the system to be run on Windows and Linux-based systems.
The process of data collection for each link is as follows:
1.
Check whether a job offer was already downloaded in the previous cycle (step 10). Since the cycle repeats every week, some links which have been already downloaded may be retrieved with the new links. As it is pointless to download these links again, such job offers must be ignored while data collection continues to another link.
2.
If a job offer was not previously downloaded, the application downloads it (step 11).
3.
The application must verify whether the full data of a job offer was collected successfully (step 12). If so, it moves to the next job offer; otherwise, the job offer needs to be downloaded again.
4.
In the case of download failure, the application checks whether it is the first attempt to redownload the job offer’s data (step 13). There is a defined maximum of two attempts to redownload a link. In some cases, an error may occur and the data downloading process stops. This may be due to a job offer unavailable for downloading (access closed by owners) or a broken/invalid link. In such cases, a job offer will be unavailable even after repeated attempts. In step 14, the application tries to download the data once again after a small delay (up to 2 min). If the downloading error still occurs, this job offer link is dropped (step 15), and the application continues to download the next link.
The number of download iterations performed by the algorithm equals the total number of links (\(i=1,\dots ,N\), where \(N\) is the total number of links). In steps 9 and 16, the application ensures that all information about a job offer has been collected.
The last step (step 17) is the data export. At this point, we collected the full information about web pages with job offers. This was important for several reasons. Very often, the information published on websites ceased to be available after some time, which would have prevented access to historical data. Services change the structure of their portal with varying frequency, making it impossible to develop a single interaction and parsing pattern for all sites. By using interaction patterns in the application, we can easily make changes as well as save the behavioural patterns for historical data. Portals are characterised by different methods of accessibility. There are technical issues and many other situations that also may prevent access to content. By using an iterative data processing algorithm, the system retrieves as much data as possible at each iteration, and repeats iterations until a complete data set is obtained.
To prepare gathered information for analysis, we applied a parsing process to the data. The data stored in the standard HTML format were converted into plain text. Parsing removes such information as font size, colour, and other formatting tags unrelated to the intended job data. After this step, we could proceed to the analysis.
3.3 Cleaning and extraction of relevant information
After the data has been collected and stored in HTML format (each file representing one job offer), it needs to be processed. Raw data cannot be analysed, so we extract the following information:
-
The name of the online job board that contains the job offer.
-
Original link to the job offer.
-
Title of the job offer.
-
Location (NUTS region).
-
Date of publication.
-
Type of employment contract.
-
Position level.
-
Offered remuneration.
-
Job type (full- or part-time).
-
Additional job benefits.
-
Full content of the job offer.
During the text extraction, it was found that some Polish letters may be incorrectly encoded. Due to situations such as this, it is crucial to control the encoding process. While analysing remuneration in job offers, it is important to convert any hourly wages to monthly wages, as monthly wages are mostly used in Poland. The currency type also must be checked for consistency. Websites may use various formats of publication and expiration dates. To aggregate job offers by date, a common format must be decided upon for conversion (such as YYYY-MM-DD). Since job offers were collected in Polish and English, the data needed to be written in one common format. Examples include voivodeship names and decimal numbers, which in Polish are written with a ‘,’ separator and in English with a ‘.’ separator. Other information can be extracted from the job offer text using dedicated natural language processing techniques.
The main purpose of collecting and processing data obtained from job offers was to identify occupations, qualifications and skills based on the contents of job offers provided by employers, on the basis of publicly available website information. We used official classifications for this purpose. Direct comparison of benchmark data included in dictionaries of classifications with actual job offers did not produce the expected results. The level of recognition of certain phrases from classifications in text written by employers was very low. This inaccuracy of exact matching of the content entered by employers to the expected dictionary resulted from different conditions:
-
Differing sentence forms;
-
Differing word forms;
-
Meaningless words (‘stop words’, which prevent further lexical analysis);
-
The use of abbreviations;
-
The use of synonyms;
-
Incomplete matching;
-
Typographical errors;
-
Polish-language words written using only English characters;
-
Excessive whitespace characters (spaces, tabs, newline characters, etc.);
-
HTML tags in the text specifying special characters (e.g., >, <, & to mean > , < , and &, respectively).
The removal of excess whitespaces was performed using an appropriate regular expression, which for the entire text string finds any sequence of consecutive whitespaces and replaces the sequence with a single space. In addition, punctuation marks and other non-alphanumeric characters were removed. Special HTML tags remained unchanged.
From these listed issues, the second (differing word forms) decreased matching accuracy the most. The Polish language contains many word variants. It contains seven cases for nouns, and each of them may contain a different word variant. Variants of the verb “to be” for all persons can be different. These word variants are not exceptional; they are common in the Polish language.
In order to improve the level of recognition of job offer texts and the share of classified job offers, we decided to preprocess the actual job titles and content of job offers into a form that allowed for much more accurate matching of the content to the dictionary template. We used a morphological library (Morfologik-stemming-1.9.0) to unify the word forms to be analysed. The library has extensive possibilities for analysing Polish words. It contains 4,800,432 words with different variations. Some words were excluded from the processing mechanism due to sentence semantics.
The application uses a mechanism for converting words from any form to the basic form (lemmatisation); that is, the infinitive forms of verbs, and nouns as their nominative cases (singular first-person forms). We lemmatized the text of dictionaries and job offers (with titles).
At this stage, our most important task was to find specific educational traits within the text of job offers. Because job offers can contain various words to describe the same trait (occupation, qualification, skill, etc.), the algorithm must address the situation in which the trait is mentioned in the sentence, but with different words than those in the dictionary. To solve this problem, we first prepared a list of occupations, qualifications, and skills using the European Skills, Competences, Qualifications, and Occupations (ESCO) classification (European Commission
2020), and the International Standard Classification of Education (ISCED-F
2013) across fields of education and qualifications (UNESCO
2013). The advantage of ESCO as the classification of skills is that it supplies a dictionary of over 13,000 transversal and job-related skills, which is significantly more than alternative classifications (Pater et al.
2019). It also uses the International Labour Organisation ISCO classification of occupations, expanding the 4-digit codes of groups of occupations into 6-digit codes of occupations. This supplies almost 3,000 occupations. ISCED-F 2013 is the most commonly used qualification classification. In this case, alternative classifications such as ESCO provide qualification titles in excessive detail containing the exact source of qualification, such as, “Bachelor degree in Primary School Education. Department of Primary Education. Faculty of Education of Florina. University of Western Macedonia”. Companies expect future workers to finish a specific faculty, in this case ‘education’, but usually do not specify an exact school or university a person must have a degree from.
These official classifications formed only basic dictionaries, because companies can (and often do) use many synonyms, causal terms, and abbreviations to describe the occupations, qualifications and skills they seek; thus, we built dictionaries of such synonyms. We supplemented the ESCO dictionary of transversal skills by their synonyms from online dictionaries. The basic job-related skills dictionary was supplemented with their synonyms, also provided by ESCO. The dictionary of educational fields and qualifications was left unchanged for their specificity. We supplemented the ESCO dictionary of occupations with synonyms provided by the Statistics Poland agency, and with synonyms of vocational occupations provided by the Educational Research Institute in Warsaw. The dictionary was supplemented with the most common job titles in the Central Job Offers Database submitted from Local Labour Offices. This database contains job offers for which a Local Labour Office clerk assigned an ESCO occupation code. Individual assignment of a code by a LLO qualified clerk to some extent ensured that the coding was correct. To further increase this probability, we used the most common of these job titles.
In the dictionary of occupations, we encoded as ‘000000’ the job offers that did not indicate an occupation, but instead a specific workplace, such as seasonal work, casual work, or occasional work. We encoded as ‘999999’ any job that was not perceived as employment by official statistics; these included internships, contracts for specific work, and contracts of mandate.
For all classifications, we used numeric codes. We sorted the dictionaries from most specific phrases to most general; for example, ‘sales manager’ occurred before ‘manager’. The matching algorithm tried to match the phrase from a dictionary with the job offer text, in order from most specific to general. In the case of managerial occupations, all types of managers were attempted to be matched with job offer text. If it failed, the word ‘manager’ was checked for matching. The exceptions were codes ‘999999’ which was searched first, and ‘000000’ which was searched last. This ensured that a job not considered as employment would not be calculated as valid, but would be noted in the database; and that some short-term jobs, mostly without high requirements, would also be counted, but as a separate category. Sometimes there were two or more identical phrases containing different codes. In such a case, this trait was counted as 1/n, where n is the number of phrases in each of the codes, and the fractions always summed to 1, meaning one occurrence.
After finishing the dictionaries, we analysed matching results, checking the results individually in a random sample of 1,000 job offers. During this procedure we created a dictionary of exceptions. This dictionary contains mostly single words with ambiguous meanings. Whenever possible, these words were substituted with phrases that clarified the context. We also found and corrected mistakes in the morphological library itself. As a result, for the purposes of the study we collected 40 million job offers during 2017–2019, based on which we extracted relevant educational profiles required by employers from potential job seekers. To calculate degree of mismatch, we needed respective information on the labour supply, which is described in the next section.
3.4 Labour supply survey
The main challenge in designing a method of continuous mismatch monitoring in the labour market, especially at a detailed level, was to prepare a tool that allows for quick aggregation of data from large representative samples of people of working age. Moreover, the chosen technique should be compatible with another equally important source of information; in this case, the source is online job advertisements. The choice of the Internet as a source of job advertisements (labour demand), as well as the means of conducting labour supply research among people of working age, results from the fact that in 2016, 93.7% of enterprises had internet access (including 93.2% broadband) and 80.4% of households (of which 75.7% used broadband). Considering the range of mobile internet services, a significant portion of Polish society is within the reach of the Internet and uses it.
In order to calculate labour market mismatch, we needed the characteristics of labour supply (job seekers) to compare them to the demand for labour (observed through job offers). The labour supply data was obtained from a CAWI (Computer-Assisted Web Interview) survey of Poles aged 18–65. The study was conducted on a nationwide random-quota sample of
N =
16,119, where quotas for gender, age and size of the place of residence were consistent with those in the Polish population (Table
1). The questionnaire included a division of respondents into people currently working and not working. Even though some respondents were not looking for a job at the time of the survey, they still could have been considered as part of the population we aimed to study; that is, of people potentially interested in finding a job. The survey was conducted during an unprecedented boom in both the Polish economy and its job vacancy market, with many employers having unfilled work positions. The market was thus full of employment possibilities, a perfect situation for this study. The questionnaire consisted of four main sections: demographics, work situation, qualifications, and competencies.
Table 1
The structure of the sample for the population of Poles aged 18–65
18–24 | 18% |
25–34 | 22% |
35–44 | 20% |
45–55 | 24% |
56–65 | 16% |
Village | 36% |
City < 20 thous | 14% |
City 20–99 thous | 20% |
City 100–500 thous | 18% |
City > 500 thous | 12% |
Since we used online job offers to measure vacancies, the survey of potential job seekers included only internet users. The CAWI survey was conducted during September and October 2017.
5