1 Introduction
1.1 Resume parsing
1.2 Named entity recognition using deep learning
1.3 Named entity recognition for semi-structured data
-
We demonstrated the use of a modified semi-supervised technique for parsing institute and degree names. Instead of following the traditional semi-supervised approach, we introduced a correction module to rectify the predictions. We added these corrected predictions back to the original seed set, thereby increasing its size. On retraining, this procedure results in improved accuracy, precision, and recall in comparison with the previously trained model.
-
We achieved high performance for recognizing degrees and institutes in a resume without large annotated data.
2 Methodology
2.1 Preprocessing
-
Conversion of pdf resumes to JSON using PDF2JSON [39]
2.2 Corpus
Tag | Meaning |
---|---|
U-INST | Single word institute name |
U-DEG | Single word degree name |
B-INST | Start of an institute name |
I-INST | Continuation of the institute name |
B-DEG | Start of a degree name |
I-DEG | Continuation of the degree name |
O | Others |
2.3 Data cleaning
-
resolving unbalanced parenthesis
-
Irregular spacing between words
-
Replacing& with&, and
-
Removing unwanted characters (including non-ASCII characters)
2.4 Model
2.4.1 Classification model
-
Word Embedding Layer: Words used in resumes such as the institute names or the degree names may not be present in a pre-built word embedding. Therefore, we created a new set of word embedding built on our corpus with Keras’s help. We produced two-word embedding layers, one for classification entities, and another for their respective POS tags. We then concatenated these two to form the base model. We added a Dropout layer to prevent over-fitting of the data with a probability of 0.1 as experimentally that gave us the best result. We have used 10% of the dataset as the development set.
-
CNN Layer: A 1-D convolution (since the text is linear data) neural network layer for extracting character level features is further concatenated to the above base.
-
Bi-LSTM Layer: We appended a Bi-LSTM layer comprising of 100 hidden neurons to the model.
2.4.2 Correction module (Processing predictions made on the unlabeled dataset)
Case 1
| ||||||
Document | Studied | In | Manipal | Institute | Of | Technology |
Predicted tags | O | O | B-INST | I-INST | I-INST | O |
Corrected tags | O | O | B-INST | I-INST | I-INST | I-INST |
Case 2
| ||||||
Document | Studied | In | Manipal | Institute | Of | Technology |
Predicted tags | O | B-INST | I-INST | I-INST | I-INST | O |
Corrected tags | O | O | B-INST | I-INST | I-INST | I-INST |
Case 3
| ||||||
Document | Studied | In | Manipal | Institute | Of | Technology |
Predicted tags | O | O | B-INST | I-INST | O | I-INST |
Corrected tags | O | O | B-INST | I-INST | I-INST | I-INST |
Case 1
| ||||||
Document | Bachelor | Of | Technology | from | 2015 | 2019 |
Predicted tags | B-DEG | I-DEG | I-DEG | I-DEG | O | O |
Corrected tags | B-DEG | I-DEG | I-DEG | O | O | O |
Case 2
| ||||||
Document | BTech | in | IT | from | 2015 | 2019 |
Predicted tags | B-DEG | I-DEG | O | O | O | O |
Corrected tags | U-DEG | O | O | O | O | O |
Case 3
| ||||||
Document | Studied | In | Manipal | Institute | Of | Technology |
Predicted tags | O | O | B-INST | I-INST | B-DEG | I-DEG |
Corrected tags | O | O | B-INST | I-INST | I-INST | I-INST |
2.4.3 Corpus expansion and retraining
3 Experimental results and discussion
Iteration | Precision (%) | Recall (%) | F1 score |
---|---|---|---|
0 (Seed) | 39.19 | 52.25 | 44.79 |
1 | 51.16 | 59.46 | 55.00 |
4 | 51.20 | 57.66 | 54.24 |
8 | 56.39 | 67.57 | 61.48 |
12 | 64.52 | 72.07 | 68.09 |
Iteration | Precision (%) | Recall (%) | F1 score |
---|---|---|---|
0 (Seed) | 65.12 | 73.68 | 69.14 |
1 | 70.23 | 80.70 | 75.10 |
4 | 81.03 | 82.46 | 81.74 |
8 | 75.20 | 82.46 | 78.66 |
12 | 78.26 | 78.95 | 78.60 |
Iteration | Accuracy (%) | Precision (%) | Recall (%) | F1 score |
---|---|---|---|---|
0 (Seed) | 82.08 | 49.06 | 57.78 | 53.06 |
1 | 86.09 | 58.70 | 64.44 | 61.44 |
4 | 83.42 | 49.45 | 60.44 | 54.40 |
8 | 87.87 | 62.71 | 65.78 | 64.21 |
12 | 87.99 | 69.13 | 70.67 | 69.89 |
Iteration | Accuracy (%) | Precision (%) | Recall (%) | F1 score |
---|---|---|---|---|
0 (Seed) | 85.71 | 51.26 | 63.11 | 56.57 |
1 | 89.14 | 60.77 | 70.22 | 65.15 |
4 | 90.15 | 65.56 | 70.22 | 67.81 |
8 | 90.15 | 65.50 | 75.11 | 69.98 |
12 | 92.06 | 71.13 | 75.56 | 73.28 |
Accuracy (%) | Precision (%) | Recall (%) | F1 score |
---|---|---|---|
89 | 74 | 77 | 75 |