Introduction
Importance of privacy preservation
Application areas of PPiRL
Our contributions
Organization of the paper
Literature review
Record linkage
Privacy-preserving record linkage (PPRL)
Incremental record linkage
Background knowledge and problem formulation
PPiRL, an end-to-end Framework
Data pre-processing
Privacy preservation
Phonetic encoding
K-anonymization method
Blocking
Feature set for blocking
Clustering
Impact of incremental approach in PPRL
Evaluation
Linkage evaluation
Privacy evaluation
Experimental results
Data pre-processing
Feature selection
Features in the raw dataset | Selected identifiable features |
---|---|
Invoice No | Patient Name |
Invoice Date | Gender |
Patient Name | Age |
Gender | Contact Number |
AGE | Address |
Contact Number | |
Address | |
Test Name | |
Delivery Date | |
Department | |
Sample | |
Test Attribute | |
Result | |
Unit | |
Reference Value |
Normalization of age values
Age | Actual Details | Scaled Age |
---|---|---|
7 Y 4 M | 7 years 4 months | 7 years |
11 Y 6 M | 11 years 6 months | 12 years |
3 D | 3 days | 1 year |
9 M | 9 months | 1 year |
Address standardization
Extracted Address | Mapped Geocode |
---|---|
Anowara, Chittagong, Chittagong | 201504 |
Saturia, Manikganj, Dhaka | 305670 |
Anowara, Chittagong, Chittagong | 201504 |
Patenga, Chittagong, Chittagong | 201565 |
Fakirhat, Bagerhat, Khulna | 400134 |
Rampal, Bagerhat, Khulna | 400173 |
Birampur, Dinajpur, Rangpur | 552710 |
Barlekha, Maulvi Bazar, Rangpur | 605814 |
Adamdighi, Bogra, Rajshahi | 501006 |
Companiganj, Sylhet, Sylhet | 609127 |
Experimental setup
-
nameGist, the phonetic encoding algorithm, groups the similar-sounding names together and gives privacy to the ‘Name’ feature as well.
-
K-Anonymization, the privacy-preserving algorithm, ensures the generalization of ‘Contact,’ ‘Address’ features.
-
NAIVE, the incremental baseline algorithm, compares each inserted record with existing clusters, then either adds it into an existing cluster or creates a new cluster for it.
-
Correlation Clustering applies a correlation penalty to get the best cluster results while implementing clustering.
Linkage evaluation
External validation measure results
Clustering process | Correlation Clustering | Naïve clustering | ||
---|---|---|---|---|
Noise | F-measure | time(s) | F-measure | Time(s) |
0% | 94.21% | 4.06 | 94.10% | 2.54 |
5% | 91.30% | 6.41 | 92.20% | 3.85 |
10% | 89.40% | 7.12 | 89.80% | 5.2 |
Internal validation measure results
Clustering | Evaluation measure | Penalty |
---|---|---|
Naïve Clustering | Correlation Penalty | 116.92 |
DB Index Penalty | 71.96 | |
Correlation clustering | Correlation Penalty | 115.02 |
Correlation Penalty | 52.12 |
Privacy evaluation
Frequency analysis
Dictionary attack
Information gain
Attribute | Entropy(bit/Character) |
---|---|
Name | 3.2 |
Gender+birthrange | 2.25 |
Address | 3.32 |
Concatenate value | 4.25 |
Adversary model name | Information gain |
---|---|
Honest-but-curious behavior(HBC) | 19.91% |
Comparison of PPiRL with batch-PPRL
Clustering | PPiRL | Batch | ||
---|---|---|---|---|
Record Linkage | ||||
Update | F-measure | Time(s) | F-measure | Time(s) |
Initial | 94.21% | 2.54 | 95.70% | 10.33 |
Increment-I | 91.30% | 3.85 | 93.20% | 14.5 |
Increment-II | 89.40% | 5.2 | 91.40% | 16.2 |
Comparison of PPiRL with incremental record linkage
Feature | Framework | |
---|---|---|
IRL | PPIRL | |
Privacy Preservation Technique | None | Phonetic encoding & Generalization |
Information gain by other party | Full | 19% |
Linkage quality | 95% | 91% |