Introduction
Background research
Subject | Number of resources |
---|---|
Big data in general | 152 |
Business/corporate | 37 |
RDBMS | 40 |
Legal, social, ethical | 23 |
Metadata | 56 |
Productivity tools | 42 |
Volume | 23 |
Velocity | 14 |
Variety | 13 |
Heterogeneous data (in Big Data) | 14 |
Addressing the challenge
Data origination—confidentiality
Confidential data | Regular expression |
---|---|
(?:5[1-5] [0-9]{2}|222[1-9]|22[3-9] [0-9]|2[3-6] [0-9]{2}|27[01][0-9]|2720)[0-9]{12} | |
4[0-9]{12}(?:[0-9]{3})? | |
3[47][0-9]{13} | |
3(?:0[0-5]|[68][0-9])[0-9]{11} | |
Gulf Countries Civil ID | \d{1} (?!00)\d{2} (?!00)\d{2} (?!00)\d{2} (?!0000)\d{4} |
Greek Civil ID | [Α-Ω]{1,2}[0-9]{6} |
International Passport | [A-Z0-9<]{9}[0-9]{1}[A-Z]{3}[0-9]{7}[A-Z]{1}[0-9]{7}[A-Z0-9<]{14}[0-9]{2} |
IBAN | [a-zA-Z]{2}[0-9]{2}[a-zA-Z0-9]{4}[0-9]{7}([a-zA-Z0-9]?){0,16} |
eMail | (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\×01-\×08\x0b\x0c\x0e-\x1f\×21\×23-\x5b\x5d-\x7f]|\\[\×01-\×09\x0b\x0c\x0e-\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\×01-\×08\x0b\x0c\x0e-\x1f\×21-\x5a\×53-\x7f]|\\[\×01-\×09\x0b\x0c\x0e-\x7f])+)\]) |
MAC Address | ([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2}) |
Classification | Search Condition | Booster Metrics |
---|---|---|
Card | List of RegEx expressions to get the initial data match | Linguistic boundary characters, e.g. space, coma, quotation Luhn algorithm, for check digit verification institutional bank identification numbers (BINs) |
Lists | List of RegEx expressions to get the initial data match | Monitor terms proximity. The distance of the occurrence with words like password, account, card, credit, id etc. is calculated |
Absolute XML | – | List of specific xml tags e.g. < CivilId > ID123456 < /CivilId > |
Relative XML | – | List of XML tags containing terms < *Passport* > where * indicates any number of any character |
Data format—delimiter determination
-
The set utilised multiple delimiters for segmentation. Although the file, for instance, was delimited with a comma, one of the fields had in it multiple values delimited with semicolons. As a result, both delimiters exhibited high degrees of conformity.
-
The records were extended into multiple lines while being enclosed in double-quotes.
-
Many of the fields had long text, document sized, that had a very high degree of variance in length stretching from a couple of lines to hundreds of lines.
-
The number of lines read from the file.
-
The number of lines having the same number of columns.
-
Min, Max and Mean of Standard Deviation for the position of the delimiter across all lines read per position.
-
Min, Max and Mean of Coefficient of Variation for the delimiter position across all lines read per position.
-
Min, Max and Mean of Standard Deviation for the relative position (distance) of the delimiter from the previous delimiter across all lines read per position.
-
Min, Max and Mean of Coefficient of Variation for the relative position (distance) of the delimited from the previous delimiter across all lines read per position.
-
The number of identified delimiters in the file.
-
Whether the number of columns is consistent across all lines read (Boolean metric).
-
The average Standard Deviation for the absolute position of the delimiter.
-
The average Coefficient of Variation for the absolute position of the delimiter.
-
The average Standard Deviation for the relative position of the delimiter.
-
The average Coefficient of Variation for the relative position of the delimiter.
Experimental methodology
-
The main objective for Data Origination—Confidentiality was to confirm that the high number of false positives generated with conventional techniques like RegEx can be minimized, thus making the identification more accurate for further actions, e.g. masking. For the Proof of Concept (PoC), datasets from application logs, audit logs and network captures were used. The respective set was selected since there is a high probability of such files being shared by the organization with vendors and other external entities.
-
Regarding Data Format—Delimiter Determination, the objective was to confirm the size of data, variables, and viability of a solution that will identify a file's delimiter and thereby enable identification of the data. In the second experiment, statistical data derived from an analysis of big data are used as an input in a neural network to enable identification.
Experiment | Dataset | Origin | Number of Files | Disk Size (GB) |
---|---|---|---|---|
Confidentiality | Mobile Banking logs | Proprietary | 4 | 0.50 |
Confidentiality | Loan Origination System Logs | Proprietary | 3 | 1.10 |
Confidentiality | Network Trace | Proprietary | 43 | 3.98 |
Delimiter Identification | Banking Set (ODS) | Proprietary | 8,605 | 15.60 |
Delimiter Identification | National Climatic Data Center (NCDC) | Public | 14,030 | 9.40 |
Delimiter Identification | Center for Disease Control and Prevention (CDC) | Public | 920 | 12.80 |
Data origination—confidentiality
Occurrence | Classification | Metrics definition | Value/add-on contribution to confidence level |
---|---|---|---|
Credit Cards | Card | RegEx Identified | 40% |
Linguistic boundary | 20% | ||
No linguistic boundary | 10% | ||
Luhn algorithm | 40% | ||
Exists in institutional BINs | 5% | ||
Sanitization method | Masking (first six and last three chars) | ||
Confidence Level | 60% | ||
PII | Lists | RegEx Identified | 40% |
Linguistic boundary | 20% | ||
No linguistic boundary | 10% | ||
Proximity | 10% | ||
Sanitization method | Hash | ||
Confidence Level | 50% | ||
PII | Absolute XML | RegEx Identified e.g. (< CIVIL_ID >) | 100% |
Sanitization method | Truncate | ||
Confidence Level | 50% | ||
PII | Relative XML | RegEx Identified e.g. (< *ID* >) | 50% |
Sanitization method | Truncate | ||
Confidence Level | 50% |
Data format—delimiter determination
Number of files | NCDC | CDC | ODS | |||
---|---|---|---|---|---|---|
# of files | % of Data set | # of files | % of Data set | # of files | % of Data set | |
0–100 | 2,742 | 95 | 84 | 15 | 3,737 | 63 |
101–500 | 129 | 4 | 102 | 18 | 330 | |
501–10,000 | 18 | 1 | 246 | 44 | 1,592 | 27 |
10,001–100,000 | 68 | 12 | 185 | 3 | ||
100,001–10,000,000 | 59 | 11 | 74 | 1 |
Delimiter | Weight | Delimiter | Weight |
---|---|---|---|
Comma | 1 | Tilda | 3 |
Semicolon | 2 | Tilda Pipe Tilda | 4 |
Tab | 2 | Tilda Pipe Pipe Tilda | 5 |
Metric | Formula |
---|---|
Mean (μ) | \(\mu = \frac{{\sum x_{i} }}{n}\) |
Standard Deviation (σ) | \(\sigma = \sqrt {\frac{{\sum \left( {x_{i} - \mu } \right)^{2} }}{n}}\) |
Coefficient of Variation (Cv) | \(C_{v} = \frac{\sigma }{\mu }\) |
Classification | Standard RegEx | “Boosted” RegEx confidence level ≤ 50% | “Boosted” RegEx confidence level > 50% |
---|---|---|---|
Cards | 18,422 | 7,337 (39.83%) | 11,085 (60.17%) |
Lists | 5,394,547 | 1,565,160 (29.01%) | 3,829,387 (70.99%) |
Total | 5,412,969 | 1,572,1497 | 3,840,472 |
Classification | “Boosted” Absolute and Relative XML confidence level ≤ 50% | “Boosted” Absolute and Relative XML confidence level > 50% |
---|---|---|
Absolute XML | 22,741 | |
Relative XML | 325,530 | 105,758 |
File Encoding | ODS | CDC | NCDC |
---|---|---|---|
ISO-8859–1 | 5,625 | 126 | 13,664 |
ISO-8859–8 | 1 | ||
Windows-1252 | 35 | 253 | |
KOI8-R | 1 | ||
MACCYRILLIC | 7 | ||
UTF-16LE | 205 | ||
UTF-32LE | 1 | ||
UTF-8 | 43 | 443 | 102 |
Total | 5,918 | 559 | 14,019 |
Results
Data origination—confidentiality
Data format—delimiter determination
-
In the upper part of the figure cross-tabulation, the number of files that fall into that category. The highlighted data element (in purple) represents that 7,687 files exhibit a difference less or equal to 1% when comparing the ANN result with the actual delimiter. The ANN formula being tested is calculated based on the ODS dataset having read 1% of the file content and tested against the NCDC data set while reading 3% of the file content.
-
In the heat-map, the percentage of files that fall into that category. The counts from the upper cross-tabulation are depicted as percentages (count/file number). Taking the same example, 76% (in purple) of the files exhibit a difference less or equal to 1% when comparing the ANN result with the actual delimiter.
Discussion
-
How can the system be extended without human interaction in respect to new RegEx patterns?
-
How can binary or HEX content be treated?
-
Is there a way to have more contextual accuracy?
-
Is there a way to retrofit prior structural identified information into the system?
-
Would such a framework be applicable in a business implementation towards DLP?