Abstract
We introduce a data set of privacy policies containing more than 18,300 sentence snippets, labeled in accordance to five General Data Protection Regulation (GDPR) privacy policy core requirements. We hope that this data set will enable practitioners to analyze and detect policy compliance with the GDPR legislation in various documents. In order to evaluate our data set, we apply a number of NLP and other classification algorithms and achieve an \(F_1\) score between 0.52 and 0.71 across the five requirements. We apply our trained models to over 1200 real privacy policies which we crawled from companies’ websites, and find that over 76% do not contain all of the requirements, thus potentially not fully complying with GDPR.