1 Introduction
-
We characterize the estimator—bias, variance, and error—of sample proportions when the sample contains noisy classifications.
-
We propose a more accurate method of constructing confidence intervals around class proportions estimated from noisy samples.
-
We conduct experiments using our method across classifier thresholds, sample sizes, and class proportions.
2 Background
2.1 Quantifying class proportions
2.2 Quantification in practice
3 Preliminaries
3.1 Binary classification
3.2 Classification error
Metric | Notation | Defn. (Probability) | Defn. (Estimate) |
---|---|---|---|
True positive rate | \(\alpha \) | \(P(\hat{Y}=1|Y=1)\) | \(TP/(TP+FN)\) |
False positive rate | \(\beta \) | \(P(\hat{Y}=1|Y=0)\) | \(FP/(FP+TN)\) |
True negative rate | \(1-\alpha \) | \(P(\hat{Y}=0|Y=0)\) | \(TN/(TN+FP)\) |
False negative rate | \(1-\beta \) | \(P(\hat{Y}=0|Y=1)\) | \(FN/(TP+FN)\) |
Positive predictive value | \(P(Y=1|\hat{Y}=1)\) | \(TP/(TP+FP)\) | |
Negative predictive value | \(P(Y=0|\hat{Y}=0)\) | \(TN/(TN+FN)\) |
3.3 Class proportions
4 Estimator properties
4.1 Bias
4.2 Variance
4.3 Error
5 Confidence intervals
5.1 Bootstrapping: Review
5.2 Error-adjusted bootstrapping
5.2.1 Correctness of algorithm
5.2.2 Predictive value estimates
6 Experiments
6.1 Datasets and classification details
-
Flu Vaccination: A set of 10,000 tweets labeled with if the tweet indicates that someone has received an influenza vaccination (i.e., a seasonal flu shot) (Huang et al. 2017) from 2013-2016. The dataset spans four years, and approximately 31% of tweets are labeled positive. 15% of tweets were reserved for testing. The aggregation task is to calculate the percent of tweets that indicate vaccination each month.
-
Flu Infection: A set of 1,017 tweets from (Lamb et al. 2013) from 2009 labeled as indicating flu infection. The original dataset included 5,000 tweets, but most are no longer available for download. The aggregation task is to calculate the percent of tweets indicating flu infection each week of available data. Again, 15% of tweets were reserved for testing.
-
IMDB: A set of 50,000 movie reviews labeled with positive or negative sentiment (Maas et al. 2011). The dataset is balanced so that there is an equal number of reviews for positive and negative sentiment, and contains reviews for 2,780 movies (average 18 reviews per movie). The IMDB data come split 50/50 into training and testing. The aggregation task is to calculate the percentage of reviews that are positive for each movie.
-
Yelp: A set of 6,752,287 reviews across 192,632 businesses on Yelp. The full dataset is available here.1 Reviews with greater than 3 stars were labeled “positive” reviews, and \( \le 3\) stars were considered negative reviews. Because this dataset was so large, we created a smaller dataset using a random sample of 10% of the businesses. Of these reviews 15% were reserved for testing (584,841 reviews in the training dataset and 103,208 in the test dataset). Approximately 77% of the reviews were positive.
scikit-learn
(Pedregosa and others 2011). Grid search using fivefold cross-validation on the training data was used to tune the \(\ell _2\) regularization parameter. For all classifiers, unigrams were used to build feature sets. While more extensive feature engineering or feature selection techniques might result in higher performing classifiers, we constructed the experiments this way to create a simple and equitable comparison between the datasets. ROC curves are shown in Figure S1. We note that classification performance is extremely high for the Yelp dataset (area under the ROC curve is nearly 1), while error rates are higher for the Twitter datasets.