Skip to main content

Advertisement

Log in

In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction

  • Original Paper
  • Published:
Journal of Quantitative Criminology Aims and scope Submit manuscript

Abstract

Objectives

We study interpretable recidivism prediction using machine learning (ML) models and analyze performance in terms of prediction ability, sparsity, and fairness. Unlike previous works, this study trains interpretable models that output probabilities rather than binary predictions, and uses quantitative fairness definitions to assess the models. This study also examines whether models can generalize across geographic locations.

Methods

We generated black-box and interpretable ML models on two different criminal recidivism datasets from Florida and Kentucky. We compared predictive performance and fairness of these models against two methods that are currently used in the justice system to predict pretrial recidivism: the Arnold PSA and COMPAS. We evaluated predictive performance of all models on predicting six different types of crime over two time spans.

Results

Several interpretable ML models can predict recidivism as well as black-box ML models and are more accurate than COMPAS or the Arnold PSA. These models are potentially useful in practice. Similar to the Arnold PSA, some of these interpretable models can be written down as a simple table. Others can be displayed using a set of visualizations. Our geographic analysis indicates that ML models should be trained separately for separate locations and updated over time. We also present a fairness analysis for the interpretable models.

Conclusions

Interpretable ML models can perform just as well as non-interpretable methods and currently-used risk assessment scales, in terms of both prediction accuracy and fairness. ML models might be more accurate when trained separately for distinct locations and kept up-to-date.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability Statement

The Broward County, FL dataset generated and analyzed during the current study is available from the corresponding author on request. The Kentucky dataset is not publicly available but can be accessed through a special data request to the Kentucky Department of Shared Services, Research and Statistics.

Notes

  1. Kentucky created and implemented their own tool in 2006 but transitioned to the Arnold PSA in 2013.

  2. For decreasing (respectively increasing) stumps, if the coefficient for the largest (respectively smallest) stump is negative, the function f will still be monotonic because the negative value will be subtracted from all values of the remaining stumps

  3. We note that a real-valued score S between 0 and 1 is well-calibrated if \(P(Y = 1 | S = s) = s\). Well-calibration says that the predicted probability of recidivism should be the same as the true probability of recidivism (Verma and Rubin 2018). Although well-calibration is the definition of calibration that is standard in the statistics community, we consider monotonic-calibration here because any score that is monotonically-calibrated can be transformed to be well-calibrated.

References

  • Agarwal A, Beygelzimer A, Dudík M, Langford J, Wallach H (2018) A reductions approach to fair classification. In: Proceedings of the 35th international conference on machine learning. https://proceedings.mlr.press/v80/agarwal18a.html

  • Agarwal A, Dudík M, Wu ZS (2019) Fair regression: quantitative definitions and reduction-based algorithms. In: Proceedings of the 36th international conference on machine learning. https://proceedings.mlr.press/v97/agarwal19d.html

  • Alfred B (2006) The crime drop in America: an explanation of some recent crime trends. J Scand Stud Criminol Crime Prev 7:17–35

    Article  Google Scholar 

  • American Law Institute (2017) Model penal code. https://www.ali.org/projects/show/sentencing/

  • Angelino E, Larus-Stone N, Alabi D, Seltzer M, Rudin C (2018) Certifiably optimal rule lists for categorical data. J Mach Learn Res 19:1–79

    Google Scholar 

  • Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias. Technical report, ProPublica

  • Barabas C, Dinakar K, Doyle C (2019) The problems with risk assessment tools. The New York Times. https://www.nytimes.com/2019/07/17/opinion/pretrial-ai.html

  • Barocas S, Selbst AD (2016) Big data’s disparate impact. Calif Law Rev 104:671–732

    Google Scholar 

  • Berk R (2017) An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Exp Criminol 13:193–216

    Article  Google Scholar 

  • Berk RA, He Y, Sorenson SB (2005) Developing a practical forecasting screener for domestic violence incidents. Eval Rev 29(4):358–383

    Article  Google Scholar 

  • Berk R, Heidari H, Jabbari S, Joseph M, Kearns M, Morgenstern J, Neel S, Roth A (2017a) A convex framework for fair regression. arXiv:1706.02409

  • Berk R, Heidari H, Jabbari S, Kearns M, Roth A (2017b) Fairness in criminal justice risk assessments: the state of the art. Sociol Methods Res

  • Bindler A, Hjalmarsson R (2018) How punishment severity affects jury verdicts: evidence from two natural experiments. Am Econ J 10

  • Binns R (2018) Fairness in machine learning: lessons from political philosophy. J Mach Learn Res 81:1–11

    Google Scholar 

  • Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, New York

    Google Scholar 

  • Brennan T, Dieterich W, Ehret B (2009) Evaluating the predictive validity of the COMPAS risk and needs assessment system. Crim Justice Behav 36(1):21–40

    Article  Google Scholar 

  • Bureau of Justice Assistance (2020) History of risk assessment. Bureau of Justice Assistance. https://psrac.bja.ojp.gov/basics/history

  • Burgess EW (1928) Factors determining success or failure on parole

  • Bushway SD, Piehl AM (2007) The inextricable link between age and criminal history in sentencing. Crime Delinq 53(1):156–183

    Article  Google Scholar 

  • Cadigan TP, Lowenkamp CT (2011) Implementing risk assessment in the federal pretrial services system. Federal Probation 75(2)

  • Carollo J, Hines M, Hedlund J (2007) Expanded validation of a decision aid for pretrial conditional release. Technical report, Central Connecticut State University

  • Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794

  • Chouldechova A (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big Data 5(2):153–163

    Article  Google Scholar 

  • Cook P, Laub J (2002) After the epidemic recent trends in youth violence in the United States. Crime Justice 29:1–37

    Article  Google Scholar 

  • Corbett-Davies S, Goel S (2018) The measure and mismeasure of fairness: a critical review of fair machine learning. arXiv:180800023v2

  • Corbett-Davies S, Pierson E, Feller A, Goel S, Huq A (2017) Algorithmic decision making and the cost of fairness. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 797–806

  • CPAT of Pretrial Services (2015) The colorado pretrial assessment tool (cpat): Administration, scoring, and reporting manual. https://university.pretrial.org/HigherLogic/System/DownloadDocumentFile.ashx?DocumentFileKey=47e978bb-3945-9591-7a4f-77755959c5f5

  • Dawes RM, Faust D, Meehl PE (1989) Clinical versus actuarial judgment. Science 243(4899):1668–1674

    Article  Google Scholar 

  • Defronzo J (1984) Climate and crime: tests of an FBI assumption. Environ Behav 16

  • Desmarais S, Garrett B, Rudin C (2019) Risk assessment tools are not a failed ’minority report’. Law360. https://www.law360.com/access-to-justice/articles/1180373/risk-assessment-tools-are-not-a-failed-minority-report-

  • Dieterich W, Mendoza C, Brennan T (2016) COMPAS risk scales: demonstrating accuracy equity and predictive parity: performance of the COMPAS risk scales in Broward county. Technical report, Northpointe, Inc

  • Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. In: Proceedings of the 3rd innovations in theoretical computer science conference, ITCS ’12, pp 214–226, New York. ACM

  • Electronic Privacy Information Center (2016) Algorithms in the criminal justice system. Electronic Privacy Information Center. https://epic.org/algorithmic-transparency/crim-justice/

  • Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, Lin C-J (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874

    Google Scholar 

  • Flores AW, Lowenkamp CT, Bechtel K (2016) False positives, false negatives, and false analyses: a rejoinder to “Machine bias: there’s software used across the country to predict future criminals”. Federal Probation 80(2)

  • Frase RS, Roberts J, Hester R, Mitchell KL (2015) Robina institute of criminal law and criminal justice, criminal history enhancements sourcebook. https://robinainstitute.umn.edu/publications/criminal-history-enhancements-sourcebook

  • Freeman K (2016) Algorithmic injustice: How the wisconsin supreme court failed to protect due process rights in state v. loomis. N C J Law Technol 18. http://ncjolt.org/wp-content/uploads/2016/12/Freeman_Final.pdf

  • Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  Google Scholar 

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  Google Scholar 

  • Garrett B, Stevenson M (2020) Open risk assessments. Behav Sci Law. https://sites.law.duke.edu/justsciencelab/2019/09/15/comment-on-pattern-by-brandon-l-garrett-megan-t-stevenson/

  • Gelb A, Velazquez T, Trust PC, of America, U. S. (2018) The changing state of recidivism: fewer people going back to prison. The Pew Charitable Trusts

  • Goel S, Rao JM, Shroff R (2016) Precinct or prejudice? understanding racial disparities in New York city’s stop-and-frisk policy. Inst Math Stat 10(1):365–394

    Google Scholar 

  • Grove WM, Meehl PE (1996) Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical-statistical controversy. Psychol Public Policy Law 2(2):293

    Article  Google Scholar 

  • Hanson R, Thornton D (2003) Notes on the development of static-2002. Department of the Solicitor General of Canada, Ottawa

  • Hardt M, Price E, Srebro N (2016) Equality of opportunity in supervised learning. In: Advances in neural information processing systems, pp 3315–3323

  • Harris GT, Rice ME (2008) Encyclopedia of Psychology and Law, chapter Violence Risk Appraisal Guide (VRAG), p 848. SAGE Publications, Inc.

  • Hart H (1924) Predicting parole success. J Crim Law Criminol 14

  • Hoffman PB, Adelberg S (1980) The salient factor score: a nontechnical overview. Federal Probation 44:44

    Google Scholar 

  • Howard P, Francis B, Soothill K, Humphreys L (2009) OGRS 3: the revised offender group reconviction scale. Technical report, Ministry of Justice

  • James N (2018) Risk and needs assessment in the federal prison system. Technical report, Congressional Research Service

  • Kehl D, Guo P, Kessler S (2017) Algorithms in the criminal justice system: assessing the use of risk assessments in sentencing. https://cyber.harvard.edu/publications/2017/07/Algorithms

  • Kim J, Bushway S, Tsao H (2016) Identifying classes of explanation for crime drop: period and cohort effects for New York state. J Quant Criminol 32:357–375

    Article  Google Scholar 

  • Kleiman M, Ostrom BJ, Cheesman FL (2007) Using risk assessment to inform sentencing decisions for nonviolent offenders in Virginia. Crime Delinq 53(1):106–132

    Article  Google Scholar 

  • Kleinberg J, Mullainathan S, Raghavan M (2017) Inherent trade-offs in the fair determination of risk scores. In: Proceedings of the 8th conference on innovations in theoretical computer science

  • Lakkaraju H, Rudin C (2017) Learning cost-effective and interpretable treatment regimes. In: Singh A, Zhu J (eds) Proceedings of the 20th international conference on artificial intelligence and statistics, vol 54 of proceedings of machine learning research, pp 166–175, Fort Lauderdale. PMLR. http://proceedings.mlr.press/v54/lakkaraju17a.html

  • Larson J, Mattu S, Kirchner L, Angwin J (2016) How we analyzed the COMPAS recidivism algorithm. Technical report, ProPublica. https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

  • Latessa E, Smith P, Lemke R, Makarios M, Lowenkamp C (2009) Creation and validation of the ohio risk assessment system. Technical report, University of Cincinnati School of Criminal Justice Center for Criminal Justice Research

  • Lazarsfeld PF (1974) An evaluation of the pretrial services agency of the Vera institute of justice. Vera Institute, New York

    Google Scholar 

  • Lou Y, Caruana R, Gehrke J, Hooker G (2013) Accurate intelligible models with pairwise interactions. In: 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 623–631. https://doi.org/10.1145/2487575.2487579

  • Ludwig J, Mullainathan S (2021) Fragile algorithms and fallible decision-makers: lessons from the justice system. J Econ Perspect 35(4):71–96

    Article  Google Scholar 

  • Matthews B, Minton J (2017) Rethinking one of the criminology’s ‘brute facts’: the age-crime curve and the crime drop in Scotland. Eur J Criminol 15(3):296–320

    Article  Google Scholar 

  • MHS Assessments (2017) Level of service/case management inventory: an offender management system. MHS Public Safety. https://issuu.com/mhs-assessments/docs/ls-cmi.lsi-r.brochure_insequence

  • Milgram A (2014) Why smart statistics are the key to fighting crime

  • Mishra A (2014) Climate and crime. Global J Sci Front Res 14

  • Nafekh M, Motiuk LL (2002) The statistical information on recidivism, revised 1 (SIR-R1) scale: a psychometric examination. Correctional Service of Canada. Research Branch

  • Neuilly M-A, Zgoba KM, Tita GE, Lee SS (2011) Predicting recidivism in homicide offenders using classification tree analysis. Homicide Stud 15(2):154–176

    Article  Google Scholar 

  • Northpointe (2013) Practitioner’s Guide to COMPAS Core. http://www.northpointeinc.com/downloads/compas/Practitioners-Guide-COMPAS-Core-_031915.pdf

  • Northpointe Inc. (2009) Measurement & treatment implications of COMPAS core scales. Technical report, Northpointe Inc

  • O’Neil C (2016) Weapons of math destruction. Crown Books, New York

    Google Scholar 

  • Orbis (2014) Service planning instrument: an innovative assessment and case planning tool. https://orbispartners.com/wp-content/uploads/2014/07/SPIn-Brochure.pdf

  • Palocsay W, PingWang S, Brookshire RG (2000) Predicting criminal recidivism using neural networks. Socio-Econ Plan Sci 34:271–284

    Article  Google Scholar 

  • Pleiss G, Raghavan M, Wu F, Kleinberg J, Weinberger K (2017) On fairness and calibration. In: Advances in neural information processing systems, pp 5680–5689

  • Pretrial Justice Institute (2020) Updated position on pretrial risk assessment tools. Pretrial Justice Institute. https://university.pretrial.org/viewdocument/updated-statement-on-pretrial-risk

  • Public Safety Assessment (2019) Risk factors and formulas. Laura and John Arnold Foundation. https://www.psapretrial.org/about/

  • Ranson M (2014) Crime, weather, and climate change. J Environ Econ Manag 67

  • Richard B (2019) Accuracy and fairness for juvenile justice risk assessments. J Empir Leg Stud 16:174–194

    Google Scholar 

  • Roberts J, von Hirsch A (2010) Previous convictions at sentening - theoretical and applied perspective. Bloomsbury Publishing, London

    Google Scholar 

  • Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1:206–215

    Article  Google Scholar 

  • Rudin C, Wang C, Coker B (2020) The age of secrecy and unfairness in recidivism prediction. Harvard Data Sci Rev 2(1). https://hdsr.mitpress.mit.edu/pub/7z10o269

  • Sherman LW (2007) The power few: experimental criminology and the reduction of harm. J Exp Criminol 3(4):299–321

    Article  Google Scholar 

  • Singh A, Mohapatra S (2021) Development of risk assessment framework for first time offenders using ensemble learning. IEEE Access 9:135024–135033

    Article  Google Scholar 

  • Skeem J, Lin Z, Jung J, Goel S (2020) The limits of human predictions of recidivism. Sci Adv 6

  • Smith B (2016) Auditing deep neural networks to understand recidivism predictions. PhD thesis, Haverford College

  • Soares E, Angelov PP (2019) Fair-by-design explainable models for prediction of recidivism. arXiv:abs/1910.02043

  • Starr SB (2015) The risk assessment era: an overdue debate. Federal Sentencing Reporter 27:205–206

    Article  Google Scholar 

  • Stevenson M (2018) Assessing risk assessment in action. Minnesota Law Review. http://www.minnesotalawreview.org/wp-content/uploads/2019/01/13Stevenson_MLR.pdf

  • Stevenson MT, Slobogin C (2018) Algorithmic risk assessments and the double-edged sword of youth. Washington Univ Law Rev 96(18–36)

  • The Leadership Conference on Civil and Human Rights (2018) The use of pretrial “risk assessment” instrument: a shared statement of civil rights concerns. http://civilrightsdocs.info/pdf/criminal-justice/Pretrial-Risk-Assessment-Full.pdf

  • Tollenaar N, van der Heijden P (2013) Which method predicts recidivism best? A comparison of statistical, machine learning and data mining predictive models. J R Stat Soc A Stat Soc 176(2):565–584

    Article  Google Scholar 

  • Turner S, Hess J, Jannetta J (2009) Development of the California Static Risk Assessment Instrument (CSRA). CEBC Working Papers

  • United States Census Bureau (2015) Hispanic or latino origin by race 2011–2015 American community survey 5-year estimates. https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_15_5YR_B03002&prodType=table

  • United States Census Bureau (2019) Quickfacts kentucy United States. https://www.census.gov/quickfacts/fact/table/KY,US/PST04521

  • Ustun B, Rudin C (2015) Supersparse linear integer models for optimized medical scoring systems. Mach Learn 1–43

  • Ustun B, Rudin C (2017) Optimized risk scores. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining

  • Ustun B, Rudin C (2019) Learning optimized risk scores. J Mach Learn Res 20(150):1–75

    Google Scholar 

  • Vapnik V, Chervonenkis A (1964) A note on one class of perceptrons. Autom Remote Control 25

  • Verma S, Rubin J (2018) Fairness definitions explained. In: ACM/IEEE international workshop on software fairness, pp 1–7. ACM

  • Virginia Department of Criminal Justice Services (2018) Virginia pretrial risk assessment instrument - (vprai). https://www.dcjs.virginia.gov/sites/dcjs.virginia.gov/files/publications/corrections/virginia-pretrial-risk-assessment-instrument-vprai_0.pdf

  • Wexler R (2017) When a computer program keeps you in jail: how computers are harming criminal justice. New York Times, p 27. Section A

  • Wolfgang ME (1987) Delinquency in a birth cohort. University of Chicago Press, Chicago

    Google Scholar 

  • Zemel R, Wu Y, Swersky K, Pitassi T, Dwork C (2013) Learning fair representations. In: International conference on machine learning, pp 325–333

  • Zeng J, Ustun B, Rudin C (2017) Interpretable classification models for recidivism prediction. J R Stat Soc A Stat Soc 180(3):689–722

    Article  Google Scholar 

  • Zweig J (2010) Extraordinary conditions of release under the bail reform act. Harvard J Legis 47:555–585

    Google Scholar 

Download references

Acknowledgements

We thank the Broward County Sheriff’s office and the Kentucky Department of Shared Services, Research and Statistics for their assistance and provision of data. We would also like to thank Daniel Sturtevant from the Kentucky Department of Shared Services, Research and Statistics for providing significant insight into the Kentucky data set, and Berk Ustun for his advice on training RiskSLIM. Finally, we thank Brandon Garrett from Duke, Stuart Buck and Kristin Bechtel from Arnold Ventures, and Kathy Schiflett, Christy May, and Tara Blair from Kentucky Pretrial Services for their thoughtful comments on the article.

Funding

This study was partially supported by Arnold Ventures, the Department of Computer Science at Duke University, the Department of Electrical and Computer Engineering at Duke University, and the Lord Foundation of North Carolina. This report represents the findings of the authors and does not represent the views of any of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Han.

Ethics declarations

Conflict of interest

No additional institutional conflicts.

Code Availability

Our code is here: https://github.com/BeanHam/interpretable-machine-learning.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Nested Cross Validation Procedure

We applied fivefold nested cross validation to tune parameters. We split the entire data set into five equally-sized folds for the outer cross validation step. One fold was used as the holdout test set and the other four folds were used as the training set (call it “outer training set”). The inner loop deals only with the outer training set (\(\frac{4}{5}\)ths of the data). On this outer training set, we conducted fivefold cross validation and grid-searched hyperparameter values. After this point, each hyperparameter value had five validation results. We selected the parameter values with the highest average validation results and then trained the model with this best set of parameters on the entire outer training set and tested it on the holdout test set.

We repeated the process above until each one of the original five folds was used as the holdout test set. Ultimately, we had five holdout test results, with which we were able to calculate the average and standard deviation of the performance.

We applied a variant of the nested cross validation procedure described above to perform the analysis discussed in the “Recidivism Prediction Models Do Not Generalize Well Across Regions” section—where we trained models on one region and tested on the other region. For instance, when we trained models on Broward and tested them on Kentucky, the Kentucky data was treated as the holdout test set. We split the Broward data into five folds and used four folds to do cross validation and constructed the final model using the best parameters. We then tested the final model on the entire Kentucky data set, as well as the holdout test set from Broward. We rotated the four folds and repeated the above process five times.

Broward Data Processing

The Broward County data set consists of publicly available criminal history, court data and COMPAS scores from Broward County, Florida. The criminal history and demographic information were computed from raw data released by ProPublica (Angwin et al. 2016). The probational history was computed from public criminal records released by the Broward Clerk’s Office.

The screening date is the date on which the COMPAS score was calculated. The features and labels were computed for an individual with respect to a particular screening date. For individuals who have multiple screening dates, we compute the features for each screening date, such that the set of events for calculating features for earlier screening dates is included in the set of events for later screening dates. On occasion, an individual will have multiple COMPAS scores calculated on the same date. There appears to be no information distinguishing these scores other than their identification number, so we take the scores with the larger identification number. The recidivism labels were computed for the timescales of 6 months and 2 years. Some individuals were sentenced to prison as a result of their offense(s). We used only observations for which we have 6 months/2 years of data subsequent to the individual’s release date.

Below, we describe details of the feature and label generation process. The constructed features are presented in Table 4 at the end of this section.

  • Degree “(0)” charges seem to be very minor offenses, so we exclude these charges. We infer whether a charge is a felony, misdemeanor, or traffic charge based off the charge degree.

  • Some of our features rely on classifying the type of each offense (e.g., whether or not it is a violent offense). We infer this from the statute number, most of which correspond to statute numbers from the Florida state crime code.

  • The raw Propublica data includes arrest data as well as charge data. Because the arrest data does not include the statute, which is necessary for us to determine offense type, we use the charge data to compute features that require the offense type. We use both charge and arrest data to predict recidivism.

  • For each person on each COMPAS screening date, we identify the offense—which we call the current offense—that most likely triggered the COMPAS screening. The current offense date is the date of the most recent charge that occurred on or before the COMPAS screening date. Any charge that occurred on the current offense date is part of the current offense. In some cases, there is no prior charge that occurred near the COMPAS screening date, suggesting charges may be missing from the data set. For this reason we consider charges that occurred within 30 days of the screening date for computing the current offense. If there are no charges in this range, we say the current offense is missing. We exclude observations with missing current offenses. We used some of the COMPAS subscale items as features for our ML models. All such components of the COMPAS subscales that we compute are based on data that occurred prior to (not including) the current offense date.

  • The events/documents data includes a number of events (e.g., “File Affidavit Of Defense” or “File Order Dismissing Appeal”) related to each case, and thus to each person. To determine how many prior offenses occurred while on probation, or if the current offense occurred while on probation, we define a list of event descriptions indicating that an individual was taken on or off probation. Unfortunately, there appear to be missing events, as individuals often have consecutive “On” or consecutive “Off” events (e.g., two “On” events in a row, without an “Off” in between). In these cases, or if the first event is an “Off” event or the last event is an “On” event, we define two thresholds, \(t_{on}\) and \(t_{off}\). If an offense occurred within \(t_{on}\) days after an “On” event or \(t_{off}\) days before an “Off” event, we count the offense as occurring while on probation. We set \(t_{on}\) to 365 and \(t_{off}\) to 30. On the other hand, the “number of times on probation” feature is just the count of “On” events and the “number of times the probation was revoked” feature is just the count of “File order of Revocation of Probation” event descriptions (i.e., we do not infer missing probation events for these two features).

  • Current age is defined as the age in years, rounded down to the nearest integer, on the COMPAS screening date.

  • A juvenile charge is defined as an offense that occurred prior to the defendant’s 18th birthday.

  • Labels and features were computed using charge data.

  • The final data set contains 1954 records and 41 features.

Table 4 Features from Broward data set

Kentucky Data Processing

The Kentucky pretrial and criminal court data was provided by the Department of Shared Services, Research and Statistics in Kentucky. The Pretrial Services Information Management System (PRIM) data contains records regarding defendants, interviews, PRIM cases, bonds etc., that are connected with the pretrial services’ interviews conducted between July 1, 2009 and June 30, 2018. The cases were restricted to have misdemeanor, felony, and other level charges. The data from another system, CourtNet, provided further information about cases, charges, sentences, dispositions etc. for CourtNet cases matched in the PRIM system. The Kentucky data can be accessed through a special data request to the Kentucky Department of Shared Services, Research and Statistics. Please refer to Table 5 for all the raw datasets we processed, together with their sizes and general information provided.

CourtNet and PRIM data were processed separately and then combined together. We describe the details below. The constructed features are presented in Table 6 at the end of this section.

  • For the CourtNet data, we filtered out cases with filing date prior to Jan. 1st, 1996, which were claimed to be less reliable records by the Kentucky Department of Shared Services, Research and Statistics (which provided the data). To investigate what types of crimes the individuals were involved in for each charge, such as drug, property, traffic-related crime, we used the Kentucky Uniform Crime Reporting Code (UOR Code), as well as detecting keywords in the UOR description.

  • From the PRIM system data, we extracted the probation, failure to appear, case pending, and violent charge information at the PRIM case level, as well as the Arnold PSA risk scores computed at the time of each pretrial services’ interview. Since Kentucky did not use Arnold PSA until July 1st, 2013, we filtered out records before the this date. We omitted records without risk scores since we want to compare the performance of the PSA with other models. Only 33 records are missing PSA scores; therefore we do not worry about missing records impacting the results. Additionally, some cases in the PRIM system have “indictment” for the arrest type, along with an “original” arrest case ID, indicating that those cases were not new arrests. We matched these cases with the records that correspond to the original arrests to avoid overcounting the number of prior arrests. Then we inner-joined the data from the two systems using person-id and prim-case-id.

  • For each individual, we used the date that is 2 years before the latest charge date in the Kentucky data, as a cutoff date. The data before the cutoff are used as criminal history information to compute features. The data after the cutoff are used to compute labels and check recidivism. In the data before the cutoff, the latest charge is treated as the current charge (i.e., the charge that would trigger a risk-assessment) for each individual. We compute features and construct labels using only convicted charges. However, the current charge can be either convicted or non-convicted. This ensures that our analysis includes all individuals that would receive a risk assessment, even if they were later found innocent of the current charge that triggered the risk assessment. It also ensures that criminal history features use only convicted charges, so that our risk assessments are not influenced by charges for crimes that the person may not have committed.

  • In order to compute the labels, we must ensure that there are at least 2 years of data following an individual’s current charge date. For individuals who are sentenced to prison due to their current charge, we consider their release date instead of the current charge date. We omitted individuals for whom there were less than 2 years of data between their current charge date or release date, and the last date recorded in the data set.

  • To get the age at current charge information, we first calculated the date of birth (DOB) for each individual, using CourtNet case filing date and age at the CourtNet case filing date. Then we calculated “age at current charge” using the DOB and charge date (the charge date sometimes differs from the case filing date). Notice that there are many errors in age records in the data. For instance, some people have age recorded over 150, which is certainly wrong but there is no way to correct it. To ensure the quality of our data, we limited the final current age feature to be inclusively between 18 and 70. This is also consistent with the range from Broward analysis. If the person was not sentenced to prison, we define current age as the age at current charge date. If the person was sentenced to prison, we compute current age by adding the sentence time to the age at the current charge date. Note that this differs from the way risk scores are computed in practice—usually risk scores are computed prior to the sentencing decision. This helps to handle distributional shift between the individuals with no prison sentence (for whom a 2-year evaluation can be handled directly) and the full population (some of whom may have been sentenced to prison and cannot commit a crime during their sentence).

  • We computed features using the data before the current charge date. The CourtNet data is organized by CourtNet cases, and each CourtNet case has charge level data. The PRIM data is organized by PRIM cases. Each CourtNet case can connect to multiple PRIM cases. This occurs because a new PRIM case is logged when an update occurs in the defendant’s CourtNet case — for example, if the defendant fails to appear in court. Therefore, to compute the criminal history information, we first grouped on PRIM case level to summarize the charge information. Next, we grouped on CourtNet case level to summarize PRIM case level information. Last, we grouped on the individual level to summarize the criminal histories.

  • On computing the ADE feature: The ADE feature means number of times the individual was assigned to alcohol and drug education classes. Note that by Kentucky state law, any individual convicted for a DUI is assigned to ADE classes. This does not indicate whether the individual successfully completed ADE classes.

  • We compute labels using the 2 years of data after the current charge date/release date. We constructed the general recidivism labels by checking whether a “convicted charge” occurred within 2 years or 6 months from the current charge (or release date). Then, using the charge types of the convicted charge, other recidivism prediction labels were generated, such as drug or property-related recidivism. The final data set contains 250,778 records and 40 features.

    Note: there are degrees of experimenter freedom in some of these data processing choices; exploring all the possible choices here is left for future studies.

The Arnold PSA features that were included in the Kentucky data set (e.g., prior convictions, prior felony convictions etc.) were computed by pretrial officers who had access to criminal history data from both inside and outside of Kentucky. However, the Kentucky data set we received contained criminal history information from within Kentucky only. Thus, the Arnold PSA features for Kentucky (which are included in our models as well) use both in-state and out-of-state information, but the remaining features (which we compute directly from the Kentucky criminal history data) are limited to in-state criminal history.

Additionally, we were informed by Kentucky Pretrial Services team that the data set ’s sentencing information may not be reliable due to unmeasured confounding, including shock probation and early releases that would allow a prisoner to be released much earlier than the end date of the sentence. Because the sentence could be anywhere from zero days to the full length, we conducted a sensitivity analysis by excluding the sentence information in the data processing, which is equivalent to the assumption that no prison sentence was served. For that analysis, the current age of each individual was calculated to be the age at the current charge, and the prediction labels were generated from new charges within 6 months (or 2 years) from the current charge. The sensitivity analysis yielded predictive results that were almost exactly the same as the results in the main text, when the sentence information was used to determine age and prediction interval.

Table 5 The table lists raw datasets obtained from the Kentucky Department of Shared Services, Research and Statistics, the number of records within each data frame, and general descriptions of the data
Table 6 Features from Kentucky data set

Why We Compare Only Against COMPAS and the PSA

The variables included in risk assessments are often categorized into static and dynamic factors. Static factors are defined as factors that cannot be reduced over time (e.g. criminal history, gender, and age-at-first-arrest). Dynamic factors are defined as variables that can change over time to decrease the risk of recidivism; they allow insight into whether a high-risk individual can lower their risk through rehabilitation, and sometimes improve prediction accuracy. Examples of dynamic factors include current age, treatment for substance abuse, and mental health status (Kehl et al. 2017). Dynamic factors are often included in risk-and-needs-assessments (RNAs), which in addition to identifying risk of recidivism, recommend interventions to practitioners (e.g., treatment programs, social services, diversion of individuals from jail).

With the exception of current age, our features all fall under the “static” classification. This renders us unable to compare against the risk assessment tools that use dynamic factors, whose formulas are public. The risk assessments that we examined are listed in Table 7. Since we have only criminal history and age variables, the only model we could compute from our data was the Arnold PSA.

However, as we demonstrated in the main body of the paper, the fact that we do not possess dynamic factors is not necessarily harmful to the predictive performance of our models. The goal behind including dynamic factors in models is to improve prediction accuracy as well as be able to recommend interventions that reduce the probability of recidivism. While an admirable goal, the inclusion of dynamic factors does not come at zero cost and may not actually produce performance gains for recidivism prediction. In the Baseline Machine Learning Methods and “Recidivism Prediction Models Do Not Generalize Well Across Regions” sections, we show that standard machine learning techniques (using only the static factors) and interpretable ML models (using only static factors) are able to outperform a criminal justice model that utilizes both static and dynamic factors (COMPAS). Furthermore, the inclusion of additional, unnecessary factors increases the risk of data entry errors, or exposes models to additional feature bias (Corbett-Davies and Goel 2018). As Rudin et al. (2020) reveals, data entry errors appear to be common in COMPAS score calculations and could lead to scores that are either too high or too low.

Although the COMPAS suite is a proprietary (and thus black-box) risk-and-needs assessment, we were still able to compare against its risk assessments thanks to the Florida’s strong open-records laws. Created by Northpointe (a subsidiary company of Equivant), COMPAS is a recidivism prediction suite which is used in criminal justice systems throughout the United States. It is comprised of three scores: Risk of General Recidivism, Risk of Violent Recidivism, and Risk of Failure to Appear. In this work, we examine the two risk scores relating to violent recidivism and general recidivism. Each risk score is an integer from one to ten (Brennan et al. 2009).

As COMPAS scores are proprietary instruments, the precise forms of its models are not publicly available. However, it is known that the COMPAS scores are computed from a subset of 137 input variables that include vocational/educational status, substance abuse, and probational history, in addition to the standard criminal history variables (Brennan et al. 2009). As such, we cannot directly compute these risk scores, and instead utilize the COMPAS scores released by ProPublica in the Broward County recidivism data set. We do not compare against COMPAS on the Kentucky data set, as our data set does not include COMPAS scores.

The PSA was created by Arnold Ventures, and is a publicly available risk assessment tool. Similar to the COMPAS suite, it is comprised of three risk scores: Failure to Appear, New Criminal Activity, and New Violent Criminal Activity. Again, we compare against latter two scores. Both are additive integer models which take nine factors as input, relating to age, current charge, and criminal history. The New Criminal Activity model outputs a score from 1 to 6, while the New Violent Criminal Activity model outputs a binary score (Public Safety Assessment 2019). The PSA is an interpretable model.

Table 7 Variable comparison for currently-utilized actuarial risk assessments

Hyperparameters

Baseline Models, CART, EBM

We applied nested cross validation to tune the hyperparameters. Please refer to Table 8 for parameter details.

Table 8 Hyperparameters for \(\ell _1\) and \(\ell _2\) penalized logistic regression, linear SVM, CART, random forest, XGBoost, and EBM. RiskSLIM and additive stumps are discussed separately

Additive Stumps

Stumps were created for each feature as detailed in “Preprocessing Features into Binary Stumps” section. An additive model was created from the stumps using \(\ell _1\)-penalized logistic regression, and no more than 15 original features were involved in the additive models. But multiple stumps corresponding to each feature could be used in the models. We chose to limit the size of the model to 15 original features because then at most 15 plots would be generated to visualize the full model, which is a reasonable number of visualizations for users to digest.

We started with the smallest regularization parameter on \(\ell _1\) penalty that provides at most 15 original features from the model. This will be our lower bound for nested cross validation. From there, we perform nested cross validation over a grid of regularization parameters, all of which are greater than or equal to the minimum value of the regularization parameter found above. Please refer to Table 9 for more details.

Table 9 Hyperparameters for additive stumps

RiskSLIM

RiskSLIM is challenging to train, because it uses the CPLEX optimization software, which can be difficult to install and requires a license. Moreover, since RiskSLIM solves a very difficult mixed-integer nonlinear optimization problem, it can be slow to prove optimality, which makes it difficult to perform nested cross validation as nested cross validation requires many solutions of the optimization problem. A previous study (Smith 2016) also noted similar problems with algorithms that use CPLEX (this study trained on SLIM (Ustun and Rudin 2015), which is similar to the training process of RiskSLIM in that they both require CPLEX). Here we provide details of how we trained RiskSLIM to help others use the algorithm more efficiently.

  • We ran \(\ell _1\)-penalized logistic regression on the stumps training data with a relatively large regularization parameter to obtain a small subset of features (that is, we used \(\ell _1\)-penalized logistic regression for feature selection). Then we trained RiskSLIM using nested cross validation with this small subset of features. The maximum run-time, maximum offset, and penalty value were set to 1000 seconds, 100, and 1e−6 respectively. The coefficient range was set to [− 5, 5], which would give us small coefficients that are easy to add/subtract.

  • If the model converged to optimality (optimality gap less than 5%) within 1000 seconds, we then ran \(\ell _1\)-penalized logistic regression again with a smaller regularization parameter to obtain a slightly larger subset of features to work with. We then trained RiskSLIM with nested cross validation again on this larger subset of features. If RiskSLIM also generated an optimality gap less than 5% within 1,000 seconds and had better validation performance, we repeated this procedure.

  • Once either RiskSLIM could not converge to a 5% optimality gap within 1,000 seconds, or the validation performance did not improve by adding more stumps, we stopped there, using the previously obtained RiskSLIM model as the final model.

  • This procedure generally stopped with between 12 and 20 stumps from \(\ell _1\)-penalized logistic regression. Beyond this number of stumps, we did not observe improvements in performance in validation.

See Figs. 8, 9, 10.

Fig. 8
figure 8

Probabilities of 2-year and 6-month violent recidivism, given the age at current charge

Fig. 9
figure 9

Base rates of all twelve types of recidivism on Kentucky data, conditioned (separately) on race and gender

Fig. 10
figure 10

Calibration of the Arnold NVCA Raw, EBM and RiskSLIM for 2-year violent recidivism on Kentucky

See Tables 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26.

Table 10 Additive Stumps on two-year general recidivism
Table 11 Race and gender distributions for Kentucky
Table 12 Arnold Public Safety Assessment (PSA): New Criminal Activity (NCA)
Table 13 Arnold Public Safety Assessment (PSA): New Violent Criminal Activity (NVCA)
Table 14 Broward baseline models
Table 15 Kentucky baseline models
Table 16 AUCs of intepretable models on Broward data
Table 17 AUCs of interpretable models on Kentucky data
Table 18 Training baseline models and interpretable models on the Kentucky data set using fivefold nested cross validation and testing the best-performing model on the Broward data set
Table 19 Training baseline models and interpretable models on the Broward County data set using fivefold nested cross validation and testing the resulting best-performing model on a held out portion of the Broward data set
Table 20 Training baseline and interpretable models on the Broward County data set using fivefold nested cross validation and testing the resulting best-performing model on the Kentucky data set
Table 21 Training baseline models and interpretable models on the Kentucky data set using fivefold nested cross validation and testing the resulting best-performing model on a held out portion of the Kentucky data set
Table 22 AUCs of the Arnold NVCA Raw, EBM and RiskSLIM on Kentucky for two-year violent recidivism, conditioned on sensitive attributes. AUC ranges are also given for each sensitive attribute class
Table 23 Two year prediction problems—Kentucky
Table 24 Six month prediction problems—Kentucky
Table 25 Two year prediction problems—Broward
Table 26 Six Month Prediction Problems—Broward

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, C., Han, B., Patel, B. et al. In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction. J Quant Criminol 39, 519–581 (2023). https://doi.org/10.1007/s10940-022-09545-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10940-022-09545-w

Keywords

Navigation