Medical diagnostics is a field where technology and the law patently interconnect. The medical standard of care defines contractual and tort liability for medical malpractice. However, this standard is itself shaped by state-of-the-art technology. If doctors fail to use novel, ML-driven methods, which are required by the applicable standard of care, liability potentially looms large. Conversely, if they apply models that result in erroneous predictions, they are equally threatened by liability. Importantly, both questions are intimately connected to explainability, as we argue below.
3.1.2 Legal liability
As the previous section has shown, predictions used by medical AI models are not, of course, fully accurate in every case. Hence, we shall ask what factors determine whether such a potentially erroneous model may be used by a medical doctor without incurring liability. Furthermore, with ML technology approaching, and in some cases even surpassing, human capacities in medical diagnostics, the question arises whether the failure to use such models may constitute medical malpractice. The relevant legal provisions under contract and tort law differ from country to country. We offer a legal analysis in which we generally refer to German and US law in particular; nevertheless, general normative guidelines can be formulated.
Adoption For the sake of simplicity, we assume that all formal requirements for the use of the ML model in medical contexts are met. In April 2018, for example, the US Food and Drug Administration (FDA) approved IDxDR, an ML tool for the detection of diabetes-related eye disorders (US Food and Drug Administration
2018). While liability for the adoption of new medical technology is an obvious concern for medical malpractice law (Katzenmeier
2006; Greenberg
2009), the issue has, to our knowledge, not been discussed with an explicit focus on explainability of ML models. The related but different question whether the avoidance of legal liability
compels the adoption of such a model has rarely been discussed in literature (Froomkin
2018; Greenberg
2009) and, to our knowledge, not at all by the courts. The answer to both questions crucially depends on whether it is considered negligent, under contractual and tort liability, (not) to use ML models during the treatment process. This, in turn, is determined by the appropriate medical standard of care.
Generally speaking, healthcare providers, such as hospitals or doctor’s practices, cannot be required to always purchase and use the very best products on the market. For example, when a new, more precise version of an x-ray machine becomes available, it would be ruinous for healthcare providers to be compelled to always immediately buy such new equipment. Therefore, they must be allowed to rely on their existing methods and products as long as these practices guarantee a satisfactory level of diagnostic accuracy, i.e., as long as they fall within the “state of the art” (Hart
2000). However, as new and empirically better products become available, the minimum threshold of acceptable accuracy moves upward. Otherwise, medical progress could not enter negligence norms. Hence, the content of the medical standard, whose fulfilment excludes negligence, is informed not only by experience and professional acceptance, but also (and in an increasingly dominant way) by empirical evidence (Hart
2000; Froomkin
2018).
Importantly, therefore, the acceptable level of accuracy could change with the introduction of new, more precise, ML driven models. Three criteria, we argue, should be met for this to be the case. First, the use of the model must not, in itself, lead to medical malpractice liability. This criterion, therefore, addresses our first question concerning liability for the positive use of ML technology in medicine. To rely on the model, there must be a significant difference between the performance of the model and human-only decision making in its absence. This difference must be shown consistently in a number of independent studies and be validated in real-world clinical settings—which is often lacking at the moment (Topol
2019). The superiority of the model cannot be measured only in terms of its accuracy (i.e., the ratio of correct over all predictions); rather, other performance metrics, such as sensitivity (a measure of false negatives)
1 or specificity (a measure of false positives),
2 also need to be considered. Depending on the specific area, a low false positive or false negative rate may be equally desirable, or even more important, than superior accuracy (Froomkin
2018; Topol
2019; Caruana
2015). For example, false negatives in tumor detection will mean that the cancer can grow untreated—quite likely the worst medial outcome. The deep learning model for detecting Alzheimer disease, for example, did have both higher specificity and sensitivity (and hence higher accuracy) than human radiologists (Ding
2018). The superiority of the novel method must, in addition, be plausible for the concrete case at hand to be legitimate in that instance. Under German law, for example, new medical methods meet the standard of care if the marginal advantages vis-à-vis conventional methods outweigh the disadvantages for an individual patient (Katzenmeier
2006); the same holds true for US law (Greenberg
2009).
This has important implications for the choice between an explainable and a non-explainable model: the non-explainable model may be implemented only if the marginal benefits of its use (improved accuracy) outweigh the marginal costs. This depends on whether the lack of explainability entails significant risks for patients, such as risks of undetected false negative treatment decisions. As Ribeiro et al. (
2016) rightly argue, explainability is crucial for medical professionals to assess whether a prediction is made based on plausible factors or not. The use of an explainable model facilitates the detection of false positives and false negative classifications, because it provides medical doctors with reasons for the predictions, which can be critically discussed (Lapuschkin et al.
2019; Lipton
2018) (see, in more detail, the next section). However, this does not imply that explainable models should always be chosen over non-explainable models. Clearly, if an explainable model performs equally well as a non-explainable one, the former must be chosen [for examples, see Rudin (
2019) and Rudin and Ustun (
2018)]. But, if explainability reduces accuracy—which need not necessarily be the case, see Rudin (
2019) and below, Sect.
4—the choice of an explainable model will lead to some inaccurate decisions that would have been accurately taken under a non-explainable model with superior accuracy. Therefore, in these cases, doctors must diligently weigh the respective marginal costs and benefits of the models. In some situations, it may be possible, given general medical knowledge, to detect false predictions even without having access to the factors the model uses. In this case, the standard of care simply dictates that the model with significantly superior accuracy should be chosen. If, however, the detection of false predictions, particularly of false negatives, requires or is significantly facilitated by an explanation of the algorithmic model, the standard of care will necessitate the choice of the explainable model. Arguably, this will often be the case: it seems difficult to evaluate the model predictions in the field without access to the underlying factors used, precisely because it will often be impossible to say whether a divergence from traditional medical wisdom is due to a failure of the model or to its superior diagnostic qualities.
Hence, general contract and tort law significantly constrains the use of non-explainable ML models—arguably, in more important ways than data protection law. Only if the balancing condition (between the respective costs and benefits of accuracy and explainability) is met, the use of the model should be deemed generally legitimate (but not yet obligatory).
Second, for the standard of care to be adjusted upward, and hence the use of a model to become obligatory, it must be possible to integrate the ML model smoothly into the medical workflow. High accuracy does not translate directly into clinical utility (Topol
2019). Hence, a clinical suitability criterion is necessary since ML models pose particular challenges in terms of interpretation and integration into medical routines, as the Watson case showed. Again, such smooth functioning in the field generally includes the explainability of the model to an extent that decision makers can adopt a critical stance toward the model’s recommendations (see, in detail, below, Sect.
3.1.2, Use of the model). Note that this criterion is independent of the one just discussed: it does not involve a trade-off with accuracy. Rather, explainability per se is general pre-condition for the duty (but not for the legitimacy) to use ML models: while it may be legitimate to use a black box model (our first criterion), there is, as a general principle, no duty to use it. A critical, reasoned stance toward a black box model’s advice is difficult to achieve, and the model will be difficult to implement into the medical workflow. As a general rule, therefore, explainability is a necessary condition for a duty to use the model, but not for the legitimacy of the use of a model. The clinical suitability criterion is particularly important in medical contexts where the consequences of false positive or false negative outcomes may be particularly undesirable (Caruana
2015; Zech et al.
2018; Rudin and Ustun
2018). Therefore, doctors must be in a position to check the reasons for a specific outcome. Novel techniques of local explainability of even highly complex models may provide for such features (Ribeiro et al.
2016; Lapuschkin et al.
2019).
Exceptionally, however, the use of black box models with supra-human accuracy may one day become obligatory if their field performance on some dimension (e.g., sensitivity) is exceptionally high (e.g.,> 0.95) and hence there is a reduced need for arguing with the model within its high performance space (e.g., avoidance of false negatives). While such extremely powerful, non-explainable models still seem quite a long way off in the field (Topol
2019), there may one day be a duty, restricted to their high performance domain, to use them if suitable routines for cases of disagreement with the model can be established. For example, if a doctor disagrees with a close-to-perfect black box model, the case may be internally reviewed by a larger panel of (human) specialists. Again, if these routines can be integrated into the medical workflow, the second criterion is fulfilled.
Finally, third, the cost of the model must be justified with respect to the total revenue of the healthcare provider for the latter to be obliged to adopt it (Froomkin
2018). Theoretically, licensing costs could be prohibitive for smaller practices. In this case, they will, however, have to refer the patient to a practice equipped with the state-of-the-art ML tool.
While these criteria partly rely on empirical questions (particularly the first one), courts are in a position to exercise independent judgment with respect to their normative aspects (Greenberg
2009). Even clinical practice guidelines indicate, but do not conclusively decide, a (lack of) negligence in specific cases (Laufs
1990).
In sum, to avoid negligence, medical doctors need not resort to the most accurate product, including ML models, but to state-of-the-art products that reach an acceptable level of accuracy. However, this level of accuracy should be adjusted upwards if ML models are shown to be consistently superior to human decision making, if they can be reasonably integrated into the medical workflow, and if they are cost-justified for the individual health-care provider. The choice of the concrete model, in turn, depends on the trade-off between explainability and accuracy, which varies between different models.
Use of the model Importantly, even when the use of some ML model is justified or even obligatory, there must be room for reasoned disagreement with the model. Concrete guidelines for the legal consequences of the use of the model are basically lacking in the literature. However, we may draw on scholarship regarding evidence-based medicine to tackle this problem. This strand of medicine uses statistical methods (for example randomized controlled trials) to develop appropriate treatment methods and displace routines based on tradition and intuition where these are not upheld by empirical evidence (Timmermans and Mauck
2005; Rosoff
2001). The use of ML models pursues a similar aim. While it does, at this stage, typically not include randomized controlled trials (Topol
2019), it is also based on empirical data and seeks to improve on intuitive treatment methods.
Of course, even models superior to human judgment on average will generate some false negative and false positive recommendations. Hence, the use of the model should always only be part of a more comprehensive assessment, which includes and draws on medical experience (Froomkin
2018). Doctors, or other professional agents, must not be reduced to mere executors of ML judgments. If there is sufficient, professionally grounded reason to believe the model is wrong in a particular case, its decision must be overridden. In this case, such a departure from the model must not trigger liability—irrespective of whether the model was in fact wrong or right in retrospect. This is because negligence law does not sanction damaging outcomes, as strict liability does, but attaches liability only to actions failing the standard of care. Hence, even if the doctor’s more comprehensive assessment is eventually wrong and the model prediction was right, the doctor is shielded from medical malpractice claims as long as his reasons for departing from the model were based on grounds justified on the basis of professional knowledge and behavior. Conversely,
not departing a wrong model prediction would breach the standard of care if, and only if, the reasons for departure were sufficiently obvious to a professional (Droste
2018). An example may be an outlier case, which, most likely, did not form part of the training data of the ML model, cf. Rudin (
2019). However, as long as such convincing reasons for model correction cannot be advanced, the model’s advice may be heeded without incurring liability, even if it was wrong in retrospect (Thomas
2017). This is a key inside of the scholarship on evidence-based medicine (Wagner
2018). The reason for this rule is that, if the model indeed is provably superior to human professional judgment, following the model will on average produce less harm than a departure from the model (Wagner
2018). Potentially, a patient could, in these cases, direct a product liability claim against the provider of the ML model (Droste
2018).
Particularly in ML contexts, human oversight, and the possibility to disagree with the model based on medical reasons, seems of utmost importance: ML often makes mistakes humans would not make (and vice versa). Hence, the possibility of reasoned departure from even a supra-human model creates a machine-human-team, which likely works better than either machine or human alone (Thomas
2017; Froomkin
2018). Importantly, the obligation to override the model in case of professional reasons ensures that blindly following the model, and withholding individual judgment, is not a liability-minimizing strategy for doctors.