1 Introduction

Breast cancer is classified as one of the most widespread cancer among women worldwide. According to the statistics [80] that was published in 2020 by the Global Cancer Observatory (GCO) that is affiliated with the World Health Organization (WHO); for every 100,000 persons in the world 47.8% were diagnosed with breast cancer which has the highest incidence rate for the top 10 cancer types in the world within females, and comes the second in the mortality rate after the lung cancer with 13.6% for every 100,000 persons; this means almost 29.1% died out of the 47.8% who were diagnosed with breast cancer. Figure 1 shows the distribution of new cases and deaths for the top 10 cancers among females in 2020 all over the world [80]. Breast cancer may begin in the milk ducts and this type is called Invasive Ductal Carcinoma (IDC) or starts in the milk-producing glands and this type is called Invasive Lobular Carcinoma (ILC) [18] . Many factors are considered as risk factors for breast cancer such as family history, ageing, gene changes, race, exposure to chest radiation, and obesity [13] .

Fig. 1
figure 1

Distribution of cases and deaths for the top 10 most common cancers in 2020 among females for Incidence and Mortality [80]

The early detection of breast cancer helps in increasing the survival rate of this disease. And so, regular screening is considered one of the most important tools that can help in the early detection of this type of cancer. A mammogram is considered as one of the effective screening modalities in detecting breast cancer at early stages [54, 83], it can reveal different abnormalities in the breast even before any symptoms appear. With the significant development in machine learning and image processing techniques, several studies were proposed for breast cancer detection and classification in an attempt to create more effective Computer-Aided (Detection / Diagnosis) systems for breast cancer.

CAD systems can be categorized into two types; Computer-Aided Detection (CADe) systems and Computer-Aided Diagnosis (CADx) systems. CADe mainly provides localization and detection for the masses or abnormalities that appear in the medical images, and let the interpretation of these abnormalities to the radiologist. On the other hand, CADx provides a classification for the masses and help in the decision making of the radiologist about the identified abnormalities [14] .

This review aims to cover the significant and well-known approaches which are introduced in the field of breast cancer detection and classification for masses using conventional machine learning and deep learning. Furthermore, the paper is demonstrating the evolution of the models that were introduced over the past ten years. The paper presents the current challenges and provides a discussion of the proposed models in the literature and their limitations.

This survey highlights the current screening modalities, mammogram projections, and different public mammography datasets. Also, the paper focuses on presenting a quantitative dataset-based comparison between the deep learning-based models for the most well-known and used public datasets, furthermore, the paper highlights the limitations and pros of the conventional machine learning-based and deep learning-based CAD systems.

The paper aims to answer the following questions:

  • RQ1: What is the pipeline for the breast cancer CAD system, and what are the phases of developing such a system?

  • RQ2: What are the breast screening modalities, and the public mammographic datasets?

  • RQ3: What are the recent techniques that are used in developing CAD systems?

  • RQ4: What are the evaluation measures that are currently used for breast cancer CAD systems assessment?

  • RQ5: What are the limitations, challenges and future work in breast cancer detection and classification?

The paper is organized as the following; Section 2 provides the survey methodology, then section 3 gives an overview for the screening modalities and the publicly available mammography datasets, then section 4 presents the breast cancer CAD systems (conventional based and deep learning-based), followed by section 5 which demonstrates a dataset-based quantitative comparison between the deep learning-based CAD systems. Then, section 6 presents the evaluation metrics for detection and classification tasks in CAD systems, and finally, section 7 provides a discussion and conclusion. The organization of the survey is shown in Fig. 2.

Fig. 2
figure 2

Organization of the survey

2 Survey methodology

In this survey, the authors searched for articles through PubMed, Springer, Science Direct, Google Scholar, and Institute of Electrical and Electronics Engineers (IEEE). The articles that are included in this survey were published in English. The survey includes most of the published articles for mammography mass detection and classification from 2009 to earlier in 2021. Also, some articles are referenced for background context. Figure 3 shows the flow of information for the identification of the studies via databases, screening and the included studies in the literature according to Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA). In this survey, there are some exclusion criteria that were applied as the following:

Fig. 3
figure 3

Flow diagram according to preferred reporting items for systematic review and meta-analysis (PRISMA)

C1: The papers that aren’t in the form of a full research paper.

C2: Papers that aren’t written in English.

C3: papers that aren’t delivering results.

C4: Papers that aren’t using mammographic datasets.

C5: Papers that aren’t including mass detection, mass segmentation or mass classification.

3 Breast screening modalities

Different modalities are used for screening breasts such as mammogram (MG), Magnetic Resonance Imaging (MRI), Digital Breast Tomosynthesis (DBT), and ultrasound [43].

A mammogram (MG) is a non-invasive screening technique as it’s an X-ray for imaging the breast tissue. It can reveal the masses and calcifications. Moreover, it is considered the most effective and sensitive screening modality as it can help in reducing the mortality rate through early detection of breast cancer even before any symptom appears.

Magnetic Resonance Imaging (MRI) is mainly depending on using strong magnets and radio waves to produce detailed pictures of inside the breast. This modality is considered to be helpful in the case of women at high risk for breast cancer.

Ultrasound uses sound waves to generate images of the internal structure of the breast. It is used for women who are at high risk for breast cancer and can’t make MRI or women who are pregnant and shouldn’t be exposed to the x-ray that is used in MG. Also, ultrasound is very common to be used to screen women who have dense breast tissue.

Digital Breast Tomosynthesis (DBT) is a more recent technology that Food and Drug Administration (FDA) approved in 2011. DBT generates a more advanced form of mammogram that is generated through a low dose of x-ray. It is considered as 3D mammographic images that can reveal masses and calcifications in more detailed form, which can be very effective for radiologists especially with diagnosing dense breasts [9].

3.1 Mammography projections

A mammogram is considered the most effective and sensitive screening modality; MRI and ultrasound are used as a supplement for the mammogram especially with the cases that have high dense breast tissues, however, this doesn’t mean they can replace the mammography [26].

There are multiple views for mammograms that are used to provide more information before detection/diagnosis. The main two views of mammograms are carnio-caudal (CC) and mediolateral oblique (MLO) as shown in Fig. 4. A CC view mammogram is taken horizontally from an upper projection at C-arm angle 0°; the breast is compressed between two paddles to reveal the glandular tissue, and the surrounding fatty tissue, also the right position of a CC view shows the outermost edge of the chest muscle. MLO view mammography is captured at a C-arm angle of 45° from the side; the breast is diagonally compressed between the paddles and accordingly this allows imaging a larger part of the breast tissue compared to other views. In addition to that, the MLO projection allows the pectoral muscles to appear in the mammographic image [11, 75].

Fig. 4
figure 4

(a) Carnio-Caudal (CC) projection, and (b) Medio-Lateral Oblique (MLO) projection from INBreast Dataset

The main two abnormalities that can be revealed in the mammograms are breast masses and calcifications. Breast masses may be cancerous or non-cancerous; the cancerous tumours appear in the mammograms with irregular edges and spikes extending from the mass. On the other hand, the non-cancerous masses often appear with round or oval shapes and well-defined edges [15].

Breast calcificationscan be categorized into macrocalcifications and microcalcifications [59]. Macrocalcifications appear as large white dots on the mammogram and spread randomly over the breast, and are considered as non-cancerous cells. The microcalcifications seem as small calcium spots that look like white specks in the mammogram and they often appear in clusters. Microcalcification usually is considered as a primary indication for early breast cancer or a sign of existing precancerous cells. Figure 5 provides an illustration for the mass and calcification of an image from the INBreast dataset.

Fig. 5
figure 5

Illustration for mass and calcification from INBreast Dataset [25]

4 Mammographic datasets

Different datasets are publicly available, these datasets differ in size, resolution, image format, the type of the images (Full-Field Digital Mammography (FFDM), Film Mammography (FM), or Screen-Film Mammography (SFM)), and abnormalities’ types that are included in each dataset. Table 1 compares among the publicly available datasets such as the digital database for screening mammography (DDSM), INBreast, Mini-MIAS, curated breast imaging subset of DDSM (CBIS-DDSM), BCDR, and OPTIMAM.

Table 1 Breast cancer mammographic datasets (DM: Digital Mammogram, FFDM: Full Field Digitized Mammogram, SFM: Screen Film Mammogram)

4.1 The digital database for screening mammography (DDSM)

DDSM is composed of 2620 scanned film mammography, the studies were divided into 43 volumes. There are four breast mammographic images for each case, as each breast side was captured from two projections which are Mediolateral Oblique (MLO) and Cranio-Caudal (CC) views. Also, the dataset includes the ground truth and kinds of the suspected regions at pixel-level annotations. Each case has a file that contains the date of the study, the age of the patient, the score of breast density according to the American College of Radiology Breast Imaging Reporting and Data System (ACR BI-RADS), and the size and the scanning resolution for each image. The images are in Joint Photographic Experts Group (JPEG) format with different sizes and different resolutions [39].

4.2 Curated breast imaging subset of DDSM (CBIS-DDSM)

This dataset is an enhanced version of the DDSM, it includes decompressed images, with updated mass segmentation and bounding boxes for the region of interest (ROI). The data is selected and curated by trained mammographers; the images are in Digital Imaging and Communication in Medicine (DICOM) format. The size of the dataset is 163.6GB with 6775 studies, the dataset contains 10,239 images, that consists of mammogram images with their corresponding mask images. There are CSV files attached with the dataset that provided the pathological information for the patients. The dataset has four CSV files: mass training set, mass-testing set, calcification training set and calcification testing set. The mass training set has images for 1318 tumours, while the mass testing set has images for 378 tumours. The calcification training set has images for 1622 calcifications, and the calcification testing set has images for 326 calcifications [45].

4.3 INBreast

INBreast contains 115 cases with a total of 410 images. 90 cases were diagnosed with cancer on both breasts out of 115 cases. The dataset includes four different types of breast diseases breast mass, breast calcification, breast asymmetries, and breast distortions. The dataset contains images of (CC) and (MLO) views; the images were saved in DICOM format. Also, the dataset provides the Breast Imaging-Reporting and Data System (BI-RADS) score for breast density [56].

4.4 Mini-MIAS

The dataset includes 322 digital films, also the ground truth markings for any existing abnormality. The categories of the abnormalities that are included in the dataset are calcifications, masses, architectural distortion, asymmetry, and normal. The size of the images was reduced to become 1024 × 1024. The images are publicly available on the Pilot European Image Processing Archive (PEIPA) which belongs to the University of Essex [77].

4.5 BCDR

The BCDR is mainly divided into two mammographic repositories: (1) Film Mammography-based Repository (BCDR-FM) and (2) Full Field Digital Mammography-based Repository (BCDR-DM). BCDR repositories provide normal and abnormal cases of breast cancer with its mammography lesions outlines and related clinical data. The BCDR-FM includes 1010 cases which are for 998 females and 12 males. Furthermore, it includes 1125 studies, 3703 mammographic images in the two views MLO and CC with 1044 identified lesions. The BCDR-DM, still under construction; till now it contains 724 cases 723 cases of them are for females and 1 case for a male, the repository includes 1042 studies. It provides 3612 MLO and CC mammography images and 452 identified lesions [50].

4.6 OPTIMAM

It is composed of more than 2.5 million images that were collected from three UK breast screening centres for 173,319 cases; all of the cases are women. The dataset is divided into 154,832 cases with normal breasts, 6909 cases with benign cancer, 9690 cases with identified lesions and 1888 cases have interval cancers. It provides unprocessed and processed medical images, the dataset includes the region of interest annotations and clinical data relating to the identified cancers and the interval cancers [38].

5 Breast cancer CAD systems

Through the past decades, machine learning contributes significantly to creating more reliable CAD systems for breast cancer diagnosis that can help radiologists in interpreting and reading mammograms. Many studies introduced models for breast cancer diagnosis and prognosis through mammograms, and many of these methods showed very promising performance, however, they aren’t tested over a unified large database. The breast cancer CAD systems are composed of some phases that differ based on the task of the CAD system. As shown in Fig. 6 these phases can be divided into pre-processing [31, 76], mass detection, mass segmentation [8, 33, 81], feature extraction [27, 28, 44, 49] and mass classification. Also, the figure presents most of the used techniques in these different phases that can be used in a breast cancer CAD system.

Fig. 6
figure 6

Most of the used techniques in the different phases of breast cancer CAD systems

Lusted was the first one to discuss the analysis of the radiographic abnormalities using computers [52] in 1955 as shown in Fig. 7. The researchers in the 60s and 70s started to work toward creating automated methods for classification and detection for the abnormalities in the medical images including the breast images. In 1987, a team from the University of Chicago introduced an automated system that can aid the radiologist in the detection of microcalcification in mammograms by providing the radiologist with analysis output of the image [17].

Fig. 7
figure 7

Timeline for the evolution of breast cancer CAD systems

As shown in Fig. 7. Starting from 90’s the research efforts increased toward CAD systems. In 1998 the U.S. Food and Drug Administration (FDA) approved the first CADe system, then from 2000 to 2004, the researchers started to evaluate CADe [30, 37] to assess the effectiveness of the clinical use of CADe and its impact on cancer detection rate in mammography. From the beginning of 2009 to 2017, different conventional machine learning-based CAD systems were introduced to enhance abnormalities detection and classification in mammograms. With the appearance of deep learning networks, the researchers started in the middle of 2017 to adopt deep learning models and the transfer learning concept in developing more accurate mammographic CAD systems. The deep learning detection models showed very promising results at the abnormalities’ detection based on the results of the proposed system in 2018 that adopted those detection models. Recently from 2018 till now, the researchers started to create end to end models for mammographic CAD systems. In this survey, the existing CAD systems are categorized into conventional CAD systems and deep learning-based CAD systems. Figure 8 illustrates the pipeline of the conventional learning-based CADe / CADx and deep learning base CADe / CADx. The pipeline of conventional machine learning started with image processing then mass segmentation, followed by feature extraction and selection and finally the classification. On the other hand, the deep learning-based CAD system pipeline goes through the same phases except for the feature extraction and classification as these two phases are done as a single phase as the deep learning models can extract the features automatically through the training phase. In the CADe systems, the process stops at the mass segmentation/detection phase.

Fig. 8
figure 8

Illustration for (a) Conventional machine learning based CADe/CADx system pipeline, (b) Deep learning-based CADe/CADx system pipeline

5.1 Conventional CAD systems

Several trials and studies were proposed to develop CAD systems that can act as a second opinion or helper for the radiologists, these trials started with the use of the traditional computer vision techniques that are based on conventional machine learning and image processing techniques. This section demonstrates some of these studies with details.

Rejani, Y, and S. Thamarai Selv (2009) [65] presented an algorithm for tumour detection in mammograms, their work aimed to discuss a solution for two problems; the first one was about extracting the features that characterize the tumours and the second problem was about how to detect the masses especially the ones that have low contrast with their background. For the mammogram enhancement, they applied a Gaussian filter, top hat for eliminating the background, and Discrete Wavelet Transform (DWT). The mass region segmentation was implemented using the thresholding technique, then the morphological features were extracted from these segmented regions, and Support Vector Machine (SVM) was used for classification. Their approach achieved a sensitivity of 88.75%, however, their work needs to be tested on larger datasets as they applied their method only on 75 mammograms from the mini-MIAS dataset.

Ke, Li et al. (2010) [42] introduced a system that can detect the mass based on the texture features. They used the bilateral comparison to detect the masses and locating the Region of interest (ROI). They implemented fractal dimension and the two-dimension entropy to extract the texture features from the ROI. The ROIs were classified into a mass or normal using SVM. They run their experiment over 106 mammograms, and the results showed that their automated diagnosis method achieved a sensitivity of 85.11%.

Dong, Min, et al. (2015) [24] proposed an automated system to detect and classify the breast masses in the mammographic images. They extracted the position of the masses and the ROI using the chain codes that are provided with the DDSM dataset, then the intensity values were mapped linearly to new values based on the grey level distribution that ranges from 0 to 255. They applied the Rough Set (RS) method to apply more enhancements to the ROIs. To segment the masses from the ROIs, they used an improved Vector Field Convolution Snake (VFCS), which showed robustness to the interference of the blurry tissues. Multiple features were extracted from the segmented masses and the background of the ROIs. For classification, they applied two classifiers the first one was an optimized SVM with particle swarm optimization (PSO) and genetic algorithm (GA), while the other one was random forest (RF). They applied their experiment on DDSM and MIAS datasets. The results showed that the first method outperformed the second one with an accuracy of 97.73% on the DDSM; However, their work needs to be experimented on a larger sample data size through augmentation or using a larger dataset.

Also, Rouhi, Rahimeh, et al. (2015) [68] proposed two different methods for mass segmentation. The ROIs were cropped based on the chain codes of the DDSM dataset. Histogram equalization and median filtering were applied to reduce the noise. For segmentation, they implemented two different techniques namely region growing-based method and cellular neural-based method. They applied Genetic Algorithm (GA) with different chromosome structures and fitness functions for feature selection. The masses were classified into benign and malignant using different classifiers namely Multi-Layer Perceptron (MLP), Random Forest (RF), Naïve Bayes (NB), Support Vector Machine (SVM) and K- Nearest Neighbour (KNN). They run their experiment using DDSM and MIAS datasets. Their method showed high sensitivity of 96.87% for classification with the use of the second segmentation technique, however, the results showed variability as shown in Table 2 for DDSM and MIAS.

Table 2 Pros and limitations of some breast cancer conventional machine learning based CAD systems

Mughal et al. (2017) [58] used texture and colour features to present a system that can detect and classify the masses in mammograms. They applied Contrast Limited Adaptive Histogram Equalization (CLAHE) for enhancing the contrast of the mammogram. Moreover, the mean filter was used as well as the wavelet transform to reduce the noise. They introduced a segmentation method which is composed of two phases, firstly they extracted the normal breast region by highlighting the pectoral muscle to remove it. To highlight the pectoral muscle the greyscale image was transformed to RGB followed by a transformation to the hue saturation value (HSV), then each RGB value was represented with a value in a range from 0 to 1. In the second phase, they extracted the abnormal breast boundary region by creating a texture image using a function based on an entropy filter. Moreover, they used a mathematical morphology function to extract and refine the ROI. They applied mathematical expressions to extract the intensity, texture, and morphological features. Different classifiers namely SVM, decision tree, KNN, and bagging tree were used for classification. The SVM with (quadratic kernel) showed the best results as it achieved a sensitivity, specificity, and accuracy of 98.40% 97.00% 96.9% respectively for DDSM and 98.00% 97.00% 97.5% for MIAS.

Punitha, S. et al. (2018) [61] presented an automated detection method for masses in mammograms. They used the gaussian filter for smoothening the grey level variations and reduce the noise in the image. An enhanced version of the region growing method with the dragon-fly optimization technique was used for segmentation. Forty-five features were extracted from the ROIs; the Gray level co-occurrence matrices, and Gray level Run Length Matrix (GLRLM) were used for texture analysis and to extract other features. A Feed-Forward Network was used for classification, moreover, they trained this network using Back Propagation with the Levenberg Marquardt algorithm. In the experiment, they used 146 malignant cases, and 154 benign cases from DDSM. They divided these cases into a training set and testing set in which 100 images for testing and 200 images were used for training. It was shown that the use of the dragonfly with the growing region algorithm improved their segmentation results and accordingly the classification as the approach achieved Sensitivity of 98.1% Specificity of 97.8%.

In the same year Suhail et al. [78] proposed an approach to classify the existing microcalcifications in the mammographic images into benign and malignant. Their approach depends on using two stages scalable Linear Discriminant Analysis (LDA) algorithms for extracting the features and reducing the dimensionality, as the binary classification data is encoded to a one-dimensional representation of the microcalcification data. The classification was applied using five classifiers, which are K-NN, SVM, DT, Baysian Network, and ADTree. To evaluate the performance of their approach, they compared their technique (scalable LDA) with the PCA/LDA technique. The results showed the scalable LDA outperformed the PCA-LDA. The classification accuracy for SVM, Baysian Network, K-NN, DT and ADTree were 96%, 0.975, 0.972, 0.975, and 0.985, respectively. This work can be extended to classify the masses besides the microcalcification.

The Extreme Learning Machine (ELM) [23] which is a type of feedforward network with a single hidden layer were used in some studies [55, 57]. Mohanty Figlu et al. (2020) [55] proposed a CAD system for mass classification that showed high accuracy with the use of a reduced number of features. They classified mammograms into normal and abnormal, also their system provides a classification for the mass if it’s benign or malignant. They used the DDSM, MIAS and the BCDR datasets to validate their proposed approach. In their approach, the chaotic maps and concept of weights are fused in the salp swarm algorithm for selecting the optimal features set and also to tune the parameters of the KELM algorithm. Their approach is mainly divided into four steps: firstly, they generated the ROI using the ground truth locations, then they extracted the tsallis entropy, energy-Shannon entropy, and renyi entropy from the ROI through the discrete wavelet transform (DWT). For the feature reduction, they applied principal component analysis (PCA) [1]; finally, they used a modified learning approach which is based on ELM for classification. Their technique achieved an accuracy of 99.62% for MIAS and 99.92% for DDSM, for normal and abnormal classification. On the other hand, for the benign-malignant classification, it showed an accuracy of 99.28% for MIAS, 99.63% for DDSM, 99.60% for BCDR. Although their model can classify the mammograms in real-time, the manually cropped ROIs is considered a weak point in such an automated CAD system.

In [57] Muduli Debendra et al. (2020) merged the ELM with the Moth flame optimization (MFO) algorithm which is a meta-heuristic algorithm to tune the ELM network parameters (i.e., weights, the bias of hidden nodes) to resolve the problem of the ill-conditioned problem in the hidden layer of the network. Also, they applied a fusion between the PCA and LDA for feature reduction and accordingly reducing the computational time, the approach achieved an accuracy of 99.94% for MIAS and %, 99.68% for DDSM, however, they need to run their work over a larger sample set of data.

The authors summarize some of the breast cancer detection methods that are based on the conventional machine learning models in Table 2 to illustrates the pros and limitations of these studies, the task of the proposed model, the results and the used datasets.

5.2 The deep learning-based CAD system

Recently, many promising deep-learning models that are used in computer vision showed significant improvements in the CAD systems performance especially Convolutional Neural Network (CNN), transfer learning approach, and the deep learning-based object detection models. Several algorithms were proposed for the CAD systems based on the use of deep learning models.

Dhungel Neeraj et al. (2017) introduced a CAD tool for mass detection, segmentation and classification in mammographic images with minimal user intervention [22]. For mass detection, they used random forest and cascade of deep learning models, followed by hypothesis refinement. Moreover, they extracted a partial image from the detected masses after refining it by active contour models to segment the masses. They used a deep learning model for classification, the model was pre-trained on hand-crafted feature values, the work was tested on the INBreast dataset. The results showed that the system detected almost 90% of masses with 1 false-positive rate per image, while the accuracy of segmentation achieved 0.85 (Dice index), and the model reached a sensitivity of 0.98 for classification.

In the same year, Geras et al. [34] developed a Deep Convolutional Network (DCN) that can handle multiple views of screening mammography, as the network takes the CC and MLO views for each breast side of a patient. Furthermore, the model works on large high-resolution images with a size of 2600 X 2000; the model learned to predict the assessment of a radiologist and classifying the image based on Breast Imaging-Reporting and Data System (BI-RADS) [47] to “incomplete”, “normal” or “benign”. In their work, they investigated the impact of the size of the dataset and image resolution on the screening performance. The results showed that when the size of the training set increased the performance increases, also they found that with the original resolution, the model achieves its best performance. It was shown the model achieved a macUAC of 0.688 in a reader study that was done on a random set from the private dataset that they used in their experiments, while a committee of radiologists achieved the macUAC of 0.704.

Al-antari et al. (2018) [5] proposed a deep belief network-based CAD system. For detecting the initial suspicious regions, they used the adaptive thresholding method, which achieved an accuracy of 86%. They adopted two ways to extract the ROIs; in the first, multiple mass regions of interest were extracted as they randomly extract four non-overlapping ROIs of size 32 × 32 pixels around the center of each mass. The second technique depends on extracting the whole mass region of interest as a rectangular box placed around masses and the irregular shapes that are extracted manually. The morphological and statistical features were extracted from these ROIs to be used in classification, they applied different classifiers namely Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), Neural Network (NN), and deep belief network (DBN). The DBN outperformed the other classifiers with an accuracy of 92.86% with the first ROI extraction technique, while it achieved 90.48% with the second ROI extraction technique.

Shen et al. (2020) [71] proposed a framework that depends on adversarial learning to detect the masses in the breast mammograms in an attempt to facilitate the annotation process for the masses in the mammograms through an automated process. The framework consists of two networks, the first network is a Fully Convolutional Network (FCN) to predict the spatial density, and the second one is a domain discriminator that works as a domain transfer that utilizes the adversarial learning to align the low-annotated target domain features with the high-annotated source domain features. The FCN takes the source and the target domains as input to generate a pixel-wise heatmap for them, as every pixel in the heatmap indicates whether the corresponding input pixel relates to a mass lesion. Then the heatmap of the target domain is fed into another network that acts as a domain discriminator which is used to decrease the difference of the heatmap distribution between the source and target domains. They compared their approach with state of art approaches, their approach achieved an AUC score of 0.9083 for a private dataset, and 0.8522 for INBreast.

5.2.1 Transfer learning

Transfer learning is considered recently as one of the keys that aim to enhance the performance of the learner models. It can be defined as the concept of transferring the knowledge acquired for a task to solve a related task. Transfer learning is used widely nowadays in most of the developed CAD systems to resolve the problem of having a non-sufficient amount of data, also it reduces the computational cost and the time needed for training the models [82]. It is a well-known methodology in the deep learning discipline where the pre-trained models can be adapted to be used with other tasks such as computer vision tasks. And so, this can accelerate the computational time that is needed to develop a neural network from scratch, also transfer learning resolved the problem of the difficulty of getting vast amounts of labelled data, considering the time and effort that is needed for that [94].

Some studies recently adopted the transfer learning approach in developing their CAD systems [2, 40, 46]. Ragab et al. (2019) [62] introduced a CAD system that aims to classify mammogram masses into benign and malignant. They applied two different segmentation techniques; the first technique depended on manually cropping the ROI using a circular contour that is provided with the dataset, while the second technique adopted the thresholding and the region-based to crop automatically the ROI. The features were extracted using deep AlexNet architecture-based CNN, then these features were fed through the last fully connected layer in the CNN to SVM classifier for classification. Based on their results the second segmentation technique outperformed the first one. The best results that the model achieved were accuracy of 80.5%, AUC of 88%, and sensitivity of 77.4%, for DDSM. Moreover, the results showed that the segmentation accuracy increased to 73.6% when using samples from the CBIS-DDSM dataset, furthermore, the classification accuracy enhanced to became 87.2% with an AUC of 94%.

Ansar et al. (2020) [12] presented a MobileNet based architecture model that was able to classify the masses in the mammograms into malignant and benign with competitive performance relative to the state of art architectures and less computational cost. The proposed approach firstly detects the masses in the mammogram through classifying the mammograms into cancerous and non-cancerous using a CNN, then the cancerous ones are fed into a pre-trained MobileNet based model to be classified. They compared the performance of their model with the performance of VGG-16, AlexNet, VGG-19, GoogLeNet and ResNet-50. Their model showed competitive performance with an accuracy of 86.8% for DDSM and 74.5% for CBIS-DDSM.

5.2.2 Deep learning-based- object detection (single shot and two shot detectors)

Deep learning replaced the use of the hand-crafted features through learning automatically the most relevant image features to be used to perform a specific task. Object detection is one of the disciplines that showed very promising performance with the use of deep learning. Object detection deep learning-based techniques can be classified into two types, one-stage detectors that are based on regression or classification, and two-stage detectors that are based on regional proposals [89]. Anchor boxes are considered as the key concept behind both of those techniques, it’s one of the main factors that affect the performance of the detector in detecting the objects within the image [90].

One stage detector mainly depends on taking one shot of the image to detect more than one object within the image. On the other hand, the regional proposal network (RPN) based approaches are working through two phases, one for generating the candidate region proposals, while the other stage is responsible for detecting the object for each candidate. One stage detector is much faster compared with two-stage detectors as the detection and the classification are done simultaneously over the whole image once; however, the RPN based approaches showed more accurate results [87].

Ribli Dezső, et al. (2018) [67] adopted one of the two-shot detectors named Faster R-CNN [66] to build a system that can detect, localize and classify the abnormalities in mammograms. They used the DDSM in their work, accordingly they mapped the pixel values to optical density due to the low quality of digitized film-screen mammograms, then they rescaled the pixel values to the 0–255 range. Through their experiment model, they noticed that the higher resolution images give good results. They used the INbreast dataset for testing, and a private dataset besides the DDSM dataset for training. The final layer in their model classifies the masses into benign or malignant, also the model generates a bounding box for each detected mass. Furthermore, the model provides a confidence score that indicates to which class the mass belongs. Their model achieved an AUC of 0.95 for classification and was able to detect 90% of the malignant masses in the INbreast dataset with 0.3 false-positive rate/image. The limitation of this work is that they tested their work only on INBreast due to the lack of pixel annotated publicly available datasets, so their model should be tested on larger datasets to generalize their results.

In [6, 10] Al-antari et al. used You Only Look Once (YOLO) [64] in their work for detecting masses in mammograms. In [6] (2018) they proposed a fully automated breast cancer CAD system that is based on deep learning in its three phases of mass detection, segmentation and classification. They used YOLO for detecting and localizing the masses. In the next phase, they used a Full Resolution Convolutional Network (FRCN) to segment the detected masses. Then the segmented masses were classified into benign and malignant through a pre-trained CNN that based on AlexNet architecture. The system achieved a mass detection accuracy of 98.96%, segmentation accuracy of 92.97%, and classification accuracy of 95.64%. Moreover, in [7] (2020) they proposed the same model they introduced in [6], with some improvements in the classification, and segmentation phase. After these improvements, YOLO achieved detection accuracy of 97.27%, breast lesion segmentation accuracy of 92.97%. CNN, ResNet-50, and InceptionResNet-V2 were used for classification and achieved an average overall accuracy of 88.74%, 92.56%, and 95.32%, respectively.

Cao et al. (2021) [16] proposed a novel model for detecting breast masses in mammograms, furthermore, they proposed a new data augmentation technique to overcome the overfitting problem due to the small dataset. Their augmentation technique is based on local elastic deformation, this technique enhanced the performance of their model; however, its calculation speed is slower compared to the traditional augmentation techniques. In their approach, they firstly segment the breast to remove most of the background through Gaussian filtering and the Otsu thresholding method. Moreover, they used an enhanced version of the RetinaNet named FSAF [93] for mass detection. Each image has an average of 0.495 false-positive rate for INBreast, while for the DDSM dataset each image has 0.599 false-positive rate.

5.2.3 End to end models

The End to End (E2E) learning approach is the concept of replacing a pipeline of several modules in a complex learning system with a single model (deep neural network). E2E training approach enhances the performance of the model as it allows a single optimization criterion instead of optimizing each module separately under different criteria as in the pipelined architecture [35]. Recently different studies build their models based on the E2E training approach that showed promising results.

Shen et al. (2017) introduced in [70] a CNN based end to end model to detect and classify the masses within the whole mammographic image, moreover, in (2019) [70] they improved the work they introduced in [70] by classifying the local image patches through a pre-trained model on a labelled dataset that provides the ROI data. They initialized the weight parameters of the whole image classifier with the weight parameters of the pre-trained patch classifier. They used the two pre-trained CNN models that are Resnet50 and VGG16 to build four classification models. They used CBIS-DDSM to train the patch and the whole image classifiers, then with the use of transfer learning they transferred the whole image classifier for testing over the INbreast dataset. The patch images were classified into 5 classes which are background, benign/malignant mass, and benign/malignant calcification. In their results, the best single model that tested on CBIS-DDSM achieved an AUC of 0.88 per image, while the average AUC of the four-model was up to 0.91. Also, the INbreast dataset showed that the AUC of the best single model achieved 0.95 per image, and the average AUC of the four-model improved to be up to 0.98. In this work, they downsized the images due to GPU limitations and this led to losing some information of the ROIs, if this information was retained maybe it can differ in the performance of this approach.

Agnes et al. (2020) [4] presented a Multiscale CNN which is based on an end-to-end training strategy. The main task of their model is to classify the mammographic images into normal and malignant. Their model is mainly divided into two parts; context feature extraction and mammogram classification. The model implements a multi-level convolutional network that can extract both high- and low-level contextual features from the image. The model achieved an accuracy of 96.47% with an AUC of 0.99 for the mini-MIAS dataset.

Table 3 Provides a summary of the pros, limitations, tasks and results for some of the deep learning-based breast cancer CAD systems.

Table 3 Pros and limitations of some breast cancer deep learning-based CAD systems

6 Dataset based quantitative comparison

Tables 4 and 5 provides a quantitative comparison between some selected techniques that were proposed on a per- dataset basis. The selected techniques were used DDSM, CBIS-DDSM and INBreast in their work. The selected datasets are the most used ones in most of the state of art proposed models.

Table 4 Quantitative comparison for results of mass detection and classification techniques that used DDSM / CBIS-DDSM datasets
Table 5 Quantitative comparison for results of mass detection and classification techniques that used INBreast dataset

It can be demonstrated from Tables 4 and 5 that there’s an improvement at both levels; mass detection and classification. Still, mass detection needs more work to enhance the detection of true positives to increase the sensitivity as the best sensitivity that was reached was 90%. In addition to that, the specificity which indicated the true negative still needs extensive efforts to enhance it.

7 Evaluation metrics for breast cancer CAD systems

This section presents the most used evaluation metrics for evaluating the performance of the breast cancer CAD systems. Various performance measures are used for evaluating and analyzing the performance of the CAD systems at detection and classification [91].

Intersection over Union (IoU) is one of the most used methods to evaluate the performance of the detection. IoU represents the amount of overlap between the ground truth and the predicted bounding box. It can be calculated as shown in Eq. (1).

$$ \mathrm{IoU}=\frac{A\cap B}{A\cup B}, where\ A\ is\ the\ predicted\ bounding\ box\ and\ B\ is\ the\ ground\ truth\ box $$
(1)

Also, sensitivity, specificity and accuracy are used for evaluating both abnormality detection and classification. Furthermore, the confusion matrix must be taken into consideration as it represents the number of True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN). Table 6 provides an illustration for the structure of the confusion matrix where:

  • True Positive (TP): is the number of times that the system correctly detected or classified the present masses in the mammography as positive.

  • False-positive (FP): represents the number of the negative masses that are incorrectly detected or classified as positive.

  • True Negative (TN): is the negative masses that are correctly detected or classified within the mammogram image as negative.

  • False Negative (FN): represents the masses that are existing in the mammogram and weren’t detected or classified correctly.

Table 6 Illustration for the basic framework of the confusion matrix

Sensitivity, Recall or True Positive Rate (TPR) is the probability that the actual positive is correctly tested as positive masses in the mammogram, and it can be calculated as shown in Eq. (2). Specificity which is also called True Negative Rate (TNR) is the probability of the actual negative that is tested correctly as negative when no abnormalities exist in the mammogram as shown in Eq. (3).

$$ \mathrm{Sensitivity}=\frac{TP}{TP+ FN} $$
(2)
$$ \mathrm{Specificity}=\frac{TN}{TN+ FP} $$
(3)

Accuracy (Acc) is describing the ratio of the number of the correct predictions regarding the total number of predictions, it’s mainly describing the performance of the system regarding all classes as shown in Eq. (4). In addition to these evaluation methods, the Receiver Operating Characteristics (ROC) is considered as one of the important performance measures for CAD systems as it represents the trade-off between TPR and False Positive Rate (FPR) at different classification thresholds. The ROC curve is plotted with two-axes; the y-axis for TPR against the FPR which is represented by the x-axis. The area under the ROC curve (AUC) is indicating to what extent the system can distinguish between positive and negative classes. Figure 9 shows an illustrative example for ROC with different classifiers, as each curve of the three curve lines (A, B and C respectively) represents the ROC while the AUC is the area under each curve line. The higher the value of AUC means the better the model is; For example, as shown in Fig. 9 the ROC curve of A has the higher AUC value which means that the performance of classifier A is better than B and C. The AUC value ranges from 0 to 1; the value of AUC is 0 when the model fails to predict any prediction correctly and equals to 1 when the model can distinguish correctly between all of the negatives and positives.

$$ \mathrm{Acc}=\frac{TP+ TN}{TP+ TN+ FP+ FN} $$
(4)
Fig. 9
figure 9

Receiver operating characteristics (ROC)

F1-Score also is one of the evaluation measures that is used to evaluate the model’s performance at the binary classification such as classifying the masses into benign and malignant. Calculating F1-Score depends mainly on the precision and recall, as it can be calculated using the following formula as shown in Eq. (5):

$$ \mathrm{F}1-\mathrm{Score}=2\times =\frac{precision\times recall}{precision+ recall} $$
(5)

mAP (mean Average Precision) is another evaluation method that mostly used for evaluating the performance of the object detection models. mAP is calculating by calculating the average precision (AP) for each class then taking the average of the Average Precision (AP) for all classes. The mAP is calculated as the following in Eq. (6), where Q is the number of classes in the set and AP is the average precision for a given class q.

$$ mAP=\frac{\sum_{q=1}^Q AP(q)}{Q} $$
(6)

8 Discussion and conclusion

In summary, this survey highlights the current deep learning and conventional machine learning techniques for mammographic CAD systems, datasets, and concepts. It can be demonstrated that the studies that have adopted the use of conventional machine learning techniques and algorithms showed good performance with high accuracy rates, however, although these techniques won’t perform well with large datasets, and almost all of them depend on expert crafted features. Over the last few years, the conventional ML techniques have evolved especially with the appearance of deep learning techniques.

Recently, various researchers started to use deep learning models in an attempt to create more reliable CAD systems with fewer false-positive rates. Although the review showed that Deep learning techniques showed very promising performance and significant contribution in the development of CAD systems, there are still some limitations in these techniques especially with the lack of datasets and this complicates its clinical applicability.

From Tables 2, 3, 4 and 5 the current challenges in CAD systems development can be summarized as the following: (1) Increasing the number of the mammographic images to overcome the problem of the insufficient data amount that is using in the experiments is one of the big obstacles especially when it comes to the medical images; it’s very hard to find or acquire annotated mammograms at pixel level and image level. (2) Mass localization and detection are still considered a challenging task because the mass features in dense breasts seem to be like the ones in normal tissues, also these tissues mask the cancerous cells [32]. Moreover, the masses’ sizes vary hugely [53], and this also makes this task more challenging especially for the small masses. (3) Reducing the false positive rate and increasing the specificity and sensitivity rates needs more work. (4) Selecting the optimal parameters for the DL models is also one of the challenges that need more investigation to build more robust CAD systems.

Due to the insufficient number of mammographic images that are included in the publicly available datasets; data augmentation techniques are needed to create synthetic mammographic images, especially with the appearance of Generative Adversarial Network (GAN) [36]. However, some researchers started to work in this direction [69, 84], but it still needs more investigation to generate large scale of mammographic images in an attempt to solve the imbalanced class problem in the mammography available datasets. Moreover, GAN can be able to generate more realistic images than the ones that are generated through the traditional augmentation techniques like rotation, flipping, cropping, translation, noise injection and colour transformation [74], and this may affect the performance positively and increase the capability of the models to detect and classify the lesions correctly.

Also, more investigation is needed for developing new data augmentation techniques that can preserve the mass features, and add a variation at the morphological level. Furthermore, other different strategies can be used to overcome the problem of the insufficient data amount such as using pre-trained models and so the pre-trained weights are transferred to initialize the network and the parameters are fine-tuned through training [72, 73].

The object detection deep learning-based models like YOLO and Faster RCNN are considered as one of the recent customizable techniques that achieved better detection accuracies and enhanced the mass detection and localization within the mammographic image, however, the small mass detection still needs more investigation, especially for the very close ones. Training these models with enough amount of data that contain more images with small masses may enhance the performance of such models at the small mass detection, also fine-tuning for the bounding boxes can help in overcoming this problem.

The positions of body or images’ angles vary in mammographic masses, and so recognition of the texture to be estimated at different angles is important when performing texture analysis for the masses [19]. However, as shown in the literature many studies presented models that used the morphological features such as texture, color, and so on, some studies such as [88] recently listed the problem of absence of neighbourhood invariant components, which can’t adequately react to image transformation or changes brought about by imaging points when classifying the mammographic masses via CNN. And so, they proposed a novel approach that is based on a fusion between the rotation invariant features, texture features and the deep learning for classifying masses in mammograms. This listed problem can be considered as a new challenge that needs more investigation, as it can be extended to harness the rotation invariant features in mass detection through using Rotation Invariant Fisher Discriminative Convolutional Neural Networks (RIFD-CNN) for object detection [20].

It was shown from the review that the studies that recently started to focus on using more than one mammography view in the classification such as MLO and CC views; the use of more than one view proved effectiveness in mammogram classification more than using single-view images [48, 86]. Accordingly, utilizing the multi-view mammographic images in mass detection needs more investigation and research efforts, as it can enhance the sensitivity and specificity rates for mass detection and classification through preserving more information and features from both views. Also, the studies showed that the full image resolution can give better accuracy results [84], so developing systems that can retain the full resolution of the mammographic image are needed to minimize the information loss that occurs due to downsizing the images which affect the image quality.

Based on the aforementioned discussion, breast cancer CAD system development still needs more research efforts to solve the current challenges, especially for the DL models that suffer from the lack of annotated data, and so building deep learning models that can learn from a small size of data is considered as one of the open challenges.

This survey contributed a review of the literature of the past ten years on the state of art methodologies for breast cancer CAD systems specifically for mass detection and classification. This work aimed to help in building CAD systems that can be applied clinically to assist in breast cancer diagnosis. The review provides evaluation for some of the studies presented in the literature through presenting their pros and limitations. The survey gives an overview of the main phases of the CAD system and what are the used techniques in these phases. Moreover, it lists the breast screening modalities and the publicly available mammographic datasets, also, it provides a dataset-based quantitative comparison between the most recent techniques. In addition to that, the evaluation metrics that are used for CAD assessment are demonstrated. Furthermore, the survey presented the current challenges that need more investigation to improve the efficiency and the performance of these systems.