Automated measurement of fetal head circumference using 2D ultrasound images

Thomas L. A. van den Heuvel; Dagmar de Bruijn; Chris L. de Korte; Bram van Ginneken

doi:10.1371/journal.pone.0200412

Abstract

In this paper we present a computer aided detection (CAD) system for automated measurement of the fetal head circumference (HC) in 2D ultrasound images for all trimesters of the pregnancy. The HC can be used to estimate the gestational age and monitor growth of the fetus. Automated HC assessment could be valuable in developing countries, where there is a severe shortage of trained sonographers. The CAD system consists of two steps: First, Haar-like features were computed from the ultrasound images to train a random forest classifier to locate the fetal skull. Secondly, the HC was extracted using Hough transform, dynamic programming and an ellipse fit. The CAD system was trained on 999 images and validated on an independent test set of 335 images from all trimesters. The test set was manually annotated by an experienced sonographer and a medical researcher. The reference gestational age (GA) was estimated using the crown-rump length measurement (CRL). The mean difference between the reference GA and the GA estimated by the experienced sonographer was 0.8 ± 2.6, −0.0 ± 4.6 and 1.9 ± 11.0 days for the first, second and third trimester, respectively. The mean difference between the reference GA and the GA estimated by the medical researcher was 1.6 ± 2.7, 2.0 ± 4.8 and 3.9 ± 13.7 days. The mean difference between the reference GA and the GA estimated by the CAD system was 0.6 ± 4.3, 0.4 ± 4.7 and 2.5 ± 12.4 days. The results show that the CAD system performs comparable to an experienced sonographer. The presented system shows similar or superior results compared to systems published in literature. This is the first automated system for HC assessment evaluated on a large test set which contained data of all trimesters of the pregnancy.

Citation: van den Heuvel TLA, de Bruijn D, de Korte CL, Ginneken Bv (2018) Automated measurement of fetal head circumference using 2D ultrasound images. PLoS ONE 13(8): e0200412. https://doi.org/10.1371/journal.pone.0200412

Editor: Constantino Carlos Reyes-Aldasoro, City University London, UNITED KINGDOM

Received: January 16, 2018; Accepted: June 26, 2018; Published: August 23, 2018

Copyright: © 2018 van den Heuvel et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The ultrasound data has been made available as part of medical image analysis challenge at https://hc18.grand-challenge.org/ and on Zenodo, DOI 10.5281/zenodo.1322001. All other relevant data are included within the paper and its Supporting Information files.

Funding: This research was partially funded by the Life Sciences & Health for Development Fund (LSH14ET04), https://english.rvo.nl/subsidies-programmes/lsh4d. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Ultrasound imaging is widely used for screening and monitoring of pregnant women, since it is a low-cost, real-time and non-invasive imaging method. However, acquisition of ultrasound images is operator-dependent and the images are characterized by attenuation and speckle and may contain artifacts such as shadows and reverberations, making their interpretation complex. During the ultrasound screening examination, biometric measurements of the fetus such as the crown-rump length (CRL) and the head circumference (HC) are often computed to determine the gestational age (GA) and to monitor growth of the fetus. The CRL is the most accurate measurement for estimating the GA of the fetus between 8 weeks and 4 days (commonly noted as: 8⁺⁴ weeks) and 12⁺⁶ weeks. After 13 weeks, the HC is used the most accurate measurement to determine the GA, because it is not possible to accurately measure the CRL anymore. The guidelines state that HC should be measured in a transverse section of the head with a central midline echo, interrupted in the anterior third by the cavity of the septum pellucidum with the anterior and posterior horns of the lateral ventricles in view [1]. The biometric measurements are obtained manually, which leads to inter- and intra-observer variability. An accurate automated system could reduce measuring time and variability, because it does not suffer from intra-observer variability. Worldwide, 99% of all maternal deaths occur in developing countries. Skilled care before, during and after childbirth can save the lives of women and newborn babies [2]. Unfortunately, there is still a severe shortage of well-trained sonographers in low resource settings. This keeps ultrasound screening out of reach for most pregnant women in these countries [3]. An automated system could assist inexperienced human observers in obtaining an accurate measurement. In this work, we focus on measuring the HC because this measurement can be used to determine the GA and monitor growth of the fetus. In addition, the fetal head is more easily detectable compared to the fetal abdomen.

Systems for automatic HC measurement have been presented using randomized Hough transform [4, 5], Haar-Like features [6–9], multilevel thresholding [10], circular shortest paths [11], boundary fragment models [12], semi-supervised patch based graphs [13], active contouring [14, 15], intensity based features [16] and texton based features [17]. Although these methods show promising results, they were evaluated on a relatively small amount of data (10 to 175 test images). Furthermore, none of these papers used images of fetuses from all trimesters of pregnancy. We present a system that was developed using 999 ultrasound images and evaluated on a large independent test set of 335 ultrasound images from all trimesters. The presented quantification system was designed to be as fast and robust as possible and the results were compared to the methods presented in literature. A complete overview of the comparison between our method and previous publications is presented in Section Comparison to literature.

Materials and methods

Data

A total of 1334 two-dimensional (2D) ultrasound images of the HC were collected from the database of the Department of Obstetrics of the Radboud University Medical Center, Nijmegen, the Netherlands. The ultrasound images were acquired from 551 pregnant women who received a routine ultrasound screening exam between May 2014 and May 2015. Only fetuses that did not exhibit any growth abnormalities were included in this study. Images were acquired by experienced sonographers using either the Voluson E8 or the Voluson 730 ultrasound device (General Electric, Austria). The local ethics committee (CMO Arnhem-Nijmegen) approved the collection and use of this data for this study. Due to the retrospective data collection, informed consent was waived. All data was anonymized according to the tenets of the Declaration of Helsinki.

The size of each 2D ultrasound image was 800 by 540 pixels with a pixel size ranging from 0.052 to 0.326 mm. This large variation in pixel size is a result of adjustments in the ultrasound settings by the sonographer (depth settings and amount of zoom are routinely varied during the examination) to account for the different sizes of the fetuses. Fig 1 shows example ultrasound images from each trimester. The distribution of the GA in this study is shown in Fig 2. Most data were acquired after 12 and 20 weeks of pregnancy, since these are standard time points of routine ultrasound screening for pregnant women in the Netherlands. During each exam, the sonographer manually annotated the HC. This was done by drawing an ellipse that best fits the circumference of the head. Fig 2 also shows the comparison between the distribution of the HC and the growth curve of Verburg et al. [1]. The reference GA was determined with a CRL measurement between 20 mm (8⁺⁴ weeks) and 68 mm (12⁺⁶ weeks). All the HCs that fell outside the 3-97 percent confidence interval of the curve of Verburg et al. [1] were individually checked to ensure no mistakes were made during data collection.

Download:

Fig 1. Example ultrasound images.

From top to bottom: without annotation and with annotation in red. From left to right: first trimester with an HC of 65.1 mm (pixel size of 0.06 mm), second trimester with an HC of 167.9 mm (pixel size of 0.12 mm) and third trimester with an HC of 278.4 mm (pixel size of 0.24 mm). Note that the skull is not yet visible as a bright structure in the first trimester.

https://doi.org/10.1371/journal.pone.0200412.g001

Download:

Fig 2. Distribution of HC and GA for the study data.

The x-axis represents the GA that was estimated using the CRL. The y-axis represents the HC measured by the experienced sonographer.

https://doi.org/10.1371/journal.pone.0200412.g002

The data was randomly divided into a training set and a test set of 75 percent and 25 percent, respectively. The GAs were proportionally balanced between the data sets as shown in Table 1. All images that were made during one echographic examination were assigned to either the training or the test set. An independent data set of HC annotations of the images in the test set was created by TLAvdH, a medical researcher who has a technical background in ultrasound imaging and received training by an experienced sonographer in measuring the HC.

Download:

Table 1. Number of images in the training and the test set.

https://doi.org/10.1371/journal.pone.0200412.t001

Quantification system

In this study, three variations of the quantification system, indicated as system A, B, or C, were optimized and evaluated to investigate the influence of the changing appearance of the fetal head during pregnancy on the performance of the system. An overview of the three systems is shown in Fig 3. All three systems contain the same two steps: First, Haar-like features were computed from the ultrasound images to train a random forest classifier (RFC) to locate the fetal skull. Next, the HC was extracted using Hough transform, dynamic programming and an ellipse fit. Both steps are described in detail in the following subsections. System A uses one pipeline that was optimized on training data from all trimesters. It can be seen in Fig 1 that the fetal skull is not clearly visible in the first trimester. To deal with this different appearance, system B uses two pipelines to measure the HC: one pipeline was optimized on training data from the first trimester and the other pipeline was optimized on training data from the second and third trimesters. System C uses three pipelines, which were optimized on training data from the first, second and third trimester separately. In a low-resource setting the trimester of the fetus is commonly unknown. For systems with multiple pipelines, a selection method was used to automatically select the best fitted ellipse. This allows the system to automatically measure the HC without requiring the trimester to be known in advance.

Download:

Fig 3. Overview of the three evaluated quantification systems A, B, and C.

System A was optimized on training data from all trimesters. System B has two pipelines: pipeline 1 was optimized on training data from trimester one and pipeline 2 was optimized on training data from trimester two and three. System C uses three pipelines: pipeline 1, 2 and 3 were optimized on training data from trimester one, two and three, respectively. All pipelines of a quantification system are computed when the HC is measured in a test ultrasound image.

https://doi.org/10.1371/journal.pone.0200412.g003

Pixel classifier.

The first step of the three quantification systems consists of a pixel classifier that emphasizes the fetal skull and reduces artifacts in the ultrasound image, by computing the likelihood that each pixel in the image has of being part of the fetal skull. This makes the detection of the fetal skull in the second step more robust.

Feature extraction: Haar-like features [18] were used to be able to discriminate between background pixels and pixels that belong to the fetal skull. Viola and Jones [19] have shown that using an integral image enables the rapid computation of these features. Fig 4 shows the twelve different Haar-like features that were used for the pixel classification. The Haar-like features in rotated direction have a larger kernel width and height compared to the upright direction, but they capture the same relationship between the neighboring pixels. The Haar-like features were computed in different kernel sizes. To make these kernels invariant to the pixels size of the ultrasound image, all features were computed in millimeters. The pixel size of each Haar-like feature was chosen as close to the millimeter scale as possible. As a consequence, the kernel size of the Haar-like features increases when the pixel size of an ultrasound image decreases. A larger kernel size will result in a higher kernel response. To make the response of the feature independent from its kernel size, the Haar-like features were normalized. Normalization was performed by dividing the positive and negative coefficients of the kernel by their respective areas.

Download:

Fig 4. Overview of the twelve Haar-like features utilized in the quantification system.

From top to bottom: 1. Edge features in horizontal and vertical direction (kernel size of two by two pixels). 2. Line features in horizontal en vertical direction (kernel size of three by three pixels). 3. Center-surround features (kernel size of three by three pixels). 4. Rectangle features (kernel size of two by two pixels). The left side of each row represents the features in upright direction. The right side of each row represents the features in rotated direction. The height and width of the features in rotated direction are larger compared to the upright direction, but they capture the same relationship between the neighboring pixels.

https://doi.org/10.1371/journal.pone.0200412.g004

Classification: An OpenCV implementation of the RFC [20] was used for pixel classification. Positive samples were obtained from pixels annotated by the sonographers as the HC. The same number of negative samples were obtained from pixels randomly taken from the background with a minimal distance d_min from the annotation. When negative samples were obtained too close to the annotation they resemble positive samples, since the manually drawn ellipse will never fit the outer edge of the skull perfectly. This problem was solved by increasing d_min, which was optimized within the training set. Data augmentation was applied by flipping the ultrasound image horizontally, which resembles an acquisition with a flipped ultrasound transducer. The pixel classifier produces a likelihood map with a per pixel estimate of being part of the fetal skull. This likelihood map was visualized with a color map ranging from green to red, where a high likelihood was shown in red.

Detect fetal skull.

The likelihood map of the pixel classifier was used to detect the fetal skull in three steps. First, a Hough transform was applied to detect the center of the fetal skull. Secondly, dynamic programming was used to detect the outside of the fetal skull. Finally, an ellipse was fitted on the result of the dynamic programming algorithm to measure the HC.

Hough transform: An itk implementation of the Hough transform algorithm [21] was used to detect the center of the fetal skull from the likelihood map of the pixel classifier. Every classification pipeline has a GA ranging from the minimum GA, GA_min, to the maximum GA, GA_max. The minimum radius, r_min, of each classification pipeline was set to the half of the biparietal diameter (BPD) of the GA_min on the P3 curve of Verburg et al. [1]. The maximum radius, r_max, of each classification pipeline was computed using Eq (1) in which the HC and BPD are taken from the GA_max of the P97 curve of Verburg et al. [1]. The Hough transform was not used to measure the HC because the fitted circle will not give a good estimation of the elliptical shape of the fetal skull. Instead, the detected center was used for initialization of the dynamic programming algorithm (as explained in the next step), which is computational more efficient than fitting an ellipse using Hough transform.

(1)

Dynamic programming: Dynamic programming was used to extract the pixels belonging to the outside of the fetal skull [22]. Dynamic programming was used, because it can be computed very efficiently compared to other methods like active contouring. Fig 5 shows a schematic example of the dynamic programming algorithm. Dynamic programming was used in a polar transform of the pixel classifier likelihood map to find the shortest path from the left to the right side of Fig 5B. The polar transform uses a preset number of angles, N_angles, around the center point that was detected with the Hough transform algorithm. The sampling distance, S_dis, in radial direction was increased to make the algorithm less sensitive to noise and spurious responses in the likelihood map and to a decrease computation time. When S_dis becomes too large, the resolution of the polar transform decreases and eventually the dynamic programming algorithm will fail to detect the fetal skull. An optimal value for S_dis was determined on the training set. To make the dynamic programming algorithm less sensitive to small circular structures in the likelihood map, a radial offset of 5 mm and 10 mm was taken for the second and third trimester, respectively. According to the annotation protocol for HC measurements, the HC must be detected at the outside edge of the fetal skull [1]. Although the RFC was trained with annotations that describe the outside of the fetal skull, the Haar-like features were not able to distinguish between inside and outside of the fetal skull. Therefore, the RFC detected all pixels belonging to the fetal skull instead of only those that belong to the outside of the fetal skull. For this reason, the dynamic programming algorithm detected the midline of the skull. To solve this problem, a second dynamic programming algorithm was computed in the polar transform of the ultrasound image. This algorithm uses the same center and number of angles, N_angles, as the first dynamic programming algorithm, but without any downsampling in radial direction to maintain detailed information about the edge of the skull. To detect the outside of the fetal skull, the derivative of the ultrasound image in radial direction was computed. Pilot experiments showed that the fetal skull is only a few millimeters thick. To restrict the second dynamic programming algorithm to the area that is likely to contain the fetal skull, the second dynamic programming algorithm was only computed on the area within a distance of 2 mm from the first dynamic programming result. It is not advisable to directly apply dynamic programming to the derivative of the ultrasound image in radial direction because this would be overly sensitive to noise in ultrasound image. The result of the second dynamic programming algorithm, computed on the derivative of the ultrasound image, was taken as the final result for the ellipse fit in the next step.

Download:

Fig 5.

A: Perfect pixel classifier likelihood map where only the fetal skull has a high probability (depicted in white) and the background a low probability (depicted in gray). The pixels outside of the FOV are depicted in black. The center detected by the hough transform is depicted in purple and the radial offset is depicted in green. This schematic example uses eight angles (N_angles) for the polar transform (depicted in blue). The sampling distance (S_dis) is depicted in red. B: The output of the polar transform. The dynamic programming algorithm is used to extract the shortest path from left to right.

https://doi.org/10.1371/journal.pone.0200412.g005

Ellipse fitting: A direct least square fitting of ellipses [23] was used to determine the HC from the extracted pixels of the dynamic programming algorithm. Only the pixels detected by the dynamic programming algorithm within the highest fifth percentile of the likelihood map of the pixel classifier were used to fit the ellipse, because these pixels have a high likelihood for being part of the fetal skull. The fitted ellipse was required to have a circumference of at least 38.6 mm. This is the smallest reported HC on the curve of Verburg et al. [1] and will therefore prevent the quantification system from detecting small circular structures or noise in the image.

Select best result.

All pipelines of a quantification system were computed when the HC was measured in a test ultrasound image. In a low-resource setting the trimester of the fetus is commonly unknown, so quantification systems B and C will produce two and three fitted ellipses, respectively. To allow the system to fully automatically measure the HC, the ellipse with the highest median value of the first dynamic programming algorithm on the pixel classifier likelihood map was selected as the final result.

Experiments

Four experiments were performed to evaluate the performance of the three quantification systems and compare them to the manual annotations of the experienced sonographer (observer 1) and the medical researcher (observer 2). First, the parameters of the pipelines were optimized for each system. Secondly, the HC measured by observer 1 was used as a reference to compare the HC measured by the three systems and the HC measured by observer 2. Thirdly, the measured HCs were used to estimate the GAs which were compared to the GAs that were estimated using the CRL (measured in the first trimester of the pregnancy). Finally, we checked for indications of overfitting.

System parameter optimization

All parameters in the three quantification systems were optimized within the training set using a three-fold cross-validation. Optimization of five parameters was performed to improve the system performance (the parameter settings can be found in Table 2). First, the number of trees in the RFC was increased until the performance of the classifier was stable. Increasing the number of trees increases the computation time, so the lowest number of trees which showed a stable performance was used during optimization of the other parameters. Secondly, the scales of the Haar-like features were optimized. Starting with the optimum single scale, additional scales were only included when they improved the result. Thirdly, both the minimal distance, d_min and S_dis were increased until the performance did not improve anymore. Finally, the number of angles, N_angles, used for the polar transform was decreased as long as the performance of the system did not decrease, to speed up computation time.

Download:

Table 2. Parameter sets for optimizing systems A, B, and C.

https://doi.org/10.1371/journal.pone.0200412.t002

HC comparison

The HC annotations of observer 1 were used as a reference to compare the performance of quantification system A, B, or C, as well as the observer 2 using the difference (DF), the absolute difference (ADF), the Hausdorff distance (HD) [24] and the Dice similarity coefficient (DSC) [25].

DF was defined as: (2) where HC_R is the HC measured by observer 1 and HC_S is the HC measured by observer 2 or quantification system A, B or C.

ADF was defined as: (3)

HD was defined as: (4) where R = {r₁, …, r_q} are the pixels from observer 1 and S = {s₁, …, s_p} are the pixels from observer 2 or quantification system A, B or C, given: (5)

DSC was defined as: (6) where Area_R is the area of the annotation of observer 1 and Area_S is the area of the annotation of observer 2 or the quantification system A, B or C.

Statistical analysis was performed to determine whether the difference was significant (p < 0.05). When the tested data was normally distributed according to the Shapiro-Wilk test, a paired T-Test was performed using SPSS (version 20.0). Otherwise, a Wilcoxon Signed Rank Test was performed. Although not all distributions were normally distributed, the tables in the Results Section show the mean and standard deviation, because this makes a comparison with values provided in previous literature possible.

GA comparison

The GA from the HC of the quantification systems and the observers was estimated using the P50 curve from Verburg et al. [1]. The reference GA was determined with a CRL measurement between 20 mm (8⁺⁴ weeks) and 68 mm (12⁺⁶ weeks). The differences between the estimated GA and the reference GA were computed for evaluation of the results. The same statistical tests as explained in the previous Section were used to determine whether the difference in GA was significant.

Overfitting

The best performing quantification system was evaluated on the training data to investigate whether overfitting of the system parameters had occurred.