Published on in Vol 5, No 1 (2022): Jan-Dec

This is a member publication of University College London (Jisc)

Preprints (earlier versions) of this paper are available at, first published .
The Use of Machine Learning to Reduce Overtreatment of the Axilla in Breast Cancer: Retrospective Cohort Study

The Use of Machine Learning to Reduce Overtreatment of the Axilla in Breast Cancer: Retrospective Cohort Study

The Use of Machine Learning to Reduce Overtreatment of the Axilla in Breast Cancer: Retrospective Cohort Study

Original Paper

1Victor Horsley Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, United Kingdom

2School of Business, University of Salford, Salford, United Kingdom

3Division of Surgery and Interventional Science, University College London, London, United Kingdom

4Nuffield Department of Surgical Sciences, University of Oxford, Oxford, United Kingdom

Corresponding Author:

Felix Jozsa, BMUS, MBBS

Victor Horsley Department of Neurosurgery

National Hospital for Neurology and Neurosurgery

Queen Square

London, WC1N 3BG

United Kingdom

Phone: 44 020 3456 7890


Background: Patients with early breast cancer undergoing primary surgery, who have low axillary nodal burden, can safely forego axillary node clearance (ANC). However, routine use of axillary ultrasound (AUS) leads to 43% of patients in this group having ANC unnecessarily, following a positive AUS. The intersection of machine learning with medicine can provide innovative ways to understand specific risks within large patient data sets, but this has not yet been trialed in the arena of axillary node management in breast cancer.

Objective: The objective of this study was to assess if machine learning techniques could be used to improve preoperative identification of patients with low and high axillary metastatic burden.

Methods: A single-center retrospective analysis was performed on patients with breast cancer who had a preoperative AUS, and the specificity and sensitivity of AUS were calculated. Standard statistical methods and machine learning methods, including artificial neural network, naive Bayes, support vector machine, and random forest, were applied to the data to see if they could improve the accuracy of preoperative AUS to better discern high and low axillary burden.

Results: The study included 459 patients; 142 (31%) had a positive AUS; among this group, 88 (62%) had 2 or fewer macrometastatic nodes at ANC. Logistic regression outperformed AUS (specificity 0.950 vs 0.809). Of all the methods, the artificial neural network had the highest accuracy (0.919). Interestingly, AUS had the highest sensitivity of all methods (0.777), underlining its utility in this setting.

Conclusions: We demonstrated that machine learning improves identification of the important subgroup of patients with no palpable axillary disease, positive ultrasound, and more than 2 metastatically involved nodes. A negative ultrasound in patients with no palpable lymphadenopathy is highly indicative of low axillary burden, and it is unclear whether sentinel node biopsy adds value in this situation. Further studies with larger patient numbers focusing on specific breast cancer subgroups are required to refine these techniques in this setting.

JMIR Perioper Med 2022;5(1):e34600



The contemporary management of the axilla in breast cancer aims to reduce unnecessary intervention while providing optimal oncological safety. Historically, given the well-recognized importance of axillary node status on breast cancer prognosis [1], any patient with axillary disease underwent a complete axillary node clearance (ANC). Several key trials have since reduced the indications for ANC, including evidence that isolated tumor cells [2] and micrometastases [3] were clinically insignificant as well as results of the ACOSOG Z11 trial [4], which demonstrated that in patients with T1-2 breast cancer who had no clinically palpable axillary nodes, with 2 or fewer positive macrometastatically involved axillary nodes at sentinel node biopsy (SNB), no further axillary treatment was necessary. More patients are consequently able to forego ANC, a large surgical procedure with significant morbidity [5], without inferior oncological survival outcomes. The accurate identification of this group of patients is therefore crucially important to ensure they do not receive unnecessary surgical treatment of the axilla.

Axillary ultrasound (AUS) is used nearly ubiquitously in UK breast oncology centers to assess the axilla preoperatively in breast cancer. Typically, a suspicious node viewed on AUS may be biopsied and can be clipped to aid intraoperative identification [6]. When patients are ‘fast-tracked’ to ANC on the basis of a positive AUS, up to 43% of these may have 2 or fewer involved nodes [7] and are thus overtreated. Since AUS was not used in the ACOSOG Z11 trial, this discrepancy remains, and the bypassing of SNB prevents identification of patients who could have safely avoided ANC.

Artificial neural networks are a form of supervised machine learning based on the simplest computational model of a neuron—the ‘perceptron.’ Connections between nodes in consequent layers of a network are weighted probabilistically; following input at the first layer with information about variables describing an item in a data set, which is prelabelled (eg, as ‘dog’ or ‘cat’), the network attempts to correctly categorize the label of the item. This process is repeated on the training set of data while the model updates weights of connections between each iteration to minimize the error of its categorization. Once optimized, it can be deployed on the test set to verify its accuracy.

The aim of this study was to undertake a retrospective pilot study to deploy machine learning methods (ie, artificial neural networks) and traditional statistical models (ie, linear regression) to aid identification of patients with no clinically palpable nodes and a positive preoperative AUS who have low axillary nodal burden. The rational for this is that better identification of this subgroup of patients can reduce the number of patients who undergo unnecessary ANC on the basis of a preoperative positive AUS, which turns out to be clinically insignificant.

Ethics Approval

The study was registered as a clinical audit with the ethics committee of Guy's Hospital, London, United Kingdom and was approved in February 2019 (institutional reference number 7608).

Data Collection

The first part of this study was to analyze retrospectively the use of preoperative AUS in patients with breast cancer at our tertiary care center. Women with confirmed breast cancer treated at Guy’s Hospital, London, United Kingdom, who had an AUS preoperatively between 2012 and 2014 were retrospectively identified from a departmental database. The results of the AUS and the patients’ sex; age; date of birth; primary tumor size, grade, and type; as well as receptor phenotype were recorded alongside the results of any axillary surgical intervention and breast surgery. Lymph nodes were evaluated with ultrasound using the following criteria for reporting an abnormal node: diffuse or focal cortical enlargement, loss of lymph node fatty hilum, and enlarged nodal size [8]. All data were fully anonymized.

The second part of this study was to use machine learning and statistical methods to try and improve identification of patients with high or low axillary burden. High burden in patients was defined as more than 2 macrometastatic axillary nodes. Low burden was defined as 0, 1, or 2 macrometastatic nodes or isolated tumor cells or micrometastases in patients.

Both types of models were given the following patient characteristics to predict nodal burden: patient age, estrogen receptor and HER2 status, tumor grade, presence of associated ductal carcinoma in-situ, tumor type (eg, invasive ductal carcinoma and invasive lobular carcinoma), tumor size, presence of lymphovascular invasion, and the result of a preoperative AUS.

Machine Learning Methods

After collection and deidentification of data, the data set was preprocessed using pandas [9], matplotlib [10], and scikit [11], which are open-source data analysis and manipulation tools built in the Python programming language. A total of 70% of the data was randomly selected to form the training set, on which predictive models were developed, with the other 30% designated as the test set. The resultant nodal burden of each patient was labelled as 1 or 0 to indicate low and high nodal burden respectively, and this feature was designated as the label to be predicted by the model. Categorical variables were one-hot encoded, and numerical variables were scaled to between 0 and 1 using the MinMaxScaler function. TensorFlow [12] and Keras were used to design the artificial neural network (ANN). A dense, feed-forward ANN with 3 layers of 11, 6, and 1 neuron, respectively, was constructed with backpropagation optimized using Adam [13]. Support vector machine, random forest, and naive Bayes classifier methods were also used for comparison with the ANN.

Statistical Methods

Logistic regression is a well-known and widely used technique for predicting binary variables and carrying out discriminant analysis when the predictor variables are not all normally distributed [14]. It was used for classification here by choosing the predicted group as the group with the larger predicted probability of membership.

Logistic regression is a standard methodology, and the only nontrivial problem was estimation of the sensitivity and specificity. These would have been overestimated if computed in-sample from fitted data. We therefore used a computationally feasible method for out-of-sample estimation—k-fold cross-validation; this is a better use of data compared to estimating sensitivity on a hold-out sample.

The model was fitted k times, leaving out each ‘fold’ in turn, and predictions were then made for that fold using the fit to the other folds only. Folds were produced by shuffling high and low burden cases separately and then dividing the sample so that the percentage of high-burden cases was as equal between the folds as possible. We used 5 folds, which is usually taken as sufficient, and moving to 10 folds made very little difference.

The method is not Bayesian but can be made so using a ‘vague prior.’ Laplace’s method of integration was used to obtain a Bayesian solution, and when this was done, the probability that a patient had low or high burden shifted slightly toward 1/2, by about 0.02, so the Bayesian methodology gave a slightly less certain prediction. However, the classification was unchanged, so the Bayesian refinement was not used.

A total of 459 patients with breast cancer who had undergone a preoperative AUS before SNB or primary surgery with ANC were included. Patient characteristics are detailed in Table 1. All patients were women, with a mean age of 57.1 (SD 13.9) years. Mean tumor size was 28.3 (SD 24.05) mm, of which 319 (69.5%) were invasive ductal carcinoma, and 69 (15%) were invasive lobular carcinoma.

Table 1. Patient characteristics. All patients were female.
CharacteristicsAll patients (N=459)Low burden (≤2 nodes; n=392)High burden (>2 nodes; n=67)
Age (years), mean (range, SD)57.11 (28-88, 13.85)57.48 (29-88, 13.80)54.97 (28-86, 14.05)
Tumor size (mm), mean (range, SD)28.29 (1.1-180, 24.05)25.48 (1.1-180, 20.4)44.99 (3-180, 35.1)
Tumor histology, n (%)

Invasive ductal carcinoma319 (69.5)260 (66.3)55 (82.1)

Invasive lobular carcinoma69 (15)56 (14.3)8 (11.9)

Other invasive types41 (8.9)39 (10)2 (3)

Isolated in situ disease30 (6.5)30 (7.7)0 (0)
Tumor grade, n (%)

148 (10.5)45 (11.5)3 (4.5)

2204 (44.4)177 (45.2)27 (40.3)

3176 (38.3)139 (35.5)37 (55.2)

Not specified2 (0.4)2 (0.5)0 (0)
Invasive tumor with associated DCISa, n/N (%)b

High grade194/269 (72.1)7/222 (3.2)0/47 (0)

Intermediate grade68/269 (25.3)58/222 (26.1)10/47 (21.3)

Low grade7/269 (2.6)157/222 (70.7)37/47 (78.7)
Receptor phenotype, n (%)

Luminal A332 (72.3)283 (72.2)67 (71.6)

Luminal B30 (6.5)23 (5.9)48 (10.5)

Triple negative65 (14.2)58 (14.8)6 (9)

HER213 (2.8)8 (2.2)5 (7.5)

Not specified19 (14.1)20 (5.1)1 (1.5)
Primary surgery, n (%)

WLEc257 (56)210 (53.6)25 (37.3)

Mastectomy193 (42)151 (38.5)41 (61.2)

Lymphovascular invasion present114 (24.8)73 (18.6)41 (61.2)

aDCIS: ductal carcinoma in situ.

bThe total number of patients in this category was 269/459 (58.6%); the total number of patients with low burden (≤2 nodes) was 222 (56.6%); and the total number of patients with high burden (>2 nodes) was 47 (70.2%). All the other percentages under this category are calculated based on these denominators.

cWLE: wide local excision.

Accuracy of Preoperative AUS

The preoperative AUS was positive in 142 (31%), negative in 285 (62.09%), and inconclusive in 32 (6.97%) patients. Among patients with a positive ultrasound, 54 (38.03%) had more than 2 positive axillary nodes at ANC, and 88 (62%) had 2 or fewer nodes. Among patients with a negative ultrasound, 304 (95.9%) had 2 or fewer than 2 positive nodes at SNB (Table 2). In the subgroup of patients with a negative AUS and a tumor size of 20 mm or less, the number of patients with 2 or fewer positive nodes at SNB was 5 (2.78%). The sensitivity and specificity of ultrasound overall from these data was 0.809 (95% CI 0.715-0.902) and 0.777 (95% CI 0.736-0.818), respectively. The accuracy was 0.820 (95% CI 0.778-0.862).

Table 2. Axillary nodal burden of patients with positive and negative ultrasound.
Nodal burdenUltrasound negative (N=317), n (%)Ultrasound positive (N=142), n (%)
Two or fewer nodes304 (95.9)88 (62)
More than 2 nodes13 (4.1)54 (38)

Application of Machine Learning and Statistical Models

All machine learning and statistical models applied to these data delivered improved specificity when compared to preoperative AUS (Table 3).

The best performing model was logistic regression, with a specificity of 0.950. This was achieved by sacrificing sensitivity, which was 0.462. If logistic regression had been used on this patient cohort, 66/459 (14.3%) patients who had a positive AUS and low axillary burden would have been identified as such and avoided unnecessary ANC; 20/459 (4.3%) patients would have been wrongly classified as having low burden, but these would then have undergone SNB as per current practice and likely been identified as having high burden at that point. The most important covariates identified by logistic regression were abnormal AUS, lymphovascular invasion, tumor size, as well as invasive ductal and invasive lobular carcinoma tumor types.

The ANN, support vector machine, naive Bayes, and random forest classifiers all outperformed preoperative ultrasound’s specificity, but none were able to improve on its sensitivity (Table 3). The ANN was stopped early after 163 epochs of training (Figure 1), reaching a specificity of 0.9355 and a sensitivity of 0.7273. As such, the ANN had the highest accuracy (0.919) of all models, including logistic regression. When performing on the test set, the ANN correctly identified 21 of the 24 patients with a positive ultrasound and low burden.

Table 3. Comparison of preoperative ultrasound with logistic regression and machine learning models.
Preoperative axillary ultrasound0.8090.7770.820
Logistic regression0.9500.4620.880
Naive Bayes0.9470.4760.874
Artificial neural network0.9360.7270.919
Support vector machine0.9340.6150.904
Random forest0.9110.4550.874
Figure 1. Training of the artificial neural network over 163 epochs.
View this figure

Principal Findings

Our results demonstrate that logistic regression and machine learning methods can be used effectively to reduce the number of patients undergoing ANC unnecessarily. As current practice leads to 43% of patients with early breast cancer, nonpalpable axillary nodes, and a positive ultrasound receiving such overtreatment, this is a valuable addition to the preoperative workup of breast cancer patients, and there are significant implications on clinical practice.

In this data set, logistic regression performed best. The particular success of logistic regression’s high specificity came at a cost of poor sensitivity. However, this trade-off is favorable in the case of axillary staging because patients deemed as low risk will undergo SNB. Thus, the potential group of patients wrongly classified as having low burden by logistic regression will be identified and not left without treatment. For this reason, despite the ANN’s accuracy outperforming the other models, logistic regression is the best model for the problem presented by the data. Indeed, a recent meta-analysis of clinical prediction models found that logistic regression tends to perform better than machine learning methods in this setting [15] as a predictor of disease in a data set of relatively low dimensions and size.

This study confirms that machine learning can be successfully deployed in the preoperative assessment of patients with breast cancer, despite not being able to outperform logistic regression’s optimization of specificity for this task. The ANN developed the greatest overall accuracy, meaning it would have been the most useful tool if SNB following negative imaging was not standard of care. Larger and higher dimensional data sets will likely provide an arena in which machine learning can excel, particularly when considering its potential to combine image analysis techniques using convolutional neural networks and standard data in the form used in this study [16].

The fact that none of the models could improve on the sensitivity of AUS underlines the value of this imaging modality for helping rule out axillary disease in the clinically node negative breast cancer population. Evidence from a meta-analysis of 5139 patients showed that ultrasound’s negative predictive value was 0.951 (95% CI 0.941-0.960) in this setting [17]. Despite this, patients with a negative ultrasound still undergo a SNB, and this may be considered surgical overtreatment in the same sense that ANC is used unnecessarily in the ultrasound positive group. This issue is currently being addressed in the SOUND randomized control trial [18]. Adaption of machine learning and statistical methods could be used on large data sets to help identify the approximately 4% of patients with no clinically palpable disease and a negative ultrasound but with more than 2 macrometastatically involved axillary nodes. This could lead to future selective use of SNB in this patient subgroup, analogous to the selective use of ANC, which is now common practice among patients with nodal burden identified on SNB.

There are several limitations to this study. They stem principally from the fact that this study is a proof-of-concept idea demonstrating the application of machine learning techniques in a breast surgery cohort, applied to a specific clinical and radiological problem within the general breast cancer patient population but not able to further delineate important risk differences between subgroups in this population. For example, it has not included several important patient factors and data points, which may prove important to refining models before implementation in a real-world scenario; examples of parameters that the authors would like to include in further models include menopausal status and lymph node biopsy pathology results. A further limitation of this study’s applicability to clinical practice was that it did not consider patients undergoing primary systemic therapy, the indications for which have increased [19]. In this patient group, the use of ultrasound is less important as staging magnetic resonance imaging is often used alongside SNB to assess response to treatment. Another key limitation of this study was that our data set was relatively small; deployment of the same models on much larger sets of patient data would be necessary to further validate our results. Furthermore, with larger training sets, model performance may improve. This could allow for suture large studies on specific breast cancer patient subgroups, for example invasive lobular carcinoma. A further interesting future consideration will be to include particular aspects of ultrasound data, for example cortex to hilum ratios when computing predictive models, or to combine data predictive methods with computer vision techniques looking directly at the ultrasound images obtained from each patient.


AUS’s poor specificity renders it ineffective to reliably identify patients with a clinically negative axilla and significant nodal burden (ie, more than 2 macrometastatic nodes), despite it being attractive as a noninvasive and widely available tool. The addition of logistic regression and machine learning methods can provide valuable predictions based on patient characteristics and the AUS result, which can greatly reduce the surgical overtreatment of the axilla and significantly improve the accuracy of identification of high nodal burden among patients with no clinically palpable disease. This two-part improvement in preoperative axillary staging is highly desirable and has the potential to spare many patients unnecessary axillary surgery; however, given the heterogenous nature of the patient population in this study, further refinement of the models with international multicenter trials are warranted to confirm the results.

Conflicts of Interest

None declared.

  1. Fisher B, Anderson S, Bryant J, Margolese RG, Deutsch M, Fisher ER, et al. Twenty-year follow-up of a randomized trial comparing total mastectomy, lumpectomy, and lumpectomy plus irradiation for the treatment of invasive breast cancer. N Engl J Med 2002 Oct 17;347(16):1233-1241. [CrossRef] [Medline]
  2. Imoto S, Ochiai A, Okumura C, Wada N, Hasebe T. Impact of isolated tumor cells in sentinel lymph nodes detected by immunohistochemical staining. Eur J Surg Oncol 2006 Dec;32(10):1175-1179. [CrossRef] [Medline]
  3. Galimberti V, Cole BF, Zurrida S, Viale G, Luini A, Veronesi P, International Breast Cancer Study Group Trial 23-01 investigators. Axillary dissection versus no axillary dissection in patients with sentinel-node micrometastases (IBCSG 23-01): a phase 3 randomised controlled trial. Lancet Oncol 2013 Apr;14(4):297-305 [FREE Full text] [CrossRef] [Medline]
  4. Giuliano AE, Hunt KK, Ballman KV, Beitsch PD, Whitworth PW, Blumencranz PW, et al. Axillary dissection vs no axillary dissection in women with invasive breast cancer and sentinel node metastasis: a randomized clinical trial. JAMA 2011 Feb 09;305(6):569-575 [FREE Full text] [CrossRef] [Medline]
  5. Veronesi U, Paganelli G, Viale G, Luini A, Zurrida S, Galimberti V, et al. A randomized comparison of sentinel-node biopsy with routine axillary dissection in breast cancer. N Engl J Med 2003 Aug 07;349(6):546-553. [CrossRef]
  6. Caudle AS, Yang WT, Krishnamurthy S, Mittendorf EA, Black DM, Gilcrease MZ, et al. Improved axillary evaluation following neoadjuvant therapy for patients with node-positive breast cancer using selective evaluation of clipped nodes: implementation of targeted axillary dissection. JCO 2016 Apr 01;34(10):1072-1078. [CrossRef]
  7. Ahmed M, Jozsa F, Baker R, Rubio IT, Benson J, Douek M. Meta-analysis of tumour burden in pre-operative axillary ultrasound positive and negative breast cancer patients. Breast Cancer Res Treat 2017 Nov 28;166(2):329-336 [FREE Full text] [CrossRef] [Medline]
  8. Shetty M, Carpenter W. Sonographic evaluation of isolated abnormal axillary lymph nodes identified on mammograms. J Ultrasound Med 2004 Jan;23(1):63-71. [CrossRef] [Medline]
  9. Pandas.   URL: [accessed 2020-05-22]
  10. Matplotlib.   URL: [accessed 2020-05-22]
  11. Scikit-learn.   URL: [accessed 2020-05-22]
  12. TensorFlow.   URL: [accessed 2020-05-22]
  13. Kingma DP, Ba J. Adam: a method for stochastic optimization. ArXiv. Preprint posted online Dec 22, 2014. [FREE Full text]
  14. Harris JK. Primer on binary logistic regression. Fam Med Community Health 2021 Dec 23;9(Suppl 1):e001290 [FREE Full text] [CrossRef] [Medline]
  15. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019 Jun;110:12-22. [CrossRef] [Medline]
  16. Bica I, Alaa AM, Lambert C, van der Schaar M. From real-world patient data to individualized treatment effects using machine learning: current and future methods to address underlying challenges. Clin Pharmacol Ther 2021 Jan 28;109(1):87-100. [CrossRef] [Medline]
  17. Jozsa F, Ahmed M, Baker R, Douek M. Is sentinel node biopsy necessary in the radiologically negative axilla in breast cancer? Breast Cancer Res Treat 2019 Aug 31;177(1):1-4. [CrossRef] [Medline]
  18. Gentilini O, Veronesi U. Abandoning sentinel lymph node biopsy in early breast cancer? A new trial in progress at the European Institute of Oncology of Milan (SOUND: Sentinel node vs Observation after axillary UltraSouND). Breast 2012 Oct;21(5):678-681 [FREE Full text] [CrossRef] [Medline]
  19. Amoroso V, Generali D, Buchholz T, Cristofanilli M, Pedersini R, Curigliano G, et al. International expert consensus on primary systemic therapy in the management of early breast cancer: highlights of the fifth symposium on primary systemic therapy in the management of operable breast cancer, Cremona, Italy (2013). J Natl Cancer Inst Monogr 2015 May 10;2015(51):90-96 [FREE Full text] [CrossRef] [Medline]

ANC: axillary node clearance
ANN: artificial neural network
AUS: axillary ultrasound
SNB: sentinel node biopsy

Edited by J Pearson; submitted 31.10.21; peer-reviewed by J McGuinness, P Ray, M Sherman; comments to author 28.04.22; revised version received 18.09.22; accepted 06.10.22; published 15.11.22


©Felix Jozsa, Rose Baker, Peter Kelly, Muneer Ahmed, Michael Douek. Originally published in JMIR Perioperative Medicine (, 15.11.2022.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Perioperative Medicine, is properly cited. The complete bibliographic information, a link to the original publication on, as well as this copyright and license information must be included.