This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Perioperative Medicine, is properly cited. The complete bibliographic information, a link to the original publication on http://periop.jmir.org, as well as this copyright and license information must be included.
The automated acquisition of intraoperative patient temperature data via temperature probes leads to the possibility of producing a number of artifacts related to probe positioning that may impact these probes’ utility for observational research.
We sought to compare the performance of two de novo algorithms for filtering such artifacts.
In this observational retrospective study, the intraoperative temperature data of adults who received general anesthesia for noncardiac surgery were extracted from the Multicenter Perioperative Outcomes Group registry. Two algorithms were developed and then compared to the reference standard—anesthesiologists’ manual artifact detection process. Algorithm 1 (a slope-based algorithm) was based on the linear curve fit of 3 adjacent temperature data points. Algorithm 2 (an interval-based algorithm) assessed for time gaps between contiguous temperature recordings. Sensitivity and specificity values for artifact detection were calculated for each algorithm, as were mean temperatures and areas under the curve for hypothermia (temperatures below 36 °C) for each patient, after artifact removal via each methodology.
A total of 27,683 temperature readings from 200 anesthetic records were analyzed. The overall agreement among the anesthesiologists was 92.1%. Both algorithms had high specificity but moderate sensitivity (specificity: 99.02% for algorithm 1 vs 99.54% for algorithm 2; sensitivity: 49.13% for algorithm 1 vs 37.72% for algorithm 2; F-score: 0.65 for algorithm 1 vs 0.55 for algorithm 2). The areas under the curve for time × hypothermic temperature and the mean temperatures recorded for each case after artifact removal were similar between the algorithms and the anesthesiologists.
The tested algorithms provide an automated way to filter intraoperative temperature artifacts that closely approximates manual sorting by anesthesiologists. Our study provides evidence demonstrating the efficacy of highly generalizable artifact reduction algorithms that can be readily used by observational studies that rely on automated intraoperative data acquisition.
Body temperature is a critical vital sign, and its measurement during surgery is an integral part of standard American Society of Anesthesiologists monitoring [
This study was approved by the institutional review board (approval number: HIC 1206010438).
This was a multicenter, observational, retrospective study of data that were collected by the Multicenter Perioperative Outcomes Group (MPOG) consortium after institutional review board approval. The MPOG registry contains the anesthetic data of over 14 million procedures from over 48 medical centers. This consortium has rigorously collected and standardized information regarding anesthetic and surgical encounters with patient-level data [
The study plan, including the sample size assessment, was published prior to data extraction and analysis [
Anesthetic records of patients aged over 18 years who were undergoing general anesthesia with an endotracheal tube for noncardiac surgery were included in this study. The exclusion criteria comprised cases with an American Society of Anesthesiologists Physical Status of 5 or 6, temperature probes placed at sites other than the nasopharynx or the oropharynx, cases in which an endotracheal tube was not used for general anesthesia, or cases with less than 3 temperature readings in the anesthetic records. These temperature recordings were extracted from anesthesia charts. Only intraoperative readings were used for artifact detection.
After the cohort was selected by using the inclusion and exclusion criteria, a convenience sample of 200 noncardiac surgical cases from an anonymized institution within the MPOG consortium was chosen.
The primary study end point was to assess the sensitivity and specificity of the two algorithms for detecting artifacts in automated intraoperative temperature recordings by comparing them to a reference standard—manual artifact detection by three anesthesiologists. The other study end points included measures of agreement (by case) between each algorithm and between the algorithms and the experts’ adjudications for mean temperatures and areas under the curve (AUCs). AUCs for temperature readings below 36 °C were used for this analysis. The AUC for the time multiplied by temperature readings below 36 °C was calculated for each patient after excluding artifacts, as adjudicated by the algorithms or the experts. The use of AUCs for temperature readings under 36 °C served as an index that combined the duration and severity of patient hypothermia [
Algorithms 1 and 2 are depicted in
The algorithms used for the reduction of artifacts in intraoperative temperature recordings.
Algorithm 2—the interval-based algorithm—assessed for time gaps between contiguous temperature recordings that were more than 5 minutes apart. If there were less than 5 temperature recordings after the time gap, they were recorded as artifacts. If, however, there were more than 5 recordings after the measurement gap, then the slope between the last valid temperature recording and the next temperature recording was calculated, and if the slope was less than 0.35 °C per minute, then the temperature points were retained. Otherwise, they were marked as artifacts.
Three board-certified anesthesiologists independently identified artifacts in temperature readings of intraoperative cases; each anesthesiologist was blinded to the other anesthesiologists’ results and the algorithms’ calculations. In the event of discordance, the majority rule (ie, agreement among at least 2 of the 3 anesthesiologists) was followed. We used an innovative approach to present time-temperature readings to the experts, for which we developed software on the JavaFX (Oracle Corporation) and Java 11 JDK (Oracle Corporation) platforms. The program first extracted patient temperature data to a flat file. Each record incorporated a unique patient identifier, temperature, and time stamp. The data were then written to an HTML file, using a FreeMarker Java template. The file used the JavaScript Google Visualization application programming interface to display intraoperative temperatures for each case in a scatterplot, which displayed temperatures on the vertical axis and time on the horizontal axis (
Statistical analyses were performed by using SAS version 9.4 (SAS Institute Inc). Descriptive statistics were performed on all extracted temperature readings, including readings deemed artifactual by each algorithm and the expert-adjudicated values.
The results of the manual artifact identification by the experts (majority rule) and the two algorithms were also compared by using Bland-Altman plots for both mean temperatures and AUCs for hypothermic temperature readings. For these AUCs, we computed the average height between successive time points and corresponding interval widths to estimate the segment areas. We aggregated the areas for temperatures under 36 °C to obtain the total area for each surgical case.
Although we conducted an observational descriptive analysis without inferential aims, we performed a power analysis to establish the extent to which the data set would define bias and limits of agreement. After a literature review, we were not able to find similar studies that could be used to guide the sample size estimation. Based on our pilot data, the mean sample difference in AUCs for temperature readings below 36 °C between the experts and each algorithm was 0.2 (SD 1.02) minutes×°C. Using the methodology developed by Lu et al [
A total of 27,683 temperature readings from 200 anesthetic records were analyzed by the algorithms and the anesthesiologists. The median temperature reading count per case was 103 (IQR 51-185.5). A histogram depicting the temperature curve is presented in
Among the 27,683 temperature readings, a total of 411 temperature points were identified as artifacts by the slope-based algorithm, and 236 points were identified as artifacts by the interval-based algorithm. Notably, these rejections were not limited to a few cases. Of the 200 cases, 81 (40.5%) had at least one rejection by the slope-based algorithm, and 89 cases (44.5%) had at least one rejection by the interval-based algorithm. In comparison, 88 cases (44%) were adjudicated to have artifacts by the anesthesiologists. The mean number of rejections for each of the 200 cases was 2.1 for the slope-based algorithm and 1.2 for the interval-based algorithm.
As expected, both algorithms had a high specificity for artifact detection (slope-based algorithm: 99.02%; interval-based algorithm: 99.54%), while the slope-based algorithm appeared to be better than the interval-based algorithm in terms of sensitivity (49.13% vs 37.72%). The F-score was 0.65 for the slope-based algorithm and 0.55 for the interval-based algorithm.
Comparisons between the AUCs for hypothermic temperature readings from raw data and those from anesthesiologists showed no appreciable differences in the patient-averaged summaries (
Previously, an AUC of 60 minutes×°C was used as a standard unit of reference; multiples of 60 minutes×°C were shown to be associated with adverse patient outcomes [
Interestingly, both the bias between experts and the slope-based algorithm (19.78 minutes×°C) and the bias between experts and the interval-based algorithm (−15.53 minutes×°C) were less than 60 minutes×°C, suggesting that after the raw data were evaluated by experts or by either of the algorithms, the resulting measures of hypothermia were similar and were within accepted measures of clinical relevance.
In order to better characterize the agreement, we assessed the performance of the algorithms in evaluating a clinically meaningful measure. Large AUCs for hypothermic temperature readings (time under 36 °C × temperature value of under 36 °C) have been shown to be associated with poor postoperative outcomes, including increased lengths of hospital stay and the need for a blood transfusion [
Bland-Altman plots for the interrater agreement analysis of areas under the curve for hypothermia; 95% limits of agreement are shown with light blue lines, bias is shown as a dotted black line, and the agreement bias of 2 methods is shown as a solid red line. Each dot represents a surgical case.
Scatter plots showing the distribution of AUCs for hypothermia (time under 36 °C × hypothermic temperature value) for the cases after artifact removal by the algorithms versus the anesthesiologists (experts). Each dot indicates a case. Values on the red line indicate cases that have temperature readings with similar AUCs after artifact removal by experts and by algorithm 1 (left) and algorithm 2 (right). Values to the right of the red line indicate fewer hypothermic temperatures marked as artifacts by the algorithm (compared to those marked by experts), leading to larger AUCs calculated by the experts compared to those calculated by the algorithms. AUC: area under the curve.
Scatter plots showing the distribution of AUCs for hypothermia (time under 36 °C × hypothermic temperature value) for the cases after artifact removal by the algorithms versus the raw values. Each dot indicates a case. Values on the red line indicate cases that have temperature readings with similar AUCs before (raw values) and after artifact removal by algorithm 1 (left) and algorithm 2 (right). Values to the right of the red line indicate the number of hypothermic temperatures marked as artifacts by the algorithm (as compared to the raw values), leading to larger AUCs calculated from the raw data compared to those calculated by the algorithms. AUC: area under the curve.
Mean temperature readings for each patient record were calculated after artifact removal via the methods we described. The mean temperature reading profiles, in which the raw data were compared to anesthesiologists’ majority rule–based results, showed no appreciable differences (
Bland-Altman plots for the interrater agreement analysis of mean temperatures; 95% limits of agreement are shown with light blue lines, bias is shown as a dotted black line, and the agreement bias of 2 methods is shown as a solid red line. Each dot represents a surgical case.
In order to describe clusters, we considered a cluster to be 3 or more consecutive temperature readings that were adjudicated as artifacts. We compared the distributions of the number of clusters per case among the three methods (manual artifact detection by anesthesiologists, the use of the slope-based algorithm [algorithm 1], and the use of the interval-based algorithm [algorithm 2]), as depicted in
This study has important findings. First, the overall rate of intraoperative temperature artifacts in the sample, which was obtained via automated electronic health record data capture, was low (point estimate 0.01, 95% CI 0.009-0.011). To the best of our knowledge, our study is the first of its kind to address the validity of raw intraoperative temperature recordings. Thus, mean temperature values derived from raw data closely approximate those derived by experts and may be directly used for research purposes. Second, the slope-based algorithm can filter intraoperative temperature artifacts, closely approximating manual sorting by anesthesiologists. The artifact reduction algorithm can thus be used by studies that evaluate the effect of intraoperative hypothermia on patient outcomes. This algorithm can also serve as a powerful tool for gauging the quality of temperature data capture by a particular medical center via comparisons to other medical centers. In addition, our methodology can be used to validate similar algorithms aimed at discerning artifacts associated with other vitals, such as intraoperative blood pressure.
Our intraoperative temperature recordings are similar to those in other studies evaluating intraoperative temperatures [
Our study has some limitations. First, due to the lack of a true gold standard, manual artifact sorting by anesthesiologists was considered a reasonable method for assessing the performance of artifact detection. An alternate methodology for measuring the artifacts could have been correlating esophageal temperatures with temperature measurements that were simultaneously captured from other sites, such as the bladder. However, very few patients receive more than 1 temperature measurement modality. Moreover, bladder temperatures lag behind esophageal temperatures, which would make identifying a true artifact difficult [
In summary, it is widely recognized that intraoperative temperature monitoring is key to postoperative patient outcomes. Our study provides highly generalizable artifact reduction algorithms that can be used as standard open-access tools to filter out artifacts in large database studies. They can also be used as tools for assessing the quality of intraoperative temperature recordings at various centers. Further investigations should assess our slope-based algorithm’s performance for other intraoperative databases and populations.
Intraoperative temperatures for a surgical case displayed as a function of time. Red dots indicate the temperature points that were adjudicated by an anesthesiologist as artifactual.
Histogram showing the distribution of raw temperature data in the study cohort.
Bland-Altman plots for the interrater agreement analysis of areas under the curve (AUCs) for hypothermia; 95% limits of agreement are shown with light blue lines, bias is shown as a dotted black line, and the agreement bias of 2 methods is shown as a solid red line. Each dot represents a surgical case.
Bland-Altman plots for the interrater agreement analysis of mean temperature; 95% limits of agreement are shown with light blue lines, bias is shown as a dotted black line, and the agreement bias of 2 methods is shown as a solid red line. Each dot represents a surgical case.
Distribution of the number of temperature clusters (3 consecutive artifactual temperature readings), as adjudicated by experts, algorithm 1, and algorithm 2.
Distribution of the size of the clusters per case for the three methods (experts, algorithm 1, and algorithm 2).
Table depicting jackknife analysis versus full data to understand potential outlier effects.
area under the curve
Multicenter Perioperative Outcomes Group
The authors gratefully acknowledge the valuable contributions to the protocol and final manuscript review by the Multicenter Perioperative Outcomes Group collaborators, including Mark Neuman, MD, MSc (Penn Medicine [Anesthesiology]); Shital Vachhani, MD (MD Anderson Cancer Center, Department of Anesthesia and Perioperative Medicine); Robert Edward Freundlich, MD, MS, MSCI (Vanderbilt University Medical Center [Anesthesiology and Biomedical Informatics]); and Wilton A van Klei, MD, PhD (University Medical Center Utrecht [Anesthesiology]).
This work was supported in part by grant R01AG059607 from the National Institute on Aging and Clinical and Translational Science Awards Grant UL1 RR024139 from the National Center for Advancing Translational Sciences. The content is solely the responsibility of the authors and does not necessarily represent the policies or views of the National Institutes of Health, the National Institute on Aging, the National Center for Advancing Translational Sciences, or the US government.
Support for the collection of underlying electronic health record data was provided in part by Blue Cross and Blue Shield of Michigan (BCBSM) and Blue Care Network as part of the BCBSM Value Partnerships program for contributing hospitals in the State of Michigan. Although BCBSM and the Multicenter Perioperative Outcomes Group work collaboratively, the opinions, beliefs, and viewpoints expressed by the authors do not necessarily reflect the opinions, beliefs, and viewpoints of BCBSM or any of its employees.
MRM received the K01-HL141701 National Institutes of Health grant during the study period.
AB, DY, FD, and RBS conceived and designed the study, performed the data analysis, interpreted the data, and prepared manuscript. RD, NLP, KS, MRM, and SK conceived and designed the study, interpreted the data, and prepared the manuscript. GM performed the data analysis and interpreted the data.
RBS holds an equity position in Johnson & Johnson.