Trial Design and Program Selection
From July 1, 2015, to June 30, 2016, we compared a variety of measures in internal medicine residency programs in the United States that were randomly assigned to be regulated by standard 2011 ACGME duty-hour policies or by more flexible policies that did not specify any limits on shift length or mandatory time off between shifts (Table 1). The protocol (available with the full text of this article at NEJM.org) was approved by the institutional review board at the University of Pennsylvania or by the review board at each trial center. We measured education outcomes using time–motion observations of intern activity at six programs, responses by trainees and faculty on surveys administered by both iCOMPARE investigators and ACGME staff members, and scores on the American College of Physicians Internal Medicine In-Training Examination.
We selected programs to meet sample-size requirements for the primary hypothesis that the 30-day patient mortality under a flexible policy would not be inferior to that under the standard policy by more than 1 percentage point. (Mortality outcomes will be assessed with the use of 2015 and 2016 Medicare data, when available.) We included only programs that had at least one affiliated hospital in both the upper half of resident-to-bed ratios and the upper three quartiles of patient volume for diagnoses used to measure mortality. We excluded New York programs because state legislation prevented the necessary duty-hour waivers. Of the 179 eligible programs, 63 volunteered and were randomly assigned to be governed by flexible policies (32 programs) or standard policies (31 programs). Directors were informed about the assignment of their program at the time of randomization.
The decision to be included in the iCOMPARE trial was made by the leadership at each program, not by individual trainees or faculty. Prospective interns could choose which programs they applied to; almost all the programs had undergone randomization by November 2014 and could inform applicants of their duty-hour assignment during recruitment. Each intern who participated in the time–motion observations provided written informed consent. Trainees who responded to surveys that were administered by iCOMPARE investigators were assumed to have provided consent at the time of survey completion.
We conducted time–motion observations of interns to address our hypothesis that trainees in flexible programs would spend more time on direct patient care and education than their colleagues in standard programs. From March through May 2016, we conducted time–motion observations at three flexible and three standard programs, both community and university-based, located in the mid-Atlantic region. Preliminary data9,10 suggested that the mean (±SD) percentage of time that would be spent in direct patient care in standard programs was 13±4%. We prespecified that the observation of 30 intern shifts on general internal medicine inpatient rotations at each program (180 total) would provide a power of at least 80% to detect an absolute difference of 3 percentage points in the percentage of time spent in direct patient care. Observers were scheduled to cover the entire shift. The types and lengths of shifts that were observed reflected the site-specific distribution of shifts in terms of shift length and overnight schedules. A total of 80 interns (44 in flexible programs and 36 in standard programs) who provided consent were observed for 2173 hours over 194 shifts (1072 hours in 96 shifts in flexible programs and 1101 hours in 98 shifts in standard programs). The mean length of observed shifts was 11.2 hours in both flexible and standard programs, the median length was 10.1 and 11.8 hours, respectively, and the maximum length was 27.8 and 14.5 hours. In the flexible programs, 4.2% of the shifts lasted more than 24 hours.
The observations, which were restricted to shifts beginning on weekdays, were performed by 23 trained observers. During the training of the observers, the median kappa coefficient among pairs of observers was 0.67, and the median agreement was 90%. During the trial, 10% of shifts were simultaneously observed by two observers (median kappa, 0.74; median agreement, 89%). Activity was recorded in milliseconds with the use of custom-built tablet-based software, with choices of direct patient care, education, indirect patient care, handoffs, rounds, or miscellaneous; more than one category could be selected if different types of activities occurred simultaneously.
We used the score (percent of questions answered correctly) on the American College of Physicians In-Training Examination to address our hypothesis that the medical knowledge acquired by interns in flexible programs would not be inferior to the knowledge of those in standard programs. The American College of Physicians shared deidentified 2015 and 2016 scores for trainees who had provided consent for research. (A total of 88% of all test takers provided such consent in 2016. Not all programs require trainees to take the examination.) In 2016, a total of 1687 trainees (852 of 1228 in flexible programs [69%] and 835 of 1300 in standard programs [64%]) took the examination as second-year residents. A total of 1766 trainees were included in 2015 baseline data (882 in flexible programs and 884 in standard programs).
We used three surveys to address our hypothesis that the trainees’ satisfaction with their educational experience would be superior in flexible programs. In accordance with the data-use policy of the ACGME, the iCOMPARE research team specified the analyses to be conducted on ACGME data and provided statistical code, and ACGME researchers completed the analyses and provided summary results. ACGME researchers performed and provided analyses of responses to their 2015 and 2016 annual resident surveys. The primary outcome for this hypothesis about trainee satisfaction with their educational experience was a single statement from the ACGME resident survey: “Major assignments provide an appropriate balance between education and other clinical demands.” Potential responses were “never,” “rarely,” “sometimes,” “often,” and “very often.” Additional ACGME measures were questions in seven content areas. (Details are provided in the Supplementary Appendix, available at NEJM.org.)
For each content area, ACGME dichotomized the 5-level response to each component question into “compliant” or “noncompliant” and pooled the dichotomized responses to provide a content-level dichotomized response. For example, for the question listed above, the options “sometimes,” “rarely,” and “never” would be considered noncompliant. The response for a content area wa
s noncompliant if the respondent provided a noncompliant response to any of the questions in the content area. The mean rates of response among flexible and standard programs were 91% and 90%, respectively, in 2015 and 91% each in 2016.
Investigators administered an end-of-year survey to all trainees in May 2015 in 55 programs (which served as a baseline survey before the start of the trial) and in May 2016 in 62 programs (end-of-trial survey). The instrument was developed for the FIRST (Flexibility in Duty Hour Requirements for Surgical Trainees) trial5 in surgery and adapted for internal medicine (Table S1 in the Supplementary Appendix). The 5-level response for each question was dichotomized into a “positive” or “negative” response to parallel the results of the ACGME resident survey. The survey ended with the Maslach Burnout Inventory–Human Services Survey, a 22-item scale assessing emotional exhaustion, depersonalization, and perception of personal accomplishment.15 The survey response rates among trainees were 58% in flexible programs and 57% in standard programs in 2015 and 46% and 44%, respectively, in 2016. (In 2016, overall response rates were 45% for all trainees and 49% for interns.)
Investigators administered end-of-shift surveys to all trainees (60 programs) every 2 weeks from September 2015 through April 2016. The surveys reflected trainees’ perceptions of their experience with education, ownership, work intensity, and continuity; 72% of interns in flexible programs and 67% of those in standard programs (64% and 61% of all trainees in flexible and standard programs, respectively) participated in at least one survey cycle.
To address our hypothesis that faculty in flexible programs would report greater satisfaction with clinical teaching experiences and perceptions of patient safety, teamwork, and supervision, we used ACGME faculty surveys and an iCOMPARE survey of program directors. Using the same process that has been described for ACGME resident surveys, the ACGME provided analyses of responses to their 2015 and 2016 annual faculty surveys. The primary outcome for this hypothesis about faculty satisfaction was a single statement from the ACGME faculty survey: “Residents’ clinical workload exceeds their capacity to do the work.” Response options were “never,” “rarely,” “sometimes,” “often,” and “very often.” Additional measures were questions in six content areas, with responses in each component of the content area dichotomized and pooled, as described for the ACGME resident survey. (Details are provided in the Supplementary Appendix.)
Mean rates of survey response among flexible and standard programs were 90% each in 2015 and 91% each in 2016. In addition, iCOMPARE investigators surveyed program directors in May 2015, when 63 programs were surveyed (response rate, 88% in flexible programs and 100% in standard programs, although a data-acquisition error limited secondary analyses to 19 flexible and 18 standard programs), and in May 2016 (63 programs, with a survey response rate of 100% for flexible programs and 97% for standard programs). The survey instrument mirrored the one used in a previous survey of program directors16 (Table S2 in the Supplementary Appendix). The 5-level response for each question was dichotomized into “positive” or “negative” to provide parallel reporting to the results of the ACGME faculty survey.
The sample-size calculation of 58 programs (29 per group) was based on the primary hypothesis that the 30-day patient mortality under a flexible policy would not be inferior to that under a standard policy by more than 1 percentage point. The achieved sample size of 63 programs provides a power of more than 80% in the comparison of each of the four prespecified education hypotheses. The hypothesis regarding medical knowledge is a noninferiority hypothesis with a noninferiority margin of 2 percentage points that was chosen by consensus of the investigators. We hypothesized that interns in flexible programs would spend more time in direct patient care and education and would be more satisfied with their education and that faculty in flexible programs would be more satisfied with their teaching experiences, patient safety, teamwork, and supervision than their peers in standard programs.
We used a mixed-effects linear-regression model with a random intercept for each program cluster to determine the between-group difference in outcomes obtained from the time–motion observations; for each activity, we analyzed the mean percentage of the observed shift time that was spent in the activity over all shifts observed for the intern. We used mixed-effects linear or logistic regression with a random intercept for each program cluster to determine the between-group difference in ordinal outcomes obtained from the ACGME surveys of trainees and faculty and the binary outcomes obtained from the end-of-year trainee surveys. We used logistic regression with generalized estimating equations and robust variance estimation to account for the correlations between responses from respondents at the same program to determine the between-group difference in binary outcomes obtained from the ACGME trainee and faculty surveys. Exact logistic regression was used to determine the difference in outcomes obtained from the end-of-year survey of program directors. We used mixed-effects linear regression with a random intercept for each program cluster to determine the between-group difference in outcomes obtained from the end-of-shift surveys; for each question, we analyzed the mean of all ratings provided by the trainee over the survey cycles in which the trainee participated. When program-level data were available for the baseline year, we completed secondary analyses after adjustment for the respondent’s trial-year outcome for the baseline year program-level mean outcome.
Here, we report marginal duty-hour group effects; observed effects and variance components from the mixed-effects regression models are provided in the Supplementary Appendix. Each marginal effect is similar to an observed mean or percentage but is derived from the regression model and accounts for correlations between respondents at the same program, averaging across random effects caused by variation in respondent outcomes within programs. Each linear mixed-effects model provides a measure and test of the clustering effect of programs on the outcome (program variance and P value) and a measure of the variability of the respondent’s response (error variance). Each logistic mixed-effects model provides a measure and test of the clustering effect of programs on the outcome (random program variance and P value).
Data were analyzed according to the intention-to-treat principle. We assumed that missing responses were missing completely at random and analyzed all available responses. We report P values for the primary outcome measures. For the secondary outcome measures, we report 95% confidence intervals without P values, given the multiplicity of comparisons. Analyses were conducted with the use of SAS and Stata software.