Data Sources

This project relied primarily on a dataset of EMS incidents between December 2016 and June 2020 in Charlottesville City and Albemarle County. These data consisted of numerous variables regularly recorded by EMS providers when responding to an incident. The variables available can be broken down into three rough categories:

  • Operations
    • Incident date and time
    • Unit and crew dispatched
    • Scene location and type
    • Response time to incident
    • Hospital patient was delivered to
  • Medical
    • Patient symptoms (based on provider's impression and/or patient's reported complaints)
    • Vital signs (temperature, pulse, etc.)
    • Medications provided
    • Results of medical assessments
  • Patient Characteristics
    • Age
    • Race
    • Sex
    • Medical history

In addition to this EMS dataset, we also used data from the American Community Survey to develop an understanding of the baseline demographic patterns in the region and mortality data from the Virginia Department of Health in an attempt to link any observed changes in EMS delivery to overall trends in COVID-19-related mortality.


Data Quality

Because many of the observations in the dataset were recorded under high intensity scenarios (increasing the probability of errors), we knew that a thorough assessment of data quality would be necessary. While some errors (missing rows of data, logical inconsistencies between variables, etc.) were easy to spot, the complex real-world nature of the data presented additional challenges as well.

For instance, in an ideal scenario, each entry in the dataset would correspond to a single patient, with a single response unit recording information for that patient. In actuality, multiple units often respond to the same incident, with each unit keeping records. For these cases, the dataset ends up with multiple rows that reference the same patient and we risk over-counting the true number of patients serviced, with the potential to skew the actual distribution of the patient-level variables we hope to explore.

Further complicating things is the fact that in some cases, multiple units may respond to a single incident, but may actually service different patients (this is often true for traffic-related incidents, which frequently have multiple patients involved).

Avoiding both the possibility of removing legitimate entries from the dataset while also identifying duplicated patients proved to be a challenge. Ultimately, we needed to reduce the raw dataset to two clean datasets: one where each observation is a single patient, and one where each observation is a single unit (either may be appropriate depending on the research question being explored).

By cross-referencing demographic characteristics, medical numbers (where they existed), and operations variables across rows, we were ultimately able to identify many thousands of duplicate patients in the dataset and remove them. One key takeaway from this process is that the development of systems to better link EMS records when multiple units are involved would be valuable tool for more extensive analysis of EMS data in general.


Response Time Modeling

Of key interest to the Charlottesville and Albemarle EMS teams was the potential that services were being distributed unequally across demographic and socioeconomic subpopulations. By their very nature, the incidents that EMS providers encounter are highly time-sensitive. With this in mind, we attempted to assess whether response times varied across demographic groups after accounting for the effects of other operational factors. Even without intent on the part of service providers, longstanding systemic racism could still reveal itself in response times: for instance, decades-old practices of redlining and targeted infrastructure development may serve to increase accessibility to certain areas within the response region, and these areas may be highly correlated with race, age, and socioeconomic factors. Additionally, we were especially interested to determine how any observed differences may have changed since the advent of the COVID-19 pandemic.


Model Specification

We used an iterative process to specify our final model. We began by explicitly specifying the covariates of interest and the covariates we needed to control for:

  • Covariates of interest:
    • Age
    • Gender
    • Race
    • Symptoms recorded (by provider and/or patient)
    • Whether the COVID-19 pandemic had begun yet
  • Covariates to control for:
    • Type of vehicle responding
    • Travel distance to location
    • Time of Day
    • Possible Spatial Effects

Unfortunately, systematic data on location at dispatch was not available, limiting our ability to control for an important variable: distance between dispatch and incident locations. To account for some of this variability, we instead used the neighborhood (within Charlottesville) or census tract (outside Charlottesville) as a proxy for distance. However, this is only a rough approximation, and should be considered when interpreting the model results.

We began by using a linear model. Because the distribution of response times was heavily right skewed, we log transformed the data prior to fitting. Our model was as follows:

\[ \begin{align*} log(\mbox{Response Time})_i\vert \alpha, \beta_1,\dots\beta_9,\sigma \overset{iid}\sim N(&\alpha + \beta_1 (\mbox{COVID era})_i + \boldsymbol{\beta_2}^T\boldsymbol{(\mbox{Demographics})_i} +\\ &\boldsymbol{\beta_3}^T\boldsymbol{(\mbox{Symptoms})_i} + \boldsymbol{\beta_4}^T\boldsymbol{(\mbox{Vehicle})_i} + \\ &\boldsymbol{\beta_5}^T\boldsymbol{(\mbox{Time of Day})_i} +(\mbox{COVID era})_i* \\ (&\boldsymbol{\beta_6}^T\boldsymbol{(\mbox{Demographics})_i} + \boldsymbol{\beta_7}^T\boldsymbol{(\mbox{Symptoms})_i} + \\ &\boldsymbol{\beta_8}^T\boldsymbol{(\mbox{Vehicle})_i} + \boldsymbol{\beta_9}^T\boldsymbol{(\mbox{Time of Day})_i}),\; \sigma^2) \end{align*} \] \[ \beta_1,\dots,\beta_9 \overset{iid}\sim N(0, 1.1)\\ \alpha \sim N(0, 4.6) \\ \sigma \sim Exp(2.2) \]

Where COVID era is an indicator variable with value 1 for incidents after March 15th, 2020 and 0 otherwise. Note that each bolded coefficient actually represents a set of coefficients that accounts for the multiple categorical levels within each variable type.

We included an interaction term for "COVID Era" so we could assess whether the effects on response times of any of the other factors included in the model have changed with the advent of the COVID-19 pandemic.

We chose these uninformative priors due to our lack of any strong prior beliefs about the coefficients. On a technical note, these priors are in the column space of the Q matrix of a QR decomposition of the design matrix, which decreases convergence time of the MCMC algorithm.

This model performed reasonably well. However, there were confounding spatial effects, and it failed to control for travel distance to location.

In an attempt to control for these issues, we fit the following multilevel model:

\[ log(\mbox{Response Time})\vert\alpha,\boldsymbol{\beta},\boldsymbol{b},\sigma \sim N(\alpha + \boldsymbol{X\beta} + \boldsymbol{Zb}, \sigma^2\boldsymbol{I}) \] \[ \alpha \sim N(0,4.6)\\ \boldsymbol{\beta} \sim N(0, 1.15\boldsymbol{I})\\ \boldsymbol{b}\vert\boldsymbol{\Sigma} \sim N(0, \boldsymbol{\Sigma})\\ \sigma \sim Exp(0,2.2)\\ \boldsymbol{\Sigma} \sim \mbox{Decomposition of Covariance}(1, 1, 1, 1) \]

Here \(\boldsymbol{X}\) and \(\boldsymbol{\beta}\) are identical to the earlier specification and have been written as such for simplicity. \(\boldsymbol{Z}\) is a matrix encoding deviations in intercepts and covariates across neighborhoods. The prior for \(\boldsymbol\Sigma\) is complicated to write out, and is specified using the rstanarm function decov(1,1,1,1). More can be found on this specification here.

By including information about the neighborhood that each incident occurred in, we hoped to both account for any spatial autocorrelation as well as our inability to more directly measure vehicle travel distance.

Both of these models were fit using Bayesian estimation with the R package rstanarm. Due to its superior performance on both spatial autocorrelation issues and information criterion, we used the multilevel model for our interpretations (though results were similar for both models). Model assessment and results are discussed in the findings section of the website.


COVID-19 Indicator

Medical knowledge of COVID-19 has been rapidly evolving throughout the course of the pandemic. Initially, healthcare providers focused on three primary symptoms: cough, fever, and shortness of breath, but over time other symptoms, like diarrhea and loss of taste or smell have also become associated with the virus. Furthermore, a large number of people may exhibit few to no symptoms at all, which is one of the primary challenges to containing the virus. Thus, while few of the provider impressions in the EMS dataset directly refer to COVID-19, it remains possible that far more cases exist in the dataset than may be explicitly identified. Because of the numerous symptoms that may be associated with COVID-19, though, we needed to produce a more concise indicator that would give a sense of whether COVID-19 symptoms were present for a given patient before we could explore potential trends.

After multiple conversations with medical professionals, we decided to focus on identifying the following symptoms as we developed an indicator for COVID-19:

  • Cough
  • Fever/Chills
  • Shortness of breath
  • Myalgia
  • Fatigue
  • Headache
  • Diarrhea
  • Nausea/Vomiting
  • Hypoxemia (with or without response to oxygen)
  • Cyanosis
  • Stroke/cardiac arrest (in young people)

Because information about the presence of these symptoms could be encoded in various places in the dataset (for instance, in EMS provider impressions, patient complaints, vital sign results, etc.), we consolidated the results from multiple variables as we searched for these symptoms. However, there were a variety of ways many of these symptoms could be recorded. For instance, shortness of breath could be recorded as "difficulty breathing", "shortness of breath", "sob", etc. Below we present a summary of the final classification system we used to develop our indicator of COVID-like symptoms. At this point, we had a record of the frequency that these symptoms appeared in the dataset, and could use our combined indicator to assess trends in these symptoms in Charlottesville and Albemarle.

Table 1: Summary of Symptoms Used For COVID-19 Indicator
Symptom Complaint/Impression Keywords Assessments Medications Pre-COVID Count During COVID Count
Cough cough -- albuterol (proventil)*; albuterol/ipratropium (duoneb)*; ipratropium (atrovent)* 264 31
Fever/Chills fever; chill; rigor; shiver temperature measurment greater than or equal to 100.4 degrees F -- 980 89
Shortness of Breath breath; sob -- racemic epinephrine; oxygen; albuterol (proventil)*; albuterol/ipratropium (duoneb)*; ipratropium (atrovent)* 8212 490
Myalgia myalgia; muscle ache; muscle soreness; muscle pain -- -- 8 2
Fatigue fatigue -- -- 83 7
Headache headache; neuro - headache - migraine; neuro - headache -- -- 1492 57
Diarrhea diarrhea; gi/gu - diarrhea -- -- 470 23
Nausea/Vomiting nausea, vomit; gi/gu - nausea (without vomiting); gi/gu - nausea (with vomiting) -- ondansetron (zofran) 4458 202
Hypoxemia -- first or last pulse oximetry measurement less than or equal to 94% oxygen saturation, last oximetry measurement less than or equal to initial oximetry measurement oxygen 13565 778
Cyanosis -- "cyanotic" appears in skin assessment findings -- 73 5
Stroke (in young people) neuro - stroke/cva Patient age less than or equal to 64 -- 129 5
Cardiac Arrest (in young people) cv - cardiac arrest; cv - cardiac arrest/obvious death; cv - chest pain - myocardial infarction (non-stemi) Patient age less than or equal to 64 -- 443 29
* Because these medications are more generally used (and not specifically targeted toward a potential COVID-19 symptom), we include them only as a catch-all: they do not contribute to the total symptom count if cough or shortness of breath are already listed in the impression/complaint columns.