Below we will introduce Air Quality Index (AQI), and demonstrate how to use tidyindex package to compute it based on existing datasets.
Everything about AQI
The Air Quality Index (AQI) is an index created to measure air quality, with high values indicating greater pollution levels. The index categorized into six levels, each corresponding to a specific numerical range:
- Good (0 - 50),
- Moderate (50 - 100),
- Unhealthy for Sensitive Groups (101 - 150),
- Unhealthy (151 - 200),
- Very Unhealthy (201 - 300), and
- Hazardous (300 and higher).
The AQI is calculated individually for each of the six major pollutants: Ozone (O3), PM2.5, PM10, Carbon Monoxide (CO), Sulfur Dioxide (SO2), Nitrogen Dioxide (NO2). The pollution concentrations are first truncated, and then the AQI value for each pollutant is calculated based on the following equation:
where:
- is the AQI for pollutant , 
- is the truncated concentration of pollutant , 
- and are the high and low concentration breakpoints of , which are provided in Table 6 of the Technical Assistance Document for the Reporting of Daily Air Quality. 
- and are the AQI values corresponding to and , respectively. 
Each individual AQI value is rounded to the nearest integer, and the highest value among the six pollutant is taken as the final AQI.
As an example, with an 8-hour measurement of CO value of 4.67, this value is first truncate to 4.6. For CO concentrations ranging from 4.5 to 9.4, the corresponding AQI range is 51 to 100. Using the formula above, the AQI is calculated as follows:
This gives an AQI value of 75 for CO after rounding.
Construct AQI with tidyindex
In this section, we will demonstrate how to compute the AQI using the tidyindex package. The aqi data, available in the tidyindex package, contains daily PM2.5 values recorded from January to March 2024 at three monitoring sites in Travis county, Texas, USA.
aqi
#> # A tibble: 272 × 9
#>    pollutant  code date       value   aqi  long   lat site_code site_name       
#>    <chr>     <dbl> <date>     <dbl> <int> <dbl> <dbl>     <dbl> <chr>           
#>  1 PM2.5     88101 2024-01-01  10      53 -97.8  30.4        14 Austin North Hi…
#>  2 PM2.5     88101 2024-01-02   9.1    51 -97.8  30.4        14 Austin North Hi…
#>  3 PM2.5     88101 2024-01-03   7.1    39 -97.8  30.4        14 Austin North Hi…
#>  4 PM2.5     88101 2024-01-04  10.3    53 -97.8  30.4        14 Austin North Hi…
#>  5 PM2.5     88101 2024-01-05   3.7    21 -97.8  30.4        14 Austin North Hi…
#>  6 PM2.5     88101 2024-01-06   4.5    25 -97.8  30.4        14 Austin North Hi…
#>  7 PM2.5     88101 2024-01-07   6.2    34 -97.8  30.4        14 Austin North Hi…
#>  8 PM2.5     88101 2024-01-08   7.5    42 -97.8  30.4        14 Austin North Hi…
#>  9 PM2.5     88101 2024-01-09   2.7    15 -97.8  30.4        14 Austin North Hi…
#> 10 PM2.5     88101 2024-01-10   7.5    42 -97.8  30.4        14 Austin North Hi…
#> # ℹ 262 more rowsAugment low/high breakpoints
To compute the AQI, as outlined in Section 1, we must first determine the high and low breakpoints for each sample measurement, as well as the corresponding AQI breakpoints. The helper functions below are designed to assist with retrieving this breakpoint information and augmenting them into the aqi data.
lookup_helper <- function(sample, subset){
  return(subset %>% filter(sample >= low & sample <= high))
}
breakpoint_lookup <- function(dataset){
  # takes the dataset of measurements as input
  # returns a tibble object with corresponding breakpoints and group info
  id <- dataset$code[1]
  concentration <- dataset$value
  if(id == "44201"){
    concentration <- trunc(concentration * 10^3)/10^3
  }
  else if (id == "88101" | id == "42101"){
    concentration <- trunc(concentration * 10)/10
  }
  else{
    concentration <- trunc(concentration)
  }
  subset <- pollutant_ref_tbl %>% filter(code == id)
  results <- map_dfr(concentration, ~ lookup_helper(.x, subset)) %>%
    bind_cols(aqi |> select(-pollutant, -code))
  return(results)
}The lookup_helper() function finds the high and low breakpoints for any given samples. The breakpoint_lookup() function first truncates each measurement based on the pollutant, then calls lookup_helper() to augment the high/low breakpoints and group information.
(aqi_aug <- breakpoint_lookup(aqi))
#> # A tibble: 272 × 12
#>    pollutant  code group            low  high date       value   aqi  long   lat
#>    <chr>     <dbl> <chr>          <dbl> <dbl> <date>     <dbl> <int> <dbl> <dbl>
#>  1 PM2.5     88101 Unhealthy        9.1  35.4 2024-01-01  10      53 -97.8  30.4
#>  2 PM2.5     88101 Unhealthy        9.1  35.4 2024-01-02   9.1    51 -97.8  30.4
#>  3 PM2.5     88101 Very Unhealthy   0     9   2024-01-03   7.1    39 -97.8  30.4
#>  4 PM2.5     88101 Unhealthy        9.1  35.4 2024-01-04  10.3    53 -97.8  30.4
#>  5 PM2.5     88101 Very Unhealthy   0     9   2024-01-05   3.7    21 -97.8  30.4
#>  6 PM2.5     88101 Very Unhealthy   0     9   2024-01-06   4.5    25 -97.8  30.4
#>  7 PM2.5     88101 Very Unhealthy   0     9   2024-01-07   6.2    34 -97.8  30.4
#>  8 PM2.5     88101 Very Unhealthy   0     9   2024-01-08   7.5    42 -97.8  30.4
#>  9 PM2.5     88101 Very Unhealthy   0     9   2024-01-09   2.7    15 -97.8  30.4
#> 10 PM2.5     88101 Very Unhealthy   0     9   2024-01-10   7.5    42 -97.8  30.4
#> # ℹ 262 more rows
#> # ℹ 2 more variables: site_code <dbl>, site_name <chr>Note that to efficiently use the tidyindex pipeline, it is recommended to put all needed data columns into one single dataframe or tibble object.
The construction of AQI can be divided into two steps:
- a minmax rescaling step:
- a variable transformation step (affine transformation): .
Rescaling
The minmax rescaling is performed by the rescaling module and the rescale_minmax() step requires min and max argument, which correspond to teh low and high breakpoints we’ve looked up. The rescaled values are stored in the minmax column.
pipeline <- init(aqi_aug) |> 
  rescaling(minmax = rescale_minmax(value, min = low, max = high))
pipeline$data %>% dplyr::select(pollutant, date, low, high, value, minmax)
#> # A tibble: 272 × 6
#>    pollutant date         low  high value minmax
#>    <chr>     <date>     <dbl> <dbl> <dbl>  <dbl>
#>  1 PM2.5     2024-01-01   9.1  35.4  10   0.0342
#>  2 PM2.5     2024-01-02   9.1  35.4   9.1 0     
#>  3 PM2.5     2024-01-03   0     9     7.1 0.789 
#>  4 PM2.5     2024-01-04   9.1  35.4  10.3 0.0456
#>  5 PM2.5     2024-01-05   0     9     3.7 0.411 
#>  6 PM2.5     2024-01-06   0     9     4.5 0.5   
#>  7 PM2.5     2024-01-07   0     9     6.2 0.689 
#>  8 PM2.5     2024-01-08   0     9     7.5 0.833 
#>  9 PM2.5     2024-01-09   0     9     2.7 0.3   
#> 10 PM2.5     2024-01-10   0     9     7.5 0.833 
#> # ℹ 262 more rowsVariable transformation
While an affine transformation step is not initially included in the tidyindex package, you can always easily create one:
trans_affine <- function(var, a = NULL, b = NULL){
  fn <- function(x, a = NULL, b = NULL) a*x + b
  new_trans("trans_affine", var = enquo(var), 
            fn = fn, a = enquo(a), b = enquo(b))
}The trans_affine function has three inputs:
- 
var, the variable to be transformed;
- 
a, the multiplicative coefficient, and
- 
b, the additive constant
The fn function specifies the transformation as a*x + b. The new_trans() constructor is then called to register the name along with other inputs, including the transformed variable, the transformation function fn, and associated parameters a and b. This creates a “recipe” for the transformation that can be evaluated standalone and used in the variable_trans() module.
trans_affine(minmax, a = high - low, b = low)
#> [1] "trans_affine"
#> attr(,"var")
#> <quosure>
#> expr: ^minmax
#> env:  global
#> attr(,"fn")
#> function (x, a = NULL, b = NULL) 
#> a * x + b
#> <bytecode: 0x5608982c73a0>
#> <environment: 0x5608982c7ad8>
#> attr(,"args")
#> attr(,"args")$a
#> <quosure>
#> expr: ^high - low
#> env:  global
#> 
#> attr(,"args")$b
#> <quosure>
#> expr: ^low
#> env:  global
#> 
#> attr(,"class")
#> [1] "var_trans"Note: the enquo() function is required for additional parametersa and b in the new_trans() constructor. This creates a defused expression that describes how to evaluate the values but not actually execute the evaluation. For more details, see rlang::enquo()
This concludes our pipeline for computing the AQI:
pipeline <- pipeline |> 
  variable_trans(AQI = trans_affine(minmax, a = high - low, b = low))
pipeline$data$AQI <- round(pipeline$data$AQI)
pipeline$data |> dplyr::select(pollutant, date, low, high, value, minmax, AQI)
#> # A tibble: 272 × 7
#>    pollutant date         low  high value minmax   AQI
#>    <chr>     <date>     <dbl> <dbl> <dbl>  <dbl> <dbl>
#>  1 PM2.5     2024-01-01   9.1  35.4  10   0.0342    10
#>  2 PM2.5     2024-01-02   9.1  35.4   9.1 0          9
#>  3 PM2.5     2024-01-03   0     9     7.1 0.789      7
#>  4 PM2.5     2024-01-04   9.1  35.4  10.3 0.0456    10
#>  5 PM2.5     2024-01-05   0     9     3.7 0.411      4
#>  6 PM2.5     2024-01-06   0     9     4.5 0.5        4
#>  7 PM2.5     2024-01-07   0     9     6.2 0.689      6
#>  8 PM2.5     2024-01-08   0     9     7.5 0.833      8
#>  9 PM2.5     2024-01-09   0     9     2.7 0.3        3
#> 10 PM2.5     2024-01-10   0     9     7.5 0.833      8
#> # ℹ 262 more rowsNow let’s check our results. Below we show a line graph with AQI values from all three monitor sites, computed using our pipeline.
library(ggplot2)
pipeline$data$date <- as.Date(pipeline$data$date)
ggplot(pipeline$data, aes(x = date, y = AQI, color = site_name)) +
  geom_line() + 
  labs(title = "AQI Values Over Time by Site",
       x = "Date",
       y = "AQI",
       color = "Monitor Sites") +
  theme_minimal() +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  theme(legend.position = "bottom")