2.3 Grouped Frequency

When counting by every unit of measure is too granular, we can create grouped frequency distribution. To do this, we need to decide on how many groups (\(n_{classes}\)) and the width of the group (class width or interval, \(i\)) that we will combine the original scores to.

\[\begin{equation*} n_{class} = \frac{range(X)}{i} \end{equation*}\]

Some rules/guideline

  1. if \(\frac{range(X)}{i}\) is a decimal, round up and use that rounded up number as \(n_{class}\)
  2. \(i\) should be odd
  3. The lowest limit of new class (aka interval, stated or apparent limit) should be divisible by \(i\)

(TODO: reasoning for #2, #3)

If you are starting from number of classes, same formula applies

\[\begin{equation} i = \frac{range(X)}{n_{class}} \end{equation}\] \tag{2.1}

To compute grouped frequency:

  1. Start with range of limits for smallest class \((class_{\ i,\ LL}, class_{\ i,\ UL}) = (min(x_{LL}),\ min(x_{LL}) + i)\)
  2. Increment \(i\) until \(max(x_{UL})\): \((class_{i,\ LL}, class_{i,\ UL}) = (class_{\ i-1,\ UL},\ class_{\ i-1,\ UL} + i)\)
  3. Stop at \((class_{i,\ LL}, class_{i,\ UL}) = (class_{\ i-1,\ UL},\ max(x_{UL})\)

    Example with 4 classes

Using nClasses as starting point using (2.1):

# number of classes
nClasses <- 5

# interval width, i
iWidth <- ceiling((max(x) - min(x) + 1) / nClasses)
class_LL <- seq(from = min(x) - chocochip_unit/2, to = max(x) - chocochip_unit/2, by = iWidth)
class_UL <- class_LL + iWidth
class_name = paste0(class_LL, " - ", class_UL) 
#  "8.5 - 13.5", # (min lower limit of x, .... )
# "13.5 - 18.5", 
# "18.5 - 23.5", 
# "23.5 - 28.5",
# "28.5 - 33.5"  # (..., max upper limit of x)

# Pretty name of class name (without bounds)
# also in factor data type for plotting and grouping in R
chips_group = paste0(ceiling(class_LL), " - ", floor(class_UL))

df_freq_grouped <- tibble(x) %>% 
  rowwise() %>% 
  mutate(
    # assumes correct (class_LL, class_UL)
    which_group_index = which((x > class_LL & x < class_UL)), 
    `$(class_{LL} - class_{UL})$` = class_name[which_group_index],
    `Chip Group (class)` = factor(chips_group[which_group_index], levels = chips_group) 
  ) %>%
  group_by(`$(class_{LL} - class_{UL})$`, `Chip Group (class)`) %>%
  tally(name = "f") %>% 
  mutate(
    cf = cumsum(f),
    p = f / max(cf),
    cp = cf / max(cf)
  ) %>% 
  arrange(desc(`Chip Group (class)`)) # display the table ordered, increasing from bottom to up
\((class_{LL} - class_{UL})\) Chip Group (class) f cf p cp
28.5 - 33.5 29 - 33 4 4 1 1
23.5 - 28.5 24 - 28 5 5 1 1
18.5 - 23.5 19 - 23 5 5 1 1
13.5 - 18.5 14 - 18 5 5 1 1
8.5 - 13.5 9 - 13 5 5 1 1
library(ggplot2)
ungrouped_freq_plot <- ggplot(df_freq) + 
  geom_bar(aes(x, f), stat = "identity") +
  ylim(0, 5) +
  xlab("# of Chocolate Chips per Cookie") +
  ylab("Frequency")

grouped_freq_plot <- ggplot(df_freq_grouped) +
  geom_bar(aes(`Chip Group (class)`, f), stat = "identity") +
  ylim(0, 5) +
  xlab("# of Chocolate Chips per Cookie") +
  ylab("Frequency")

gridExtra::grid.arrange(ungrouped_freq_plot, grouped_freq_plot, widths=c(0.55, 0.45), ncol=2)