2.2 Frequency

Table 2.3 shows the number of chocolate chips in a Chips Ahoy cookie measured by 33 graduate students. Each person (Id) measured his/her own cookie twice (Time_1 and Time_2). Prior to collecting this data, the students discussed what ~counts as~ is considered a single chocolate chip. This is actually more difficult than you think because chocolate chips in a cookie comes in different sizes and shapes (individual chip or melted and merged together as giant glob).

Table 2.3:
Number of Chocolate Chips
Measured by Fall 2019 Class
Id	Time₁	Time₂
1	19	18
2	17	16
3	9	11
…
31	20	24
32	21	22
33	29	30

Let’s create a frequency table for measurement at time 1 with this dataset.

Here are the steps to create a frequency table:

Step 1
Find the highest value and the lowest value of our measured values. We often refer measurement as a score and use Greek letters to denote variable, like \(x\), to represent these scores as a vector of value.

In our chocolate chip example, the highest value is 32 and the lowest is 9 for measurements at time 1. We will only look at Time₁ value for now

Step 2
Count down from the highest value to the lowest value by measurement unit intervals. Put this sequence in column \(x\) of our frequency table in decreasing order.

measurement unit: precision width of your measurement, in most cases \(1\) since we measure counts by whole numbers. This may seem trivial because what else could it be other than \(1\)? Well, this is because we (humans) count numbers by whole numbers, and we are conditioned to think this way. In reality, we made a preliminary decision to what constitutes as 1 chocolate chip versus 2 chocolate chips. In this scenario, it is not possible to measure 1.5 chocolate chips. Note that we are not saying it is impossible to count 1.5 chips – we would just have to define what this is. The 1-vs-2-chips concept is something that we invented for our measurement of the construct of chocolate chip. We as a society “agreed” on a systematic procedural way to count chocolate chips.

If we had a agreed on measuring the magnitude of chocolate chip at \(0.5\) intervals, then we would count down by \(0.5\) instead.

# Step 2: Generate dataset over all measurement unit
chocochip_unit <- 1 # measurement unit
x <- sort(seq(min(df_chips$Time_1), max(df_chips$Time_1), by = chocochip_unit), decreasing = T)
df_freq <- tibble(x = x)

Step 3
For each row value of column \(x\), count the number of times that value appeared in our measurement. Put this value in column \(f\) (for frequency) in our table.

Table 2.4:
Step 3 - 4
x	f	cf
32	1	33
31	0	32
30	1	32
29	1	31
28	1	30
27	1	29
26	2	28
25	4	26
24	0	22
23	1	22
22	3	21
21	3	18
20	4	15
19	3	11
18	1	8
17	1	7
16	2	6
15	1	4
14	0	3
13	2	3
12	0	1
11	0	1
10	0	1
9	1	1
Note:
Note the 0 in frequency column when `x` was not measured.

Note that some frequency counts are 0, for exampe \(x = 31\). This is because none of the observers (33 students) did not measure instances of 31 chocolate chips in their cookies. This does not mean Chips Ahoy company does not make cookies with 31 chocolate chips. It just means that we failed to observe such instance due to sampling error. If we had measured all the cookies produced by the company (i.e. population of all cookies) and repeated the measurement experiment, we would know whether there truly wasn’t a single cookie with 31 chips. In our sample of 33 cookies, we don’t know if \(f = 0\) if \(x = 31\) is due to chance. In the population of all cookies, we would know for a fact that \(f = 0\) if \(x = 31\) because we have the measurement for the entire population.

Also note that saying “This cookie has 1 ‘chocolate chip’” is equivalent to saying “This cookie has somewhere between 0.5 and 1.5 ‘chocolate chip’”. The former is a very strong statement that cannot possibly be true, as we genuinely can’t know what exactly 1 choclate chip is. The latter is a statement that communicates uncertainty due to measurement error of our procedural way to count chocolate chip.

This “procedural way to count chocolate chip” is also called operationalization of chocolate chip. To be more generic, “number of chocolate chip” is an example of a construct. Contruct is a broad concept or topic of study interest. Another examples of constructs in non cookie-factory context is “intelligence”. Just like how “number of chocolate chip” can be a difficult thing to define and measure, “intelligence” is difficult to define and measure. Our measurement error comes from various sources, including how we define these constructs and how we operationalize them.

# Step 3: Calculate frequency over min to max of x, for each measurement unit
# Count # of times each Time_1 value appeared in our x, and put the counts in column f
df_freq <- as_tibble(
  as.data.frame.table(
    table(x = factor(df_chips$Time_1, levels = rev(df_freq$x))), # factor levels in increasing order 
    responseName='f')
)

Step 4
Calculate the cumulative frequency and save to our table as \(cf\). To do this, we sum up the frequency \(f\) in the ascending ordered values of \(x\) (lowest value to highest value). Note that in the R code using library(dplyr), we are ordering x in descending order due to the vocabulary of tidyverse context (In tidyverse table, tibble, we are ordering the table rows from top to bottom. Hence, having lowest value of \(x\) as first row and highest value of \(x\) on the last row is arranging table in descending order).

# Step 4: Calculate cumulative frequency
df_freq <- df_freq %>% 
  mutate(cf = cumsum(f)) %>% 
  arrange(desc(x)) # display the table ordered, increasing from bottom to up

It looks like we are simply taking a running sum of frequency, but it is very important to understand the meaning of cumulative frequency, especially in terms of limits as per Limits & Rounding.
Cumulative frequency by definition is the frequency of scores falling at or below the upper limit of a score.
You probably were exposed to cumulative frequency in terms simpler definition of “the frequency by which the observed values X are less than or equal to Xr.” (Wikipedia - Cumulative Frequency Analysis) ~~and/or somehow equated it as discrete version of similar concept from probability distribution as Cumulative Distribution Function (CDF)~~. What you previously know about cumulative frequency is valid, but you should augment that prior knowledge with the concept of measurement and uncertainty. Remember, a score of \(x_{i}\) doesn’t actually mean “Observation \(i\) scored exactly \(x_{i}\).” It is more appropriate to say “Observation \(i\) scored somewhere between \((x_{i} - \frac{1}{2}\) of \(x_{\bullet, \text{unit of measurement}})\) and \((x_{i} + \frac{1}{2}\) of \(x_{\bullet, \text{unit of measurement}})\).”
Hence, computing cumulative frequency algorithm is as follow:

For \(min(x)\), \(cf = f\)
For all \(i\) where \(x_{i} > min(x)\), \(cf =\) number of observations with score \(< x_{i, LL}\)
For \(max(x)\), \(cf = N\), where \(N = \sum{f}\), aka total count of all frequencies

\((x_{LL} - x_{UL})\)	x	f	cf
(31.5 - 9.5)	32	1	33
(30.5 - 10.5)	31	0	32
(29.5 - 11.5)	30	1	32
(28.5 - 12.5)	29	1	31
(27.5 - 13.5)	28	1	30
(26.5 - 14.5)	27	1	29
(25.5 - 15.5)	26	2	28
(24.5 - 16.5)	25	4	26
(23.5 - 17.5)	24	0	22
(22.5 - 18.5)	23	1	22
(21.5 - 19.5)	22	3	21
(20.5 - 20.5)	21	3	18
(19.5 - 21.5)	20	4	15
(18.5 - 22.5)	19	3	11
(17.5 - 23.5)	18	1	8
(16.5 - 24.5)	17	1	7
(15.5 - 25.5)	16	2	6
(14.5 - 26.5)	15	1	4
(13.5 - 27.5)	14	0	3
(12.5 - 28.5)	13	2	3
(11.5 - 29.5)	12	0	1
(10.5 - 30.5)	11	0	1
(9.5 - 31.5)	10	0	1
(8.5 - 32.5)	9	1	1

By adding the concept of limits, we can further elaborate on concepts like range that goes beyond what you have probably have learned as “maximum value - minimum value + 1”. What range really is:

\[\begin{align*} range(X) & = x_{max} - x_{min} + 1 \\ & = x_{max} - x_{min} + 0.5 + 0.5 \\ & = x_{max} - x_{min} + \frac{1}{2} x_{\bullet, \text{unit of measurement}} + \frac{1}{2} x_{\bullet, \text{unit of measurement}} \\ & = x_{max} + \frac{1}{2} x_{\bullet, \text{unit of measurement}} - x_{min} + \frac{1}{2} x_{\bullet, \text{unit of measurement}}\\ & = (x_{max} + \frac{1}{2} x_{\bullet, \text{unit of measurement}}) - (x_{min} - \frac{1}{2} x_{\bullet, \text{unit of measurement}}) \\ & = max(x_{UL}) - min(x_{LL}) \end{align*}\]

Hence, range is actually defined by lower limit of the \(min(x)\) and upper limit of \(max(x)\).

In the most cases where we compute frequency by whole number counts, class width is just equal to the measurement unit which is just 1. That is why the simple rule that you have learned as “max - min + 1” works. It actually is a simplification of this concept, since we always count in whole number units.

Step 5
Calculate the proportion (relative frequency) of each \(x\) by dividing frequency, \(f\), by sum of all frequencies (or equivalently, the max of cumulative frequency). Similarly, calculate the cumulative proportion by dividing cumulative frequency, \(cf\), by sum of all frequencies.
Note that cumulative proportion, \(cp\), is not cumsum(p) but rather defined by cumulative frequency.

# Step 5: Calculate relative frequency (proportion)
df_freq <- 
  df_freq %>% 
  mutate(p = f / max(cf),
         cp = cf / max(cf))

x	f	cf	p	cp
32	1	33	0.0303	1.0000
31	0	32	0.0000	0.9697
30	1	32	0.0303	0.9697
29	1	31	0.0303	0.9394
28	1	30	0.0303	0.9091
27	1	29	0.0303	0.8788
26	2	28	0.0606	0.8485
25	4	26	0.1212	0.7879
24	0	22	0.0000	0.6667
23	1	22	0.0303	0.6667
22	3	21	0.0909	0.6364
21	3	18	0.0909	0.5455
20	4	15	0.1212	0.4545
19	3	11	0.0909	0.3333
18	1	8	0.0303	0.2424
17	1	7	0.0303	0.2121
16	2	6	0.0606	0.1818
15	1	4	0.0303	0.1212
14	0	3	0.0000	0.0909
13	2	3	0.0606	0.0909
12	0	1	0.0000	0.0303
11	0	1	0.0000	0.0303
10	0	1	0.0000	0.0303
9	1	1	0.0303	0.0303