- 1. Motivation
- 2. Checking One Group (and intro to Hypothesis Testing)
- 3. Comparing Two Groups (Two-sample tests)
- 4. Comparing Multiple Groups (> 2)
- 4.1. Pre-requisite for ANOVA tests: do all groups have equal variances?
- 4.2. Pre-requisite for ANOVA tests: are all errors independant and normally distributed?
- 4.3. One-Way ANOVA: Equal Variances over data, Normal Distribution of errors
- 4.4. If cannot run ANOVA (assumptions failed)
- 4.5. Concepts behind the ANOVA model

- 5. Displaying Data

This is a conceptual overview of the ideas covered in NUS’s ST2137: Computer Aided Data Analysis. It is meant to give me a standard starting point when I next encouter a non-trivial data set that requires analysis.

We have some data, and we want to test something about a particular property of the data sample. Usually, this is the **mean** value of something in the sample.

These data samples are continually distributed and have a single dimension. Think:

Height (m) |
---|

1.7 |

1.65 |

1.76 |

2.01 |

1.55 |

… |

Usually, we want to test one of three things about the **mean** value of the data.

- Is the mean equal to some value?
`(mu = mu0)`

? - Is the mean more than some value?
`(mu > mu0)`

? - Is the mean less than some value?
`(mu < mu0)`

?

Before deciding on a test, we need to have information about the underlying **distribution** of the values.

If the population that this sample is extracted from follows a **normal** distribution, we can use

- The
**One-Sample T-Test**

We don’t know or assume anything about the underlying distribution. We can then use:

- The
**Sign Test** - The
**Wilcoxon Signed-Rank Test**

Instead of a single sample, we have two samples that may or may not be dependent on each other.

Think:

Control group: Lifetime change in Height without Drug(TM) | Treatment Group: Lifetime change in Height with Drug(TM) |
---|---|

1.7 | 1.66 |

1.65 | 1.67 |

1.76 | 1.77 |

2.01 | 1.99 |

1.55 | 1.99 |

… | … |

Here, the control group and the treatment group are independent of each other (as long as the experiment is controlled correctly).

We want to check if the **the mean values of both groups are equal**.

Use

**Two Sample T-Test**- If the
**variances are equal**(use the F-test to determine this!): this is the standard situation that uses the**pooled variance**to estimate the degrees of freedom for the test. - If the
**variances are different**(use the F-test to determine this!): use the**Welch Two Sample T-Test**(uses the “Welch estimate” of the degrees of freedom is used)

- If the

Use

**Wilcoxon Rank-Sum Test**(also called Mann-Whitney U-test)

Height before treatment | Height after treatment |
---|---|

1.7 | 1.66 |

1.65 | 1.67 |

1.76 | 1.77 |

2.01 | 1.99 |

1.55 | 1.99 |

… | … |

Here, we have a list of **pairs** of samples: each row represents a single person before and after a treatment, so the two values are related and dependent on each other. We can similarly test the mean values between the pairs.

Use

**Paired T-Test**

Use

**Sign Test****Wilcoxon Signed-Rank Test**

We want to check if the means of **many** groups are **all equal to each other** (alternate hypothesis, one of the groups has a different mean as compared to some other group). We could do this by running many pair-wise T-tests, but this increases the **family-wise error rate** as we do more and more tests. This means we have a higher and higher chance of getting a false result as we run many more separate tests.

Can use either of

**Bartlett Test****Levene’s Test**

The concept of a **residual**: the **difference** between the **observed value** of the variable and the **predicted value**. Here, the **observed value** would be the data itself, and the **predicted value** would be the normal distribution.

To check this condition:

- A normality test on the residuals (
**Kolmogorov-Smirnov Test**), or - Do residul plots (
**QQPlot on residuals**OR**plot the residuals against the groups**)

Requirements

- Data in each group is normally distributed
- Population variance in each group is equal
- Independance of observations (depends on how the study was run)

Note: the ANOVA test is robust against non-normal data (with skew / kurtosis), but has an increase in error rate.

If the ANOVA test **finds differences between groups** - it cannot tell which groups had the difference. A **post-hoc** test is required to find this out.

There is a formula to calculate the **Least Significant Difference value** between two groups i and j. We then check if the **absolute difference in means between groups i and j exceeds the LSD value**. If so, those two groups are different.

Other tests of the same nature

- Duncan’s multiple-range test
- Student-Newman-Keul’s multiple-range test
- Scheffe’s multiple-comparison procedure

Similarly to the LSD, we want to check which groups are different from each other (given that the ANOVA has shown that at least one pair is different from each other). **Contrasts** allow us to

A **contrast** is a **linear combination of the variables** such that **all the coefficients add up to 0**. This is potentially able to check questions like **Is (mean of X) different from (mean of Y AND Z together)**. Examples:

- 0mu_x - 1mu_y + 1mu_z (check if Y and Z are different)
- 2mu_x - 1mu_y - 1mu_z (check if x is different from y AND z)

Use a **non-parametric** one-way test

**Kruskal-Wallis**test

(TODO): Sum of squares, MSE, DF…