This is a conceptual overview of the ideas covered in NUS’s ST2137: Computer Aided Data Analysis. It is meant to give me a standard starting point when I next encouter a non-trivial data set that requires analysis.
We have some data, and we want to test something about a particular property of the data sample. Usually, this is the mean value of something in the sample.
These data samples are continually distributed and have a single dimension. Think:
Usually, we want to test one of three things about the mean value of the data.
(mu = mu0)?
(mu > mu0)?
(mu < mu0)?
Before deciding on a test, we need to have information about the underlying distribution of the values.
If the population that this sample is extracted from follows a normal distribution, we can use
We don’t know or assume anything about the underlying distribution. We can then use:
Instead of a single sample, we have two samples that may or may not be dependent on each other.
|Control group: Lifetime change in Height without Drug(TM)||Treatment Group: Lifetime change in Height with Drug(TM)|
Here, the control group and the treatment group are independent of each other (as long as the experiment is controlled correctly).
We want to check if the the mean values of both groups are equal.
|Height before treatment||Height after treatment|
Here, we have a list of pairs of samples: each row represents a single person before and after a treatment, so the two values are related and dependent on each other. We can similarly test the mean values between the pairs.
We want to check if the means of many groups are all equal to each other (alternate hypothesis, one of the groups has a different mean as compared to some other group). We could do this by running many pair-wise T-tests, but this increases the family-wise error rate as we do more and more tests. This means we have a higher and higher chance of getting a false result as we run many more separate tests.
Can use either of
The concept of a residual: the difference between the observed value of the variable and the predicted value. Here, the observed value would be the data itself, and the predicted value would be the normal distribution.
To check this condition:
Note: the ANOVA test is robust against non-normal data (with skew / kurtosis), but has an increase in error rate.
If the ANOVA test finds differences between groups - it cannot tell which groups had the difference. A post-hoc test is required to find this out.
There is a formula to calculate the Least Significant Difference value between two groups i and j. We then check if the absolute difference in means between groups i and j exceeds the LSD value. If so, those two groups are different.
Other tests of the same nature
Similarly to the LSD, we want to check which groups are different from each other (given that the ANOVA has shown that at least one pair is different from each other). Contrasts allow us to
A contrast is a linear combination of the variables such that all the coefficients add up to 0. This is potentially able to check questions like Is (mean of X) different from (mean of Y AND Z together). Examples:
Use a non-parametric one-way test
(TODO): Sum of squares, MSE, DF…