Computer Aided Data Analysis: A Conceptual Overview

- 7 mins

1. Motivation

This is a conceptual overview of the ideas covered in NUS’s ST2137: Computer Aided Data Analysis. It is meant to give me a standard starting point when I next encouter a non-trivial data set that requires analysis.

2. Checking One Group (and intro to Hypothesis Testing)

We have some data, and we want to test something about a particular property of the data sample. Usually, this is the mean value of something in the sample.

2.1. One-sample tests

These data samples are continually distributed and have a single dimension. Think:

Height (m)

Usually, we want to test one of three things about the mean value of the data.

Before deciding on a test, we need to have information about the underlying distribution of the values.

2.1.1. Values are normally distributed

If the population that this sample is extracted from follows a normal distribution, we can use

2.1.2. Values are not normally distributed / unknown distribution

We don’t know or assume anything about the underlying distribution. We can then use:

3. Comparing Two Groups (Two-sample tests)

Instead of a single sample, we have two samples that may or may not be dependent on each other.

3.1. Groups are independant of each other


Control group: Lifetime change in Height without Drug(TM) Treatment Group: Lifetime change in Height with Drug(TM)
1.7 1.66
1.65 1.67
1.76 1.77
2.01 1.99
1.55 1.99

Here, the control group and the treatment group are independent of each other (as long as the experiment is controlled correctly).

We want to check if the the mean values of both groups are equal.

3.1.1. Values are normally distributed


3.1.2. Values are not normally distributed / unknown distribution / ordered categories


3.2. Groups are dependent on each other

Height before treatment Height after treatment
1.7 1.66
1.65 1.67
1.76 1.77
2.01 1.99
1.55 1.99

Here, we have a list of pairs of samples: each row represents a single person before and after a treatment, so the two values are related and dependent on each other. We can similarly test the mean values between the pairs.

3.2.1. Values are normally distributed


3.2.2. Values are not normally distributed / unknown distribution


4. Comparing Multiple Groups (> 2)

We want to check if the means of many groups are all equal to each other (alternate hypothesis, one of the groups has a different mean as compared to some other group). We could do this by running many pair-wise T-tests, but this increases the family-wise error rate as we do more and more tests. This means we have a higher and higher chance of getting a false result as we run many more separate tests.

4.1. Pre-requisite for ANOVA tests: do all groups have equal variances?

Can use either of

4.2. Pre-requisite for ANOVA tests: are all errors independant and normally distributed?

The concept of a residual: the difference between the observed value of the variable and the predicted value. Here, the observed value would be the data itself, and the predicted value would be the normal distribution.

To check this condition:

4.3. One-Way ANOVA: Equal Variances over data, Normal Distribution of errors


Note: the ANOVA test is robust against non-normal data (with skew / kurtosis), but has an increase in error rate.

4.3.1. After running ANOVA: Least Significant Difference Test (Post-Hoc)

If the ANOVA test finds differences between groups - it cannot tell which groups had the difference. A post-hoc test is required to find this out.

There is a formula to calculate the Least Significant Difference value between two groups i and j. We then check if the absolute difference in means between groups i and j exceeds the LSD value. If so, those two groups are different.

Other tests of the same nature

4.3.2. After running ANOVA: Contrast Analysis

Similarly to the LSD, we want to check which groups are different from each other (given that the ANOVA has shown that at least one pair is different from each other). Contrasts allow us to

A contrast is a linear combination of the variables such that all the coefficients add up to 0. This is potentially able to check questions like Is (mean of X) different from (mean of Y AND Z together). Examples:

4.4. If cannot run ANOVA (assumptions failed)

Use a non-parametric one-way test

4.5. Concepts behind the ANOVA model

(TODO): Sum of squares, MSE, DF…

5. Displaying Data

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora