Click for Great Web Sites

Click for Basic Statistics notes

by Dr PHUA Kai Lit

School of Medicine and Health Sciences

Monash University Malaysia

Bandar Sunway, Malaysia

Companion book: "Statistics Made Simple for Healthcare and Social Science Professionals and Students"

by Wong Kam Cheong and Phua Kai Lit (Serdang: UPM Press 2006)

**BASIC CRITERIA**:

(1) At what level are your data measured?

Nominal e.g. ethnicity, gender, disease is present or absent

Ordinal e.g. low, medium or high level of anxiety

Interval e.g. quantitative data converted into a mean

(2) How many samples are there? Are they paired/matched?

One sample

Two independent samples (unpaired/unmatched)

Two samples (paired/matched)

More than two samples (not matched/matched)

(3) Should you use a "parametric test" or a "non-parametric test"?

Parametric tests make assumptions about the underlying populations from which the samples are drawn e.g. that the population is a normal distribution (also called Gaussian distribution)or that the populations have approximately equal variances.

(1)(2)and (3) are used to choose an appropriate statistical test

More advice on how to choose an appropriate stat test

Online calculator for chi-square test of independence

Online calculator for Student's t-test

Online calculator for Relative Risk

Online calculator for Odds Ratio

Online calculator for other statistical tests

**GUIDE TO CHOOSING A STAT TEST (BASED ON WHAT'S AVAILABLE ON "SOS" SOFTWARE)**

Nominal data, single sample --> chi-square goodness of fit test (to compare an observed sample distribution with a hypothesised distribution)

Nominal data, 2 unpaired samples --> chi-square test of independence/association; 2 sample z-test of proportions

Nominal data, 2 paired samples -->

Nominal data, more than 2 samples --> chi-square test of independence/association

Ordinal data: use the tests for nominal data

Interval data, single sample --> 1 sample t-test of means; 1 sample z-test of means (to compare a sample mean with a hypothetical population mean)

Interval data, 2 unpaired samples -->2 sample t-test for difference in means; Wilcoxon rank sums test; 2 sample z-test of means

Interval data, 2 paired samples -->2 sample t-test for paired differences; Wilcoxon signed ranks test

Interval data, more than 2 samples -->Analysis of Variance or ANOVA (not available on SOS)

**MAKE SURE ASSUMPTIONS OF TESTS ARE NOT VIOLATED!**

Assumptions of chi-square test:

Nominal data (ordinal data is also OK)

Preferably n is between 25 and 250 (if n>250, better to use z-test of proportions)

"Expected value" of each cell is at least 5

Assumptions of t-test:

Interval data

n < 30 for each sample (if n is large, use z-test of means)

Normal distribution (if distribution is not normal, use Wilcoxon test)

**NULL HYPOTHESIS (H _{0}) AND RESEARCH/ALTERNATIVE HYPOTHESIS (H_{1})**

Chi-square test of independence

Null Hypothesis is "No association between X and Y. Any association seen is due to chance"

Research Hypothesis is "Statistically significant association between X and Y"

t-test of means & z-test of means

Null Hypothesis is "No difference between the two population means. Any difference seen is due to chance"

Research Hypothesis is "Statistically significant difference between the two population means"

(STRICTLY, the Null Hypothesis should be "No difference between the means of the respective populations from which the two samples are drawn. Any difference seen is due to chance"

STRICTLY, the Research Hypothesis should be "Statistically significant difference between the respective population means from which the two samples are drawn")

**INTERPRETING OUTPUT OF STATS TESTS**

Look at the **p-value**:

If p < 0.05, it is statistically significant (prob of getting your "statistically significant result" by chance is less than 5%). Therefore, accept the research hypothesis

If p < 0.01, it is highly significant (prob of getting your "statistically significant result" by chance is less than 1%). Therefore, accept the research hypothesis

**Thus, the p-value is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred purely by chance, and that in the population from which the sample was drawn, no such relationship or differences actually exist.**

**Confidence Intervals**

Confidence Intervals are usually set at 95% or 99%.

A 95% Confidence Interval means (loosely-speaking) that the prob is 0.95 that a true value will fall in that interval e.g. A sample mean can be used to estimate an actual population mean:

The 95% Confidence Interval of the population mean is the "sample mean plus or minus X"

The prob is 0.95 that the true value of the population mean will fall within the 95% Confidence Interval

THIS SECTION IS FOR ADVANCED (SEMESTER 5) STUDENTS ONLY

**USING CONFIDENCE INTERVALS FOR HYPOTHESIS TESTING**

Example: Testing if two means are significantly different

We can look at the p-value from the t-test or we can use 95% confidence intervals

Null Hypothesis: population mean1 - population mean2 = zero

(Same as "no difference between the 2 population means")

Research Hypothesis: population mean1 - population mean2 is not equal to zero

(Same as "Significant difference between the 2 population means")

* Accept H_{1} if 95% Confidence Interval DOES NOT contain "zero"

Why? Because if the 95% Confidence Interval DOES NOT contain "zero", it is very likely that the true difference in means is not equal to zero. Therefore, we accept the research hypothesis H_{1}

**CONFIDENCE INTERVAL FOR RELATIVE RISK (PROSPECTIVE STUDIES)**

Relative Risk:

If RR = 1, no association between exposure to risk factor and subsequent development of disease

If RR >> 1, strong positive association

If RR << 1, strong negative association

e.g. Suppose the calculated RR is 2.14 for exposure to Risk Factor X and Disease Y

This means loosely that "Those exposed to Risk Factor X are 2.14 times more likely to develop Disease Y in the future than those who are not exposed"

Next step: We need to test if this RR is statistically significant

First, calculate the 95% Confidence Interval for the RR

Null Hypothesis: RR = 1 (same as "the risk of developing the Disease is the same for the exposed as compared to the unexposed")

Research Hypothesis: RR is not equal to 1 (same as "the risk of developing the Disease is NOT the same for the exposed as compared to the unexposed")

* Accept the Research Hypothesis H_{1} if the 95% Confidence Interval for the Relative Risk DOES NOT contain "1"

Why? Because if the 95% Confidence Interval DOES NOT contain "1", it is very likely that the true Relative Risk is not equal to 1. Therefore, we accept the research hypothesis H_{1}

**CONFIDENCE INTERVAL FOR ODDS RATIO (AS USED IN RETROSPECTIVE, CASE-CONTROL STUDIES)**

Odds Ratio:

If OR = 1, no association between having the disease now and previous exposure to the risk factor

If OR >> 1, strong positive association

If OR << 1, strong negative association

e.g. Suppose the calculated OR for lung cancer and smoking is 2.14

This means "Lung cancer victims are 2.14 times more likely to have smoked in the past than those who are not lung cancer victims"

Next step: We need to test if this OR is statistically significant

First, calculate the 95% Confidence Interval for the OR

Null Hypothesis: OR = 1 (same as "no difference in odds of being smokers when lung cancer victims are compared to non-victims")

Research Hypothesis: OR is not equal to 1 (same as "Significant difference in odds of being smokers when lung cancer victims are compared to non-victims")

* Accept the Research Hypothesis H_{1} if the 95% Confidence Interval for the Odds Ratio DOES
NOT contain "1"

Why? Because if the 95% Confidence Interval DOES NOT contain "1", it is very likely that the true Odds Ratio is not equal to 1. Therefore, we accept the research hypothesis H_{1}

**Confidence Interval**: The interval within which something is likely to be found

A 95% Confidence Interval for the population mean indicates (loosely speaking) that there is a 95% probability that the population mean actually lies within that particular Confidence Interval. Strictly speaking, it means that if you take 100 samples and calculate the sample means 100 times, 95 of these will fall within the 95% confidence interval.

**Odds Ratio**: This is commonly used in public health research.

** References**

Cassens, B.J. 1992 "Preventive Medicine and Public Health" 2nd ed. Philadelphia: Harwal Publishing

Champion, D.J. 1981 "Basic Statistics for Social Research" 2nd ed. New York: MacMillan

Porkess, R. 1991 "The Harper Collins Dictionary of Statistics" New York: HarperPerennial

**

**