Click for Home Page
Click for Great Web Sites

STATISTICS WITHOUT TEARS

by Dr PHUA Kai Lit
School of Medicine and Health Sciences
Monash University Malaysia
Bandar Sunway, Selangor, Malaysia

Click for Answers to Bridging Assignment One

Click for Answers to Bridging Assignment Two


INTRODUCTION TO STATISTICS

POPULATION: People you wish to study. If you want to study lung cancer among Malaysians, then the study population will be all Malaysians. If you want to study lung cancer among Malaysian women, then the population will be all Malaysian women.

SAMPLE: Small group of people selected from the population you wish to study (subset of the population). Findings from the sample are used to draw conclusions about the entire population.

SIMPLE RANDOM SAMPLING: Each member of the population has an EQUAL
CHANCE of being selected into the sample

REPRESENTATIVE SAMPLE: Composition of the sample resembles the composition of the population e.g. if 60% of all Malaysians are Malays, then in a sample of 1,000 Malaysians, there should be about 60% Malays or 600 Malays

* A random sample is likely to be a representative sample

WAYS TO MEASURE CLUSTERING OF DATA ("MEASURES OF CENTRAL TENDENCY")

1. Mean (average)
2. Median e.g. median of 1,3,5,7,9 is 5. The median of 1,3,5,7,9,11 is (5 + 7)/2 = 6
3. Mode e.g. mode of 1,3,3,5,7 is 3. The mode is the most commonly occuring number.

WAYS TO MEASURE SPREAD OF DATA ("MEASURES OF DISPERSION")

1. Range - difference between highest and lowest values
2. Standard Deviation - the higher the Standard Deviation, the more spread out the data.
3. Variance - this is simply the standard deviation squared

THE STANDARD DEVIATION IS A VERY IMPORTANT MEASURE - Under a (Standardised) Normal Distribution Curve,
68.3% of the data are found +1 or -1 standard deviation from the mean
95.5% of the data are found +2 or -2 standard deviations from the mean
99.7% are found +3 or -3 standard deviations from the mean

LEVEL OF MEASUREMENT OF DATA

1. Nominal data: qualitative, categorical data. Example: ethnicity, gender, religion.

2. Ordinal data: Rank-ordered data. Data are grouped from low to high. But we cannot say how much lower or how much higher. Example: "low anxiety", "moderate anxiety" and "high anxiety".

3. Interval data: quantitative data. There is equal spacing between numbers e.g. the difference between 10 kg and 11 kg is the same as the difference between 35 kg and 36 kg. Examples of interval data: height, weight, temperature measured using the Celsius scale.

4. Ratio level data: Similar to Interval Data but in addition, it has an absolute zero or meaningful zero e.g. income, temperature measured using the Kelvin scale.



The Chi-Square Test (Chi-Squared Test)

A commonly-used test for analysing NOMINAL DATA.

When we use the test, we must make sure the assumptions of the test are not violated!

Place the data in a "contingency table"
Write out the "Null Hypothesis" (also called H-nought and written as H0) and the "Research Hypothesis" (also called H-one and written as H1)
Run the chi-square test
Decide which of the 2 hypotheses should you take as your conclusion

Chi-square test: Null Hypothesis and Research Hypothesis

H0 No association between X and Y
Any association seen is due to chance

H1 There is an association between X and Y

If p-value is less than 0.05, accept the research hypothesis
If calculated chi-square exceeds the critical value, accept the research hypothesis

ASSUMPTIONS OF CHI-SQUARE TEST

1. Nominal data
2. 25 =< n =<250
3. Random sample
4. Expected value of each cell is at least 5 (if not, you should combine some of the categories)



The t-test

A commonly-used test for comparing two MEANS derived from INTERVAL DATA

When we use the test, we must make sure the assumptions of the test are not violated!

Write out the "Null Hypothesis" (H0 and the "Research Hypothesis"1
Run the t-test
Decide which of the 2 hypotheses should be your conclusion

t-test: Null Hypothesis and Research Hypothesis

H0 No difference between the two population means
Any difference seen is due to chance

H1 Statistically significant difference between the two population means

If p-value is less than 0.05, accept the research hypothesis
If calculated t exceeds the critical value, accept the research hypothesis

ASSUMPTIONS OF T-TEST FOR TWO INDEPENDENT SAMPLES

1. Random samples
2. Interval data
3. Normal distribution in both population groups
4. Preferably n < 30 (for each sample).

CHOOSING A TEST

1. What is the level of measurement? Nominal, ordinal or interval?
2. How many samples? One, two or more?
3. If two samples, are they independent or paired/matched?
4. Choose the test. Make sure the assumptions of the test are not violated

IMPORTANT: What is STATISTICALLY SIGNIFICANT may not be CLINICALLY SIGNIFICANT. "STATISTICALLY SIGNIFICANT" simply means that the probability of what you see happening by chance is very, very low (p < 0.05). We can therefore conclude that it is highly unlikely to have happened by chance



The Odds Ratio and the 95% Confidence Interval

95% Confidence Interval

A 95% Confidence Interval for X simply means (roughly) that we are 95% sure that the true value of X lies somewhere within that interval. (The technically correct interpretation is that if we take 100 samples and calculate the X for each of these samples, 95 of them will lie within the Confidence Interval)

Odds Ratio

If the estimated Odds Ratio for smoking and lung cancer is 2.14, this means that "The odds that smokers will get lung cancer are 2.14 times the odds that non-smokers will get lung cancer"

Determining if a Calculated Odds Ratio is Statistically Significant

1. Derive the estimated Odds Ratio
2. Write down the Null Hypothesis and the Research Hypothesis
3. Look at the 95% Confidence Interval for the True Odds Ratio
4. Decide which of the two hypotheses to accept

Odds Ratio: Null Hypothesis and Research Hypothesis

H0 Odds Ratio = 1 (i.e. no association between risk factor and disease)

H1 Odds Ratio is not equal to 1 (i.e. statistically significant association between risk factor and disease)

Decision Rule: Accept Research Hypothesis if 95% Confidence Interval does not contain 1

EXAMPLE

Estimated Odds Ratio is 2.2
95% Confidence Interval for True Odds Ratio is 1.1 to 3.3
Null Hypothesis: OR = 1 (i.e. no association between risk factor & disease)
Research Hypothesis: OR is not equal to 1 (i.e. statistically significant association between risk factor & disease)
Which of the two hypotheses should you accept?

Answer: Since the 95% Confidence Interval does not contain the value 1, we are 95% sure that the True Odds Ratio is not equal to 1. Therefore, we accept the Research Hypothesis "Odds Ratio is not equal to 1" and conclude that there is a statistically significant association between the risk factor and the disease



Using the 95% Confidence Interval for Testing Differences in Means

Null Hypothesis H0 population mean1 - population mean2 = zero (i.e. no difference between the two means)

Research Hypothesis H1 population mean1 - population mean2 is not equal to zero (i.e. statistically significant difference between the two means)

1. Estimate the difference in means
2. Look at the 95% Confidence Interval for the Actual difference in means

Decision Rule: Accept the Research Hypothesis if the 95% Confidence Interval for the actual difference in means does not contain zero i.e. we are 95% sure that actual difference in means is not equal to zero.

EXAMPLE
Estimated difference in means is 1.2
The 95% Confidence Interval for the actual difference in means is 0.5 to 1.9
Which of the two hypotheses do you accept?

Answer: Since the 95% Confidence Interval for the actual difference in means does not contain zero, we are 95% sure that the actual difference in means is not equal to zero. Therefore, we accept the Research Hypothesis. We conclude that there is a statistically significant difference between the two means