Click for Great Web Sites

by Dr PHUA Kai Lit

School of Medicine and Health Sciences

Monash University Malaysia

Bandar Sunway, Selangor, Malaysia

Click for Answers to Bridging Assignment One

Click for Answers to Bridging Assignment Two

**POPULATION**: People you wish to study. If you want to study lung cancer among Malaysians, then the study population will be all Malaysians. If you want to study lung cancer among Malaysian women, then the population will be all Malaysian women.

**SAMPLE**: Small group of people selected from the population you wish to study (subset of the population). Findings from the sample are used to draw conclusions about the entire population.

**SIMPLE RANDOM SAMPLING**: Each member of the population has an EQUAL

CHANCE of being selected into the sample

**REPRESENTATIVE SAMPLE**: Composition of the sample resembles the composition of the population e.g. if 60% of all Malaysians are Malays, then in a sample of 1,000 Malaysians, there should be about 60% Malays or 600 Malays

* A random sample is likely to be a representative sample

**WAYS TO MEASURE CLUSTERING OF DATA ("MEASURES OF CENTRAL TENDENCY")**

1. **Mean (average)**

2. **Median** e.g. median of 1,3,5,7,9 is 5. The median of 1,3,5,7,9,11 is (5 + 7)/2 = 6

3. **Mode** e.g. mode of 1,3,3,5,7 is 3. The mode is the most commonly occuring number.

**WAYS TO MEASURE SPREAD OF DATA ("MEASURES OF DISPERSION")**

1. **Range** - difference between highest and lowest values

2. **Standard Deviation** - the higher the Standard Deviation, the more spread out the data.

3. **Variance** - this is simply the standard deviation squared

**THE STANDARD DEVIATION IS A VERY IMPORTANT MEASURE** - Under a (Standardised) Normal Distribution Curve,

68.3% of the data are found +1 or -1 standard deviation from the mean

95.5% of the data are found +2 or -2 standard deviations from the mean

99.7% are found +3 or -3 standard deviations from the mean

**LEVEL OF MEASUREMENT OF DATA**

1. **Nominal data**: qualitative, categorical data. Example: ethnicity, gender, religion.

2. **Ordinal data**: Rank-ordered data. Data are grouped from low to high. But we cannot say how much lower or how much higher. Example: "low anxiety", "moderate anxiety" and "high anxiety".

3. **Interval data**: quantitative data. There is equal spacing between numbers e.g. the difference between 10 kg and 11 kg is the same as the difference between 35 kg and 36 kg. Examples of interval data: height, weight, temperature measured using the Celsius scale.

4. **Ratio level data**: Similar to Interval Data but in addition, it has an **absolute zero** or meaningful zero e.g. income, temperature measured using the Kelvin scale.

A commonly-used test for analysing NOMINAL DATA.

When we use the test, we must make sure the assumptions of the test are not violated!

Place the data in a "contingency table"

Write out the **"Null Hypothesis" (also called H-nought and written as H _{0})** and the

Run the chi-square test

Decide which of the 2 hypotheses should you take as your conclusion

**Chi-square test: Null Hypothesis and Research Hypothesis**

H_{0} No association between X and Y

Any association seen is due to chance

H_{1} There is an association between X and Y

**If p-value is less than 0.05, accept the research hypothesis**

**If calculated chi-square exceeds the critical value, accept the research hypothesis**

**ASSUMPTIONS OF CHI-SQUARE TEST**

1. Nominal data

2. 25 =< n =<250

3. Random sample

4. Expected value of each cell is at least 5 (if not, you should combine some of the categories)

A commonly-used test for comparing two MEANS derived from INTERVAL DATA

When we use the test, we must make sure the assumptions of the test are not violated!

Write out the **"Null Hypothesis" (H _{0}** and the

Run the t-test

Decide which of the 2 hypotheses should be your conclusion

**t-test: Null Hypothesis and Research Hypothesis**

H_{0} No difference between the two population means

Any difference seen is due to chance

H_{1} Statistically significant difference between the two population means

**If p-value is less than 0.05, accept the research hypothesis**

**If calculated t exceeds the critical value, accept the research hypothesis**

**ASSUMPTIONS OF T-TEST FOR TWO INDEPENDENT SAMPLES**

1. Random samples

2. Interval data

3. Normal distribution in both population groups

4. Preferably n < 30 (for each sample).

**CHOOSING A TEST**

1. What is the level of measurement? Nominal, ordinal or interval?

2. How many samples? One, two or more?

3. If two samples, are they independent or paired/matched?

4. Choose the test. **Make sure the assumptions of the test are not violated**

** IMPORTANT: What is STATISTICALLY SIGNIFICANT may not be CLINICALLY SIGNIFICANT. "STATISTICALLY SIGNIFICANT" simply means that the probability of what you see happening by chance is very, very low (p < 0.05). We can therefore conclude that it is highly unlikely to have happened by chance**

**95% Confidence Interval**

A 95% Confidence Interval for X simply means (roughly) that we are 95% sure that the true value of X lies somewhere within that interval. (The technically correct interpretation is that if we take 100 samples and calculate the X for each of these samples, 95 of them will lie within the Confidence Interval)

**Odds Ratio**

If the estimated Odds Ratio for smoking and lung cancer is 2.14, this means that "The odds that smokers will get lung cancer are 2.14 times the odds that non-smokers will get lung cancer"

**Determining if a Calculated Odds Ratio is Statistically Significant**

1. Derive the estimated Odds Ratio

2. Write down the **Null Hypothesis** and the **Research Hypothesis**

3. Look at the 95% Confidence Interval for the True Odds Ratio

4. Decide which of the two hypotheses to accept

**Odds Ratio: Null Hypothesis and Research Hypothesis**

H_{0} Odds Ratio = 1 (i.e. no association between risk factor and disease)

H_{1} Odds Ratio is not equal to 1 (i.e. statistically significant association between risk factor and disease)

**Decision Rule: Accept Research Hypothesis if 95% Confidence Interval does not contain 1**

**EXAMPLE**

Estimated Odds Ratio is 2.2

95% Confidence Interval for True Odds Ratio is 1.1 to 3.3

Null Hypothesis: OR = 1 (i.e. no association between risk factor & disease)

Research Hypothesis: OR is not equal to 1 (i.e. statistically significant association between risk factor & disease)

Which of the two hypotheses should you accept?

Answer: **Since the 95% Confidence Interval does not contain the value 1**, we
are 95% sure that the True Odds Ratio is not equal to 1. Therefore, we **accept the
Research Hypothesis ** "Odds Ratio is not equal to 1" and conclude that there is a statistically significant association between the risk factor and the disease

Null Hypothesis H_{0} population mean1 - population mean2 = zero (i.e. no difference between the two means)

Research Hypothesis H_{1} population mean1 - population mean2 is not equal to zero (i.e. statistically significant difference between the two means)

1. Estimate the difference in means

2. Look at the 95% Confidence Interval for the Actual difference in means

**Decision Rule: Accept the Research Hypothesis if the 95% Confidence Interval for the actual difference in means does not contain zero i.e. we are 95% sure that actual difference in means is not equal to zero.**

EXAMPLE

Estimated difference in means is 1.2

The 95% Confidence Interval for the actual difference in means is 0.5 to 1.9

Which of the two hypotheses do you accept?

Answer: **Since the 95% Confidence Interval for the actual difference in means does not contain zero**, we are 95% sure that the actual difference in means is not equal to zero. Therefore, we **accept the Research Hypothesis**. We conclude that there is a statistically significant difference between the two means