biostatist pptx - غير معروف

Association between two categorical variables

Dr. Ahmed Samir Al-Naaimi
MBChB, MSc epid, PhD
Assistant Professor / Department of Community Medicine
Baghdad College of Medicine

Learning objectives

• The student will analyze the relationship between two categorical variables with two or more categories.
• Apply the Chi-square test for hypothesis testing.
• Understand the concept of observed and expected frequencies.
• Explore the link between Chi-square test and multiplication rule used for joint probability under the assumption of independence between 2 classification criteria.

Learning objectives

• Relate the magnitude of difference between observed and expected frequency to resulting test statistic and the conclusion of statistical significance.
• Evaluate the link between Z test and Chi-square test in the special condition of 2 x 2 contingency table.
• List the conditions for a valid Chi-square test.

Introduction

The table shows the distribution of individuals according to 3 categories of Socioeconomic Index Level (SEIL).
• SEIL
• N
• %
• Low
• 50
• 25
• Average
• 110
• 55
• High
• 40
• 20
• Total
• 200
• 100

Introduction
In the same sample the location of residence was also classified into 3 sectors: south, center and north.
• N
• %
• South
• 44
• 22
• Center
• 96
• 48
• North
• 60
• 30
• Total
• 200
• 100

Introduction

When we examine the relationship between two categorical variables, tabulated one against other.
This is a two way table or cross-tabulation.
• Location
• SEIL
• South
• Center
• North
• Total
• Low
• 33
• 7
• 10
• 50
• Average
• 9
• 81
• 20
• 110
• High
• 2
• 8
• 30
• 40
• Total
• 44
• 96
• 60
• 200

Interpretation of a two by two table
There is an association between two categorical variables, if the distribution of a variable varies according to the value of the other.
The question we are interested in “Does the Socio-economic Index level (SEIL) varies by place of residence?
To answer this question we need to assess a cross-tabulation and calculate relative frequencies (percentages).

Interpretation of a two by two table

To answer the question of interest, what should we consider the relative frequencies of column or row totals?
• SEIL
• South
• n %
• Center
• n %
• North
• n %
• Low
• 33 75
• 7 7.3
• 10 16.7
• Average
• 9 20.5
• 81 84.4
• 20 33.3
• High
• 2 4.5
• 8 8.3
• 30 50
• Total
• 44 100.0
• 96 100.0
• 60 100.0
Place of residence

Interpretation of a two by two table
If the distribution of SEIL is the same in each place of residence, the percentage of columns would be the same for each place of residence. It appears that the percentage of low SEIL differ between sites of residency, but the data are subject to sampling errors, so we need to assess whether these differences in the proportions of the sample reflect differences in populations.

To do this, we need a hypothesis test.

Expected frequencies
If the null hypothesis is true, there is no association between SEIL and area of residence, the percentages for each level of SEIL in each area, should be the same as the column of percentages in the total column. Or one can state the hypothesis as “the 2 methods of classification for people: SEIL and place of residence are independent”

• SEIL

• South
• n %
• Center
• n %
• North
• n %
• Total
• n %
• Low
• 33 75
• 7 7.3
• 10 16.7
• 50 25
• Regular
• 9 20.5
• 81 84.4
• 20 33.3
• 110 55
• High
• 2 4.5
• 8 8.3
• 30 50
• 40 20
• Total
• 44 100.0
• 96 100.0
• 60 100.0
• 200 100.0
Place of residence
Interpretation of a two ways table
Also, we should expect than 25% of people in the South have low SEIL. so the frequency (count) of people in South sector of residence with low SEIL is 0.25 x 44 = 11.

Expected frequencies
If there are no differences in the distribution of SEIL by places of residence, we should expect that the relative frequency of people with low SEIL is the same in each place of residence.
Note that the expected frequencies do not have to be integers.
Using the totals of columns and rows, we can calculate the expected frequency (count) in each cell.
E = (row total x column total) / grand total

Expected frequencies

Under the null hypothesis of independence for 2 events, the joint probability is equal to the product of the probability of each event.
P (Low SEIL) = 50/200
P (South) = 44/200
P (Low SEIL and South) =
50/200 x 44/200
The frequency expected in (Low SEIL and South) is equal to the P (Low SEIL and South) multiplied by total sample size of 200.
Expected frequency (E) = 50/200 x 44/200 x 200
E = (row total x column total) / grand total
• Location
• SEIL
• South
• Center
• North
• Total
• Low
• 33
• 50
• Average
• 110
• High
• 40
• Total
• 44
• 96
• 60
• 200

Chi-square test
Expected frequencies are those that we should expect if the null hypothesis were true.
To test the null hypothesis, we must compare the expected frequencies with observed frequencies, using the following formula.

Chi-Square test

From the formula we can see that:
• If there is a large or significant difference between the observed and expected values, the calculated (test statistic) 2 will be large, while if there is a small (or statistically insignificant) difference between the observed and expected values, the resulting 2 will be small also.

Chi-Square test

• If the calculated (test statistic) 2 is large, then the sample data provides enough evidence to reject the null hypothesis (Ho) because the observed values are not what we expect under the null hypothesis.
• If the calculated (test statistic) 2 is small in magnitude, then the sample data agrees with (accepts) the null hypothesis (Ho), which states that the observed values are similar to or not significantly different from those expected under the null hypothesis of independence.

Chi-Square distribution

• The values of test statistic in Chi-square distribution is between zero and + ∞. No negative values are present since they are squared values.
• The Chi-square distribution has one tail only (positively skewed distribution).
• The higher the df the
• more flattened is the
• curve.
• Hypothesis testing is
• always one tailed

Chi-Square test

• The X2 distribution is obtained from the sum of the squares of many standard Normal variables. The number of independent variables commonly used in this sum is the “degrees of freedom”, df = (r-1) x (c-1), where r is the count of rows in the table and c is the count of columns.
• The tabulated X2 for 2x2 table with df=1 and alpha error = 0.05 is equal to (Z1-alpha/2)2 = (1.96)2 = 3.84.

Chi-Square test

• SEIL
• South
• O E
• Center
• O E
• North
• O E
• Total
• n
• Low
• 33 11
• 7 24
• 10 15
• 50
• Regular
• 9 24.2
• 81 52.8
• 20 33
• 110
• High
• 2 8.8
• 8 19.2
• 30 12
• 40
• Total
• 44 44
• 96 96
• 60 60
• 200
Place of residence
Expected frequency = row total x column total /grand total.
Example: the expected frequency in the first cell of the table (the left upper) = (50 x 44) / 200 = 11, while the observed frequency is 33.

Chi-Square test
• SEIL
• Place of residence
• Observed
• Expected
• O - E
• (O-E)2
• (O-E)2/E
• Low
• South
• 33
• 11
• 22
• 484
• 44
• Low
• Center
• 9
• 24
• - 15
• 225
• 9.38
• Low
• North
• 2
• 15
• - 13
• 169
• 11.27
• Regular
• South
• 7
• 24.2
• -17.2
• 295.8
• 12.2
• Regular
• Center
• 81
• 52.8
• 28.2
• 795.2
• 15.1
• Regular
• North
• 8
• 33
• - 25
• 625
• 18.9
• High
• South
• 10
• 8.8
• 1.2
• 1.44
• 0.2
• High
• Center
• 20
• 19.2
• 0.8
• 0.64
• 0.03
• High
• North
• 30
• 12
• 18
• 324
• 27
• Total
• 138.1

Steps for hypothesis testing
• 1. State the statistical hypothesis
• Ho: There is no association between SEIL and residence location
• HA: There is an association
• 2. Fill in the observed frequencies for contingency table.
• 3. Calculate expected frequencies.
• 4. Calculate the test statistic (Chi-square)
• 5. Calculate the degrees of freedom (df) = (r-1) x (c-1)
• = (3-1) x (3-1) = 2 x 2 = 4
• 6. Get the tabulated 2 (decision rule) for the specified df.

Steps for hypothesis testing

• 6. The tabulated X2 (decision rule) for df=4 is 9.5
• 7. Compare the test statistic (calculated X2) and decision rule. Since 138.1 is > 9.5, then reject the Ho in favor of HA.
• 8. Conclusion: there is a statistically significant association between SEIL and residence location.

Chi-Square test in 2 x 2 tables

When both variables are binary (dichotomous), the cross-tabulation table becomes a 2 x 2.
The 2 test can be applied in the same way as for a larger number of categories table.
This special condition for 2 is very common in medical literature. It will give the same result as that of Z test used for the difference between 2 proportions studied earlier in the biostatistics module. Remember that the decision rule for 2 at df=1 is 3.841 which is the square value of Z at alpha 0.05 = 1.96.

Example (2 x 2 table)
There was a study of the bacteriological efficacy of clarithromycin Vs penicillin, in acute pharyngo-tonsillitis in children by Streptococcus Beta Haemolytic Group A. The results are shown below
• Drug
• Cure
• Not cure
• Total
• Clarithromycin
• 91
• 9
• 100
• Penicillin
• 82
• 18
• 100
• Total
• 173
• 27
• 200

Example (2 x 2) table

Statistical hypothesis
• Ho: There is no association between type of treatment and cure. While in case of Z test we would say “There is no difference in bacteriological efficacy (response rate) between the two treatments, against Streptococcus Beta Hemolytic Group A.
• HA: There is an association between type of treatment and patient’s response to treatment.
• Drug
• Cure
• O E
• Not cure
• O E
• Total
• Clarithromycin
• 91 86.5
• 9 13.5
• 100
• Penicillin
• 82 86.5
• 18 13.5
• 100
• Total
• 173
• 27
• 200

• Drug
• Effect
• Observed
• Expected
• O - E
• (O-E)2
• (O-E)2/E
• Clarithromycin
• Cure
• 91
• 86.5
• 4.5
• 20.25
• 0.234
• Clarithromycin
• Not cure
• 9
• 13.5
• - 4.5
• 20.25
• 1.5
• Penicillin
• Cure
• 82
• 86.5
• - 4.5
• 20.25
• 0.234
• Penicillin
• Not cure
• 18
• 13.5
• 4.5
• 20.25
• 1.5
• Total
• 3.47
df = (r-1) x (c-1) = (2-1) x (2-1) = 1 x 1 = 1
Calculate expected frequencies
Calculate the test statistic (2) for each cell in the table and its sum = 3.47
Get the decision rule 2 at df=1 which is 3.841

Example (2 x 2) table

Compare the test statistic (3.47) and decision rule (3.841), since the test statistic is larger, we accept the Ho.
• Conclusion: There is no statistically significant association between the type of treatment and the patients response to treatment
Try to solve this example by Z test and compare the results obtained by both methods.

Example (2 x 2) table

A quick formula for 2 x 2 tables
2 can be calculated without the need for expected frequencies in the special case of 2 x 2 table. Use the observed frequencies in a table and marginal totals. If we labeled the cells and marginal totals as follow:
• Exposure
• Result
• Yes
• Result
• No
• Total
• Yes
• a
• b
• a + b
• No
• c
• d
• c + d
• Total
• a + c
• b + d
• N
2=[(ad – bc)2 x N ]/[(a+b) (c+d) (a+c) (b+d)]

Validity of Chi-Square tests
Chi square tests are based on the assumption that the test statistic follows approximately the 2 distribution.
This is reasonable for large samples but for the small one we should use the following guidelines:
a) For 2 x 2 tables
• If the total sample size is> 40, then 2 can be used.
• If n is between 20 and 40, and the smallest expected value is > 5, 2 can be used. Otherwise, use the Fisher exact significance test.
b) For r x c tables: The 2 test is valid if not more than 20% of expected values is less than 5 and none is less than 1.