Medical statistics ppt - Biostatistics

Medical statistics

Lecture One
Introduction to medical statistics

Statistics

Is the discipline concerned with the treatment (handling) of numerical data derived from groups of individuals Statistical methods are the methods especially adapted to the elucidation of data affected by multiplicity of causes

It is a fact that we are living in the information age (information revolution) For example, about 0.5 million new articles are published only in the medical field annually

Thus we need to know how to obtain, how to analyse, and how to interpret these information (which are called data) Data are available in the form of numbers (values)

Is that field of statistics in which the data being analysed were derived from the biological sciences and medicine
Biostatistics:

There are two main types of statistics: The type in which we are concerned only with collection, organisation, presentation and summarisation of data, is called descriptive statistics The type in which the objective is to reach a decision about a large group of data by examining only a small one, is called inferential (analytical) statistics

Data (datum): The raw material of statistics is called data. It is obtained either as a measurement or as a process of counting.

Value: It is the numerical representative of the measurement of the variable

The need for statistical activities is motivated by the need to answer a question. That needs a search for suitable data to serve as the raw material for the investigation. Such data are usually available in the form of one or more of the following sources:
Sources of data:

Routine records, such as hospital medical records Surveys, if the data needed to answer a question are not available from routine records Experiments External sources, in form of published reports, data banks, or the research literature

Variable: Any characteristic that can take different values in different occasions, places, persons, and time, e.g. height, weight, age, etc...

Variables are one of two types

Quantitative variable (numerical): is that variable that can be measured by units such as height, weight, age, etc... Qualitative variable (categorical): is that variable that cannot be measured by units. It can only be assessed by number or percentage e.g. sex, ethnic group, colour of the eye, race, education, occupation, type of disease

Quantitative variables are of two main types:

Discrete quantitative variable: characterized by gaps or interruptions in the valuesThese gaps or interruptions indicate the absence of values between the values, e.g. daily admission of patients to hospital, parity or abortion times… etc

Continuous quantitative (random) variable: it does not posses the gaps or interruptions characteristic. It has fractions of units, and the variable can assume any value within a specified interval, as height, weight, etc.. In fact, most of the biological data are of the continuous quantitative type

There is another classification of variables according to measurements or measurement scales. Measurement means the assignment of numbers to objects or events according to a set of rules: these rules include:
Measurements and measurement scales:

Nominal scale (male-female, well-sick, under 65 years- 65 and above, child-adult, and married-unmarried) Ordinal scale (high-intermediate-low, not smoker, light, moderate, heavy smoker, Social class I, II, III, IV&V) Interval scale (Age as 20-, 30-, 40-, 50-) Ratio scale (determine the quality of ratio or interval)

Population: It is the largest collection of entities of which we have an interest at a particular time, sharing at least one characteristic in common

Sample: The sample may be defined as a part of population, subset of population chosen in a representative way to be as much as possible representative for the population (random, or non-random) The method applied to collect a sample is called sampling

Lectures Two & Three

Summarisation and presentation of data

Data organisation (Ordered array):

It is the enlistment or the arrangement of the data according to their magnitude from the smallest to the largest or vice versa. The benefits of ordered array are:

Determine the smallest value (Xs) and the largest value (Xl) Determine the range Easy to present the data by table To find the value of median

Data presentation is either by: 1-Neumerical (numbers) 2- Tables: as a-Master table b- Simple frequency distribution table c- Class interval frequency distribution table 3- Graphs (Pictorial presentation of data)
Data presentation:

When we have the data composed of small sample size (n=20) it is easy to present them by numerical (numbers) "simple data", while if the data is more than 20 values or observations it is better to present them by tables

It contains the information regarding all variables included in the study (spreadsheet in the computer Excel). From master table the information regarding one or two variables will be taken and presented in simple frequency or an other type of tables.

It is the arrangement of data according to their magnitude and the frequency of occurrence of each magnitude.
Simple frequency distribution table:

When we want to complete the table, it is composed of many columns including the values of the variable (X), the frequency of occurrence recurrence (F), the cumulative frequency (Cum.F), the relative frequency (R.F.) (R.F.=F/ n), the cumulative relative frequency (C.R.F), the relative frequency percentage (R.F.%) (R.F.% =F/ n X 100), and the cumulative relative frequency percentage (C.R.F.%) as in the following example:

Parity

Frequency
Cum.F
R.F.
C.R.F
R.F. %
C.R.F. %
Primigravida (0)
25
25
0.25
0.25
25 %
25 %
1
14
39
0.14
0.39
14 %
39 %
2
16
55
0.16
0.55
16 %
55 %
3
18
73
0.18
0.73
18 %
73 %
4 & more
27
100
0.27
1.00
27 %
100 %
Total
100
--
1.00
--
100
--
Table (1): The parity distribution of mothers attending ANC clinic in the Al-Muntezeh PHCC for the year 2010

Table should be simple, easy to be understood and self- explanatory. Each table should have a number. Each table should have a title written at the top of it. This title should answer the following questions: what, where, and who. Each table should have a clear heading for the columns.
The characteristics of tables:

Each table should contain a total at the end of each column. We should avoid the use of abbreviations and codes, and if we have to use them we should refer to them at the bottom of the table. If we use any number from any reference or book: we should refer to it at the bottom of the table.

The data of continuous quantitative type is presented here as intervals, the steps to present the data by class interval table is as following: Count the number of observations. Determine the smallest and the largest values.

Class-interval frequency distribution table:

Decide whether to present them in simple or in class interval table. To present them in class interval table we have to determine the number of class intervals according to Sturges' formula:

K=1+3.322 log10 n

Then determine the width of class interval W= = = Then determine the class interval Then present the frequency of observations according to this class interval by tallying

The number of class interval (k) should not be less than 5 (in order not to lose the details) and not more than 20 The preferable number of class interval is 6-12, or using Sturges' formula. Constant width of class interval No gaps in between class intervals No overlapping between class intervals (the observation will be presented once only) Example:
The additional characteristics of class interval tables:

The haemoglobin level in g/dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
10.2
13.7
10.4
14.9
11.5
12.0
11.0
13.3
12.9
12.1
9.4
13.2
10.8
11.7
10.6
10.5
13.7
11.8
14.1
10.3
13.6
12.1
12.9
11.4
12.7
10.6
11.4
11.9
9.3
13.5
14.6
11.2
11.7
10.9
10.4
12.0
12.9
11.1
8.8
10.2
11.6
12.5
13.4
12.1
10.9
11.3
14.7
10.8
13.3
11.9
11.4
12.5
13.0
11.6
13.1
9.7
11.2
15.1
10.7
12.9
13.4
12.3
11.0
14.6
11.1
13.5
10.9
13.1
11.8
12.2

Table 3: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
Hemoglobin (g/ dL)
Tallying
Freq.
Cum.F
R.F.
C.R.F.
R.F.%
C.R.F.%
8-
I
1
1
0.014
0.014
1.4 %
1.4 %
9-
III
3
4
0.043
0.057
4.3 %
5.7 %
10-
IIIII IIIII IIII
14
18
0.2
0.257
20.0 %
25.7 %
11-
IIIII IIIII IIIII IIII
19
37
0.27
0.528
27.1 %
52.8 %
12-
IIIII IIIII IIII
14
51
0.2
0.728
20.0 %
72.8 %
13-
IIIII IIIII IIII
13
64
0.186
0.914
18.6 %
91.4 %
14-
IIIII
5
69
0.071
0.985
7.1 %
98.5 %
15-15.9
I
1
70
0.014
1.00
1.4 %
100 %
Total
70
--
1.00
--
100 %
--

The Graphical Representation of Data

Bar chart: It a graphic representation used to present data of qualitative type. It is composed of number of bars separated from each other, the width of the bar is not of that importance but it is preferable to be of the same width (so as to give true impression), the length of the bar is of importance and it is drawn proportional to the frequency or percentage.
Types of graphs:

Table 4: The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010
Method of delivery
No. of births
Percentage
Normal vaginal delivery
478
79.7 %
Forceps delivery
65
10.8 %
Caesarean section
57
9.5 %
Total
600
100 %

Figure 1: The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010

Figure 2: The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010

Pie chart:

It is a graphic representation used to present data of qualitative type in shape of circle The size of the slice for each category is determined by the equation f/ n * 3600.

Figure 3: The method of delivery of 600 babies born in Bintul Huda Teaching Hospital for the year 2010

Histogram:

It is a graphic representation used to present continuous quantitative data arranged in class-interval It is composed of number of bars adherent to each other

The width of bars is very important which equal to the width of class interval, and the length of the bars is proportional to the frequency of class interval or its percentage So the area in histogram is very important and it represent 1 unit, 100% equal to the probability

Figure 4: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010

Line graph (frequency polygon):

It is a graphic representation used to present discrete quantitative data, also it can be derived from histogram (that is used to present continuous quantitative data arrange in class-interval) by taking the mid point at the top of each bar, joining them by straight lines

The line graph should not be left open, it should be closed by taking the mid point of the class-interval before the first class-interval (it has a frequency of zero) and taking the mid point of the class-interval after the last class-interval (it has a frequency of zero)

So the line graph will join the X-axis at these two ends. The area of line graph below the line above the X-axis is equal to the area of histogram, equal to one unit, equal to 100%, equal to the probability.

Also line graph is used when we want to present two groups by one graph for the purpose of comparison, which is not possible by histogram (as one bar of group 1 will cover another bar from group 2)

Figure 5: The haemoglobin level in g/dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010

Spot map (spot chart, map chart): It is a graphic representation used to present data by map. Scatter diagram: It is a graphic representation used to present data for correlation and regression to show the relationship between two quantitative variables.

Cumulative relative frequency percentage curve: It is special type of line graph in which X-axis is the variable and the Y-axis is the C.R.F.%, it is used to calculate the value of the median precisely. The shape of the curve or line is of what is called sigmoid shape (sigmoid curve).

The characteristics of graphs:

Graphs should be simple, easy to be understood and self-explanatory. Each graph should have a number. Each graph should have a title written at the bottom of the graph, this title should answer the following question: what, where, when, and who. We should avoid the use of abbreviation and codes, and if we have to use them, we should refer to them inside the graph.

Stem-and-Leaf display
Another graphical method of representing data
DIFFEREN Stem-and-Leaf Plot Frequency Stem & Leaf 5.00 -7 . 00000 1.00 -6 . 0 4.00 -5 . 0000 4.00 -4 . 0000 1.00 -3 . 0 4.00 -2 . 0000 8.00 -1 . 00000000 .00 -0 . 6.00 0 . 000000 20.00 1 . 00000000000000000000 15.00 2 . 000000000000000 6.00 3 . 000000 4.00 4 . 0000 4.00 5 . 0000 2.00 6 . 00 4.00 7 . 0000 2.00 8 . 00 2.00 9 . 00 1.00 10 . 0 7.00 Extremes (>=12.0) Stem width: 1.00 Each leaf: 1 case (s)

Lecture Four

Measurement of central location

Data summarisation:

Data summarisation is either by: Measurements of central tendency (average measurements, measurements of location, and measurements of position) Measurements of variability (dispersion, distribution measurements) Skewness Kurtosis

Measures of central tendency

Descriptive measure: is a single number used as a means to summarize data. Statistic: is a descriptive measure computed from the data of a sample. Parameter: is a descriptive measure computed from the data of a population.

Mean Mode Median

Mean
It is a measure calculated by adding all the values in a population or a sample and dividing by the number of values that are added. If the “value”= x, and Number of values= n, then the mean= ∑ x1, x2, …xn/ n

Properties of the mean
Uniqueness Simplicity Since each and every value in a set of data enters into the computation of the mean, it is affected by each value. Therefore, extreme values have an influence on the mean

Mode

Is that value which occurs most frequently.

Median

Is the value that divides the set into two equal parts after sorting them into an ascending or descending pattern

If the number of values is odd, the median will be the middle value when all values have been arranged in order of magnitude.

When the number of values is even, there is no single middle value. Instead, there are two middle values. In this case, the median is taken to be the average of these two middle values, when all values have been arranged in order in order of magnitude.

Properties of median

Uniqueness Simplicity It is not as drastically affected by extreme values as is the mean

Mode

Is that value which occurs most frequently in a set of observations. A set of values may have more than one mode ( e.g,bimodal, trimodal).

Lecture Five

Measurement of variability

Measures of dispersion

Dispersion: is the variety that a set of observations exhibitsRange: is the difference between the largest and smallest value in a set of observationsVariance: is a measure of dispersion relative to the scatter of values about their mean = ∑ (xn-x)І/ n-1

Standard deviation: is the square root of the variance.Coefficient of variation: is a measure expresses the standard deviation as a percentage of the mean: = Variance (Ѕ)/ mean* 100Q/ Why is it used?

Measures of variability (Dispersion):

The degree to which numerical (quantitative data) tend to spread about an average value is called variation or dispersion of the data. The variation is something that is in the nature of data, i.e. the data always do not come as one value.

There are a lot of measures of variation (dispersion) available, but the most commonly used are:

Range:

is the difference between the smallest and the largest value in a set of values.Range (R) = Largest value (Xl) – Smallest value (Xs)

The range is of limited use in statistics as a measure of variability because it takes in consideration only two values and neglects the others.

E.g. If we have: 10 values, the range will consider only 2 values and neglect the other 8 values, 100 values, the range will consider only 2 values and neglect the other 98 values, and if we have 1000 values, the range will consider 2 values and neglect the other 998 values)

These two values, considered by the range, are the two extreme ones (smallest and the largest), which are not of high interest in biostatistics to describe the variation perfectly

The uses of range

It gives an idea about the extent of data distribution (the scale or range on which the data extend or spread). In determining the width of class interval in case of class interval table (w=R/K).

Variance:

The variance is defined as the average of the squared deviation of observations away from their mean in a set of observations. Or: The scatter of values about their mean

E.g.: Suppose we have five persons with their haemoglobin level (g/dl) measurements (8, 9, 10, 11, 12).

The variance = ∑(xn-x)І/ n-1

Hemoglobin level (g/dl)
Difference, deviation d=(Xn-X)
D2 = (Xn-X)2
8
8-10=-2
4
9
9-10=-1
1
10
10-10=0
0
11
11-10=+1
1
12
12-10= +2
4
Variance (s2)= ∑d2/(n-l) = 10/(5-1)= 10/4= 2.5

Standard deviation:
The SD is defined as the squared root of the variance, or it can be defined as the average of the deviation of observations away from their mean in a set of observations.

It is a measure widely used in biostatistics as a measure of variability If the value of SD is high, it means the data posses a large variation and vice versa

Coefficient of variation (CV%)

It is the standard deviation expressed in percentage out of the mean.

It is used in statistics in the following conditions: To compare the variability of two groups for the same variable but measured by different unite E.g.: Birth weight is measured in Iraq by Kilograms and in the UK in bounds). So we cannot compare the variable of the two groups by SD but we can compare it by CV%.

To compare the variability of two groups for the same variable measured by the same unite and they have the same SD value but different means.

e.g.: The plasma volume of 8 healthy adult males: 2.75, 2.86, 3.37, 2.76, 2.62, 3.49, 3.05, and 3.12 litersMean =∑x/n = ∑x = [2.75+ 2.86+ 3.37+ 2.76+ 2.62+ 3.49+ 3.05+ 3.12]= 24.02Mean= 24.02/8= 3.002 liters

Rearranging the measurements in an increasing order 1st 2nd 3rd 4th 5th 6th 2.62, 2.75, 2.76, 2.86, 3.05, 3.12, 7th 8th 3.37, 3.49 liters Median position= (n+l)/2= (8+l)/2= 4.5 (4th, 5th) Median= The average of 4th value and 5th value Median= (2.86 + 3.05)/ 2= 2.961. This value divides the data into two equal parts

Mode: There is no value occurs more than the others, so there is no mode here.Range=Xl - Xs= 3.49-2.62= 0.77 LiterSD=√Variance=√0.097=±0.312 LiterCV% = SD/mean x 100= 0.312/3.002 X 100=10.39%

Table: The parity distribution of mothers attending ANC clinic in the PHCC of the Al-Muntezeh PHCC for the year 2010
Parity
frequency
Cum. f
xf
r.f.
c.r.f.
r.f. %
c.r.f.%
x2f
0
3
3
0
0.03
0.03
3%
3%
0
1
15
18
15
0.15
0.18
15%
18%
15
2
24
42
48
0.24
0.42
24%
42%
96
3
27
69
81
0.27
0.69
27%
69%
243
4
15
84
60
0.15
0.84
15%
84%
240
5
10
94
50
0.10
0.94
10%
94%
250
6
6
100
36
0.06
1.00
6%
100%
216
Total
n=100
--
∑x=290 1.00
--
100%
--
∑X2 -1060

For the calculations: Mean ( ) =
∑x =[(0x3)+(lxl5)+(2x24)+(3x27)+(4xl5)+(5xl0)-f(6x6)]= 290Mean ( ) = =
= 2.9 Mode = 3 (it has the highest frequency i.e. 27) Median position =
=
=
= 50.5 (50th, 51st) From the column of cumulative frequency, the Median = 3 Or Median = 50th percentile (half of 100% = 50%) so from the column of C.R.F%; the median = 3

Range XL-Xs= 6 – 0 = 6 parity∑d2 = ∑x2 f- (∑xf)2/n = 1060 – 2902/ 100Variance (S2) = =
=
= 2.21SD= √ Variance = √2.21 = ± 1.49 parityCV % = SD/ mean Ч 100 = 1.49/ 2.9 Ч 100 = 51.38 %

Table: The haemoglobin level in g/ dL for 70 pregnant women in Bintul Huda Teaching Hospital for the year 2010
hemoglobin (g/dL)
Freq.
Mid point
MP x f
Cum. f
r.f.
c.r.f.
r.f.%
c.r.f.%
MP2 x f
18 -
1
8.5
8.5
1
0.014
0.014
1.4%
1.4%
72.25
9-
3
9.5
28.5
4
0.043
0.057
4.3%
5.7%
270.75
10-
14
10.5
147.0
18
0.2
0.257
20%
25.7%
1543.5
11-
19
11.5
218.5
37
0.27
0.528
27.1%
52.8%
2512.75
12 -
14
12.5
175.0
51
0.2
0.728
20%
72.3%
2187.5
13-
13
13.5
175.5
64
0.186
0.914
18.6%
91.4%
2369.25
14-
5
14.5
72.5
69
0.071
0.985
7.1%
98.5%
1051.25
15-15.9
1
15.5
15.5
70
0.014
1.00
1.4%
100%
240.25
Total
n= 70
--
∑MPf = 841 (∑x) --
1.00
--
100%
--
SMP2f-1 10247.5 (∑x2)

For the calculations: Mean ( ) =
∑x= ∑MPf = [(8.5xl)+ (9.5x3)+ (10.5xl4)+ (l 1.5xl9)+ (12.5xl4)+ (13.5xl3) f (14.5x5)+ (15.5xl)]= 841Mean ( ) = =
= 12.01 g/ dl

Mode =11.5 g/dl (C.I of 11-11.9) which has the highest frequency i.e. 19) Median position = = = 35th
From column of cum. F. the median lies in C.I 11-11.9 Median =
L= Lower limit of the C.I. containing the median =11 r= remaining number until reaching the position of the median r = (n/2)-the previous cumulative frequency =70/2 - 18= 17 f = frequency of the C.I. containing the median =19 W= width of the C.I. Median =
C.R.F% curve for calculating the exact value of the median in continuous quantitative data arranged in class interval.
L +
x W
L +
x W= 11+
x 1 = 11.89g/dl

Range =XL – XS= 15.9 - 8 = 7.9 g/dL (using the C.I) Range =XL – XS= 15.5- 8.5 = 7.0 g/dL (using the MP)Range =XL - XS= 15.1 - 8.8=6.3 g/dL (using original smallest, largest data) ∑d2- ∑MP2f-(∑MPf)2/n= 10247.5 - 8412/ 70Variance (S2) = =
=
SD-√Variance=√2.08 = ±1.44 g/dLCV% - SD/mean x 100- 1.44/12.01 X 100= 11.99% = 2.08

Lecture Six
Introduction to Sampling

Introduction to sampling

Data collection:
It is difficult to study all population of interest to reach a conclusion regarding certain parameter (variable) and the effect of different factors on such parameter. It needs time, money, efforts, and manpower, through census of the population.

But census determines only the demographic characteristics of the population. No medical information can be gathered from census.

So a sample is taken from the population by sampling, which is as representative as possible for the population When it is done properly, we can generalise its findings on the population.

Reasons for sampling:

Sample can be studied easily (population needs time, manpower, and efforts). Less expensive than studying the entire population Sample results are usually more accurate than results obtained from population.

If samples are properly selected, probability methods can be used to estimate the error in the resulting statistics. To reduce the heterogeneity, so that a sample of specific characteristics can be studied, i.e. not whole population

Sampling:
There are two main types of sampling: Probability (random sampling): which is the best method that allows us to infer from the sample drawn to the population Non-probability (Non-random sampling)

In this type of sampling, each person in the population has an equal chance (probability) to be included in the sample as the others. So there is no bias that prefer any person to be included in the sample
Random sampling:

This method allows to select a sample that is as representative as possible to the population, making it possible to generalise the findings in the sample on the population. There are different, methods of random sampling:

Simple random sampling:

The individuals are coded by letters or numbers (to make it more random than names). Next, the required number of individuals is selected, and each one has the same chance of being chosen in the sample.

This selection can be achieved by labeling a card for each individual in the population, shuffling them well, and then selecting the appropriate required number of cards.

A more convenient method is the random digit table. First of all look at the total number of the population and see how many digits it comprises 00000 to select numbers within this digit range.

An arbitrary point will be chosen from random digit table and then we go in the list to read numbers moving down or across rows as preferred until the required number of different individuals have been selected.

If a selected random digital number is larger than the population total or if it is zero, it will be ignored. If a chosen number is repeated it will be ignored

Sometimes, in case if we are dealing with infinite population (population composed of endless number, such as patients attending the outpatient clinic)
Systematic random sampling:

So it is convenient to carry out sampling in a systematic way (through regular interval).

The interval is determined according to the total number of population and the number of the sample required: Total number of population…………………………………………. = nth Number of sample required

Example:

The total number of patients attending outpatient clinic in Al-Hussein Teaching Hospital is about to be 500 daily. We want to select a sample of 100 patients, so the interval is 500/ 100 = 5th

The starting point from the first 5 digits is chosen at random by simple random sampling. Suppose it was taken as 4, so the sample will comprise individuals with numbers 4,9,14,19,24,29,etc….

This type of sampling is used when we have a population composed of quite different strata or distinct subgroups.
Stratified random sampling:

As if we have a population composed of males and females.

The selection of a sample that does not take into account these distinct subgroups will yield a sample that may be totally composed of males or of females or of different percentages of males and females as that of the population.

So we use stratified random sampling, in which we divide the population according to these subgroups and then we select the required number by simple random sampling from each subgroup

By this method we select a sample that the percentage of males and females are the same of that in the population

When we have a large population extended over a large geographical area, it is better to carry a multi-stage random sample.
Clustered random sampling:

By this method we select in stages. In each stage, the selection is done by simple random sampling

E.g. To take a sample from the population for studying the vaccination coverage in children across Iraq and 25% of population is wanted to be selected as sample, we select stage by stage (governorate, town, district, sections, streets, and finally the houses from the selected streets) In each stage we select by simple random sample

Lecture Seven

The normal Distribution and its Characteristics

Probability

Probability: is a numerical measure of the likelihood that an event will occur An experiment: is any process that generates well-defined outcomes Sample space (S): is the set of all possible outcomes of an experiment An event (A): is an outcome or set of outcomes that are of interest to the experiment. An event (A) is a subset of the sample space (S) The probability of an event A {P (A)}: is a measure of the likelihood that an event A will occur

Example: Tossing a coin Experiment: Toss a coin and observe the up face S { } S= {H, T} H (head) T (tail) Example: Tossing a coin twice Experiment: flip a coin twice and observe the sequence (keeping track of order) of up faces. S= {HH, HT, TH, TT} A= {Tossing at least one head} A = {HH, HT, TH}

Example = Tossing by a dice Experiment: Tossing a six-sided dice and S= {1, 2, 3, 4, 5, 6} A= {roll an even number} A = {2, 4, 6}

Methods of assigning probability

Classical probability: Each outcome is equally likely It is applicable to games of chance In the cases, if there are N outcomes in S, then the probability of any one outcome is 1/N If A is any event and nA is the number of outcomes in A, then: P (A) =

Example: Tossing a dice: S= {1, 2, 3, 4, 5, 6} P (1) = P(2)= P(3)=P (4)=P(5)=P(6)= A= {roll an even number}= {2, 4, 6} P (A) = 3/6 = 0.5

Empirical probability is simply the relative frequency that some event is observed to happen (or fail). Number of times an event occurred divided by the number of trials: P (A) = Where: N= total number of trails nA Number of outcomes producing A

Relative frequency example
Children No.
Frequency
Relative frequency
0
40
40/215 = 0.19
1
80
80/215 = 0.37
2
50
50/215 = 0.23
3
30
30/215 = 0.14
4
10
10/215 = 0.05
5
5
5/215 = 0.02
Sum
215
215/215 = 1.00

Basic concepts of probability:

Probability values are always assigned on a scale from 0 to 1 A probability near 0 indicates an event is unlikely to occur A probability near 1 indicates an event is almost certain to occur A probability near of 0.5 indicates event is just as likely as it is unlikely The sum of the probabilities of all outcomes must be 1

Definitions

Mutually exclusive events: occurrence of one event precludes the occurrence of the other event Independent event: occurrence of one event does not affect the occurrence or non- occurrence of the other event Complementary events: all elementary events that are not in the event A are in its complementary event. P (Sample space) P (A') = 1-P (A)

Probability Distribution

Defined: It is the distribution of all possible outcomes of a particular event. Examples of probability distribution are: the binomial distribution (only 2 statistically independent outcomes are possible on each attempt) (Example coin flip) the normal distribution other underlying distributions exist (such as the Poisson, t, f, chi-square, ect.) that are used to make statistical inferences.

The normal probability distribution

The normal curve is bell-shaped that has a single peak at the exact centre of the distribution.The arithmetic mean, median, and mode of the distribution are equal and located at the peak The normal probability distribution is symmetrical about its mean (of the observations are above the mean and are below).It is determined by 2 quantities: the mean and the SD. The random variable has an infinite theoretical range (Tails do not touch X – axis). The total area under the curve is = 1

Figure

68% of the area under the carve is between 1 SD 95% of the area under the carve is between 1.96 SD 99% of the area under the carve is between 2.58 SD Why the normal distribution is important? A/ Because many types of data that are of interest have a normal distribution

Central Limit theorem

sampling distribution of means becomes normal as N increases, regardless of shape of original distribution Binominal distribution becomes normal as N increases N.B: Normal distribution is a continuous one Binomial distribution is a quantitative discrete

Standard normal distribution (curve)

A normal distribution with a of zero and SD of 1 is called standard normal distribution Any normal distribution can be converted to the standard normal distribution using the Z-statistics (value) Z-value (SND): is the distance between the selected value, designated X, and the population mean (M), divided by the population SD ( ) Z = The standard normal distribution curve is bell-shaped curve centered around zero with a SD=1

Z- score

Z-score is often called the standardized value or Standard Normal Deviate (SND). It denotes the number of SD.s a data value X is distant from the and in which direction. A data value less than sample mean will have a z-score less then zero; A data value greater than the sample will have a z-score greater than zero; and A data value = the will have a z-score of zero

Normal curve table

The normal curve table gives the precise percentage of scores (values) between the (z-score of zero) and any other z-score. It can be used to determine:proportion of scores above or below a particular z-score proportion of scores between the and a particular z–score proportion of scores between two z–scores

By converting raw scores to z-scores, can be used in the same way for raw sources. Can also used in the opposite way: Determine a z-score for a particular proportion of scores under the normal curve. * Table lists positive z-scores * Can work for negatives too * Why? Because curve is symmetrical

Steps for figuring percentage above or below a z-score:

Convert raw score to z-score, if necessary Draw a normal curve: - indicate where z-score falls - Shade area you are trying to find Find the exact percentage with normal curve table

Figure

Steps for figuring a z-score or raw score from a percentage:
Draw normal curve, shedding an approximate area for the percentage concerned Find the exact z-score using normal curve table Convert z–score to raw score, if desired

Figure

Example: For = 2200, M = 2000, = 200, Z = (2200-2000)/200=1For = 1700, M = 2000, = 200, Z = (1700 – 2000)/200= -1.5 A z-value of 1 indicates that the value of 2200 is 1 SD above the of 2000, while a z-value of -1.5 indicates that the value of 1700 is 1.5 SD below the of 2000. Example: For M= 500, = 365, determine the position of 722 in SD units

Figure

= = = 0.61

We can also determine how much of the area under the normal curve is found between any point on the curve and the Once you have a z-score, you can use the table to find the area of the z-score 0.61 (from table A) = 0 .2291 = 0.23 Therefore, 22.9% or 23%

Q/ How much of the population lies between 500 and 722?A/ 0.5 – 0.23 = 0.27Q/ How much of the population is to the left?A/ 0.5 + 0.23 = 0.73

Example: The daily water usage per person in an area, is normally distributed with a of 20 gallons and a SD of 5 gallons Q1/ About 68% of the daily water usage per person in this area lies between what 2 values? A/ About 68% of the daily water usage will lie between 15 and 25 gallons Q2/ What is the probability that a person from this area, selected at random, will use less then 20 gallons par day? A/ P (X < 20) = 0.5

Q3/ What percent uses between 20 and 24 gallons?The z-value associated with X=24: z = (24 -20)/ 5 = 0.8 From the table, the probability of z= 0.8 is 0.2119. Thus, P (20 < Ч < 24) = 0.5 – 0.2119 = 0.2881 = 28.81%

Figure

What percent of the population uses between 18 and 26 gallous?A/ The z-value associated with X = 18: z = (18-20)/5= -0.4and for X=26: z= (26-20)/5 = 1.2Thus P (18 <Ч < 26) = P (-0.4 < Z < 1.2) =0.6554 – 0.1151 =0.5403

Example: Height of young women:The distribution of heights of women, aged 20-29 years, is approximately normal with =64 inch and SD= 2.7 inchQ/ Approximately, 68% of women have height between ……………. and ………….Q/ ~ 2.5% of women are shorter than ……..Q/ Approximately, what proportion of women are taller then 72.1=?

Lecture Eight

The Confidence Interval and Limits

The confidence interval and limits

The Confidence Interval approach is based upon the normal curve distribution

The characteristics of normal distribution could be applied to the distribution of the sample means:

Alternatively:

A confidence interval is a range of values within which the population parameter is expected to occur

If we have a sample mean and we want to estimates the population mean, we need to construct confidence limits, around the sample mean

Confidence Interval ( CI)

- Zα/2 µ + Zα/2

Where Z= the critical value in a normal probability distribution for computing the upper and lower estimates

搆Ι(搀Ņ搆ȠăxǯЂ୚Ѓ攰‚늘ѓ攰„늘…‡€їǿ́萊̿쎀οRectangle 3ӷМᘉრ௃戨Пyྟྨ5഍㤍┵挠湯楦敤据⁥楬業獴ഽ††ⴠ⸱㘹††Ⱐ††‫⸱㘹 ꄀᘏ㘀Ā਀฀܀㘀ꨀਏ㘀ĀༀЀ绰눀਄ࣰȀdက茀଀䋰缀耀Ѐ㉁଀䄁㼀܀뼀က＀ࠀ耀ዃ뼀Ȁ伀戀樀攀挀琀㐀ကࣰᔊଇ﬈་ᄀ೰섀Ћ䄀ༀЀ绰눀਄ࣰ̀dက茀଀䋰缀耀Ѐ㉁଀䈁㼀܀뼀က＀ࠀ耀ዃ뼀Ȁ伀戀樀攀挀琀㜀ကࣰ鰊鈁ﬂ་ᄀ೰섀Ћ䈀ༀЀ胰눀਄ࣰЀdက茀଀䓰缀耀Ѐ㍁଀䌁㼀܀뼀က＀ࠀ耀ᓃ뼀Ȁ伀戀樀攀挀琀㄀　ကࣰ븀

95% confidence limits= -1.96 , + 1.96

Assumptions: population SD is known population is normally distributed If population is not normal, use large sample (> 30)

Factors affecting interval width:

Sample size (n) Variability in the population usually estimated by SD Desired level of confidence, usually 95% or 99%

For Qualitative data:

For 95%: CI= P ± 1.96 x SEPSo:P-Zα/2 x SEP ≤ ≤ P + Zα/2 x SEP

Lecture Nine - Twelve

Tests of Significance

Tests of significance:

Tests of Significance
What is a test of significance? A/ It is a formal procedure for comparing observed data with a hypothesis whose truth we want to assess. The results of tests are expressed in terms of a probability that measure how well the data and hypothesis agree

Stating Hypothesis

Null hypothesis (Ho)
A statistical test begins by supposing that the effect, we want, is not present. This assumption is called the null hypothesisThen we try to find evidence against this claim (hypothesis) Typically, Ho is a statement of “no difference” or “no effect”We also want to assess the strength against the null hypothesis

Alternative Hypothesis (Ha)
It is the statement about the population parameter that we hope or suspect is true (i.e. what we are trying to prove or the effect we are hoping to see)Ha is a statement of difference or relationship It can be one tailed (< or >) (ex: Ha > Ho) or two tailed (< and >) (ex: Ha µ1≠ µ2)

Types of statistical tests:

Parametric tests: assume that variables of interest are measured on interval scale or ratio scale, usually continuous quantitative variable. There is assumption that variables are normally distributed Non parametric tests: assumed that the variables are measured on a nominal or ordinal scale

Steps of hypothesis testing:

State the null hypothesis State the alternative hypothesis State the level of significance Choose the correct test statistics Computed the test statistics Determine the critical value of a statistics (needed to reject the Ho) from a table of sampling distribution values Compare computed to critical value Accept or reject the Ho.

Significance level:

Usually, it is represented as α It is the value of probability below which we start consider significant differences Typical levels used are 0.1, 0.05, 0.01 and 0.001 The usual alpha level considered in medicine is 0.05

The Z test

One sample Z – test That of one sample mean:Steps for testing one sample mean (with σ known), irrespective of sample size State the Ho (Ho: µ1= µ2)State the H1 (H1: M1≠ M2)State the level of significance (example 0.05)Calculate the test statistics:

Z =

5. Find the critical value a. for Z= 1.96 = 0.05 b. for Z= 2.58 = 0.01

6. Decision: Reject Ho if test statistics > critical value i.e. P value < the significance level

7. State your conclusion: If Ho is rejected, there is significant statistical evidence that the population mean is different than the sample mean If Ho is not rejected, there is no significant statistical evidence that the population mean is different from the sample mean

Z – test for sample proportion: Z =

Z- Test for differences between 2 means: Z=

=

Testing the difference between 2 sample proportions: Z = Where Sp1-p2 = P (Pooled)=

T-test

One Sample T-test
In small sample size, when σ is not known, the sample standard deviation is used to estimate σ and the Z-statistics is replaced by the T-statistics.t= x - µ S/ √nWhen the x is the mean of a random sample of size n from a normal distribution with mean µ, then t has a student t-distribution with n-1 degree of freedom (df)

The df is the number of scores in a sample that are free to vary The df is a function of the sample size determines how spread of the distribution is (compared to the normal distribution)

The T-distribution

Example, using the normal curve, 1,96 is the cut-off for a two tailed test at the 0,05 level of significance On a t-distribution with 3 df (a sample size of 4), the cut-off is 3.18 for a 2-tailed test at the 0.05 level of significance If your estimate is based on a larger sample of 7, the cut-off is 2.45, a critical score closer to that for the normal curve

The t-distribution is a bell-shaped and symmetrical one that is used for testing small sample size (n < 30) The distribution of the values of t is not normal, but its use and the shape are some what analogous to those of the standard normal distribution of z. T spreads out more and more as the sample size gets small. The critical value of t is determined by its df equal to n-1

Finding tcrit using t-table

T-table is very similar to the standard normal table The bigger the sample size (or df), the closer the t-distribution is to a normal distribution

T-test for two sample means

t= lx1-x2l SE(x1-x2)SE(x1-x2) = Spooled * √ 1 + 1 n1 n2 2 2Spooled = S1 (n1-1) + S2 (n2-1) n1 + n2 – 2N.B df for 2 sample means in t-test = n1+ n2- 2

χІ = ∑ E= df= (r- 1) (c - 1)

Lectures Thirteen - Fifteen
The concept of community diagnosis as an application of statistics in measuring population health

In clinical practice and in clinical surveys, we describe patients as individuals with some labeling (diagnosis)

The aim is to classify these individuals into those who are labeled as (diseased) so they receive treatment and those who are considered non-diseased at the time of encounter (it is very exceptional in clinical practice to label an attendant as non-diseased)

Usually no attempt is made in clinical practice to relate sick individuals to the “mother” population to which they belong

In epidemiology, the concern is different Here, we are interested both in the sick and the non-sick persons We describe the sick, the characteristics, the events in relation to the total population to which these attributes are related

Thus, the amount of sickness or disease or the frequency of a characteristic or an event must be quantified in relation to the background “reference” populationThe frequency is expressed in terms of rates, proportions, and ratios

Rates

They are used to express the frequency of an event (sickness, disease, birth, death…etc) per unit of size of related population. Time period and place are specified

All rates have: 1. Numerator: cases or events 2. Denominator: population at risk 3. Time limit or reference period 4. A standard multiplication factor, usually a multiple of 10

Population at risk:

are those individuals who are at risk of getting ill and thus contributing to the cases (they became ill or diseased or die or give birth to live babies), which form the numerator Generally, the numerator is part of the denominator

No. of persons with a characteristic or a state or No. of events during a specified period of time and specified place A rate =--------------------------------------------------x K population at risk during the same period and at the same place (Note: K is a multiplication factor)

Proportions

express the part (persons affected, number of cases or deaths of a given disease) in relation to the total persons, cases, or deaths due to all diseases The numerator is part of the denominator but there is no multiplication factor The value of a proportion is usually less than unity (less than one) It equals one only if all individuals at risk become diseased

Ratios

express the number of persons with a characteristic relative to the number of persons without the characteristics. The numerator is not part of the denominator. Ratio is not a common epidemiological parameter.

Example:

In a village, there were 6000 persons. During the year 2001, a total of 240 live births took place, of which 115 were female births. Use these data to measure frequencies of births as events in this population:

Definition of basic rates

For convenience, commonly used epidemiological rates can be grouped into three groups:

1. Rates related to fertility

These are useful indicators in health and demographic characteristics of population. The rates include:

a. Crude birth rate (CBR):

b. General fertility rate (GFR):
In this rate the numerator is the same as that for crude birth rate but the denominator is the total number of women in the reproductive age (15 – 49). No. of live births in a year in a specific placeGFR = ---------------------------------------------- x 1000 No. of women aged ( 15 – 49) years

c. Marital specific fertility rate:

d. Total fertility rate (TFR):

2. Rates related to morbidity:

Incidence rate:
Incidence of a disease is the number of new cases or episodes of disease which occur during a specified period of time in a specific population or place. The incidence rate (IR) is the number of new cases or episodes (spells) of disease per unit of size of population. Number of new cases of a disease in a year in a given population IR = ---------------------------------------------------------------------- X 1000 Total population at risk in the same year Or Number of new spells of disease in a year in a given population IR=----------------------------------------------------------------------- X 1000 Total population at risk in the same year

Which of these two rates do you expect to have higher value if both are calculated for the same population during a given year?

Incidence rate is more useful in the following situations:

To study disease of short duration To study the aetiology of disease To evaluate preventive measures To determine the risk of acquiring of disease To assess transmission of infectious agent

Prevalence rate:

Prevalence refers to the total number of cases of a disease or conditions existing in a given population at a point in time (point prevalence) or during a period of time (period prevalence). According to time specification, prevalence rate (number of existing cases per unit of size of population) is of two types:

Point prevalence rate (PnPR):

It is the commonly used prevalence rate and measures the probability of disease existence at a point in time in a given population: Number of existing cases (new & old) in a given population at a point in time PnPR= -------------------------------------------- X 1000 Total population in the same place and the same point in time

Period prevalence rate (PrPP):

Less commonly used. The numerator includes all cases (new & old) existing in population during a given period of time. Number of existing cases during a period of time PrPR=----------------------------------------------- X 1000 Total population at risk in the same place and time period

Notes:

The term “point in time” may be as long as it takes to get information on existing cases in cross-sectional studies.Period prevalence during a year is equal to combination of all cases existing at the beginning of the year plus all new cases, which occur during the year regardless of their fate (death, recovery, or infirmity).Prevalence rates are useful for:1. Diseases of long duration2. Administrative purposes

Relationship of incidence and prevalence

Prevalence of disease may vary from place to place or from time to another because of variation in incidence and/ or duration of the disease In an epidemiologically stable situation, i.e. with constant incidence and duration of disease, the following relationship may be stated:

Prevalence = Incidence X Duration Duration of a disease is a function of its fatality and its tendency to recover. The higher the case fatality of a disease, the shorter the average duration of it. Similarly, the quicker the recovery of the disease, the shorter the duration is

Actually point prevalence of any disease is a function of its incidence rate and the rate at which cases die or completely recover

3. Rates related to mortality:

These rates measure the impact of disease on the population in terms of death, thus they reflect in general the severity of disease and the quality of health care services

The commonly used mortality rates are:

1. Perinatal mortality rate: It relates total stillbirths and deaths in the first week of life to total births. 2. Stillbirth rate: It relates stillbirths to total births. 3. Neonatal mortality rate (early and late) 4. Post neonatal mortality rate 5. Infant mortality rate (including neonatal and post neonatal rates) 6. Crude death rate: It relates all deaths due to all causes to mid-year population.

4. Other useful rates:

A number of other useful rates are commonly used in the field of health care epidemiology. Examples are utilization rates, coverage rates, performance rates, bed occupancy rates, adequacy rates, proficiency rates…. etc.

Exercise:

A household survey was carried out to generate relevant data on certain aspects of the health status of a given population, the following results were reported: Total population 50 000 Annual live births 1800 Annual total deaths 300 Annual infant deaths 80 Number of new cases of pneumonia 64 Number of deaths due to pneumonia 2 Number of infants who received BCG 1650 Q/ Use the data above to calculate appropriate rates to describe the health status of the population