OVERVIEW OF DIFFERENT STATISTICAL TESTS
USED IN EPIDEMIOLOGICAL STUDIES AND THEIR
APPLICATIONS
GUIDED BY- PRESENTED BY-
Dr. Shraddha Mishra Dr. Shefali Jain
Associate professor (P.G. 1Styear)
Dr. Shatkratu Dwivedi
Assistant professor
DATA TYPES
Variable
(characteristics)
Quantitative
(numerical)
Continuous
takes any
value
e.g. height
Discrete:
integers
Ex: no. of
children
Qualitative
(categorical)
Ordinal:
obvious
order
Ex: likert
scale
binary/
dichotomo
us
Ex:
disease
(yes/no)
Nominal: no
meaningful
order’
Ex: gender
QUANTITATIVE DATA
 Quantitative data are data that can be measured
numerically and may be continuous or discrete.
• Continuous data lie on a continuum and so can take
any value between two limits.
• Discrete data do not lie on a continuum and can
only take certain values, usually counts (integers)
 Quantitative data can be further classified as being on
an ‘interval scale’ or on a ‘ratio scale’.
 Categorical data
Categorical data are data where individuals fall into
a number of separate categories or classes.
 Dichotomous data
This is where there are only two classes and all
individuals fall into one or other of the classes.
These data are also known as binary data.
 Categorizing continuous data
It is possible to reclassify continuous data into
groups, perhaps for ease of reporting.
Dependent
Scale
Normally distributed:
Parameteric test
skewed data:
non parameteric test
Categorical
ordinal: non
parameteric
nominal: chi
squared
STATISTIC
 A statistic is any quantity that is calculated from a
set of data.
 For example mean blood pressure calculated in a group
of subjects is a statistic.
 Another example is the proportion of people who are
overweight in a sample.
 There are many different statistics that can be calculated
from data and the choice of which to use is driven partly
by the type of data and partly by the purpose of the
study.
 Data Analysis Statistics - a powerful tool for analyzing
data
1. Descriptive Statistics - provide an overview of the
attributes of a data set. These include measurements
of central tendency (frequency histograms, mean,
median, & mode) and dispersion (range, variance &
standard deviation)
2. Inferential Statistics - provide measures of how well
your data support your hypothesis and if your data are
generalizable beyond what was tested (significance
tests)
A VARIABLE
 A variable is a quantity that is measured or
observed in an individual and which varies from
person to person.
 For example, blood pressure is a variable because
blood pressure varies from person to person.
 blood group, which also varies from person to
person.
 gender, where people can be classified as either
male or female.
DECIDING ON APPROPRIATE STATISTICAL
METHODS FOR RESEARCH
What is the main research
question?
Which variables (types of
measurement) will help answer
the research question?
Which is the dependent
(outcome) variable and what
type of variable is it?
CONTD…..
Which are the independent
(explanatory) variables, how many
are there and what data types are
they?
Are relationships or differences
between means of interest?
Are there repeated measurements of
the same variable for each subject?
NULL HYPOTHESIS AND ALTERNATE HYPOTHESIS
 The null hypothesis is the baseline hypothesis
which is usually of the form ‘there is no difference’ or
‘there is no association’.
 The corresponding alternative hypothesis is ‘there is
a difference’ or ‘there is an association’.
 Examples
• Does a new treatment reduce blood pressure more than
an existing treatment?
 The null hypothesis is that mean blood pressure is
the same in the two treatment groups
 The alternative hypothesis is that mean blood
pressure is different in the two treatment groups
• Is risk of cardiovascular disease?
 The null hypothesis is that there is no
association between blood pressure and risk of
cardiovascular disease
 The alternative hypothesis is that blood
pressure is associated with a change in risk of
cardiovascular disease there an association
between blood pressure and
STEPS IN DOING A SIGNIFICANCE TEST
1. Specify the hypothesis of interest as a null and
alternative hypothesis
2. Decide what statistical test is appropriate
3. Use the test to calculate the P value
4. Weigh the evidence from the P value in favour of
the null or alternative hypothesis
ERRORS IN SIGNIFICANCE TESTING
• Type 1 error: this is getting a significant result in a
sample when the null hypothesis is in fact true in the
underlying population (‘false significant’ result).
 We usually set a limit of 0.05 (5%) for the probability of a
type 1 error, which is equivalent to a 0.05 cut-off for
statistical significance.
• Type 2 error: this is getting a non-significant result in
a sample when the null hypothesis is in fact false in the
underlying population (‘false non-significant’ result).
 It is widely accepted that the probability of a type 2 error
should be no more than 0.20 (20%).
STANDARD ERROR OF THE MEAN
 Suppose we selected many samples, then the
sample means would follow a distribution known as
the sampling distribution of the mean.
 We could calculate the mean of these sample
means, and the standard deviation. The standard
deviation of the sample means is known as the
standard error of the mean and provides an
estimate of the precision of the sample mean.
 Standard error of the sample mean SE (mean)
 SD/√n
where SD is the standard deviation for the data and
n is the sample size
 Note as n increases, SE decreases and so
precision is greater for larger samples
 95% confidence interval for a mean from a large
sample mean – 1.96 SE (mean) to mean + 1.96 SE
(mean)
DEALING WITH UNCERTAINTY
 Statistical methods based on probability theory are
therefore used to quantify this uncertainty:
• If we are estimating some quantity from our data, for
example, the proportion of patients who have a
particular attribute, then we can quantify the
imprecision in the estimate using a confidence
interval
• If we are testing a hypothesis, for example,
comparing blood pressure in two groups, then we
can do a statistical significance test which helps
us to weigh the evidence that the sample difference
we have observed is in fact a real difference
CHOICE OF PERCENTAGE FOR CONFI DENCE
INTERVALS (CI)
 95% is the most commonly used percentage for CIs
and the multiplier is 1.96 for large samples
 90% CI has a probability of 90% of containing the
true value and uses the multiplier 1.64 rather than
1.96
 99% CI has a probability of 99% of containing the
true value and uses the multiplier 2.58
WHAT IS A P VALUE?
 P value is the probability, given that the null hypothesis
is true, of obtaining data as extreme or more extreme
than that observed
• 0.05 or 5% is commonly used as a cut-off, such that if the
observed P is less than this (P<0.05) we consider that
there is good evidence that the null hypothesis is not
true. This is directly related to the type 1 error rate.
• If 0.05 is the cut-off then P<0.05 is commonly described
as statistically significant and P≥0.05 is described
as not statistically significant
WHAT IS PROBABILITY ?
• The proportion of times an event happens in the
long run which can be estimated from a proportion
calculated in a sample
 For example, the proportion of stillbirths out of total
births in England and Wales in 2006 was 3602/673
203 = 0.0054. Since this was a census and
therefore a large sample, we can use this as an
estimate of the probability that a baby born in
England and Wales will be stillborn.
THREE BASIC RULES OF PROBABILITIES
1. A probability must lie between 0 and 1 inclusive
2. If two events are mutually exclusive so that they
cannot both happen, the probability of either
happening is the sum of the individual probabilities
3. If two events are independent then the probability
of both occurring is the product of the individual
probabilities
INTERPRETATION OF THE PROPERTIES
1. If a probability is 0 then the event never happens. If it is
1, the event always happens.
2. If two events are mutually exclusive then only one can
happen.
 For example death and survival are mutually exclusive – a
patient cannot both survive and die at the same time.
3. If two events are independent then the fact that one has
happened does not affect the chance of the other event
happening.
 For example the probability that a pregnant woman gives
birth to twins (event 1) and the probability of a white
Christmas (event 2). These two events are unconnected
since the probability of giving birth to twins is not related to
the weather at Christmas.
WHAT IS STATISTICAL TEST?
 A statistical test provides a mechanism for making
quantitative decisions about a process or
processes. The intent is to determine whether there
is enough evidence to "reject" a conjecture or
hypothesis about the process
NORMAL DISTRIBUTION
 A very common continuous probability distribution
All normal distributions are symmetric. bell-shaped
curve with a single peak.
 68% of the observations fall within 1 standard
deviation of the mean
 95% of the observations fall within 2 standard
deviations of the mean
 99.7% of the observations fall within 3 standard
deviations of the mean for a normal distribution,
almost all values lie within 3 standard deviations of
the mean
NORMAL DISTRIBUTION CURVE
PARAMETRIC OR NONPARAMETRIC
In cases where
 the data which are measured by interval or ratio
scale come from a normal distribution
 Population variances are equal parametric tests are
used.
In cases where
the data is nominal or ordinal
the assumptions of parametric tests are
inappropriate nonparametric tests are used.
Type of test Use
Correlational These tests look for an association
between variables
Pearson
correlation Tests
the strength of the association between
two continuous variables
Spearman
correlation Tests
the strength of the association between
two ordinal variables (does not rely on
the assumption of normally distributed
data
Chi-square Tests the strength of the association between
two categorical variables
Type of test use
Paired T-test
Tests
the difference between two related variables
Independent T-
test Tests
the difference between two independent variables
ANOVA Tests the difference between group means after any other
variance in the outcome variable is accounted for
Simple regression
Tests
how change in the predictor variable predicts the
level of change in the outcome variable
Multiple
regression Tests
how change in the combination of two or more
predictor variables predict the level of change in the
outcome variable
Regression: assess if change in one variable predicts change in
another variable
 Non-parametric: used when the data does not meet
assumptions required for parametric
Type of test Use
Wilcoxon rank-
sum test Tests
difference between two independent
variables—takes into account
magnitude and direction of difference
Wilcoxon sign-
rank test
Tests for the difference between two
related variables—takes into account
the magnitude and direction of
difference
Sign test Tests if two related variables are
different—ignores the magnitude of
change, only takes into account
direction
FLOWCHART FOR HYPOTHESIS TESTS
Overview of different  statistical tests used in epidemiological

Overview of different statistical tests used in epidemiological

  • 1.
    OVERVIEW OF DIFFERENTSTATISTICAL TESTS USED IN EPIDEMIOLOGICAL STUDIES AND THEIR APPLICATIONS GUIDED BY- PRESENTED BY- Dr. Shraddha Mishra Dr. Shefali Jain Associate professor (P.G. 1Styear) Dr. Shatkratu Dwivedi Assistant professor
  • 2.
    DATA TYPES Variable (characteristics) Quantitative (numerical) Continuous takes any value e.g.height Discrete: integers Ex: no. of children Qualitative (categorical) Ordinal: obvious order Ex: likert scale binary/ dichotomo us Ex: disease (yes/no) Nominal: no meaningful order’ Ex: gender
  • 3.
    QUANTITATIVE DATA  Quantitativedata are data that can be measured numerically and may be continuous or discrete. • Continuous data lie on a continuum and so can take any value between two limits. • Discrete data do not lie on a continuum and can only take certain values, usually counts (integers)  Quantitative data can be further classified as being on an ‘interval scale’ or on a ‘ratio scale’.
  • 4.
     Categorical data Categoricaldata are data where individuals fall into a number of separate categories or classes.  Dichotomous data This is where there are only two classes and all individuals fall into one or other of the classes. These data are also known as binary data.  Categorizing continuous data It is possible to reclassify continuous data into groups, perhaps for ease of reporting.
  • 5.
    Dependent Scale Normally distributed: Parameteric test skeweddata: non parameteric test Categorical ordinal: non parameteric nominal: chi squared
  • 6.
    STATISTIC  A statisticis any quantity that is calculated from a set of data.  For example mean blood pressure calculated in a group of subjects is a statistic.  Another example is the proportion of people who are overweight in a sample.  There are many different statistics that can be calculated from data and the choice of which to use is driven partly by the type of data and partly by the purpose of the study.
  • 7.
     Data AnalysisStatistics - a powerful tool for analyzing data 1. Descriptive Statistics - provide an overview of the attributes of a data set. These include measurements of central tendency (frequency histograms, mean, median, & mode) and dispersion (range, variance & standard deviation) 2. Inferential Statistics - provide measures of how well your data support your hypothesis and if your data are generalizable beyond what was tested (significance tests)
  • 9.
    A VARIABLE  Avariable is a quantity that is measured or observed in an individual and which varies from person to person.  For example, blood pressure is a variable because blood pressure varies from person to person.  blood group, which also varies from person to person.  gender, where people can be classified as either male or female.
  • 10.
    DECIDING ON APPROPRIATESTATISTICAL METHODS FOR RESEARCH What is the main research question? Which variables (types of measurement) will help answer the research question? Which is the dependent (outcome) variable and what type of variable is it?
  • 11.
    CONTD….. Which are theindependent (explanatory) variables, how many are there and what data types are they? Are relationships or differences between means of interest? Are there repeated measurements of the same variable for each subject?
  • 12.
    NULL HYPOTHESIS ANDALTERNATE HYPOTHESIS  The null hypothesis is the baseline hypothesis which is usually of the form ‘there is no difference’ or ‘there is no association’.  The corresponding alternative hypothesis is ‘there is a difference’ or ‘there is an association’.  Examples • Does a new treatment reduce blood pressure more than an existing treatment?  The null hypothesis is that mean blood pressure is the same in the two treatment groups  The alternative hypothesis is that mean blood pressure is different in the two treatment groups
  • 13.
    • Is riskof cardiovascular disease?  The null hypothesis is that there is no association between blood pressure and risk of cardiovascular disease  The alternative hypothesis is that blood pressure is associated with a change in risk of cardiovascular disease there an association between blood pressure and
  • 14.
    STEPS IN DOINGA SIGNIFICANCE TEST 1. Specify the hypothesis of interest as a null and alternative hypothesis 2. Decide what statistical test is appropriate 3. Use the test to calculate the P value 4. Weigh the evidence from the P value in favour of the null or alternative hypothesis
  • 15.
    ERRORS IN SIGNIFICANCETESTING • Type 1 error: this is getting a significant result in a sample when the null hypothesis is in fact true in the underlying population (‘false significant’ result).  We usually set a limit of 0.05 (5%) for the probability of a type 1 error, which is equivalent to a 0.05 cut-off for statistical significance. • Type 2 error: this is getting a non-significant result in a sample when the null hypothesis is in fact false in the underlying population (‘false non-significant’ result).  It is widely accepted that the probability of a type 2 error should be no more than 0.20 (20%).
  • 16.
    STANDARD ERROR OFTHE MEAN  Suppose we selected many samples, then the sample means would follow a distribution known as the sampling distribution of the mean.  We could calculate the mean of these sample means, and the standard deviation. The standard deviation of the sample means is known as the standard error of the mean and provides an estimate of the precision of the sample mean.
  • 17.
     Standard errorof the sample mean SE (mean)  SD/√n where SD is the standard deviation for the data and n is the sample size  Note as n increases, SE decreases and so precision is greater for larger samples  95% confidence interval for a mean from a large sample mean – 1.96 SE (mean) to mean + 1.96 SE (mean)
  • 18.
    DEALING WITH UNCERTAINTY Statistical methods based on probability theory are therefore used to quantify this uncertainty: • If we are estimating some quantity from our data, for example, the proportion of patients who have a particular attribute, then we can quantify the imprecision in the estimate using a confidence interval • If we are testing a hypothesis, for example, comparing blood pressure in two groups, then we can do a statistical significance test which helps us to weigh the evidence that the sample difference we have observed is in fact a real difference
  • 19.
    CHOICE OF PERCENTAGEFOR CONFI DENCE INTERVALS (CI)  95% is the most commonly used percentage for CIs and the multiplier is 1.96 for large samples  90% CI has a probability of 90% of containing the true value and uses the multiplier 1.64 rather than 1.96  99% CI has a probability of 99% of containing the true value and uses the multiplier 2.58
  • 21.
    WHAT IS AP VALUE?  P value is the probability, given that the null hypothesis is true, of obtaining data as extreme or more extreme than that observed • 0.05 or 5% is commonly used as a cut-off, such that if the observed P is less than this (P<0.05) we consider that there is good evidence that the null hypothesis is not true. This is directly related to the type 1 error rate. • If 0.05 is the cut-off then P<0.05 is commonly described as statistically significant and P≥0.05 is described as not statistically significant
  • 22.
    WHAT IS PROBABILITY? • The proportion of times an event happens in the long run which can be estimated from a proportion calculated in a sample  For example, the proportion of stillbirths out of total births in England and Wales in 2006 was 3602/673 203 = 0.0054. Since this was a census and therefore a large sample, we can use this as an estimate of the probability that a baby born in England and Wales will be stillborn.
  • 23.
    THREE BASIC RULESOF PROBABILITIES 1. A probability must lie between 0 and 1 inclusive 2. If two events are mutually exclusive so that they cannot both happen, the probability of either happening is the sum of the individual probabilities 3. If two events are independent then the probability of both occurring is the product of the individual probabilities
  • 24.
    INTERPRETATION OF THEPROPERTIES 1. If a probability is 0 then the event never happens. If it is 1, the event always happens. 2. If two events are mutually exclusive then only one can happen.  For example death and survival are mutually exclusive – a patient cannot both survive and die at the same time. 3. If two events are independent then the fact that one has happened does not affect the chance of the other event happening.  For example the probability that a pregnant woman gives birth to twins (event 1) and the probability of a white Christmas (event 2). These two events are unconnected since the probability of giving birth to twins is not related to the weather at Christmas.
  • 25.
    WHAT IS STATISTICALTEST?  A statistical test provides a mechanism for making quantitative decisions about a process or processes. The intent is to determine whether there is enough evidence to "reject" a conjecture or hypothesis about the process
  • 28.
    NORMAL DISTRIBUTION  Avery common continuous probability distribution All normal distributions are symmetric. bell-shaped curve with a single peak.  68% of the observations fall within 1 standard deviation of the mean  95% of the observations fall within 2 standard deviations of the mean  99.7% of the observations fall within 3 standard deviations of the mean for a normal distribution, almost all values lie within 3 standard deviations of the mean
  • 29.
  • 30.
    PARAMETRIC OR NONPARAMETRIC Incases where  the data which are measured by interval or ratio scale come from a normal distribution  Population variances are equal parametric tests are used. In cases where the data is nominal or ordinal the assumptions of parametric tests are inappropriate nonparametric tests are used.
  • 32.
    Type of testUse Correlational These tests look for an association between variables Pearson correlation Tests the strength of the association between two continuous variables Spearman correlation Tests the strength of the association between two ordinal variables (does not rely on the assumption of normally distributed data Chi-square Tests the strength of the association between two categorical variables
  • 33.
    Type of testuse Paired T-test Tests the difference between two related variables Independent T- test Tests the difference between two independent variables ANOVA Tests the difference between group means after any other variance in the outcome variable is accounted for Simple regression Tests how change in the predictor variable predicts the level of change in the outcome variable Multiple regression Tests how change in the combination of two or more predictor variables predict the level of change in the outcome variable Regression: assess if change in one variable predicts change in another variable
  • 34.
     Non-parametric: usedwhen the data does not meet assumptions required for parametric Type of test Use Wilcoxon rank- sum test Tests difference between two independent variables—takes into account magnitude and direction of difference Wilcoxon sign- rank test Tests for the difference between two related variables—takes into account the magnitude and direction of difference Sign test Tests if two related variables are different—ignores the magnitude of change, only takes into account direction
  • 35.