Data Management & Analysis In Research

A practical approach

Faculty of Family Medicine Workshop

April 28, 2023

Dr Samuel Blay Nguah FWACP FGCPS

1 / 50

Workshop outlineData Management
Databases software
Database design & validation
Data verificationDouble 
Single 

Data warehousing
Data migration
Data cleaning

Data Analysis
Planning your analysisText, dummy tables and figures

Software 
Data cleaning and missing data
Descriptive statisticsContinuous variables
Categorical variables

Inferential statisticsHypothesis testing
P-values
Confidence interval

Graphical presentations

2 / 50

Our study for the dayAimDetermine if New Drug has a better BP lowering effect after 2 weeks of administration compared to the Control Drug
Study typeRandomized Controlled Trial
Variable to be collectedAge in years
Sex
Initial BP
Final BP
3 / 50

Our study for the dayStudy questionsHow much does the New drug lower the BP?
How much does the Control drug lower the BP?
Which of the two drugs lowers the BP better?
Is there a difference in the BP lowering effect of the two drugs?
Is the BP lowering effect related to the age of the patient?
Is the BP lowering effect related to the sex of the patient?
4 / 50

Our study for the day

5 / 50

Data Management6 / 50

Data management software

7 / 50

8 / 50

9 / 50

Database design, validation and verificationValidation
Limits
Valid ranges
Allowable values
Some software better than others

Cleaning
Regular review of filled questionnaires
Weekly checking of entered data for correctness

Verification
Single entry10% verification
Whole database verification

Double entryCreate identical database
Double enter data
Picks data entry errors
Compare the data from both databases
Identify discrepancies
Correct errors as necessary


10 / 50

Data Warehousing, migration & cleaningWarehousing
Backup the data regularly – 3 copies
Backup with versions and dates
Keep in the appropriate formatMicrosoft Excel
Text files
PDF
Tiff 


Migration
Moving data around
Should be in stable state

Not all software requires this

Cleaning
Involves picking out erroneous data
Picks upData collection errors
Data entry errors

Strategy depends on Continuous variable
Discrete variable
Categorical
Etc


11 / 50

Data Analysis12 / 50

Data Types

13 / 50

Variable typesIndependent (predictor) variablePotentially influences, affects or predicts another variable
E.g: How age influences income make age the independent variable
Dependent (predicted) variablePotentially predicted, influenced and affected by another variable
E.g: How age influences income make income the dependent variable
14 / 50

Data analysisSoftwareR - Analysis only
Microsoft Excel - Entry and analysis
Stata - Analysis only
SPSS - Entry and analysis
15 / 50

Know your dataVariables
id  = Study ID
treat = Treatment given
age = Age of participant
sex = Sex of patient
bp1 = Initial mean arterial BP
bp2 = Final mean arterial BP




  
      id
      treat
      age
      sex
      bp1
      bp2
    

  C1
0
63
F
97.4
93.1
C2
0
NA
F
97.2
92.4
C6
0
62
F
103.4
99.7
C7
0
61
F
290.1
88.4
C9
0
73
F
96.4
91.1
C10
0
57
F
98.6
90.5
C13
0
61
F
97.4
93.8
C14
0
999
F
97.4
92.6
A18
0
51
F
92.2
86.2
A20
0
65
F
96.9
90.4
A21
NA
65
F
102.6
91.5
C3
0
54
M
98.8
94.6
C4
0
69
M
98.4
92.3
C5
0
75
M
89.8
89.3
C8
0
59
M
93.7
90.4




16 / 50

id	treat	age	sex	bp1	bp2
C1	0	63	F	97.4	93.1
C2	0	NA	F	97.2	92.4
C6	0	62	F	103.4	99.7
C7	0	61	F	290.1	88.4
C9	0	73	F	96.4	91.1
C10	0	57	F	98.6	90.5
C13	0	61	F	97.4	93.8
C14	0	999	F	97.4	92.6
A18	0	51	F	92.2	86.2
A20	0	65	F	96.9	90.4
A21	NA	65	F	102.6	91.5
C3	0	54	M	98.8	94.6
C4	0	69	M	98.4	92.3
C5	0	75	M	89.8	89.3
C8	0	59	M	93.7	90.4

Summarizing data      id                treat             age             sex           
 Length:50          Min.   :0.0000   Min.   : 45.00   Length:50         
 Class :character   1st Qu.:0.0000   1st Qu.: 57.75   Class :character  
 Mode  :character   Median :1.0000   Median : 63.00   Mode  :character  
                    Mean   :0.5714   Mean   : 81.00                     
                    3rd Qu.:1.0000   3rd Qu.: 65.00                     
                    Max.   :1.0000   Max.   :999.00                     
                    NA's   :1        NA's   :2                          
      bp1             bp2       
 Min.   : 87.5   Min.   :78.00  
 1st Qu.: 96.0   1st Qu.:85.20  
 Median : 98.0   Median :88.40  
 Mean   :102.4   Mean   :88.61  
 3rd Qu.:101.2   3rd Qu.:92.30  
 Max.   :290.1   Max.   :99.70  
 NA's   :1       NA's   :1
17 / 50

Pre-processing of dataData cleaning
Involves picking out erroneous data
Picks upData collection errors
Data entry errors
Strategy depends on 
Continuous variable  OR Discrete variable


Steps (personal)
Check study idAny duplication in whole data?
Any duplication in study id?
Any missing study id?
Sort them out if possible

General overview of datasingle categorical variables
Single  continuous variables
Combination of variables – Categorical
Combination of variables - Continuous


18 / 50

Pre-processing of dataMissing dataKnow the pattern
Know how to deal with themDropping observations
ImputationCommonest observation
Median/Mean
MICE


19 / 50

Missing pattern

20 / 50

Pre-processing of dataGenerating new variablesConvert data to appropriate typessex to categorical variable (factor)
treat to categorical variable

Fill missing data with appropriate data 
Correct abnormal values in BPs
Generate variable(s)bp_diff = Difference in BP
age_group:Elderly: >60 years
Middle age: <=60 years


21 / 50

Data summary after cleaning      id                 treat         age            sex          bp1        
 Length:50          Old Drug:22   Min.   :45.00   Female:26   Min.   : 87.50  
 Class :character   New Drug:28   1st Qu.:57.25   Male  :24   1st Qu.: 95.62  
 Mode  :character                 Median :63.00               Median : 97.70  
                                  Mean   :61.48               Mean   : 98.30  
                                  3rd Qu.:65.00               3rd Qu.: 99.40  
                                  Max.   :75.00               Max.   :111.70  
      bp2           bp_diff            age_group 
 Min.   :78.00   Min.   : 0.500   Middle age:17  
 1st Qu.:85.22   1st Qu.: 4.800   Elderly   :33  
 Median :88.15   Median : 8.250                  
 Mean   :88.60   Mean   : 9.704                  
 3rd Qu.:92.10   3rd Qu.:13.700                  
 Max.   :99.70   Max.   :26.300
22 / 50

Descriptive statisticsCategorical variable
Frequency tables – univariate
Contingency tables – bivariateRow percentage
Column percentage

Graphical representationsBar chart
Pie Chart
Others

Odds & Odds Ratio
Risk & Risk Ratio

Continuous Variable
Measures of central tendencyMeanArithmetic Mean
Geometric mean
Harmonic mean

Median
Mode

Measures of dispersionStandard deviation
Variance
Interquartile range
Range


23 / 50

Categorical variablesUnivariate analysis


  Table 1: Univariate categorical table
  
      Characteristic
      N = 50
    

  Treatment Type, n (%)
    Old Drug
22 (44.0)
    New Drug
28 (56.0)
Sex, n (%)
    Female
26 (52.0)
    Male
24 (48.0)
Age Grouping, n (%)
    Middle age
17 (34.0)
    Elderly
33 (66.0)




Bivariate analysis


  Table 2: Bivariate categorical table
  
      
      
        Treatment Type
      
      Total
    

      Old Drug
      New Drug
    

  Sex


    Female
11 (42%)
15 (58%)
26 (100%)
    Male
11 (46%)
13 (54%)
24 (100%)
Total
22 (44%)
28 (56%)
50 (100%)




24 / 50

**Table 1**: Univariate categorical table
Characteristic	N = 50
Treatment Type, n (%)
Old Drug	22 (44.0)
New Drug	28 (56.0)
Sex, n (%)
Female	26 (52.0)
Male	24 (48.0)
Age Grouping, n (%)
Middle age	17 (34.0)
Elderly	33 (66.0)

**Table 2**: Bivariate categorical table
	Treatment Type	Total
Sex
Female	11 (42%)	15 (58%)	26 (100%)
Male	11 (46%)	13 (54%)	24 (100%)
Total	22 (44%)	28 (56%)	50 (100%)

Categorical variablesBivariate tables

  Table 3: Bivariate categorical table
  
      Characteristic
      1">Overall, N = 501
      
        Treatment given
      
      Old Drug, N = 22
      New Drug, N = 28
    
  Sex, n (%)

    Female
26 (52.0)
11 (50.0)
15 (53.6)
    Male
24 (48.0)
11 (50.0)
13 (46.4)
Age Grouping, n (%)

    Middle age
17 (34.0)
7 (31.8)
10 (35.7)
    Elderly
33 (66.0)
15 (68.2)
18 (64.3)

      1 n (%)
    
25 / 50

**Table 3**: Bivariate categorical table
Characteristic	¹">Overall, N = 50¹	Treatment given
Sex, n (%)
Female	26 (52.0)	11 (50.0)	15 (53.6)
Male	24 (48.0)	11 (50.0)	13 (46.4)
Age Grouping, n (%)
Middle age	17 (34.0)	7 (31.8)	10 (35.7)
Elderly	33 (66.0)	15 (68.2)	18 (64.3)
¹ n (%)

Categorical variables plots

Pie Chart

Bar Chart - Univariate

26 / 50

Categorical variables plots

Barchart - Bivariate

27 / 50

Measures of Asociation

Risk

Risk (probability, likelihood)
- Probability of outcome in a specified period
- If 200 children in a boarding school ate rice and 28 had diarrhea then

$R_{𝑒} = \frac{28}{200} = 0.14 = 14 %$

Odds

$= \frac{R i s k o f g e t t i n g t h e d i s e a s e}{(R i s k o f n o t g e t t i n g t h e d i s e a s e)}$ $O d d s_{e} = \frac{0.14}{1 - 0.14} = \frac{0.14}{0.86} \approx 0.16$

28 / 50

Measures of association

Risk Ratio (RR)

$= \frac{I n c i d e n c e i n t h e e x p o s e d}{I n c i d e n c e i n t h e n o n - e x p o s e d} = \frac{R_{e}}{𝑅_{𝑢}}$

Used for estimation of causal relationship
Can be calculated in cohort studies
Higher RR => Better causal relationship
- RR=1 => No evidence of association
- RR $\neq$ 1 => Exposure is harmful or protective

29 / 50

Measures of association

Odds Ratio (OR)

$= \frac{O d d s i n t h e e x p o s e d}{O d d s i n t h e n o n - e x p o s e d} = \frac{O d d s_{e}}{O d d s_{𝑢}}$

Used for estimation of causal relationship
Can be calculated in case-control studies
Higher RR => Better causal relationship
- OR=1 => No evidence of association
- OR $\neq$ 1 => Exposure is harmful or protective

30 / 50

Descriptive statisticsMeasures of central tendency
Mean 
Median
Mode

Measure of Dispersion
Range 
– Minimum to maximum 
Interquartile rangep25 - p75

QuartilesMinimum, p25, p75, maximum

Standard Deviation
Variance

31 / 50

Descriptive Statistics


  Table 4: Univariate descriptive statistics
  
      Characteristic
      N = 50
    

  Initial blood pressure (mmHg)
    Median (IQR)
97.7 (95.6, 99.4)
    Mean (SD)
98.3 (5.2)
    Range
87.5, 111.7
Blood Pressure after treatment
    Median (IQR)
88.2 (85.2, 92.1)
    Mean (SD)
88.6 (4.6)
    Range
78.0, 99.7







  Table 5: Bivariate descriptive statistics
  
      Characteristic
      Old Drug, N = 22
      New Drug, N = 28
    

  Initial blood pressure (mmHg)

    Median (IQR)
97.4 (95.5, 98.8)
98.3 (95.9, 101.7)
    Mean (SD)
97.1 (3.6)
99.2 (6.0)
    Range
89.8, 103.4
87.5, 111.7
Blood Pressure after treatment

    Median (IQR)
91.9 (90.4, 93.6)
85.4 (84.4, 87.2)
    Mean (SD)
92.2 (3.3)
85.8 (3.3)
    Range
86.2, 99.7
78.0, 94.2




32 / 50

**Table 4**: Univariate descriptive statistics
Characteristic	N = 50
Initial blood pressure (mmHg)
Median (IQR)	97.7 (95.6, 99.4)
Mean (SD)	98.3 (5.2)
Range	87.5, 111.7
Blood Pressure after treatment
Median (IQR)	88.2 (85.2, 92.1)
Mean (SD)	88.6 (4.6)
Range	78.0, 99.7

**Table 5**: Bivariate descriptive statistics
Characteristic	Old Drug, N = 22	New Drug, N = 28
Initial blood pressure (mmHg)
Median (IQR)	97.4 (95.5, 98.8)	98.3 (95.9, 101.7)
Mean (SD)	97.1 (3.6)	99.2 (6.0)
Range	89.8, 103.4	87.5, 111.7
Blood Pressure after treatment
Median (IQR)	91.9 (90.4, 93.6)	85.4 (84.4, 87.2)
Mean (SD)	92.2 (3.3)	85.8 (3.3)
Range	86.2, 99.7	78.0, 94.2

Descriptive Statistics

Histogram

Boxplot

33 / 50

Descriptive Statistics - plots

Scatter plot

Boxplot

34 / 50

Inferential statistics35 / 50

Sample vrs. Population

36 / 50

Statistic vrs. Parameter

37 / 50

Sample variation

38 / 50

Estimates

Point estimates
Interval estimates

Confidence interval

The 95% confidence interval is the interval that is likely to contain the population parameter 95% of the time.

39 / 50

95% Confidence interval

40 / 50

p-value

Well known in research
Very often misinterpreted and overemphasized

It is the probability of having a statistic as extreme as the one observed from the sample if the null hypothesis is true. It determines the strength of support for the null hypothesis.

The nearer p-value is to 1 the better the data at hand or test statistic supports the null value.

41 / 50

Hypothesis testing

State (Null) hypothesis (H0)
Decide on significance level (α) usually 0.05
Determine sample size
Collect data (Evidence)
Apply appropriate statistical test
Compute the probability value (p-value)
We compare the p-value with α
1. If p < α Then: Reject the H0 (Guilty)
2. If p >= α Then: Refuse to reject the H0 (Not guilty)

Generally:

Lower p-value: More confident of rejecting the H0
Note that failure to reject H0 does not mean H0 is true. It means we do not have enough evidence to reject it

42 / 50

Statistical Tests

43 / 50

Answering our questions from our study44 / 50

How much does the New drug lower the BP?How much does the Control drug lower the BP?Which of the two drugs lowers the BP better?

      Characteristic
      1">Old Drug, N = 221
      2">95% CI2
      1">New Drug, N = 281
      2">95% CI2
    
  Change in Blood Pressure
4.9 (2.3)
3.9, 5.9
13.4 (5.7)
11, 16

      1 Mean (SD)
    
      2 CI = Confidence Interval
    
45 / 50

Characteristic	¹">Old Drug, N = 22¹	²">95% CI²	¹">New Drug, N = 28¹	²">95% CI²
Change in Blood Pressure	4.9 (2.3)	3.9, 5.9	13.4 (5.7)	11, 16
¹ Mean (SD)
² CI = Confidence Interval

Is there a difference in the BP lowering effect of the two drugs?

H0: No difference in BP lowering effect between the two drugs

Characteristic	¹">Old Drug, N = 22¹	¹">New Drug, N = 28¹	²">p-value²
Change in Blood Pressure	4.9 (2.3)	13.4 (5.7)	<0.001
¹ Mean (SD)
² Welch Two Sample t-test

Conclusion: Based on the p-value we reject the H0 and thus conclude there is a significant difference between the BP lowering effect of the New drug and Control

46 / 50

H0: BP lowering effect is not related to age

Characteristic	Beta	¹">95% CI¹	p-value
Age in years	0.08	-0.19, 0.36	0.5
¹ CI = Confidence Interval

Conclusion: BP lowering effect not significantly related to the age of the patient.

47 / 50

  Table 6: Overall data outlook
  
      Characteristic
      1">Overall, N = 501
      
        Drug
      
      2">p-value2
    

      1">Old Drug, N = 221
      1">New Drug, N = 281
    

  Age in years
61.5 (6.5)
62.1 (6.6)
61.0 (6.5)
0.6
Sex



0.8
    Female
26.0 (52.0%)
11.0 (50.0%)
15.0 (53.6%)
    Male
24.0 (48.0%)
11.0 (50.0%)
13.0 (46.4%)
Initial blood pressure (mmHg)
98.3 (5.2)
97.1 (3.6)
99.2 (6.0)
0.13
Blood Pressure after treatment
88.6 (4.6)
92.2 (3.3)
85.8 (3.3)
<0.001
Change in Blood Pressure
9.7 (6.2)
4.9 (2.3)
13.4 (5.7)
<0.001
Age Grouping



0.8
    Middle age
17.0 (34.0%)
7.0 (31.8%)
10.0 (35.7%)
    Elderly
33.0 (66.0%)
15.0 (68.2%)
18.0 (64.3%)


  
      1 Mean (SD); n (%)
    

      2 Welch Two Sample t-test; Pearson’s Chi-squared test
    


48 / 50

**Table 6**: Overall data outlook
Characteristic	¹">Overall, N = 50¹	Drug	²">p-value²
Age in years	61.5 (6.5)	62.1 (6.6)	61.0 (6.5)	0.6
Sex				0.8
Female	26.0 (52.0%)	11.0 (50.0%)	15.0 (53.6%)
Male	24.0 (48.0%)	11.0 (50.0%)	13.0 (46.4%)
Initial blood pressure (mmHg)	98.3 (5.2)	97.1 (3.6)	99.2 (6.0)	0.13
Blood Pressure after treatment	88.6 (4.6)	92.2 (3.3)	85.8 (3.3)	<0.001
Change in Blood Pressure	9.7 (6.2)	4.9 (2.3)	13.4 (5.7)	<0.001
Age Grouping				0.8
Middle age	17.0 (34.0%)	7.0 (31.8%)	10.0 (35.7%)
Elderly	33.0 (66.0%)	15.0 (68.2%)	18.0 (64.3%)
¹ Mean (SD); n (%)
² Welch Two Sample t-test; Pearson’s Chi-squared test

Summary...Data Management
Database design
Data entry
Data validation
Data verification
Data cleaning
Data warehousing
Data transfer

Data Analysis
Statistical software 
Data cleaning
Data analysisContinuous variables
Categorical variables
Descriptive vrs. Inferential 

Presentation of resultsTables
Figures


49 / 50

Thank you!!!

50 / 50

Workshop outline

Data Management

Databases software
Database design & validation
Data verification
- Double
- Single
Data warehousing
Data migration
Data cleaning

Data Analysis

Planning your analysis
- Text, dummy tables and figures
Software
Data cleaning and missing data
Descriptive statistics
- Continuous variables
- Categorical variables
Inferential statistics
- Hypothesis testing
- P-values
- Confidence interval
Graphical presentations

2 / 50

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Tile View: Overview of Slides

Toggle scribble toolbox

Data Management & Analysis In Research

A practical approach

Faculty of Family Medicine Workshop

April 28, 2023

Dr Samuel Blay Nguah FWACP FGCPS

1 / 50

Workshop outlineData Management
Databases software
Database design & validation
Data verificationDouble 
Single 

Data warehousing
Data migration
Data cleaning

Data Analysis
Planning your analysisText, dummy tables and figures

Software 
Data cleaning and missing data
Descriptive statisticsContinuous variables
Categorical variables

Inferential statisticsHypothesis testing
P-values
Confidence interval

Graphical presentations

2 / 50

Our study for the dayAimDetermine if New Drug has a better BP lowering effect after 2 weeks of administration compared to the Control Drug
Study typeRandomized Controlled Trial
Variable to be collectedAge in years
Sex
Initial BP
Final BP
3 / 50

Our study for the dayStudy questionsHow much does the New drug lower the BP?
How much does the Control drug lower the BP?
Which of the two drugs lowers the BP better?
Is there a difference in the BP lowering effect of the two drugs?
Is the BP lowering effect related to the age of the patient?
Is the BP lowering effect related to the sex of the patient?
4 / 50

Our study for the day

5 / 50

Data Management6 / 50

Data management software

7 / 50

8 / 50

9 / 50

Database design, validation and verificationValidation
Limits
Valid ranges
Allowable values
Some software better than others

Cleaning
Regular review of filled questionnaires
Weekly checking of entered data for correctness

Verification
Single entry10% verification
Whole database verification

Double entryCreate identical database
Double enter data
Picks data entry errors
Compare the data from both databases
Identify discrepancies
Correct errors as necessary


10 / 50

Data Warehousing, migration & cleaningWarehousing
Backup the data regularly – 3 copies
Backup with versions and dates
Keep in the appropriate formatMicrosoft Excel
Text files
PDF
Tiff 


Migration
Moving data around
Should be in stable state

Not all software requires this

Cleaning
Involves picking out erroneous data
Picks upData collection errors
Data entry errors

Strategy depends on Continuous variable
Discrete variable
Categorical
Etc


11 / 50

Data Analysis12 / 50

Data Types

13 / 50

Variable typesIndependent (predictor) variablePotentially influences, affects or predicts another variable
E.g: How age influences income make age the independent variable
Dependent (predicted) variablePotentially predicted, influenced and affected by another variable
E.g: How age influences income make income the dependent variable
14 / 50

Data analysisSoftwareR - Analysis only
Microsoft Excel - Entry and analysis
Stata - Analysis only
SPSS - Entry and analysis
15 / 50

Know your dataVariables
id  = Study ID
treat = Treatment given
age = Age of participant
sex = Sex of patient
bp1 = Initial mean arterial BP
bp2 = Final mean arterial BP




  
      id
      treat
      age
      sex
      bp1
      bp2
    

  C1
0
63
F
97.4
93.1
C2
0
NA
F
97.2
92.4
C6
0
62
F
103.4
99.7
C7
0
61
F
290.1
88.4
C9
0
73
F
96.4
91.1
C10
0
57
F
98.6
90.5
C13
0
61
F
97.4
93.8
C14
0
999
F
97.4
92.6
A18
0
51
F
92.2
86.2
A20
0
65
F
96.9
90.4
A21
NA
65
F
102.6
91.5
C3
0
54
M
98.8
94.6
C4
0
69
M
98.4
92.3
C5
0
75
M
89.8
89.3
C8
0
59
M
93.7
90.4




16 / 50

id	treat	age	sex	bp1	bp2
C1	0	63	F	97.4	93.1
C2	0	NA	F	97.2	92.4
C6	0	62	F	103.4	99.7
C7	0	61	F	290.1	88.4
C9	0	73	F	96.4	91.1
C10	0	57	F	98.6	90.5
C13	0	61	F	97.4	93.8
C14	0	999	F	97.4	92.6
A18	0	51	F	92.2	86.2
A20	0	65	F	96.9	90.4
A21	NA	65	F	102.6	91.5
C3	0	54	M	98.8	94.6
C4	0	69	M	98.4	92.3
C5	0	75	M	89.8	89.3
C8	0	59	M	93.7	90.4

Summarizing data      id                treat             age             sex           
 Length:50          Min.   :0.0000   Min.   : 45.00   Length:50         
 Class :character   1st Qu.:0.0000   1st Qu.: 57.75   Class :character  
 Mode  :character   Median :1.0000   Median : 63.00   Mode  :character  
                    Mean   :0.5714   Mean   : 81.00                     
                    3rd Qu.:1.0000   3rd Qu.: 65.00                     
                    Max.   :1.0000   Max.   :999.00                     
                    NA's   :1        NA's   :2                          
      bp1             bp2       
 Min.   : 87.5   Min.   :78.00  
 1st Qu.: 96.0   1st Qu.:85.20  
 Median : 98.0   Median :88.40  
 Mean   :102.4   Mean   :88.61  
 3rd Qu.:101.2   3rd Qu.:92.30  
 Max.   :290.1   Max.   :99.70  
 NA's   :1       NA's   :1
17 / 50

Pre-processing of dataData cleaning
Involves picking out erroneous data
Picks upData collection errors
Data entry errors
Strategy depends on 
Continuous variable  OR Discrete variable


Steps (personal)
Check study idAny duplication in whole data?
Any duplication in study id?
Any missing study id?
Sort them out if possible

General overview of datasingle categorical variables
Single  continuous variables
Combination of variables – Categorical
Combination of variables - Continuous


18 / 50

Pre-processing of dataMissing dataKnow the pattern
Know how to deal with themDropping observations
ImputationCommonest observation
Median/Mean
MICE


19 / 50

Missing pattern

20 / 50

Pre-processing of dataGenerating new variablesConvert data to appropriate typessex to categorical variable (factor)
treat to categorical variable

Fill missing data with appropriate data 
Correct abnormal values in BPs
Generate variable(s)bp_diff = Difference in BP
age_group:Elderly: >60 years
Middle age: <=60 years


21 / 50

Data summary after cleaning      id                 treat         age            sex          bp1        
 Length:50          Old Drug:22   Min.   :45.00   Female:26   Min.   : 87.50  
 Class :character   New Drug:28   1st Qu.:57.25   Male  :24   1st Qu.: 95.62  
 Mode  :character                 Median :63.00               Median : 97.70  
                                  Mean   :61.48               Mean   : 98.30  
                                  3rd Qu.:65.00               3rd Qu.: 99.40  
                                  Max.   :75.00               Max.   :111.70  
      bp2           bp_diff            age_group 
 Min.   :78.00   Min.   : 0.500   Middle age:17  
 1st Qu.:85.22   1st Qu.: 4.800   Elderly   :33  
 Median :88.15   Median : 8.250                  
 Mean   :88.60   Mean   : 9.704                  
 3rd Qu.:92.10   3rd Qu.:13.700                  
 Max.   :99.70   Max.   :26.300
22 / 50

Descriptive statisticsCategorical variable
Frequency tables – univariate
Contingency tables – bivariateRow percentage
Column percentage

Graphical representationsBar chart
Pie Chart
Others

Odds & Odds Ratio
Risk & Risk Ratio

Continuous Variable
Measures of central tendencyMeanArithmetic Mean
Geometric mean
Harmonic mean

Median
Mode

Measures of dispersionStandard deviation
Variance
Interquartile range
Range


23 / 50

Categorical variablesUnivariate analysis


  Table 1: Univariate categorical table
  
      Characteristic
      N = 50
    

  Treatment Type, n (%)
    Old Drug
22 (44.0)
    New Drug
28 (56.0)
Sex, n (%)
    Female
26 (52.0)
    Male
24 (48.0)
Age Grouping, n (%)
    Middle age
17 (34.0)
    Elderly
33 (66.0)




Bivariate analysis


  Table 2: Bivariate categorical table
  
      
      
        Treatment Type
      
      Total
    

      Old Drug
      New Drug
    

  Sex


    Female
11 (42%)
15 (58%)
26 (100%)
    Male
11 (46%)
13 (54%)
24 (100%)
Total
22 (44%)
28 (56%)
50 (100%)




24 / 50

**Table 1**: Univariate categorical table
Characteristic	N = 50
Treatment Type, n (%)
Old Drug	22 (44.0)
New Drug	28 (56.0)
Sex, n (%)
Female	26 (52.0)
Male	24 (48.0)
Age Grouping, n (%)
Middle age	17 (34.0)
Elderly	33 (66.0)

**Table 2**: Bivariate categorical table
	Treatment Type	Total
Sex
Female	11 (42%)	15 (58%)	26 (100%)
Male	11 (46%)	13 (54%)	24 (100%)
Total	22 (44%)	28 (56%)	50 (100%)

Categorical variablesBivariate tables

  Table 3: Bivariate categorical table
  
      Characteristic
      1">Overall, N = 501
      
        Treatment given
      
      Old Drug, N = 22
      New Drug, N = 28
    
  Sex, n (%)

    Female
26 (52.0)
11 (50.0)
15 (53.6)
    Male
24 (48.0)
11 (50.0)
13 (46.4)
Age Grouping, n (%)

    Middle age
17 (34.0)
7 (31.8)
10 (35.7)
    Elderly
33 (66.0)
15 (68.2)
18 (64.3)

      1 n (%)
    
25 / 50

**Table 3**: Bivariate categorical table
Characteristic	¹">Overall, N = 50¹	Treatment given
Sex, n (%)
Female	26 (52.0)	11 (50.0)	15 (53.6)
Male	24 (48.0)	11 (50.0)	13 (46.4)
Age Grouping, n (%)
Middle age	17 (34.0)	7 (31.8)	10 (35.7)
Elderly	33 (66.0)	15 (68.2)	18 (64.3)
¹ n (%)

Categorical variables plots

Pie Chart

Bar Chart - Univariate

26 / 50

Categorical variables plots

Barchart - Bivariate

27 / 50

Measures of Asociation

Risk

Risk (probability, likelihood)
- Probability of outcome in a specified period
- If 200 children in a boarding school ate rice and 28 had diarrhea then

$R_{𝑒} = \frac{28}{200} = 0.14 = 14 %$

Odds

$= \frac{R i s k o f g e t t i n g t h e d i s e a s e}{(R i s k o f n o t g e t t i n g t h e d i s e a s e)}$ $O d d s_{e} = \frac{0.14}{1 - 0.14} = \frac{0.14}{0.86} \approx 0.16$

28 / 50

Measures of association

Risk Ratio (RR)

$= \frac{I n c i d e n c e i n t h e e x p o s e d}{I n c i d e n c e i n t h e n o n - e x p o s e d} = \frac{R_{e}}{𝑅_{𝑢}}$

Used for estimation of causal relationship
Can be calculated in cohort studies
Higher RR => Better causal relationship
- RR=1 => No evidence of association
- RR $\neq$ 1 => Exposure is harmful or protective

29 / 50

Measures of association

Odds Ratio (OR)

$= \frac{O d d s i n t h e e x p o s e d}{O d d s i n t h e n o n - e x p o s e d} = \frac{O d d s_{e}}{O d d s_{𝑢}}$

Used for estimation of causal relationship
Can be calculated in case-control studies
Higher RR => Better causal relationship
- OR=1 => No evidence of association
- OR $\neq$ 1 => Exposure is harmful or protective

30 / 50

Descriptive statisticsMeasures of central tendency
Mean 
Median
Mode

Measure of Dispersion
Range 
– Minimum to maximum 
Interquartile rangep25 - p75

QuartilesMinimum, p25, p75, maximum

Standard Deviation
Variance

31 / 50

Descriptive Statistics


  Table 4: Univariate descriptive statistics
  
      Characteristic
      N = 50
    

  Initial blood pressure (mmHg)
    Median (IQR)
97.7 (95.6, 99.4)
    Mean (SD)
98.3 (5.2)
    Range
87.5, 111.7
Blood Pressure after treatment
    Median (IQR)
88.2 (85.2, 92.1)
    Mean (SD)
88.6 (4.6)
    Range
78.0, 99.7







  Table 5: Bivariate descriptive statistics
  
      Characteristic
      Old Drug, N = 22
      New Drug, N = 28
    

  Initial blood pressure (mmHg)

    Median (IQR)
97.4 (95.5, 98.8)
98.3 (95.9, 101.7)
    Mean (SD)
97.1 (3.6)
99.2 (6.0)
    Range
89.8, 103.4
87.5, 111.7
Blood Pressure after treatment

    Median (IQR)
91.9 (90.4, 93.6)
85.4 (84.4, 87.2)
    Mean (SD)
92.2 (3.3)
85.8 (3.3)
    Range
86.2, 99.7
78.0, 94.2




32 / 50

**Table 4**: Univariate descriptive statistics
Characteristic	N = 50
Initial blood pressure (mmHg)
Median (IQR)	97.7 (95.6, 99.4)
Mean (SD)	98.3 (5.2)
Range	87.5, 111.7
Blood Pressure after treatment
Median (IQR)	88.2 (85.2, 92.1)
Mean (SD)	88.6 (4.6)
Range	78.0, 99.7

**Table 5**: Bivariate descriptive statistics
Characteristic	Old Drug, N = 22	New Drug, N = 28
Initial blood pressure (mmHg)
Median (IQR)	97.4 (95.5, 98.8)	98.3 (95.9, 101.7)
Mean (SD)	97.1 (3.6)	99.2 (6.0)
Range	89.8, 103.4	87.5, 111.7
Blood Pressure after treatment
Median (IQR)	91.9 (90.4, 93.6)	85.4 (84.4, 87.2)
Mean (SD)	92.2 (3.3)	85.8 (3.3)
Range	86.2, 99.7	78.0, 94.2

Descriptive Statistics

Histogram

Boxplot

33 / 50

Descriptive Statistics - plots

Scatter plot

Boxplot

34 / 50

Inferential statistics35 / 50

Sample vrs. Population

36 / 50

Statistic vrs. Parameter

37 / 50

Sample variation

38 / 50

Estimates

Point estimates
Interval estimates

Confidence interval

The 95% confidence interval is the interval that is likely to contain the population parameter 95% of the time.

39 / 50

95% Confidence interval

40 / 50

p-value

Well known in research
Very often misinterpreted and overemphasized

It is the probability of having a statistic as extreme as the one observed from the sample if the null hypothesis is true. It determines the strength of support for the null hypothesis.

The nearer p-value is to 1 the better the data at hand or test statistic supports the null value.

41 / 50

Hypothesis testing

State (Null) hypothesis (H0)
Decide on significance level (α) usually 0.05
Determine sample size
Collect data (Evidence)
Apply appropriate statistical test
Compute the probability value (p-value)
We compare the p-value with α
1. If p < α Then: Reject the H0 (Guilty)
2. If p >= α Then: Refuse to reject the H0 (Not guilty)

Generally:

Lower p-value: More confident of rejecting the H0
Note that failure to reject H0 does not mean H0 is true. It means we do not have enough evidence to reject it

42 / 50

Statistical Tests

43 / 50

Answering our questions from our study44 / 50

How much does the New drug lower the BP?How much does the Control drug lower the BP?Which of the two drugs lowers the BP better?

      Characteristic
      1">Old Drug, N = 221
      2">95% CI2
      1">New Drug, N = 281
      2">95% CI2
    
  Change in Blood Pressure
4.9 (2.3)
3.9, 5.9
13.4 (5.7)
11, 16

      1 Mean (SD)
    
      2 CI = Confidence Interval
    
45 / 50

Characteristic	¹">Old Drug, N = 22¹	²">95% CI²	¹">New Drug, N = 28¹	²">95% CI²
Change in Blood Pressure	4.9 (2.3)	3.9, 5.9	13.4 (5.7)	11, 16
¹ Mean (SD)
² CI = Confidence Interval

Is there a difference in the BP lowering effect of the two drugs?

H0: No difference in BP lowering effect between the two drugs

Characteristic	¹">Old Drug, N = 22¹	¹">New Drug, N = 28¹	²">p-value²
Change in Blood Pressure	4.9 (2.3)	13.4 (5.7)	<0.001
¹ Mean (SD)
² Welch Two Sample t-test

Conclusion: Based on the p-value we reject the H0 and thus conclude there is a significant difference between the BP lowering effect of the New drug and Control

46 / 50

H0: BP lowering effect is not related to age

Characteristic	Beta	¹">95% CI¹	p-value
Age in years	0.08	-0.19, 0.36	0.5
¹ CI = Confidence Interval

Conclusion: BP lowering effect not significantly related to the age of the patient.

47 / 50

  Table 6: Overall data outlook
  
      Characteristic
      1">Overall, N = 501
      
        Drug
      
      2">p-value2
    

      1">Old Drug, N = 221
      1">New Drug, N = 281
    

  Age in years
61.5 (6.5)
62.1 (6.6)
61.0 (6.5)
0.6
Sex



0.8
    Female
26.0 (52.0%)
11.0 (50.0%)
15.0 (53.6%)
    Male
24.0 (48.0%)
11.0 (50.0%)
13.0 (46.4%)
Initial blood pressure (mmHg)
98.3 (5.2)
97.1 (3.6)
99.2 (6.0)
0.13
Blood Pressure after treatment
88.6 (4.6)
92.2 (3.3)
85.8 (3.3)
<0.001
Change in Blood Pressure
9.7 (6.2)
4.9 (2.3)
13.4 (5.7)
<0.001
Age Grouping



0.8
    Middle age
17.0 (34.0%)
7.0 (31.8%)
10.0 (35.7%)
    Elderly
33.0 (66.0%)
15.0 (68.2%)
18.0 (64.3%)


  
      1 Mean (SD); n (%)
    

      2 Welch Two Sample t-test; Pearson’s Chi-squared test
    


48 / 50

**Table 6**: Overall data outlook
Characteristic	¹">Overall, N = 50¹	Drug	²">p-value²
Age in years	61.5 (6.5)	62.1 (6.6)	61.0 (6.5)	0.6
Sex				0.8
Female	26.0 (52.0%)	11.0 (50.0%)	15.0 (53.6%)
Male	24.0 (48.0%)	11.0 (50.0%)	13.0 (46.4%)
Initial blood pressure (mmHg)	98.3 (5.2)	97.1 (3.6)	99.2 (6.0)	0.13
Blood Pressure after treatment	88.6 (4.6)	92.2 (3.3)	85.8 (3.3)	<0.001
Change in Blood Pressure	9.7 (6.2)	4.9 (2.3)	13.4 (5.7)	<0.001
Age Grouping				0.8
Middle age	17.0 (34.0%)	7.0 (31.8%)	10.0 (35.7%)
Elderly	33.0 (66.0%)	15.0 (68.2%)	18.0 (64.3%)
¹ Mean (SD); n (%)
² Welch Two Sample t-test; Pearson’s Chi-squared test

Summary...Data Management
Database design
Data entry
Data validation
Data verification
Data cleaning
Data warehousing
Data transfer

Data Analysis
Statistical software 
Data cleaning
Data analysisContinuous variables
Categorical variables
Descriptive vrs. Inferential 

Presentation of resultsTables
Figures


49 / 50

Thank you!!!

50 / 50

	Treatment Type		Total
	Old Drug	New Drug	Total
Sex
Female	11 (42%)	15 (58%)	26 (100%)
Male	11 (46%)	13 (54%)	24 (100%)
Total	22 (44%)	28 (56%)	50 (100%)