Proportion Inference

David Gerbing

library("lessR")

The analysis of proportions is of two primary types.

From standard base R functions, the lessR function Prop_test(), abbreviated prop(), provides for either type of the analysis for proportions. To use, enter either the original data from which the sample proportions are computed, or directly enter already computed sample proportions.

When analyzing the original data, an entered value for the parameter success for the categorical variable of interest, indicated by parameter variable, triggers the test of homogeneity. If the proportions are entered directly, indicate the number of successes and the total number of trials with the n_succ and n_tot parameters, each as a single value for a single sample or as vectors of multiple values for multiple samples. Without a value for success or n_succ the analysis is of goodness-of-fit or independence.

Single Proportion

Consider a single proportion for a value of a variable of interest, analyzed from a single sample. What is the proportion of occurrences of a designated value of variable? That tradition is to call that value a success. All other values are failures. Success or failure in this context does not necessarily mean good or bad, desired or undesired, but simply that a designated value occurred or it did not occur.

The example below is the same example given in the documentation for the base R binom.test(), with the same result as Prop_test(), which relies upon that base R function for this analysis.

For a given categorical variable of interest, in this case a type of plant, consider two values, either “giant” or “dwarf”. From a sample of 925 plants, the specified value of “giant” occurred 682 times, and did not occur 243 times. The null hypothesis tested is the specified value occurs for 3/4 of the population according to the p0 parameter.

Prop_test(n_succ=682, n_fail=243, p0=.75)
## 
## >>> Exact binomial test of a proportion <<< 
## 
## ------ Description ------
## 
## Number of successes: 682 
## Number of failures: 243 
## Number of trials: 925 
## Sample proportion: 0.737 
## 
## ------ Inference ------
## 
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765

To illustrate with data, read the Employee data included as part of lessR.

d <- Read("Employee")
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

For the variable Gender in the default d data frame, in this example parameter success defines a success for the value of Gender of “F”. Analyze the proportion of successes, that is, those reporting a Gender of “F”. The default null hypothesis is a population value of 0.5.

Here include the parameter names, but not necessary in this example as the parameters are listed in the order that they are defined in the definition of the Prop_test() function.

Prop_test(variable=Gender, success="F")
## 
## >>> Exact binomial test of a proportion <<< 
## 
## Variable: Gender 
## success: F 
## 
## ------ Description ------
## 
## Number of missing values: 0 
## Number of successes: 19 
## Number of failures: 18 
## Number of trials: 37 
## Sample proportion: 0.514 
## 
## ------ Inference ------
## 
## Hypothesis test for null of 0.5, p-value: 1.000
## 95% Confidence interval: 0.344 to 0.681

The null hypothesis is not rejected, with a \(p\)-value of 1. The sample result of \(p=0.514\) is considered close to the default hypothesized value of \(0.5\) for the proportion of "F" values for Gender.

In this next example, change the null hypothesis with the parameter p0 to 0.6. Use the abbreviation prop().

prop(Gender, "F", p0=0.6)
## 
## >>> Exact binomial test of a proportion <<< 
## 
## Variable: Gender 
## success: F 
## 
## ------ Description ------
## 
## Number of missing values: 0 
## Number of successes: 19 
## Number of failures: 18 
## Number of trials: 37 
## Sample proportion: 0.514 
## 
## ------ Inference ------
## 
## Hypothesis test for null of 0.6, p-value: 0.315
## 95% Confidence interval: 0.344 to 0.681

The null hypothesis of \(p_0=0.6\) is also not rejected as the \(p\)-value is well above \(\alpha=005\).

Multiple Proportions

The next example is the same in the documentation for the base R prop.test(), with the same result as Prop_test(), which relies upon that base R function for the comparison of proportions across different groups. To indicate multiple proportions, specified across groups, when inputting proportions provide multiple values for the n_succ and n_tot parameters.

The null hypothesis is that the four populations of patients from which the samples were drawn have the same population proportion of smokers. The alternative is that at least one population proportion is different.

smokers <- c(83, 90, 129, 70)
patients <- c(86, 93, 136, 82)
Prop_test(n_succ=smokers, n_tot=patients)
## 
## >>> 4-sample test for equality of proportions without continuity correction  <<< 
## 
## 
## >>> Description
## 
##                   1       2       3       4
## -----------  ------  ------  ------  ------
## n_               83      90     129      70
## n_total          86      93     136      82
## proportion    0.965   0.968   0.949   0.854
## 
## >>> Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

Can also label the groups in the output by providing a named vector for the successes.

smokers <- c(83, 90, 129, 70)
names(smokers) <- c("Group1","Group2","Group3","Group4")
patients <- c(86, 93, 136, 82)
Prop_test(n_succ=smokers, n_tot=patients)
## 
## >>> 4-sample test for equality of proportions without continuity correction  <<< 
## 
## 
## >>> Description
## 
##               Group1   Group2   Group3   Group4
## -----------  -------  -------  -------  -------
## n_                83       90      129       70
## n_total           86       93      136       82
## proportion     0.965    0.968    0.949    0.854
## 
## >>> Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

Here duplicate these results from data. First create the data frame d according to the proportions of smokers and non-smokers.

sm1 <- c(rep("smoke", 83), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm3 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm4 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm <- c(sm1, sm2, sm3, sm4)
grp <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
d <- data.frame(sm, grp)

Examine the first six rows and last six rows of the data frame d. Indicate the variable of interest, sm, with values “smoke” and “nosmoke”.

head(d)
##      sm grp
## 1 smoke   A
## 2 smoke   A
## 3 smoke   A
## 4 smoke   A
## 5 smoke   A
## 6 smoke   A
tail(d)
##          sm grp
## 392 nosmoke   D
## 393 nosmoke   D
## 394 nosmoke   D
## 395 nosmoke   D
## 396 nosmoke   D
## 397 nosmoke   D

To indicate a comparison across groups, retain the format for a single proportion, providing a categorical variable of interest. Define a success by the value “smoke”. What is added for this analysis is to indicate the comparison across the four groups with a grouping variable that contains a label that identifies the corresponding group, Specify the grouping variable with the by parameter. The grouping variable in this example is grp, with values the first four uppercase letters of the alphabet.

The relevant parameters variable, success, and by are listed in their given order in this example, so the parameter names are not necessary. They are listed here for completeness. .

Prop_test(variable=sm, success="smoke", by=grp)
## 
## >>> 4-sample test for equality of proportions without continuity correction  <<< 
## 
## Variable: sm 
## success: smoke 
## by: grp 
## 
## >>> Description
## 
##                   A       B       C       D
## -----------  ------  ------  ------  ------
## n_smoke          83      90     129      70
## n_total          86      93     136      82
## proportion    0.965   0.968   0.949   0.854
## 
## >>> Inference
## 
## Chi-square statistic: 12.600 
## Degrees of freedom: 3 
## Hypothesis test of equal population proportions: p-value = 0.006

The analysis, of courses, provides the same results as providing the proportions directly.

Goodness-of-Fit

For the goodness-of-fit test to a uniform distribution, provide the frequencies for five cells for the parameter n_tot. The default null hypothesis is that the proportions of the different categories of a categorical variable are equal.

x = c(5,6,4,6,15)
Prop_test(n_tot=x)
## 
## >>> Chi-squared test for given probabilities  <<< 
## 
## 
## >>> Description
## 
##                  1        2        3        4       5
## ---------  -------  -------  -------  -------  ------
## observed         5        6        4        6      15
## expected     7.200    7.200    7.200    7.200   7.200
## residual    -0.820   -0.447   -1.193   -0.447   2.907
## stdn res    -0.917   -0.500   -1.333   -0.500   3.250
## 
## >>> Inference
## 
## Chi-square statistic: 10.944 
## Degrees of freedom: 4 
## Hypothesis test of equal population proportions: p-value = 0.027

Make the n_tot parameter a named vector to label the output accordingly.

x = c(5,6,4,6,15)
names(x) = c("ACCT", "ADMN", "FINC","MKTG","SALE")
Prop_test(n_tot=x)
## 
## >>> Chi-squared test for given probabilities  <<< 
## 
## 
## >>> Description
## 
##               ACCT     ADMN     FINC     MKTG    SALE
## ---------  -------  -------  -------  -------  ------
## observed         5        6        4        6      15
## expected     7.200    7.200    7.200    7.200   7.200
## residual    -0.820   -0.447   -1.193   -0.447   2.907
## stdn res    -0.917   -0.500   -1.333   -0.500   3.250
## 
## >>> Inference
## 
## Chi-square statistic: 10.944 
## Degrees of freedom: 4 
## Hypothesis test of equal population proportions: p-value = 0.027

Next the same analysis but from the data.

d <- Read("Employee", quiet=TRUE)
Prop_test(Dept)
## 
## >>> Chi-squared test for given probabilities  <<< 
## 
## Variable: Dept 
## 
## >>> Description
## 
##               ACCT     ADMN     FINC     MKTG    SALE
## ---------  -------  -------  -------  -------  ------
## observed         5        6        4        6      15
## expected     7.200    7.200    7.200    7.200   7.200
## residual    -0.820   -0.447   -1.193   -0.447   2.907
## stdn res    -0.917   -0.500   -1.333   -0.500   3.250
## 
## >>> Inference
## 
## Chi-square statistic: 10.944 
## Degrees of freedom: 4 
## Hypothesis test of equal population proportions: p-value = 0.027

Independence

To do the \(\chi^2\) test of independence, specify two categorical variables. The first variable listed in this example is the value of the parameter variable, so does not need the parameter name. The second variable listed must include the parameter name by.

Prop_test(Dept, by=Gender)
## 
## >>> Pearson's Chi-squared test  <<< 
## 
## Variable: Dept 
## by: Gender 
## 
## >>> Description
## 
##       Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
##    F      3    4    1    5    5  18
##    M      2    2    3    1   10  18
##    Sum    5    6    4    6   15  36
## 
## Cramer's V: 0.415 
## 
## >>> Inference
## 
## Chi-square statistic: 6.200 
## Degrees of freedom: 4 
## Hypothesis test of independence: p-value = 0.185