Data Formatting and Encoding

Basic required format

The data.frame provided to the data argument must be arranged in a particular way (i.e. “long” or “tidy” format). Each row should be an alternative from a choice observation. The choice observations do not have to be symmetric (i.e. each choice observation could have a different number of alternatives). The data must include columns for each of the following arguments in the logitr() function:

  1. choiceName: A dummy variable that identifies which alternative was chosen (1 = chosen, 0 = not chosen).
  2. obsIDName: A sequence of numbers that identifies each unique choice occasion. For example, if the first three choice occasions had 2 alternatives each, then the first 9 rows of the obsID variable would be .
  3. parNames: The names of the variables that will be used as model covariates. For WTP space models, do not include the price variable in parNames - this is provided separately with the priceName argument.

The data sets included in this package all follow this format (e.g. the yogurt data set).

Continuous and discrete variables

Numeric variables are by default estimated with a single “slope” coefficient.

Example: Consider a data frame that contains a price variable with the following three levels: c(10, 15, 20). Adding price to the parNames argument in the main logitr() function would result in a single price coefficient for the “slope” of the change in price.

Categorical variables (i.e. “character” or “factor” type variables) are by default estimated with a coefficient for all but the first “level”, which serves as the “baseline” or "0" level. Categorical variables are automatically “dummy” coded: 0 for FALSE and 1 for TRUE.

Example: Consider a data frame that contains a brand variable with the following four levels: c("dannon", "hiland", "weight", "yoplait"). Adding brand to the parNames argument in the main logitr() function would result in three covariates: brand_hiland, brand_weight, and brand_yoplait, with brand_dannon serving as the “dummied out” baseline level.

Creating dummy coded variables

To model a continuous variable as a discrete variable with a coefficient for all but the first level, there are two options:

  1. Convert the variable to a "character" or "factor" type.
  2. Create dummy coded variables using the dummyCode() function.

The second approach of using the dummyCode() function allows the modeler to specify the baseline level. It can also be used to create dummy-coded variables of a categorical variable.

Details for each approach are provided below.

1) Convert variable types

The simplest way to model a continuous variable as a discrete variable is to convert the column in the data frame to a "character" or "factor" type prior to estimating the model. For example, consider the following model:

model_default <- logitr(
  data       = cars_us,
  choiceName = 'choice',
  obsIDName  = 'obsnum',
  parNames   = c(
    'price', 'hev', 'phev10', 'phev20', 'phev40', 'bev75', 'bev100',
    'bev150', 'american', 'japanese', 'chinese', 'skorean',
    'phevFastcharge', 'bevFastcharge','opCost', 'accelTime')
)
#> Running Model...
#> Done!

In this model, since the price variable is a "double" variable type, it is by default modeled as a continuous variable with a single “slope” coefficient:

typeof(cars_us$price)
#> [1] "double"
summary(model_default)
#> =================================================
#> MODEL SUMMARY: 
#>                           
#> Model Space:    Preference
#> Model Run:          1 of 1
#> Iterations:             20
#> Elapsed Time:  0h:0m:0.47s
#> Weights Used?:       FALSE
#> 
#> Model Coefficients: 
#>                 Estimate StdError    tStat   pVal signif
#> price          -0.073882 0.002049 -36.0612 0.0000    ***
#> hev             0.059741 0.073667   0.8110 0.4174       
#> phev10          0.086141 0.078725   1.0942 0.2739       
#> phev20          0.121737 0.079619   1.5290 0.1263       
#> phev40          0.190580 0.079013   2.4120 0.0159      *
#> bev75          -1.185508 0.087262 -13.5856 0.0000    ***
#> bev100         -0.960710 0.086753 -11.0740 0.0000    ***
#> bev150         -0.707314 0.084204  -8.4000 0.0000    ***
#> american        0.173212 0.058839   2.9438 0.0033     **
#> japanese       -0.027662 0.058507  -0.4728 0.6364       
#> chinese        -0.758692 0.062305 -12.1771 0.0000    ***
#> skorean        -0.445575 0.060899  -7.3166 0.0000    ***
#> phevFastcharge  0.212802 0.059931   3.5508 0.0004    ***
#> bevFastcharge   0.215705 0.066998   3.2196 0.0013     **
#> opCost         -0.120876 0.004429 -27.2948 0.0000    ***
#> accelTime      -0.125380 0.011587 -10.8207 0.0000    ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Model Fit Values: 
#>                                      
#> Log.Likelihood.         -4616.9517861
#> Null.Log.Likelihood.    -6328.0067827
#> AIC.                     9265.9036000
#> BIC.                     9372.4427000
#> McFadden.R2.                0.2703940
#> Adj..McFadden.R2            0.2678655
#> Number.of.Observations.  5760.0000000

To model price as a categorical variable, simple change it to a "character" or "factor" type:

cars_us$price <- as.character(cars_us$price)
typeof(cars_us$price)
#> [1] "character"

Now re-estimate the model:

model_character_price <- logitr(
  data       = cars_us,
  choiceName = 'choice',
  obsIDName  = 'obsnum',
  parNames   = c(
    'price', 'hev', 'phev10', 'phev20', 'phev40', 'bev75', 'bev100',
    'bev150', 'american', 'japanese', 'chinese', 'skorean',
    'phevFastcharge', 'bevFastcharge','opCost', 'accelTime')
)
#> Running Model...
#> Done!

Now price is modeled as a categorical variable with a coefficient for all but the first level:

typeof(cars_us$price)
#> [1] "character"
summary(model_character_price)
#> =================================================
#> MODEL SUMMARY: 
#>                          
#> Model Space:   Preference
#> Model Run:         1 of 1
#> Iterations:            22
#> Elapsed Time:  0h:0m:0.6s
#> Weights Used?:      FALSE
#> 
#> Model Coefficients: 
#>                 Estimate StdError    tStat   pVal signif
#> hev             0.062222 0.073722   0.8440 0.3987       
#> phev10          0.088042 0.078788   1.1175 0.2638       
#> phev20          0.121140 0.079657   1.5208 0.1284       
#> phev40          0.192805 0.079087   2.4379 0.0148      *
#> bev75          -1.189301 0.087337 -13.6175 0.0000    ***
#> bev100         -0.962399 0.086843 -11.0820 0.0000    ***
#> bev150         -0.711546 0.084280  -8.4426 0.0000    ***
#> american        0.174072 0.058863   2.9573 0.0031     **
#> japanese       -0.024672 0.058548  -0.4214 0.6735       
#> chinese        -0.758227 0.062362 -12.1585 0.0000    ***
#> skorean        -0.445432 0.060921  -7.3116 0.0000    ***
#> phevFastcharge  0.211813 0.059977   3.5316 0.0004    ***
#> bevFastcharge   0.217633 0.067059   3.2454 0.0012     **
#> opCost         -0.121119 0.004435 -27.3095 0.0000    ***
#> accelTime      -0.125424 0.011592 -10.8196 0.0000    ***
#> price_18       -0.191639 0.054849  -3.4939 0.0005    ***
#> price_23       -0.657617 0.057072 -11.5227 0.0000    ***
#> price_32       -1.317372 0.060378 -21.8188 0.0000    ***
#> price_50       -2.546410 0.077090 -33.0317 0.0000    ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Model Fit Values: 
#>                                      
#> Log.Likelihood.         -4614.4422294
#> Null.Log.Likelihood.    -6328.0067827
#> AIC.                     9266.8845000
#> BIC.                     9393.3996000
#> McFadden.R2.                0.2707906
#> Adj..McFadden.R2            0.2677880
#> Number.of.Observations.  5760.0000000

2) Create dummy-coded variables

For the second option, you can use the dummyCode() function to create new dummy-coded variables for all the levels of a continuous variable and then use those variables in the model:

cars_us_dummy <- dummyCode(df = cars_us, vars = "price")
names(cars_us_dummy)
#>  [1] "id"             "obsnum"         "choice"         "hev"           
#>  [5] "phev10"         "phev20"         "phev40"         "bev75"         
#>  [9] "bev100"         "bev150"         "phevFastcharge" "bevFastcharge" 
#> [13] "opCost"         "accelTime"      "american"       "japanese"      
#> [17] "chinese"        "skorean"        "weights"        "price"         
#> [21] "price_15"       "price_18"       "price_23"       "price_32"      
#> [25] "price_50"

The new cars_us_dummy data frame now contains variables for each level of the price column. This approach allows the modeler to specify the baseline level. In this example, I’ll use the price_50 level as the baseline:

model_dummy_price <- logitr(
  data       = cars_us_dummy,
  choiceName = 'choice',
  obsIDName  = 'obsnum',
  parNames   = c(
    "price_15", "price_18", "price_23", "price_32",
    'hev', 'phev10', 'phev20', 'phev40', 'bev75', 'bev100',
    'bev150', 'american', 'japanese', 'chinese', 'skorean',
    'phevFastcharge', 'bevFastcharge','opCost', 'accelTime')
)
#> Running Model...
#> Done!

Now price is modeled with a specified coefficient for all but the price_50 level:

summary(model_dummy_price)
#> =================================================
#> MODEL SUMMARY: 
#>                          
#> Model Space:   Preference
#> Model Run:         1 of 1
#> Iterations:            23
#> Elapsed Time:  0h:0m:0.7s
#> Weights Used?:      FALSE
#> 
#> Model Coefficients: 
#>                 Estimate StdError    tStat   pVal signif
#> price_15        2.546463 0.077091  33.0321 0.0000    ***
#> price_18        2.354811 0.076783  30.6686 0.0000    ***
#> price_23        1.888845 0.075124  25.1431 0.0000    ***
#> price_32        1.229120 0.074531  16.4915 0.0000    ***
#> hev             0.062127 0.073722   0.8427 0.3994       
#> phev10          0.087935 0.078787   1.1161 0.2644       
#> phev20          0.121048 0.079657   1.5196 0.1287       
#> phev40          0.192727 0.079086   2.4369 0.0148      *
#> bev75          -1.189114 0.087334 -13.6157 0.0000    ***
#> bev100         -0.962333 0.086842 -11.0814 0.0000    ***
#> bev150         -0.711515 0.084279  -8.4424 0.0000    ***
#> american        0.173844 0.058862   2.9534 0.0032     **
#> japanese       -0.024653 0.058548  -0.4211 0.6737       
#> chinese        -0.758252 0.062361 -12.1590 0.0000    ***
#> skorean        -0.445564 0.060921  -7.3138 0.0000    ***
#> phevFastcharge  0.211886 0.059977   3.5328 0.0004    ***
#> bevFastcharge   0.217423 0.067058   3.2423 0.0012     **
#> opCost         -0.121112 0.004435 -27.3084 0.0000    ***
#> accelTime      -0.125420 0.011592 -10.8193 0.0000    ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Model Fit Values: 
#>                                      
#> Log.Likelihood.         -4614.4422281
#> Null.Log.Likelihood.    -6328.0067827
#> AIC.                     9266.8845000
#> BIC.                     9393.3996000
#> McFadden.R2.                0.2707906
#> Adj..McFadden.R2            0.2677880
#> Number.of.Observations.  5760.0000000