Research Proposal

  1. Our team will be exploring the following research question: Are wine ratings influenced by price, grape variety, region, year of vintage?
  1. The data source will be from Tidy Tuesday: Wine Enthusiast Reviews.There are 129,971 rows and 13 columns. After removing reviews of identical wines and reviewers, the dataset is reduced to 108,290 rows. We are interested in filtering the dataset to the following sample frame, resulting in 26,244 observations.
  1. The unit of observation is a unique product review per bottle of wine.

Sush Feedback

  1. the research question seems too broad. Can you recognize a primary predictor and build the study on top of it. You may eventually add other covariates to build a better model, but the study should revolve around the primary predictor of interest.

Price (Y) ~ Points (X) - grape variety could be more defensible (more valuable than country/region?) - vintage as a rule of thumb for predicting price ~ useful as a control variable

  1. do you think it is a good idea to use an ordinal variable for the response. You could use the usual OLS regression, but your study would then have its limitations. Within this context, can you suggest a different response variable that is metric??
  1. consider having a prior hypothesis about the effect of X on Y?

Additional Covariates: * Vintage * Grape Variety * Region

Scope Decisions: * Global Model: country + vintage + grape variety <- eliminate due to issues w/ IID Country-specific: region + vintage + grape variety

Our Theory: * type of grape likely covaries with region due to temperature and soil requirements * some variation could be explained by variety and by region * vintage may explain some variation, but it could be limited

Main Variable Comparison

## [1] 0.5888724

What Models Do We Want To Build

Model #1: Our Primary Relationship

  • log(Price) ~ Points
  • Selecting the log-linear model to use points + other covariates to explain changes in price caused by a change in points (rating)

Model #2:

  • log(Price) ~ Points + variety

Model #3:

  • log(Price) ~ Points + variety + region_1

Model #4:

  • log(Price) ~ Points + variety + region_1 + vintage
    • Vintage as metric: we’re defining relationship between price x vintage
    • Vintage as ordinal: distinct intercepts between points x price relationship
      • ordinal can be subset of metric
    • Expect the relationship: older vintage (smaller #), higher the price

What Is Needed Before Building Models

Begin Model Building

Comparing Level-Level and Log-Level Model

## 
## Call:
## lm(formula = price ~ points, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.66 -13.68  -4.06   8.30 436.22 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -323.5353     9.0723  -35.66   <2e-16 ***
## points         4.1204     0.1033   39.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.09 on 4349 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.268,  Adjusted R-squared:  0.2679 
## F-statistic:  1592 on 1 and 4349 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = log(price) ~ points, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.59562 -0.34285  0.00696  0.32167  2.34733 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.419263   0.184777  -29.33   <2e-16 ***
## points       0.101048   0.002103   48.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4907 on 4349 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.3468, Adjusted R-squared:  0.3466 
## F-statistic:  2309 on 1 and 4349 DF,  p-value: < 2.2e-16

## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -5.4192630  0.1859504 -29.144 < 2.2e-16 ***
## points       0.1010480  0.0021095  47.902 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Does some regional detail add insight?

Yes, it explains +7% of variation in the model

## 
## Call:
## lm(formula = log(price) ~ points + region_3, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.45066 -0.32785 -0.01909  0.30370  2.17533 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -4.596534   0.178086 -25.811  < 2e-16 ***
## points                    0.086849   0.002069  41.967  < 2e-16 ***
## region_3Central Coast     0.382666   0.024844  15.403  < 2e-16 ***
## region_3Central Valley    0.144859   0.089649   1.616 0.106201    
## region_3Napa-Sonoma       0.542000   0.023583  22.983  < 2e-16 ***
## region_3North Coast       0.199413   0.069302   2.877 0.004028 ** 
## region_3Sierra Foothills  0.229197   0.059701   3.839 0.000125 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4615 on 4344 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4229, Adjusted R-squared:  0.4221 
## F-statistic: 530.5 on 6 and 4344 DF,  p-value: < 2.2e-16

##              GVIF Df GVIF^(1/(2*Df))
## points   1.094792  1        1.046323
## region_3 1.094792  5        1.009098
## 
## t test of coefficients:
## 
##                           Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)              -4.596534   0.178867 -25.6981 < 2.2e-16 ***
## points                    0.086849   0.002087  41.6143 < 2.2e-16 ***
## region_3Central Coast     0.382666   0.026140  14.6391 < 2.2e-16 ***
## region_3Central Valley    0.144859   0.088901   1.6294  0.103291    
## region_3Napa-Sonoma       0.542000   0.025455  21.2926 < 2.2e-16 ***
## region_3North Coast       0.199413   0.075002   2.6587  0.007872 ** 
## region_3Sierra Foothills  0.229197   0.044782   5.1181  3.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = log(price) ~ points + region_3 + red, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34560 -0.29578 -0.01739  0.27734  2.06623 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -4.57669    0.16669 -27.456  < 2e-16 ***
## points                    0.08420    0.00194  43.402  < 2e-16 ***
## region_3Central Coast     0.35543    0.02328  15.268  < 2e-16 ***
## region_3Central Valley    0.09729    0.08393   1.159  0.24646    
## region_3Napa-Sonoma       0.49799    0.02215  22.488  < 2e-16 ***
## region_3North Coast       0.20923    0.06487   3.226  0.00127 ** 
## region_3Sierra Foothills  0.12899    0.05603   2.302  0.02137 *  
## red                       0.35582    0.01434  24.806  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.432 on 4343 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4945, Adjusted R-squared:  0.4937 
## F-statistic: 606.9 on 7 and 4343 DF,  p-value: < 2.2e-16

##              GVIF Df GVIF^(1/(2*Df))
## points   1.098121  1        1.047913
## region_3 1.106438  5        1.010166
## red      1.016899  1        1.008414
## 
## t test of coefficients:
## 
##                            Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)              -4.5766895  0.1687189 -27.1261 < 2.2e-16 ***
## points                    0.0841994  0.0019702  42.7373 < 2.2e-16 ***
## region_3Central Coast     0.3554292  0.0249319  14.2560 < 2.2e-16 ***
## region_3Central Valley    0.0972920  0.0849832   1.1448  0.252339    
## region_3Napa-Sonoma       0.4979879  0.0244435  20.3730 < 2.2e-16 ***
## region_3North Coast       0.2092338  0.0805271   2.5983  0.009400 ** 
## region_3Sierra Foothills  0.1289864  0.0446461   2.8891  0.003883 ** 
## red                       0.3558231  0.0135053  26.3469 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What about wine color?

## 
## Call:
## lm(formula = log(price) ~ points + region_3 + color, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34600 -0.29514 -0.01661  0.27754  2.06541 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -4.206922   0.167631 -25.096  < 2e-16 ***
## points                    0.084030   0.001943  43.242  < 2e-16 ***
## region_3Central Coast     0.356321   0.023285  15.303  < 2e-16 ***
## region_3Central Valley    0.096864   0.083924   1.154   0.2485    
## region_3Napa-Sonoma       0.499134   0.022156  22.528  < 2e-16 ***
## region_3North Coast       0.208725   0.064861   3.218   0.0013 ** 
## region_3Sierra Foothills  0.133142   0.056092   2.374   0.0177 *  
## colorrose                -0.441221   0.060609  -7.280 3.95e-13 ***
## colorwhite               -0.352315   0.014545 -24.222  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4319 on 4342 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4948, Adjusted R-squared:  0.4938 
## F-statistic: 531.5 on 8 and 4342 DF,  p-value: < 2.2e-16

## 
## t test of coefficients:
## 
##                           Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)              -4.206922   0.171074 -24.5912 < 2.2e-16 ***
## points                    0.084029   0.001976  42.5244 < 2.2e-16 ***
## region_3Central Coast     0.356321   0.024943  14.2853 < 2.2e-16 ***
## region_3Central Valley    0.096864   0.084906   1.1408  0.254003    
## region_3Napa-Sonoma       0.499134   0.024451  20.4136 < 2.2e-16 ***
## region_3North Coast       0.208725   0.080484   2.5934  0.009536 ** 
## region_3Sierra Foothills  0.133142   0.044781   2.9732  0.002964 ** 
## colorrose                -0.441221   0.042883 -10.2891 < 2.2e-16 ***
## colorwhite               -0.352315   0.013718 -25.6835 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              GVIF Df GVIF^(1/(2*Df))
## points   1.102126  1        1.049822
## region_3 1.110284  5        1.010516
## color    1.023524  2        1.005830

What about vintage?

## 
## Call:
## lm(formula = log(price) ~ points + region_3 + color + age, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34109 -0.29453 -0.01546  0.27667  2.06968 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -4.191962   0.168050 -24.945  < 2e-16 ***
## points                    0.084154   0.001946  43.251  < 2e-16 ***
## region_3Central Coast     0.355000   0.023308  15.231  < 2e-16 ***
## region_3Central Valley    0.103770   0.084101   1.234  0.21732    
## region_3Napa-Sonoma       0.498577   0.022159  22.500  < 2e-16 ***
## region_3North Coast       0.211874   0.064906   3.264  0.00111 ** 
## region_3Sierra Foothills  0.140109   0.056367   2.486  0.01297 *  
## colorrose                -0.447024   0.060784  -7.354 2.28e-13 ***
## colorwhite               -0.354533   0.014653 -24.195  < 2e-16 ***
## age                      -0.002646   0.002124  -1.246  0.21289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4319 on 4341 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4949, Adjusted R-squared:  0.4939 
## F-statistic: 472.7 on 9 and 4341 DF,  p-value: < 2.2e-16

## 
## t test of coefficients:
## 
##                            Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)              -4.1919617  0.1721607 -24.3491 < 2.2e-16 ***
## points                    0.0841537  0.0019754  42.5998 < 2.2e-16 ***
## region_3Central Coast     0.3550001  0.0249706  14.2167 < 2.2e-16 ***
## region_3Central Valley    0.1037699  0.0850409   1.2202  0.222442    
## region_3Napa-Sonoma       0.4985769  0.0244911  20.3575 < 2.2e-16 ***
## region_3North Coast       0.2118736  0.0807259   2.6246  0.008705 ** 
## region_3Sierra Foothills  0.1401094  0.0452691   3.0950  0.001980 ** 
## colorrose                -0.4470236  0.0431728 -10.3543 < 2.2e-16 ***
## colorwhite               -0.3545326  0.0138818 -25.5394 < 2.2e-16 ***
## age                      -0.0026461  0.0022510  -1.1756  0.239839    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##              GVIF Df GVIF^(1/(2*Df))
## points   1.105027  1        1.051203
## region_3 1.136712  5        1.012896
## color    1.043625  2        1.010732
## age      1.047616  1        1.023531

================================================================================================================================== Dependent variable:
——————————————————————————————————— log(price)
(1) (2) (3) (4)
———————————————————————————————————————————- points 0.101*** 0.087*** 0.084*** 0.084***
(0.002) (0.002) (0.002) (0.002)

region_3Central Coast 0.383*** 0.356*** 0.355***
(0.025) (0.023) (0.023)

region_3Central Valley 0.145 0.097 0.104
(0.090) (0.084) (0.084)

region_3Napa-Sonoma 0.542*** 0.499*** 0.499***
(0.024) (0.022) (0.022)

region_3North Coast 0.199*** 0.209*** 0.212***
(0.069) (0.065) (0.065)

region_3Sierra Foothills 0.229*** 0.133** 0.140**
(0.060) (0.056) (0.056)

colorrose -0.441*** -0.447***
(0.061) (0.061)

colorwhite -0.352*** -0.355***
(0.015) (0.015)

age -0.003
(0.002)

Constant -5.419*** -4.597*** -4.207*** -4.192***
(0.185) (0.178) (0.168) (0.168)


Observations 4,351 4,351 4,351 4,351
R2 0.347 0.423 0.495 0.495
Adjusted R2 0.347 0.422 0.494 0.494
Residual Std. Error 0.491 (df = 4349) 0.462 (df = 4344) 0.432 (df = 4342) 0.432 (df = 4341)
F Statistic 2,308.693*** (df = 1; 4349) 530.524*** (df = 6; 4344) 531.474*** (df = 8; 4342) 472.654*** (df = 9; 4341) ================================================================================================================================== Note: p<0.1; p<0.05; p<0.01

## Warning: Removed 3 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 3 rows containing missing values
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing non-finite values (stat_bin).

## Warning: Removed 3 rows containing non-finite values (stat_density).
## Warning: Removed 3 rows containing non-finite values (stat_density2d).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).
## Removed 3 rows containing missing values (geom_point).

Build Regression Models on Test Dataset