Main Variable Comparison
## [1] 0.5888724
X will be price for a bottle of the wine (metric), grape type (categorical), region(category), and vintage (category).
X main concept - Price
X sub concepts - Grape variety, region, year of vintage
Y will be wine rating in an ordinal scale, which is defined in the number of points Wine Enthusiast rated the wine on a scale of 80 -100 (Reviews for wines that score 1 - 79 are not available in the raw data set)
Reviews: The sampling frame is exclusively composed of anonymous reviewers. Because we have limited information on reviewer details, we assume all anonymous reviewers are unique, have equal access to all wines in our model, and are pulled from the same distribution.
Universe
Countries: We will be restricting the sample frame to wines produced in the US. Assuming collinearity between country and region, this restriction allows us to use region as an X concept in our regression models.
Grape Varieties: We will be restricting our model to the top 3 grape varieties.
Region: We will be restricting our model to the top 3 regions in the US.
Vintage: We will include 5-6 vintage year categories.
Points ~ Price
Options:
Price (Y) ~ Points (X) - grape variety could be more defensible (more valuable than country/region?) - vintage as a rule of thumb for predicting price ~ useful as a control variable
Additional Covariates: * Vintage * Grape Variety * Region
Scope Decisions: * Global Model: country + vintage + grape variety <- eliminate due to issues w/ IID Country-specific: region + vintage + grape variety
Our Theory: * type of grape likely covaries with region due to temperature and soil requirements * some variation could be explained by variety and by region * vintage may explain some variation, but it could be limited
## [1] 0.5888724
##
## Call:
## lm(formula = price ~ points, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.66 -13.68 -4.06 8.30 436.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -323.5353 9.0723 -35.66 <2e-16 ***
## points 4.1204 0.1033 39.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.09 on 4349 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.268, Adjusted R-squared: 0.2679
## F-statistic: 1592 on 1 and 4349 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(price) ~ points, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.59562 -0.34285 0.00696 0.32167 2.34733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.419263 0.184777 -29.33 <2e-16 ***
## points 0.101048 0.002103 48.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4907 on 4349 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.3468, Adjusted R-squared: 0.3466
## F-statistic: 2309 on 1 and 4349 DF, p-value: < 2.2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.4192630 0.1859504 -29.144 < 2.2e-16 ***
## points 0.1010480 0.0021095 47.902 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, it explains +7% of variation in the model
##
## Call:
## lm(formula = log(price) ~ points + region_3, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.45066 -0.32785 -0.01909 0.30370 2.17533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.596534 0.178086 -25.811 < 2e-16 ***
## points 0.086849 0.002069 41.967 < 2e-16 ***
## region_3Central Coast 0.382666 0.024844 15.403 < 2e-16 ***
## region_3Central Valley 0.144859 0.089649 1.616 0.106201
## region_3Napa-Sonoma 0.542000 0.023583 22.983 < 2e-16 ***
## region_3North Coast 0.199413 0.069302 2.877 0.004028 **
## region_3Sierra Foothills 0.229197 0.059701 3.839 0.000125 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4615 on 4344 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4229, Adjusted R-squared: 0.4221
## F-statistic: 530.5 on 6 and 4344 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## points 1.094792 1 1.046323
## region_3 1.094792 5 1.009098
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.596534 0.178867 -25.6981 < 2.2e-16 ***
## points 0.086849 0.002087 41.6143 < 2.2e-16 ***
## region_3Central Coast 0.382666 0.026140 14.6391 < 2.2e-16 ***
## region_3Central Valley 0.144859 0.088901 1.6294 0.103291
## region_3Napa-Sonoma 0.542000 0.025455 21.2926 < 2.2e-16 ***
## region_3North Coast 0.199413 0.075002 2.6587 0.007872 **
## region_3Sierra Foothills 0.229197 0.044782 5.1181 3.22e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = log(price) ~ points + region_3 + red, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34560 -0.29578 -0.01739 0.27734 2.06623
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.57669 0.16669 -27.456 < 2e-16 ***
## points 0.08420 0.00194 43.402 < 2e-16 ***
## region_3Central Coast 0.35543 0.02328 15.268 < 2e-16 ***
## region_3Central Valley 0.09729 0.08393 1.159 0.24646
## region_3Napa-Sonoma 0.49799 0.02215 22.488 < 2e-16 ***
## region_3North Coast 0.20923 0.06487 3.226 0.00127 **
## region_3Sierra Foothills 0.12899 0.05603 2.302 0.02137 *
## red 0.35582 0.01434 24.806 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.432 on 4343 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4945, Adjusted R-squared: 0.4937
## F-statistic: 606.9 on 7 and 4343 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## points 1.098121 1 1.047913
## region_3 1.106438 5 1.010166
## red 1.016899 1 1.008414
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.5766895 0.1687189 -27.1261 < 2.2e-16 ***
## points 0.0841994 0.0019702 42.7373 < 2.2e-16 ***
## region_3Central Coast 0.3554292 0.0249319 14.2560 < 2.2e-16 ***
## region_3Central Valley 0.0972920 0.0849832 1.1448 0.252339
## region_3Napa-Sonoma 0.4979879 0.0244435 20.3730 < 2.2e-16 ***
## region_3North Coast 0.2092338 0.0805271 2.5983 0.009400 **
## region_3Sierra Foothills 0.1289864 0.0446461 2.8891 0.003883 **
## red 0.3558231 0.0135053 26.3469 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = log(price) ~ points + region_3 + color, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34600 -0.29514 -0.01661 0.27754 2.06541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.206922 0.167631 -25.096 < 2e-16 ***
## points 0.084030 0.001943 43.242 < 2e-16 ***
## region_3Central Coast 0.356321 0.023285 15.303 < 2e-16 ***
## region_3Central Valley 0.096864 0.083924 1.154 0.2485
## region_3Napa-Sonoma 0.499134 0.022156 22.528 < 2e-16 ***
## region_3North Coast 0.208725 0.064861 3.218 0.0013 **
## region_3Sierra Foothills 0.133142 0.056092 2.374 0.0177 *
## colorrose -0.441221 0.060609 -7.280 3.95e-13 ***
## colorwhite -0.352315 0.014545 -24.222 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4319 on 4342 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4948, Adjusted R-squared: 0.4938
## F-statistic: 531.5 on 8 and 4342 DF, p-value: < 2.2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.206922 0.171074 -24.5912 < 2.2e-16 ***
## points 0.084029 0.001976 42.5244 < 2.2e-16 ***
## region_3Central Coast 0.356321 0.024943 14.2853 < 2.2e-16 ***
## region_3Central Valley 0.096864 0.084906 1.1408 0.254003
## region_3Napa-Sonoma 0.499134 0.024451 20.4136 < 2.2e-16 ***
## region_3North Coast 0.208725 0.080484 2.5934 0.009536 **
## region_3Sierra Foothills 0.133142 0.044781 2.9732 0.002964 **
## colorrose -0.441221 0.042883 -10.2891 < 2.2e-16 ***
## colorwhite -0.352315 0.013718 -25.6835 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## GVIF Df GVIF^(1/(2*Df))
## points 1.102126 1 1.049822
## region_3 1.110284 5 1.010516
## color 1.023524 2 1.005830
##
## Call:
## lm(formula = log(price) ~ points + region_3 + color + age, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.34109 -0.29453 -0.01546 0.27667 2.06968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.191962 0.168050 -24.945 < 2e-16 ***
## points 0.084154 0.001946 43.251 < 2e-16 ***
## region_3Central Coast 0.355000 0.023308 15.231 < 2e-16 ***
## region_3Central Valley 0.103770 0.084101 1.234 0.21732
## region_3Napa-Sonoma 0.498577 0.022159 22.500 < 2e-16 ***
## region_3North Coast 0.211874 0.064906 3.264 0.00111 **
## region_3Sierra Foothills 0.140109 0.056367 2.486 0.01297 *
## colorrose -0.447024 0.060784 -7.354 2.28e-13 ***
## colorwhite -0.354533 0.014653 -24.195 < 2e-16 ***
## age -0.002646 0.002124 -1.246 0.21289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4319 on 4341 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4949, Adjusted R-squared: 0.4939
## F-statistic: 472.7 on 9 and 4341 DF, p-value: < 2.2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1919617 0.1721607 -24.3491 < 2.2e-16 ***
## points 0.0841537 0.0019754 42.5998 < 2.2e-16 ***
## region_3Central Coast 0.3550001 0.0249706 14.2167 < 2.2e-16 ***
## region_3Central Valley 0.1037699 0.0850409 1.2202 0.222442
## region_3Napa-Sonoma 0.4985769 0.0244911 20.3575 < 2.2e-16 ***
## region_3North Coast 0.2118736 0.0807259 2.6246 0.008705 **
## region_3Sierra Foothills 0.1401094 0.0452691 3.0950 0.001980 **
## colorrose -0.4470236 0.0431728 -10.3543 < 2.2e-16 ***
## colorwhite -0.3545326 0.0138818 -25.5394 < 2.2e-16 ***
## age -0.0026461 0.0022510 -1.1756 0.239839
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## GVIF Df GVIF^(1/(2*Df))
## points 1.105027 1 1.051203
## region_3 1.136712 5 1.012896
## color 1.043625 2 1.010732
## age 1.047616 1 1.023531
==================================================================================================================================
Dependent variable:
——————————————————————————————————— log(price)
(1) (2) (3) (4)
———————————————————————————————————————————- points 0.101*** 0.087***
0.084*** 0.084***
(0.002) (0.002) (0.002) (0.002)
region_3Central Coast 0.383*** 0.356*** 0.355***
(0.025) (0.023) (0.023)
region_3Central Valley 0.145 0.097 0.104
(0.090) (0.084) (0.084)
region_3Napa-Sonoma 0.542*** 0.499*** 0.499***
(0.024) (0.022) (0.022)
region_3North Coast 0.199*** 0.209*** 0.212***
(0.069) (0.065) (0.065)
region_3Sierra Foothills 0.229*** 0.133** 0.140**
(0.060) (0.056) (0.056)
colorrose -0.441*** -0.447***
(0.061) (0.061)
colorwhite -0.352*** -0.355***
(0.015) (0.015)
age -0.003
(0.002)
Constant -5.419*** -4.597*** -4.207*** -4.192***
(0.185) (0.178) (0.168) (0.168)
Observations 4,351 4,351 4,351 4,351
R2 0.347 0.423 0.495 0.495
Adjusted R2 0.347 0.422 0.494 0.494
Residual Std. Error 0.491 (df = 4349) 0.462 (df = 4344) 0.432 (df =
4342) 0.432 (df = 4341)
F Statistic 2,308.693*** (df = 1; 4349) 530.524*** (df = 6; 4344)
531.474*** (df = 8; 4342) 472.654*** (df = 9; 4341)
==================================================================================================================================
Note: p<0.1; p<0.05; p<0.01
## Warning: Removed 3 rows containing non-finite values (stat_density).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 3 rows containing missing values
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing non-finite values (stat_density).
## Warning: Removed 3 rows containing non-finite values (stat_density2d).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing missing values (geom_point).
## Removed 3 rows containing missing values (geom_point).