Introduction
I came across some data on the quality of red wine along with some of its chemical properties and thought that it would be quite fun to look at. I decided to answer the burning question of our time: is more alcoholic red wine better? So I’m going to do a handful of linear regressions on the data to try and find an answer whilst discussing some statistical issues along the way.
I would also like to apologise for the state of some of the graphs and tables that I’ve included. They always seem to look much better before I stick them into the main body of the blog.
Data
The data I’m using can be found here. It was originally discussed by Cortez et al. The data has probably been put to better and more interesting use elsewhere, but I wanted to see what I could produce myself.
The data contains 1,599 entries for different kinds of Portuguese vinho verde wine. 12 variables were recorded including alcohol content and the perceived quality of the wine by three sensory assessors in a blind taste test. The other variables include chemical properties of the wine such as pH level, acidity levels, residual sugar content, and sulphate content. Sadly, the data does not include factors such as grape type, wine brand, or the price of a bottle. Knowing this would be very useful when buying Christmas presents!
Hypothesis and Model
My hypothesis that I’m aiming to test here is that wines with a higher alcohol content are more likely to be of a higher quality. I’m not a wine expert at all, so I’m coming at this regression with no sense of what I might find.
I’ll begin with a simple univariate regression of quality against alcohol content. Then I shall include some variables that I think might reduce omitted variable bias. Finally, I’m going to throw all the variable I have available into the regression to see what happens. I’m aware that this approach is not necessarily the most appropriate, but I might as well put these variables to some good use.
Regression 1: Alcohol
The first regression has alcohol content as the only independent variable. As we can see from the graph below, it would seem that the quality of the red wine does indeed increase with alcohol content.
Below is a table showing the output of the regression. From this we can see that an increase in the alcohol content by one percentage point increases the quality of the wine by about 0.36. The null hypothesis (that quality is not related to alcohol content) can be easily rejected.
Coefficient 
Standard Error 
tStat 
pvalue 
Lower 95% 
Upper 95% 

Intercept 
1.875 
0.175 
10.732 
0.000 
1.532 
2.218 
Alcohol 
0.361 
0.017 
21.639 
0.000 
0.328 
0.394 
Regression Statistics  
Multiple R  0.476 
R²  0.227 
Adjusted R²  0.226 
Standard Error  0.710 
Observations  1599 
A final thing to note here is the R² value. This is the fraction of the variance of the wine quality explained, or predicted, by alcohol content. It’s a measure of how well the model fits the data. In this case alcohol explains, or predicts, about 23% of the variance in quality. You can’t predict the quality of a wine using alcohol content alone, but it does seem to be an important factor.
Omitted Variable Bias
I’m only interested in how alcohol affects the quality of red wine, but that doesn’t mean that I can ignore the other variables. My results could be vulnerable to the dreaded omitted variable bias. This bias occurs when a variable has been omitted from the regression that correlates with an included variable and determines the dependent variable. In other words, alcohol might correlate with another variable I’ve excluded which in turn partly explains the quality of the wine.
Why is this a problem? Well, one of the assumptions behind OLS (ordinary least squares) regressions is that the conditional distribution of the error term on the independent variable(s) is 0. If the omitted variable is a determinant of the quality of red wine, then it will be a part of the error term. If the omitted variable is correlated with alcohol content, then the error term will be correlated with the independent variable and the conditional mean will be nonzero.
Below is a correlation matrix between the 12 variables. I’ve colourcoded it to highlight the strongest correlations. We can see that there are several that seem to correlate quite strongly with both alcohol and quality. It seems reasonable to think that volatile acidity, measuring the concentration of acetic acid, can affect the quality since high levels of acetic acid can make the wine vinegary. Sulphur dioxide, density, and the concentration of chlorides might also affect the quality of the wine.
Regression 2: Alcohol, Volatile Acidity, Sulphur Dioxide, Density, and Chlorides
To take the possibility of omitted variable bias into account, I’ve included these variables into another regression. By doing so, I hope that this will remove much of the bias that might exist in the first regression.
The results are a little difficult to display graphically, especially when limited to Excel. I get the impression that this is much easier to do in R or a commercial statistical package. Nonetheless, I’ve included a graph below showing the predicted quality of wine against its alcohol content.
The relationship here is much as it was in the first regression. But if we look more closely at the results we can see that the coefficient related to alcohol as fallen from 0.36 to 0.32. It would seem that the variables omitted from the first regression created a result that was biased upwards. Again, we can reject the null hypothesis that alcohol content and quality are unrelated.
Coefficients 
Standard Error 
tStat 
pvalue 
Lower 95% 
Upper 95% 

Intercept 
18.149 
10.318 
1.759 
0.079 
38.387 
2.089 
Alcohol 
0.317 
0.019 
16.795 
0.000 
0.280 
0.355 
Volatile Acidity 
1.350 
0.095 
14.175 
0.000 
1.537 
1.164 
Total Sulphur Dioxide 
0.002 
0.001 
3.727 
0.000 
0.003 
0.001 
Density 
21.385 
10.253 
2.086 
0.037 
1.274 
41.496 
Chlorides 
0.416 
0.364 
1.141 
0.254 
1.130 
0.299 
Regression Statistics  
Multiple R  0.570 
R²  0.325 
Adjusted R²  0.323 
Standard Error  0.664 
Observations  1599 
The R² has also increased, from 0.23 to 0.33, which would suggest that the new model “explains” more of the variance in wine quality than the first. But is it so wise to rely on R²? Sadly not. It is an interesting quirk of R² that it increases whenever a new variable is added, even if the included variable has no relation to the dependent variable. The adjusted R² compensates for this effect to some extent. In this case they aren’t really all that different.
Reading the Tea Leaves: Residual Plots
One way to tell if a regression model is not quite right is to have a look for any strange results in the residual plots of the regression. A residual plot shows the relationship between an independent variable and its residual – the difference between the predicted value of the dependent variable and the actual value.
By doing this I can see if there is any nonlinearity which would give rise to model specification bias. I can also see if the residuals are homoskedastic or heteroskedastic.
I grouped the residual plots into three groups. Density, volatile acidity, and alcohol content seemed mostly fine to me. Sulphur dioxide exhibited clear heteroskedasticity, as shown below. That means that the variance of the residuals was dependent on the value of sulphur dioxide. Here we can see that the variance fell as sulphur dioxide increased. Chloride seemed to have some large outliers, which might violate one of the key assumptions underpinning OLS regressions.
The fact that the residual for sulphur dioxide is heteroskedastic is not a problem for this regression. Some textbooks include homoskedasticity as an assumption behind the OLS model. The GaussMarkov theorem shows that with homoskedasticity, the OLS estimators are more efficient than other estimators. This means that the variance of the OLS estimators is lower than the variance of other estimators. Provided that heteroskedasticityrobust standard errors are used then this is not a particular problem. Unfortunately I can’t seem to confirm that the standard errors provided by Excel’s Analysis ToolPak are robust. This might be a problem for my analysis, but there’s not a lot I can do about it at the moment.
The outliers for chloride do not seem to be sufficiently large to create any problems. For these to be too large, the kurtosis of the distribution of the variable must be nonfinite. In this case, the kurtosis is roughly 42.
Regression 3: All Variables
After wading through some of those issues and concluding that there weren’t any glaring problems, I decided to throw all the variables into the regression to see what will happen. By doing this, I will be eliminating as much omitted variable bias as possible and getting the best estimate possible for the effect alcohol has on quality.
The graph below shows the relationship between alcohol and quality. The coefficient has fallen further to ~0.29. The null hypothesis that alcohol has no effect on quality can be rejected. Furthermore, the adjusted R² value has increased to about 0.36.
Coefficients 
Standard Error 
tStat 
pvalue 
Lower 95% 
Upper 95% 

Intercept 
6.180 
13.437 
0.460 
0.646 
20.176 
32.535 
Volatile Acidity 
1.078 
0.121 
8.911 
0.000 
1.315 
0.841 
Citric Acid 
0.135 
0.139 
0.975 
0.330 
0.407 
0.137 
Residual Sugar 
0.010 
0.014 
0.746 
0.456 
0.016 
0.037 
Chlorides 
1.968 
0.408 
4.828 
0.000 
2.768 
1.169 
Free Sulphur Dioxide 
0.005 
0.002 
2.128 
0.034 
0.000 
0.009 
Total Sulphur Dioxide 
0.003 
0.001 
4.835 
0.000 
0.005 
0.002 
Density 
1.517 
13.389 
0.113 
0.910 
27.779 
24.745 
pH 
0.546 
0.133 
4.099 
0.000 
0.808 
0.285 
Sulphates 
0.900 
0.113 
7.961 
0.000 
0.678 
1.121 
Alcohol 
0.290 
0.022 
13.047 
0.000 
0.246 
0.334 
Regression Statistics  
Multiple R  0.600 
R²  0.360 
Adjusted R²  0.356 
Standard Error  0.648 
Observations  1599 
I won’t go into detail about the residual plots here as none seem to display clear nonlinearity. I have already discussed issues of heteroskedasticity and outliers so I don’t want to repeat myself too much.
Conclusion
In conlusion, it would seem that more alcoholic red wines are more likely to be rated as a higher quality. I reduced omitted variable bias by controlling for more variables. I discussed heteroskedasticity and some of the assumptions behind OLS regressions, and why my regressions were still valid.
It’s been fun to do a linear regression and get to grips with some of the problems and difficulties associated with them. I’m going to be moving on to learning some R now and trying to perform some linear and logistic regressions using that. It might be a little while before I post anything new.