Return of the Stats Project
It’s been almost exactly a year since I started this blog and my statistics project. Life got in the way and I had to put this on the backburner (whoops). But I’m back and ready to kick some multivariate ass!
Last time I had identified my data source, the ESS, and had some ideas about how to proceed, if only I could get the data into Excel. A very helpful person commented on my post (blog?) saying that I could directly download the ESS datasets as an Excel spreadsheet. At the time I don’t think I was able to download the ESS Round 8 as a CSV file (I think you can now) so I settled for the ESS Round 7.
For my analysis I decided to use the work of Daniel Oesch, a political sociologist, to try and investigate the effects of class on voting behaviour in the UK (the 2010 general election to be precise). I was already familiar with Oesch’s unique class schema, but coding the ESS data into something workable seemed like quite the daunting task. Fortunately Oesch had some material on his website which allowed me to easily convert the ESS data (in ISCO format) into 8 different classes. Thanks!
With my goal in mind, I set about cleaning the dataset into something usable. I was only interested in a handful of variables: gender, age, education, public sector employment, class (defined by Oesch’s schema), and voting record. Using these variables I could keep my work as closely comparable to Oesch’s as possible and avoid making too many mistakes. My hope was that I could easily compare my results to his.
Many of the ESS responses had no data available for the variables I was going to investigate. Eventually my sample size had been reduced from 2265 to 735. The biggest culprit here was probably voter turnout. One third of my data was unusable as I was only interested in people who had actually voted. I knew this was a problem at the time, but only realised how big it was when I reached the end of my regression. I’ve got more to say on this but I will save it for another, more exciting, blog post! Wowee.
I began by looking at the probability of voting Conservative. I needed to use a logistic regression since my dependent variable (regressand) was a binary variable. Using a linear regression would have produced some pretty strange results. My regressors were also all binary variables, so I was mindful not to fall into the dummy variable trap. Finding a way to do a logistic regression in Excel without paying a fortune was a little difficult, but I overcame this minor hurdle and was finally ready to press the big red button.
When I initially chose to use ESS data I noted that they had sampling weights. At the time I thought this was great. I have since learned that there is a lot of disagreement and confusion about when it is appropriate to use weights, and that they can make results less reliable. On top of this, I was unsure if I was using the weights in my regression correctly. So I performed an unweighted regression and a weighted regression to see if the results were much different. I’ll come back to this next time.
With the results in front of me, I needed to know if the coefficients were significant to any degree. A handful of Wald tests later and … oh dear. In the unweighted analysis, only the coefficient on public sector employment was significant. In the weighted analysis, this remained significant, but the coefficients on vocational secondary employment and managers were weakly significant. These results were disappointing and I decided not to continue any more analysis on my data as my sample sadly seemed too small to obtain any significant results.
Full results below:
|Small business owners||1.48||1.92|
|Compulsory or incomplete schooling (reference)|
|Post-secondary, but no tertiary degree||0.68||0.73|
These figures are the odds ratios of the chance of voting for the Conservative party in the UK 2010 general election against not voting for them, with respect to the reference categories. *** Significant at the 0.001 level; ** at the 0.01 level; * at the 0.05 level. Data source: European Social Survey Round 7.
That table looked a lot nicer in Word!
These two charts show the standardised coefficients with their 95% confidence intervals. I’ve shaded the significant results in green. I’m aware some of the labels are a hard to read and the images are blurry.
I also had a little look at trying to represent the results graphically on the advice of a friend. That seemed to be more trouble than it was worth in this case and at this time. I might come back to that at a later date.
A Sad Conclusion
So my first regression didn’t go quite as planned. I had some problems with my sample size, I’m still not sure if I was using the sampling weights correctly, and I’m not as familiar with logistic regressions as I would like. But I’m not going to give up. I plan on writing another post about some of the specific issues I had, and possible solutions to them. Then I’m going to try and do some nice easy linear regressions. We can all breathe a sigh of relief.