Stats Project: Class Voting Post-Mortem

Introduction

In my last post I wrote up my first ever logistic regression and noted that it didn’t go as well as I would have liked. I’ve learned a lot about the practice of performing regressions in Excel so the process was still useful. I had some specific issues with my sample size and sampling weights, which I want to discuss a little bit more now.

Sample Size

I performed some Wald tests on the estimated coefficients of the regression (see last blog for the results). One of the determinants of the Wald statistic is the sample size. Put simply, as the sample size increases so too does the size of the Wald statistic. A larger Wald statistic has a smaller associated p-value meaning that the null hypothesis (in this case that the coefficient is equal to 0) will be rejected more often. If I had a larger sample more of my results would be significant.

Was my sample size too small? How much larger would my sample need to be to produce the statistical significance I wanted? The first question is trickier than the second. I began by reading some academic articles about sample sizes and found that there was a bit of a debate around the topic. For logistic regressions there is a rule of thumb that there need to be 10 events per predictor variable. Vittinghoff and McCulloch have argued that such a rule is too conservative, but others have argued the opposite.

In my case that meant that I needed 10 Conservative voters for every variable I had thrown into my regression (150 Conservative voters in total). My sample had a total of 238 such voters so it would seem that, despite my previous conclusion, my sample was sufficiently large to generate solid results. In fact, my sample size would be large enough to look at Labour voters as well (but I wouldn’t be able to do an analysis of Lib Dem voters).

But surely I could get better results if my sample size was much bigger? After all, there are some much larger surveys out there (which are admittedly much more difficult to get hold of). It turns out that increasing sample size has diminishing returns and that a sample as large as the one I had was about as good as it got. My results might have been a little bland, but that wasn’t necessarily a result of the size of the ESS.

Sample Weighting

Turning to sample weighting, the academic literature here was a bit more intimidating. Solon, Haider and Wooldridge show just how tricky the decision to weight samples actually is. Sometimes it is best to use weights in regressions and other times it makes everything worse. It seems academics have as much trouble and confusion around the topic as I have myself. This was reassuring if also deeply troubling.

My decision to include both weighted and unweighted regressions seems to have been the correct one. If in doubt, shrug and throw both of them into the mix. There are also tests that help decide whether it’s appropriate to weight or not, but this seems a little advanced for me at the moment. This discussion is definitely something I definitely want to explore in more detail when I come to doing some linear regressions and have a clearer understanding of what I’m actually doing.

Graphs, Charts, and Excel

The last thing I want to discuss in this post is how to present regression results visually. I was discussing this project with a friend and they were very keen to get me to present my findings in something other than a table. I spent a lot of time in my second and third years at university staring at tables of regression coefficients and I was reluctant to try and present my findings in any other format.

I gave in and threw in some box plots (after making them a little nicer to look at) that Excel had vomited up for me. Presenting the data in the form of a graph didn’t make much sense to me since all my variables were binary. If anyone knows of any other way to present my results then let me know!

I think that linear regressions are much easier to present visually than what I’ve just worked on. During my research I came across a lot of cool ways to present data and I’d love to get around to doing some of that soon. It looks like learning R might be a good idea, particularly for partial dependence plots, as purely relying on Excel might become a little limiting.

Final Thoughts

Performing a grisly post-mortem on my logistic regression was very useful and I think I might come back to this at some point in the future when I’ve got some more statistical experience under my belt. In hindsight I bit off a lot more than I could chew. With that in mind, I’m going to do some very basic stuff next: hypothesis tests and linear regressions. I’m not sure what the topic will be yet, but I’ve got a few ideas!

Stats Project: Class Voting Analysis

Return of the Stats Project

It’s been almost exactly a year since I started this blog and my statistics project. Life got in the way and I had to put this on the backburner (whoops). But I’m back and ready to kick some multivariate ass!

Last time I had identified my data source, the ESS, and had some ideas about how to proceed, if only I could get the data into Excel. A very helpful person commented on my post (blog?) saying that I could directly download the ESS datasets as an Excel spreadsheet. At the time I don’t think I was able to download the ESS Round 8 as a CSV file (I think you can now) so I settled for the ESS Round 7.

Class Voting

For my analysis I decided to use the work of Daniel Oesch, a political sociologist, to try and investigate the effects of class on voting behaviour in the UK (the 2010 general election to be precise). I was already familiar with Oesch’s unique class schema, but coding the ESS data into something workable seemed like quite the daunting task. Fortunately Oesch had some material on his website which allowed me to easily convert the ESS data (in ISCO format) into 8 different classes. Thanks!

With my goal in mind, I set about cleaning the dataset into something usable. I was only interested in a handful of variables: gender, age, education, public sector employment, class (defined by Oesch’s schema), and voting record. Using these variables I could keep my work as closely comparable to Oesch’s as possible and avoid making too many mistakes. My hope was that I could easily compare my results to his.

Many of the ESS responses had no data available for the variables I was going to investigate. Eventually my sample size had been reduced from 2265 to 735. The biggest culprit here was probably voter turnout. One third of my data was unusable as I was only interested in people who had actually voted. I knew this was a problem at the time, but only realised how big it was when I reached the end of my regression. I’ve got more to say on this but I will save it for another, more exciting, blog post! Wowee.

I began by looking at the probability of voting Conservative. I needed to use a logistic regression since my dependent variable (regressand) was a binary variable. Using a linear regression would have produced some pretty strange results. My regressors were also all binary variables, so I was mindful not to fall into the dummy variable trap. Finding a way to do a logistic regression in Excel without paying a fortune was a little difficult, but I overcame this minor hurdle and was finally ready to press the big red button.

When I initially chose to use ESS data I noted that they had sampling weights. At the time I thought this was great. I have since learned that there is a lot of disagreement and confusion about when it is appropriate to use weights, and that they can make results less reliable. On top of this, I was unsure if I was using the weights in my regression correctly. So I performed an unweighted regression and a weighted regression to see if the results were much different. I’ll come back to this next time.

With the results in front of me, I needed to know if the coefficients were significant to any degree. A handful of Wald tests later and … oh dear. In the unweighted analysis, only the coefficient on public sector employment was significant. In the weighted analysis, this remained significant, but the coefficients on vocational secondary employment and managers were weakly significant. These results were disappointing and I decided not to continue any more analysis on my data as my sample sadly seemed too small to obtain any significant results.

Full results below:

GB 2010:

Conservative Party

Unweighted Weighted
Class
Socio-cultural specialists 1.32 1.60
Service workers 0.91 1.03
Technical specialists 0.82 1.14
Production workers 0.98 1.36
Managers 1.73 2.05*
Clerks (reference)
Traditional bourgeoisie 1.71 2.04
Small business owners 1.48 1.92
Education
Compulsory or incomplete schooling (reference)
Vocational secondary 1.49 1.73*
General secondary 0.91 0.98
Post-secondary, but no tertiary degree 0.68 0.73
Tertiary degree 1.17 1.13
Gender
Male 0.91 1.00
Age
20-35 0.66 0.72
35-50 (reference)
51-65 1.36 1.14
Sector
Private (reference)
Public 0.48** 0.54**
N 735 735

These figures are the odds ratios of the chance of voting for the Conservative party in the UK 2010 general election against not voting for them, with respect to the reference categories. *** Significant at the 0.001 level; ** at the 0.01 level; * at the 0.05 level. Data source: European Social Survey Round 7.

That table looked a lot nicer in Word!

Conservative Weighted

Conservative Unweighted

These two charts show the standardised coefficients with their 95% confidence intervals. I’ve shaded the significant results in green. I’m aware some of the labels are a hard to read and the images are blurry.

I also had a little look at trying to represent the results graphically on the advice of a friend. That seemed to be more trouble than it was worth in this case and at this time. I might come back to that at a later date.

A Sad Conclusion

So my first regression didn’t go quite as planned. I had some problems with my sample size, I’m still not sure if I was using the sampling weights correctly, and I’m not as familiar with logistic regressions as I would like. But I’m not going to give up. I plan on writing another post about some of the specific issues I had, and possible solutions to them. Then I’m going to try and do some nice easy linear regressions. We can all breathe a sigh of relief.

Stats Project: Introduction

A Short Introduction

Welcome to my stats project! I’ve wanted to do some hands-on statistical analysis for years but I’ve never gotten round to it – until now. In this post (blog? article?), I want to outline what I’m planning to do, and in future posts I’ll discuss every painful stage of my little stats project. Hopefully this will be interesting.

So what is my project? Well, I’m going to play around with some data from the European Social Survey. I’m hoping to import the data from the ESS into Excel and R, and then perform some multivariate regressions. I’m not sure exactly what data I’m going to be looking at, but I suspect I’ll end up looking at the determinants of voting behaviour. I know the academic literature on political sociology quite well so I think I know what I’m doing.

The European Social Survey

I think a good place to start with this project is to have a little look at what the ESS actually is. I know that it’s used quite a lot in academic articles and that it’s a survey that covers lots of European countries. Other than that I’m really not sure. So let’s have a look!

The ESS has been around since 2001. It collects data every two years through face-to-face interviews. It’s been conducted in 35 European countries (unless I can’t count), but only 15 countries have participated in each round of the ESS. Not to worry. I only plan on having a look at data from the UK which has participated in all 8 rounds of the ESS. The questionnaire of the ESS is made up of two parts – a core module and a rotating module. It’s an absolute beast of a survey.

But how reliable is this survey? Can I trust any conclusions I draw from its data? Fortunately, the ESS seems to have thought quite a lot about this. The main issues I can understand (I’m not used to this) are sampling, measurement error, and non-response bias. I’m going to ignore methodological concerns for the time being and assume that the ESS has designed the perfect survey. I’ll probably come back to it in a future post. Bet you can’t wait for that one!

The Way Forward

Data from the ESS is publicly available but can only be downloaded in SAS, SPSS or STATA formats. This is my first obstacle. I know that you can take this data and squish it into Excel and R but I don’t have any idea how. I’ll have to use some googlefu to work that one out. I’ll return to this project in another blog thing when I’ve done that.