When I did a post-mortem on my class voting project I outlined three problems that I wanted to return to: sample size, sample weighting, and graphs. There was a fourth problem which I had at the back of my mind: missing values. Furthermore, I was determined to return to the project and perform my analysis using R instead of Excel. After a long break over the Christmas period I am back to delve into this statistical chaos!
The “Evil” of Missing Values
Back when I was tidying my data ready for use in Excel I faced a small problem. Some of the respondents to the European Social Survey hadn’t answered certain questions. For instance, many people refused to say which political party they voted for. I didn’t know this at the time, but this is called ‘item nonresponse’. I “dealt” with this problem by excluding all incomplete responses from my data. In practice this meant that I was removing 160 entries from my dataset. Since my dataset was already quite small this resulted in me losing almost 20% of my data. I didn’t think much of this at the time and resolved to come back to the matter latter.
Well here I am. I learned that the method I used to brush my incomplete data under the carpet is called ‘listwise deletion’. King, Honaker, Joseph, and Scheve (1998) called this process “evil”. I had apparently committed a cardinal sin. I wasn’t alone: a lot has been written about social scientists mishandling missing values in their data. So what’s the big deal?
You lose some of your data meaning that your standard errors are going to be larger. This is a big problem that I’ve already written quite a lot about. But more importantly, you can end up with a biased analysis if your data isn’t missing completely at random (MCAR). Data is MCAR when the probability of a variable Y being missing isn’t related to Y itself or the value of X (any other variable in the dataset). If the data isn’t MCAR then the sample is no longer random, which violates the assumptions underpinning the regression analysis.
I read quite a lot about missing values and there’s still a lot I have to learn. But I concluded that my class voting data was probably MNAR (missing not at random). Data is MNAR when the “missingness” of a variable Y is related to the value of Y itself. In my case, I considered it quite likely that the probability of refusing to answer questions about voting were related to that person’s voting record. The Shy Tory factor (link) is a well-known problem in UK opinion polling and it’s possible that a similar effect was in play here.
Handling data that is MNAR is very tricky and requires some pretty advanced techniques which I don’t have a great understanding of at the moment. However, I found some evidence that a technique called multiple imputation produced less biased regression coefficients than listwise deletion even when the data is MNAR. Multiple imputation essentially generates, or imputes, multiple possible values for the missing data based on all the variables in the dataset. Analysis is then performed on every data set (in my case, a logistic regression) before the results are “pooled” into a single definitive set.
Using R for Data Analysis
I’ve spent the last few weeks getting to grips with R and decided to try and use R to recreate my class voting analysis. But this time I would use multiple imputation rather than listwise deletion. I’ve included my spaghetti code as a pdf here and in the appendix (I can’t use code snippets in WordPress without plugins which require a hefty subscription). This is my first piece of R code and it’s a total mess, but it got the job done.
I imported data from CSV files into R where I used dplyr to manipulate and “tidy” this data into something more usable. I then used the mice package to perform my multiple imputation and the glm() function for the logistic regressions. One of the advantages of using R instead of Excel is that you (and I) can see exactly what I’ve done to my data. This makes it easier to recreate for other analyses or spot mistakes. I’ve really enjoyed using R and can’t wait to use it for other regressions.
The Wonders of ggplot2
The graphs I used in my original class voting analysis were atrocious. But after some helpful feedback I resolved to do better. I like to think the visual presentation of data in my more recent blogs is much better. Even so, I’ve become very aware of some of the shortcomings of Excel. For this analysis I decided to see what I could do with ggplot2 in R. You can see the code I’ve used in the attached pdf in the appendix.
I have since fallen in love with ggplot2. It is so easy to use and create great-looking graphs with. I’m sure that there’s so much more that I can do with it, but I think I’ve created a pretty nice graph to display my results.
The results of my logistic regression are displayed below in a table and a graph:
|Compulsory or incomplete (reference)|
|Small Business Owners||1.400||0.290|
I’ve used odds ratios since they’re easier to interpret than coefficients. From the data we can see that vocational secondary is significant at the 0.05 level whereas public is significant at the 0.001 level. How does this compare to the Excel analysis?
In my Excel analysis I used listwise deletion which reduced my sample size and could have biased my results. The odds ratios in my new analysis are not all too dissimilar to the ones calculated in my older analysis, but the results are (slightly) more significant.
From these results, we can see that people with a vocational secondary education were ~ 1.8 times more likely to have voted for the Conservatives than the reference group (about 41% of whom voted for the party). This is quite an unusual result and it’s difficult to think why this might be the case.
Public sector workers were roughly half as likely to vote for the Conservatives. This might not be surprising since the main issue in the 2010 election was the fallout of the financial crisis and the deficit reduction plans of the different parties.
Although the other results are not significant I think some are quite interesting. We can see a clear age / generational voting pattern. Younger voters were less likely to have voted for the Conservatives than older voters. This is a pattern which seems to persist 8 years later.
Small business owners, managers, and the traditional bourgeoisie were more likely to have voted Conservative whereas production workers and service workers were less likely to vote for the Conservatives. This matches previous work done by Oesch (2008).
Socio-cultural specialists were more likely to have voted for the Conservatives. This is an unexpected result which could indicate a realignment of class voting. But I suspect that this is a consequence of the particularly poor performance of the Labour party rather than a permanent upheaval.
It’s unfortunate that most of these results have to be taken with a pinch of salt – the standard errors are simply too large to be able to draw any reliable conclusions.
As I was going over this analysis again, some more problems surfaced which I’ll discuss a little here. I hope to be able to confront these problems in the coming weeks.
Firstly, I have only performed an unweighted analysis of my data. This is because it is slightly more difficult to do a logistic regression using survey weights in R. I wanted to have a better understanding of the problem before I tried to use the “survey” package (link).
Secondly, I haven’t included an R² value in my results. This is because an R² value is calculated for every single imputed dataset, but the mice package cannot pool R² values that have been calculated using the glm() function. This is a problem that I can’t find a solution for at the present, but I’ll definitely keep looking.
Thirdly, whilst I think my graphs have dramatically improved with ggplot2, tables in WordPress don’t look very good. There are plugins that produce much nicer tables, but they require much more expensive subscriptions to WordPress. Aside from creating tables as images, I’m not too sure how I can present data more effectively.
Finally, even though I am using multiple imputation for missing values, I am still facing the problem of a small sample size. The European Social Survey might be too small for this ongoing project. The way forward here is quite clear. I’m currently applying for access to the British Household Panel Survey (BHPS) through the UK Data Service. The BHPS is a much larger survey than the ESS so I might be able to produce more precise results with that.
There’s been a lot of headaches along the way, but my first regression in R is finally here. Logistic regressions were a pain in Excel and they’re still a pain in R, but I’m slowly eliminating problems and I think further analyses will be much easier to do in R. I’ve also learned a lot about missing values and the practicalities of using multiple imputation in R.
For now, I’m going to go back to doing some linear regressions in R and discuss the results in much greater detail than I have for my class voting project. I’m sure there’ll be more pitfalls and problems along the way!
Appendix: Data Reference and R Code
The data I used for this analysis can be found at the European Social Survey:
ESS Round 7: European Social Survey Round 7 Data (2014). Data file edition 2.2. NSD – Norwegian Centre for Research Data, Norway – Data Archive and distributor of ESS data for ESS ERIC.
I’ve also used Daniel Oesch’s social class scripts to code this data.
I’ve uploaded my (terrible) R code here. Hopefully that link works. I’ve edited out the enormous console output that the code generates (thanks to multiple imputation) for readability.