In my last post I wrote up my first ever logistic regression and noted that it didn’t go as well as I would have liked. I’ve learned a lot about the practice of performing regressions in Excel so the process was still useful. I had some specific issues with my sample size and sampling weights, which I want to discuss a little bit more now.
I performed some Wald tests on the estimated coefficients of the regression (see last blog for the results). One of the determinants of the Wald statistic is the sample size. Put simply, as the sample size increases so too does the size of the Wald statistic. A larger Wald statistic has a smaller associated p-value meaning that the null hypothesis (in this case that the coefficient is equal to 0) will be rejected more often. If I had a larger sample more of my results would be significant.
Was my sample size too small? How much larger would my sample need to be to produce the statistical significance I wanted? The first question is trickier than the second. I began by reading some academic articles about sample sizes and found that there was a bit of a debate around the topic. For logistic regressions there is a rule of thumb that there need to be 10 events per predictor variable. Vittinghoff and McCulloch have argued that such a rule is too conservative, but others have argued the opposite.
In my case that meant that I needed 10 Conservative voters for every variable I had thrown into my regression (150 Conservative voters in total). My sample had a total of 238 such voters so it would seem that, despite my previous conclusion, my sample was sufficiently large to generate solid results. In fact, my sample size would be large enough to look at Labour voters as well (but I wouldn’t be able to do an analysis of Lib Dem voters).
But surely I could get better results if my sample size was much bigger? After all, there are some much larger surveys out there (which are admittedly much more difficult to get hold of). It turns out that increasing sample size has diminishing returns and that a sample as large as the one I had was about as good as it got. My results might have been a little bland, but that wasn’t necessarily a result of the size of the ESS.
Turning to sample weighting, the academic literature here was a bit more intimidating. Solon, Haider and Wooldridge show just how tricky the decision to weight samples actually is. Sometimes it is best to use weights in regressions and other times it makes everything worse. It seems academics have as much trouble and confusion around the topic as I have myself. This was reassuring if also deeply troubling.
My decision to include both weighted and unweighted regressions seems to have been the correct one. If in doubt, shrug and throw both of them into the mix. There are also tests that help decide whether it’s appropriate to weight or not, but this seems a little advanced for me at the moment. This discussion is definitely something I definitely want to explore in more detail when I come to doing some linear regressions and have a clearer understanding of what I’m actually doing.
Graphs, Charts, and Excel
The last thing I want to discuss in this post is how to present regression results visually. I was discussing this project with a friend and they were very keen to get me to present my findings in something other than a table. I spent a lot of time in my second and third years at university staring at tables of regression coefficients and I was reluctant to try and present my findings in any other format.
I gave in and threw in some box plots (after making them a little nicer to look at) that Excel had vomited up for me. Presenting the data in the form of a graph didn’t make much sense to me since all my variables were binary. If anyone knows of any other way to present my results then let me know!
I think that linear regressions are much easier to present visually than what I’ve just worked on. During my research I came across a lot of cool ways to present data and I’d love to get around to doing some of that soon. It looks like learning R might be a good idea, particularly for partial dependence plots, as purely relying on Excel might become a little limiting.
Performing a grisly post-mortem on my logistic regression was very useful and I think I might come back to this at some point in the future when I’ve got some more statistical experience under my belt. In hindsight I bit off a lot more than I could chew. With that in mind, I’m going to do some very basic stuff next: hypothesis tests and linear regressions. I’m not sure what the topic will be yet, but I’ve got a few ideas!