User-friendly
statistics
The perspective here attempts to make
statistical methods simpler but no less powerful
Michael Wood
(MichaelWoodSLG@gmail.com)
How sure are we
(link to come)? Suppose
you are told that your life expectancy is another 40 years, or that women
have better emotional intelligence than men. But how accurate is the 40
years, and how certain should we be that the assertion about women is
correct? I explain two statistical approaches to questions like these
(testing hypotheses and estimating confidence levels) from first principles,
explaining both how to get the answers, and why the answers are sensible.
Both approaches can be carried out by "resampling"
using this spreadsheet which includes a brief
explanation of both approaches (and can also be used to estimate
probabilities). |
Making predictions (link to come). Suppose you have some data on some
houses - their selling price, their location, how big they are, etc - and you
want to devise a formula for predicting the price from the other variables,
and understanding the impact of variables like location on the price. One
method for doing this is known as regression: I explain how it works in this horrible video and the rationale behind it in this video. |
Probability: a minimalist approach
(link to come) using two core ideas only: the
"equally likely possibilities" principle, and a spreadsheet to simulate the overall effect of
lots of events following this principle. |
Simple Ideas home page (A more general perspective on
the idea of making knowledge simpler and more efficient) |
The basic idea of statistics
is to take a lot of data and analyze it to show patterns that may not be clear
from a superficial glance. It enables us to learn about the dangers of smoking
and of living near a busy road, to assess the effectiveness of innovations in
various domains, and to discover that in 2011 there were 7.8 road deaths per
100,000 of population in Belgium compared to the UK average of 3.1 (why?), and
so on. Life would be much poorer without the improvements brought about by
statistics in so many domains.
However ... statistics
is notorious for being a difficult subject to master. This reputation is
deserved: the concepts involved are inevitably subtle, but the difficulties are
magnified massively by unhelpful jargon, dense mathematical techniques covering
very restricted situations, and the promotion of the wrong concepts (e.g. null
hypothesis significance tests). Statistics needs simplifying by focusing on
user-friendly approaches which clarify the rationale behind concepts, by ignoring
superfluous concepts and techniques and by taking care to avoid unhelpful
jargon. I think this is possible without sacrificing its power and usefulness; in fact a fresh, simpler,
perspective is likely to make the process of learning and using statistics more
efficient, effective and useful.
I have been trying to make statistics more user-friendly for
a long time. I used to teach an introductory course
for students on an MBA and similar courses which I tried to make it as simple
as possible, and use more intuitive methods whenever I could. I took these
ideas a bit further in my
book. The philosophy behind this book is summed up in the first chapter.
What I want to do here is to
take this philosophy further still. The results are in the links in the box on
the right. Each of these should make
reasonable sense on its own. Chapter 3
of my book may be helpful as background on how to summarize statistical data by
graphs, averages, correlations, etc. (The data sets mentioned in this chapter
are on the web at drink20.xls, drink.xls, iofm.xls, shares.xls.)
But what is statistics? From my perspective the core concepts
are averages, probabilities and randomization. It is impossible to predict with
certainty whether a particular person who smokes will develop lung cancer or
exactly how many more years they will live, but the statistics can estimate the
probability of developing lung cancer, and an average life expectancy for
people in a similar position. It usually does this by taking a sample of data,
and it is obviously important that this sample represents the whole population
as accurately as possible. In practice the easiest way of doing this is often
to use a random sample – this is chosen so that every member of the population
is equally likely to be selected so there shouldn’t be any consistent bias.
Randomization is also important in experiments or trials such
as drug trials. Suppose you wanted to compare a drug treatment for a disease
with a placebo, and you decided to ask patients whether they wanted the drug or
the placebo. The difficulty would be that the two groups would be different in
ways that would almost certainly affect the result (e.g. those who were more
seriously ill might opt for the drug). The solution is to allocate patients to
the drug or placebo group at random.
There is a more detailed
discussion of the nature of statistics, and it relation to the fuzzy logic and
the idea of chaos in Chapter
4 of my book, Making sense of statistics:
a non-mathematical approach, and some
thoughts on designing and interpreting practical investigations using
statistical methods in Chapter 10.
The twin sister of statistics is probability theory. As well
as being essential to statistics itself, it also enables you to go from simple
assumptions (like each of the 59 numbered balls in the UK Lotto draw is equally
likely to be drawn) to conclusions about how likely particular results are
(e.g. the probability of winning the jackpot can be calculated as about 1 in 45
million) or about averages in a large population (e.g. the average number of
winners of a particular prize).
The box below lists some
articles and other resources on the theme of simplifying statistics and making
it more useful.