User-friendly
statistics
The perspective here attempts to make
statistical methods simpler but no less powerful
Michael Wood
(MichaelWoodSLG@gmail.com)
How sure are we? Suppose you are told that your
life expectancy is another 40 years, or that women have better emotional
intelligence than men. But how accurate is the 40 years, and how certain
should we be that the assertion about women is correct? This article explains
two statistical approaches to questions like these - testing hypotheses and
estimating confidence levels - from first principles, explaining both how to
get the answers, and why the answers are sensible. Both approaches can be
carried out by "resampling" or
bootstrapping" using this spreadsheet. The article also includes a very
brief discussion of probability in terms of stories about equally likely
possible worlds. The same spreadsheet can be used to explore how this
idea works in practice using Monte Carlo simulation. |
Making predictions. Suppose you wanted to devise a formula
for predicting the price of a house its size, location, etc. One method for
doing this is known as regression, although I think "straight line
prediction formula" would be clearer. This horrible video explains how it works, and this video explains the rationale behind it.
There's a similar explanation using a different example in Chapter 9 of my book,
and a spreadsheet
here to explain how regression works (this is an improved version of the
one referred to in my book). |
Simple knowledge home page:
A more general perspective on the idea of
making knowledge simpler and more efficient. |
The basic idea of
statistics is to take a lot of data and analyze it to show patterns that may
not be clear from a superficial glance. It enables us to learn about the
dangers of smoking and of living near a busy road, to assess the effectiveness
of innovations in various domains, and to discover that in 2011 there were 7.8
road deaths per 100,000 of population in Belgium compared to the UK average of
3.1 (why?), and so on. Life would be much poorer without the improvements
brought about by statistics in so many domains.
However ... statistics
is notorious for being a difficult subject to master. This reputation is deserved:
the concepts involved are inevitably subtle, but the difficulties are magnified
massively by unhelpful jargon, dense mathematical techniques covering very
restricted situations, and the promotion of the wrong concepts (e.g. null
hypothesis significance tests). Statistics needs simplifying by focusing on
user-friendly approaches which clarify the rationale behind concepts, by
ignoring superfluous concepts and techniques and by taking care to avoid
unhelpful jargon. I think this is possible without sacrificing its power and usefulness; in fact a fresh, simpler,
perspective is likely to make the process of learning and using statistics more
efficient, effective and useful.
I have been trying to make statistics more user-friendly for
a long time. I used to teach an introductory course
for students on an MBA and similar courses which I tried to make it as simple
as possible, and use more intuitive methods whenever I could. I took these ideas
a bit further in my
book. The philosophy behind this book is summed up in the first chapter.
What I want to do here is to
take this philosophy a bit further still. The results are in the links in the
box on the right. Each of these should
make reasonable sense on its own. Chapter 3
of my book may be helpful as background on how to summarize statistical data by
graphs, averages, correlations, etc. (The data sets mentioned in this chapter
are on the web at drink20.xls, drink.xls, iofm.xls, shares.xls.)
But what is statistics? From my perspective the core concepts
are averages, probabilities and randomization. It is impossible to predict with
certainty whether a particular person who smokes will develop lung cancer or
exactly how many more years they will live, but the statistics can estimate the
probability of developing lung cancer, and an average life expectancy for
people in a similar position. It usually does this by taking a sample of data,
and it is obviously important that this sample represents the whole population
as accurately as possible. In practice the easiest way of doing this is often
to use a random sample – this is chosen so that every member of the population
is equally likely to be selected so there shouldn’t be any consistent bias.
Randomization is also important in experiments or trials such
as drug trials. Suppose you wanted to compare a drug treatment for a disease
with a placebo, and you decided to ask patients whether they wanted the drug or
the placebo. The difficulty would be that the two groups would be different in
ways that would almost certainly affect the result (e.g. those who were more
seriously ill might opt for the drug). The solution is to allocate patients to
the drug or placebo group at random.
There is a more detailed
discussion of the nature of statistics, and it relation to the fuzzy logic and
the idea of chaos in Chapter
4 of my book, Making sense of statistics:
a non-mathematical approach, and some
thoughts on designing and interpreting practical investigations using
statistical methods in Chapter 10.
The twin sister of statistics is probability theory. As well
as being essential to statistics itself, it also enables you to go from simple
assumptions (like each of the 59 numbered balls in the UK Lotto draw is equally
likely to be drawn) to conclusions about how likely particular results are
(e.g. the probability of winning the jackpot can be calculated as about 1 in 45
million) or about averages in a large population (e.g. the average number of
winners of a particular prize).
The boxes below, and above
right, list some articles and other resources on the theme of simplifying
statistics and making it more useful.