|
Do not smooth times series, you hockey puck!
The advice which forms the title
of this post would be how Don Rickles, if he were a statistician, would
explain how not to conduct times series analysis. Judging by the
methods I regularly see applied to data of this sort, Don’s rebuke is
sorely needed.
The advice is particularly relevant now because there is a new
hockey stick controversy brewing. Mann and others have published a new
study melding together lots of data and they claim to have again shown
that the here and now is hotter than the then and there. Go to climateaudit.org
and read all about it. I can’t do a better job than Steve, so I won’t
try. What I can do is to show you what not to do. I’m going to shout
it, too, because I want to be sure you hear.
Mann includes at this site
a large number of temperature proxy data series. Here is one of them
called wy026.ppd (I just grabbed one out of the bunch). Here is the
picture of this data:
The various black lines are the actual data! The red-line is a 10-year running mean smoother! I will call the black data the real data, and I will call the smoothed data the fictional data.
Mann used a “low pass filter” different than the running mean to
produce his fictional data, but a smoother is a smoother and what I’m
about to say changes not one whit depending on what smoother you use.
Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses!
If the data is measured with error, you might attempt to model it
(which means smooth it) in an attempt to estimate the measurement
error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.
If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results!
Mann et al. first dramatically smoothed their series, then analyzed
them separately. Regardless of whether their thesis is true—whether
there really is a dramatic increase in temperature lately—it is
guaranteed that they are now too certain of their conclusion.
There. Sorry for shouting, but I just had to get this off my chest.
Now for some specifics, in no particular order.
- A probability model should be used for only one thing: to quantify the uncertainty of data not yet seen. I go on and on and on about this because this simple fact, for reasons God only knows, is difficult to remember.
- The corollary to this truth is the data in a time series analysis is the data.
This tautology is there to make you think. The data is the data! The
data is not some model of it. The real, actual data is the real, actual
data. There is no secret, hidden “underlying process” that you can
tease out with some statistical method, and which will show you the
“genuine data”. We already know the data and there it is. We do not
smooth it to tell us what it “really is” because we already know what
it “really is.”
- Thus, there are only two reasons (excepting measurement error) to ever model time series data:
- To associate the time series with external factors. This is the
standard paradigm for 99% of all statistical analysis. Take several
variables and try to quantify their correlation, etc., but only with a
mind to do the next step.
- To predict future data. We do not need to predict the data we already have. Let me repeat that for ease of memorization: Notice that we do not
need to predict the data we already have. We can only predict what we
do not know, which is future data. Thus, we do not need to predict the
tree ring proxy data because we already know it.
- The tree ring data is not temperature (say that out loud). This is
why it is called a proxy. It is a perfect proxy? Was that last question
a rhetorical one? Was that one, too? Because it is a proxy, the
uncertainty of its ability to predict temperature must be taken into
account in the final results. Did Mann do this? And just what is a
rhetorical question?
- There are hundreds of time series analysis methods, most with the
purpose of trying to understand the uncertainty of the process so that
future data can be predicted, and the uncertainty of those predictions
can be quantified (this is a huge area of study in, for example,
financial markets, for good reason). This is a legitimate use of
smoothing and modeling.
- We certainly should model the relationship of the proxy and
temperature, taking into account the changing nature of proxy through
time, the differing physical processes that will cause the proxy to
change regardless of temperature or how temperature exacerbates or
quashes them, and on and on. But we should not stop, as everybody has
stopped, with saying something about the parameters of the probability
models used to quantify these relationships. Doing so makes use, once
again, far too certain of the final results. We do not care how the proxy predicts the mean temperature, we do care how the proxy predicts temperature.
- We do not need a statistical test to say whether a particular time
series has increased since some time point. Why? If you do not know, go
back and read these points from the beginning. It’s because all we have
to do is look at the data: if it has increased, we are
allowed to say “It increased.” If it did not increase or it decreased,
then we are not allowed to say “It increased.” It really is as simple
as that.
- You will now say to me “OK Mr Smarty Pants. What if we had several
different time series from different locations? How can we tell if
there is a general increase across all of them? We certainly need
statistics and p-values and Monte Carol routines to tell us that they
increased or that the ‘null hypothesis’ of no increase is true.” First,
nobody has called me “Mr Smarty Pants” for a long time, so you’d better
watch your language. Second, weren’t you paying attention? If you want
to say that 52 out 413 times series increased since some time point,
then just go and look at the time series and count! If 52 out
of 413 times series increased then you can say “52 out of 413 time
series increased.” If more or less than 52 out of 413 times series
increased, then you cannot say that “52 out of 413 time
series increased.” Well, you can say it, but you would be lying. There
is absolutely no need whatsoever to chatter about null hypotheses etc.
If the points—it really is just one point—I am making seem tedious
to you, then I will have succeeded. The only fair way to talk about
past, known data in statistics is just by looking at it. It is true
that looking at massive data sets is difficult and still somewhat of an
art. But looking is looking and it’s utterly evenhanded. If you want to
say how your data was related with other data, then again, all you have
to do is look.
The only reason to create a statistical model is to predict data you
have not seen. In the case of the proxy/temperature data, we have the
proxies but we do not have temperature, so we can certainly use a
probability model to quantify our uncertainty in the unseen
temperatures. But we can only create these models when we have
simultaneous measures of the proxies and temperature. After
these models are created, we then go back to where we do not have
temperature and we can predict it (remembering to predict not its mean but the actual values;
you also have to take into account how the temperature/proxy
relationship might have been different in the past, and how the other
conditions extant would have modified this relationship, and on and on).
What you can not, or should not, do is to first model/smooth the
proxy data to produce fictional data and then try to model the
fictional data and temperature. This trick will always—simply
always—make you too certain of yourself and will lead you astray.
Notice how the read fictional data looks a hell of a lot more
structured than the real data and you’ll get the idea.
Next step is to start playing with the proxy data itself and see
what is to see. As soon as I am granted my wish to have each day filled
with 48 hours, I’ll be able to do it.
Thanks to Gabe Thornhill of Thornhill Securities for reminding me to write about this.
Source
|