## Friday, June 22, 2007

### If you talk about power laws, read this paper:

A. Clauset, C. R. Shalizi and M. E. J. Newman, "Power-law distributions in empirical data", arxiv:0706.1062. Let me just repeat three key points that Shalizi summarizes on his blog:
Lots of distributions give you straight-ish lines on a log-log plot. True, a Gaussian or a Poisson won't, but lots of other things will. Don't even begin to talk to me about log-log plots which you claim are "piecewise linear".
And:
Abusing linear regression makes the baby Gauss cry. Fitting a line to your log-log plot by least squares is a bad idea. It generally doesn't even give you a probability distribution, and even if your data do follow a power-law distribution, it gives you a bad estimate of the parameters. You cannot use the error estimates your regression software gives you, because those formulas incorporate assumptions which directly contradict the idea that you are seeing samples from a power law. And no, you cannot claim that because the line "explains" a lot of the variance that you must have a power law, because you can get a very high R^2 from other distributions (that test has no "power"). And this is without getting into the errors caused by trying to fit a line to binned histograms.

It's true that fitting lines on log-log graphs is what Pareto did back in the day when he started this whole power-law business, but "the day" was the 1890s. There's a time and a place for being old school; this isn't it.
Use a goodness-of-fit test to check goodness of fit. In particular, if you're looking at the goodness of fit of a distribution, use a statistic meant for distributions, not one for regression curves. This means forgetting about R^2, the fraction of variance accounted for by the curve, and using the Kolmogorov-Smirnov statistic, the maximum discrepancy between the empirical distribution and the theoretical one. If you've got the right theoretical distribution, KS statistic will converge to zero as you get more data (that's the Glivenko-Cantelli theorem). The one hitch in this case is that you can't use the usual tables/formulas for significance levels, because you're estimating the parameters of the power law from the data. This is why God, in Her wisdom and mercy, gave us the bootstrap.

If the chance of getting data which fits the estimated distribution as badly as your data fits your power law is, oh, one in a thousand or less, you had better have some other, very compelling reason to think that you're looking at a power law.
The good news is that, despite having been submitted for publication too soon to cite Clauset et al., this paper is largely following the advice above and is trying to convey the message to sedimentary geologists (hopefully others will look at it as well) that straightish-looking lines on log-log plots with a large R squared are not enough evidence for power-law behavior.

Related previous posts:
The fractal nature of Einstein's and Darwin's letter writing
My talk on bed thicknesses and power laws
On cumulative probability curves
Power laws and log-log plots II.
Power laws and log-log plots I.