## Significant Digit (Benford's) Law in Publication Citations

I expect that any decent sized sample of a convex process will have more numbers with a leading significant digit of 1, followed by significant digit 2, and the least occurrence of numbers with significant digit 9, since

$f^{-1}(2) - f^{-1}(1) > f^{-1}(3) - f^{-1}(2) > … > f^{-1}(9) - f^{-1}(8)$

for a convex function $f(t)$ and uniform distribution of $t$. To see this in action I thought to plot the histogram of significant digits of publication citations, since I think it’s reasonable that the more citations a paper has, the more likely it is to be cited again. This meets the convex criterion. For a roughly uniform sampling of $t$, we should collect the citations of papers of senior researchers (although I make one exception out of curiosity).

To get the data, I used the Publish or Perish application, a Windows interface to Google Scholar, and downloaded six csv files, one per researcher. Here’s the J code I use to plot the histograms:

And a version made using R’s ggplot:

Code on Github

So the results match intuition, but the next question is why (except for the less senior researcher) do the distributions so closely match the log distribution $log(1 + 1/x)$? Here’s one answer by Hill 1995.

References