搜档网
当前位置:搜档网 › 神秘的本福特定律

神秘的本福特定律

神秘的本福特定律

physixfan 2010-10-31 21:25

统计一下世界上237个国家的人口数量,你觉得其中以1开头的数会占多大比例,而以9开头的数又占多大比例呢?如果你的回答是都为1/9,恭喜你你是正常人,但是事实却不是如此:以1开头的数惊人的占到了27%,而以9开头的数却只占5%。下图可以很形象的展示出在各国人口数量问题上,以各个数字开头的数占了多大的比例(图片来自维基百科)。为什么会相差这么大呢?这正是神秘的本福特定律在起作用。

本福特定律,也称为本福德法则,说明一堆从实际生活得出的数据中,以1为首位数字的数的出现机率约为总数的三成,接近期望值1/9的3倍,推广来说,越大的数字,以它为首几位的数出现的机率就越低;精确地数学表述为:在b进位制中,以数n起头的数出现的机率为logb(n + 1) ? logb(n)。

在十进制中,首位数字出现的概率为:

这个定律的发现,据说是因为本福特在翻对数表的时候发现前面几页被翻得很黑很破烂,越往后越颜色越浅。由此他想到会不会是1开头的数字就是比其他数多,他统计了一下发现果然如此。其实这个对数表的事情真假难辨了,就像是牛顿说自己是被苹果砸到了头才发现的万有引力定律一样,只要最后的定律有用就可以了。

首先说明一下本福特定律的适用范围

这个定律是一个非常神奇的定律,它的适用范围异常的广泛,几乎所有日常生活中没有人为规则的统计数据都满足这个定律。比如说世界各国人口数量、各国国土面积、账本、物理化学常数、数学物理课本后面的答案、放射性半衰期等等数据居然都符合本福特定律。值得一提的是,科学家还发现,统计物理的三个重要分布,Boltzmann-Gibbs分布,Bose-Einstein分布,Fermi-Dirac分布,也基本上满足Benford定律!(来源:李淼的博客)

其次这个定律毕竟还是有适用范围的

第一,这些数据必须跨度足够大,必须横跨好几个数量级才能产生这个结果。

第二,有人为规则的数据就不满足次定律,比如说手机号码、身份证号、发票编号等数据,明显不满足这种对数分布律。也就是说,本福特定律正是没有任何限制才显露出来的定律,越是对数据的产生有人为限制,越是不满足该定律。第三,数据不能经过人为修饰,随便人为修改的数据一般就不满足本福特定律了,比如当年著名的安然公司造假案,他们的账本就没有满足本福特定律,因此这个神秘的定律甚至可以用来判别是否财务造假。

那么到底该如何理解这个神秘的定律呢?为何自然产生的数据会满足这么奇特的一个定律,而不是均匀分布呢?

本福特定律产生的根源,就在于指数增长。这幅图可以直观的显示,如果一个变量随时间成指数增长的话,那么这个变量开头的数字随着时间的变化就应该是如下图:(横轴代表时间,纵轴代表那个变量)

显然,在某时刻你得到它以1开头的概率要大于9开头。而这是只取一个值的情况,如果是取大量的数据的话,在某时刻你观察到他以1开头的数据数量就大于以9开头的数量了。而指数增长的形式在自然界是十分普遍的,只要一个变量的增长率和他的大小成正比,结果就会是指数增长。比如说人类科技发展的速度大致和已有的科技成果成正比,所以人类的科技发展就是个指数增长;人口增长率会和已存在人口数成正比,因此没有资源限制的人口增长也是指数增长。指数增长是自然中极为普遍的一种变化规律,而这种变化规律可以直接导致本福特定律。

另外一种直观的解释(来自维基百科)是这样的

从数数目来说,顺序从1开始数,1,2,3,…,9,从这点终结的话,所有数起首的机会似乎相同,但9之后的两位数10至19,以1起首的数又大大抛离了其他数了。而下一堆9起首的数出现之前,必然会经过一堆以2,3,4,…,8起首的数。若果这样数法有个终结点,以1起首的数的出现率一般都比9大。

就以一个城市的所有门牌号为例,有的街道门牌号可能在100多就结束了,有的在500多结束,有的在900多结束。注意到500多结束那条街一定包含了1、10+和100~199这些1开头的门牌号,而不包含9开头的百位数,只包含9及

90+的以9开头的数,这样一来明显以1打头的就多于9打头的了。然后对整个城市的所有街道做一个综合,最终就满足本福特定律了。

以上只是直观的理解,如果想深究它的根本原理,可以参见它的证明Hill, T. P. “A Statistical Derivation of the Significant-Digit Law.” Stat. Sci. 10, 354-363, 1996.。

另外,值得一提的是,本福特定律满足尺度不不变性,即如果我们换一套单位制,本福特定律仍然成立。其实,这也可以作为大自然产生的统计数据满足该定律的一个解释:如果我们把原来的单位是米的统计数据换一个单位,例如换成英尺或者公尺,那么统计数据的分布应当不变。而唯一满足这种尺度不变性的分布,应当是某种对数分布,也就是本文的主角本福特定律。

Benford's Law

Benford's Law (which was first mentioned in 1881 by the astronomer Simon Newcomb) states that if we randomly select a number from a table of physical constants or statistical data, the probability that the first digit will be a "1" is about 0.301, rather than 0.1 as we might expect if all digits were equally likely. In general, the "law" says that the probability of the first digit being a "d" is

This implies that a number in a table of physical constants is more likely to begin with a smaller digit than a larger digit. It was published by Newcomb in a paper entitled "Note on the Frequency of Use of the Different Digits in Natural Numbers", which appeared in The American Journal of Mathematics (1881) 4, 39-40. It was re-discovered by Benford in 1938, and he published an article called "The Law of Anomalous Numbers" in Proc. Amer. Phil. Soc 78, pp 551-72.

To illustrate this interesting fact, try tabulating the first digits of the physical constants listed in Table 2.3 of Abramowitz and Stegun's "Handbook of Mathematical Functions". The result is the bar chart shown below, which gives the distribution of the leading digits of the 44 constants in the table, along with the theoretical expected distribution based on Benford's Law:

Aside from the conspicuous deficiency of 3's, this is a reasonably good match for just 44 data points.

Although there have been many lengthy and erudite "explanations" of Benford's Law, it seems to me it can be explained with a single picture:

1---------------2---------3-------4-----5----6---7--8--9

Clearly the underlying premise of Benford's Law is that the subject population of quantities, expressed in the base 10 and more or less arbitrary units, will be fairly evenly distributed on a logarithmic scale. This is confirmed by the fact that the exponents on these constants are fairly uniformly distributed (at least over several orders of magnitude). As a result, the probability of the leading digit being "d" clearly approaches

Of course, we could have chosen units for our physical constants such that the leading digits were all 9's (for example), but evidently we have a natural tendency to choose units so that our numbers are evenly distributed by order of magnitude, rather than absolute value. This may be related to our basic impressions of hearing and sight (and earthquakes), since our sense impressions of loudness and brightness are logarithmic.

Naturally we can apply Benford's Law to numbers expressed in any base, not just the base 10. In general the probability of the leading digit d (in the range 1 to B-1) for the base B is

Notice that for binary numbers, i.e., numbers expressed in the base 2, the probability of the leading digit being 1 is 1.000, as it must be, since the leading non-zero digit of a binary number is necessarily 1. The distributions of probabilities of the digits 1 to B-1 for each base B from 2 to 10 are shown below.

We can also easily verify that the sum of all the probabilities for digits 1 through b-1 equals 1.0000, as it must, since the leading digit must be one of these. This implies

To verify this, recall the fundamental law of logarithms, ln(ab) = ln(a) + ln(b). With this we can re-write this sum of logarithms as the logarithm of a product:

which confirms the result.

By the same kind of analysis we can determine the probability that the second digit will have a certain value. It's only necessary to consider a single order of magnitude, since the pattern is repeated on each order. For example, in the base 10, the probability of the second digit being "3" is equal to the sum of the probabilities of the first two digits being "1.3", "2.3", "3.3", ... or "9.3" for numbers in the range from 1 to 10. This is indicated by the shaded regions in the logarithmic scale shown below.

The fraction of this region covered by the range from 1.3 to 1.4 is

The fraction covered by the other regions (such as 2.3 to 3.3, and so on) can be found similarly, and we can add them together to give the total probability that the first digit following the first non-zero digit will be a 3:

In general, the probability of the 2nd digit of d in a base-B number (taken from a logarithmic population) is

Extending this analysis to the case of the nth digit following the first non-zero digit, we arrive at the general formula

This applies to the case of the leading non-zero digit, with the understanding that with n = 0 the summation reduces to just the single term k = 0. This formula shows that the non-uniformity in the distribution of digits becomes much less as we consider less significant digits. For example, we have P

{1}

= 0.301019995..., P

1{1} = 0.113890103..., and P

2

{1} = 0.101375977... Thus the

probability of a "1" quickly approaches 1/10 as we proceed to less significant digits.

Return to MathPages Main Menu

相关主题