Information Security and Benford’s Law

Mahyar A. Amouzegar is the provost and senior vice-president and Freeport McMoRan Distinguished Professor of Logistics at the Univeristy of New Orleans.

Khosrow Moshirvaziri received M.S. and D. Engr. degrees from Stanford University and a Ph.D. in EE from the University of California, Los Angeles.

INTRODUCTION

Falsified numbers in tax returns, invoice payment records, expense account claims, and many other settings often display patterns that aren’t present in legitimate records. In fact, there is a certain pattern in the way a large group (list) of numbers behave that may be somewhat counter intuitive. One would expect that the ten digits occur with equal frequency. In fact, why would one digit be favored over another? Yet, it has been shown in many situations (both naturally occurring or human generated) the first digits of numbers in a dataset (e.g., legitimate records) often follow a distribution similar to the table below.

That is, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the most significant digit less than 5% of the time. This distribution is known as Benford’s law, also called the “First Digit Law”. It is interesting to note that this result applies to a surprisingly large number of datasets, including street addresses, certain stock market data, Internal Revenue Service files, electricity bills, house prices, population numbers, death rates, lengths of rivers, and physical and mathematical constants such as radioactive decay rates, and processes described by power laws[2, 6].

Mathematically, the probably of an occurrence of the particular leading significant digit is given by

  where, d = 1, 2, 3, …9.

 

Datasets following this distribution are said to be Benford. If a dataset is Benford, then by the above equation there is approximately a 30% chance that the first significant digit of any datum in that dataset is 1, about an 18% chance that the first significant digit is 2, and so on, decreasing to only 4.5% for the first significant digit being 9. When altered fraudulently, Benford datasets depart from this pattern, a fact that is used in fraud detection [3]. These predicted probabilities, especially for the first significant digit, have been shown to hold for both theoretical distributions and some real datasets [5].

When should we expect a dataset to be Benford? There is as yet no clear answer. However, it can be proven mathematically that taking any distribution and repeatedly multiplying or dividing by random numbers or raising it to a random integral power numerous times converges to a Benford distribution.

Researchers have laid out some properties that tend to lead to a dataset being Benford, though satisfying these conditions will not guarantee it:

  • Numbers coming from mathematical combinations of other numbers (e.g., products of numbers such as price times quantity);
  • Transaction-level data (as opposed to aggregated data);
  • Large datasets that span multiple orders of magnitude in values;
  • Data for which the mean is greater than the median and skewness coefficient is positive (long right tail); and
  • Scale invariance.

Datasets that are less likely to be Benford are those composed of assigned or sequential numbers (e.g., telephone numbers), data that are influenced by psychological factors (e.g., prices set at $29.99), data with a large number of firm-specific numbers (accounts set up to record refunds of a fixed price), or data with a built-in minimum or maximum. Data that are presented as percentages rather than raw values are also less likely to be Benford. Data that have a fixed number of digits for each entry are often not Benford.

Although the mathematical proof is beyond the need of this article, intuitively the law is not difficult to understand. Consider a stock portfolio with a current market value of $1,000.000. For the first significant digit to turn from “1” to “2”, it will have to double in size. That is, the portfolio value needs to grow 100 percent. Now, for the first digit to become “3,” then the portfolio only needs to grow by 50 percent. And of course, for the first digit to become “4”, the portfolio needs to only grow by 33%. Therefore, it is clear to see that in many distributions of financial data, which measure the size of anything from a purchase order to stock market returns, the first digit one is much further from two than eight is from nine. Thus, the observed finding is that for these distributions, smaller values of the first significant digits are much more likely than larger values [3].

Table 2 presents the first, second, third and fourth digit portions of Benford’s law, generated using the formula above (note how in higher orders, the frequency of digits converge to a more uniform distribution).

ORIGIN OF BENFORD’S LAW

In 1881, Simon Newcomb, an astronomer and mathematician, discovered the statistical principle that has become known as Benford’s law. He observed that the earlier pages of logarithm books, used at that time to carry out logarithmic calculations, were considerably more worn in the beginning pages which dealt with low digits and progressively less worn on the pages dealing with higher digits. This led him to formulate the principle that, in any list of numbers taken from an arbitrary set of data, more numbers will tend to begin with “1” than with any other digit. The obvious conclusion was that more numbers exist which begin with the numeral one than with larger numbers.

 Newcomb provided no theoretical explanation for the phenomena he described and his article went virtually unnoticed. Then, almost 50 years later, Frank Benford, a physicist, also noticed that the first few pages of his logarithm books were more worn than the last few. He came to the same conclusion Newcomb had arrived at years prior; that people more often looked up numbers that began with low digits rather than high ones. He also posited that there were more numbers that began with the lower digits. He, however, attempted to test his hypothesis by collecting and analyzing data. Benford collected more than 20,000 observations from such diverse datasets as areas of rivers, atomic weights of elements, and numbers appearing in Reader’s Digest articles [1]. Benford found that numbers consistently fell into a pattern with low digits occurring more frequently in the first position than larger digits. The mathematical tenet defining the frequency of digits became known as Benford’s law.

However, it wasn’t until 1995 that T. P. Hill, a mathematician, provided a proof for Benford’s law as well as demonstrating how it applied to stock market data, census statistics, and certain accounting data. He noted that Benford’s distribution, like the normal distribution, is an empirically observable phenomenon. Hill’s proof relies on the fact that the numbers in sets that conform to the Benford distribution are second generation distributions, that is, combinations of other distributions. If distributions are selected at random and random samples are taken from each of these distributions, then the significant digit frequencies of the combined samplings will converge to Benford’s distribution, even though the individual distributions may not closely follow the law. The key is in the combining of numbers from different sources. In other words, combining unrelated numbers gives a distribution of distributions, a law of true randomness that is universal [5].

APPLICATION OF BENFORD’S LAW TO DETECTING ANOMALIES

Public and private sectors are highly dependent on information systems to carry out their missions and business functions. And these dependencies have made business and government supply and information process highly vulnerable to cyber-attacks. The problem is not just denial of service, or malicious firmware, which are of course of concern, but also subtle corruption of data that might impact the operation of these enterprises. The sheer size and complexity of many of the supply chain systems, for example, place demands on knowledge of the identity of parts, stock levels, part locations—and many other data—that exceed human capacity. In these systems, the absence of reliable data, might force many key functions to halt or at minimum decrease trust in the accuracy of information. This is true in the private sector but also true in the defense side, where for example modern aircraft have become a complex system arguably requiring multifaceted support system that heavily rely on technical data and automated diagnostic equipment.

Data Error and impact – Of course, data errors in any operations are inevitable. Errors occur routinely from everyday mistakes. For the most part, these day-to-day errors do not have significant negative operational impacts as most systems and there are processes that have evolved to handle them. And the randomness of routine errors makes it unlikely that any one error will cascade into a major operational problem. But significant impacts are possible, as experience has shown. A skilled, determined, and knowledgeable adversary could potentially wreak far more damage by deliberately corrupting data that are unlikely to be detected as anomalous, yet targeting the attack to have a significant negative impact on operations. An adversary (whether internal or external or domestic or foreign) might choose this kind of targeted attack by corruption, over data destruction or denial of access to data in order to maintain a longer foothold in the systems or to mask attribution. Of course, regardless of whether data is corrupted by attack or random error, operational system should be sufficiently resilient and robust to data corruption to continue providing adequate support. However, even a most resilient operation can be susceptible to an astute, highly targeted attack and therefore enterprises with complex operations should leverage modern tools that can be used to detect and identify serious anomalous data.

Election Case – An interesting use of Benford’s law was after the controversial 2009 Iran’s election, when Mahmoud Ahmadinejad, running against three challengers, won with an overwhelming majority, despite the pre-election expectations. The Ministry of the Interior (MOI) published a table of the numbers of votes received by each candidate for the 366 voting areas. The MOI’s data vary from about 104 to 106, which suggested the possibility of the dataset being Benford. Boudewijn Roukema used the available data and Benford’s law to show, amongst other issues, one of the losing candidates had a significant excess of vote counts starting with the digit 7. He concluded, “[the] most consistent way to explain these results would appear to be the hypothesis of artificial interference in the offcial results.” [7]

SOCIAL NETWORKS DATASET

Data collected from various social networks showed another interesting application of Benford’s law [4]. Data was collected from following social networks with number of users:

Pinterest = 40 Million users

Twitter = 78,000 users

LiveJournal = 45,000 users

Google Plus = 20,000 users

Facebook = 18,000 users

Where numbers of friends or followers of each user had been counted and then determined how the first digits were distributed. Every dataset, expect for one (Pinterest) followed Benford’s law. From the discussion above, the fact that most of these datasets are Benford shouldn’t be a surprise since the dataset was generated naturally or organically grown. Of course, if a dataset is not Benford doesn’t mean there is a fraud but it may mean additional scrutiny is warranted.

It became clear that Pinterest users are required to follow five or more “interests” as a part of their registration process; this creates at least five initial follows for each user affecting the entire distribution of first significant digits. It was further considered network created by user’s friends, which is known as egocentric networks. The correlation between a user’s egocentric network and Benford’s law was measured and the result was that for majority of people, this correlation was greater than 0.9, which means that they conformed to Benford’s law. In case of Twitter, only 170 users out of 21,000 had a correlation lower than 0.5. Further investigation showed some of the accounts were spam and most of them were a part of a Russian bots’ network who behaved in similar way. The purpose of these accounts was not clear but they were certainly suspicious.

CONCLUDING REMARKS

In recent years, we have witnessed a dramatic rise in hackers’ activities and destructive attacks, from in and out of the country, on various social, financial enterprises, political, and government networks. It is quite difficult to detect fraudulent and suspicious activities as hackers equip themselves with sophisticated tools in carrying out their malicious attacks. The framework presented by implementation of Benford’s law has proven to have important implications for social network forensics. Models built upon the framework introduced herein has been very promising as implemented on Twitter’s and Facebook networks [4]. This makes Benford’s law to be one of the available effective tools in the war against fraud and suspicious activity on social networks. The use of the technique and applications of Benford’s law to social media is a new tool for analyzing user behavior, understanding when and why natural deviations may occur, and ultimately detecting when anomalies occur.

REFERENCES

  1. Benford, Frank, The law of anomalous numbers. Proceedings of the American Philosophical Society, Vol. 78, No. 4, 1938.
  2. Berger, Arno, and Theodore P. Hill, An Introduction to Benford’s Law, Princeton, New Jersey: Princeton University Press, 2015.
  3. Durtschi, Cindy, William Hillison and Carl Pacini, “The Effective Use of Benford’s Law to Assist in Detecting Fraud in Accounting Data,” Journal of Forensic Accounting, Vol. 5, 2004.
  4. Golbeck, Jennifer, Social and Information Networks (cs.SI); Physics and Society, DOI: 10.1371/ journal.pone.0135169
  5. Hill, Theodore, P., “A Statistical Derivation of the Significant-Digit Law,” Statistical Science, Vol. 10, No. 4, 1995.
  6. Nigrini, Mark J., Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection, Hoboken, New Jersey: John Wiley & Sons, 2012.
  7. Roukema, Boudewijn, A First-Digit Anomaly in the 2009 Iranian Presidential Election, Journal of Applied Statistics, Vol. 41(1), 2014.

 

ABOUT THE AUTHORS

Mahyar A. Amouzegar is the provost and senior vice-president and Freeport McMoRan Distinguished Professor of Logistics at the Univeristy of New Orleans. He is also a senior analyst at the RAND Corporation. He is the founding editor of the Journal of Applied Mathematics and Decision Sciences and is on the editorial boards of Advances in Operations Research, International Journal of Applied Decision Sciences, and International Journal of Strategic Decision Sciences. He is a fellow at IMA (UK) and ICA (Canada), a Senior Member of IEEE, a member of Tau Beta Pi, engineering honor society, and the honor society of Phi Kappa Phi.

Khosrow Moshirvaziri received M.S. and D. Engr. degrees from Stanford University and a Ph.D. in EE from the University of California, Los Angeles. He served as a Staff Scientist with IBM Scientific Centers and R&D Divisions. Currently, he is a Professor of Information Systems (IS) at California State University, Long Beach and director of MS program in IS. Khosrow also serves as the director of Information Systems at Western DSI. He has over 25 years of experience in developing optimization algorithms and software, and delivering coursework on system simulation, optimization, and transportation networks. He is an Editor for Advances in Operations Research, Journal of Industrial Systems Engineering, and Advances in Decision Sciences. His areas of research interests are in the interface of computer science and operations research.

Updated: August 23, 2018 — 3:16 pm

Leave a Reply

Decision Sciences Institute © 2018                     Terms of Use  |  PrivacyDSI Membership Refund/Cancellation Policy