Open source analysis of SEC data

Many years ago (when I had long hair and glasses), I wrote a silly little hack – the Made-up-ness Quotient calculator – to run a simple statistical test on the numbers in company’s SEC filings. The idea is very simple: when humans make up numbers, the distribution of digits in those numbers is very different from those generated by actual true data. In particular, human-generated numbers don’t follow Benford’s Law, whereas real financial data generally will. People thought it was cool, but I never took it public because (a) if it had a bug and people made investment decisions based on that, I would feel bad and maybe get sued, (b) the obvious thing to do is write a generic framework to perform these analyses and share the results.

Last night I opened my copy of Wired and discovered that Jesper Andersen and Toby Segaran have done exactly that with It’s still in the early stages, but it looks promising!

And with that, I am free to make my Made-up-ness Quotient calculator public again:

First person to implement a Benford’s Law test on this system wins a Friends help friends use Linux shirt.

4 thoughts on “Open source analysis of SEC data”

  1. Wow, that is a really interesting idea. Of course the trillion dollar question is does it work? I’m skeptical due to that nature of SEC numbers, verse mirco-numbers. Most are rounded to the million making a 22.2mil SEC filing entry not seem odd to make up even if it is based on a general ledger numbers that were made up and might be detectable. Fortunately it is semi-provable hypothesis, just put in SEC filings from known cooked books companies (ones SEC has taken action against) and see how they compare to the norm. That is of course assuming that the norm isn’t made up numbers…

    I’m curious enough to run the question b a few econ/business professors I know… Maybe they have someone looking for a thesis/independent study project… :)

  2. caveats:
    you need a large pool of numbers; variance is small relative to the mean
    rounded off numbers will be false-positives

Comments are closed.