Data mining 8-K filings for trading opportunities

Two economists are walking down the street. One says: “Hey, there’s a dollar bill on the floor.” The other says: “Impossible. If it were real, someone would have picked it up by now.”

What if I told you that you could make money in the stock market by writing a computer script to read SEC filings and then invest based on those that have historically indicated good things to come? You could be the person that picks up that dollar bill.

There are no guarantees in the financial markets, let alone life (well, except death and taxes– both things that libertarian Peter Thiel doth protest too much).  But it may be possible.

In addition to filing annual reports on Form 10-K and quarterly reports on Form 10-Q, public companies in USA must report certain material corporate events on a more current basis.  Form 8-K is the SEC’s designated format for such reports, and the SEC gracefully makes the information available via its EDGAR database.

Here’s a link to EDGAR: – EDGAR stands for the SEC’s Electronic Data Gathering, Analysis, and Retrieval system.

These reports contain information that is material to evaluating the company’s financial health and prospects, so I wondered if we could analyze them en-masse to find certain indicators of prosperity: bullish or bearish indicators, if you will.

This post contains an explanation of the technical methods I use as well as some information on my process in general and it is quite comprehensive.  If there is anything that you find confusing, please send me an email and I will clarify – [email protected]  Thanks for reading.

Table of Contents

  1. Downloading the 8-Ks from the SEC
  2. Processing the 8-Ks
  3. Processing the price action on the underlying stocks
  4. Analyzing the data set for opportunities
  5. Doing a sanity check on the opportunities
  6. Building a production trading system to exploit the opportunity

Chapter One:  Downloading the 8-Ks from the SEC

The SEC publishes an index list of all the forms that it has in its database.  You can find it at:{year}/QTR#{quarter}/form.idx

Lines are in the format (… to omit fields we don’t care about)


Therefore, what we need to do is:

  1. Download the list for each year and quarter
  2. Scan for lines that begin with ^8-K (the ^ is POSIX Regular Expression syntax to indicate “start of the line”) and discard the rest
  3. For each line, grab the filing date and the path to the filing and stick them in a database.
  4. Download the actual file

Here’s some sample code to do just that, using Ruby.  This sort of problem is naturally parallelized, so I suggest you make that improvement.  Note: expect over 50GB of data per year!

Chapter Two:  Processing the 8-Ks

Okay!  Now we have downloaded 8-Ks from our chosen time period of time.  We should have hundreds of gigabytes of data at this point– in fact, after deleting outliers (files greater than 50MB in size), I had over 280GB in data. I used a DigitalOcean droplet with their new “detachable volumes” feature to work off of – I just provisioned a 500GB volume for $50/month (billed in hourly increments).

Each 8-K file is in .txt format, but inside the 8-K is some XBRL, eXtensible Business Reporting Language.  Think poorly-formed XML.  What I was interested in was the text inside the filings, not the metadata or images.    I also wanted to remove “stop words”, so I used the list from python’s NLTK  (Natural Language Tool Kit).

Here’s a diagram overview of this step:


And here’s a Ruby code sample that uses parallelization for speedup

Chapter Three:  Processing the price action on the underlying stocks

Now that we have a dataset (database) in the format (stock, date, 2grams,3grams,4grams), we will want to add a little price action to elucidate how these ngrams do or do not predict price movement.  Now, there is more to back-testing prices in equities than meets the eye.  A naïve approach consist of something simple like: grab the ticker symbol, use Yahoo! finance to get the price on the date of the filing, use Yahoo! finance to get the price one year thereafter, subtract the difference and divide by the original price – and there you have performance.  Unfortunately it is a little bit more complicated than that in the real world.  Why?  Naming a few,

  • Companies issue dividends
  • Companies issue stock splits so the price needs to be adjusted by the split factor
  • Companies go bankrupt and new companies take over their ticker symbol
  • Companies change ticker symbols
  • Companies get acquired

Not to mention many stocks are thinly traded and you need to budget in a “slippage” factor and also assess whether there is enough normal trading volume to handle your order.

Therefore our backtesting engine needs to be a little bit more sophisticated. Fortunately, you can acquire this data from vendors nowadays for a really reasonable price.  I recommend going through Quandl.  Expect to budget at least 500gb of space to manage <10 years of this data in a fast SQL database.  You will need split- and dividend-adjusted price history as well as a corporate actions database.

At present time I am not going to share my historical price-checking code because it is completely useless unless you buy the data, but if you use a free platform like Quantopian their integrated backtesting and research environments should handle most of these concepts for you (but not necessarily).

Chapter Four:  Analyzing the data set for opportunities

Now we have a vector:  (stock, filing date, n2grams[list], n3grams[list], n4grams[list], performance).  We now need to go through every record, take each ngrams entry, unroll the list and append each ngram in the list into a file.
unroll ngram table

Now we should have three files:  list_of_all_2_grams, list_of_all_3_grams, list_of_all_4_grams.  For the sake of brevity, I am going to continue just describing the list_of_all_3_grams.  If each file has on average 2000 words, then there are going to be ~1998 (round up to 2000) 3-grams per file.  And if we have 300,000 files, that’s 600,000,000 rows in our n-gram file… kind of a lot.

Our first order of business is to remove ngrams that occur too frequently as well as ngrams that occur too rarely.  Why?  We remove ngrams that occur too frequently because they don’t have any useful information. For example, almost every 8-K filing is going to contain the ngram “securities exchange commission” (remember, we removed the because it’s a stop word).  And ngrams that occur too rarely are useless because we can’t trade on the information.

So:  we sort the file, run it through uniq -c to get a count of each ngram, then filter out ngrams that are too rare or too frequent:

sort list_of_all_3_grams | uniq -c | awk '($1 >= 1000) && ($1 <= 10000) {print $0}' | sed -e 's/^ *[0-9]* *//' > interesting_n_3_grams

Now we have the file “interesting_n_3_grams”.  It contains a list of ngrams that appear between 1,000 and 10,000 times.  Now, we want to:

  1. Load this list into memory (into a set)
  2. Take our big vector list (stock, performance, n3grams) and run through each item.  See if any of the ngrams are in the set, and if they are, append the performance a hash of arrays, keyed by ngram.
  3. This will result in us having a hash in the format:  {ngram => [perf_0, perf_1, perf_2, …, perf_n]}

Here’s a code sample:

Now, we want to sort our hash, ngram->performance, by mean performance (maybe excluding outliers).  Then we can take the top 100 keys and see what they are.  In the case of n3grams, between 2010 and 2016, I found these to be the most bullish indicators:

3 gram
corporation /s/ james
financial corporation /s/
caution readers place
number description 99
business acquired applicable
balance sheet total
publicly release result
cost interest bearing
stockholders' equity total
noninterest expense increased
2013 (globe newswire)
interest margin net
information (in thousands,
gain sale loans
noninterest income increased
earnings (loss) $
2012 net interest
reported) may 21,
loan loss reserve
last year decrease
employee compensation benefits
low interest rates
yield earning assets
income noninterest income
total shares outstanding
service charges fees
margin net interest
3 months ended
fluctuations interest rates,
last year due
financial corporation (exact
statements reflect occurrence
percentage total loans
financial officer ex-99
attached exhibit 99
million period 2012
reflect occurrence anticipated
occurrence anticipated unanticipated
loan losses $
made forward-looking statements
anticipated unanticipated events
million period year
surrender value life
changes economic conditions
loans receivable, net
holding company headquartered
statements financial condition
description 99 press
$ 225 $
compared period 2012
(unaudited) (unaudited) (unaudited)
mortgage servicing rights
allowance loan lease
months ended ended
value life insurance
completed fourth quarter
$ 223 $
consolidated statements financial
$ 250 $
leverage capital ratio
compared $17 million
loan lease losses
hold advisory vote
$ 208 $
equity total assets
vote compensation named
partially offset impact
$ 182 $
circumstances date statements
$ 230 $
$ 211 $
gain (loss) sale
million net interest
30, 2012 company
2013, compensation committee
$ 199 $
quarter 2012 decrease
plan signatures pursuant
employer id no)
interest income loans
nonperforming assets total
date july 30,
officer ex-99 2
corporate governance management
transactions applicable (d)
net earnings (loss)
24013e-4(c)) section 5
$ 168 $
governance management item
shares outstanding march
million compared first
date made company
$ 77 $
2013 press release
interest bearing liabilities
8-k filed may
interest rate interest
31, 2013 june
applicable (c) shell
advisory vote frequency

Chapter Five:  Doing a sanity check on the opportunities

Now we have a list of 8-K filing keywords that were linked to highly performant stocks.  Now, let’s remove any of the keywords that look weird or for which we have no serious rationale or understanding of the underlying dynamic.  We could easily see how “business acquired applicable” could be an interesting ngram, whereas “date july 30,” may be indicative of a data error or fluke.

I find these 3grams to be interesting:

business acquired applicable
balance sheet total
publicly release result
cost interest bearing
noninterest expense increased
gain sale loans
noninterest income increased
employee compensation benefits
total shares outstanding
service charges fees
statements reflect occurrence
reflect occurrence anticipated
occurrence anticipated unanticipated
million period year
mortgage servicing rights
leverage capital ratio
hold advisory vote
equity total assets
partially offset impact
vote compensation named
circumstances date statements
plan signatures pursuant
interest income loans
nonperforming assets total
governance management item
million compared first
interest bearing liabilities
advisory vote frequency

I chalk up the others to randomness.

Using the first, “business acquired applicable”, as an example, but for

each keyword, search our database for all the 8-K filings that mentioned “business acquired applicable” and pull the tickers and the dates.  Let’s do a check by hand… did these stocks really perform?

Make a Table:
3gram, # of different companies, mean perf, links to forms, ..

Link to table — Do your due diligence!
As a sanity check, follow the above document and read the 8-K filings to figure out what happened.  What business conditions resulted in so many companies repeatedly using this phrase over 1,000+ filings? Let’s open the CSV linked above, and pick a row to analyze. Let’s choose “employee compensation benefits”. Here are some values from the third column of the spreadsheet:

“CAFI 6496.978181818182”, “CAFI 5975.29090909091”, “CAFI 474.0656000000001”, “CAFI 362.92640000000006”, “MEDS 24.301454545454543”, “CAFI 17.18181818181818”, “CAFI 16.391304347826086”, “GLBC 14.68944099378882”, “CPYT 11.857142857142858”, “CAFI 10.53846153846154”, “CAFI 10.53846153846154”, “CAFI 6.5”, “BRLI 4.225694444444445”, “BFCF 3.129032258064516”, “BFCF 2.7837837837837833”, “ABR 2.1228070175438596”, “RGR 1.7005108556832693”, “OCN 1.6656848306332845”, “SVLF 1.6630434782608698”, “NSM 1.6608851674641152”, “RGR 1.606205250596659”, “GCAP 1.4837962962962963”, “MRGE 1.4615384615384615”, “ENR 1.3744493392070485”, “GCAP 1.3538461538461541”, “BBX 1.341059602649007”, “BBX 1.2940074906367043”, “NPTN 1.2323529411764704”, “PTX 1.2162790697674417”, “XXIA 1.2056451612903223”, “ERI 1.2032520325203253”, “HPQ 1.129803586678053”, “MKTX 1.1156944892039962”, “XPO 1.1146278870829769”, “GCAP 1.1136842105263156”, “SWC 1.0891608391608392”, “CNS 1.087583719732097”, “EVR 1.0559599636032755”, “BBX 1.0524642289348172”, “VRTU 1.044059795436664”, “SFNS 1.0416666666666667”, “VRTU 1.0402819738167173”, “BBX 1.002469135802469”, “RGR 0.9904397705544934”, “MKTX 0.9493192133131617”, “SFNS 0.9285714285714288”, “IBKR 0.9192307692307692”, “NSM 0.9116666666666667”, “VRTU 0.9054505005561734”, “EVR 0.8994575045207959”, “HAL 0.8985258827562564”, “RGR 0.8957208142916494”, “NSM 0.8772504091653028”, “ROYE 0.8716666666666667”, “SMCG 0.8707167994072975”, “TRAD 0.846590909090909”, “SMCG 0.846153846153846”, “MRGE 0.8388278388278387”, “VRTU 0.8383838383838382”, “TAX 0.8375142531356897”, “VRTU 0.8252194732641662”, “PSX 0.8214285714285716”, “MKTX 0.8169170840289373”, “CI 0.7932527923410077”, “VRTU 0.7920741121976326”, “MNK 0.7909090909090909”, “AMTD 0.7782448765107726”, “VRGY 0.775147928994083”, “RGR 0.7719186785260483”, “DFS 0.7606534090909091”, “VRTU 0.7605552342394449”, “MKTX 0.7601317957166391”, “WLL 0.7587249244297883”, “ACMP 0.7550697909437682”, “SSNC 0.7543413807708599”, “CYLU 0.7500000000000001”, “AB 0.742326909350464”, “HPQ 0.7421052631578945”, “AMTD 0.7411838790931988”, “SSNC 0.7256365232660228”, “SPB 0.7226705796038151”, “POL 0.7177375068643602”, “FALC 0.7155789473684211”, “ESC 0.7150442477876107”, “DFS 0.7139141742522757”, “WLL 0.71152720889409”, “VRGY 0.7103762827822122”, “IBKR 0.7102272727272726”, “TAX 0.7091559644751134”, “AOS 0.6946417043253712”, “MKTX 0.6815405046480745”, “TAX 0.6755395683453236”, “A 0.6744347826086957”, “TNC 0.6685296646603611”, “ENV 0.6664293342826629”, “RSE 0.6645279560036663”, “DFS 0.6577970297029703”, “NCR 0.6548314606741573”, “MEAS 0.6530456852791877”, “ABR 0.6507030772366007”, “ESC 0.643792888334373”, “KE 0.6426592797783933”, “TUC 0.6419213973799125”, “COWN 0.6398467432950193”, “WLL 0.6397338403041826”, “IBKR 0.6368794326241134”, “MKTX 0.6291891891891892”, “XPO 0.6271551724137931”

Now, CAFI is Campaign Auto Finance. What happened was that the stock got so low (<$0.10 per share) that it was possible to make lots of money by investing in it. Although you might dismiss stocks that get so low, the same thing happened to Plug Power (PLUG), the fuel cell system company, and investors who bought at the lows could have enjoyed returns up to 8000%. However, for the sake of discussion, we will review another high performing ticker’s filing to see what happened. How about RGR – Sturm, Ruger & Company, the manufacturer of the Ruger.

Ruger firearm

Here are the RGR filings that contained the n3gram “employee compensation benefits”:

Let’s take a look at the first one,
The 8-K was a press release regarding their financial results and “employee compensation benefits” was a row in the Liabilities table of their financial results.

Let’s take a look at another one, Same thing.
Another one, – same thing yet again.
Let’s jump to the last in the line.. Same thing.

Now, companies are not obligated to post their quarterly results in an 8-K filing.  That’s what the 10-Q is for.  Now of course all public companies have to compensate their employees but not all use the language “employee compensation and benefits”.  And not all of them call out employee compensation in their financial report as a substantial expense. So we have new questions to answer…

Chapter Six: Building a production trading system to exploit the opportunity

Frankly, we do not have enough information yet to justify an investment or exploitation.  However, since this is a didactic post, I am including this section.

Now that we’ve found an exciting investment opportunity, maybe we could write a direct letter to all of the hedge funds in the USA and pitch our concept.  If we find a manager who’s excited by the idea, the next steps are include (ask your lawyer; this is neither investment advice nor legal advice): forming a limited partnership, creating an operating agreement, getting someone to run compliance and auditing, building a system to instantly download 8-K filings as soon as they’re published (hint: you can run a HEAD on the SEC’s index file to check for diffs and therefore updates), and making the trades.


At the time of writing this post I am neither actively trading nor investing based on n-grams in 8-K filings.  However, I am hope that it has inspired you to do similar work that exceeds the scope of what I have shared.  I will share that I am actively automatically investing (for clients) based on other filings that the SEC publishes.  There is gold in the streets… Wall Street.


Go forth and prosper!

Author: zack

Zack Burt started programming at age 9 after a passion for memorizing Pi led him to wanting to compute it programmatically. His passion for the stock market began at age 10 while participating in a school competition to create a mock portfolio. His love for markets, math and programming led him to University of Chicago, where his studies emphasized psychology and pure (abstract) mathematics. He passed the Series 65 examination in November 2015.

Leave a Reply

Your email address will not be published. Required fields are marked *