If you don’t know about the Panama Papers by now, you should. Here’s the scoop: an anonymous source, over a year ago, leaked 2.6 terabytes of data to German news source Süddeutsche Zeitung. This data was encrypted internal documents from Mossack Fonesca, which is a law firm in Panama, which is selling offshore companies located all over the world.
As a quick aside I would like to break down what these documents are outlining and what offshore banking and the buying of shell companies means: Mossack Fonesca appear to have been selling shell companies to wealthy individuals and companies worldwide to act as a tax-free money trove. Essentially, any person or company that hold their profits in the bank legally gets taxed on those profits by their respective governments. The longer you hold those profits the more you may have to pay in taxes. But investing that money into areas such as R&D, acquiring smaller companies, or expanding a business are all tax free ways for a business to use profits. With an offshore shell company, anyone could invest or buy into a fake business that looks like investment. The fake company, brokered by a law firm, can then “pay back” the investment at a later time. Until that time, the money can be held offshore with no taxes to be paid. You can also hide how much money you are actually making by holding offshore before claiming it as your income.
So this leak of data to the news source turned out to be 11.5 million files of “… emails, contracts, transcriptions and scanned documents. In total, the leak contains: 4.8 million emails, three million database entries, two million PDFs, one million images and 320,000 text documents. The dataset is bigger than any from Wikileaks, or the Edward Snowden disclosures.” According to Wired.
This incredible amount of data does not necessarily detail illegal behavior, but does shine a light on the potential criminal activity, barely-legal activity, or shady dealings of wealthy people and companies all over the world, as The New York Times explains, it’s “not illegal in many cases to have offshore bank accounts. But they are used in some instances by wealthy individuals and criminals to hide money and business transactions, and to avoid paying taxes.” And among those named in the papers are more than 70 current or former world leaders, in addition to people like friends of Vladimir Putin, relatives of leadership in China, Britain, and Pakistan, the President of Ukraine, and Iceland’s (former) Prime Minister Sigmundur David Gunnlaugsson, who is resigning as of Tuesday, April 5th, 2015. He is just the first casualty of the Panama Papers, and many more may be on the way.
But what I want to talk about is not the Panama Papers highlighting how global and how severe the world’s tax-evasion problems are, what is really interesting is how we got the information out of 11.5 million documents and over 2.6 terabytes of data. Hailed as the biggest data leak in journalistic history, just how exactly did Süddeutsche Zeitung find anything useful in a literal sea of documentation? Big data analytics is the answer. As a journalist pounding the streets and knocking on doors for information, that is an unthinkably large amount of information, but to Nuix, an Australian data analytics system in Australia, 2.6 terrabytes is not that far out of the norm. It took them just about two weeks to turn everything: emails, PDF’s, Word documents and PowerPoint presentations into a database that could be queried.
Once you have a database it just takes some deep querying to draw correlations and spot patterns. According to TechWorld: “The system churns through the files at high speed, extracting text as well as the metadata that indicates who created each file, when the file was created and notes subsequent modifications. Sometimes the location is available. The language doesn’t matter to Nuix, which can process characters and words in any language using natural language processing (NLP). Documents in a closed state such as PDFs are identified and fed into an optical character recognition (OCR) system for text extraction, something that accounts for most of the work according to Barron. Critically, Nuix de-duplicates the data – the same file in a different place – which in the case of the Panama Papers quickly removed about a third of the data.”
So instead of thumbing through paper looking for a dignitary’s name in 11.5 million different sheets of paper, with big data analytics you can start to cross-reference and use simple search terms. In addition to indexing the data and sorting it in different ways, like credit card numbers or company names, the software itself can seek out repeating terms, like a name that occurs in multiple places many times, and flag it as recognized in other places. This can help human analysts and journalists quickly shake out the chaff and leave the juicy information sitting right where you can see it. “The data from the Panama Papers shows that Mossack Fonseca worked with more than 14,000 banks, law firms, company incorporators and other middlemen to set up companies, foundations and trusts for customers,” the ICIJ says.
Long term, big data analytics isn’t just useful for companies who want to dissect customer’s habits to sell more to them. It also makes impenetrable amounts of information penetrable, and in cases like the Panama Papers, it is going to unlock the hidden patterns and help make correlations that can be a dangerous thing for secrets. Eventually, there won’t be any place to hide.