Wednesday, October 2, 2013

A Non-Nerd Guide to the NSA Surveillance Program (or How I Learned To Stop Worrying and Love Data Mining)

This summer the world was wracked when Edward Snowden dropped the bombshell that the US Gov't had and was using the ability to intercept and eavesdrop electronic communication that flows through American telecom companies. It literally sounded like something out of George Orwell's 1984 as an all seeing, all-knowing Govt entity that could literally spy on American lives left many people creeped out. But on the flip side was also acknowledgement that terrorists love to use e-mail, internet chat, and cell phones to plan attacks and that this program with proper oversight could be a useful tool in preventing another 9-11 type attack.


So with these conflicting views how should we feel about this program? Well since the NSA Surveillance program happens coincide with one of my specialties data analysis, so let me walk you through it so you can at least feel confident to make up your own mind. If all else you'll at least be armed with knowledge to enhance your party conversation skills or argue with your curmudgeon uncle over Thanksgiving dinner.

Okay hotshot what makes you an expert? Technically I'm not a statistician (no Ph.D) but I play one on TV and for my employer (Hint its a health insurance company with a big blue cross and a big blue shield). But before you can use statistics you have to understand data which is the key to understanding what the NSA is doing.

So what the hell is the NSA doing? During the Iraq War in 2005 when the US military was being inundated with insurgent attacks, NSA director Keith B.Alexander started the Real Time Regional Gateway electronic surveillance program which collected and analyzed electronic communication of Iraq insurgents to eavesdrop on their activities. To do this they collected and stored ALL electronic communication from our telecoms and social network providers since much of Iraq's communication at the time had to flow through American and European owned channels. Nothing excites intelligence agencies than listening in on people, the program eventually grew into something called PRISM, a clandestine mass electronic surveillance data mining program known to have been operated by the NSA since 2007.

Now the program has two parts, the ability to actually eavesdrop and listen to phone conversations or social media chats which the NSA can do to American citizens with known foreign contacts already under suspicion for up to a week without a warrant. After which they would need a warrant from something called a FISA Court (Foreign Intelligence Surveillance Act). So technically the Gov't needs permission but what's up for debate is how faithful and adherent these FISA courts, and intelligence agencies are to the actual letter of the law. However since nearly a billion new electronic records are created every day there is no way for any individual agency to sift each and every new record. So they need the help of technology.

So the second part is using PRISM which performs data mining to cull through collectively several trillion records and use a statistical technique called predictive modeling to find a literal needle in the haystack of a potential terrorist. This part is actually the key to understanding how the NSA targets or selects people for electronic surveillance and what I'm going to help explain. In all likelihood the average American is not being eavesdropped on because let's face it, most of us live boring lives. So that occasional porn, facebook stalking, or online googling of Selena Gomez that makes men feel sorta dirty afterward is in all likelihood not going to have the NSA checking in on us. But we should at least be diligent and demand from our Gov't that its use isn't falling outside legal boundaries.

Did you say Trillions of records..with a T? Yes. We are now in the era of what's called 'Big Data'. Every time you use your cell phone, browse the internet, or send e-mail, you leave a data point. Over time a heavy internet or phone user will create thousands and thousands data points. All of which your friendly, reputable telecomm provider, social media site, etc accumulates and thanks to the Protect America Act of 2007 can provide to the NSA without a warrant. Well technically they need a warrant through a top secret court called FISA and thankfully the Bush administration's faithful adherence to the Constitution...well who are we kidding let's just assume they can touch everything.

Where did they get this idea? Predictive modeling has been around for at least 20 years and used for analyzing all sorts of human behavior and  in my line of work its used as an epidemiological tool to find segments of our insurance members who are at risk to get ill and run up huge medical bills. The goal for the model is for the company to intervene high risk members and try to get them healthier before they end up in the hospital. It's most frequently used by corporations in marketing to predict which people would be most likely to respond to a advertising or sales campaign. Whenever you use a credit card, use a shopper's reward card, like something on facebook, or tweet about it, you give marketers valuable information to try to analyze your behavior and preferences to predict your consumer decisions. For example Target found women who bought 13 household items in combination accurately predicted them likely to be 1 to 3 months pregnant thus best
 candidates to send baby catalogs and steer them away from Babies R Us or other maternity stores.

So borrowing this idea the NSA created PRISM to mine through the electronic records and find people who may be fitting the profile of a threat and pass them along to electronic surveillance for a little closer look. Certain red flags like the Arab gentleman discussing purchasing the one way airline ticket, or Neo Nazi who looks to be increasingly going off the rails, or an MTV reality star announcing on Twitter they're releasing a music album will probably land on the NSA eavesdroppers inbox.

So how does it work? Any predictive model is built on data with the key assumption that the past will help predict the future. You develop a model by performing data mining to find two things: the significant predictor variables of an event which are X, and the probability of an event occurring which is Y. You build your model based on past events from X with the assumption they can predict Y in the future.

So how do we know it works? The performance any predictive model can be measured and evaluated. To be considered a legit or working predictive model at a bare minimum has to meet several criteria listed below. With that in mind here are FIVE things the NSA must demonstrate to justify their use of snooping of our internet data. Ideally if our Congresspeople actually were smart and diligent in their role of oversight (Michele Bachman is retiring so there is hope) they would demand NSA prove they meet the following criteria before allowing this program to proceed:

1.Accuracy Rate = Simply put its the number of people correctly predicted Y divided by the total number we originally predicted Y. Marketers sometimes only need an accuracy rate of 5% to be successful. Most direct mail marketing only needs be accurate 2% of the time. Telemarketing only 5 to 10%. Say Toyota buys 30 seconds of TV advertising chunks on History Channel to reach the 18-34 male demo who would most likely buy a pickup truck. Do you need all 3 million people watching Pawn Stars to buy a truck? Nope just a small fraction. If Toyota gains 20,000 new truck owners to get a positive return on investment then advertising to 3 million than accuracy rate of .006% is worth it.


But when talking about predicting criminal behavior and placing someone under suspicion that accuracy rate should be much higher. For the NSA what is your accuracy rate?

2. Validity = This answers how well does my predictive model actually measure the behavior I'm trying to predict. Or in other words how well does my model catch terrorist activity. One way to measure is rule out spurious correlation. For instance did you know an increase in ice cream sales strongly correlates to increase in shark attacks. So should we interpret this as the need to limit sales of ice cream? Nope because what they both have in common is they occur in summertime. And unless you factored that into your model it would be worthless. So how does the NSA know PRISM is valid?

3. Reliability = This answers does my model provide the same results and accuracy when measured over different periods of time and with different data? We do this to rule out something called overfitting. This is when the model become biased on a small set of predictors so a set X variables are good for one dataset but can't be replicated to any others. This often happens when trying to measure something very rare, such as someone being a terrorist or a drug dealer.

So for instance if you wanted to predict the Boston Marathon bombings happening again, significant predictors would include being a young Chechen male who buys a backpack, and a pressure cooker right before a marathon. But those characteristics were unique to the Boston bombings and not much use because in the real world terrorists can come from anywhere with variety of ideas for attacks. Invariably results of the model would be flawed because you would snag a lot of Chechen males who bought backpacks and pressure cookers...because they were probably going camping.

4. False Positives = How do you handle people who are false positives in other words predicted to be positive for Y but they really weren't.Every day you are a false positive and don't even realize it. Every time you throw away a piece of junk mail or hang up on a telemarketer, or skip an internet ad you become a false positive. Some predictive algorithm by some Marketing firm predicted you the right person to advertise whatever to you and you declined or ignored it. In normal cases there are no negative consequences to the average individual because the cost is borne by the company doing the advertising. So when Target sends you a baby catalog but you are not pregnant nor planning on it you simply throw it in the trash. The only loss incurred is to Target for however much postage it spent to send the catalogs.

But what if someone is predicted to be a terrorist but they're really not? What happens to them and what protections are they afforded? Does guilty until proven innocent still apply? A long time ago the FBI kept a secret list of people who checked out flagged books at the library. Things like how to construct a bomb or Adolf Hitler's Mein Kompf would get you on the list. Legend has it back in the 1980's a man named Tom Clancy was visited by FBI agents after data showed he checked a large number of books revolving around nuclear submarine technology and Naval submarine warfare.

The result of the investigation showed Clancy was not a saboteur for the Soviets but instead was a writer collecting information for a submarine thriller story he was writing. That story that would be titled 'The Hunt for Red October' and make Clancy a bestselling author. The main problem is people search the internet for a variety of things, sometimes malicious but a lot of times for knowledge or information. How can the NSA tell whether a college student named Muhammed is researching biological warfare because he's terrorist or because he's write a research paper for class? How can the NSA tell that whether a post on facebook 'I just wanna blow up the World' is a signal launch an attack or just a bad day at work?

5. Peer Review = Having another set of eyes on your model and getting feedback for improvement. If someone were to ask my opinion (that would be an ego boost) I would note a very small, miniature, teensy, but possibly very humongous, large flaw in the NSA's PRISM technology. Predictive modeling assumes the human behavior being studied is normal behavior that will be often repeated again in the future. Which is why it works for everything from marketing, to customer service, to online dating, etc. But the problem is terrorism or any criminality for that matter isn't normal human behavior.

Instead it can be thought of more in terms of a virus or bacteria that constantly mutates and changes in response to the body's immune system or antibiotics. Similarly terrorists will change and adapt their tactics to evade notice meaning yesterdays terrorist attack most likely won't be repeated tomorrow because law enforcement is now looking for it. Thus predictive modeling as the NSA uses it now may not be effective. So instead the NSA may well be better suited to borrow models from Biostatistics which predict how and possibly when mutations in viruses and bacteria will occur. This would better answer what would be a terrorists next step assuming they won't repeat the past.

Now for a few more questions
Is This Legal? Funny you should ask because a lot people are asking that as well. And the answer will most likely come from the Supreme Court in a few years as to whether it violates the US Constitution Fourth Amendment which prohibits any unreasonable search or seizure without probable cause and a warrant. Meanwhile some in Congress are not waiting for the issue to make it's way through the courts as Sen.Rand Paul (R-Kentucky; the slightly less paranoid, bat shit crazy version of his father Ron) has proposed a bill to limit the NSA's surveillance program.

Edward Snowden had a hot, stripper girlfriend. Does that mean I should be a data analyst to land a stripper? Snowden is what statisticians call an anomaly or outlier, an observation way outside mean distribution such that it skews the results. Because if you were ask a 100 male data analysts if they're girlfriend was a stripper, probably 100 would reply 'Girls? You mean real ones? Like actually talk to them?' Past research shows strippers are attracted large, shiny objects and large amounts of money.


Edward Snowden was going to get asylum in Ecuador. How would you rate his choice? Ecuador has it all for the Int'l fugitive on the run. Jungles, mountains, beaches, Inca ruins, the Galapagos islands. I give him props for good taste in travel.