Hunting for suspicious domains with Python and SKLearn

The trouble you can get into, after a Bender..

 
Screen Shot 2018-03-09 at 16.29.50.png
 

I was attending a security conference about a year ago and I stumbled into a training session on machine learning. What struck me about this session, it was targeted towards normal people, not programmers. The gist: showing how easy it was to use Azure|AWS|Python|Excel to apply machine learning (reads: statistical analysis) against data-sets to weed out obvious statistical anomalies. No real programming knowledge or Nvida GPUs required, just a spreadsheet and an Azure|AWS account. A few more clicks and you could transform your simple "algo" into a REST HTTP endpoint. With a credit card and a few dollars a month, you could integrate it with your existing tool-set.

To most nerds, this may not seem like anything special, especially if you've using cloud services for any length of time. But to see how far these cloud services have come in terms of integrating their "ML" tools into simpler things, such as spreadsheets, was something different. For the first time it makes these pseudo 'black boxes' accessible to more and more people. People who have great ideas, but lack the necessary resources in generating statistical context around those ideas. To me, it makes the world a little more… transparent… statistically speaking.

I am a trader and I make 3000+ trades a year. Not a buy em and selling, up or down day trader, but an options trader. I live and die by the bell-curve and the central limit theorem. This means, I inherently base 84% of my decisions in life on probabilities and number of occurrences, the other ~16% are are based on decisions involving beer. Of that, 3% of those choices… turn out to be 50/50, cause well, it's beer. Over time, with the right number of occurrences, those probabilities always play out. Makes no difference if i'm making money or losing money- year after year, well over 85% of my trades end up earning a penny or more.

When you're constantly putting your resources at risk you're always looking at the probabilities against time. How many of these trades do I need to put on to have a 80, 90 or 99% success rate? How do I limit the number of failing trades? How do I protect myself from statistical fallout (eg: a market that goes straight up or straight down?). How do I make lots of little trades so the probabilities play out, but none of them have the potential of blowing out my account? If every trade starts out at about the same as a coin flip (50/50- up or down), how do I get to 60/40, or 70/30?

You learn very quickly is nothing's perfect. No algo or black box is going to catch or predict everything with 100% accuracy. Even if your success rate is well over 84%, there's still a few outliers that always have the potential of blowing out your account (eg: you can wear your seat belt every day of every year, but you can't predict a drunk driver meeting you head on). By trading, you learn to accept these risks, but not play for them. You learn to play for the predictability of things you can grind out day after day, managing what you are able to and letting them normalize the rest. You may (probably will) miss stuff, but you'll gain enough of an edge (by predictably finding more stuff) that the outliers have less chance to take you down. If someone is selling you something different- run, don't walk. If it were true- they'd be printing money and NOT TELLING YOU. I'd take a higher number of predictable outcomes every day, than a lower, un-predictable number of "hit it out of the park every time" any day of the week.

I automate therefor I am.

Screen Shot 2018-03-09 at 16.31.09.png

Hunting for suspicious data is really a game of statistics. In order to free up your human resources you need to look at the problem as "weeding out the obvious". Instead of asking the question "how do I find the needle in a haystack" ask "how do I predictably get rid of all the hay?". The answer to this question is simple- light the haystack on fire. The probability exists that the needle(s) will withstand the flames. You will still need humans to find stuff, but there will be less to go through, you'll have a better chance at finding it (not you WILL, but you're odds improve OF).

AI is never going to replace our jobs, it's just going to invent new ones. This game is and will always be human to human driven, and we'll need more humans to help inform the algos what the other humans are doing to attack the humans we're trying to protect. However, we need things like statistical probability to both help us automate "the data patterns that represent human activity we already know" and to help TEACH the new humans how to detect the other humans that are trying to attack the humans we're trying to protect (haha, I couldn't help myself…).

AI isn't about creating the newest and greatest black box, it's about disseminating lessons learned (or "patterns" as we call it in programmer speak) to new humans. That way the new humans can better protect us when we're old and grey from the other new humans that are trying to steal our bank accounts- because we're old and grey. In real life- we've historically shared these patterns through the use of open-source, video, books, etc.. But what about statistical patterns?

It USED to be hard.

Traditionally, it seems you have to read a LOT of scientific papers (and have the according PHD to digest them) coupled with lots and lots of visuals to wrap your head around this stuff. Then you're stuck trying to re-implement what they've discovered on your own. Up until a few years ago those libraries didn't exist, were hard to use and required lots of CPU to crunch the math. With recent advancements in things like SKLearn, AWS and Azure, those learning curves are finally being crushed. If you were lucky, those science nerds built a prototype model and feature extractors, but it was usually cobbled together in a way that made it non-reusable. By the time you read their paper (10 years later), the attack vectors had changed a bit and because their model was fragile, needed to be re-written anyway.

To me, AI is quickly becoming the way of teaching humans about formalized statistical patterns we find in the world. Not to replace their work, but to help them weed out the things- the rest of us have already thought about, found and accounted for in the math. What's really missing is the normalization of these patterns into reusable, domain-specific libraries that others can learn from and build on. These days, there's such a land-grab for mysterious black boxes, that we assume there's something magical about them AND because they're not very transparent, we assign a disproportionate amount of value to them. When it's probably at-best a bunch of twenty somethings who've got their PHD in something nerdy, but haven't spent a ton of time hunting for bad-guys. Operators have a hard time with statistics and math geeks have a hard time tracking human behavior. To succeed, the two need to meet somewhere in the middle (HI!) or else you're destine to stay behind the curve 84% of the time.

Black Boxes

Screen Shot 2018-03-09 at 16.29.07.png

So we're left with "Oh, you're not willing to share your algo? it MUST be AMAZING!" When you really look under the hood- it's a probability box, little better than a coin flip. If you treated every suspicious domain as a coin flip, in a normally distributed sample, over time you'd have a 50/50 chance at being right. If you filter out the top 1000 domains from Alexa, you're probably at 70/30, if you weed out domains that have more than 3 dots in them, 75/25, 3 or more hyphens might get you to 80/20 and if the domain is greater than 15 chars, it's probably not worth your time. Wrap that up with a whitelist that double checks what you have left (eg: url shorteners) and WHAT'S LEFT is prob 99% of being suspicious… and here's what's important: for damn near ZERO dollars.

Of course, with this model you're probably missing stuff. The last mile of a problem is always the costliest. However, for about ZERO dollars you have a high probability of weeding out the obvious things that aren't worth your time. Now you can spend your time and money investing in the parts of that last mile that are important to your business. The parts of your business that cost/save you the most amount of dollars and ignore all the hay.

These models are simple and more importantly open. They're not going to get you to the 99% mark, but they're a quick and easy way to start learning and weed out the noise. By being transparent, they have the added value of statistical lessons learned from real operators all around the world. Lessons that maybe you have yet to learn.


 

Did you learn something new?