In an earlier post, I detailed a pattern used to detect suspicious connections with Python and the machine learning toolkit, SKLearn. I used that post to both demonstrate the use of the RandomForest classifier, as well as how to construct a custom deployment pipeline for machine learning based modules. What's more powerful than the classifier itself, is the ability to create pull requests, push new code, have it tested and installable in minutes, not hours, certainly not days and weeks.
In this post, we're going to explore the problem as a neural network problem, specifically using Keras and Tensorflow. If you're not familiar with these frameworks, take a look at some existing examples where we use these deep learning techniques to classify phishing urls. I'm not arguing which framework is better, a lot of that depends on the amount and type of data you have and your use case. There are many ways to attack this problem, here i'm trying to un-magic a few of them. This way you and your team can have intelligent conversations about the concept of machine learning as it applies to the problem you're trying to solve.
Scale
The biggest difference between our previous example and our neural network example will be this; the RandomForest example will be a bit smaller and more efficient (less memory, disk, cpu) but more rigid. The neural network models will be larger, but more resilient to changes in future data patterns. This means, while our original example will perform better at scale, this comes at the expense of resilience in the long run. The power of neural networks is in their ability to reinforce their learning and sort of "future proof" themselves.
This should make sense; the more specific something is, the faster it performs, but the more edge cases it misses. The more complex something is, the slower it tends to perform, but at scale, it can make up for this with its ability to 'think' through nuance. How you scale something like this depends a lot on who's paying you to do what, and where your greatest returns are. I'm here to help you understand how to decide which fits best and how to deploy and more importantly, iterate on it.
Trade Offs
One of the biggest tradeoffs you'll notice in the code, is when I extract features in the TensorFlow example, I had to remove the "ASN" feature. If you think through the features of "an ip based connection", you'd think ASN makes a really big difference, and it probably does. The issue is, there are a lot of them. This means the model has to consume more memory, cpu and disk to accommodate for them in a way the RandomForest model did not.
In my initial tests, this blew the model from about ~4MB to ~1GB. That's a massive difference for a feature that may or may not make a difference. Tradeoffs, performance vs accuracy suggested in this initial version the ASN "feature" be dropped… for now. If users had to download a 1Gb model just to test this black magic, it raises the barrier to entry for them, creates more friction and for what? A 5% boost in accuracy? More tests are needed to see if it really makes that much of a difference.
For now though, it didn't seem worth it. However, as we iterate forward and users learn to build their own custom models (that's the point anyway), those are things you can easily tweak and do locally. 1GB is nothing these days, it merely just creates a bit of friction when you're learning something new. These examples are just that, examples to build from. Your features (and training data) should be a bit different than what you see here, it's what will make your team special and give you edge.
How to Train your Dragon
Coming up with "suspicious" training data was relatively easy. Take ours, take others, stand up your own honeypot and generate your own. It almost doesn't matter, it's somewhat plentiful these days. Generating "whitelist" training data takes a bit more work. Ideally, "good" traffic is defined by your own local network with a mix of your external business partners and customer traffic. All stuff I don't have access to at the moment, so we'll use the next best thing, Cisco's Umbrella list.
With this we're simply trying to help build a feature set that describes to our model, where a typical non-malicious connection might come from (as well as when). In this VERY CRUDE example- connections from cities that tend to host very popular websites PROBABLY AREN'T malicious. This is important, because PROBABLY doesn't mean 100%, it means if you're trying to teach a new hire what traffic to prioritize, you're PROBABLY filtering out traffic to/from Netflix.
This also means the networks Netflix tends to operate out of too. We're assuming, at-scale Netflix has done a decent job of vetting providers they do business with, at-least better than we could. In our endless list of priorities, if we only have a finite number of resource cycles, we'll classify those "types of connections" to the bottom of the pile. What we're more interested in, connections that exhibit almost the opposite features of a Netflix connection.
From there- we try to not overfit the data. This means, we want the algo to be a little more fuzzy than specific. We want it to cast a wider net and absorb some of those outliers, things that seem odd but are probably benign. Why? False negatives are harder to account for than false positives. False negatives are invisible, so we won't know what we're missing. Where-as with false positives, we can easily whitelist the obvious ones (eg: address space that belongs to Netflix), leaving us with extreme odd-balls that are probably worth investigating.
Maybe a bad actor is using a shady actor in an otherwise 'ethical' city in the US. The connection has all the features of a 'non-malicious' connection except for one, a fuzzier set of parameters would pick that up and flag it, a more rigid set of parameters might ignore it. Ideally you have multiple sets of networks feeding these results into each-other (mechanical and/or biological) to help flesh that out.
The key to all of this- is accepting the beauty in embracing imperfection. Once you accept a certain level of marginal error, you begin to understand the concept of scale. All those false negatives you were abstracting from your view in the name of perfection, start bubbling slowly to the top. Lather, rinse, repeat.