The first and second posts in this series demonstrated how to think through and leverage simple, feature driven prediction patterns. These extremely simple, yet powerful patterns come with a very simple caveat, they require us to extract and document specific features of a url, domain, netflow connection in order to predict the result. This works well with simple things, but it also forces us to adapt and change those features as our attackers adapt as well. They see the features we're looking for, they adapt, we lose insight, adapt.. the arms race goes on.
We can hide our magic algo from everyone, or make the algo itself adaptive in a way it learns and adapts with our attackers. Remember, we're not targeting 100% perfection here, just enough edge to help us scale past our attackers. In this context, anything that gets us from a 50/50 coin toss, to somewhere above 85% probability is generally a win. If we can do that by modeling a neural network around something as simple as openphish data, using an Amazon Deep Learning AMI in ~30 minutes or less..
Trade-Offs
The trick here is understanding the trade-offs of using a highly specific model (like we previously did with SKLearn), vs a more generic deep learning model that feels a bit more like reasoning to make it's predictions. As the cost of predictions has decreased, it becomes cheaper for us to use neural network based models to make up the difference. The caveat being, the more generic, reasoning based models will catch some of the new unknowns, at the expense at missing some of the obvious knowns. For instance, it may not catch something like g0ogle.com/phish.html, but it may catch google.com/wp/admin/phish.html.
The SKLearn models get away with this, because we're manually seeding the models with specific features. However, the deep learning models are generating what they observe as the features, based on your sample data-set. The deep learning models will most likely be correct MORE of the time, but they may miss some obvious stuff LESS of the time. Some right brained engineers may see this as a bad thing- but with a few filters on the input and output (eg: using whitelists, blacklists as extra filters to infer when the model is mis-classifying something), you go from ~85%, probably to 95%. If trading has taught me anything, you can't plan for much outside two standard deviations of risk. Also, i'm not sure any of our analysts can process a million urls a second with an 85-95% probability rate, the ROI here is mind-blowing.
Learning is Hard
I spent well over a year trying to wrap my head around things like SKLearn and Tensorflow. Some folks get this stuff right off the bat, for the rest of us, we're a little slower. I spent my first few iterations running through the examples on Amazon's Machine Learning service. It helped me go from data in Excel to a full blow REST API I could make predictions against. The caveat is the types of learning algo's you could use. For simple binary classifications [at the time] it was easy but if you wanted to use something like RandomForest (more advanced classification), things became a bit difficult. This ultimately lead me to SKlearn, and then finally TensorFlow.
For those just getting started, my advice is simple:
Get a YouTube RED Account.
Subscribe to the Artificial Intelligence Channel.
Binge.
Build lots of very small examples using different kinds of data.
Binge more.
If you're like me, you won't really understand most of it, at-least not for the first few months. However, over time, you'll absorb the different concepts and future videos will start making sense. This works especially well if you have small children and only have 30min or so a night to "do some passive homework". Those 30min chunks add up over time, and within a year or so you'll be able to walk the walk, or at-least fake it pretty well (hi!). Even better, you might build something someone else finds useful, which is one of the most gratifying things in life.
When I first started googling around for things like "phishing" and "machine learning", I found a bunch of [mostly stale] research papers and projects demonstrating the how to do feature extraction with SKLearn. It wasn't that things weren't happening in the TensorFlow space, but unless you knew what to search for, TensorFlow had (has?) been mostly about deep learning when it comes to image recognition. There hasn't been a lot of research published [until recently] on it's application to things like phishing data.
Prototyping
After a few months of the YouTubes, I came across a pretty interesting Medium post. This post detailed taking web logs, parsing out the GETs and building models to detect which logs appeared to have been some type of injection attack. Not only that, these wonderful people posted sippets of their TensorFlow based code too! The code was pretty trivial to adapt, since it was already looking "for odd URLs", and within an hour or two I had a very simple model that used deep learning to predict if a url was suspicious or not.
To test this, I used very old (months, not years) openphish data to prime the model, along with the url whitelist in our previous SKLearn models. Once built, I started testing with fresh openphish data the initial results appeared to check out. This was NOT a very scientific test, other than the regression testing the framework does itself, but it helped get us to the next step. It proved that, with a few lines of python we could build a neural network that did some basic reasoning based on the semi-supervised data (eg: we classified it's samples as good or bad, it did the rest).
More importantly, it proved that with a few lines of python, we could now begin processing millions of URLs per second with a significantly high success rate. Assume our attackers change their tactics every day, week, hour whatever. The cost of building new models is ~30min, where the learning algo determines the new tactics automatically and redistributes that throughout the network. Your model takes hours, days? Spin up an Amazon Deep Learning AMI with 8 or 16 GPU's, it's just a simple credit card transaction problem now.
Learning in Layers
There's an obvious "looking at history" problem here. We're building a model around previously tagged data (eg: "supervised"), meaning of the 15% we're missing, there could be (reads: "IS") potentially very dangerous stuff in there. URLs the phishers have crafted using their own deep learning techniques, knowing full and well they get past your AI. I've seen presentations that prove this, and the AI does a better job at crafting phishing urls with a higher success rates than most humans do. This is where we start thinking of the larger AI frameworks as layers.
Neural networks themselves are just like neurons, layers of things distilling down information and presenting the results to other layers. No different than how we model information flows between humans. We have humans whose job it is to take a set of predetermined features and apply them to data, filter out the obvious matches, while passing on the non-obvious matches. We have other bundles of neurons (eg: other humans) who eyeball data and use their specialized predictive powers to investigate, make judgements, further distill information and pass it along.
The judgement pieces never really go away, this arms race is still humans vs humans. The cost of the initial predictions just goes down and becomes more automated. You could almost describe both our SKLearn and TensorFlow frameworks becoming layers in our larger network. Our machine based neural network acts as a pre-filter for our more specific SKLearn based framework. This feeds ultimately into our biological network which makes judgement calls based on our current operating environment.
These judgements then feed back into our initial layers, in the form of data, doc other models. From that, we begin to understand the next filter that needs to be built, the un-supervised deep learning layer. These will help us spot the new unknowns as they evolve, and feed those signals into our other layers for better prediction and judgement. The ultimate goal is reducing the cost of judgement, enabling it to scale sideways, therefore increasing its overall value.