This is part of a quazi-multi-part series where we use different styles of machine learning (SKLearn and TensorFlow) to hunt for suspicious indicators. In the most recent part of this series we scratched the surface with what you can accomplish with 'deep learning' frameworks like TensorFlow. The edge here is having the frameworks generate the classifiers for you, rather than you trying to pick apart all the features every time the bad-guys evolve.
Our URLs example was easy- in that, urls typically provide enough context (eg: length and re-use [or not] of characters) that determining what's good vs what's bad, isn't relatively hard. Essentially, the longer the URL these days (shorteners excluded), the better chance we have at classifying it [as suspicious]. Our model basically takes a look at which characters are used more heavily in suspicious URLs vs legit. Their presence [or lack there of] usually gives us enough context to make a scalable judgement call. 80% of the time, we're probably going to be correct in our prediction.
What about the other 20% of the time? We might easily pre-filter for obvious domains to reduce false positives, but what about something like: hxxps://go0gle.com/about-us ? It's possible the "zero" gives it away, since many of the URLs in our data-set don't contain numbers in the middle. Outside of that; from a human perspective it's somewhat obvious, but because we have the domain context pre-drilled into our brains. Our URL model, doesn't… yet.
Keep it Simple
This problem can be, should be broken up into pieces. You should start where you have the MOST context (eg: a long URL, if it's obvious, it'll get filtered out) and work your way down to the least context. Think of your different models as a pipeline, if something fails the URL test and your domain is NOT in a whitelist somewhere, then send it down the line into your suspicious domains model. If you try to solve both of these problems using a single model you'll more than likely miss the majority of outliers.
However, unlike URLs, domains require a bit more nuance. With a URL, we could suggest that "if you see lots of hyphens, or numbers, or long length" something is odd with that URL. It's using character sequences to try and trick the end-user. When you train your model, you'll notice the sequence "paypal" is used, A-LOT in mostly suspect domains and very little in legit domains. What you're almost training the model to do by default, is flag anything with 'paypal' in it, even if it means 'paypal.com'.
The easy way around this is obvious- just pass the result through an alexa style whitelist? If that's the case, then why use a machine learning model at all? Just flag everything that has some Levenshtein distance greater than 5-10 and pass it through a whitelist. Done and Done? Well, if you know that- so do the phishers and you'll just constantly be chasing them. If you look at this from a deeper learning perspective, we want to automate that process. It makes your model rebuild work cheaper, which gives you the edge. Let the machines do the day to day learning, so we can focus on the more important stuff.
Learning the Learn
If you dig a bit into the initial post and repo where we used SKLearn to predict domains, that's kind of what we did. Features were extracted manually, the model was tested, tweaked, rebuilt. New features as the phishers adapted their tactics were built into the model and the numbers WERE COMPLETELY MADE UP TO FIT OUR ASSUMPTIONS. Why a distance of 5-10? Who knows, it was the best we had at the time, but completely made up. Next step- use a real mathematical approach, with baked in regression testing to develop the model. Make the process repeatable so it can be tested, adapted and more statistically accurate.
The first big problem we ran into was, there's a plethora of places to get access to relatively clean phishing urls (openphish, phishtank, apwg, etc..), there isn't [an obvious] when it comes to the domains. It's easy to get a corpus of highly popular legit domains (Alexa, Umbrella), but highly popular phishing domains you kind of have to pre-filter on your own. Malwaredomains[.com] is a helpful starter set, but i've also noticed over the years some odd stuff gets in there too. Domains that aren't really all that bad, but not that great either.
When you're trying to train your model [for the first, second or third time] you'll probably need to make some sacrifices with respect to bias. If you try to aim for perfection, you'll be sadly disappointed. Best to try and glob together 30,000 or so of the best 'worst' types of domains you can find, get your pipeline going and refine from there as you find outliers. I simply pulled as many of the odd looking "paypal" or "apple" domains that contained "-" or other odd characters in the domain part of the URL. Combined with some of the more "tagged as phishing" from the malwaredomains[.com] list and ran with it.
This totally biases my initial model towards things with hyphens, paypal and apple in them, but that's OK. Our goal here is to learn how the model is built, over time we'll find the weak spots and improve. Invest in what's not going to change, remember? Besides, with very little investment we're able to turn a 50/50 coin toss into, at-worst a 70/30 problem (if not higher). For the purposes of our learning experience, i've catalogged the data-set we're using here. Pull requests are always welcome and we hope this helps provide a benchmark when developing and testing your models. Also, if you have the capacity to donate and or pay-for any of the services that helped us attain this data, they're saving you time and money. DO IT.
Back to the original problem, the word "paypal" is heavily used in phishing domains, less heavily used in legit ones. Unlike the URLs problem, where TensorFlow is able to develop a pretty good model efficiently (eg: less regression passes at the data, doesn't have to spend too much time 'learning the learn'), domains require a bit more effort. What this means is, we have to make multiple, higher resolution passes at the data to weight things appropriately. Bluntly, more cores, more passes, more time and more money.
It also means you have to spend some time both reading the TensorFlow, Keras and LSTM doc, to the point where you kind of understand what the different values mean because you're going to have to tweak them… and wait a few hours to see what the result was. At one point through this- I found a value that wasn't easily googleable [with relevant examples], but was immensely important to the end result in classifying both g00gle.com and paypal.com. The doc ABOUT the variable is there, but the doc explaining [in layman's terms] how it affects the outcome wasn't as easy to digest. Most responses were "just fiddle with the number, it's magic". Which means, "we read it in the paper, we sort of understand that it adjusts the output.. but it's hard to explain here- or we didn't understand it ourselves".
RTFM- If it exists
Herein lies the rub- most "deep learning" frameworks are oriented around images. While there's a lot of work that's been done on the "text" space, there aren't a ton of great, implemented examples. There's a lot of research papers, but many of them are hard to read, try to be unbiased and leave out a lot of implementation details. I kind of want my approach to be somewhat biased (eg: have more false positives than false negatives) because I can whitelist the obvious ones. This is not scientific research, this is real life. People actually lose massive amounts of wealth if I miss something. Then again, it can cost me a lot of resource investment to build these models too, there's a balance.
What I found when generating domain based models, is you need to tweak almost all the numbers a bit. If you want a quick and dirty "70%" model, you can probably build that, against a 30k/30k domain split (whitelist/blacklist) in less than 10 minutes. Here you use larger batch sizes (lower resolution), have less "neurons" and have a little less resolution in how your each of your domains "maps out" in their characteristics (Character Embeddings, another dark-art I won't go into in this post).
It's like teaching a 12 year old to spot a phishing domain. They're cheap, they take a coin flip and make it 60/40 or even 70/30, but they're moody and unreliable in the long run. They need to be re-trained as your attackers change their tactics and will randomly fail for no reason. Put enough of them in a room together, and they don't really scale sideways, they may even flip that 70/30 into a 30/70.
Teaching a 30 year old to spot phishing domains is a bit more complex. They require a bit more intensity, data and resources, but if you spend enough time helping them learn where the outliers are, over time they'll more easily adapt with the attacks. They're able to consume more complexity in their training, higher resolution data-sets which helps them reason through outliers that may have been missed in their regression training. They scale a bit more sideways too- with more mature models, you're able to chain them together transforming an 84% or 93% probability into near 99%.
It certainly requires significant investment in the initial term, with phishing, it's the outliers that cost you the most. The nice thing about the cost of prediction frameworks is that, they're getting cheaper by the day. Let's say you spend 40 hours building your first set of models. At $100 / hour, that's an initial investment of $4,000 (assume your laptop being a sunk cost already). Let's also assume each time someone falls for a phishing attack, it costs you $1,000 (normal cleanup costs, person-hours, etc). If that very simple model protects you from 100 phishing attacks, that's a return on investment of 2,400%.
Your cleanup [sunk?] costs are probably different, what's important is the value proposition. If you're a medium to large sized business, that investment almost stays constant while the ROI grows exponentially. Not only that- but the lessons learned (eg: Machine Learning, TensorFlow, etc) then cascade into other areas of your operation reducing their overall costs, further improving your ROI. You're not using this to replace people, you're using this to make them smarter and freeing them up to focus on the really hard problems. If you're able to re-focus 2 of your FTE's at $100,000 each at a harder problem, that's an additional savings of $200,000. Now your ROI is somewhere in the ballpark of 5,900%.
There are very few things in life that exhibit similar behavior to the magic of compounding interest.. this, is one of them. Then again, once you've proven concept.. you can further compound your ROI by using our pre-built models too.