I'm a network junkie, not a traditional 'security analyst' by trade. In my first few years as a security "professional" [if you could call it that] I was hacking kernel drivers for a DAG card in an effort to apply Snort to a 1GB fiber line. Applying an IDS to traffic is one thing, keeping up with pattern matching is an entirely different science in and of itself. Easier these days, but still expensive. As time went by, I learned about 'SEMs' and this magical idea of "correlation" (I was an ArcSight junkie at the time). We had a need to correlate across our IDSs and firewalls, it was either build something or buy something, I was told to buy something. More time passed and a thought occurred to me; what if we correlated these bad-actors across larger, more federated organizations?
We tested the idea with Prelude-IDS (an open-source SEM, easier to test with than licensing ArcSight, this was ~2009?). We used their "correlator", a python based event loop that mashed large python dicts together and triggered on the results. We prototyped this to trigger events when multiple SSH brute-forcers hit multiple sites (large /16's spread throughout the Internet). We submitted those events to a central repo, where other IDSs could pull that feed. The system worked well, but the software was cumbersome to setup, install and configure. Not to mention adapting it for the different variations of snort rules used at the different sites to detect this data. Some would detect a scan after an attacker hit 30 destination ips in 60 seconds, some, 2. It worked, but we were just making it up as we went along, I think they call that "prototyping".
The next few years was more geared around gathering other forms of intelligence. The system hummed along in the background, knowing full in well we needed something more mathematically elegant. Early warning systems are nice, but to really get a-head of the attackers you need to model their behavior. You need to know where they are more likely to come in the future, not where they already hit you in the past. If the missiles are flying, you can try and get out of the way, but it's probably better to know where, and more importantly WHEN they might come from in the future. Build your defenses on that knowledge and position accordingly.
That's what we do with our typical network defenses, we throw up 1950's style radar systems (Snort, Bro, etc), with a really wide net. We then spend our time tuning it to ignore the birds. I'm sure in this day and age, many advanced systems understand where and when they're attackers are likely coming from, but a good chunk of the internet, simply does not. Especially those with the largest chunks of v4 and v6 space in their allocation.
Most sites do understand, if they're on the east coast of the US, and they get some odd SMTP/IMAP connections from Nigeria in the middle of the night, those connections may be 'suspicious'. However, automations used to detect that are ridged, require humans to manually interject and make judgement calls based on broad country codes [or large net-blocks] they "know are a bad". These tactics ignore the precise features that suggest "connections from this city around this time of day are suspicious". You may have a visiting professors traveling home for the week, during their day, may need access to certain services. At other times of the day- it may be someone more nefarious trying to access your systems on their behalf. How do you adapt your network to protect against that? In realtime?
Again, there are folks doing this well, and i'm also sure if you asked them how much they spent on doing it well, they'd tell you- $$$$$$$$, on technology and probably people who understand it. Where does that leave the rest of us?
SKLearn is still kind of hard. Actually, math, statistics and just the idea of trusting their predictability is still hard. You have to have faith in the models and their developers for them to be effective. They have to be measurable in order for them to be predictable, and more importantly, they have to be tuned to your environment. They need to understand features like:
Time of day most of your users are- for lack of a better term, active. This could include multiple segments throughout the day people are active. After coffee time? Before bed? Some of us like a mid-afternoon nap, can you account for that?
What timezone you operate out of, help understand broader patterns of not just you, but others in your region.
General location, not specific. What LAT/LONG does your typical traffic patterns originate from based on time of day?
Are there certain patterns based on the features above based that cluster around certain countries (again, based on time of day and LAT/LONG patterns)? Different cultures have different types of normal activity. Human behavior is usually specific to a culture, which includes how defined certain types of legal statutes may to a region. Is the activity you're observing legal in that [general] LAT/LON? Is it a grey area?
Is there a sub-culture in a specific region that has more loosely defined cyber ethics, but likes to "follow the sun" of their targets? Do they know you like your mid-afternoon naps based on your social media profile? Do the people protecting you know that?
I've been tooling with this idea for a few years now, and while building SKLearn models against things like URLs and FQDNs is relatively easy, IP addresses is a bit more nuanced. URLs and FQDNs have specific features you can pull from the indicator itself. How long is the indicator? How many hyphens does it have? How many subdomains, etc.
An IP, is more like a street address, they are what they are. The same is true for ASN's and BGP Prefixes. The address itself isn't bad, in fact, it may be a legit pizza shop in the front, but dealing drugs out the back. How do you develop a model for detecting that at-scale? We know in real life this sort of modeling happens intuitively all the time, it's how LEO catches THE peeps dealing drugs out the back. In cyberspace though, we've had a harder time [transparently] learning from that.
Modeling For Dummies
In SKLearn (and other modeling frameworks), it's also relatively easy to build models around feature sets that are more "yes/no/maybe". Does your indicator have most of these observable have most of these single dimensional features? Cluster up enough of those, perform some statistical magic and the closer you are to a "yes" you get a a statistical 'yes'. 84% of the time, that's good enough. Feature clustering for "things that just are", requires a bit more nuance.
The IP is from China
The IP is Lat/Long: 39/116
The connection was during hour 23 of 24 UTC
The connection was known to be tcp/22
You can't point to a single feature, or set of features and suggest they are indicative of being a suspicious connection. However, over a large enough data-set you can start to cluster the patterns of connections. Though, again, in and of that grouping, you may have lots of visiting professors from China, how do you tell the difference? The model has to both be developed and trained for your context to work effectively.
Learning the hard way
I'm currently reading through Daniel Suarez's latest; Change Agent. I won't spoil the book, but in the beginning they're trying to track a general type of suspicious activity. Instead of trying to get too granular and track individuals, they broaden their scope to look at more generalized patterns of activity, then point local law enforcement towards the activity and let them do what they do best, hunt. This turns out to be a more efficient use of resources.
A light bulb finally went off for me (i'm a slow learner). We've invested so much time in hunting for specific bad actors. Or worse- we've been looking at the broader activity, "just block XYZ country code and be done with it". In a world that's becoming even more interconnected, not everyone in every country is out to get others in a different country. There are a lot of bad actors in the US and China a alike.
I can predict my kids... sometimes.
Here's the catch though; they're all humans. Humans like routine and predictability, even the bad actors, most of them at-least. As much as we like to think things are random, more often than not at some level there is a certain predictability to the broader patterns. While that predictability may not be a significant edge, contrasted against the same model flush with "known good, generalized patterns" can provide enough edge to more predictably thwart the obvious stuff.
This isn't to suggest low level pattern matching is dead. Many black-box vendors would love you to believe that. To the contrary, i'm a huge believer that good "police work" is the ONLY thing that will protect you. However, it's a complement between the two which efficiently provides more effective results. It's not one or the other, it's both.
Are these models perfect, or even mature? Not a chance. Have others already figured this out? Of course, we're not inventing anything new here. These are a baseline meant to help bring some of the power of SKLearn, and statistics to people who may just need a few small hints. It's my way of codifying some simple, yet effective lessons learned over the years into a simple library others might build on.
The less noise your hunters have to weed through, the more focused they become. The more focused they are, the more likely they'll find that needle. Often times, as is the case with most breaches, enough positive edge is all it takes.