In part one of this series, I gave a higher level overview of how easy it is to tease out certain patterns from semi-structured data. All this without the need for heavier "NLTK" style frameworks. My thesis, stop thinking about formats and start thinking about the data and its context. Formats will change over time; the meaning within the data- will not. IPs will be IPs, URLs will be URLs and context will still surround them; however it is they're described.
However, there are some subtle nuances to this. If you remember the example from the previous post, one of the fields looked like an ASN description instead of an indicator description. Remember, the purpose of this exercise is to trade 100% accuracy [which comes with very expensive overhead] for ~90% accuracy and a significant reduction in overhead. At the very least- maybe we come up with a tool that auto-generates the required configuration to get you from 90% to 100% accuracy with a feed. The real goal is positive edge using scale.
In the initial prototype, we kept it simple. We looked a handful of key identifiers (or "tokens" in NLTK speak) to detect if a feed was XML, JSON or just a plain file. We then iterated through a handful of lines to detect if it was a plain feed, or some kind of delimited feed (CSV, pipe, semicolon, etc). This pattern helped us reduce a few lines of YAML and gave us fairly accurate results for a simple CSV feed. Ironically most feeds come as simple CSV, so if you were to look at this from a ROI perspective, these few tweaks are probably one of the largest return we're going to get.
We could very easily tweak the config a bit to deal with that one line with the ASN Description in it, but where's the fun in that? Each new iPhone isn't special because they "shaved off a tenth of a picometre"; it's the culmination of many iterations that sets these phones apart from the competition. Over time, those seemingly 'tiny' improvements drive the device closer and closer to perfection and further from the competition. The same is true for software; each new release shouldn't be a game changer, that's not how iteration works. If each release was substantially new, it's probably because you're trying to sell features, not solve actual problems. The game changer features should span a few iterations into the tool, where the sum of all the tiniest problems solved [not features designed] add up to … well, fm.
What's the difference between an ASN Description and an indicator description?
They both contain letters, spaces and at times punctuation. They may include unicode characters and sometimes they may just be blank, or 'NA'. The easiest engineering way around this is probably simple- hard code the fields in the config, or since we know all the ASNs that exist in the universe, hard code that list and check against it. There are some obvious nuances to this- while ASNs don't change that often, they do change. This means we need to build some complexity into our code that refreshes this list on a semi-regular basis. Our goal here is to reduce complexity with a more generalized pattern, not simply shift it.
Think about what your eyes do when you look at that feed. How do they differentiate between what is the obvious description in this feed and what the ASN is? I'll give you a hint, it's the same thing as when you [visually] identify the feed is CSV. Your eyes peek at the table and group one of the columns as the description. In cases where the same description is used over multiple lines; typically the ASN description will almost always be different.
In this example, if we simply peek at the data we can gather a list of the top "tokens" in the file. By removing a few of the obvious ones, separator characters (commas, pipes, newlines, quotes) and anything with "less than 3-5 chars" we get a list of the common words or phrases used in the feed. Things like "phishing" and "caught in my darknet" float to the top. Things like timestamps, ASN descriptions and the indicators sink to the bottom. This sorted dictionary can be passed to the parsers as a sort of "hint", "hey- here's a list of things that might help you apply context to your decision making process".
If the parser comes across something it can't quite rule one way or the other, it consults the hints table. If it's in the hints table- there's a good probability it will make the right choice. It's in the hints table, but we didn't see an actual ASN pass by? It's probably an indicator description. It's not in the hints table and we DID see an ASN? It's probably an ASN Description. It's not in the hints table, but it's not an IPv4|v6 address? It's probably an indicator description..
Will there be feeds with a larger set of descriptions? Will there be feeds with a constant set of ASN descriptions? Of course.. but again, we're solving for complexity here using probability, not for perfection. Solving for complexity results in scale and scale gives us edge (ie: the ability to process more feeds with less overhead). We're making the bet that, over time we'll be able to build upon these general patterns in a way that solves for these nuances too. We're betting that because we can consume more data, that positive edge will make up for any accuracy discrepancies. Identifying and breaking out the pattern is the first step, and the most important. From there, you're simply iterating through edge cases- which then become part of the pattern. Lather, rinse- repeat.
Confidence and Probabilities
Recall in our last post, I described some simple ways to determine confidence automatically based on the indicator and "how many tags" we applied. URLs are probably higher confidence by default, especially if we've given them 2+ tags. IP addresses are probably lower unless they have 3+ tags associated with them. Even then they're almost never a "certain", if only because you can hide a LOT of users (websites, domains, urls) behind a single IP address.
We now have easy to use libraries that leverage the power of SKLearn that can be applied to URLs, FQDNs and IP addresses With that, we can apply probabilities to the indicator and thus our decision making process. If we process a url that matches one of the prediction libraries, we can bump up the confidence of it's tag, or even re-tag it as something else. We can apply similar logic in detecting potential false positives. For instance, if a URL is found but the library detects it as non-suspicious, maybe we bump it's confidence down? Toss it in a separate feed for triage? Or even feed it back into the machine learning process to improve our models.
As you start browsing through the code, you'll notice something. It seems we're pushing a lot of the YAML complexity into the code, which over time .. makes for a lot of IF/ELSE statements. Almost as if we've shifted the configuration complexity for code complexity.. and i'm not sure which is worse. That's OK, we're prototyping here- as we figure out the general patterns those IF/ELSEs will be re-factored and/or pushed lower into their own separate libraries. The more we can abstract out of the code we have to look at day to day, the easier it is for our brains to pick apart the new problems that need to be solved.
There is one other nuance- while I have been careful to not try and introduce too much SKLearn and NLTK early on, we're very quickly starting to overlap much of their functionality. What i've learned over the years is, it's better to prototype the problem out with as much as your own code as possible first. Do "the job yourself first" then figure out what other libraries exist that help you solve what you need. In a lot of cases your code will be good enough and the problem will be solved.
Solve Real Problems
Over time, if that problem is important enough (worth your time), you'll slowly start refactoring pieces of it with more common abstractions built by smarter people. You'll intuitively understand why you need to use things like SKLearn and NLTK and integrate them as you have the problems they are trying to solve. Going too deep too fast comes with it's own set of issues and you'll spend years of your life down paths that really weren't worth your time. Speaking from personal experience.. If you're lucky you may find yourself in the position to even help improve those frameworks with your unique take on the problem space.
The real problem we're trying to solve here is context. We're lifting a bunch of "tokens", that usually have more than 3 characters, surrounding them with context and applying a probability value to them. All this with the express purpose of taking the high value indicators and applying them to our defenses in real-time. Not trivial, but not hard either. I'm not an SKLearn or NLTK expert- but I do know what it feels like to block accidentally netflix.com at the border.