F your formats, just show me the data.

F@#$! your formats, just gimme the data.

That was the title of a talk I gave in 2011 at a private conference. I had been giving a lot of talks about standards in that era (IDMEF, IODEF, XML, etc). If we just all used these common formats, sharing data would be … easy! How wrong, .. very wrong I was. I never thought, 7 years later this problem would be so prevalent and we'd be beating our heads against the wall still to try and solve it.

That's not to suggest [traditional] standards don't have their place. These days, they're (the non TLS kind) are typically developed by non-operators who do not have a firm grip on reality (or the problem space). Thus, they never really gain any traction outside of their immediate .. bubbles. What's the most prevalent standard still in use today? CSV.. or some form of 'delimited pattern'. Why? It's still, by far the easiest to present, fetch and process by most humans. I can read it as a human and equally as fast parse it as a machine. I only need to learn how to do line and comma splits [in python].

Everyone then ends up with a [python|perl] script to parse every different feed they want. We did this in the early days of CIF, we had a perl script for each feed. It was … obnoxious at best. Then I had an idea, what if we rolled these common things into a common tool? What if that tool could pull and parse MOST types of feeds we'd run across? What if it just needed a configuration file to help HINT to the tool how to map the various feeds? From that, SMRT was born.

Repeating History

Over the years, while I watch other projects repeating the same mistakes we made back in the early days; just write another 20 lines of code for each feed we want to process, good luck with all that overhead. Actually, i'm pretty sure the various SEMs do too for each log type they want to process. Same wheel, 6 sides and everyone continuously re-inventing it. Eventually the overhead crushes you (costs, time, customers..). Developers love creating complexity, not reducing it. It helps them own something, or stay relevant, or something. Not quite sure, but we all know technical debt is a soul killer, but instead of striving to reduce it, we end up with more of it. I digress..

"CIF-SMRT" was finally broken out into a standalone tool. This tool enables users to fetch, normalize a threat intel feed [csv|tsv|xml|json|taxii|..] using nothing more than a YAML file. This YAML hinted to SMRT how it should map the various indicators to a more simple standard, csirtg-indicator. From there we could translate to the last mile: Table, Bro, CIF, CSIRTG, ElasticSearch, SYSLOG, etc. This is neat, YAML is simple, you don't really need to be a programmer to learn it, and in 90% of the use cases, to pull a feed, you take one of our examples, a few minutes later you're done!

The Good

There are a lot of benefits to this model, very little programming, no more code to maintain, and you are able to leverage our lessons learned. Things like:


  • HTTP HEAD check, to make sure we're not abusing the bandwidth of those providing us with their data. If they didn't update the feed, why should we download it again?

  • Memory performance, processing the feed as a pipeline, not as a giant chunk of memory.

  • Archival, keeping track of what indicators we've already processed. Be smart about your pipeline.

  • Outputs, most of the outputs you'd want to push a feed into have already been written and most importantly battle tested in live production (large ISP like) environments. Like Bro, Snort, etc.

  • Unicode. Ugh. Unicode. Ever have a script get tripped up on a latin character set? URLs are THE WORST.

  • Plugins for non HTTP transports. Easy to plugin non-trivial, non-http style parsers.

The Ugly

As i've used csirtg-smrt, there are still some things that bother me about it. While it's super memory efficient the code, and pipeline it represents is too hard to read, let alone maintain. It also turns out, as part of streamlining that pipeline process (eg: use less memory, earlier versions of CIF and cif-smrt were memory hogs), I wrote the pipeline backwards. Don't get me wrong, it works just fine for what it is. However, if you wanna take your threat intel consumption to the next level, you need to load and peek at the content first, before you determine how you're going to process it. We didn't do that. We assumed you knew what you wanted to do with a feed before you processed it. Not wrong, but restricts what kind of magic you can do to help reduce complexity.

Lastly, there's too much YAML dependency. I like that YAML is simple, and I LOVE the fact, that in order to pull a feed, you just tweak one of the examples. However, in 2018 THIS STUFF SHOULD BE MAGIC. There should be less code to maintain, less YAML to define what you're trying todo. There are only so many ways to really skin a cat, why should threat feeds be any different?

For instance; most (93.7%) of HTTP based feeds are one of:

  • Some form of delimited (csv, tsv, pipe, semi-colon, etc..)

  • Some bastardized form of XML

  • JSON

Why do I need to configure for that? Does the first line of the feed start with a "<" ? Does it start with a "[{" or "{" and end with the same? Well then, it must be a text feed. Is the number of commas the same in line 1 as it is in lines 2, 3 and 4? Now you must be thinking to yourself; "well self, that's obvious!". YES. YES IT IS! Next you're left with the problem of mapping that 'record' to an actual normalized indicator surrounded with metadata (timestamps, tags, descriptions, confidence, provider).


I'd like to say I had this (relatively obvious) epiphany one day, but it was really the result of an odd DARPA project I was involved with a few years ago. For whatever reason we were trying to tackle the problem of parsing output from `ps aux`. Have you ever looked at the output from from `ps aux` ? Turns out it's a really interesting problem because THE RECORDS SOMETIMES HAVE LINE BREAKS EMBEDDED IN THEM. Let that sink in for a minute. You type in `ps aux` and on some systems, there's a line break in the record itself. Not just that, but SOMETIMES (eg: not all times).

How do you parse for that? Well, the answer I gave at the time (sadly) was "just write a tool that parses it". Pretty soon you have 50 tools that each parse something similar, but just a smudge different. Here's the joke- when you look at that output your eyeballs parse already do this. They look at the entire table and pick out a few things. They spot and mentally group things like line breaks, commas, spaces. If your eyes can do this, why can't we program for that? Peek at the whole table, what's the most common token? Commas and newlines? Our brains intuitively understand how to break those up, why doesn't our code? This isn't new, but it's rarely implemented well.

Reduce Complexity.

Let's start with the easy ones, timestamps. We can EASILY loop through a record and determine what looks like a timestamp. Then, by comparing them we can further determine which is "first_at", "last_at" and "reported_at". The indicator itself, there's a function for that and we get `itype` for free. Now without much effort, we've got an indicator with some timestamps. However, we're usually left with a few odd ball, plain text things to sort through, tags, descriptions and "asn descriptions".

In a lot of cases, we can check to see if we've parsed out an ASN which should hint if we need to even figure out the ASN description. Is the itype an 'ipv4' ? If no, then we can ignore this. If yes, then we move onto teasing out description and tags. What do we know about tags? Most feeds don't have them, and if they do, that field probably doesn't contain a lot of spaces, where as descriptions probably will. Tags will typically only be 1, 2 or maybe 3 words and use something to separate them other than a space. ASN descriptions may contain a space or two, and if they contain a space, they're also likely to contain a "-" or ".", where as a normal description will contain >= 2 spaces and likely no odd symbols. If we really wanted to make sure ASN descriptions don't trip us up, well.. there are feeds of those we can use too. Unlike free form descriptions, ASN descriptions are rigid and finite thus making them easy to filter out if we have to.

Integers are a bit tougher. Are you looking at an internal primary key ('id') ? A port number, a protocol number? an aggregated count? Port numbers are easyish, we have a list of common ports we'd likely see in most feeds (21, 22, 23, 80, 443, 3389, etc). Does this INT fall in that category? If yes, is it the only INT you have that does? Does the INT match something in the tags or description that might match the port (eg: 'scanner', 'ssh', 'brute-force') ? If yes, run with it. If no? raise an error. We can always mix in some YAML if we really need to.

Is the INT > 65535 ? Likely an internal ID, less likely a count. How do we check this? Take a normalized distribution this column for all the records in our feed, do the results cluster around our value? or are they all unique and linear? ID's go up, counts tend to distributed normally. Similar logic can be applied to ports. You'd likely see a semi-normalized distribution of ports if the column was related to IP ports. As for snagging the 'provider', that can be easily regex'd out of the remote url we used to pull the feed in the first place.

Confidence (with a simple scale of high, almost-high, low, none) is even easier. It's in a feed? Start out as "almost-high". If it's highly specific (url, email address, hash) and has more than 2 tags, it can probably be bumped to a "high". If it's an ipv4|6 address, has 2+ tags, a port, a timestamp and a tag that includes 'scanner' or 'ssh' or 'telnet' etc.. you can probably bump that to a 'high'. If it's a less specific fqdn, it's probably a "low". Not perfect and can be tweaked with the YAML, but you get the idea.


Without ANY machine learning or NLTK magic, you have a very basic and generalized pattern (or "algo" in hipster speak) that can parse and normalize, most types of feeds.. We can always mix in YAML if we need to tweak the feed (eg: ensure that the confidence is always high, or the description is always picked up correctly). The point is for most things, we shouldn't have to. Less noise, less things to maintain, and more importantly silently brainwashing ^H^H^H^H retraining users how to think about feeds the way we've come to know and understand them as operators.

I started prototyping this week, the pattern is very simple. However, you can start to see where the intersection between this and some simple machine learning could begin to feel a bit like, FM.

Did you learn something new?