We had an interesting thread pop up on the cif-users list this week-

“i know this is a vague question.. but what is a “best practice” for pulling the cif data and throwing it into bro or somewhere else for a small business?”

— Darrell Miller

This isn't the first time i've seen this type of question, in-fact, we have a whole section of doc "dedicated" in trying to address it. I say trying, because .. well, if it were perfect, there'd have been no reason for this post.

Integrations are tough for any ecosystem

There are always a lot of moving parts. Each IDS, firewall, magic button each has their own dialect on how they'd like to pull in the data, what it cares about and more importantly, what it DOES with it. When we first started, we focused on dedicated humans, whose only job was more or less "running the IDS cluster" or "running the firewall" or or or.. So integration wasn't that big of a problem [for us], your product either worked with a few select products or users were forced to spend their time duct taping it together.

In the early days, our user base was made up of people whose main role was getting the IDS cluster work, which means figuring out how to get feeds into them. I know, because I was one of those, that's where CIF and CSIRTG came from, the desire to get all sorts of feeds into Snort and monitor a 1 GB line (~2006). The problem we did have [as a niche community in EDU], was the breadth of technology that was being used to process the feeds. Everything from IPTables, to Snort, to Bro, PaloAlto, CISCO routers, customized DNS sinkholes, Suricata, javascript plugins, python plugins, perl plugins and best of all, good ole CSV (who doesn't love a good spreadsheet for sharing IOCs amongst friends).

What we learned...

Care about the last mile and develop in such a way that it's not just an afterthought- it's a plugin. When someone asks "hey how do I get my data into $NEW_SEXY!?", the answer is really no more than "start with the CSV output" and then "probably ~20 lines of python if you want to get more granular". This methodology pushes you into doing a few things really well. First, you're focus is on enabling people to USE your product in ways you may have not foreseen (eg: listening to the market direction). Secondly, it forces you to remove a lot of complexity from the CORE, the more modular the last mile is, the easier it is to write for, tweak, argument, adapt. The less complex the CORE is, well, same thing. Things start moving faster. Lastly, it enforces that the most important part of your project is the end product, which is really the only thing your users really care about. They want the data, in a feed that makes sense to them, with as little effort as possible.

What astonishes me..

Is how many projects don't appear to understand this. They develop a tool, or framework around a core concept and delivery of that data is still an after-thought. Their core is about moving highly complex data around, which works ok- but you can almost tell by the number of "BLAH-TO-BRO" repo's they have, that it integration wasn't thought about until some sales engineer pushed them to create it. It's awkward, disjointed, doesn't work great with the core, may have been written by a some interns (or worse- sales engineers) who don't really understand (reads: "loves") the core. Do we have these kinds of repo's? Of course.. everyone does. From experience though- these are the repo's that bother me the most. They're the ones that scream out "we couldn't figure this problem out natively, which means our CORE and plugins aren't really thinking about that problem, so this repo is a work-around". In some instances, you need these to help work through a problem, and that's OK. But for a lot of projects, they just feel… odd.

That said- most of our users get all they need from:

$ cif --tags phishing|botnet|malware|exploit|scanner --itype .. --fomat bro|csv|snort|json|.... > feed.txt

Most of those "just work" because all the logic that prevents those plugins from "inserting netflix.com into a blocklist" is built into the core. Adding a new feed output, that only has to think about data translation is trivial: ~20 lines of python. The math is simple, the less complex your last mile needs to be, the more plugins you can write. The more plugins you write, the more users you'll attract. The more users you attract, the more problems you solve, because, as we all know.. users bring problems to be solved. Solved problems get pushed back in the core, making the end result more valuable, thus attracting more plugins! err users!

The problem however appears to have shifted over the last decade

We used to have a [very small] user base made of up operators whose ONLY job was operating the IDS. However, the technology and market space has matured such that it's becoming the mainstream responsibility of network admins, or "people who need an IDS, but it's not their full time job". That's not to suggest this group hasn't always been there, ironically this is how I got into the space, a network "admin" with a pet Snort sensor. It took me 3-4 years to go from that to "we will pay you to JUST do Snort for a very large network". These network admins can certainly just go buy something- but there's a segment of them who aren't. They're seeing the value of active monitoring vs the passive spending of resources on outsourced technology. Nobody will care about your network except you, no matter how much you pay them.

As the last mile has been more or less solved

Meaning there is a clear methodology for getting threat intel into the network. What's lacking, especially in small business is a good understanding how to apply and tune these technologies. What confidence level should I use? How often should I update the feeds? Which feeds should I use? How many days back should the feeds cover? IS THERE JUST A STUPID @$%@^! OPTION I CAN USE THAT WILL GIVE ME THE OPTIMAL FEED WITH THE MINIMAL PAIN!? and more importantly, HOW WILL I KNOW IT'S WORKING!?

There are a TON of resources out there, everything from documentation, to conferences, to trainings that are built around addressing these problems. I know- i've lived my life running around attending them, taking them, teaching them. They're both expensive, and a lifestyle. A lifestyle most small business combat operators don't have time for. Ten years ago, this made sense, the space was still immature and we were still figuring things out. Today- well, I spend a lot of time learning from others on the YouTube.

No excuse

These days, there is no excuse, and for the record, CIF is just as culpable for this as any other tool. It does a great job at producing data, it does a TERRIBLE job at helping you know what to do with it. It assumes you have endless hours in the day to read the doc, read the code and pray you don't block something by accident. "Here's a loaded gun kids, the firing range is over there.. go figure it out, just don't point that thing at anyone." That's partly what you get for free, but it's not how you build goodwill in the long run.

We have a lot of new ideas coming in CIFv4; streaming data (ZeroMQ, WebSockets), graphs, machine learning and a statistics APIs to name a few. A subtle yet more important feature-set will be making the last mile more intelligent. How do I translate the `cif --i-am-a-small-business-using-bro` flag into something useful that just does the right thing. Oh, you want the malware, dns feed at a confidence of 7, you want it every 15 minutes and you want to place it here. You're setting up a DNS sinkhole? You probably want the malware/botnet/phishing dns feeds with a confidence of 8 in the bind zone format every hour. You're running a snort sensor? You probably want every feed we have starting at a confidence of 7.