Prototyping CIFv4 - Neural Networks out of the Box

We're ~120 hours into our little adventure of prototyping a next generation threat intel platform. In that timeframe, 4 releases have been tagged and pushed to docker. Recently you may have seen a little detour into what may seem like the wild west of artificial intelligence. The core driver of CIF has always been to push the limits of traditional intelligence architecture and scale it. Years past, this meant taking indicators from point A to point B as quickly and efficiently as possible. URLs to and IDS, IPs to a firewall, all without blocking Netflix or Facebook in the process.

Traditional threat platforms, you know- the ones that feel like Facebook, make it difficult to both import and export your intel. The kind that kinda make you wonder what they're doing with YOUR data. Or the ones that you deploy locally, but act as Yet Another Console because the developers don't quite understand what it means to scale. Nor what it means to move large gobs of data in and out of the platform every 5min (or in real-time).

The majority of these platforms have made waves in our industry, but they require a few things:

  • Analysts working day in and day out on the platform (eg: expensive eyeballs, we're shifting work- not automating it)

  • Indicators to work with (theirs or your's, doesn't matter)

  • Access to lots of [sometimes very expensive] data

Maybe some of these platforms have a magic "machine learning threat score" eight ball, but it's usually a black box and limited by their view of the world, not your's. You want to retrain their algo? Good luck. Maybe some organizations need that, but it's my belief those of us at the edge are smart enough to push the envelope further. If it's not local to you, while it might help you in the short term- there are long term consequences to ignorance.

Your own black box

As our various deep learning (eg: Tensorflow) models have been published, they've been tested internally with CIF against real-world data. The ultimate goal has been to apply neural network driven (eg: mathematical) probabilities to data as it enters CIF. Zero friction, the data is crunched by the gatherers before it hits the database. This logic has also been placed in FM (formally 'smrt'), so even if you're not working directly with CIF, you can plow through a feed and get the match, for free.

Let that sink in for a moment. Instead of relying on human driven 'confidence' values, which are subjective in nature. Your data takes on a different context. With probabilities, you can benchmark data 'you are extremely confident in' and based on statistics, weed out the ~16% of that data that may cause you problems. Better said, you can take LOWER confident data (eg: MORE data) and weed out the high probability indicators with confidence.

The law of large numbers stipulates that the more occurrences of something, the more stable and PREDICTABLE the long term results will be. The more highly confident and PREDICTABLE data you are able to use, the better off you'll be in the long run. You'll be hunting like a quant, not like an analyst, and as we all know, the quants make all the money.

Paradigm Shifts

There's a paradigm shift happening in the industry. Hunters are becoming overwhelmed by the amount of indicators they have to try and keep up with. That combined with the amount of data "available for sale" is daunting. Every threat intel company has a magic feed that, without could SINK YOUR COMPANY! What's worse? They're probably not wrong. Each of them does have their unique view of the Internet, but it's really really hard to justify paying for all-the-feeds, vs something like cybersecurity insurance.

How do you even measure if you're moving the needle? How many of those feed have non-overlapping data, but massively overlapping patterns? The cost of observing and training things like neural networks on those patterns is dropping.. like a rock. The cost of these feeds, is not and the market is starting to recognize that.

Spend $5,000 training a hunter how to use Keras and Tensorflow and you can probably save $120,000 (per year thereafter) on a feed. Those feeds may have great data, but do care about the data, or the generalized patterns? The problem now is, how do you get your IDS to use a Tensorflow model, instead of a classical feed? Do you see where this is going, why we didn't spend a lot of time on a UX for CIF? It's a platform with an API and SDKs, the real value is in it's core.

At-least for now you can start messing with deep learning models in CIFv4, out of the box. That's why CIF is designed the way it is, to push the envelope and help teach along the way. It's meant to solve some of the problems you might not realize you have yet.


Did you learn something new?