Prototyping CIFv4 - Developing a Threat Intel View using Graphs

For the longest time, I never really understood graphs. Not "chart" graphs, but networkx style graphs. Time series stores, like "normal" SQL databases make sense, but graphs tend to abstract a bit of the time element, in that, if you ignore time, what do you see? In fact, it's actually pretty hard to apply that kind of temporal context to a graph, and scale it well. Since the queries need to traverse the nodes, the more nodes you add- the harder it is to scale.  If you have "lots of time nodes", the graph gets exponentially large, but if you remove time, you remove context.

You can build a graph around time based edges and nodes, or even place a graph within a series of "time" base contexts. This may make sense from a forensics timeline view (eg: historical), you do lose a bit of efficiency to the query mechanisms that make a graph powerful. For "building a threat feed" a graph almost makes sense, because you're building a feed "around a time frame", meaning, you're already applying the context to the query, the timestamps almost don't matter. For historical queries however, those timestamps apply context to the results, which can be stored in a graph, make it a little more awkward to search.

In most cases with threat intel, you're building a snapshot either for a feed, or a historical search. Ideally what you want is an efficient storage mechanism for this, but with the ability to take that snapshot (eg: context, temporal or otherwise) into a graph, or "picture" of what you're trying to view. There's a complication though- while most smaller instances of threat intel can definitely be stored in something simple like networkx, most security operators still don't understand much more than sqlite. It's not that they can't, they just don't have the time, so where's the balance?


CIFv4 attempts to strike this balance, by empowering one of it's simple output formats to build a GEXF style graph at the edge. Using this plugin, you can control the context (eg: timeframe, itype, etc) via the normal query flags and simply write the resulting output to a GEXF file most graph tools can easily read. Moving forward, you'll likely see more REST based support for graph type queries to CIF. You'll also see other graph like storage adapters (most likely titan, networkx, etc) in the source code, somewhat un-documented as CIF continues to strike this balance.

It may be that things like networkx and titan significantly improve the performance of larger data-sets and queries in CIF, but I also want to make sure users understand what's going on under the hood. Magic is one thing, usability is another. My feeling is, combined with the new machine learning networks introduced in v4 these new graph based approaches, will help operators build a more complete view of their local threat landscapes.

What do you think?

Did you learn something new?