CIF started out as a means for moving threat indicators. In ~2006 we were publishing botnet C&C's and phishing URLs to a web page. We didn't have fancy terms like "IOC" or "Threat Intel", we were simply scraping HTML tables with some cruddy perl and deploying that list to our Snort sensors. It was a very crude pattern, but we weren't interested in much more than the intended network effect. The more eyes on the collection we had, the better our chances at raising the bar. We didn't have to always outrun the bear, just everyone else who wasn't doing this. If we were detecting and cleaning up faster than everyone else, maybe they'd leave us alone.
Over time we started combining those 'indicators' with other datasets, a few years later, CIF evolved from that mess. What astonishes me to this day- is how much of the industry focuses on the IOC part of this model, and not the larger delivery pattern. A stream of haphazardly described [massively XML-ized] IOC's doesn't scale very well when you're trying to detect at 10 or 100 Gbps, aggregate feeds and models do. Are you aggregating your data-sets based on time? What time frame is relevant to you? Are you measuring indicator decay? or blocking things from two years ago? Are you applying a whitelist? How are you building that whitelist? Is your dataset targeted towards your application (eg: applying a phishing bent to a urls feed, or ipv4 bent to a null-route feed)?
In the early days, we had another side project that involved the trading of not just C&C addresses, but also snort signatures. Snort signatures in and of themselves are a kind of data model. They represent not just specific data [although they can], but an attack pattern compressed into a rules based language. 'Identify if someone is sending us too many SYNs within a certain threshold, if so, fire an alert'.
We originally tried trading these non-indicator-based signatures in a wiki, then SVN repo. The problem was, these models tended to be too specific. It was non-trivial to directly apply them to a new network without tweaking them. For example, if a smaller site contributed a rule but it was tuned for a /24 (eg: 10 SSH SYN's in 30s or less), a larger site with a /16 would have to [manually] tweak the rule in order to get positive value from it.
We tried this process for a bit, but didn't really remove the human element from the equation. While the project was useful- it didn't gain much traction. Users would paste interesting patterns they found over time to the wiki, but over time it was easier to just use what was out of the box and tune yourself. We tended to just trade sig ideas over the mailing list, because that was easier to engage with than the wiki URL. There were numerous attempts to standardize around this process, it just didn't really catch on.
Then there was the common language problem, was there a way to normalize the pattern. For instance, if we knew the pattern was "too many SYNs in a short time-frame", could we automatically translate that to Bro? Suricata was easy, but what about Palo-Alto? or Cisco? IPTables? and so on..
Instead, we built feeds of indicators that met the detection side of these patterns. For instance, if you applied these versions of sigs and fed the results into our system, we could at-least distribute the actionable take-away (eg: the attackers). You'd get a list of aggregated ip addresses matching this criteria, which using our indicator library, could be easily adapted to the output of your choice. Very quickly you go from IDS output to IDS / Firewall input across multiple sites with very little effort.
This was a total work-around to a much larger problem, but it worked. It also however, required the heavier parts of CIF and it's servers to be in the middle. More recently we've started prototyping a way to help build that aggregation right into the indicator library itself. This enables access to the aggregation and whitelisting magic to be accessible to users outside of the CIF ecosystem. This means, our lessons learned from years of CIF can be applied to just about any stream of IOCs, CIF or not. It abstracts away the server requirement, that should you need an aggregated feed of something, you're only a function call away.
We've also included some of the most basic whitelisting best practices so google.com doesn't show up in your feed by default. Additionally, when working with both v4 and v6 addresses, we include some great patricia-trie modules that aggregate whitelists over larger net-blocks. Want to whitelist a /16? No problem. It happens automagically, and as efficiently as we can [in Python].
With the aggregation and whitelisting magic so tightly integrated with the indicator modules themselves, no extra glue is required to get them to interoperate. Take a stream of data, convert them to indicators and pass that list through the feed logic. The outcome of which is a normalized data-set that can be translated into just about any kind of rules language. Your rules language isn't supported? Most new rule types can be accommodated in a few lines of Python.
This doesn't solve the larger problem though. If you've ever looked at a large IPv6 feed, you quickly realize that while this approach works OK, there might be more efficient ways to solve this feed aggregation and sharing problem. With any luck, your feeds will reach 500, 600 and 1GB, and you start wanting to move that data, not just daily, but hourly and in as close to real-time as you can make it. You start thinking about things like compression, diffs and other algo's that help you distribute the data more efficiently.
Pretty soon, you find yourself back, staring at this "snort signatures" pattern problem. A small, elegant mathematical formula representing something your sensors should be detecting. All it's missing is a little normalization and a bit of an ever evolving data model behind it, representing the current state of the Internet. Ten years from now we'll likely be trading these.. if you're not already.