The Reason Kafka and Redis do not Make Great Firehoses

In a previous post, I talked about the things you must consider when developing a real-time streaming platform. In a follow-up I demonstrated some of those lessons learned as implemented in CIFv4. The decision to start with websockets as the outward facing streaming interface instead of a lower level framework then, and still is quite obvious. Most people don't understand what they can do with streaming data, let alone which architecture to use at the bottom of the stack.

While this quasi-centralized approach definitely makes things a bit easier to adopt for newcomers, it comes with the sacrifice of scale. At some level, your growing horizontal infrastructure needs to cross coordinate with each of the different endpoints producing these streams. At some connection, there's a cross-roads where all the information has to get meshed together and spewed back out as a single unified stream. Most developers simple push this complexity to some sort of choke point (eg: Kafka, Redis, etc) because it solves the problem (fairly cheaply) and simplifies the endpoint complexity for them.

In the early stages of development, it's probably good practice to take what you have [easy] access to and prototype. That's how the firehose was built (Rails, ActionCable, Redis, WebSockets) because it was easy, but in doing so you can already see where the risk lies. If the Redis box fails, the entire firehose fails. This in and of itself is not a big deal, just setup a bigger Redis box, make it redundant and call it a day, right? Well, sort of..


In real life even a redundant, fully replicated setup of Redis, hosted on AWS will probably get you 98% uptime in the prototyping stages, and for most folks that's probably just fine. What happens however when, you actually start moving LOTS of data? What happens when that data starts crossing datacenters and you want that firehose to both be accessible across ALL your front ends, as well as being unified?

Today, similar setups solve the problem by creating highly available, highly dedicated, bandwidth intense channels that span entire data centers. These solutions are highly capable, advanced and more importantly.. very costly. This also raises the barrier of entry to those who have the means to plug directly into the datacenter stream and make use of the data. Additionally, most of these solutions are pretty much "all or nothing", meaning there's not sense of trust or sensitivity restriction baked into the architecture. You either have access, or you don't.

What if you wanted to explore a virtualized version of that contained within your own network resources? Are there other "patterns" from everyday life we can borrow from to help us design this? In the real world we share different types of information (human to human) every day. We do so at various levels of sensitivity and for the most part, it not only scales- it's incredibly resilient.

  1. How do I meet others in my community to share with ?

  2. What kinds of information do I share (Sharing Restrictions) ?

  3. What happens if the people I share with, move?

  4. What happens if the power goes out, what do we do then?

  5. How do I know when someone joins my community?

  6. How do I know when someone leaves my community?

  7. How do I build trust within my community, while keeping actual entry frictionless?

If you think through a lot of these problems, you can probably think of a quick answer for each purely based on the life experiences in your own neighborhoods. We meet others in our community by walking the dog every day, running into new folks, introducing yourself and talking about the community itself. Over time you build trust in these nodes, learn about other nodes on the network, maybe attend a community watch meeting and little by little share more information with each of the nodes.

Over time, you eventually get more involved in your community politics, maybe even run in a local election. Eventually you become a super-node in your community, a quasi-watering hole where a-lot of the information tends to flow through you and into others. There are others in this position as well, so if you were to leave, the system would be less informed, but only for a short period of time. Shortly there after, the nodes automatically elect someone else and the system continues forward.

Each of those "In Real Life" patterns can be mapped to most of the various technologies we use to share information in cyber security. Where things get interesting is in how we decentralize the risk of "when stuff breaks". The reason most practitioners leverage these more centralized approaches (aside from the fact that it gets the job done, for now) is that, implementing some of these distributive life patterns in code, at-scale is REALLY REALLY HARD. I literally just spent the last 3 days working on what turned out to be ~10 lines of code. Doing something simple that scales is hard, doing something complex that doesn't, is easy.

Is it easy to map certain aspects of these problems in other languages, such as Python, Go and Ruby? Sure. The problem then becomes, ALL your tools then need to more than likely speak those languages in order to interoperate with the protocols you've developed. Is it easy to just write language bindings for the protocol you've developed? Sure, but then you have to make sure every time you change "the engine" all those things get updated automatically.

Deploying Threat Intel at Scale

I've learned a lot over the past 10 years of building threat intel deployment architectures. The endpoint of these delivery mechanisms are devices that not only monitor many /8's and /16's, but at multi-gigabit speeds. When you start pushing a lot of the complex logic back towards the edge, you end up driving a lot of your costs to the floor. In doing so, you train your endpoints to become more resilient, and you can ultimately deliver 10x or 100x the data.

By pushing your endpoints to be smarter, you start thinking about the network itself- as the firehose. This shifts your resources from trying to rely on massive amounts of centralized technology to solve the problem and focusing on teaching the nodes to solve those problems for you where those decisions belong- at the edge. By pushing this complexity to code rather than infrastructure, you will incur more initial costs, but at the benefit of long term sustainability and scale.

Why is this important? Think about how BGP and IPv4 "won" vs all the other more centralized networks. It won because the edges (eg: ASNs) were able to make their own routing choices, in real time as they saw fit. They were able to scale because each network was able to peer with other networks on it's own as they deemed appropriate and yet, because of this mesh, we're all able to stream Netflix to our iOS devices from anywhere on the planet.

As far as security information sharing is concerned, what are we missing?

  1. Code that auto generates high level language bindings when you update the engine?

  2. Protocols that auto generate code when the protocols change?

  3. The ability to beacon on a local area network and auto-connect when you find your peers?

  4. The ability to do this transparently?

  5. The ability to gossip with our peers in an effort to find new ones?

The Future is Here

If you want scale, you have to put some work into it. You also have to achieve some goals to help make scale, … scale. Meaning, you need to invest in your tools as much as you do the people implementing them. You need things like automatic code generation, protocols that reduce centralized complexity and authorization, encryption and more importantly CONFIGURATION that happens almost transparently. I shouldn’t need to configure everything every time I add new nodes to the network, the nodes should figure this stuff out automatically. More importantly, the architecture shouldn’t break if something bad happens, it should route around the problem and continue on. More automation, resiliency and less cost means more solved problems and more winning.

Do you need an architecture that moves 100Gbps of data across 20 different data-centers around the world in real-time today? Probably not. Will you 2-3 years from now? Yup.

Did you learn something new?