Realtime data is hard... well, was.
There are two problems realtime data, producing it, and consuming it... It sounds easy- but when you start scaling it, you either end up with a flood of data you can't consume- or without significant investment, data you can't move. We're not even getting into the mechanics of long term storage, purely the streaming of messages, at high volume in highly distributed environments (eg: not across racks in a DC, but across the Internet). Either way, very quickly this problem can bring your infrastructure to a grinding halt, or you start dropping stuff.. or both.
Do you use a more centralized queue such as AMQP? Redis? or something more distributed, such as ZeroMQ.. Maybe you just serve your customers web-sockets back-ended by postgres using Action Cable? Do you write your own client or leave it up to the consumer? There's really no 'wrong' answer (well, there is in my opinion, but we'll put a pin in that for now), but choosing a less optimal path always results in both clients unable to use your technology- and your technology pricing you out of many potential markets.
ZeroMQ is hard to get into, but it moves fast.. AMQP is easier-ish, but it starts falling over rather quickly, Redis is becoming quite popular, but these technologies trend themselves towards a specific kind of audience, meaning if you pick one over the other [without writing your own abstracted client] you're probably missing potential customers who aren't well versed in some of the lower level frameworks, so maybe web-sockets? but there's issues with those as well, so what do you do?
Specializing too quickly will yield you almost zero customers, even if you're "right" in the long run, if no one can [or wants to] use your streaming framework- you're wrong. You need to enable your customer base show you what types of streaming they're comfortable with. If you bridge your hard-core messaging exchange with a lighter, easier to use exchange such as web-sockets (HTTP), and price it accordingly, you give them a few breadcrumbs to follow while they wrap their head around how the exchange works.
Simplicity is also important, because no matter how much client code you write- you will almost always have users who want to engage with the raw flow themselves (eg: write their own client). If you take too many liberties trying to help them by handing them a complicated client (or SDK), or a complicated protocol (even if it's faster, better sexier), they will become frustrated and move onto something else. Over time, your customers will adapt and drive you towards the more specialized frameworks, but only after you've proved concept and that there's a market for it. They will intuitively show you where to invest next, no sales team required. You'll still need the simplicity of the on-boarding process, not all consumers learn at the same speed, but that becomes your lowest tier product. A simple way for consumers to engage and get involved with your platform, then graduate to the harder (faster?) stuff as they gain the necessarily experience. They are better equipped to teach themselves rather than you having to spend time teaching them.
The other issue with traditional firehoses- it's $$$$$$$$$$$. There are very few technologies out there willing to deliver students, researchers and small business a firehose they can reasonably build on. This isn't an artifact of artificially high prices, up until recently that's how much it costs to deliver these products. Customer support is hard, software development is hard, sales engineers are expensive and because of those combined resource costs, customers want to spend some time working with your team before they take the plunge and cut the check.
I'm sure plenty of shops give you the option of a "30day trial", but where's the $25 or $50/mo option so I can spend a few months with your product and see how it fits? What if my new business model needs a few years to find itself (hi! i'm one of those!)? What if my research project spans years (hi! i'm this guy too!)? Today your options are- get cozy with some operational folks (check!), spend time building up trust (check check!), find yourself a sugar daddy (or momma) willing to foot the bill (been there- not as fun as it sounds), or blow your savings and shoot for the moon (heh- THANK YOU AWS!!)! All of these problems are hard- but they don't have to be.
Keeping it Simple.
Fire-up the csirtg-firehose tool
$ pip install csirtgsdk $ export CSIRTG_TOKEN=1234 $ csirtg-firehose
and in less than 5min you're streaming all the public data from within CSIRTG (which- at the time of this writing is comprised mostly of various types of scanning activity from honeypots as well as odd-ball spam/phishing urls, email addresses, email attachment hashes, etc..).
There is also an example feed and correlation tool to help get you started, maybe even generate an idea or two. The correlation tool looks at all the scanners coming across all the feeds in real-time and simply produces a correlated indicator when it finds an indicator created across 3 different users within a 24 hour period. Crazy simple, yet produces a highly suspect list of suspicious actors that can be confidently acted on in your security infrastructure.
Spend time on the Rabbit Hole! Not the infrastructure..
Want something more in-depth? Have a research project, and need access to real-time data? Take the feed and start doing some analysis of those highly suspect addresses, produce another feed of /24's or ASNs. Take the suspicious URLs across all the feeds, resolve them and start enumerating the various name-server's around the world everyone seems to use when they want to attack you. You could do this by just pulling and mangling feeds every day, hour or 15minutes... but you can detect and react a lot quicker when streaming the data in real-time. Suddenly, the broader community of spam-traps and honey-nets become your SEM, and you can start to mitigate attackers before they reach you. Push your hits back into a feed you own, and start a massive feedback loop with the rest of the community. Suddenly your SEMs are connected in real-time.
Future Firehose(s)... p2p?
Oh- back to that 'pin' about being right, or wrong... I'm a hardcore ZeroMQ junkie, I love working with the pyzmq, CZMQ and Zyre (p2p) frameworks. However, these lower level messaging frameworks, while incredibly scalable and mature- are hard for most programmers to really wrap their heads around [quickly]. A lot of times, that's the difference between adoption and not. To that end, I've started building some python bindings for the ZeroMQ Zyre framework called pyzyre to help bring the concepts around p2p realtime-streaming to python users. Simple abstractions that make the lower level frameworks more accessible, recognizing that certain patterns exist in some of the higher level languages that may not exist in things like C. Patterns that drive adoption and make those languages attractive over writing things in C.
Peer to Peer and it's discovery protocols are not new, in-fact there are many examples of frameworks that try to tackle the peer to peer aspects of a firehose in both highly distributed models as well as semi-centralized. The major differences seems to be, either you pick a more centralized 'firehose' (eg: redis + web-sockets) where developers have organized around the pattern to develop bindings in all the possible languages (eg: making adoption easier) OR a highly specialized framework "written in go but here is a binary". Occasionally you'll get something in the middle (eg: BitCoin) but which is highly specialized towards a specific task (moving harder wealth).
I'm not completely sold on either of these solutions, which is where the ZeroMQ framework (CZMQ and Zyre specifically) come into play. CZMQ provides an abstract layer on top of libzmq, which because it's written in C (think: sockets on steroids), gives us the ability to have a very thin abstract API that other (object oriented languages) can more easily tie into and build on. This means, we can start doing things like code generation at the binding level. In plain english, because the lower level peer-to-peer (zyre) engine is written in C, anytime we make a change to the core, the bindings for every single higher level language is automatically re-generated (ruby, java, python, Qt, etc..). Merge that with the ZeroMQ C4 RFC (eg: we just hit merge on most pull requests), and now your community is testing your ideas much more quickly. No waiting for a language binding to be manually updated before you can test a change in the engine itself.
Instead of organizing the concept around a higher level language, we're building out p2p a more common language, C.. but enabling changes to bubble into the higher level bindings immediately and with zero effort. Why is this important? Why not "just use go and the binary" ? Well- if you're trying to build a massively [global] distributed firehose, that your platform both benefits from, and contributes to... adoption across all the programming languages is required.. and we can't pay a team of programmers to keep all the different bindings updated every-time someone makes a pull request.
the future is distributed. and life comes at you fast. more to come...