Prototyping CIFv4: Part 1.

How the Sausage is made

There are a few philosophies around how 'the next version' of anything gets started. Some people like to build things in private "until it's presentable", some like to wander off in some direction as an excuse to learn a new technology, others like to write up a proposal to a problem (which may or may not exist) and use it as a sales pitch to fund their idea. There are also those who actually build and solve something, but leave it completely undocumented, un-written about and undiscoverable by the people that need it most, the rest of us. Cause.. ohhh shiny!

CIF, and more specifically I have been guilty of many (if not all) of these at some stage. CIFv-1 and v0 were actual solutions to problems we had, while we ran around the world talking about them, we really didn't write about them.. at-least not that well. Others did, but we failed to and looking back, I think that inhibited us. CIFv1 and 2 were project proposals to problems we sort of had, but we made two mistakes; we cornered ourselves into a 3-year feature set and less than 18 months later, the scope had changed. Problem: we were on the hook for specific features we had designed for years earlier.

Additionally, we (I?) were immersed in a standards and technology game at the time, that LITERALLY did not move the needle in terms of what we had set out to accomplish. Standard-A and Standard-B (whatever the term 'standard' really means these days) do not matter in this space, FAST, EASY, JSON and CSV do. Operators who move lots and lots of data do not care about the standards or political organizations or protocol buffers. They care about "how many records can I throw into Bro before it falls over?". Not that those things aren't important, they're just not when you're exploring a new problem space and prototyping solutions.

Doc Doc and MOAR DOC!

Eventually, with CIFv2 and CIFv3 we started writing a book (well, Gabe started writing a book, I piled on). Not a real book, but one of those GitHub wiki "this is a book but not a real book" books. Something simple to get our feet wet. It helped us document what we did in hopes it'd help expose the larger problem set (or what WE thought it was) to newcomers. The problem then became, it was done AFTER everything "was stable", meaning if someone read the book and highlighted something that needed to be fixed in the core architecture, it was really hard to do. Who wants to break something when you've started tagging it as "RC1" ?

Now you're caught in this odd space between the stable latest version and what you think is going to be the basis for the next version. The catch? You wrote this platform to solve a problem and you need actually ship it. Software developers are terrible at drawing the line between the next version and scope creep. Every time we think we have a handle on a problem, 10 new problems surface, and of course those need to be solved before we cut the next release! Do we burn down the repo and start over each time? Or do we slowly migrate towards the next-new-version, carrying with us the baggage, tests, and legacy BS from the old version? I don't know that there's a wrong answer to any of these questions, but they each have their own sets of baggage. I tend to burn everything down each time, but that's more my personality than anything else.

Prototyping in Your Spare Time

Screen Shot 2018-03-23 at 12.02.26.png

I decided to start designing CIFv4 these past few weeks. I actually created the repo for this in Sept of 2016 and updated the README with some goals in June of 2017. Actually, CIFv3 was built similarly, in that by solving problems, you start getting a sense of the next version and writing these ideas down somewhere. You create a repo, and a README with some high level goals. Over time you update those goals until eventually you get excited (and some free time) and start building something. In the case of v3, it was python, new and improved Zeromq and Elasticsearch support. Over time, the repo grows and out of the blue one day, a bunch of people start trying to install it (no doc, just a setup.py file and they think: "Oh, this will be fun!"). That's when it hits you, you've not written any doc.

Now you're in this feedback loop, you write doc to solve end-user questions, which generates more users which generates the need for more doc. You slowly stop solving technical problems because- well, others found a solution they want to leverage and very quickly your platform starts to become- mature. That comes with a real cost, you're not longer allowed to make structural changes that could break things. Users are relying on this "Alpha.17" to be stable "in production", so your policy for the platform becomes "bug-fixes only" and you start tagging it as "Beta" (again, whatever THAT means). Herein lies the rub, a lot of that feedback, is architecturally GREAT FEEDBACK, so what do you do?

My friend Jeff once suggested- "you write the doc first" which I never really understood until recently. How do I write doc to a problem I don't even really understand? Why spend all that time on doc if I don't have an audience? All valid questions, especially if your tool or platform isn't well established. I'd like to tease that out into a few simple ideas:

  • Write the doc as you prototype

  • Put restraints around your prototype (only solve and doc the problems immediately in front of you)

  • Write about your doc and prototype, open up the feedback loop from non developers who may understand the general pattern, but not the deep technical details

  • Start each new project as a fresh prototype, it's easy to abstract and back-port those ideas into a more stable code base. Give yourself a little wiggle room to explore the problem

  • Use time constraints- try 2-6 week sprints, see how far you get, make sure at the end the doc is updated

While in the initial stages I had a vision for the problems I wanted to solve, it became more clear as I started putting together the basic parts. This isn't meant to be an inclusive list, but it's a rough sketch of what I think needs to be done and in what order. The order here is important, at some point i'm going to find myself down a rabbit hole and have to draw the line for CIFv5. What problems are worth solving now? and what might be able to wait (or become a plugin) because the market for that problem just isn't there yet.

Performance and Weight are Always at the Top of the List

You make a racing car faster by removing weight, not by adding power.
— Pieter Hintjens

This is one of the reasons we switched to python from perl also one of the reasons we're considering C for v5 (or some mix thereof). Clean and concise APIs are another area we're trying to improve. There's nothing worse than trying to read a PDF full of hundreds of (machine generated?) APIs. It shouldn't take "a professional" to integrate with your platform. Don't get me wrong, there should be a reference guide for that kinda thing, but it shouldn't be "the default".

Real-time streaming has always been a passion of mine. I'm happiest when i'm moving data, which is why ZeroMQ has always been at the core of CIF (albeit a little awkwardly at times). While parts of this exist in current versions of CIF, with v4 we're exposing more of those as public interfaces, meaning both ZeroMQ PUB sockets (complete with TLS) as well as native HTTP WebSockets. Both technologies are becoming so much more approachable that implementing them at scale becomes easier by the day.

ZeroMQ is a little harder because you have to understand the ZeroMQ ecosystem, but with that comes efficiency, speed and power. With WebSockets, just about any developer can hit the ground running, and there are plenty more examples in each language which makes it approachable. The result? A firehose directly from your CIF instance. What does that enable you to do? Build a giant intelligence driven IDS. Why is this important? You'll see when you start peering your CIF instances with others in ways other than the traditional "push/pull" models. Real-time, streaming intelligence with your peers is the difference between BGP peering and dialing up to a bulletin board every hour to get your latest content.

Hunting for Probabilities

Additionally I was able to prototype some of simple SKLearn based hunters for both domains and urls. This wasn't so much of generating new data as it is teasing out an actual probability model around these indicators. Traditionally CIF "confidence" levels have always been more or less "made up" (eg: what I think of a data provider and what they think about their data). That's not to say confidence isn't valuable- it helps us estimate something "better than a coin-flip", and to be honest, has served us pretty well over the past 10 years. It was the best we had at the time, but now we have something better. I love distilling "gut feel" down into more of a reproducible, statistically relevant process.

That said- the more code I write, I have to wonder how much of the probabilities will guide us in terms of "whitelisting" data rather than "finding it to be suspicious". This will probably require a bit of research, but it wouldn't surprise me if the net return of these hunters ends up being used to weed out potential "things we would have blocked" vs "trying to find things TO block". If you have an intelligence platform such as this, where the majority of what you're inserting IS suspicious in nature, are you better served by just accepting that and weeding out the potential odd balls that would cause you pain? or increasing the confidence in those things that you inject? We'll see.

So far-

I've spent about a year thinking about v4 and about 12 hours writing it (most of which has been re-factoring older code and wondering how drunk I was when I wrote it). If you look at the repo today, most of it looks and feels like v3 but with most of the complexity removed (eg: lots of refactoring for performance and readability). Last night, I was able to get "pings" flowing back and forth between the client and the storage thread, which is good sign. There's an odd issue with a new version of the Flask components i'm testing out that deals with both auto-documenting and handling API tokens. It doesn't look too crazy, but the folks upstream of a LOT of outstanding pull requests in the queue. Projects like that don't leave me with a warm fuzzy feeling, which means I either need to be patient, fork the code and support it, or both.

 

Maybe by the time I write part 2, we'll have some data flowing back and forth and can demonstrate some of the logic i've described here. Maybe there will even be some doc and a deployment-kit available since a lot of that was solved in v3.

Did you learn something new?