F your formats, just show me the data.

...without ANY machine learning or NLTK magic, you have a very basic and generalized pattern (or "algo" in hipster speak) that can parse and normalize, most types of feeds.
...without ANY machine learning or NLTK magic, you have a very basic and generalized pattern (or "algo" in hipster speak) that can parse and normalize, most types of feeds.
I've spent about a year thinking about v4 and about 12 hours writing it (most of which has been re-factoring older code and wondering how drunk I was when I wrote it). If you look at the repo today, most of it looks and feeds like v3 but with most of the complexity removed (eg: lots of refactoring for performance and readability). Last night, I was able to get "pings" flowing back and forth between the client and the storage thread, which is good sign...
What good is threat intel, if you have to spend time thinking about it?
If you treated every suspicious domain as a coin flip, in a normally distributed sample, over time you'd have a 50/50 chance at being right.If you filter out the top 1000 domains from Alexa, you're probably at 70/30, if you weed out domains that have more than 3 dots in them, 75/25, 3 or more hyphens might get you to 80/20 and if the domain is greater than 15 chars, it's probably not worth your time....
If you run an open-source project, you have no time to spend on testing deployments- so you AUTOMATE ALL THE THINGS, from testing to install, across as many platforms as you possibly can.. because if you give folks documentation, they will not read it, but if you give them an easybutton- they'll BASH THE HELL OUT OF IT. What you quickly figure out- is how many different ways they'll then want to bend, tweak and scale out your application. This leads to more questions, more answers, more time (did I mention you're not really making any money from this, it's all goodwill... you learn a lot, but you also lose a lot of time with your family... depending on your situation, maybe good, maybe bad).
For anyone that's ever tried, there's no 'one way' to parse email, it's one of those long standing protocols that was developed during a different period of time, is extremely resilient, can carry just about anything, works across different encodings, systems and will do just about anything you want it to. The very thing that makes it so versatile- is the very thing that makes it extremely difficult to parse- well. Transporting email is easy, most of the headers and other implementation details in the RFC define that pretty well. It's what IN the messages that's important (and hard)....
I've been writing threat intel parsers for over a decade. In fact, i've made a pretty decent living at it, and i've pretty much seen it all- illegal XML, JSON with XML in it, CSV with tabs in it, pipe delimited 'CSV' with tabs in it and plain text that looks like .. it was written by my 3y/o son....