Hunting Through Spam with ElasticSearch

You can glean a lot of really neat data from spam. Drop boxes, compromised host IP addresses, pieces of malware and if you’re lucky maybe even some magic pills! The hardest problem has always been capturing, archiving and searching through the results. If you've tried any of this with "normal SQL", you know what i'm talking about. Storing text is hard, unicode (or lack thereof) is hard, transporting unicode is awkward, searching it is, hard. Or is it?

Following the evolution of technologies such as ElasticSearch, Kibana and Logstash the last years has been extremely interesting to me. Not in the traditional sense- "hey, here's yet ANOTHER really sexy-fix-all-your-problems" database. Rather, HOW they took Yet Another Database (‘YAD’) and proposed interesting architecture patterns to the "logging event data" problem they accidentally set out to solve. I suggest ‘accidentally’ in a good way, most elegant solutions in life come from accidental discovery. You started out by solving one problem, but the market quickly informed you of another.

For instance, the neat thing about ElasticSearch is not JUST the ability to quickly scan, index and store text based data. While that's extremely useful, you can argue that every other database CAN do that [sort of well] too. For me, the most powerful architecture patterns they managed to embrace and actually improve are:

  • Creating “database partitions” on the fly via simple index creation (eg: email-spam-2018-01) through a REST based API.

  • The ability to archive or remove these indices using simple curation tools, enabling the backend to shift and migrate things “using magic”.

  • The ability to grow the cluster by simply bringing new nodes online, using “magic”.

These are a relatively new set of concepts [sort of], and many arguments can be made how the implementation of these ideas come at the expense of perfect "ACID" compliance. While these are all true and should be taken into consideration for any system design, i'm not going to argue for or against them here. There’s plenty of hate on both sides of that argument, and while nobody is technically wrong, it does depend on the problem you’re trying to solve. They are simply just things to think about when you’re trying to achieve performance over perfection.

Life is Event Driven

It does continue to astonish me how many people treat ElasticSearch as a non-time-driven document store, instead of as an event driven index of your data (eg: the ‘Logstash’ way). This simple paradigm shift is what I think sets ElasticSearch apart from other technologies when leveraged correctly. The irony here though is- once you learn this "event driven" architecture pattern, it cleanly translates back to other data stores, albeit with slightly more overhead. Anyone who’s scaled lots of ES nodes across time eventually figures out, when you start splitting things up by some partition of “time”, you do end up with more nodes, but with that comes more a more resilient and less error prone architecture too. For whatever reason, you end up “hitting the wall” less times and give yourself the ability to gracefully maneuver around it when you do more often than not.

As with anything else in life, there’s always a balance to strike between speed and perfection. Well seasoned architects understand when and where to use both. With something like spam, we're generally more concerned with:

  • Stashing as much data as possible.

  • Making it quickly accessible while keeping future scale in mind.

  • Low cost visualization so we can show off what we've found.

  • Lather, Rinse, Repeat.

https://gist.github.com/wesyoung/ee2f8fdd585fcbb9296e80d429e1643a

Should we get to a point where we begin building actual KNOWLEDGE from that intelligence, then we should THEN think about how to store that KNOWLEDGE in a more static (eg: less performant) way. At which point, customers are PROBABLY willing to pay for that knowledge and the problem of "cost for less volatile storage" will be solved for us. Prototype first, gain insights second, worry about long term viability once you have a business model and people willing to pay. As you can see, none of this is really all that hard. All it takes are a few simple procmail rules and a tiny python script.




Did you learn something new?