I've been writing threat intel parsers for over a decade. In fact, i've made a pretty decent living at it, and i've pretty much seen it all- illegal XML, JSON with XML in it, CSV with tabs in it, pipe delimited 'CSV' with tabs in it and plain text that looks like .. it was written by my 3y/o son.

Screen Shot 2018-01-06 at 8.14.04 AM.png — I used to think, we just needed a simple, formalized standard to abstract away the need for all this parsing... if there was just one format, everyone would use, we wouldn't need all these parsers.

Boy, was I wrong.

Not just a little wrong, but A LOT wrong. Not only had this been tried over and over and over, each time- nobody seemed to learn from the previous effort. First it was IDMEF, then it was IODEF, then STIX.. then STIX 2.0 and i'm almost 99% certain some kid is out there right now, pitching some new protocol to a group of "really important people" describing how "no guys, really... this time it's different".

Here's the problem- people genuinely WANT to share data. However:

They don't want to read your, 'perfect and fits every possible use case' ... 100+ page 'standard'
They don't want to try and understand whatever code you've managed to cobble together (in a language they're not used to programming in)
They DO NOT HAVE TIME TO CONFORM TO YOUR PERFECTION
IF you put the bulk of the data sharing effort on them, they will almost certainly NOT participate
If there's a risk of vendor or regulatory capture- most people will simply go off and do stuff on their own
IF THE PROCESS OF CHANGING THE STANDARD REQUIRES A MEETING RATHER THAN A PULL REQUEST- the market will generally work around that
Should your standard actually start being adopted, as it's tested in the wild, you'll find obvious things you never thought about, edge cases that affect real people sharing real data. If your standard can't quickly adapt, the market will just route around it
If you ship with CSV everyone will use it
If you ship with JSON, lots of that 'everyone' segment will probably, eventually use it
If you ship with XML, a few people will use it
If you provide a client with your standard, most people will use it, some will try to re-invent your code by parsing the data and NOT READING ANYTHING YOU'VE WRITTEN (and then ask you questions about why their client doesn't work w/o testing your client side by side)

Ok, so what?

Over the years, i've learned that- if you accept the philosophy "F#@^ your formats, just gimme your data" and build your tools using methodology, you'll share a lot more data. Don't get me wrong, you can still build standards, the way in which you build them just changes. Instead of arguing with non-operators over "what to call an address" and "what enumerated values it should include", you test your protocols in the wild and submit pull-requests to adapt them. You'll learn quickly what works, what doesn't and frankly was just a stupid idea. Protocol development is a slightly different topic for a later date, but it leads into the "how do I get the data as the protocols are evolving" ?

Make your tools SMRT'er

up and running with smrt..

Example SMRT Config

# cif-smrt configuration file to pull feeds from csirtg.io
# For more information see https://csirtg.io
#
# If no token is given, the feed by default is a "limited feed"
# provided by https://csirtg.io. The limits of the "limited feed"
# are:
#
# 1. Only results from the last hour are returned
# 2. A maximum of 25 results are returned per feed
#
# To remove the limits, sign up for an API key at https://csirtg.io

parser: csv
token: 'CSIRTG_TOKEN'  # ENV['CSIRTG_TOKEN'] <get one at https://csirtg.io >
limit: 250
defaults:
  provider: csirtg.io
  altid_tlp: white
  altid: https://csirtg.io/search?q={indicator}
  tlp: white
  confidence: 9
  values:
    - null
    - indicator
    - itype
    - portlist
    - null
    - null
    - protocol
    - application
    - null
    - null
    - lasttime
    - description
    - null

feeds:
  # A feed of IP addresses block by a firewall (e.g. port scanners)
  port-scanners:
    remote: https://csirtg.io/api/users/csirtgadgets/feeds/port-scanners.csv
    defaults:
      tags:
        - scanner

  # A feed of URLs seen in the message body of UCE email. Do not alert or block
  # on these urls without additional post-processing.
  uce-urls:
    remote: https://csirtg.io/api/users/csirtgadgets/feeds/uce-urls.csv
    defaults:
      tags:
        - uce
        - uce-url

  # A feed of email addresses seen in UCE email. Do not alert or block on these
  # email addresses without additional post-processing.
  uce-email-address:
    remote: https://csirtg.io/api/users/csirtgadgets/feeds/uce-email-addresses.csv
    defaults:
      tags:
        - uce
        - uce-email-address

  # A feed of IP addresses seen delivering UCE email. This could be a machine that
  # is compromised or a user account has been compromised and used to send UCE.
  uce-ip:
    remote: https://csirtg.io/api/users/csirtgadgets/feeds/uce-ip.csv
    defaults:
      tags:
        - uce
        - uce-ip

Awesome- you wrote a Swiss army knife that STILL requires a heck of a lot of YAML to make it work..

True... but if you accept the idea that SMRT is really a platform, and dig a little into the way it parses traditional CSV (among other delimited file formats), what happens when we teach the platform how to detect certain elements in a feed. You can guess where this is headed.. If your tools can translate things on the fly to the protocol YOU want, you no longer need to learn Spanish to visit Spain, you just need your phone..

The goal isn't making the tool understand the complexity of all the different ways we can write YAML, the goal... is to get rid of the YAML. Did I also mention, SMRT doesn't JUST push parsed indicators to CIF?