Ever wondered "how many SSH scanners there were in the world?"
It's one thing to think of "statistics" in the general sense. For instance,
"100 unique IPs scanned my darknet today".
This doesn't really tell me anything useful, other than (assuming DHCP churn is nil in a given 24 hour period) there's a bit of noise on the line. 100 by itself isn't a really useful number, it's probably not even statistically relevant, is it a holiday? was part of the Internet down today? was it the same device behind a series of NATs?
So the next logical step is to start abstracting that out a bit- ask the question "well, how many distinct /24's scanned me over a longer period" thinking, maybe that'll add some context. In some cases, where you're trying to identify and maybe block some of the /24s that may have shady/poor security postures it may provide a bit more context, but really you're still just measuring your little corner of the Internet, you're not really abstracting out the bigger picture.
As it turns out- we can barrow from statistical measures that have already been proven and are pretty solid in other diciplines. Here we tried applying the "Mark and Recapture" method which states:
Mark and recapture is a method commonly used in ecology to estimate an animal population's size. A portion of the population is captured, marked, and released. Later, another portion is captured and the number of marked individuals within the sample is counted. Since the number of marked individuals within the second sample should be proportional to the number of marked individuals in the whole population, an estimate of the total population size can be obtained by dividing the number of marked individuals by the proportion of marked individuals in the second sample. The method is most useful when it is not practical to count all the individuals in the population...
If you ask any network engineer
they'll probably suggest that "there's no way to find the population of SSH scanners in the world without deploying a billion ssh honeypots". Which, is partially true. Given the breadth of the Internet, DHCP churn, node churn, latency, vulnerable victim populations, etc.. there's probably as much accurate way of getting to 99% accuracy as there is with ... the common cold.
The number isn't 5 and it's not 5 billion..
It's somewhere inbetween. The trick is narrowing that range down where you're 68-84% confident. Is it between 25,000 and 250,000? maybe.. but to figure that out, you probably only need a handful of nodes on each continent, which you can pretty quickly do with an AWS account. Then you're only concern is 'nodes not attacking AWS ip space...' which is far simplier problem once you get the process down.
What's the point?
Well, as with finance, you can't articulate risk if you can't measure it and measuring it is the first step in understanding the real risk of something. When you figure out- the real number of SSH scanners at any given time on the Internet is less than 1,000,000 .. the problem of solving for SSH scanners becomes real and the question then becomes, well.. where do most of them come from?