A consideration when applying metrics is that if they’re used to quantify performance, then staff are incentivised to ‘optimise’ metrics, and this can lead to some perverse outcomes. Let’s consider some common SOC metrics, and how they can unintentionally degrade a SOC’s ability to detect threats.
Metric 1. Number of tickets processed
When a suspicious pattern in logs triggers an alert rule, it typically produces a ticket for analysts to triage. The analyst assigned to the ticket then has to assess the alert, and make a call whether it might be:
- a real attack requiring escalation into an investigation/incident
or
- a false positive due to a quirk of the alerting logic
In the vast majority of SOCs I’ve observed, alert logic leads to a lot of false positives. I’ve seen ticket-focussed SOCs where as many as 99% of tickets were being triaged as false positives. This means that an analyst being measured on ‘number of tickets processed’ is incentivised to quickly find a reason to close it as a false positive, rather than to escalate or investigate it.
Metric 2. Time taken to close a ticket
Similar to the above, but the analyst is now also incentivised to click ‘false positive’ as quickly as possible.
Metric 3. Number of detection rules
A subtly dangerous metric as the benefits seem self-evident. It seems logical to presume that the more rules there are to ‘detect bad things’ will result in more chances to ‘detect bad things’.
Unfortunately this is rarely the case.
Such a metric almost always leads to the perverse outcome of ‘alert inflation’; analysts are incentivised to write as many rules as possible, so the metric goes up. However, this leads to false positives as well as ineffective rules. At its worst, I’ve seen individual rules for individual Indicators of Compromise (IOCs) like an IP address.
Metric 4. Volume of logs collected vs value of logs collected
Effective detection needs good logs, and whilst logs are very useful for incident investigation, logs on their own won’t help with detection. I’ve seen too many SOCs that are ingesting ever-increasing volumes of logs, but those logs often either have limited detection value, or the SOC isn’t using the logs for detection (no relevant alerts, or threat hunts that require those logs).
I visited a SOC where one of their largest log feeds by volume had never been set up correctly, so they only had the first 30 characters of each entry. However, this had never been noticed, so they were not carrying out any meaningful alerting.
Worse still, collecting increasing volumes of logs with limited value generally means the existing logs can be retained for less time (as additional logs will incur additional cost, or take up disk space)