SIEM tools and Confusion matrix

Piyush Panchariya
4 min readFeb 18, 2022

--

What is SIEM?

Security information and event management is a field within the field of computer security, where software products and services combine security information management and security event management. They provide real-time analysis of security alerts generated by applications and network hardware.

How does SIEM work?

SIEM software gathers the security log data generated by a variety of sources like host systems and security devices like firewalls and antivirus. The second step is to process this log to convert it into a standard format.

The next step is to perform an analysis for the identification and categorization of incidents and events. Hence, the alerts are generated if a security issue is found. The tool can also provide the reports which are related to security incidents and events.

As per the research performed by AlienVault, most of the businesses are concerned about cloud security threats, 55% of the businesses are concerned about phishing and 45% for ransomware.

SIEM works by combining two technologies:

1) Security information management (SIM), which collects data from log files for analysis and reports on security threats and events

2) security event management (SEM), which conducts real-time system monitoring, notifies network admins about important issues and establishes correlations between security events.

The security information and event management process can be broken down as follows:

1. Data collection — All sources of network security information, e.g., servers, operating systems, firewalls, antivirus software and intrusion prevention systems are configured to feed event data into a SIEM tool.Most modern SIEM tools use agents to collect event logs from enterprise systems, which are then processed, filtered and sent them to the SIEM. Some SIEMs allow agentless data collection. For example, Splunk offers agentless data collection in Windows using WMI.

2. Policies — A profile is created by the SIEM administrator, which defines the behavior of enterprise systems, both under normal conditions and during pre-defined security incidents. SIEMs provide default rules, alerts, reports, and dashboards that can be tuned and customized to fit specific security needs.

3. Data consolidation and correlation — SIEM solutions consolidate, parse and analyze log files. Events are then categorized based on the raw data and apply correlation rules that combine individual data events into meaningful security issues.

4. Notifications — If an event or set of events triggers a SIEM rule, the system notifies security personnel.

Security information and event management tools:

Splunk

Splunk Enterprise Security provides real-time threat monitoring, rapid investigations using visual correlations and investigative analysis to trace the dynamic activities associated with advanced security threats.

The Splunk SIEM is available as locally installed software or as a cloud service. It supports threat intelligence feed integration from third-party apps.

IBM QRadar

IBM QRadar collects log data from sources in an enterprise’s information system, including network devices, operating systems, applications and user activities.

The QRadar SIEM analyzes log data in real-time, enabling users to quickly identify and stop attacks. QRadar can also collect log events and network flow data from cloud-based applications. This SIEM also supports threat intelligence feeds.

Confusion matrix terminology:

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known.

Let’s start with an example confusion matrix for a binary classifier.

What can we learn from this matrix?

  • There are two possible predicted classes: “yes” and “no”. If we were predicting the cyber attack is happen, for example, “yes” would mean attack happen, and “no” would mean they don’t happen.
  • The classifier made a total of 165 predictions (e.g., 165 logs tested for the presence of malicious activity).
  • Out of those 165 logs, the classifier predicted “yes” 110 times, and “no” 55 times.
  • In reality, 105 times there is actually a cyber-attack happen, and 60 times there is not any attack.

Let’s now define the most basic terms, which are whole numbers (not rates):

  • true positives (TP): These are cases in which we predicted yes (it is a cyber attack), and they do have the attack.
  • true negatives (TN): We predicted no, and there is not any attack.
  • false positives (FP): We predicted yes, but there is not any attack.
  • false negatives (FN): We predicted no, but it is actually an attack.

I’ve added these terms to the confusion matrix, and also added the row and column totals:

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

  • Accuracy: Overall, how often is the classifier correct?
  • (TP+TN)/total = (100+50)/165 = 0.91
  • Misclassification Rate: Overall, how often is it wrong?
  • (FP+FN)/total = (10+5)/165 = 0.09
  • equivalent to 1 minus Accuracy
  • also known as “Error Rate”
  • True Positive Rate: When it’s actually yes, how often does it predict yes?
  • TP/actual yes = 100/105 = 0.95
  • also known as “Sensitivity” or “Recall”
  • False Positive Rate: When it’s actually no, how often does it predict yes?
  • FP/actual no = 10/60 = 0.17
  • True Negative Rate: When it’s actually no, how often does it predict no?
  • TN/actual no = 50/60 = 0.83
  • equivalent to 1 minus False Positive Rate
  • also known as “Specificity”
  • Precision: When it predicts yes, how often is it correct?
  • TP/predicted yes = 100/110 = 0.91
  • Prevalence: How often does the yes condition actually occur in our sample?
  • actual yes/total = 105/165 = 0.64

In this example of confusion matrix most dangerous prediction is FALSE NEGATIVE, which means there is a cyber attack is happening but the model predicting there is not any malicious activity. Just imagine how much damage that attacker can do to the system and collect the user data and use it for unethical purposes. If there is only one single percentage chance that the model will predict FALSE NEGATIVE then it will cause a high risk of insecurity of data and privacy of the user.

--

--