Automated Threat Hunting Within Linux Logs Using DBSCAN

Using Machine Learning for Automated Threat Hunting

When it comes to automated threat hunting at scale using artificial intelligence, neural networks will tend to be the better choice of models for identifying anomalies in your data. Neural networks are designed to handle large amounts of data, and have proven to be performant when using the correct neural network architecture for your problem. For organizations that have hundreds or thousands of users or systems, it’s a no-brainer to utilize neural networks to perform anomaly detection in your data. However, there may be those cases where your threat hunt is very narrowly focused to a single or a couple of users or systems that don’t produce a lot of data. You may even have threat hunts where the timeframe of your hunt (e.g. over the past month or couple of weeks) limits the amount of data that you can actually gather. In QFunction’s own experimentation with this, it turns out that neural networks may not be the best way to find anomalies in the data. You can spend hours tinkering with the hyperparameters of the neural network only to find out that it simply doesn’t work well for your data. Instead, you may be better off using more traditional machine learning algorithms for this task. We’ll look at an example of one of these machine learning algorithms: DBSCAN.

Understanding DBSCAN

DBSCAN is an acronym for “Density-based spatial clustering of applications with noise”. It’s effectively a clustering algorithm that clusters similar data points together, resulting in a way to group similar data points together (which can be seen as regularly occurring activity from a cybersecurity point of view). It performs this clustering activity by taking your data points and seeing how many of them fall in the same neighborhood as your other data points. Consequently, data points that don’t fall into clusters will be your anomalies, which is what threat hunters will find more interesting. Without going too much into the math and technical details, DBSCAN works by supplying it three things: your data, the minimum number of data points that should considered a cluster, and a value named epsilon which defines the maximum distance between two data points for one to be considered in the neighborhood of the other. 

For example, if you supply DBSCAN with a value of 5 for the minimum data points and an epsilon of 0.8, clusters would be established around data points that have 5 or more other data points within the neighborhood of those original data points, where the neighborhood is defined as anything within 0.8 units of the original data points. There’s a better way of stating this mathematically, but we’re trying to avoid mathematics here, so bear with the explanation.

From a time complexity standpoint, the worst case scenario is that DBSCAN operates under nlogn time complexity on average, and n^2 in the worst case. In programming terms, this can be seen as worse than a for loop on average, with a for loop within a for loop being the worst case scenario. This means that DBSCAN will take longer with larger datasets. This is why if you’re dealing with a large amount of data points, it may be better to choose another algorithm or architecture. For smaller datasets, like we’re dealing with in the upcoming example, it works just fine.

How to Use DBSCAN for Automated Threat Hunting

Now that we have a rough understanding of DBSCAN, let’s use it to threat hunt within a Linux log file. The Linux log file has around 2000 lines in it, which does not constitute a large dataset. This means that it should work well with the DBSCAN algorithm. We’ll follow the steps below to perform the hunt:

  1. Preprocess the Linux logs to keep only the main parts of the message (pretty much everything after the [\d+]: characters, in regex speak)

  2. Vectorize your data (which means to turn your data into numbers, preferably numbers between 0 and 1)

  3. Feed the vectorized data to DBSCAN as well as setting the minimum data points and the epsilon (which in this case will use 5 and 0.8 for the minimum data points and epsilon, respectively)

  4. Print the values that don’t fall into clusters, which are the anomalies

Optionally, you can also visualize the data to see the clusters, although this requires principal component analysis (PCA), which reduces the dimensionality of the data to 3 dimensions so that it can be visualized. PCA is beyond the scope of this article, but explanations can be found for it on data science sites. We will use it to visualize the clusters.

The first step can be seen in the following code:

Preprocessing the Linux logs

This code reads and preprocesses the Linux logs so that only the main part of the message is kept.

The second step looks like the following:

Vectorizing the preprocessed Linux logs

Here we are assigning a number between 0 and 1 to all 256 possible characters that can appear in the Linux logs. We then loop through each line of the preprocessed Linux logs and assign the vocabulary to each character of each line, effectively vectorizing the line into a NumPy array. Determining the best way to vectorize lines is highly experimental, and there are numerous ways of doing this. However, this approach works for this scenario, so we’ll move forward with it.

The third step shows the DBSCAN code as well as the optional principle component analysis step discussed in step 5:

Executing the DBSCAN algorithm

Choosing the values for the minimum data points and epsilon parameters is highly experimental, and is effectively a matter of trial and error. The values of 5 and 0.8 worked here, but we very well could have used 4 and 0.9 here. The beauty of data science is that it lets you experiment with different values to best approach your problem, so feel free to play around with these numbers. The visualization of the clusters can be seen below:

Visualization of the clusters in 3 dimensions

Finally, we need to see all the lines that did not fall into clusters:

Printing the lines that didn’t fall within clusters

This code shows all of the lines within the Linux logs that did not fall into clusters, which could be considered as the anomalies of the dataset. In total, about 130 of the 2000 lines did not fall into clusters, which is a much easier value to threat hunt. After taking a look at the anomalies, you can see some startup logs as well as some failed authentication attempts which could be worth exploring. However, there are three lines that are of interest from a cybersecurity standpoint:

root[2421]: ROOT LOGIN ON tty2
ANONYMOUS FTP LOGIN FROM 84.102.20.2, (anonymous)
ANONYMOUS FTP LOGIN FROM 84.102.20.2, (anonymous)

Best security practices state that you should never have the root user logging in to a Linux system. They also state to disable anonymous FTP access for any file sharing server. These lines will most likely need to be investigated further to confirm that no compromises have happened on this server.

Conclusion

As shown above, the threat hunt on this server has yielded actionable findings that can be shown to cybersecurity teams or system administration teams looking to improve their security. We were also able to successfully utilize the DBSCAN machine learning algorithm in order to find anomalies that translated to threats in the data. The Linux log file used in this example can be found on Kaggle here. The code shown within this post can be found on the QFunction GitHub here.

If you’re interested in automated threat hunting in your environment, check out how QFunction performs AI-based threat hunting! And if you’re interested in seeing a threat hunt that utilized a neural network, check out how AI was used to threat hunt Zeek network logs!

Previous
Previous

Overcoming the Top AI Challenges in Cybersecurity

Next
Next

How to Enhance Your SIEM with AI