Author: Huy Mai Major: Computer Science
During the fall semester of 2020, I conducted research for the Data Science and Artificial Intelligence Lab under Dr. Justin Zhan. I explored the area of network intrusion detection, which is within the fields of machine learning and computer security. Overall, we see that the number of attacks pertaining to the entrance of unauthorized traffic into networks are increasing. Recent advancements have been presented to improve the mitigation of this issue, including the development of network intrusion detection systems. The primary function of a network intrusion detection system (NIDS) is to detect malicious network traffic and raise an alert in the event of such an attack. My project focused on anomaly-based network intrusion detection, which identifies unauthorized network activity based on its deviation from normal network activity. This type of NIDS utilizes supervised, semi-supervised, and unsupervised learning approaches to perform the intrusion detection. Keeping in consideration the various pitfalls of an NIDS that utilizes supervised learning, including the strenuous task of labelling network records in order to train the model, I focused my attention on developing a robust unsupervised learning algorithm that would be integrated into a network intrusion detection model. In the next few paragraphs, I highlight two components of my proposed unsupervised intrusion detection algorithm: sub-space clustering (SSC) and evidence accumulation (EA).
The goal of EA is to combine results from multiple clusterings into one matrix that better reflects the natural groupings. I specifically utilize the Evidence Accumulation for Ranking Outliers (EA4RO), which was designed to highlight anomalous network flows at different sub-spaces. EA4RO constructs a dissimilarity vector that accumulates the Euclidean distance between outliers found in different sub-spaces and the centroid of the cluster containing within inliers each subspace. The dissimilarity vector is then sorted from the highest to lowest dissimilarity, where the anomaly detection threshold is computed by finding the point where the slope of the sorted dissimilarity values indicates a major change. Two anomaly-score algorithms are used to compute the outliers in each sub-space: (1) Isolation Forest, which detects anomalies by calculating short average path lengths of each isolation tree within an ensemble of trees, and (2) Histogram-based Anomaly Score (HBOS), which identifies global anomalies from histograms, where a histogram is constructed for each feature in the dataset. Both algorithms were chosen based on their linear time complexity.
Two datasets that are frequently used in literature related to network intrusion detection were used to conduct experiments on the algorithm. The first dataset is NSL-KDD which lacks redundant records and allows for classifiers to have difficulty learning on the set. The other dataset is UNSW-NB15, which was created by researchers in Australia to include more modern examples than other records in sets such as KDD and NSL-KDD.
For overall data process, I implemented a Python program that included a few libraries such as Sci-kit Learn and Pyod to include the built-in Isolation Forest and HBOS algorithms. I also implemented one-hot encoding for each dataset (both datasets consist of a few categorical features) and feature selection based on relevance using the F-test. The results proved to be sporadic and, as a result, problematic. It seems that the scheme, under the different conditions for this project, doesn’t perform well with either anomaly score algorithm. This could be attributed to a number of factors. One factor could the calculation of the threshold, specifically the implementation of the slope-break. Another factor could be the feature selection method: For next time, I could experiment with selecting a larger quantity of features, which would increase the computation time of the network intrusion detection algorithm.
I am grateful to Dr. Zhan for allowing me to go through the whole process and to see what’s expected of each step in order to come up with a strong, publishable manuscript, potentially in a scientific journal or at a conference. I look forward to diving deeper into the algorithm that I have proposed and making the proper adjustments to the algorithm (e.g. consider a semi-supervised learning algorithm where the model is trained on a set that is partially labelled, choose more appropriate evaluation metrics, etc.) in order to not only use this project as my honors thesis but also produce a state-of-the-art algorithm within the fields of machine learning and computer security.