STA2018 is a new dataset generated by transforming the network traces of the UNB ISCX Intrusion Detection Evaluation DataSet 2012 [1] into a suitable format for Machine Learning (ML) and Data Mining (DM) tasks. The generation process used traffic trace files to extract 193 basic features, which were then extended to 550 features by employing part of Onut’s feature classification schema [2] (549 independent variables plus one dependent (class) variable).
The STA2018 dataset contains the profiled sessions (connections) of the network traffic of five simulation days, where data records are grouped by day so that every data file aggregated all of the connections within that simulation day. Overall, the transformation process had five main stages:
1. Basic-features extraction: every PCAP file (in UNB ISCX 2012 dataset) was processed using Bro software to extract 193 features for every ICMP, TCP and UDP connection. These features consisted of information that can be extracted from frame and packet headers such as the source and destination IP addresses and ports, connection duration, transport protocol etc.
2. Validation and connection labelling: the accurate capture of every (ICMP, TCP, UDP) packet in every PCAP file was validated, then every processed connection (in the PCAP files) was matched to its corresponding flow in the XML file (in UNB ISCX 2012 dataset) using the label provided {Attack, Normal}.
3. Extend the basic-features: every connection was processed to derive two sets of features (time-based and connection-based). Deriving these features depended on the chronological order of the original connections. Part of Onut’s feature classification schema [2] was used in this phase.
4. Balance: synthetic records (connections) were generated to balance the number of Normal and Attack connections in the dataset by generating synthetic records of the attack connections. This balancing phase used the SMOTE algorithm [3]. All synthetic records are identifiable by the ‘synthetic’ variable.
5. Clean up: any useless features were removed, before source and destination zone features were added, to reduce the large address space.
The large number of features space combined with having a balanced version of the data provides the research community with a good dataset to test algorithms and analyse the effect of various parameters on the quality of generated models.
[1] Ali Shiravi, Hadi Shiravi, Mahbod Tavallaee, and Ali A Ghorbani. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security, 31 (3): 357–374, 2012.
[2] Iosif-Viorel Onut and Ali A Ghorbani. A feature classification scheme for network intrusion detection. IJ Network Security, 5 (1): 1–15, 2007.
[3] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 321–357, 2002.
- Intrusion Detection
- labelled data
- network security
- Full data
- network traffic