Code associated to the paper: DEFENDIFY: Defense Amplified with Transfer Learning for Obfuscated Malware Framework
Before executing this project you have to consider the data flow of the solution, so you may decide what component(s) you desire to execute. In the data flow shown below we may identify different components which have the following functionality:
- Entropy tester : This classifier detects if a sample is or not obfuscated
- Image Transformer : This module converts a sample to image
- Not Encoded Classifier : This classfier is able to detect if a non encoded sample is Goodware or Malware
- Shikata Ga Nai / XOR Classifier : This classifier is able to detect if a Shikata Ga Nai / XOR encoded sample is is Goodware or Malware
You may decide to test each one of the modules that compose the flow or go directly to the deployed web application version.
The python code to identify whether a binary file has been obfuscated (XOR/Shikata ga nai) can be found in the notebook entropy_tester/entropy_tester.ipynb
. In the cell named Create and write entropy data, the entropies for every file in the folders indicated by the user (in this case, the folders containing the samples from the previous section) are extracted and saved in a file named entropies.csv
, which can be found in the entropy_tester
folder. Then, the cell named Read entropy data reads the CSV file. The remaining cells in the notebook test the performance of several machine learning algorithms when identifying obfuscated binary files based on its entropy.
The script binary2image.py
transforms executable files into greyscale images, as described in the paper. The script can be used with python binary2image.py input_folder output_folder
, where input_folder
contains the binary files and output_folder
will contain the resulting images.
If you want to test by yourself the conversion from a goodware binary to images, you may use the python script binary2image.py, as shown in the image below, where is being applied over a folder that contains a a goodware file (zotero) in different versions: not obfuscated, XOR obfuscated and Shikata Ga Nai obfuscated
Zotero in three versions: Not obfuscated, XOR obfuscated and Shikata Ga Nai obfuscated Executing binary2image script to convert Zotero binaries to images Image obtained for Zotero no encoded Image obtained for Zotero encoded with bloxor Image obtained for Zotero encoded with Shikata Ga NaiWe may do a similar test with a malware sample, as shown below:
Malware in three versions: Not obfuscated, XOR obfuscated and Shikata Ga Nai obfuscated Executing binary2image script to convert malware binaries to images Image obtained for malware no encoded Image obtained for malware encoded with bloxor Image obtained for malware encoded with Shikata Ga NaiThe notebook cnn_tester.ipynb
reads the images from the preious step and tests the performance of four different CNN architectures when identifying malware: ResNet18, ResNet34, EfficientNetB3, and EfficientNetV2. The notebook can be executed completely just by setting the value of the variable path
to the images folder which will be processed.
The samples (goodware and malware) used to train the CNN models are obtain from different sources:
The goodware samples used for this project were obtained using 2 different sources.
The first one was using the service PortableApps provides. We manually downloaded all the software this solution offers and moved all the files to a folder named "Goodware".
On parallel, for the second source, we installed a 32 bit version of Windows 10 and copied all files from the System32
folder to the same "Goodware" folder.
Then the preprocessing phase started. In the preprocessing, we employed the file
command from an Ubuntu installation to remove all files that did not include the strings PE32 executable
and for MS Windows
. This process can be seen on Python script dataset_creation_scripts/cleanType.py
. This left us with a total of 457 sample files obtained from PortableApps and 15171 sample files obtained from a Windows installation, for a total of 15628 goodware samples.
To obtain the malware samples we contacted the staff team of VirusShare and requested acess to their malware repository. Once the access was granted, we donwnloaded the torrent #144 (VirusShare_00144.zip, 87.44 GB) which contained PE for Microsoft Windows and repeated the process of cleaning by type using the Python script dataset_creation_scripts/cleanType.py
and then we randomy chose a similar amount of goodware binaries (15821) in order to compose a balanced dataset, as can be seen in the Python script dataset_creation_scripts/cleanAmount.py
.
Not, as the CNN models have been trainied mainly with samples obfuscated with Shikata Ga Nai or XOR encoder, the models will be able to detect these two obfucation techniques.
As some of the downloaded samples may not be obfuscated, an obfuscation process was applied to all the samples. Such process was done using Metasploit framework and msfvenom and is contained in the file encode_all.py. In such a file you will find an automation of the following two command line instructions:
msfvenom -p generic/custom PAYLOADFILE=sample.exe -a x86 --platform windows -e x86/shikata_ga_nai -o virusencoded.exe
msfvenom -p generic/custom PAYLOADFILE=sample.exe -a x86 --platform windows -e x86/bloxor -o virusencoded.exe
An example of obfuscation instructions for two samples (one goodware and one malware) can be see next:
Goodware (zotero.exe)
Shikata Ga Nai obfuscation of Zotero using msfvenom Bloxor obfuscation of Zotero using msfvenomMalware (VirusShare_83c460e93694aad6cf15370bd50a2684.exe)
Bloxor obfuscation of Malware using msfvenom Shikata Ga Nai obfuscation of Malware using msfvenomYou may contact any of the coauthor of the paper:
Juan Murcia Nieto at [email protected]
Rodrigo Castillo Camargo at [email protected]
Nicolás Rojas at [email protected]
Daniel Díaz-Lopez at [email protected]
Santiago Alferez at [email protected]
Angel Luis Perales Gómez at [email protected]
Pantaleone Nespoli at [email protected]
Felix Gomez Marmolb at [email protected]
Umit Karabiyik at [email protected]