Skip to content

Obfuscated malware detection using deep learning models and transfer learning

License

Notifications You must be signed in to change notification settings

nirogu/ObfuscatedMalwareDetection

Repository files navigation

Obfuscated Malware Detection

Code associated to the paper: DEFENDIFY: Defense Amplified with Transfer Learning for Obfuscated Malware Framework

QUICK START GUIDE

Before executing this project you have to consider the data flow of the solution, so you may decide what component(s) you desire to execute. In the data flow shown below we may identify different components which have the following functionality:

  1. Entropy tester : This classifier detects if a sample is or not obfuscated
  2. Image Transformer : This module converts a sample to image
  3. Not Encoded Classifier : This classfier is able to detect if a non encoded sample is Goodware or Malware
  4. Shikata Ga Nai / XOR Classifier : This classifier is able to detect if a Shikata Ga Nai / XOR encoded sample is is Goodware or Malware

You may decide to test each one of the modules that compose the flow or go directly to the deployed web application version.

Flow

1. Detecting obfuscation (Entropy Tester)

The python code to identify whether a binary file has been obfuscated (XOR/Shikata ga nai) can be found in the notebook entropy_tester/entropy_tester.ipynb. In the cell named Create and write entropy data, the entropies for every file in the folders indicated by the user (in this case, the folders containing the samples from the previous section) are extracted and saved in a file named entropies.csv, which can be found in the entropy_tester folder. Then, the cell named Read entropy data reads the CSV file. The remaining cells in the notebook test the performance of several machine learning algorithms when identifying obfuscated binary files based on its entropy.

2. Obtaining the images (Images Transformer)

The script binary2image.py transforms executable files into greyscale images, as described in the paper. The script can be used with python binary2image.py input_folder output_folder, where input_folder contains the binary files and output_folder will contain the resulting images.

If you want to test by yourself the conversion from a goodware binary to images, you may use the python script binary2image.py, as shown in the image below, where is being applied over a folder that contains a a goodware file (zotero) in different versions: not obfuscated, XOR obfuscated and Shikata Ga Nai obfuscated

Zotero in three versions: Not obfuscated, XOR obfuscated and Shikata Ga Nai obfuscated

Executing binary2image script to convert Zotero binaries to images

Image obtained for Zotero no encoded

Image obtained for Zotero encoded with bloxor

Image obtained for Zotero encoded with Shikata Ga Nai

We may do a similar test with a malware sample, as shown below:

Malware in three versions: Not obfuscated, XOR obfuscated and Shikata Ga Nai obfuscated

Executing binary2image script to convert malware binaries to images

Image obtained for malware no encoded

Image obtained for malware encoded with bloxor

Image obtained for malware encoded with Shikata Ga Nai

3. Detecting malware (Not Encoded Classifier and Shikata Ga nai / XOR Classifier)

The notebook cnn_tester.ipynb reads the images from the preious step and tests the performance of four different CNN architectures when identifying malware: ResNet18, ResNet34, EfficientNetB3, and EfficientNetV2. The notebook can be executed completely just by setting the value of the variable path to the images folder which will be processed.

Frequented Asked Questions?

1. How the models are being trained?

The samples (goodware and malware) used to train the CNN models are obtain from different sources:

Goodware

The goodware samples used for this project were obtained using 2 different sources.

The first one was using the service PortableApps provides. We manually downloaded all the software this solution offers and moved all the files to a folder named "Goodware".

On parallel, for the second source, we installed a 32 bit version of Windows 10 and copied all files from the System32 folder to the same "Goodware" folder.

Then the preprocessing phase started. In the preprocessing, we employed the file command from an Ubuntu installation to remove all files that did not include the strings PE32 executableand for MS Windows. This process can be seen on Python script dataset_creation_scripts/cleanType.py. This left us with a total of 457 sample files obtained from PortableApps and 15171 sample files obtained from a Windows installation, for a total of 15628 goodware samples.

Malware

To obtain the malware samples we contacted the staff team of VirusShare and requested acess to their malware repository. Once the access was granted, we donwnloaded the torrent #144 (VirusShare_00144.zip, 87.44 GB) which contained PE for Microsoft Windows and repeated the process of cleaning by type using the Python script dataset_creation_scripts/cleanType.py and then we randomy chose a similar amount of goodware binaries (15821) in order to compose a balanced dataset, as can be seen in the Python script dataset_creation_scripts/cleanAmount.py.

2. The solution is able to detect malware with any kind of obfuscation?

Not, as the CNN models have been trainied mainly with samples obfuscated with Shikata Ga Nai or XOR encoder, the models will be able to detect these two obfucation techniques.

3. How is the obfuscation process done?

As some of the downloaded samples may not be obfuscated, an obfuscation process was applied to all the samples. Such process was done using Metasploit framework and msfvenom and is contained in the file encode_all.py. In such a file you will find an automation of the following two command line instructions:

msfvenom -p generic/custom PAYLOADFILE=sample.exe -a x86 --platform windows -e x86/shikata_ga_nai -o virusencoded.exe

msfvenom -p generic/custom PAYLOADFILE=sample.exe -a x86 --platform windows -e x86/bloxor -o virusencoded.exe

An example of obfuscation instructions for two samples (one goodware and one malware) can be see next:

Goodware (zotero.exe)

Shikata Ga Nai obfuscation of Zotero using msfvenom

Bloxor obfuscation of Zotero using msfvenom

Malware (VirusShare_83c460e93694aad6cf15370bd50a2684.exe)

Bloxor obfuscation of Malware using msfvenom

Shikata Ga Nai obfuscation of Malware using msfvenom

Who may I contact to if I have any additional question?

You may contact any of the coauthor of the paper:

Juan Murcia Nieto at [email protected]

Rodrigo Castillo Camargo at [email protected]

Nicolás Rojas at [email protected]

Daniel Díaz-Lopez at [email protected]

Santiago Alferez at [email protected]

Angel Luis Perales Gómez at [email protected]

Pantaleone Nespoli at [email protected]

Felix Gomez Marmolb at [email protected]

Umit Karabiyik at [email protected]