Malware Training Sets: A machine learning dataset for everyone

One of the most challenging tasks during Machine Learning processing is to define a great training (and possible dynamic) dataset. Assuming a well known learning algorithm and a periodic learning supervised process what you need is a classified dataset to best train your machine. Thousands of training datasets are available out there from “flowers” to “dices” passing through “genetics”, but I was not able to find a great classified dataset for malware analyses. So, I decided to do it by myself and to share the dataset with the scientific community (and everybody interested on it) in order to give to everyone a base point to start with Machine Learning for Malware Analysis. The first challenge I faced was to define features and how to extract them.  Basically I had two choices:
  1. Extracting features directly from samples. This is the easiest solution since the possible extracted features would be directly related to the sample such as (but not limited to): file “sections”, “entropy”, “Syscalls” and decompiled assembly n-grams.
  2. Extracting features on samples analysis. This is the hardest solution since it would include both static analysis such as (but not limited to): file sections, entropy, “Syscall/API” and dynamic analysis such as (but not limited to): “Contacted IP”, “DNS Queries”, “execution processes”, “AV signatures” , etc. etc. Plus I needed a complex system of dynamic analysis including multiple sandboxes and static analysers.   
I decided to follow the hardest path by extracting features from both: static analysis and dynamic analysis of samples detonation in order to collect as much features as I can letting to the data scientist the freedom to decide what feature to use and what feature to drop in his data mining process. The analyses where performed through the sample detonation in several SandBoxes (free and commercial ones) which defined a first stage of ontologically homogeneous blocks called “Analyses Results” (AR). AR are too much verbose and they are not  performing well in any text algorithm of my knowledge.
After more readings on the topic I came up with Malware Instruction Set for Behaviour Analysis ( MIST) described in Philipp Trinius et Al. (document available here).  MIST is basically a result based  optimised representation for effective and efficient analysis of behaviour using data mining and machine learning techniques. It can be obtained automatically during analysis of malware with a behaviour monitoring tool or by converting existing behaviour reports. The representation is not restricted to a particular monitoring tool and thus can also be used as a meta language to unify behaviour reports of different sources. The following image shows the MIST encoding structure. 

A simple example coming directly from the aforementioned paper is showed in the following image where “load.dll” has been detected. The ‘load dll’ system call is executed by every software during process initialisation and run-time several times, since under Windows, dynamic-link libraries (DLLs) are used to implement the Windows subsystem and offer an interface to the operating system. Following how the load.dll has been encoded into MIST meta language.
I decided to use the same concept of “meta language” but with auto-descriptive logic (without encoding the category operation since it would not afflict the analyses) and every information organised into a well formed JSON File rather then into a line based text file in order to be used in external environments with zero effort.  The produced datasets looks like following:

DataSet Snippest (click to enlarge)
Each JSON Property could be used as an algorithmic feature of your desired Machine Learning algorithm, but the most significative ones would be the “properties” ones (the one labelled properties). Each property, by meaning of each field placed under the “properties” section of the produced JSON file, is optional and is structured as follows:
category_action_with_description |  “sanitized” involved subjects with spaces
So for example:
“sig_copies_self”: “e5ed769a e5ed769a 98e83379”
It means the category is sig (stands for signature) and the action is “copies itself”.  e5ed769a e5ed769a 98e83379 are 3 sanitize evidences of where the samples copies itself (see the Sanitization Procedure) 
 “sig_antimalware_metascan”: “”

It means the category is sig (stands for signature) and the action is “antimalware_metascan”. The evidences are empty by meaning no signature found from metascan (in such a case).

“sig_antivirus_virustotal”: “ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b”

It means the signature virus_total found 5 evidences (ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b).

A fundamental property is the “label” property which classifies the malware family. I decided to name this field “label” rather than: “malware_name”, “malware_family” or “classification” in order to let the compatibility with many implemented machine learning algorithms which use the field “label” to properly work (it seems to be a defacto standards for many engine implementations).

Sanitization Procedures

Aim of the project is to provide an useful and classified dataset to researchers who want to investigate deeper in malware analysis by using Machine Learning techniques. It is essential to give a speed up in performances on text mining and for such a reason I decided to use some well known sanitization techniques in order to “hash” the evidences letting unchanged the meaning but drastically improving the speed for an algorithm point of view. The following picture shows the sanitization procedures:
Sanitization Procedures (click to enlarge)

From a developer prospective the cited (and showed) procedures are not well written; for example are not protected and “.replace” could be not safe within specific inputs. For such a reason I will not release such a code. But please keep in mind that the result of my project is not the “sanitization code” but the outcome of it: the classified malware analyses datased, so I focused my attention on features extraction, samples collection,  aggregation, conversion, and of course analyses, not really in developing production code.

Training DataSets Generation: The Simplified Process

The whole process to obtain the training datasets is described in the following flowchart. The detonation of a classified Malware into multiple sandboxes produces multiple static and dynamic analyses colliding into an analyses results artefact (AR).  AR would be translated into a MIST elaborated meta language to be software agnostic and to give freedom to data scientists.

Data Samples

Today (please refers to blog post date) the collected classified datasets is composed by the following samples:

  • APT1 292 Samples
  • Crypto 2024 Samples
  • Locker 434 Samples
  • Zeus 2014 Samples
If you own classified Malware samples and you want to share it with me in order to contribute at the Machine Learning Training Datasets you are welcome, just drop me an email !
I will definitely process the samples and build new datasets to share to everybody.

Where can I download the training datasets ?  HERE

Available Features and Frequency

The following list enumerates the available features per each sample. The features, as mentioned, are optional by meaning you might have no all the same features for every sample. If the sample you are analysing does not have a specific feature you want consider it as None (or undefined) since that feature was not available for the specified sample. So if you are writing your of machine learning algorithm you should include a “purification procedure” which will ignore None features from training and or query.

List of current available features with occurrences counter. :

   ‘file_access’: 138759,
   ‘sig_infostealer_ftp’: 13114,
   ‘sig_modifies_hostfile’: 5,
   ‘sig_removes_zoneid_ads’: 16,
   ‘sig_disables_uac’: 33,
   ‘sig_static_versioninfo_anomaly’: 0,
   ‘sig_stealth_webhistory’: 417,
   ‘reg_write’: 11942,
   ‘sig_network_cnc_http’: 132,
   ‘api_resolv’: 954690,
   ‘sig_stealth_network’: 71,
   ‘sig_antivm_generic_bios’: 6,
   ‘sig_polymorphic’: 705,
   ‘sig_antivm_generic_disk’: 7,
   ‘sig_antivm_vpc_keys’: 0,
   ‘sig_antivm_xen_keys’: 5,
   ‘sig_creates_largekey’: 16,
   ‘sig_exec_crash’: 6,
   ‘sig_antisandbox_sboxie_libs’: 144,
   ‘sig_mimics_icon’: 2,
   ‘sig_stealth_hidden_extension’: 9,
   ‘sig_modify_proxy’: 384,
   ‘sig_office_security’: 20,
   ‘sig_bypass_firewall’: 29,
   ‘sig_encrypted_ioc’: 476,
   ‘sig_dropper’: 671,
   ‘reg_delete’: 2545,
   ‘sig_critical_process’: 3,
   ‘service_start’: 312,
   ‘net_dns’: 486,
   ‘sig_ransomware_files’: 5,
   ‘sig_virus’: 781,
   ‘file_write’: 20218,
   ‘sig_antisandbox_suspend’: 2,
   ‘sig_sniffer_winpcap’: 16,
   ‘sig_antisandbox_cuckoocrash’: 11,
   ‘file_delete’: 5405,
   ‘sig_antivm_vmware_devices’: 1,
   ‘sig_ransomware_recyclebin’: 0,
   ‘sig_infostealer_keylog’: 44,
   ‘sig_clamav’: 1350,
   ‘sig_packer_vmprotect’: 1,
   ‘sig_antisandbox_productid’: 18,
   ‘sig_persistence_service’: 5,
   ‘sig_antivm_generic_diskreg’: 162,
   ‘sig_recon_checkip’: 4,
   ‘sig_ransomware_extensions’: 4,
   ‘sig_network_bind’: 190,
   ‘sig_antivirus_virustotal’: 175975,
   ‘sig_recon_beacon’: 23,
   ‘sig_deletes_shadow_copies’: 24,
   ‘sig_browser_security’: 216,
   ‘sig_modifies_desktop_wallpaper’: 83,
   ‘sig_network_torgateway’: 1,
   ‘sig_ransomware_file_modifications’: 23,
   ‘sig_antivm_vbox_files’: 7,
   ‘sig_static_pe_anomaly’: 2194,
   ‘sig_copies_self’: 591,
   ‘sig_antianalysis_detectfile’: 51,
   ‘sig_antidbg_devices’: 6,
   ‘file_drop’: 6627,
   ‘sig_driver_load’: 72,
   ‘sig_antimalware_metascan’: 1045,
   ‘sig_modifies_certs’: 46,
   ‘sig_antivm_vpc_files’: 0,
   ‘sig_stealth_file’: 1566,
   ‘sig_mimics_agent’: 131,
   ‘sig_disables_windows_defender’: 3,
   ‘sig_ransomware_message’: 10,
   ‘sig_network_http’: 216,
   ‘sig_injection_runpe’: 474,
   ‘sig_antidbg_windows’: 455,
   ‘sig_antisandbox_sleep’: 271,
   ‘sig_stealth_hiddenreg’: 13,
   ‘sig_disables_browser_warn’: 20,
   ‘sig_antivm_vmware_files’: 6,
   ‘sig_infostealer_mail’: 617,
   ‘sig_ipc_namedpipe’: 13,
   ‘sig_persistence_autorun’: 2355,
   ‘sig_stealth_hide_notifications’: 19,
   ‘service_create’: 62,
   ‘sig_reads_self’: 14460,
   ‘mutex_access’: 15017,
   ‘sig_antiav_detectreg’: 4,
   ‘sig_antivm_vbox_libs’: 0,
   ‘sig_antisandbox_sunbelt_libs’: 2,
   ‘sig_antiav_detectfile’: 2,
   ‘reg_access’: 774910,
   ‘sig_stealth_timeout’: 1024,
   ‘sig_antivm_vbox_keys’: 0,
   ‘sig_persistence_ads’: 3,
   ‘sig_mimics_filetime’: 3459,
   ‘sig_banker_zeus_url’: 1,
   ‘sig_origin_langid’: 71,
   ‘sig_antiemu_wine_reg’: 1,
   ‘sig_process_needed’: 137,
   ‘sig_antisandbox_restart’: 24,
   ‘sig_recon_programs’: 5318,
   ‘str’: 1443775,
   ‘sig_antisandbox_unhook’: 1364,
   ‘sig_antiav_servicestop’: 78,
   ‘sig_injection_createremotethread’: 311,
   ‘pe_imports’: 301256,
   ‘sig_process_interest’: 295,
   ‘sig_bootkit’: 25,
   ‘reg_read’: 458477,
   ‘sig_stealth_window’: 1267,
   ‘sig_downloader_cabby’: 50,
   ‘sig_multiple_useragents’: 101,
   ‘pe_sec_character’: 22180,
   ‘sig_disables_windowsupdate’: 0,
   ‘sig_antivm_generic_system’: 6,
   ‘cmd_exec’: 2842,
   ‘net_con’: 406,
   ‘sig_bcdedit_command’: 14,
   ‘pe_sec_entropy’: 22180,
   ‘pe_sec_name’: 22180,
   ‘sig_creates_nullvalue’: 1,
   ‘sig_packer_entropy’: 3603,
   ‘sig_packer_upx’: 1210,
   ‘sig_disables_system_restore’: 6,
   ‘sig_ransomware_radamant’: 0,
   ‘sig_infostealer_browser’: 7,
   ‘sig_injection_rwx’: 3613,
   ‘sig_deletes_self’: 600,
    ‘file_read’: 50632,
   ‘sig_fraudguard_threat_intel_api’: 226,
   ‘sig_deepfreeze_mutex’: 1,
   ‘sig_modify_uac_prompt’: 1,
   ‘sig_api_spamming’: 251,
   ‘sig_modify_security_center_warnings’: 18,
   ‘sig_antivm_generic_disk_setupapi’: 25,
   ‘sig_pony_behavior’: 159,
   ‘sig_banker_zeus_mutex’: 442,
   ‘net_http’: 223,
   ‘sig_dridex_behavior’: 0,
   ‘sig_internet_dropper’: 3,
   ‘sig_cryptAM’: 0,
   ‘sig_recon_fingerprint’: 305,
   ‘sig_antivm_vmware_keys’: 0,
   ‘sig_infostealer_bitcoin’: 207,
   ‘sig_antiemu_wine_func’: 0,
   ‘sig_rat_spynet’: 3,
   ‘sig_origin_resource_langid’: 2255

Cite The DataSet

If you find those results useful please cite them :

@misc{ MR,
author = "Marco Ramilli",
title = "Malware Training Sets: a machine learning dataset for everyone",
year = "2016",
url = "http://marcoramilli.blogspot.it/2016/12/malware-training-sets-machine-learning.html",
note = "[Online; December 2016]"
}


Again, if you want to contribute ad you own classified Samples please drop them to me I will empower the dataset.
Enjoy your new researches!

Continue reading Malware Training Sets: A machine learning dataset for everyone

Posted in SBN

Malware Training Sets: A machine learning dataset for everyone

One of the most challenging tasks during Machine Learning processing is to define a great training (and possible dynamic) dataset. Assuming a well known learning algorithm and a periodic learning supervised process what you need is a classified dataset to best train your machine. Thousands of training datasets are available out there from “flowers” to “dices” passing through “genetics”, but I was not able to find a great classified dataset for malware analyses. So, I decided to do it by myself and to share the dataset with the scientific community (and everybody interested on it) in order to give to everyone a base point to start with Machine Learning for Malware Analysis. The first challenge I faced was to define features and how to extract them.  Basically I had two choices:
  1. Extracting features directly from samples. This is the easiest solution since the possible extracted features would be directly related to the sample such as (but not limited to): file “sections”, “entropy”, “Syscalls” and decompiled assembly n-grams.
  2. Extracting features on samples analysis. This is the hardest solution since it would include both static analysis such as (but not limited to): file sections, entropy, “Syscall/API” and dynamic analysis such as (but not limited to): “Contacted IP”, “DNS Queries”, “execution processes”, “AV signatures” , etc. etc. Plus I needed a complex system of dynamic analysis including multiple sandboxes and static analysers.   
I decided to follow the hardest path by extracting features from both: static analysis and dynamic analysis of samples detonation in order to collect as much features as I can letting to the data scientist the freedom to decide what feature to use and what feature to drop in his data mining process. The analyses where performed through the sample detonation in several SandBoxes (free and commercial ones) which defined a first stage of ontologically homogeneous blocks called “Analyses Results” (AR). AR are too much verbose and they are not  performing well in any text algorithm of my knowledge.
After more readings on the topic I came up with Malware Instruction Set for Behaviour Analysis ( MIST) described in Philipp Trinius et Al. (document available here).  MIST is basically a result based  optimised representation for effective and efficient analysis of behaviour using data mining and machine learning techniques. It can be obtained automatically during analysis of malware with a behaviour monitoring tool or by converting existing behaviour reports. The representation is not restricted to a particular monitoring tool and thus can also be used as a meta language to unify behaviour reports of different sources. The following image shows the MIST encoding structure. 

A simple example coming directly from the aforementioned paper is showed in the following image where “load.dll” has been detected. The ‘load dll’ system call is executed by every software during process initialisation and run-time several times, since under Windows, dynamic-link libraries (DLLs) are used to implement the Windows subsystem and offer an interface to the operating system. Following how the load.dll has been encoded into MIST meta language.
I decided to use the same concept of “meta language” but with auto-descriptive logic (without encoding the category operation since it would not afflict the analyses) and every information organised into a well formed JSON File rather then into a line based text file in order to be used in external environments with zero effort.  The produced datasets looks like following:

DataSet Snippest (click to enlarge)
Each JSON Property could be used as an algorithmic feature of your desired Machine Learning algorithm, but the most significative ones would be the “properties” ones (the one labelled properties). Each property, by meaning of each field placed under the “properties” section of the produced JSON file, is optional and is structured as follows:
category_action_with_description |  “sanitized” involved subjects with spaces
So for example:
“sig_copies_self”: “e5ed769a e5ed769a 98e83379”
It means the category is sig (stands for signature) and the action is “copies itself”.  e5ed769a e5ed769a 98e83379 are 3 sanitize evidences of where the samples copies itself (see the Sanitization Procedure) 
 “sig_antimalware_metascan”: “”

It means the category is sig (stands for signature) and the action is “antimalware_metascan”. The evidences are empty by meaning no signature found from metascan (in such a case).

“sig_antivirus_virustotal”: “ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b”

It means the signature virus_total found 5 evidences (ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b).

A fundamental property is the “label” property which classifies the malware family. I decided to name this field “label” rather than: “malware_name”, “malware_family” or “classification” in order to let the compatibility with many implemented machine learning algorithms which use the field “label” to properly work (it seems to be a defacto standards for many engine implementations).

Sanitization Procedures

Aim of the project is to provide an useful and classified dataset to researchers who want to investigate deeper in malware analysis by using Machine Learning techniques. It is essential to give a speed up in performances on text mining and for such a reason I decided to use some well known sanitization techniques in order to “hash” the evidences letting unchanged the meaning but drastically improving the speed for an algorithm point of view. The following picture shows the sanitization procedures:
Sanitization Procedures (click to enlarge)

From a developer prospective the cited (and showed) procedures are not well written; for example are not protected and “.replace” could be not safe within specific inputs. For such a reason I will not release such a code. But please keep in mind that the result of my project is not the “sanitization code” but the outcome of it: the classified malware analyses datased, so I focused my attention on features extraction, samples collection,  aggregation, conversion, and of course analyses, not really in developing production code.

Training DataSets Generation: The Simplified Process

The whole process to obtain the training datasets is described in the following flowchart. The detonation of a classified Malware into multiple sandboxes produces multiple static and dynamic analyses colliding into an analyses results artefact (AR).  AR would be translated into a MIST elaborated meta language to be software agnostic and to give freedom to data scientists.

Data Samples

Today (please refers to blog post date) the collected classified datasets is composed by the following samples:

  • APT1 292 Samples
  • Crypto 2024 Samples
  • Locker 434 Samples
  • Zeus 2014 Samples
If you own classified Malware samples and you want to share it with me in order to contribute at the Machine Learning Training Datasets you are welcome, just drop me an email !
I will definitely process the samples and build new datasets to share to everybody.

Where can I download the training datasets ?  HERE

Available Features and Frequency

The following list enumerates the available features per each sample. The features, as mentioned, are optional by meaning you might have no all the same features for every sample. If the sample you are analysing does not have a specific feature you want consider it as None (or undefined) since that feature was not available for the specified sample. So if you are writing your of machine learning algorithm you should include a “purification procedure” which will ignore None features from training and or query.

List of current available features with occurrences counter. :

   ‘file_access’: 138759,
   ‘sig_infostealer_ftp’: 13114,
   ‘sig_modifies_hostfile’: 5,
   ‘sig_removes_zoneid_ads’: 16,
   ‘sig_disables_uac’: 33,
   ‘sig_static_versioninfo_anomaly’: 0,
   ‘sig_stealth_webhistory’: 417,
   ‘reg_write’: 11942,
   ‘sig_network_cnc_http’: 132,
   ‘api_resolv’: 954690,
   ‘sig_stealth_network’: 71,
   ‘sig_antivm_generic_bios’: 6,
   ‘sig_polymorphic’: 705,
   ‘sig_antivm_generic_disk’: 7,
   ‘sig_antivm_vpc_keys’: 0,
   ‘sig_antivm_xen_keys’: 5,
   ‘sig_creates_largekey’: 16,
   ‘sig_exec_crash’: 6,
   ‘sig_antisandbox_sboxie_libs’: 144,
   ‘sig_mimics_icon’: 2,
   ‘sig_stealth_hidden_extension’: 9,
   ‘sig_modify_proxy’: 384,
   ‘sig_office_security’: 20,
   ‘sig_bypass_firewall’: 29,
   ‘sig_encrypted_ioc’: 476,
   ‘sig_dropper’: 671,
   ‘reg_delete’: 2545,
   ‘sig_critical_process’: 3,
   ‘service_start’: 312,
   ‘net_dns’: 486,
   ‘sig_ransomware_files’: 5,
   ‘sig_virus’: 781,
   ‘file_write’: 20218,
   ‘sig_antisandbox_suspend’: 2,
   ‘sig_sniffer_winpcap’: 16,
   ‘sig_antisandbox_cuckoocrash’: 11,
   ‘file_delete’: 5405,
   ‘sig_antivm_vmware_devices’: 1,
   ‘sig_ransomware_recyclebin’: 0,
   ‘sig_infostealer_keylog’: 44,
   ‘sig_clamav’: 1350,
   ‘sig_packer_vmprotect’: 1,
   ‘sig_antisandbox_productid’: 18,
   ‘sig_persistence_service’: 5,
   ‘sig_antivm_generic_diskreg’: 162,
   ‘sig_recon_checkip’: 4,
   ‘sig_ransomware_extensions’: 4,
   ‘sig_network_bind’: 190,
   ‘sig_antivirus_virustotal’: 175975,
   ‘sig_recon_beacon’: 23,
   ‘sig_deletes_shadow_copies’: 24,
   ‘sig_browser_security’: 216,
   ‘sig_modifies_desktop_wallpaper’: 83,
   ‘sig_network_torgateway’: 1,
   ‘sig_ransomware_file_modifications’: 23,
   ‘sig_antivm_vbox_files’: 7,
   ‘sig_static_pe_anomaly’: 2194,
   ‘sig_copies_self’: 591,
   ‘sig_antianalysis_detectfile’: 51,
   ‘sig_antidbg_devices’: 6,
   ‘file_drop’: 6627,
   ‘sig_driver_load’: 72,
   ‘sig_antimalware_metascan’: 1045,
   ‘sig_modifies_certs’: 46,
   ‘sig_antivm_vpc_files’: 0,
   ‘sig_stealth_file’: 1566,
   ‘sig_mimics_agent’: 131,
   ‘sig_disables_windows_defender’: 3,
   ‘sig_ransomware_message’: 10,
   ‘sig_network_http’: 216,
   ‘sig_injection_runpe’: 474,
   ‘sig_antidbg_windows’: 455,
   ‘sig_antisandbox_sleep’: 271,
   ‘sig_stealth_hiddenreg’: 13,
   ‘sig_disables_browser_warn’: 20,
   ‘sig_antivm_vmware_files’: 6,
   ‘sig_infostealer_mail’: 617,
   ‘sig_ipc_namedpipe’: 13,
   ‘sig_persistence_autorun’: 2355,
   ‘sig_stealth_hide_notifications’: 19,
   ‘service_create’: 62,
   ‘sig_reads_self’: 14460,
   ‘mutex_access’: 15017,
   ‘sig_antiav_detectreg’: 4,
   ‘sig_antivm_vbox_libs’: 0,
   ‘sig_antisandbox_sunbelt_libs’: 2,
   ‘sig_antiav_detectfile’: 2,
   ‘reg_access’: 774910,
   ‘sig_stealth_timeout’: 1024,
   ‘sig_antivm_vbox_keys’: 0,
   ‘sig_persistence_ads’: 3,
   ‘sig_mimics_filetime’: 3459,
   ‘sig_banker_zeus_url’: 1,
   ‘sig_origin_langid’: 71,
   ‘sig_antiemu_wine_reg’: 1,
   ‘sig_process_needed’: 137,
   ‘sig_antisandbox_restart’: 24,
   ‘sig_recon_programs’: 5318,
   ‘str’: 1443775,
   ‘sig_antisandbox_unhook’: 1364,
   ‘sig_antiav_servicestop’: 78,
   ‘sig_injection_createremotethread’: 311,
   ‘pe_imports’: 301256,
   ‘sig_process_interest’: 295,
   ‘sig_bootkit’: 25,
   ‘reg_read’: 458477,
   ‘sig_stealth_window’: 1267,
   ‘sig_downloader_cabby’: 50,
   ‘sig_multiple_useragents’: 101,
   ‘pe_sec_character’: 22180,
   ‘sig_disables_windowsupdate’: 0,
   ‘sig_antivm_generic_system’: 6,
   ‘cmd_exec’: 2842,
   ‘net_con’: 406,
   ‘sig_bcdedit_command’: 14,
   ‘pe_sec_entropy’: 22180,
   ‘pe_sec_name’: 22180,
   ‘sig_creates_nullvalue’: 1,
   ‘sig_packer_entropy’: 3603,
   ‘sig_packer_upx’: 1210,
   ‘sig_disables_system_restore’: 6,
   ‘sig_ransomware_radamant’: 0,
   ‘sig_infostealer_browser’: 7,
   ‘sig_injection_rwx’: 3613,
   ‘sig_deletes_self’: 600,
    ‘file_read’: 50632,
   ‘sig_fraudguard_threat_intel_api’: 226,
   ‘sig_deepfreeze_mutex’: 1,
   ‘sig_modify_uac_prompt’: 1,
   ‘sig_api_spamming’: 251,
   ‘sig_modify_security_center_warnings’: 18,
   ‘sig_antivm_generic_disk_setupapi’: 25,
   ‘sig_pony_behavior’: 159,
   ‘sig_banker_zeus_mutex’: 442,
   ‘net_http’: 223,
   ‘sig_dridex_behavior’: 0,
   ‘sig_internet_dropper’: 3,
   ‘sig_cryptAM’: 0,
   ‘sig_recon_fingerprint’: 305,
   ‘sig_antivm_vmware_keys’: 0,
   ‘sig_infostealer_bitcoin’: 207,
   ‘sig_antiemu_wine_func’: 0,
   ‘sig_rat_spynet’: 3,
   ‘sig_origin_resource_langid’: 2255

Cite The DataSet

If you find those results useful please cite them :

@misc{ MR,
author = "Marco Ramilli",
title = "Malware Training Sets: a machine learning dataset for everyone",
year = "2016",
url = "http://marcoramilli.blogspot.it/2016/12/malware-training-sets-machine-learning.html",
note = "[Online; December 2016]"
}


Again, if you want to contribute ad you own classified Samples please drop them to me I will empower the dataset.
Enjoy your new researches!

Continue reading Malware Training Sets: A machine learning dataset for everyone

Posted in SBN

Dirty COW Notes

I am not used to write about vulnerabilities because there are too much vulnerabilities out here and writing about just one of them is not going to contribute security community at all. So why am I writing about Diry Cow ? I am going to write about it because, in my personal opinion, it is huge. When I say “huge” I don’t really mean it will be used to exploit the “entire world” but I mean it highlights two mains issues:
  • Even patched code could easily hide the same vulnerability, just in a different way. How many patched code are not really “patched” ?
  • A new pragmatic approach to identify vulnerabilities: looking into patched code and check the  patch implementation.
But let’s start from the beginning by taking a closer look to the exploit code.
Click to enlarge: Taken From Here

As many other kernel vulnerabilities it relays on concurrency; the exploit code fires on two separate threads who will access at the same time to the same resource.  Taking a closer look to the main function you will see that the mmap syscall has been used.
calling mmap function
From documentation:

creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping.

mmap does not create a memory copy but rather it creates a new mapping of that (filedescriptor) memory area. It means the process will read data directly from the original file rather than from a copy of it.  While most of the parameters are obvious the MAP_PRIVATE flag is the “core” of the vulnerability. It enables the “copy on write” (from here the name COW) which basically copies the original data in a new memory area during the write access to the same data. Since the mmap has just mapped a readonly area and the process wants to write data on it, mmap (MAP_PRIVATE) will create a copy of that data on write actions, the modified data will not be propagated to the original memory area. 
Now the exploit runs two threads which will exploit a race condition to get “write access” to the original memory area. The first thread runs several times the function call madvise (memory advise) which is used to increase process performances by tagging a memory area according to its usage: for example  the memory could be tagged as NORMAL,  SEQUENTIAL, FREE or WILLNEED, an so on… In the exploit, the mmap memory is continuously tagged as DONTNEED,  which basically means the memory is not going to be used in the next future so the kernel could free its space and reload the content only when needed.

First Thread implementing madvise

On the other hand another thread is writing on its own memory space (by abusing the pseudo file notation: /proc/self/mem) directly on the mmap area pointing to the opened file. Since we have invoked the mmap function through the MAP_PRIVATE flag we are not going to write on the specifi memory but on a copy of it (copy on write).
Second Thread implementing write on pseudo self/mem
The race condition between those two threads tricks the write on copy on the original memory area since the copied area could be tagged has DONTNEED while the write procedure is not finished yet. And voilà you are going to write in a readonly file !
OK now we figured out how the trick worked so far but what is most interesting is the story behind it?
Going on issue tracker: Linus Trovalds (maximum respect) wrote:

This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago in commit 4ceb5db9757a (“Fix get_user_pages() race for write access”) but that was then undone due to problems on s390 by commit f33ea7f404e5 (“fix get_user_pages bug”). In the meantime, the s390 situation has long been fixed, and we can now fix it by checking the pte_dirty() bit properly (and do it better). The s390 dirty bit was implemented in abf09bed3cce (“s390/mm: implement software dirty bits”) which made it into v3.9. Earlier kernels will have to look at the page state itself. Also, the VM has become more scalable, and what used a purely theoretical race back then has become easier to trigger.

S390 is ancient IBM technology…. I am not even sure it still exists on real world (at least if compared to recent systems). Probably linux community forgot about that removal otherwise would left it in the recent memory managers.

Anyhow the bug now “has been fixed” by introducing a new internal Flag called FOLL_COW (really !?J) which basically says “yes I already did the copy on write”.
Basically the process can write to even unwritable pte’s, but only after it has gone through a COW cycle and they are dirty. Following the diff patch

Dirty Cow Patch3 on October 2016

Dirty Cow vulnerability blowed in my mind a new vulnerability hunting process. On one hand laboratories with extremely sophisticated, tuned and personalised fuzzers perform the “industrial” way (corporate and/or governative) to find new vulnerabilities, on the other hand more romantic and crafty way done by professionals and/or security researchers used to adopt handy works and smart choices. But another smart approach (industrial or romantic) could be to investigate into the patched code by itself.
Patched code is by definition where a bug or issue where located. The most difficult part of finding vulnerabilities (not exploiting them) is to figure out where they are in thousands lines of code. So finding vulnerability on patched code could be much more quick even if with high “hypothetical” complexity since a patch is involved. But as this case testifies …  is not always the case!

Continue reading Dirty COW Notes

Posted in SBN

Fighting Ransomware Threats

I wrote a little bit about Ransomware general view and Ransomware general infection methods here.
Today, after some more months working on the field and after having meet much more Ransomware than I thought, I’d like to write a little bit about how to “fight them”.
Before starting the review of some of the most known strategies to fight Ransomware let me explain why nowadays Ransomware are not “fair” as they were few months ago. Indeed while back at the beginning of 2016 Ransomware writers would assure your data back once paid the ransom, today’s Ransomware writers don’t assure it (there are several  paid ransoms with unrecovered files examples just few of them: here, here and here ).  This situation has been made possible by users who paid the ransoms during the past months. Those users arose the Ransomware ecosystem reputation by increasing the trust of entire supply chain.  
For example we experienced many infected users saying:

“Ok, I took a ransomware and my backup sucks. Let’s pay the ransom, it only asks for few bucks. I’ll pay more attention next time!”

This user’ behaviour increased the Ransomware reputation such as today nobody doubts about paying the ransom and getting back own files. This “well reputation” made possible for “not super skilled” attackers and/or to attackers who wanted to make quick money, to implement “half of the Ransomware (without decryption module)”. This made very angry the whole Ransomware ‘s writer community (which happens to be a professional community) who actually is divided into two main parties: the one who wants to increase the Ransomware reputation by giving back files once the ransom has been payed (usually Ransomware as a service writers) and the one who exploits the Ransomware community reputation writing quick and dirty Ransomware (available on black market as a service as well) who actually wont give back files once the ransom has paid (usually single hosted Ransomware).
Ok, nice story but how do we fight them ?
Today there are two main known strategies so far:
  1. Try to block the ransomware infection before it “fires up”.
  2. Try to detect it before it can create a real “damage”.
I wont write about prevention on this “post” but just about mitigation. So I assume the Ransomware is already landed on the victim’s machine.

Methods to try to block a Ransomware infection before it “fires up”.
Three main methods to try to block a Ransomware infection assuming the Malware already landed into victim’s PC are implemented so far:

1. Signature Based (AV) Approach. 

As common virus the known Ransomware own signatures. If the signature (that could be static ore dynamic) matches the sample file, the sample itself is blocked and trashed away.

This is the romantic approach that will work only for known ransomware. Useless in today’s technology.

2. Policy Based Approach.

Files executable could not be run from every folder (for example from the eMail folder or from temporary folders)
It could be a first and important way to “decelerate” the infection rate. In fact many infections happen through “avid clickers” who open untrusted email and/or click on untrusted links. Having them to move the “downloaded file” or to copy the malicious attachment to another destination often helps the “avid clickers” to get distracted and to not get infected.

3. CallBack Based approach.

Every recent Ransomware needs to comunicate to external servers to get  encryption key or to communicate the infection to the attacker and later on to get back the decryption key. A primitive approach is to detect the callback and to block it avoiding the initial communication.

This approach is hard in the real life since the communication methods can be very brilliant and innovative. Indeed the communication to command and control could be (just for example) end-to-end encrypted and/or the contacting addresses could be a legitimate hacked domain.

Methods to try to detect it before it can create a real “damage”
Some of the main methods implemented by commercial products try to block the Ransomware Infection once it has been fired up. Following the most implemented strategies. 
1. Flag processes who read and write too many files too quickly. 

This method, is used by MalwareBytes AntiRansomware which is based on Nathan Scott’s CryptoMonitorIt counts how often untrusted processed have modified “a certain number of personal files, under a certain time.” A similar method implemented by Adam Kramer’s hande_monitor tool on the frequency with which processes create handles through which they can access files.

Implementing this method solo could have a tons of false positives (and white/black listing on them). Let’s image DropBox process or GoogleDrive during a sync phase. How many file does it modify/delete/create and how quickly does it ? Or CAD software who constantly saves tons of partial rendered piece of files before assembling them ? It’s clear that this strategy solo is not gonna work.
    
2. Flag processes that changes file’s entropy values.

Encrypted files tend to have a more uniform distribution of byte values than other files. Their contents are more uniform. Our tool could compare the file’s entropy before and after the change. If a modified file has higher entropy, it might have gotten encrypted, indicating that the responsible process might be ransomware. 

Implementing this method solo you might find issues on real encrypted files and/or on compressed files who tend to have a while flat ad uniform distribution of charset.
3. Flag processes who change entropy of selected “untouchable” files.

Specific canary files are dynamically injected into hidden or not hidden folders and monitored. If a process tries to modify them, the process will be considered as malicious.

Implementing this method solo could generate false positives since an unsuspecting user could open the canary file at any time.
4. SyncHoling folders.

By creating a nested tree of recursive folders to trap single processes Ransomware who will loop into it by consuming a lot of resources but without encrypting any real user file.

Once the process (Ransomware process) is identified by one or more techniques as expressed above the system could kill it or suspend it (putting it in holding mode) asking to user what to do.
Conclusions
Ransomware infections are one of the most spread threats in todays’ Internet (foreground internet). They have been evolved over years (a super great paper on the Ransomware evolution could be found here) since the last evolution defined Ransomworm (conied at the beginning of 2016), which includes self propagating skills, such as for example: infecting public accessible folders and/or running vulnerability scanning on network machine for known exploitable vulnerabilities which will be used in order to propagate itself. The following image shows the activity of used bitcoin address in a Ransomware campaign. As you might observe the average time frame of a Bitcoin address used in a Ransomware fraud is between 0 to 5 days which makes super hard to catch the owner by cross-correlation over multiple transitions.

Figure 1: The duration of activity for Bitcoin addresses. Approximately 50% of Bitcoin addresses have zero to five days of active life (from here).

Nowadays there are plenty quick fixes that promise to solve the issue but not a real solver has been released to public (at least as far as I know) so far. At this point I wont give you the consolidated suggestion to keep up to date your OS and to download the last AntiVirus Engine because it really do not matter at all. Apply the policies, inform your users about this threat and stay tuned, the answer to such a threat will come and something will happen in the Anti Malware market soon 🙂  .

Continue reading Fighting Ransomware Threats

Posted in SBN

From ROP to LOP bypassing Control FLow Enforcement

Once upon a time breaking the Stack (here) was a metter of indexes and executables memory areas (here). Then it came a DEP protection (here) which disabled a particular area from being executable. This is the fantastic story of ROP (Return Oriented Programming) from which I’ve been working for long time in writing exploiting and re-writing “resurrectors” (software engines able to convert old exploits into brand new ROP enabled exploits), please take a look: here, here, here, here, here and here. Now it’s time to a new way of stack protection named Control-Flow Enforcement designed by Intel. CFE aims to prevent stack execution by using a “canary” stack .. ops this was the old way to call it.. right let me repeat the sentence… by using a “shadow stack” aiming to compare return addresses and a “Indirect Branching Tracking” aiming to track down every valid indirect call/jmp on target program.
Well, I made a joke mentioning the ancient canary words which might remind you how useless it was adding a canary control Byte (or 4 bits, actually) to enforce the entire stack, but this time is structurally  different. We are not facing a canary stack which could be adjusted by user by using “stores commands” such as: MOV, PUSH, POP, XSAVE, but is a user/kernel memory space exclusively used by “control flow commands” such as: CALL, RET, NEAR, FAR, etc.

When shadow stacks are enabled, the CALL instruction pushes the return address on both the data and shadow stack. The RET instruction pops the return address from both stacks and compares them. If the return addresses from the two stacks do not match, the processor signals a control protection exception (#CP). Note that the shadow stack only holds the return addresses and not parameters passed to the call instruction. To provide this protection the page table protections are extended to support an additional attribute for pages to mark them as “Shadow Stack” pages.  (Figure1 from here)

Just to make things a little harder (but it’s going to be very useful to introduce a way to bypass Stack Shadow) let me introduce to you a more comprehensive stack defencing framework, defined by Abadi et al  and called Control-Flow Integrity framework. Following I borrow the classification described by Bingchen Lan et Al. on their paper (available here) reporting 4 kinds of Control Flow Integrity Policies (CFI):

  • CFI-call. The target address of an indirect call has to point to the beginning of a function. For instance, indirect call is constrained to the limited addresses, which are specified through statically scanning the binary for function entries.
  • CFI-jump. The target address of an indirect jump should be either the beginning of another function or inside the function where this jump instruction lies. For instance, Branch Regulation prevents jumps across function boundaries to stop attackers from modifying the addresses of indirect jumps.
  • CFI-ret. In coarse-grained CFI, the target address of a ret instruction should point to the location right after any call site. Shadow stack further enhances this constraint, i.e. the ret instruction accurately corresponds to the location after the legitimate call site in its caller.
  • CFI-heuristics. Apart from enforcing specific policies on indirect branches as CFI-call, CFI-jump and CFI-ret do, some CFI solutions tend to detect attacks by validating the number of consecutive sequences of small gadgets.

During the past few years many attack mechanisms bypassed the CIF policies, let me sum they up on the following table.

Figure 2 Comparing attack strategies the green “check” means the technique can bypass the defence policy, the red “x” means it cannot

Lets assume to be able to implement CFI-Ret and CFI-Jump (or CFI-Heuristics ) techniques in a single system. We might apparently guarantee Control Flow Integrity ! Well, it was “kind of true” since Bingchen Lan, Yan Li, Hao Sun, Chao Su, Yao Liu, Qingkai Zeng introduced in a well done paper (here) a LOP Loop Oriented Programming technique.  The main idea is to choose entire functions as gadget instead of using short code fragments or unaligned instructions. In this way the call instruction targets the beginning of a function bypassing CFI-call policy. Moreover CFI-heuristics expects the execution flow on a victim application consists of multiple short code fragments as ROP and JOP does. Since no short code is involved in LOP and it is possibile to select long gadget with many instructions on it LOP can also bypass CFI-Heuristics. The process of chaining gadgets exactly follows the normal carrer-callee (call-ret-pairing) paradigm. The loop gadget acts as proxy (dispatcher) invoking different functional gadgets repeatedly which eventuallu return to the original caller bypassing the CFI-ret policy. Meanwhile there is only one jump instruction used by LOP. This jump instruction works originally for loop functionality and it is untouched by LOP. Hence, CFI-jump is also ineffective towards LOP. The following picture shows the difference between CPROP and LOP.

Figure 3. CROP VS LOP (from here)

It’s now interesting defining how a Loop gadget looks like. So, lets define a loop gadget as a complete working function having 3 keys elements such as :

  1. A loop statement
  2. An indirect call instruction within the loop
  3. An index instruction within the loop statement.
The following example is taken from initterm() in msvcrt.dll a Microsoft Windows dynamic library.
Figure 4: Example of LOP gadget

The LOP gadget make possible to set up starting address and ending address. Then Hijacks the control flow to the loop gadget. Then the LOP gadget makes the index pointer pointing to start to start address of the dispatch “table”. It takes the next gadget address and uses an indirect call to invoke the addressed lop gadget. Just after the call it returns to the instruction located right after the indirect call in the loop by a legal ret instruction. Later the gadgets modifies the pointing index making it addressing the next gadget. It ends up by comparing the index value and the “end address”.

Figure 5 Comparing attacks strategies the green “check” means the technique can bypass the defence policy, the red “x” means it cannot

We can now add an additional raw on the attack-comparing–table as shown in Figure5 introducing LOP as the ultimate way to bypass Control Flow Integrity Techniques. Happy hunting !

Continue reading From ROP to LOP bypassing Control FLow Enforcement

Posted in SBN