20 Mag Malware Training Sets: FollowUP
On 2016 I was working hard to find a way to classify Malware families through artificial intelligence (machine learning). One of the first difficulties I met was on finding a classified testing set in order to run new algorithms and to test specified features. So, I came up with this blog post and this GitHub repository where I proposed a new testing-set based on a modified version of Malware Instruction Set for Behavior-Based Analysis, also referred as MIST. Since that day I received hundreds of emails from students, researchers and practitioners all around the world asking me questions about how to followup that research and how to contribute to expand the training set.
I am so glad that many international researches used my classified Malware dataset as building block for making great analyses and for improving the state of the art on Malware research. Some of them are listed here, but many others papers, articles and researches have been released (just ask to Google).
- Big data: deep learning for detecting malware
- AI and Machine Learning for Cyber Security Wiki
- Toward Collaborative Defense Across Organizations
- Modelling Malware-driven Honeypots
- Trust, Privacy and Security in Digital Business: 14th International Conference, TrustBUS
- Design and Implementation of Malware Detection Scheme
- Machine Learning For Cybersecurity
Today I finally had chance to follow-it-up by adding a scripting section which would be useful to: (i) generate the modified version of MIST files (the one in training sets) and to (ii) convert the obtained results to ARFF (Attribute Relation File Format) by University of Waikato. The first script named
mist_json.py is a reporting module that could be integrated into a running CuckooSandBox environment. It is able to take the cuckoo report and convert it into a modified version of MIST file. To do that, drop
mist_json.py into your running instance of CuckooSandbox V1 (
/) and add the specific configuration section into
reporting.conf. You might decide to force its execution without configuration by editing directly the source code. The result would be a MIST file for each Cuckoo analysed sample. The MIST file wraps out the generated features as described into the original post here. By using the second script named
fromMongoToARFF.py you can convert your JSON object into ARFF which would be very useful to be imported into WEKA for testing your favorite algorithms.
Now, if you wish you are able to generate training sets by yourself and to test new algorithms directly into WEKA. The creation process follows those steps:
- Upload the samples into a running CuckooSanbox patched with
mist_json.pyproduces a MIST.json file for each submitted sample
- Use a simple script to import your desired MIST.json files into a MongoDB. For example
for i in **/*.json; do; mongoimport --db test --collection test --file $i; done;
- Use the
fromMongoToARFF.pyto generate ARFF
- Import the generated ARFF into Weka
- Start your experimental sessions
If you want to share with the community your new MIST classified files please feel free to make pull requests directly on GitHub. Everybody is using this set will appreciate it.