A Final year project on “Malware Detection In P.E. Files” was submitted by Aseem Jan (from Vidyalankar Institute Of Technology) to extrudesign.com.
Abstract
Since the discovery of the computer, it has taken over the world. It’s almost like the industrial revolution or even better than that. Every business sector has taken it into their system which has resulted in a great profit for them. Running ads, and marketing your products online. Marketing your company online, maintaining statistical records etc computer does it all. The computer has also acquired a place in the social life of each and every individual. It has resulted in getting the world closer. The data also can be very sensitive or private for the individual or company. These data can be corrupted, changed or leaked. All this can happen through the use of malware. Malware can corrupt the data on the computer, it can leak sensitive data to its operator or make it public and can also crash a computer.
Key Words: Malware, Static files, Machine learning
1. Introduction
This project is going to represent a peculiar method for the detection of malware. The increase of malware that is exploiting the system, servers & networks has become a serious threat. Since the time I started learning about the technologies, I always observed the securities issues while using any kind of application, website or working in a real-time environment. Talking about any other field or area in technology, all are growing rapidly for worldwide enhancement. But discussing cyber security or viruses is the field that I think is still facing issues and challenges to protect cyberspace from being attacked. As I had studied the malware approach regarding the security of the websites, it was very easy to crack or deal with their authentication & old classical standard signature encryption techniques. So to protect the systems from malware attacks it is very important to know and understand the detection and resolution of any malware techniques. Machine Learning is almost a revolution in the world of technology. The old “if-else” system of programming is getting replaced by Machine Learning algorithms. Its greatest power to learn from the existing environment and makes prediction always excites me. I have always been curious about the implementation of this prediction concept of ML in the world of security
2. Problem Statement
Malicious software or Malware is software that is programmed to harm or create issues for the user or the system (Anon., n.d.). This malicious software is definitely an increasing threat to computer systems and networks which are owned by giant organizations and big companies. And because of these issues malware analysis and detection have become a key issue in today’s era. It is a fact that a plethora of malicious software like computer viruses has been created in recent years. As we have seen the different continuous evolutions in the field of cyber securities and despite those significant improvements in security mechanisms, malware is still among the biggest threat to cyberspace. Facing this problem many researchers and vendors are coming together towards a way to find a faster alternate method of detection and analysis of malware which can become a savage to protect the latest internet world from being attacked. There are various anti-malware companies that develop software to tackle malware eg Kaspersky, Norton, Mcafee etc. But even after the presence of all this software malware attack still happens in a large number. People are still facing losses (C, 2018)
The above graph is a result of research conducted by CopariTech (Andra Zaharia) on the effects of cyber security around the globe. In this research, the researcher is trying to analyse the effects of cyberattacks on the economy through statistics.
2.1 Some Recent Major Malware Attacks
Lucy: A File Encryption Android Malware for Ransomware Operations (Anon., 2020). This is a malware attack named Lucy on smartphones. This attack is based on malware file encryption capabilities. There is a cybercriminal gang named Lucy in Russia. The gang became famous by launching a malware Botnet attack named Black Rose Lucy Service. As the android accessibility service can mimic a user’s on-screen click, this is the exploit used by this hacker gang. Once this malware is installed on smartphones through the internet. It is able to grant itself all admin privileges. It then gives the authority to the hacker to attach files to the mobile device. The attackers then mimic to be FBI and attach or send notes files to the targeted android saying that he/she is found guilty of storing pornographic content and must pay a ransom of 500$ as a penalty. To make ransom look legit and threatening the attacker sends a note to the user’s browser saying that they have the user’s photograph and details and penalty is not paid the user may be apprehended.
(Anon., 2020) At a Security Summit Held in Russia by Kaspersky, the Kaspersky researchers talked about the malware PhantomLance used by the hackers to upload malicious applications on the play store. The user on the other hand thinks that the applications are on the play store supported by google security hence must be secure. The users download these applications and get hacked. These kinds of attacks are mostly targeted toward Asian countries like India, Bangladesh, Srilanka, China
3. Objective
The purpose of this thesis is to analyze all the important previous research conducted on cybersecurity. To understand how a particular researcher tackled a problem. I am going to analyse different concepts used to build the anti-malware and pick out some important features that actually had an impact on tackling the malware. This research is completely Machine Learning based so my main goal will be to use these features as parameters for my model. My purpose is also to understand how Machine learning concepts deal with cybersecurity issues and is machine learning is powerful enough to deal with all kinds of malware.
3.1 Motivation & PURPOSE
The increase of malware that is exploiting the system, servers & networks has become a serious threat. Since the time I started learning about the technologies, I always observed the securities issues while using any kind of application, website or working in a real-time environment. Talking about any other field or area in technology, all are growing rapidly for worldwide enhancement. But discussing cyber security or viruses is the field that I think is still facing issues and challenges to protect cyberspace from being attacked. As I had studied the malware approach regarding the security of the websites, it was very easy to crack or deal with their authentication & old classical standard signature encryption techniques. So to protect the systems from malware attacks it is very important to know and understand the detection and resolution of any malware techniques.
Machine Learning is almost a revolution in the world of technology. The old “if-else” system of programming is getting replaced by Machine Learning algorithms. Its greatest power to learn from the existing environment and makes prediction always excites me. I have always been curious about the implementation of this prediction concept of ML in the world of security.
Static Malware Analysis:
The static malware analysis is the analysis in which one reviews and inspects source code and binaries to find the suspicious pattern in the code. In static analysis, the executable or binary files are analysed without executing them. These executables files have different attributes like sections and memory. And the static features from the executables files can be extracted using the PROFILE (portable executable)which is a python library
Dynamic Malware Analysis:
Cuckoo Sandbox environment is used for dynamic analysis for malware and traces the behaviour at run time execution. We are analysing the software while running it. The basic motive for using the sandbox is to isolate the actual system from the testing environment and extract the desired information from the
malware execution.
Data Extraction and Analysis:
This will be done with the help of samples provided on VirusShare.com and for dynamic analysis, I have used Cuckoo Sandbox Environment is an Automated Malware Analysis website that is used to record the API calls during execution with summarized code & Api response code. Data will be analysed on the basis of feature and dependent variable selection
4. Literature Survey
This chapter consists of a detailed discussion of the research papers studied for the development of this thesis. A complete analysis will be done on different methods and concepts used for the detection of malware on the computer system, their advantages and disadvantages. A discussion on the current system used for the detection of malware will be done. After analysing all the previous research and concept a discussion will be done on the takeaways from the previous research that have been used while developing the current system. (Wu, et al., 2011) proposed a concept to detect malware on a system using BOS and MBF features. BOS stands for Behaviour Operation Set. BOS is actually basic operations that are performed in the system and make a change in the status of the system. This may include creating a file, deleting a file etc. (Wu, et al., 2011) defined four BOS types: 1) File Action(FA) 2) Process Action (PA) 3) Network Action (NA) 4) Registry Action(RA) Features from these BOS which are found in malicious software are termed as MBF (Malicious Behaviour Features). Every MBF is defined by a triplet: For example, the triple defines an MBF with ID Thread_Injection, and its malicious degree is high. The meaning of the Boolean expression ∃F∈SFC ∧ P ∉ MP ∧ P Load Image F is: if the 3540 malware released a file F during its execution, and in the meanwhile, there existed a process P automatically loaded the file F, and the process P was neither the malware itself nor derived from it. Then it denotes that this malware performed the thread injection behaviour.
Set of MBF features used by (Wu, et al., 2011) to detect malware.
The algorithm used was:
Sets ba and be are behaviour operation sets constructed from malware datasets. The value for each the feature is calculated and stored in ‘dtr’. (SatisfiedMalFeature) a function performs a check on the value to see if it’s true or not. If the value is true, add the judgment gist by function AppendProof. The results obtained were promising with an accuracy of about 97%. (Yeo, et al., 2018) gathered data from Stratosphere IPS and build a dataset for their system. On the basis of the research was done by (Stiborek, et al., 2014) and stratosphereips organization, (Yeo, et al., 2018) downloaded the data of nine different malware from stratosphereips.
(Loi & Olmsted, 2017) studied various Heuristic-based methodologies to detect malware in a statistical manner. They found that the result obtained by most of the heuristic-based research had a high rate of False Positives. They proposed their own system to check for Backdoor Malware. They proposed that the easiest way to check if a back door is running on the system is to check for any existing internet connection established. Also, the destination domain to which this connection is established. Then to check from online blacklist DNS detection websites API to check if the destination domain is legit or not. In the year 2015, Pirscoveanu et al. used the cuckoo sandbox for capturing the malware behavioural of malware files. This classification system was developed by using the random forest algorithm which produces 98% of accuracy in malware classification. A malware detection technique using the API call features was proposed in the same year by Ki et al. This technique execute the malware samples In the dynamic environment and API samples were developed. The technique explained that every malware executes the almost same kind of API call sequence which performs the maliciousness and produced a 98% recall.
In the year 2018, (Stiborek et al.,2018) had used the Sandbox approach to capture the behaviour of malware by executing the samples of malware in the sandbox environment. In this, each malware sample was executed as a pair of names and resources type to extract the features from it. Then the different algorithms have been applied such as random forest, linear SVM and Multi-Layer Perception over the extracted features which gave approximately 95% of accuracy in the malware detection.
REFERENCES
- Anon., 2020. Lucy: A File Encryption Android Malware for Ransomware Operations.[Online]
Available at: https://go.newsfusion.com//security/item/163819 5 - Anon., 2020. Microsoft Warns of Malware Hidden in Pirated Film Files. [Online]
Available at: https://go.newsfusion.com//security/item/163830 7 - Anon., 2020. Schneier on Security. [Online]
Available at: https://go.newsfusion.com//security/item/164164 1 - Anon., n.d. CYBER EDU. [Online]
Available at: https://www.forcepoint.com/cyberedu/malware [Accessed 2020]. - Anon., n.d. malware (malicious software). [Online]
Available at: https://searchsecurity.techtarget.com/definition/ malware - C, E., 2018. Are Antivirus Programs Effective. [Online]
Available at: https://www.safetydetectives.com/blog/areantivirus-programs-effective/ [Accessed 2020]. - G., G., Stiborek, M. & Zunino, J., 2014. An empirical comparison of botnet detection methods.” computers & security. IEEE.
- Loi, H. & Olmsted, A., 2017. Low-cost Detection of Backdoor Malware. IEEE.
Muhamad, I. M. & Rahardjo, B., 2019. Malware Detection Using Honeypot and Machine Learning.IEEE.
CONCLUSION
There is a rise in demand for intelligent methods that recognize new malware cases because the current methods are tedious and error-prone. This study explored various machine learning classifiers and neural network models, which are artificial intelligence methods that can be used for detecting malware. We proposed an ensemble learning-based framework with neural networks used as first-stage classifiers and explored 15 machine learning models as final-stage classifiers. Five different machine learning algorithms were used for comparison as baseline models. We performed our experiments on a dataset containing Windows Portable Executable (PE) malware and benign files. The results obtained indicate that the ensemble of fully connected dense ANN and 1-D CNN models with ExtraTrees as a final-stage classifier achieved the best accuracy value for the classification process, outperforming other methods. Most of the known malware recognition methods concentrate on featuring engineering techniques to improve detection accuracy; the advantage of our deep learning-based approach is the end-to-end learning process without the need for manual feature engineering to achieve high malware recognition performance. Thus, ensemble learning techniques can be adopted as intelligent techniques for malware detection and classification. However, the proposed framework is limited to supervised learning, which required both benign and malicious malware to be identified and labelled by experts. In the real-world setting, some malicious code may not be identified and thus the neural network cannot be trained on recognizing it. This raises the need for developing unsupervised ensemble learning frameworks for malware recognition. Future work will perform the study of explainable artificial intelligence (XAI) techniques to interpret the results of deep learning models for malware recognition to provide valuable insights for researchers in malware analysis. We also plan to conduct additional experiments with larger datasets to validate the proposed framework.
Credit: This project “Malware Detection In P.E. Files” is completed by Prof.Santosh Tamboli, Akash Patade, Aseem Jan and Vishwaraj Singh from the Department of Information Technology, Vidyalankar Institute of Technology, Wadala, Mumbai, India.
Leave a Reply