Abstract: Apparently,we are living in the most defining and developing period of human history.
thefuture of computer is wide and amusing. This is the period where computinggeneration reached from large mainframes to PCs to cloud over Internet. Networksecurity has become an important issue due to the evolution of internet. Itbrings people not only together but also provides huge potential threats.Intrusion detectiontechnique is considered as the immense method to deploy networks security behindfirewalls. Thereis no doubt that Machine Learning (ML)/Artificial Intelligence (AI) has rapidlygained more vogue in the previous couple of years due to their uniqueproperties like adaptability, scalability, and potential to rapidly adjust tonew and unknown challenges.
Cyber security is afast-growing field demanding a great deal of attention because of remarkableprogresses in social networks, cloud and web technologies, online banking,mobile environment, smart grid, etc. Diverse ML methods have been successfullydeployed to address such wide-ranging problems in computer security. As the hottest mania in the tech industry atpresent, ML extremely powerful to make predictions and calculated suggestionswhich is generally based on the very large amount of data. Cyber security ispositioned to leverage ML to improve malware detection, triage events, andrecognize breaches and alert organizations to security issues. ML can be usedto identify advanced targeting and threats such as organization profiling,infrastructure vulnerabilities and potential interdependent vulnerabilities andexploits. ML can significantly change the cyber security landscape.
This paperdescribes role of ML to detect and highlight advanced malware for cyber defenseanalysts. Various ML algorithms are discussed and compared. This paper gives anidea about how differentapplications of ML in cyber security like phishing, spam, network intrusiondetection etc. Keywords: Machine learning algorithms, ArtificialIntelligence, Cyber Security I. INTRODUCTION In the time, where every manual work are beingcomputerized, the meaning of manual is reshaping.
ML calculations can enablePCs to play games, to perform surgeries and get more quick witted and moreprivate. Artificial Intelligence (AI) is affecting the lives of regular peoplefrom moment to moment helping to tackle the complexities of -Transportation (Google’s AI-Powered Predictions, Ridesharing Apps Like Uber andLyft, Commercial Flights Use an AI Autopilot), Email (Spam Filters, Smart EmailCategorization), Grading and Assessment (Plagiarism Checkers, Robo-readers),Banking/Personal Finance (Mobile Check Deposits, Fraud Prevention, CreditDecisions), Social Networking (Facebook, Pinterest, Instagram, Instagram),Online Shopping (Search, Recommendations, Fraud Protection), Mobile Use(Voice-to-Text, Smart Personal Assistants) and stay ahead of cyber securitythreats.One of the main features of thesetransformations is how computing techniques and tools have been democratized.
In the past few years, data scientist has assemble evolutionary data-crunchingmachines by seamlessly executing advanced techniques 1. The results areamazing. ML is a data analytics technique through which computers learns to dowhat comes naturally to humans and animals i.e. learn from experience. MLalgorithms use computational methods to remember information which is directlyfrom the data without depending on a predetermined equation. The ML algorithmsadaptively enhance their performances as the inputs available for learningincreases.There is no doubt that AI/ML hasrapidly gained more popularity in the previous couple of years.
As Big Data isthe trending mania in the tech industry at present moment, machine learning isvery strong for calculated suggestions and makes predictions which are based onthe large amount of data. Specific attacks like malware and Ransom warecontinue to pose a major challenge for most commercial, government and academicorganizations. The ability to train and provide cyber security expertiserepresents a daunting challenge for the global security community. New techniquessuch as ML offers a unique opportunity to close the cyber skills gap byreducing the number of cyber securitypersonnel needed to research, analyze and share malware detectioninformation1.
Some of the very common and famous examples of ML are Netflix’salgorithms or Amazon’s algorithms that recommend books based on the books youhave bought or searched before. ML is inevitable in cyber security. It providespotential solutions in all these domains and more, and is set to be a pillar ofour future civilization. II. EVOLUTION OF MACHINE LEARNING (ML) ML was born from patternrecognition and the theory that computers can learn without being programmed toperform some specific tasks. But researchers who are concerned in artificialintelligence wanted to see that if computers could learn from data. Therepetitive side of ML is important because models learn from computations togenerate reliable, repeatable decisions and results 1.
While there are many ML algorithmshave been around for a very long time, the capability to automatically applycomplex mathematical calculations to big data is a new development. Types of Machine Learning AlgorithmsThere are three types of ML algorithms as depictedin Fig.1.
Figure 1: Types of Machine Learning Algorithms 1. Supervised Learning This algorithm consists of an outcomevariable or dependent variable. Using these set of variables, it generates afunction that maps inputs to their desired outputs. This process will go onuntil the model reaches a desired level of accuracy on the training data.
Someexamples of supervised learning are Decision Tree, Regression, LogisticRegression etc. 2. Unsupervised LearningIn this type of algorithm, there is notarget or outcome variable to estimate. This algorithm is used for gatheringpopulation in different groups, which is widely used for segmenting customersin different groups for specific intervention. 3. Reinforcement LearningIn this algorithm, the machine is coachto make specific decision. Machine is exposed to the environment where ittrains itself continually using trial and error. Then the machine learns itspast experience and then tries to catch the possible best knowledge to makeaccurate decisions.
Some Machine LearningAlgorithmsMachine learning algorithms can be broadlycategorized into 3 categories: – supervised learning, unsupervised learning,and reinforcement learning. Supervised learning is very handy in the caseswhere a label is available for a certain training set, but needs to bepredicted for other objects. Unsupervised learning is useful when the challengeis to find implicit relationships in a given unlabeled datasets.
Reinforcementlearning lies between these 2 categories. · Linear Regression: In this process, a relationship is established between dependent and independentvariables by fitting them in a line. This line is known as regression line andrepresented by a linear equation. · Logistic Regression: Logistic regression is generally used to calculate discrete values from aset of independent variables. This helps to predict the probability of anevent. It is also called logit regression.
Hereare some methods that are often used to improve logistic regression models:· Includeinteraction terms· Eliminatefeatures· Regularizetechniques· Usea non-linear model · Decision Tree: This is the most famous ML algorithm that is being used inpresent situation. This is one kind of supervised algorithm that is used forclassified algorithms. It works well for classifying both categorical andcontinuous dependent variables. · Support Vector Machine (SVM): Itis the method for classification in which points are plotted as raw data inan n-dimensional space. Then the value of each feature is tied to a particularcoordinate, making it very easy to classify the data. Lines that are used tospilt the data are called classifiers and they can used to plot the graph onthem. · Naïve Bayes: This algorithm assumes that the presence of particular feature ina class is not related to the presence of any other feature. Naive Bayes classifierconsider these variables independent even if they are related to each otherwhen calculating the probability of a particular outcome.
A Naïve Bayes modelis very useful for massive datasets and easy to build 2. It is a simple modelwhich is known to outperform highly sophisticated classification methods. · K-Nearest Neighbors (KNN): This algorithm can be applied to both classification and regressionproblems. It is widely used in solving classification problems. It is verysimple algorithm that stores all available cases and classifies any new case bytaking a majority vote of its k neighbors.
It is very easy algorithm that canbe understood by comparing it to the real life. Here are some things that mustbe considered before selecting KNN:· KNNis computationally expensive.· Variablesshould be normalized otherwise higher range variables can bias the algorithm.· Datastill needs to be pre-processed. · K-Means: K-Means is an unsupervised algorithm through which one can solveclustering problems. In this algorithm, data sets being classified into aparticular number of clusters in such a way that within a cluster, all the datapoints are homogeneous and heterogeneous from the data in other clusters 2.Here are some points through which one can know how K-Means form clusters:· TheK-Means algorithm picks k number of points for each cluster. These points areknown as centroids.
· Eachdata point forms a cluster with the closest centroids.· Thenit creates new centroids, based on the existing cluster members.· Nowwith these new centroids, determine closest distance for each data points. Thisprocess is being repeated until the centroids do not change.
· Random Forest: A collective of decision tree is called random forest. In this,each tree is classified and tree votes for that class and this process is doneto classify a new object based on its attributes. The forest chooses theclassification which has the most votes.Each tree is planted and grown asfollows:Ifthe number of cases in the training set is N, then a sample of N cases is takenat random. This sample will be the training set for the growing tree. Each treeis growing to the largest extent possible. There is no pruning.
· Dimensionally Reduction Algorithms: Vast amounts of data is being stored and analyzed by corporatein today’s world. Dimensionality reduction algorithms remove the challengescoming in the identifying significant patterns and variables. · GradientBoosting and Ad boost: Theseare the boosting algorithms used when enormous loads of data have to be handledin order to make predictions with high accuracy. Boosting is an ensemblelearning algorithm that combines the predictive power of several baseestimators to improve robustness.To summarize, it combines multiple weakor average predictors to build a strong predictor.
These boosting algorithmsalways work well in data science competitions like Kaggle, AV Hackathon etc.These are the most preferred ML algorithms today. ML has played a critical roleacross several technologies and practices that we have developed to reduce theopportunity for and limit the damage of cyber-attacks 3. Cyber security threat is thevulnerabilities caused by factors outside the end user’s control, such assecurity flaws in applications and protocols. The traditional remedies includeusing firewalls and antivirus software; distributing patches that fix newlydiscovered problems, and amending protocols. While the defense against suchthreats is still an ongoing battle, software engineers have been effective in counteringmost threats and reducing the risk to an acceptable level in most cases4. The otherapproach, which has received less observations, includes the problems caused byunaware user actions. For example, an attacker may convince inexperienced usersto install a fake antivirus, which in reality corrupts their computers.
III. MACHINE LEARNING INCYBER SECURITY MLis a powerful tool that can be hired in many areas of cyber security. · Phishingdetection: Phishing is adeception technique that utilizes a combination of social engineering andtechnology to gather sensitive and personal information, such as passwords andcredit card details by masquerading as a trustworthy person or business in anelectronic communication. ML is applied to predict whether a given URL or domain is phishingwebsite or not. It can accurately identify a wide variety of phishing pages,including those that only present users with an image to elude content analysisand those that deliver dynamic content to the page to evade web crawlers.
Roleof ML is to detect a phishing site and alert the affected users. It also alertsthe affected brand that the phishing site was trying to mimic, so it can takethe proper precautions to protect itself. The phishing domaindetection with ML techniques are grouped as given below.· URL-BasedFeatures· Domain-BasedFeatures· Page-BasedFeatures· Content-BasedFeaturesLR, SVM, RF, Decision Tree and K-Means ML algorithms are appliedin phishing detection.
· Network Intrusion Detection:Many intrusion detectionsystems are specially based on machine learning techniques due to theiradaptability to new and unknown attacks. There are three main types of cyberanalytic in support of IDSs: misuse-based (sometimes also called signaturebased), anomaly-based, and cross-breed. Misuse-based techniquesare designed to detect known attacks by using signatures of those attacks.
Theyare effective for detecting known type of attacks without generating number offalse alerts. They require frequent manual updates of the database with rulesand signatures. Misuse-based techniques cannot recognize novel (zero-day)attacks 5.In the case of misuse detection, it uses pre-defined fitting modelsor new information go through the model and model is delegated to whether ithas a place in misuse detection or is ordinary.
To figure out what has beenstolen, maybe record get to logs or system activity would be investigated bythe examiner, searching for access to delicate documents, or a lot ofinformation streaming out of the system7. Malware investigation of the diskmay be needed to try and track down known malware samples using signaturesdeveloped by other human analysts. Or on the other hand maybe examination ofthe running framework, searching for irregular procedures running or differentstrange practices would be led as a component of the occurrence reaction. Anomaly-based techniquesmodel the normal network and framework conduct, and identify anomalies asdeviations from normal behavior. They have the advantage to detect zero-dayattacks. Another advantage is that the profiles of normal activity arecustomized for every system, application, or network, thereby making itdifficult for attackers to know which activities they can carry out undetected.
Additionally, the data on which anomaly-based techniques alert (novel attacks)can be used to define the signatures for misuse detectors. The principledrawback of irregularity based procedures is the potential for high falsecaution rates (FARs) on the grounds that already concealed (yet genuine)framework practices might be sorted as oddities. Cross-breed techniquescombine misuse and anomaly detection. They are utilized to raise identification rates of known interruptionsand decline the false positive (FP) rate for unknown assaults. Majority of the methods used are reallymix of both the technologies. Therefore, in the descriptions of ML the anomalydetection and hybrid methods are described together 6.An ML approach usually consists of twophases: training and testing.
Often, the following steps are performed:•Identify class attributes (features) and classes fromtraining data.• Identify a subset of the attributesnecessary for classification (i.e., dimensionality reduction).
• Learn the model using training data.• Use the trained model to classify theunknown data. · Validation and Authorization withkeystroke dynamics: Keystroke dynamics represents – a classof behavioral biometrics that captures the writing style of a client. Themajority of computer systems employ a login ID and password is the primarystrategy to get to security. In stand-alone situations, this level of securitymight be satisfactory, yet when PCs are associated with the web, thevulnerability to a security breach is expanded.
Keeping in mind the end goal todecrease defenselessness to attack, biometric solutions have been employed.Probabilistic Neural Network (PNN) is extremly suitable candidate for a novelML algorithm in the context of keystroke dynamics authentication. At present,there are two noteworthy types of biometrics: those in view of physiologicalproperties and those in light of behavioral qualities. Physiological biometricsincorporates an estimation of some physiological component, for example,fingerprints, retinal vein examples and iris designs into a computerizedvalidation composition. Behavioral biometrics then again separate andcoordinate data about human conduct, for example variations in our speechpattern, gait, signature and the way we type into the authentication schema.
· Artificial intelligence and robotics:ForML frameworks to have a certifiable effect in these essential spaces, theseframeworks must have the capacity to speak with profoundly gifted humanspecialists to investigate their judgment and learning, and offer valuable dataor examples from the information. ML strategies and people have aptitudes thatsupplement each other — ML procedures are great at calculation on informationat the most minimal level of granularity, though people are better atdeveloping learning from their experience, and spreading the information.Neural Network is utilized for character acknowledgment. · Encrypted-decrypted techniques:Wecan speak to control utilization as buoys, and we can speak to comes about astrust in key piece. In any case, I don’t concur this is entirely cryptography,as this isn’t generally important approach to interface ML to cryptosystem.This is scarcely utilized as approach to break down information. It is utilizedpredominantly for information extraction with ML. · Examine security properties ofprotocols: Securityand reliability of network protocol implementations are essential forcorrespondence administration or communication services.
Most of the approachesfor confirming security and reliability, such as formal validation andblack-box testing, are limited to checking the specification or conformance ofimplementation. In any case, a protocol implementation may contain buildingpoint of interest, which are excluded in the system specification but may resultin security flaws. Black-box implementations are deployed with Linear Regression Regression calculations. · End user techniques:Spammers exploit social systems for employingphishing attacks, disseminating malware, and promoting affiliate websites.
Itis no wonder thus that ML is making inroads everywhere. Performing taskswithout the need for programming things explicitly is what makes ML sopowerful. Diversestrategies are intended to channel spam, including boycott/white rundown,Bayesian arrangement calculations, catchphrase coordinating, header datahandling, examination of spam-sending variables and examination of got sends.The way spam recognitions are arranged relies on various systems mapping, gettogether, pre-filtration, characterization. In mapping and gathering a standardmodel is determined for each question, which is characterized by the structure.For instance in our proposed framework we have utilized two models: messagemodel or profile display. In Pre-Filtering the approaching item is checked bycontrasting it and a boycott.
spam identification in interpersonalorganizations utilizing Decision Tree, SVM, Random Forest and Naïve Bayesianmethodologies is profoundly viable and a blend of spam counteractive actionchannels will give higher precision. Spammers are associated with postingnumerous messages by making counterfeit profiles. Spammers additionallyendeavor to hack diverse client profiles.
Thus SVM is prepared in such a way inthis exploration work, that it will order the testing information consideringboth the profile model and message show. VI. FINDINGS AND RESULTS ML in security is afast-growing trend.
Analysts at ABI Research estimate that ML in cyber securitywill boost spending in big data, artificial intelligence (AI) and analytics to$96 billion by 2021. Most of the major companies in security have shifted from atraditional “signature-based” system which was used to detect malware, to a MLsystem that tries to interpret actions and events and learns from a variety ofsources to result safe data or information. ML is used to design securitysystem, evaluation over the protocol implementation and providing humaninteraction to the machine. Random forest based classifiers are thebest classifier with great classification accuracy of 97.
47% for the givendataset of phishing site. SVM techniques performs best 95.5% detection rate. Analysis of The SVM inspam detection demonstrates precision to be 70% to 82% for a given datasets.Table 1 consolidates Areas of cyber Security and ML Algorithms. Table1: Areas of Cyber Security and ML Algorithm VII.
CONCLUSION The goal of this paper was to betterunderstand how machine learning is applied in cyber security domain. Thereexist some robust anti-phishing algorithms and network intrusion detectionsystems. Data mining has been popularly recognized as an important means tomine useful information from large volumes of data which is noisy, fuzzy, andrandom. Machine learning algorithms can improve the efficiency of IDS. Machinelearning can be successfully used for developing authentication systems,evaluating the protocol implementation, assessing the security of humaninteraction proofs, smart meter data profiling, etc. There are manyopportunities in information security to apply machine learning to addressvarious challenges in such complex domain. Spam detection, virus detection, andsurveillance camera robbery detection are only some examples.