静态恶意JavaScript检测:支持向量机(SVM)方法说明书|电子爱好者

admin管理员组
文章数量:1531692

2024年6月13日发(作者：)

A Static Malicious Javascript Detection Using SVM

WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,

FANG Zhao-Lin

Zhejiang University of Technology

HangZhou, China

Abstract—Malicious script,such as JavaScript, is one of the

primary threats of the network security. JavaScript is not only

a browser scripting language that allows developers to create

sophisticated client-side interfaces for web applications, but

also used to carry out attacks taht used to steal users'

credentials and lure users into providing sensitive information

to unauthorized parties. We propose a static malicious

JavaScript detection techniques based on SVM(Support Vector

Machine). Our approach combines static detection with

machine learning technique, to analyze and extract malicious

script features,and use the machine learning technology,SVM,

to classify the technique has the characteristics of

high detection rate,low false positive rate and the detection of

unknown attacks. Applied to experiments on the prepared data

set, we achieved excellent detection performance.

WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,

FANG Zhao-Lin

Zhejiang University of Technology

HangZhou, China

static characteristics information of the file, to distinguish the

malicious script and the benign script[4]. This article uses

machine learning techniques to analyze the feature of the

script, proposes a static detection method based on SVM.

II. M

ALICIOUS

CRIPT

EATURE

XTRACTION

JavaScript[5] is a lightweight, object-based and event-

driven scripting language. JavaScript based on HTML could

develop interactive Web pages, making web users achieve

real-time, dynamic interaction [6]. However, JavaScript is

also an attractive choice for attackers to implement their

assaults and distribute them over the Internet., such as cross-

site scripting attacks, SQL injection attacks and passive

download attack.

According to a survey to 90 sites in the China Education

Keywords-Keywords; SVM; static detection; malicious script

and Research Network (CERNET) in 2008, nearly one-third

detection

of the sites was attacked. And 39% of the attacks is caused

by the malicious JavaScript [6]. Its characteristics make

JavaScript easy to become a carrier of malicious programs.

NTRODUCTION

JavaScript has two characteristics: First, JavaScript, a

With the rapid development of network information

description language as a file, can be executed directly

technology, information security issues gains more and more

through the browser; Second, Without protection, JavaScript

attentions. The malicious script is one of the primay security

written in the HTML can be seen and copy by anyone

threats of computer networks. By constructing a special web

directly.

page, which contains Trojans, viruses, worms, or aggressive

Therefore, these characteristics have made JavaScript the

programs, malicious script propagate to the user's computer

one of attackers' favorite tools. To solve this problem, sand-

when the user access to these pages.

boxing mechanism is provided to prevent malicious

Based on the execution state of malicious script, the

JavaScript from compromising the security of client's

current detection methods of malicious script can be divided

environment[8]. And it allows the code to perform a

into the static analysis and dynamic analysis method:

restricted set of operations only. What's more, the sand-

Without executing the script, the static analysis method

boxing mechanism not only brings the problem of efficiency,

uses the static characteristic, the structure of the scripts to

but also constraints the execution of JavaScript in client. In

identify malicious scripts, take [1] as example, it counts

this paper, we turn to machine learning classification

malicious signatures, then weights the different statistical

techniques to solve this problem.

methods with Judgment matrix method, and at last uses the

To achieve this goal, features are analyzed and extracted

weighted geometric mean method to obtain the results. This

at first. According to [9], we can extract 17 malicious

method not only requires some obvious features, but also

JavaScript features. And 10 features more are added based

weak at finding unknown attacks.

on the analysis of the data. The part of 27 features are

Dynamic analysis method, which runs malicious scripts

explained as follows:

in the controlled environment, detects malicious scripts by

In most benign cases, the number of some special

observing the execution states, processes. In [2][3], they

functions is limited while there are a relatively large number

monitor system ports, network connections, the registry,

of these functions in malicious script, such as the eval

system configuration files , to detect abnormal procedures.

function, escape function,DOM-modifying function. The

The method has to run malicious code, which increases the

exploits usually call several of DOM functions in order to

risk of the system, and the efficiency is also a problem.

instantiate vul-nerable components and/or create elements in

Malicious script is the special code hidden in the

the page for the pur-pose of loading external scripts and

scripting language, such as js files. Thanks to its

exploit the escape function could be called to

standardized script format, grammar, we tend to get enough

Published by Atlantis Press, Paris, France.

0214

code malicious abnormal use of special keyword,

tag,string are also considered.

Unfortunately, obfuscation techniques, which was

intended to protect the source code, is taken by the attackers

to circumvent these feature extraction. In order to reduce the

impact of the obfuscation, we also do a certain degree of

strength analysis [10]. Some features such as the scripts'

whitespace percentage, the maximum entropy of the strings,

the entropy of the script, are measured. Table.I shows one of

the results :

TABLE I.

FEATURES OF DATASET

the number of DOM

modification functions

the script’s whitespace

percentage

the average length of the

strings used in the script

the average script line length

the number of strings

containing “iframe”

the number of suspicious tag

strings

the length of the script in

characters

the number of unescape and

escape

the number of eval()

function

the number of the

setTimeout() functions

the ratio between

keywords and words

the number of built-in

functions used for 18

deobfuscation

the entropy of the strings

declared in the script

the entropy of the script

as a whole

the number of long

strings(>40)

the maximum entropy of

all the script’s strings

the probability of the

script to contain 23

shellcode

the maximum length of

the script’s strings

the number of string

direct assignments

the number of string

modification functions

the number of event

attachments

the number of suspicious

strings

SVM, which creates a feature space with the attributes in the

training dataset, is to search a decision boundary or an

optimal hyperplane to separates the feature space with the

maximum interval,as shown in Fig.1.

There are two types of SVM. The linear SVM which

separates the data points with a linear boundary and the non-

linear SVM which separates the data points with a nonlinear

boundary.

In the case of linearly separable problems, it is easy to

find the plane in the feature space that separate two types of

samples. Therefore, our optimal plane is the one that has

maximum geometry interval. As the following formulas

shows:

min||ω||

s.t.,y

(ω

⋅

≥

1,i



Obviously, it's a convex quadratic programming

problems. To solve this problem, firstly, the Lagrange

function should be brought in to turn it to its dual

problem,.The slack variable and penalty function are

proposed to deal with linearly inseparable problem caused by

noise. Then the objective function convert to:

min||w||



s.t.,y

≥

−

……

≥

0,i

……

，n

Linear SVM performs well on datasets that can be easily

separated by a hyper-plane into two parts. But sometimes

datasets are complex and are difficult to classify using a

linear kernel. Non-linear SVM classifiers can be used for

the number of classid

such complex datasets.

In the non-linear case, it maps the data into a high

the number of parseInt and

dimensional space, where an optimal separating hyperplane

fromcharcode

would be found. With appropriate mapping function, most of

the ratio between

the non-linear problem can be transformed into the linear

n and line

problem in high-dimensional space. However, the high-

dimensional mapping also brings the curse of dimensionality,

the number of chars in hex

and it is a disaster to calculate separating hyperplane in the

feature space. The inner product can be realized in the

the number of

feature space with kernel function satisfies Mercer, which is

CreateObject,ActiveXObject

a trick to this problem:

max



−



(

)

III. M

ALICIOUS SCRIPT DETECTION BASED ON

SVM

≥





The machine learning technology,SVM, which could

help summarize the knowledge of identifying known

Common kernel functions are polynomial kernel,

malicious JavaScript, carry out a similarity search to find

Gaussian kernel, Sigmoid kernel function. Gaussian kernel is

unknown malicious JavaScript, with a high detection rate

a universal nuclear function, by selecting the appropriate

and low false alarm rate [11].

parameters, it can achieve a high correct rate. Gaussian

kernel:

A. SVM

(

)

exp(

−

⋅

−

SVM (Support Vector Machine), originated in statistical

learning theory by Vapnik et al in 1995, was focused on

pattern classification problems [12]. It is a statistical learning

algorithm that classifies the samples using a subset of

training samples called support simple terms,

Published by Atlantis Press, Paris, France.

0215

IV. E

XPERIMENTAL ANALYSIS AND IMPLEMENTATION

A. Experimental Analysis

The experimental data is composed of 1000 malicious

JavaScript collected from VX Heavens [13] and 1000 benign

ones from reputable sites. The dataset is divided into three,

one third as the training set and two thirds as the test set.

According to the analysis previously,we extract 27

features of the dataset, scale on the extracted features, and

converts it into WEKA file format.

The above shows that , SVM obtains more than

90% both on accuracy and recall, and the accuracy on the

Figure 1. Optimal hyperplane

training set even raised to 93.8% . SVM shows a better

accuracy even in the case of less training samples.

In this paper, we choose the RBF kernel to get the best

B. The malicious script analysis framework based on SVM

classification model. Two parameters would be adjusted, the

As mentioned before, the script analysis can be divided

penalty factor C and kernel function parameter γ.

into static analysis and dynamic analysis. Here, we propose

C is used to weigh the "Find largest interval hyperplane"

an SVM-based static analysis method, combined with

and "make sure minimum deviation of the data points", C set

machine learning classification techniques, to distinguish

large value easily causes overlearning, and reduceing the

malicious scripts and benign script. Its script training

generalization performance. When set small value, it results

flowchart and script test flow chart are shown in Fig.2.

in less learning, which all the sample are classified into the

a) Dataset preparation: collect enough malicious

strong class. γ stands for the nuclear radius, directly impacts

JavaScript and benign JavaScript from the site.

the classification performance of SVM. With too large value,

it will end in zero generalization ability, while with too small

b) Data cleaning: cleaning the sample data, such as the

removal of the Notes, excess carriage return and line feed,

value, the classify ability of new samples close to zero,even

it has a high accuracy on the training set[14].

which increases the processing speed and accuracy.

The optimization algorithm, GridSearch on WEKA, is

c) Feature extraction: extract 27 features based on the

used in this paper to search the optimal

analysis above.

accurately rate as criterion, 1 as Step of C, γ steps as a base

d) Pre-treatment: data normalization processing, scaled

unit, and obtain the experimental results of . when C

to [0,1]. This process reduces the training error while the

= 27, γ= 4, the training set accuracy up to 96.59%. And get

data characteristic value is too large, or too small. Second,

the best optimization model parameter training.

the efficiency could also be improved.

As shown in , SVM gains higher accuracy on

the training set and a test set than ADTree and NaveBayes.

e) Parameter tuning: the WEKA is the platform to train

models. With a grid of binary classification SVM traverse

NaveBayes even don't obtains 90%, while SVM has an

GridSearch algorithm and ten-fold cross-validation,it selects

accuracy of 94.38% on the test set. It is clear that the SVM is

better at handling binary class.

the best SVM model parameters.

These experimental results shows that, the static

f) Model training: training best SVM model to obtain

detection method based on SVM we proposed, is excellent

the optimal parameters.

both on the accuracy and detection efficiency.

g) The data prediction: using the best model to predict

TABLE II.

THE

WEKA

FILE OF

ORMALIZED EIGENVECTORS

the classification of the test set.

malicious

benign

average

TP FP

Rate Rate

0.912 0.038

0.962 0.088

0.937 0.064

PrecisiRecall F-

on Measure

0.958 0.912 0.934

0.919 0.962 0.940

0.938 0.937 0.937

Roc

Area

0.937

TABLE III.

1~128

ARAMETER OPTIMIZATION WITH

RID

EARCH

−

Optimal

parameter

C=27

γ=4

C=30

γ=1

C=8

γ=5

Accuracy of

training set

Accuracy of

test set

96.59% 94.38%

95.48% 95.46%

96.48% 93.38%

−

Figure 2. The flowchart of malicious JavaScript

1~128

−

Published by Atlantis Press, Paris, France.

0216

TABLE IV.

This paper proposed a SVM-based malicious JavaScript

detection method, which,based on fully analysis of scripting

Learning algorithm Accuracy of training set Accuracy of test set

language, extracts the static information of the script, and

ADTree

94.94% 91.68%

improves the detection efficiency and safety of the system,

NaiveBayes

86.36% 84.31%

without parsing and compiling the script; The SVM has a

96.59% 94.38%

SVM

good reputation in the practical application of machine

learning, and helps detect unknown attacks. The

experimental results show that this method has a high

B. System implementation

accuracy and low false alarm , and could detect unknown

attacks.

The implementation of prototype system for the detection

of malicious JavaScript is introduced in this section. The

CKNOWLEDGMENT

system can directly detect a JavaScript script, or deal with a

URL to detect the JavaScript the page contains.

This research was supported by grant R1090569 and

The module of feature extraction and SVM detection is

LY12F02039 from the Natural Science Foundation of

developed by C, while the script extraction is by PHP. As

Zhejiang Province.

shown in Figure 3.

EFERENCES

1) Script extraction module:

This module is developed for the user as interface,

[1]

Hao Zhang, Ran Tao, Zhiyong Li, Hua DU , The Detection

Methods of Malicious Script . Ordnance Academic Journal,

provides services of script detection . Users can either choose

2008.

to upload a JavaScript script, or submit a URL address. This

module will analyze the page, and extract the JavaScript and

[2]

Ming Zhu, Qian Xu, Chunming Liu. The Analysis And

then package to feature extraction module for further

Detection Of Trojan, Computer Engineering and Applications,

2003.

analysis.

2) Feature extraction module:

[3]

Oystein Hallaraker, Giovanni Vigna, Detecting Malicious

JavaScript Code in , 2005.

Firstly,this module would accept the JavaScript from the

last one, do data cleaning and remove extra blank lines,

[4]

Min Dai, Ya-Lou Huang, Wei Wang, Trojan detection Model

comments and so on; Then extract 27 feature previously

Based On Static File Information. Computer Engineering,

mentioned; At last, the data is scaled to [0,1] to improve

2006,3 (6): 176 - 179.

computational efficiency, and converted into the standard

[5]

D. Flanagan. JavaScript: The Definitive Guide, 4th

form of the next detection module.

er 2001.

3) SVM detection module

[6]

Yinhe Zhang, Wenxin Liang, Xinlei Li, Self-study manual of

The model used here,SVM, is trained with optimal

JavaScript . Tsinghua University Press,2008-10.

parameters. It accepts a standard data from feature extraction

[7]

Bin Liang, Jianjun Huang, Fang Liu, Dawei Wang, Daxiang

module. Detected by SVM, the results are then delivered to

Dong, Zhaohui Liang. Malicious Web Pages Detection Based

display in the script extraction module.

on Abnormal Visibility Recognition. IEEE, 2009.

[8]

V. Anupam and AJ Mayer. "Secure Web Scripting". IEEE

HE COMPARISON OF THE ACCURACY OF TRAINING SET

AND TEST SET OF

SVM,

ADT

REE

AND

AVE

AYES ALGORITHM

V. C

ONCLUSIONS

Internet Computing, 1998, 2 (6) :46-55.

[9]

Likarish P., Jung E., Jo I. Obfuscated malicious JavaScript

detection using classification techniques. IEEE :47-54.

[10]

Byung-Ik Kim, Chae-Tae Im, Hyun-Chul Jung. Suspicious

Malicious Web Site Detection with Strength Analysis of a

JavaScript ational Journal of Advanced

Science.2011.

[11]

Xiaokang Zhang. Malicious code detection technology based

on data mining and machine learning research [D]. Master's

degree thesis of USTC .2010.

[12]

Vapnik VN The nature of statistical learning theory [M].

Springer Verlag, 2000.

[13]

VX Heavens. Http:/// [EB / OL]. 2006-09-28.

[14]

Xiaofei Yan, Hongwei Ge, Sheng Yan, RBF kernel SVM and

Its Application, Computer Engineering and Design, 2006.

Figure 3. The implementation of prototype system

Published by Atlantis Press, Paris, France.

0217

本文标签：检测支持恶意说明书方法

版权声明：本文标题：静态恶意JavaScript检测:支持向量机(SVM)方法说明书内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1718214505a654121.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

静态恶意JavaScript检测:支持向量机(SVM)方法说明书

更多相关文章

linux环境刷amd显卡bios,amd显卡刷bios方法

amd显卡驱动linux 卸载,安装和卸载amd显卡驱动的正确方法

Linux 系统安装 AMD 显卡官方驱动的方法

删除桌面右键菜单AMD Catalyst Control Center选项的方法

以UEFI模式安装Ghost Win8方法

android 百度手机卫士 卸载,百度卫士卸载方法汇总

服务器千兆网卡显示百兆,windows10系统查看网卡是千兆还是百兆的方法

VMware虚拟机安装Windows 10操作系统的步骤及方法

windows系统win10将chrome加入环境变量的方法步骤

Win11小组件怎么添加待办事项？Win11添加待办事项小组件的方法

Win10一键修复所有dll缺失的方法

win10无限蓝屏_快速解决Win10无限重启的方法

ibm服务器修改ide,xp系统装机bios中sata改为ide方法

笔记本装linux屏蔽独显,笔记本怎么在bios屏蔽独显_笔记本屏蔽独显的方法－系统城...

移动硬盘无法读取？学会这3个方法，快速解决问题！

硬盘结构损坏且无法读取恢复方法

从移动硬盘恢复已删除的文件的6个有效方法分享

终端命令方法解决在Mac系统移动硬盘读写问题

python安装失败0x80070570_0x80070570 文件或目录损坏且无法读取 CHKDSK 修复方法

华为笔记本原厂系统恢复原装Win10Win11带F10恢复功能创建安装方法

发表评论

推荐文章

Chrome浏览器语音自动播放功能

win10开机黑屏只有鼠标？这份指南帮你轻松解决！

我为什么要学LINUX系统？(转)

linux有k歌软件吗,在Linux下可用Wine安装和运行暴风影音16、全民K歌

计算机内无法使用搜狗,电脑无法在Word文档中使用搜狗拼音输入法怎么办

热门文章

8.7k Star！Khoj：你的AI第二大脑、开源RAG Cop​​ilot、平替 MS Copilot与ChatGPT

Ubuntu系列：win10使用VMWare中安装Ubuntu

win10系统删除chrome浏览器输入框的历史记录

Postman安装与入门简单教程

攻下隔壁女生路由器后，我都做了些什么

Win7 Wininit.exe任意加载执行(漏洞)

手机怎么解决同ip多账号_安卓手机用久了垃圾多怎么办？5款强力清理软件帮你解决难题...

如何在Android手机和平板电脑上恢复已删除的PDF文件？

Win11显示麦克风未插上怎么办？Win11显示麦克风未插上的解决方法

Mac直接拔掉移动硬盘无法识别或识别要很久的解决方法

最新文章

docker选择安装位置_详解docker pull 下来的镜像文件存放的位置

VMware Workstation Pro 12 无法使用超过4G大小的ghost镜像文件进行系统还原的 傻 快 处理方案

linux下如何读取使用iso 镜像文件的方法

win10禁用驱动程序强制签名_如何将驱动程序注入Windows 10 WIMISO安装映像？

如何在 Mac 上安装 Windows 11 系统？这个方法完全免费又简单好用！最完美的方案，超级流畅丝滑，完全免费，支持 M1M2Intel，支持拖拽互传文件！免费下载并安装 VMware

虚拟光驱下载安装和使用，Windows系统如何直接打开iso文件

修改镜像文件boot.wim(再封装)

Windows系统怎么将dmg文件转换为iso格式

UltraISO打开Ubuntu镜像ISO文件只有EFI文件夹

您选择的文件不是有效的iso映像文件，请重新选择

vs2015镜像文件安装

android iso 制作工具,android x86 iso custom

SACD ISO镜像中提取DSDIFF(DFF)、DSF文件

MATLAB R2010a（WIN7系统专用）下载地址与安装全程

virtual box如何使用同一个镜像文件建立多个虚拟机

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

android 百度手机卫士卸载,百度卫士卸载方法汇总

8.7k Star！Khoj：你的AI第二大脑、开源RAG Copilot、平替 MS Copilot与ChatGPT

VMware Workstation Pro 12 无法使用超过4G大小的ghost镜像文件进行系统还原的傻快处理方案

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载