admin管理员组

文章数量:1531692

2024年6月13日发(作者:)

A Static Malicious Javascript Detection Using SVM

WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,

FANG Zhao-Lin

Zhejiang University of Technology

HangZhou, China

Abstract—Malicious script,such as JavaScript, is one of the

primary threats of the network security. JavaScript is not only

a browser scripting language that allows developers to create

sophisticated client-side interfaces for web applications, but

also used to carry out attacks taht used to steal users'

credentials and lure users into providing sensitive information

to unauthorized parties. We propose a static malicious

JavaScript detection techniques based on SVM(Support Vector

Machine). Our approach combines static detection with

machine learning technique, to analyze and extract malicious

script features,and use the machine learning technology,SVM,

to classify the technique has the characteristics of

high detection rate,low false positive rate and the detection of

unknown attacks. Applied to experiments on the prepared data

set, we achieved excellent detection performance.

WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,

FANG Zhao-Lin

Zhejiang University of Technology

HangZhou, China

static characteristics information of the file, to distinguish the

malicious script and the benign script[4]. This article uses

machine learning techniques to analyze the feature of the

script, proposes a static detection method based on SVM.

II. M

ALICIOUS

S

CRIPT

F

EATURE

E

XTRACTION

JavaScript[5] is a lightweight, object-based and event-

driven scripting language. JavaScript based on HTML could

develop interactive Web pages, making web users achieve

real-time, dynamic interaction [6]. However, JavaScript is

also an attractive choice for attackers to implement their

assaults and distribute them over the Internet., such as cross-

site scripting attacks, SQL injection attacks and passive

download attack.

According to a survey to 90 sites in the China Education

Keywords-Keywords; SVM; static detection; malicious script

and Research Network (CERNET) in 2008, nearly one-third

detection

of the sites was attacked. And 39% of the attacks is caused

by the malicious JavaScript [6]. Its characteristics make

JavaScript easy to become a carrier of malicious programs.

I.

I

NTRODUCTION

JavaScript has two characteristics: First, JavaScript, a

With the rapid development of network information

description language as a file, can be executed directly

technology, information security issues gains more and more

through the browser; Second, Without protection, JavaScript

attentions. The malicious script is one of the primay security

written in the HTML can be seen and copy by anyone

threats of computer networks. By constructing a special web

directly.

page, which contains Trojans, viruses, worms, or aggressive

Therefore, these characteristics have made JavaScript the

programs, malicious script propagate to the user's computer

one of attackers' favorite tools. To solve this problem, sand-

when the user access to these pages.

boxing mechanism is provided to prevent malicious

Based on the execution state of malicious script, the

JavaScript from compromising the security of client's

current detection methods of malicious script can be divided

environment[8]. And it allows the code to perform a

into the static analysis and dynamic analysis method:

restricted set of operations only. What's more, the sand-

Without executing the script, the static analysis method

boxing mechanism not only brings the problem of efficiency,

uses the static characteristic, the structure of the scripts to

but also constraints the execution of JavaScript in client. In

identify malicious scripts, take [1] as example, it counts

this paper, we turn to machine learning classification

malicious signatures, then weights the different statistical

techniques to solve this problem.

methods with Judgment matrix method, and at last uses the

To achieve this goal, features are analyzed and extracted

weighted geometric mean method to obtain the results. This

at first. According to [9], we can extract 17 malicious

method not only requires some obvious features, but also

JavaScript features. And 10 features more are added based

weak at finding unknown attacks.

on the analysis of the data. The part of 27 features are

Dynamic analysis method, which runs malicious scripts

explained as follows:

in the controlled environment, detects malicious scripts by

In most benign cases, the number of some special

observing the execution states, processes. In [2][3], they

functions is limited while there are a relatively large number

monitor system ports, network connections, the registry,

of these functions in malicious script, such as the eval

system configuration files , to detect abnormal procedures.

function, escape function,DOM-modifying function. The

The method has to run malicious code, which increases the

exploits usually call several of DOM functions in order to

risk of the system, and the efficiency is also a problem.

instantiate vul-nerable components and/or create elements in

Malicious script is the special code hidden in the

the page for the pur-pose of loading external scripts and

scripting language, such as js files. Thanks to its

exploit the escape function could be called to

standardized script format, grammar, we tend to get enough

Published by Atlantis Press, Paris, France.

© the authors

0214

code malicious abnormal use of special keyword,

tag,string are also considered.

Unfortunately, obfuscation techniques, which was

intended to protect the source code, is taken by the attackers

to circumvent these feature extraction. In order to reduce the

impact of the obfuscation, we also do a certain degree of

strength analysis [10]. Some features such as the scripts'

whitespace percentage, the maximum entropy of the strings,

the entropy of the script, are measured. Table.I shows one of

the results :

TABLE I.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

27

FEATURES OF DATASET

the number of DOM

modification functions

the script’s whitespace

percentage

the average length of the

strings used in the script

the average script line length

the number of strings

containing “iframe”

the number of suspicious tag

strings

the length of the script in

characters

the number of unescape and

escape

the number of eval()

15

function

the number of the

16

setTimeout() functions

the ratio between

17

keywords and words

the number of built-in

functions used for 18

deobfuscation

the entropy of the strings

19

declared in the script

the entropy of the script

20

as a whole

the number of long

21

strings(>40)

the maximum entropy of

22

all the script’s strings

the probability of the

script to contain 23

shellcode

the maximum length of

24

the script’s strings

the number of string

25

direct assignments

the number of string

26

modification functions

the number of event

27

attachments

the number of suspicious

strings

SVM, which creates a feature space with the attributes in the

training dataset, is to search a decision boundary or an

optimal hyperplane to separates the feature space with the

maximum interval,as shown in Fig.1.

There are two types of SVM. The linear SVM which

separates the data points with a linear boundary and the non-

linear SVM which separates the data points with a nonlinear

boundary.

In the case of linearly separable problems, it is easy to

find the plane in the feature space that separate two types of

samples. Therefore, our optimal plane is the one that has

maximum geometry interval. As the following formulas

shows:

1

min||ω||

2

2

s.t.,y

i

T

x

i

+

b)

1,i

=

1,

,n

Obviously, it's a convex quadratic programming

problems. To solve this problem, firstly, the Lagrange

function should be brought in to turn it to its dual

problem,.The slack variable and penalty function are

proposed to deal with linearly inseparable problem caused by

noise. Then the objective function convert to:

n

1

2

min||w||

+

C

ξ

i

2

i

=

1

s.t.,y

i

(w

T

x

i

+

b)

1

ξ

i

,i

=

1,

……

,n

ξ

i

0,i

=

1,

……

,n

Linear SVM performs well on datasets that can be easily

separated by a hyper-plane into two parts. But sometimes

datasets are complex and are difficult to classify using a

linear kernel. Non-linear SVM classifiers can be used for

the number of classid

such complex datasets.

In the non-linear case, it maps the data into a high

the number of parseInt and

dimensional space, where an optimal separating hyperplane

fromcharcode

would be found. With appropriate mapping function, most of

the ratio between

the non-linear problem can be transformed into the linear

n and line

problem in high-dimensional space. However, the high-

dimensional mapping also brings the curse of dimensionality,

the number of chars in hex

and it is a disaster to calculate separating hyperplane in the

feature space. The inner product can be realized in the

the number of

feature space with kernel function satisfies Mercer, which is

CreateObject,ActiveXObject

a trick to this problem:

max

α

i

i

=

1

n

1

n

α

i

α

j

y

i

y

j

k

(

x

i

,

x

j

)

2

i

,

j

=

1

III. M

ALICIOUS SCRIPT DETECTION BASED ON

SVM

s

.

t

.,

α

i

0,

i

=

1,

,

n

n

α

i

y

i

=

0

The machine learning technology,SVM, which could

i

=

1

help summarize the knowledge of identifying known

Common kernel functions are polynomial kernel,

malicious JavaScript, carry out a similarity search to find

Gaussian kernel, Sigmoid kernel function. Gaussian kernel is

unknown malicious JavaScript, with a high detection rate

a universal nuclear function, by selecting the appropriate

and low false alarm rate [11].

parameters, it can achieve a high correct rate. Gaussian

kernel:

A. SVM

k

(

x

i

,

x

j

)

=

exp(

γ

||

x

i

x

j

||

2

),

γ

>

0

SVM (Support Vector Machine), originated in statistical

learning theory by Vapnik et al in 1995, was focused on

pattern classification problems [12]. It is a statistical learning

algorithm that classifies the samples using a subset of

training samples called support simple terms,

Published by Atlantis Press, Paris, France.

© the authors

0215

IV. E

XPERIMENTAL ANALYSIS AND IMPLEMENTATION

A. Experimental Analysis

The experimental data is composed of 1000 malicious

JavaScript collected from VX Heavens [13] and 1000 benign

ones from reputable sites. The dataset is divided into three,

one third as the training set and two thirds as the test set.

According to the analysis previously,we extract 27

features of the dataset, scale on the extracted features, and

converts it into WEKA file format.

The above shows that , SVM obtains more than

90% both on accuracy and recall, and the accuracy on the

Figure 1. Optimal hyperplane

training set even raised to 93.8% . SVM shows a better

accuracy even in the case of less training samples.

In this paper, we choose the RBF kernel to get the best

B. The malicious script analysis framework based on SVM

classification model. Two parameters would be adjusted, the

As mentioned before, the script analysis can be divided

penalty factor C and kernel function parameter γ.

into static analysis and dynamic analysis. Here, we propose

C is used to weigh the "Find largest interval hyperplane"

an SVM-based static analysis method, combined with

and "make sure minimum deviation of the data points", C set

machine learning classification techniques, to distinguish

large value easily causes overlearning, and reduceing the

malicious scripts and benign script. Its script training

generalization performance. When set small value, it results

flowchart and script test flow chart are shown in Fig.2.

in less learning, which all the sample are classified into the

a) Dataset preparation: collect enough malicious

strong class. γ stands for the nuclear radius, directly impacts

JavaScript and benign JavaScript from the site.

the classification performance of SVM. With too large value,

it will end in zero generalization ability, while with too small

b) Data cleaning: cleaning the sample data, such as the

removal of the Notes, excess carriage return and line feed,

value, the classify ability of new samples close to zero,even

it has a high accuracy on the training set[14].

which increases the processing speed and accuracy.

The optimization algorithm, GridSearch on WEKA, is

c) Feature extraction: extract 27 features based on the

used in this paper to search the optimal

analysis above.

accurately rate as criterion, 1 as Step of C, γ steps as a base

d) Pre-treatment: data normalization processing, scaled

unit, and obtain the experimental results of . when C

to [0,1]. This process reduces the training error while the

= 27, γ= 4, the training set accuracy up to 96.59%. And get

data characteristic value is too large, or too small. Second,

the best optimization model parameter training.

the efficiency could also be improved.

As shown in , SVM gains higher accuracy on

the training set and a test set than ADTree and NaveBayes.

e) Parameter tuning: the WEKA is the platform to train

models. With a grid of binary classification SVM traverse

NaveBayes even don't obtains 90%, while SVM has an

GridSearch algorithm and ten-fold cross-validation,it selects

accuracy of 94.38% on the test set. It is clear that the SVM is

better at handling binary class.

the best SVM model parameters.

These experimental results shows that, the static

f) Model training: training best SVM model to obtain

detection method based on SVM we proposed, is excellent

the optimal parameters.

both on the accuracy and detection efficiency.

g) The data prediction: using the best model to predict

TABLE II.

THE

WEKA

FILE OF

N

ORMALIZED EIGENVECTORS

the classification of the test set.

malicious

benign

average

TP FP

Rate Rate

0.912 0.038

0.962 0.088

0.937 0.064

PrecisiRecall F-

on Measure

0.958 0.912 0.934

0.919 0.962 0.940

0.938 0.937 0.937

Roc

Area

0.937

0.937

0.937

TABLE III.

C

1~128

1~128

P

ARAMETER OPTIMIZATION WITH

G

RID

S

EARCH

γ

2

10

~2

6

Optimal

parameter

C=27

γ=4

C=30

γ=1

C=8

γ=5

Accuracy of

training set

Accuracy of

test set

96.59% 94.38%

95.48% 95.46%

96.48% 93.38%

3

10

~3

6

Figure 2. The flowchart of malicious JavaScript

1~128

5

10

~5

6

Published by Atlantis Press, Paris, France.

© the authors

0216

TABLE IV.

This paper proposed a SVM-based malicious JavaScript

detection method, which,based on fully analysis of scripting

Learning algorithm Accuracy of training set Accuracy of test set

language, extracts the static information of the script, and

ADTree

94.94% 91.68%

improves the detection efficiency and safety of the system,

NaiveBayes

86.36% 84.31%

without parsing and compiling the script; The SVM has a

96.59% 94.38%

SVM

good reputation in the practical application of machine

learning, and helps detect unknown attacks. The

experimental results show that this method has a high

B. System implementation

accuracy and low false alarm , and could detect unknown

attacks.

The implementation of prototype system for the detection

of malicious JavaScript is introduced in this section. The

A

CKNOWLEDGMENT

system can directly detect a JavaScript script, or deal with a

URL to detect the JavaScript the page contains.

This research was supported by grant R1090569 and

The module of feature extraction and SVM detection is

LY12F02039 from the Natural Science Foundation of

developed by C, while the script extraction is by PHP. As

Zhejiang Province.

shown in Figure 3.

R

EFERENCES

1) Script extraction module:

This module is developed for the user as interface,

[1]

Hao Zhang, Ran Tao, Zhiyong Li, Hua DU , The Detection

Methods of Malicious Script . Ordnance Academic Journal,

provides services of script detection . Users can either choose

2008.

to upload a JavaScript script, or submit a URL address. This

module will analyze the page, and extract the JavaScript and

[2]

Ming Zhu, Qian Xu, Chunming Liu. The Analysis And

then package to feature extraction module for further

Detection Of Trojan, Computer Engineering and Applications,

2003.

analysis.

2) Feature extraction module:

[3]

Oystein Hallaraker, Giovanni Vigna, Detecting Malicious

JavaScript Code in , 2005.

Firstly,this module would accept the JavaScript from the

last one, do data cleaning and remove extra blank lines,

[4]

Min Dai, Ya-Lou Huang, Wei Wang, Trojan detection Model

comments and so on; Then extract 27 feature previously

Based On Static File Information. Computer Engineering,

mentioned; At last, the data is scaled to [0,1] to improve

2006,3 (6): 176 - 179.

computational efficiency, and converted into the standard

[5]

D. Flanagan. JavaScript: The Definitive Guide, 4th

form of the next detection module.

er 2001.

3) SVM detection module

[6]

Yinhe Zhang, Wenxin Liang, Xinlei Li, Self-study manual of

The model used here,SVM, is trained with optimal

JavaScript . Tsinghua University Press,2008-10.

parameters. It accepts a standard data from feature extraction

[7]

Bin Liang, Jianjun Huang, Fang Liu, Dawei Wang, Daxiang

module. Detected by SVM, the results are then delivered to

Dong, Zhaohui Liang. Malicious Web Pages Detection Based

display in the script extraction module.

on Abnormal Visibility Recognition. IEEE, 2009.

[8]

V. Anupam and AJ Mayer. "Secure Web Scripting". IEEE

T

HE COMPARISON OF THE ACCURACY OF TRAINING SET

AND TEST SET OF

SVM,

ADT

REE

,

AND

N

AVE

B

AYES ALGORITHM

V. C

ONCLUSIONS

Internet Computing, 1998, 2 (6) :46-55.

[9]

Likarish P., Jung E., Jo I. Obfuscated malicious JavaScript

detection using classification techniques. IEEE :47-54.

[10]

Byung-Ik Kim, Chae-Tae Im, Hyun-Chul Jung. Suspicious

Malicious Web Site Detection with Strength Analysis of a

JavaScript ational Journal of Advanced

Science.2011.

[11]

Xiaokang Zhang. Malicious code detection technology based

on data mining and machine learning research [D]. Master's

degree thesis of USTC .2010.

[12]

Vapnik VN The nature of statistical learning theory [M].

Springer Verlag, 2000.

[13]

VX Heavens. Http:/// [EB / OL]. 2006-09-28.

[14]

Xiaofei Yan, Hongwei Ge, Sheng Yan, RBF kernel SVM and

Its Application, Computer Engineering and Design, 2006.

Figure 3. The implementation of prototype system

Published by Atlantis Press, Paris, France.

© the authors

0217

本文标签: 检测支持恶意说明书方法