admin管理员组

文章数量:1579086

import os
import tarfile
import tensorflow as tf
import numpy as np
import re
import string
from random import randint

数据文件解压

# 解压数据
if not os.path.exists(r"C:\Users\st\Desktop/123/aclImdb"):
    tfile = tarfile.open(filepath,"r:gz")
    print('extracting...')
    result = tfile.extractall("data/")
    print('extraction comoleted')
else:
    print("data/aclImdb is existed!")
data/aclImdb is existed!

数据读取

#将文本中不需要的字符清除,如html标签<br />
def remove_tags(text):
    re_tag = re.compile(r'<[^>]>')
    return re_tag.sub('',text)
#读取文件
def read_files(filetype):
    path = r"C:\Users\st\Desktop\123/aclImdb/"
    file_list=[]
    #读取正面评价的文件的路径,存到file_list列表里
    positive_path = path+filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list +=[positive_path+f]
    pos_files_num = len(file_list)
    
    #读取负面评价的文件路径,存到file_list列表里
    negative_path = path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list += [negative_path+f]
    neg_files_num = len(file_list)-pos_files_num
    
    
    print('read',filetype,'files:',len(file_list))
    print(pos_files_num,'pos files in ',filetype,'files')
    print(neg_files_num,'neg files in ',filetype,'files')
    #得到所有标签,标签用one-hot编码表示,正面评价标签为【0,1】负面评价标签为【0,1】
    all_labels =([[1,0]]*pos_files_num+[[0,1]]*neg_files_num)
    #得到所有文本
    all_texts=[]
    for fi in file_list:
        with open (fi,encoding='utf8')as file_input:
            #文本中有<br/>这类html标签,将文本传入remove_tags函数
            #函数里使用正则表达式可以将这样的标签清除掉
            all_texts +=[remove_tags(" ".join(file_input.readlines()))]
    return all_labels,all_texts
        

读取数据集

#得到训练与测试用的标签和文本
train_labels,train_texts=read_files("train")
test_labels,test_texts=read_files("test")
read train files: 25000
12500 pos files in  train files
12500 neg files in  train files
read test files: 25000
12500 pos files in  test files
12500 neg files in  test files

读取数据集

#查看数据、标签
print("训练数据")
print("正面评价:")
print(train_texts[0])
print(train_labels[0])
print("负面评价:")
print(train_texts[12500])
print(train_labels[12500])
print("测试数据")
print("正面评价:")
print(test_texts[0])
print(test_labels[0])
print("负面评价:")
print(test_texts[12500])
print(train_labels[12500])
训练数据
正面评价:
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
[1, 0]
负面评价:
Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
[0, 1]
测试数据
正面评价:
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
[1, 0]
负面评价:
Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.
[0, 1]

数据处理

建立词汇词典Token

from tensorflow import keras
token = keras.preprocessing.text.Tokenizer(num_words=4000)
token.fit_on_texts(train_texts)
token.document_count
25000

建立词汇词典

将单词(字符串)映射为它们的排名或者索引
#print(token.word_index)
将单词(字符串)映射为它们在训练期间所出现的文档或文本的数量
#token.word_docs
查看Token中词汇出现的频次排名
#print(token.word_counts)

文字转数字列表

train_sequences = token.texts_to_sequences(train_texts)
test_sequences = token.texts_to_sequences(test_texts)
print(train_texts[0])
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
print(train_sequences[0])
[309, 6, 3, 1069, 209, 9, 2161, 30, 1, 169, 55, 14, 46, 82, 41, 392, 110, 138, 14, 58, 150, 8, 1, 482, 69, 5, 261, 12, 2002, 6, 73, 2425, 5, 632, 71, 6, 1, 5, 2003, 1, 1534, 34, 67, 64, 205, 140, 65, 1230, 1, 4, 1, 223, 901, 29, 3022, 69, 4, 1, 10, 693, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1467, 3712, 800, 5, 3513, 177, 1, 392, 10, 1237, 30, 309, 3, 353, 344, 2974, 143, 130, 5, 28, 4, 126, 1467, 2373, 5, 309, 10, 532, 12, 108, 1468, 4, 58, 555, 101, 12, 309, 6, 227, 48, 3, 2232, 12, 9, 215]
x_train = keras.preprocessing.sequence.pad_sequences(train_sequences,padding='post',truncating='post',maxlen=400)
x_test = keras.preprocessing.sequence.pad_sequences(test_sequences,padding='post',truncating='post',maxlen=400)
x_train.shape
(25000, 400)

填充后的数字列表

print(x_train[0])
[ 309    6    3 1069  209    9 2161   30    1  169   55   14   46   82   41
  392  110  138   14   58  150    8    1  482   69    5  261   12 2002    6
   73 2425    5  632   71    6    1    5 2003    1 1534   34   67   64  205
  140   65 1230    1    4    1  223  901   29 3022   69    4    1   10  693
    2   65 1534   51   10  216    1  387    8   60    3 1467 3712  800    5
 3513  177    1  392   10 1237   30  309    3  353  344 2974  143  130    5
   28    4  126 1467 2373    5  309   10  532   12  108 1468    4   58  555
  101   12  309    6  227   48    3 2232   12    9  215    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0]

构建模型

建立模型

model = keras.models.Sequential()
model.add(keras.layers.Embedding(output_dim=32,input_dim=4000,input_length=400))
model.add(keras.layers.Flatten())
#用GlobalAveragePoolingID也起到平坦化的效果
#model.add(keras.layers.GlobalAveragePoolingID())
model.add(keras.layers.Dense(units=256,activation='relu'))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(units=2,activation='softmax'))
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 400, 32)           128000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 12800)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3277056   
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
=================================================================
Total params: 3,405,570
Trainable params: 3,405,570
Non-trainable params: 0
_________________________________________________________________

模型设置与训练

y_train = np.array(train_labels)
y_test = np.array(test_labels)
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(x_train,y_train,validation_split=0.2,epochs=10,batch_size=128,verbose=1)
Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 23s 1ms/step - loss: 6.0026 - acc: 0.6246 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 2/10
20000/20000 [==============================] - 21s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 3/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 4/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 5/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 6/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 7/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 8/10
20000/20000 [==============================] - 24s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 9/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 10/10
20000/20000 [==============================] - 20s 981us/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00

训练过程

import matplotlib.pyplot as plt

acc = history.history["acc"]
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1,len(acc)+ 1)
plt.plot(epochs,loss,'r',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()# clear figure
acc_values = history.history['acc']
val_acc_values = history.history['val_acc']
plt.plot(epochs,acc,'r',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

评估模型准缺率

y_test = np.array(test_labels)
test_loss,test_acc = model.evaluate(x_test,y_test,verbose = 1)

print("Test accuracy:",test_acc)
25000/25000 [==============================] - 8s 309us/step
Test accuracy: 0.5
predictions = model.predict(x_test)
predictions[0]
array([ 1.,  0.], dtype=float32)
sentiment_dict = {0:"pos",1:"neg"}

def display_test_sentiment(i):
    print(test_texts[i])
    print("label value:",sentiment_dict[np.argmax(y_test[i])],
         "predict value",sentiment_dict[np.argmax(predictions[i])])
display_test_sentiment(0) 
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
label value: pos predict value pos
review_text = "The Empirs Strikes Back is the best film in the original Star Wars"
input_seq = token.texts_to_sequences([review_text])
pad_input_seq = keras.preprocessing.sequence.pad_sequences(input_seq,padding="post",truncating="post",maxlen = 400)
pred = model.predict(pad_input_seq)
print("predict value:",sentiment_dict[np.argmax(pred)])

predict value: pos
sentiment_dict = {0:"post",1:"neg"}
def displays_text_sentiment(text):
    print(text)
    input_seq = token.texts_to_sequences([text])
    pad_input_seq = keras.preprocessing.sequence.pad_sequences(input_seq,padding="post",truncating="post",maxlen=400)
    pred = model.predict(pad_input_seq)
displays_text_sentiment(review_text)
The Empirs Strikes Back is the best film in the original Star Wars

基于LSTM结构的模型构建

# 建立模型
model = keras.models.Sequential()
model.add(keras.layers.Embedding(output_dim=32,input_dim=4000,input_length=400))

#用RNN,不用吧词嵌入层平坦化
# model.add(keras.layers.Flatten())
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units = 8)))
          
model.add(keras.layers.Dense(units=32,activation="relu"))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(units=2,activation="softmax"))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 400, 32)           128000    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 16)                2624      
_________________________________________________________________
dense_3 (Dense)              (None, 32)                544       
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 66        
=================================================================
Total params: 131,234
Trainable params: 131,234
Non-trainable params: 0
_________________________________________________________________
#标签是One-Hot编码的多分类模型,损失函数用categorical_crossentropy
#标签不是One-Hot编码的多分类模型,损失函数用sparse_categorical_crossentropy
#标签是二分类,损失函数用binary_crossentropy
model.compile(optimizer='adam',
loss='categorical_crossentropy',metrics = ["accuracy"])
history = model.fit(x_train,np.array(train_labels),
validation_split=0.2,
epochs=6,
batch_size=128,verbose=1)
Train on 20000 samples, validate on 5000 samples
Epoch 1/6
20000/20000 [==============================] - 87s 4ms/step - loss: 0.6620 - acc: 0.6239 - val_loss: 0.9128 - val_acc: 0.0000e+00
Epoch 2/6
20000/20000 [==============================] - 88s 4ms/step - loss: 0.6478 - acc: 0.6458 - val_loss: 0.9550 - val_acc: 0.0000e+00
Epoch 3/6
20000/20000 [==============================] - 88s 4ms/step - loss: 0.6518 - acc: 0.6250 - val_loss: 0.9173 - val_acc: 6.0000e-04
Epoch 4/6
20000/20000 [==============================] - 94s 5ms/step - loss: 0.5934 - acc: 0.6811 - val_loss: 1.1345 - val_acc: 0.0042
Epoch 5/6
20000/20000 [==============================] - 93s 5ms/step - loss: 0.5605 - acc: 0.7072 - val_loss: 0.8810 - val_acc: 0.5942
Epoch 6/6
20000/20000 [==============================] - 90s 4ms/step - loss: 0.4216 - acc: 0.8247 - val_loss: 0.3850 - val_acc: 0.8742
import matplotlib.pyplot as plt

acc = history.history["acc"]
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1,len(acc)+ 1)
plt.plot(epochs,loss,'r',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()# clear figure
acc_values = history.history['acc']
val_acc_values = history.history['val_acc']
plt.plot(epochs,acc,'r',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()


本文标签: IMDB