IMDB|电子爱好者

admin管理员组
文章数量:1579086

import os
import tarfile
import tensorflow as tf
import numpy as np
import re
import string
from random import randint

数据文件解压

# 解压数据
if not os.path.exists(r"C:\Users\st\Desktop/123/aclImdb"):
    tfile = tarfile.open(filepath,"r:gz")
    print('extracting...')
    result = tfile.extractall("data/")
    print('extraction comoleted')
else:
    print("data/aclImdb is existed!")

data/aclImdb is existed!

数据读取

#将文本中不需要的字符清除，如html标签<br />
def remove_tags(text):
    re_tag = re.compile(r'<[^>]>')
    return re_tag.sub('',text)

#读取文件
def read_files(filetype):
    path = r"C:\Users\st\Desktop\123/aclImdb/"
    file_list=[]
    #读取正面评价的文件的路径，存到file_list列表里
    positive_path = path+filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list +=[positive_path+f]
    pos_files_num = len(file_list)
    
    #读取负面评价的文件路径，存到file_list列表里
    negative_path = path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list += [negative_path+f]
    neg_files_num = len(file_list)-pos_files_num
    
    
    print('read',filetype,'files:',len(file_list))
    print(pos_files_num,'pos files in ',filetype,'files')
    print(neg_files_num,'neg files in ',filetype,'files')
    #得到所有标签，标签用one-hot编码表示，正面评价标签为【0,1】负面评价标签为【0,1】
    all_labels =([[1,0]]*pos_files_num+[[0,1]]*neg_files_num)
    #得到所有文本
    all_texts=[]
    for fi in file_list:
        with open (fi,encoding='utf8')as file_input:
            #文本中有<br/>这类html标签，将文本传入remove_tags函数
            #函数里使用正则表达式可以将这样的标签清除掉
            all_texts +=[remove_tags(" ".join(file_input.readlines()))]
    return all_labels,all_texts

读取数据集

#得到训练与测试用的标签和文本
train_labels,train_texts=read_files("train")
test_labels,test_texts=read_files("test")

read train files: 25000
12500 pos files in  train files
12500 neg files in  train files
read test files: 25000
12500 pos files in  test files
12500 neg files in  test files

读取数据集

#查看数据、标签
print("训练数据")
print("正面评价：")
print(train_texts[0])
print(train_labels[0])
print("负面评价：")
print(train_texts[12500])
print(train_labels[12500])
print("测试数据")
print("正面评价：")
print(test_texts[0])
print(test_labels[0])
print("负面评价：")
print(test_texts[12500])
print(train_labels[12500])

训练数据
正面评价：
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
[1, 0]
负面评价：
Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly.
[0, 1]
测试数据
正面评价：
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
[1, 0]
负面评价：
Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.
[0, 1]

数据处理

建立词汇词典Token

from tensorflow import keras

token = keras.preprocessing.text.Tokenizer(num_words=4000)

token.fit_on_texts(train_texts)

token.document_count

建立词汇词典

将单词（字符串）映射为它们的排名或者索引

#print(token.word_index)

将单词（字符串）映射为它们在训练期间所出现的文档或文本的数量

#token.word_docs

查看Token中词汇出现的频次排名

#print(token.word_counts)

文字转数字列表

train_sequences = token.texts_to_sequences(train_texts)
test_sequences = token.texts_to_sequences(test_texts)

print(train_texts[0])

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

print(train_sequences[0])

[309, 6, 3, 1069, 209, 9, 2161, 30, 1, 169, 55, 14, 46, 82, 41, 392, 110, 138, 14, 58, 150, 8, 1, 482, 69, 5, 261, 12, 2002, 6, 73, 2425, 5, 632, 71, 6, 1, 5, 2003, 1, 1534, 34, 67, 64, 205, 140, 65, 1230, 1, 4, 1, 223, 901, 29, 3022, 69, 4, 1, 10, 693, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1467, 3712, 800, 5, 3513, 177, 1, 392, 10, 1237, 30, 309, 3, 353, 344, 2974, 143, 130, 5, 28, 4, 126, 1467, 2373, 5, 309, 10, 532, 12, 108, 1468, 4, 58, 555, 101, 12, 309, 6, 227, 48, 3, 2232, 12, 9, 215]

x_train = keras.preprocessing.sequence.pad_sequences(train_sequences,padding='post',truncating='post',maxlen=400)
x_test = keras.preprocessing.sequence.pad_sequences(test_sequences,padding='post',truncating='post',maxlen=400)

x_train.shape

(25000, 400)

填充后的数字列表

print(x_train[0])

[ 309    6    3 1069  209    9 2161   30    1  169   55   14   46   82   41
  392  110  138   14   58  150    8    1  482   69    5  261   12 2002    6
   73 2425    5  632   71    6    1    5 2003    1 1534   34   67   64  205
  140   65 1230    1    4    1  223  901   29 3022   69    4    1   10  693
    2   65 1534   51   10  216    1  387    8   60    3 1467 3712  800    5
 3513  177    1  392   10 1237   30  309    3  353  344 2974  143  130    5
   28    4  126 1467 2373    5  309   10  532   12  108 1468    4   58  555
  101   12  309    6  227   48    3 2232   12    9  215    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0]

构建模型

建立模型

model = keras.models.Sequential()

model.add(keras.layers.Embedding(output_dim=32,input_dim=4000,input_length=400))

model.add(keras.layers.Flatten())

#用GlobalAveragePoolingID也起到平坦化的效果
#model.add(keras.layers.GlobalAveragePoolingID())

model.add(keras.layers.Dense(units=256,activation='relu'))

model.add(keras.layers.Dropout(0.3))

model.add(keras.layers.Dense(units=2,activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 400, 32)           128000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 12800)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               3277056   
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 514       
=================================================================
Total params: 3,405,570
Trainable params: 3,405,570
Non-trainable params: 0
_________________________________________________________________

模型设置与训练

y_train = np.array(train_labels)
y_test = np.array(test_labels)

model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

history = model.fit(x_train,y_train,validation_split=0.2,epochs=10,batch_size=128,verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
20000/20000 [==============================] - 23s 1ms/step - loss: 6.0026 - acc: 0.6246 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 2/10
20000/20000 [==============================] - 21s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 3/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 4/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 5/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 6/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 7/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 8/10
20000/20000 [==============================] - 24s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 9/10
20000/20000 [==============================] - 22s 1ms/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00
Epoch 10/10
20000/20000 [==============================] - 20s 981us/step - loss: 6.0443 - acc: 0.6250 - val_loss: 16.1181 - val_acc: 0.0000e+00

训练过程

import matplotlib.pyplot as plt

acc = history.history["acc"]
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1,len(acc)+ 1)
plt.plot(epochs,loss,'r',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()# clear figure
acc_values = history.history['acc']
val_acc_values = history.history['val_acc']
plt.plot(epochs,acc,'r',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

评估模型准缺率

y_test = np.array(test_labels)
test_loss,test_acc = model.evaluate(x_test,y_test,verbose = 1)

print("Test accuracy:",test_acc)

25000/25000 [==============================] - 8s 309us/step
Test accuracy: 0.5

predictions = model.predict(x_test)
predictions[0]

array([ 1.,  0.], dtype=float32)

sentiment_dict = {0:"pos",1:"neg"}

def display_test_sentiment(i):
    print(test_texts[i])
    print("label value:",sentiment_dict[np.argmax(y_test[i])],
         "predict value",sentiment_dict[np.argmax(predictions[i])])
display_test_sentiment(0)

I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.
label value: pos predict value pos

review_text = "The Empirs Strikes Back is the best film in the original Star Wars"
input_seq = token.texts_to_sequences([review_text])
pad_input_seq = keras.preprocessing.sequence.pad_sequences(input_seq,padding="post",truncating="post",maxlen = 400)
pred = model.predict(pad_input_seq)
print("predict value:",sentiment_dict[np.argmax(pred)])

predict value: pos

sentiment_dict = {0:"post",1:"neg"}
def displays_text_sentiment(text):
    print(text)
    input_seq = token.texts_to_sequences([text])
    pad_input_seq = keras.preprocessing.sequence.pad_sequences(input_seq,padding="post",truncating="post",maxlen=400)
    pred = model.predict(pad_input_seq)
displays_text_sentiment(review_text)

The Empirs Strikes Back is the best film in the original Star Wars

基于LSTM结构的模型构建

# 建立模型
model = keras.models.Sequential()
model.add(keras.layers.Embedding(output_dim=32,input_dim=4000,input_length=400))

#用RNN，不用吧词嵌入层平坦化
# model.add(keras.layers.Flatten())
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units = 8)))
          
model.add(keras.layers.Dense(units=32,activation="relu"))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(units=2,activation="softmax"))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 400, 32)           128000    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 16)                2624      
_________________________________________________________________
dense_3 (Dense)              (None, 32)                544       
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 66        
=================================================================
Total params: 131,234
Trainable params: 131,234
Non-trainable params: 0
_________________________________________________________________

#标签是One-Hot编码的多分类模型，损失函数用categorical_crossentropy
#标签不是One-Hot编码的多分类模型，损失函数用sparse_categorical_crossentropy
#标签是二分类，损失函数用binary_crossentropy
model.compile(optimizer='adam',
loss='categorical_crossentropy',metrics = ["accuracy"])
history = model.fit(x_train,np.array(train_labels),
validation_split=0.2,
epochs=6,
batch_size=128,verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/6
20000/20000 [==============================] - 87s 4ms/step - loss: 0.6620 - acc: 0.6239 - val_loss: 0.9128 - val_acc: 0.0000e+00
Epoch 2/6
20000/20000 [==============================] - 88s 4ms/step - loss: 0.6478 - acc: 0.6458 - val_loss: 0.9550 - val_acc: 0.0000e+00
Epoch 3/6
20000/20000 [==============================] - 88s 4ms/step - loss: 0.6518 - acc: 0.6250 - val_loss: 0.9173 - val_acc: 6.0000e-04
Epoch 4/6
20000/20000 [==============================] - 94s 5ms/step - loss: 0.5934 - acc: 0.6811 - val_loss: 1.1345 - val_acc: 0.0042
Epoch 5/6
20000/20000 [==============================] - 93s 5ms/step - loss: 0.5605 - acc: 0.7072 - val_loss: 0.8810 - val_acc: 0.5942
Epoch 6/6
20000/20000 [==============================] - 90s 4ms/step - loss: 0.4216 - acc: 0.8247 - val_loss: 0.3850 - val_acc: 0.8742

import matplotlib.pyplot as plt

acc = history.history["acc"]
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1,len(acc)+ 1)
plt.plot(epochs,loss,'r',label='Training loss')
plt.plot(epochs,val_loss,'b',label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.clf()# clear figure
acc_values = history.history['acc']
val_acc_values = history.history['val_acc']
plt.plot(epochs,acc,'r',label='Training acc')
plt.plot(epochs,val_acc,'b',label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

本文标签： IMDB

版权声明：本文标题：IMDB 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1727845637a1133056.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

IMDB

数据文件解压

数据读取

读取数据集

读取数据集

数据处理

建立词汇词典Token

建立词汇词典

将单词（字符串）映射为它们的排名或者索引

将单词（字符串）映射为它们在训练期间所出现的文档或文本的数量

查看Token中词汇出现的频次排名

文字转数字列表

填充后的数字列表

构建模型

建立模型

模型设置与训练

训练过程

评估模型准缺率

基于LSTM结构的模型构建

更多相关文章

电影评论分类：二分类问题（IMDB数据集）

【深度学习】IMDB数据集上电影评论二分类

［转］吐血推荐250部必看电影下载 IMDB TOP 250 download

IMDB

发表评论

推荐文章

Unable to add window -- token null is not for an application错误

rssi

m4s格式转换mp3_如何把FLAC音频转换成MP3格式

最全的全球搜索引擎的介绍

Chrome谷歌浏览器中如何恢复已经关闭的网页?

热门文章

交叉编译dropbear使能ssh登录以及rt3070wifi模块的移植使用

深圳恒波软件公司LockDir加密软件原理与破解

【Linux 网络】网络基础（三）（网络层协议：IP 协议）

mgg格式怎么转换为mp3？这个小工具一键批量mgg格式转mp3（亲测有效）

一款Windows和Linux下应急响应工具_emergency_v1，2024年最新靠着这份900多页的PDF面试整理

信息收集--空间搜索引擎网盘

百度网盘下载器更新啦！PanDownload又带来了「资源搜索」功能！

迅雷U享版 v3.0.1.96 Lite V4 精简绿色版

迅雷取代FlashGet

Android 与其他基于 Linux 的系统有何不同？

最新文章

一芯FC1178BC主控U盘量产修复指南

慧荣SMISM3280AB开卡量产工具适用于无法识别设备黑片U盘量产工具修复使用

u盘无法识别怎么办，u盘无法识别解决方法

linux 下u盘分区修复无法识别问题解决

定了，6大领域93个开源任务，阿里开源导师带你参与中科院开源之夏2022

识别到硬盘 计算机不显示盘符,笔者教你修复可以识别u盘但不显示盘符的问题...

agio U盘强制弹出导致的无法识别需格式化的问题的修复方案

U盘无法与计算机连接,U盘无法连接电脑

通过修复VMware软件解决虚拟机无法识别到U盘设备的问题

@mysql数据库面试手册

修复U盘【笔记】

Ubuntu及Debian下挂载U盘及exFat文件系统U盘无法挂载的解决

linux usb3.0无法识别u盘启动,Deepin 20系统能识别USB3.0：如果不能用请重启系统或重插几次...

为什么计算机无法读取u盘,电脑无法识别读取U盘怎么办？逐一排查解决问题

解决Ubuntu下U盘无法识别的问题

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

识别到硬盘计算机不显示盘符,笔者教你修复可以识别u盘但不显示盘符的问题...

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载