admin管理员组文章数量:1619280
Scrapy框架实战
爬取目标:英雄联盟所有英雄的基本信息(名字,背景故事,技能名称及介绍)、下载所有英雄的皮肤并保存至本地
首先来到LOL官网首页,如图进入所有英雄的信息页面
先说一下我最开始的思路:
通过网页源代码来获取想要的数据,这也是最基本的爬取数据的方式
通过单个英雄信息的url不难发现规律,每个英雄的详情页url地址都一样,只是参数id的值不一样。
那么便可以通过在英雄信息页获取到每个英雄的id从而得到详情页地址
想象是美好的,实际操作时一直都获取不到想要的数据,获取的li标签中的值一直是“正在加载中”
最后才发现这些英雄的数据都是用过ajax请求来获取数据的,用传统的方式肯定不行
然后我换了一种思路
直接获取存储英雄信息的js文件,通过js文件来获得每一个英雄的id,然后通过拼接url来得到英雄详情页的地址
英雄详情页一样是通过ajax获取数据
获取的js文件中有我们想要的数据
英雄信息、皮肤图片地址可以直接获取
爬取代码:
lolheros_info.py
# -*- coding: utf-8 -*-
import scrapy
import json
from lolheros.items import LolherosItem
class LolherosInfoSpider(scrapy.Spider):
name = 'lolheros_info'
allowed_domains = ['lol.qq']
start_urls = ['https://game.gtimg/images/lol/act/img/js/heroList/hero_list.js']
def parse_heroinfo(self,response):
datas = json.loads(response.body)
hero_info = datas['hero']
hero_nickname = hero_info['name']
hero_realname = hero_info['title']
hero_background = hero_info['shortBio']
hero_skins = datas['skins']
hero_skin_urls = []
for hero_skin in hero_skins:
hero_skin_url = hero_skin['mainImg']
hero_skin_urls.append(hero_skin_url)
hero_skills = datas['spells']
hero_skills_str = ""
for hero_skill in hero_skills:
hero_skills_str += "("+str(hero_skill['name'])+":"+str(hero_skill['description']).replace('<br>','')+")"
hero_info_list = [hero_nickname,hero_realname,hero_background,hero_skills_str]
item = LolherosItem(hero_info_list=hero_info_list,
hero_skin_urls=hero_skin_urls)
yield item
def parse(self, response):
datas = json.loads(response.body)
heros_list = datas['hero']
for hero_info in heros_list:
hero_id = hero_info['heroId']
heroinfo_url = "https://game.gtimg/images/lol/act/img/js/hero/"+hero_id+".js"
request = scrapy.Request(heroinfo_url,callback=self.parse_heroinfo,dont_filter=True)
yield request
数据处理代码:
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html
import xlwt
from urllib import request
import os
class LolherosPipeline(object):
current_row = 1
savepath = "LOL英雄信息.xls"
book = xlwt.Workbook(encoding="utf-8", style_compression=0)
sheet = book.add_sheet('LOL英雄信息', cell_overwrite_ok=True)
def __init__(self):
pass
def open_spider(self,spider):
print("爬取数据开始")
self.image_path = os.path.join(os.path.dirname(os.path.dirname(__file__)),"images")
if not os.path.exists(self.image_path):
os.mkdir(self.image_path)
def process_item(self, item, spider):
hero_skin_urls = item['hero_skin_urls']
hero_info_list = item['hero_info_list']
print(hero_skin_urls)
#将英雄数据保存到excel
col = ("昵称","名字","背景故事","技能介绍")
for i in range(0,4):
self.sheet.write(0,i,col[i])
for i in range(0,4):
self.sheet.write(self.current_row,i,hero_info_list[i])
self.current_row += 1
self.book.save(self.savepath)
# 下载英雄皮肤
hero_name = hero_info_list[0]
# 创建 英雄名的文件夹
image_category = os.path.join(self.image_path,hero_name)
if not os.path.exists(image_category):
os.mkdir(image_category)
for hero_skin_url in hero_skin_urls:
if hero_skin_url != '':
image_name = hero_skin_url.split('/')[-1]
request.urlretrieve(hero_skin_url,os.path.join(image_category,image_name))
return item
def close_spider(self,spider):
print("爬取数据结束")
爬取结果:
所有英雄的基本信息(保存至excel)
所有英雄的皮肤图片
版权声明:本文标题:使用爬虫框架scrapy爬取LOL英雄数据 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://m.elefans.com/dongtai/1728795087a1174089.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论