admin管理员组

文章数量:1619280

Scrapy框架实战

爬取目标:英雄联盟所有英雄的基本信息(名字,背景故事,技能名称及介绍)、下载所有英雄的皮肤并保存至本地

首先来到LOL官网首页,如图进入所有英雄的信息页面

先说一下我最开始的思路:

通过网页源代码来获取想要的数据,这也是最基本的爬取数据的方式

通过单个英雄信息的url不难发现规律,每个英雄的详情页url地址都一样,只是参数id的值不一样。


那么便可以通过在英雄信息页获取到每个英雄的id从而得到详情页地址

想象是美好的,实际操作时一直都获取不到想要的数据,获取的li标签中的值一直是“正在加载中”

最后才发现这些英雄的数据都是用过ajax请求来获取数据的,用传统的方式肯定不行

然后我换了一种思路

直接获取存储英雄信息的js文件,通过js文件来获得每一个英雄的id,然后通过拼接url来得到英雄详情页的地址

英雄详情页一样是通过ajax获取数据

获取的js文件中有我们想要的数据

英雄信息、皮肤图片地址可以直接获取

爬取代码:
lolheros_info.py

# -*- coding: utf-8 -*-
import scrapy
import json
from lolheros.items import LolherosItem

class LolherosInfoSpider(scrapy.Spider):
    name = 'lolheros_info'
    allowed_domains = ['lol.qq']
    start_urls = ['https://game.gtimg/images/lol/act/img/js/heroList/hero_list.js']

    def parse_heroinfo(self,response):
        datas = json.loads(response.body)
        hero_info = datas['hero']
        hero_nickname = hero_info['name']
        hero_realname = hero_info['title']
        hero_background = hero_info['shortBio']
        hero_skins = datas['skins']
        hero_skin_urls = []
        for hero_skin in hero_skins:
            hero_skin_url = hero_skin['mainImg']
            hero_skin_urls.append(hero_skin_url)
        hero_skills = datas['spells']
        hero_skills_str = ""
        for hero_skill in hero_skills:
            hero_skills_str += "("+str(hero_skill['name'])+":"+str(hero_skill['description']).replace('<br>','')+")"

        hero_info_list = [hero_nickname,hero_realname,hero_background,hero_skills_str]
        item = LolherosItem(hero_info_list=hero_info_list,
                            hero_skin_urls=hero_skin_urls)
        yield item

    def parse(self, response):
        datas = json.loads(response.body)
        heros_list = datas['hero']
        for hero_info in heros_list:
            hero_id = hero_info['heroId']
            heroinfo_url = "https://game.gtimg/images/lol/act/img/js/hero/"+hero_id+".js"
            request = scrapy.Request(heroinfo_url,callback=self.parse_heroinfo,dont_filter=True)
            yield request

数据处理代码:
pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html
import xlwt
from urllib import request
import os

class LolherosPipeline(object):
    current_row = 1
    savepath = "LOL英雄信息.xls"
    book = xlwt.Workbook(encoding="utf-8", style_compression=0)
    sheet = book.add_sheet('LOL英雄信息', cell_overwrite_ok=True)
    def __init__(self):
        pass

    def open_spider(self,spider):
        print("爬取数据开始")
        self.image_path = os.path.join(os.path.dirname(os.path.dirname(__file__)),"images")
        if not os.path.exists(self.image_path):
            os.mkdir(self.image_path)


    def process_item(self, item, spider):
        hero_skin_urls = item['hero_skin_urls']
        hero_info_list = item['hero_info_list']
        print(hero_skin_urls)
        #将英雄数据保存到excel
        col = ("昵称","名字","背景故事","技能介绍")
        for i in range(0,4):
            self.sheet.write(0,i,col[i])
        for i in range(0,4):
            self.sheet.write(self.current_row,i,hero_info_list[i])
        self.current_row += 1
        self.book.save(self.savepath)
        # 下载英雄皮肤
        hero_name = hero_info_list[0]
        # 创建 英雄名的文件夹
        image_category = os.path.join(self.image_path,hero_name)
        if not os.path.exists(image_category):
            os.mkdir(image_category)
        for hero_skin_url in hero_skin_urls:
            if hero_skin_url != '':
                image_name = hero_skin_url.split('/')[-1]
                request.urlretrieve(hero_skin_url,os.path.join(image_category,image_name))
        return item

    def close_spider(self,spider):
        print("爬取数据结束")
爬取结果:

所有英雄的基本信息(保存至excel)

所有英雄的皮肤图片

本文标签: 爬虫框架英雄数据scrapy