admin管理员组

文章数量:1657213

刚刚开始为毕业设计做一个爬虫项目作为数据准备,花费了几天学习爬虫的知识,写了一个爬取电影天堂的爬虫项目,主要是爬取电影天堂的下载链接,图片,导演这些信息保存到本地的mysql数据库中,具体的字段有:

具体代码如下:
demo_scrapy.py:

//#爬虫主体
import scrapy
import json
from movie.items import MovieItem
import re

from scrapy.utils.project import get_project_settings


settings = get_project_settings()
class DmozSpider(scrapy.spiders.Spider):
    name = "demo"
    allowed_domains = ['www.dytt8']
    start_urls = ['https://www.dytt8/html/gndy/dyzz/index.html']
    i = 0
    def parse(self, response):
        info_url_xpath='//td/b/a/@href'
        next_url_xpath='//div[@]/td/a[last()-1]/@href'
        #titles=response.xpath(titles_xpath).extract()
        # 电影介绍页面url
        info_urls=response.xpath(info_url_xpath).extract()
        next_urls=response.xpath(next_url_xpath).extract()
        # 下一页url
        next_url='https://www.dytt8/html/gndy/dyzz/'+next_urls[0]
        #print(next_url)
        a=0
        while a in range(len(info_urls)):
            #print(a)
            #print(titles[a])
            info_url='https://www.dytt8'+info_urls[a]
            a+=1
            yield scrapy.Request(url=info_url,callback=self.def_info)
        yield scrapy.Request(next_url,callback=self.parse)
        pass
    #获取电影标题、下载地址:
    def def_info(self,response):
        #print(response.text)
        i_item = MovieItem()
        data=response.body.decode("gb2312","ignore")
        #title_xpath='//title/text()'
        #title=response.xpath(title_xpath).extract_first()
        down_url_xpath='//tbody/tr/td/a/text()'

        imageurl_xpath='//img[@alt=""]/@src'
        imageurl=response.xpath(imageurl_xpath).extract_first()
        down_url=response.xpath(down_url_xpath).extract_first()
        pat1='类  别 (.*?)<br />'
        pat2='年  代 (.*?)<br />'
        pat3='IMDb评分 (.*?)/10'
        pat4='导  演 (.*?)<br />'
        pat5='简  介

本文标签: 爬虫数据库天堂电影资源