豆瓣电影页分析
已经向您讲解了如何在 XML 文档中查找信息,具体来说已经学会了如果获取元素、内容、属性,并且还知道如何通过标签的属性来进筛选与过滤。本章节来讲解如何通过xpath获取豆瓣的数据
第一页下载实现
# -*- coding: utf-8 -*- import scrapy from lxml import etree from douban.items import DoubanItem class DoubanSpiderSpider(scrapy.Spider): name = 'douban_spider' # 允许下载的域名 allowed_domains = ['movie.douban.com'] # 配置下载的首地址 start_urls = ['http://movie.douban.com/top250'] def parse(self, response): # print(response.text) html = etree.HTML(response.text) # 首先通过xpath获取ol li_list = html.xpath("//ol[@class='grid_view']/li") for li in li_list: item = DoubanItem() # em = title = img = comment print(li.xpath(".//em/text()")[0]) print(li.xpath(".//span[@class='title']/text()")[0]) print(li.xpath(".//img/@src")[0]) print(li.xpath(".//div[@class='star']/span/text()")[-1])
前5部电影数据如下
1 肖申克的救赎 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg 1551310人评价 2 霸王别姬 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561716440.jpg 1146654人评价 3 这个杀手不太冷 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p511118051.jpg 1399607人评价 4 阿甘正传 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2559011361.jpg 1209414人评价 5 美丽人生 https://img3.doubanio.com/view/photo/s_ratio_poster/public/p510861873.jpg 708487人评价