创建简书爬虫项目
C:\Users\Administrator\Desktop>scrapy startproject jianshu New Scrapy project 'jianshu', using template directory 'd:\anaconda3\lib\site-packages\scrapy\templates\project', created in: C:\Users\Administrator\Desktop\jianshu You can start your first spider with: cd jianshu scrapy genspider example example.com
创建crawl解析器
之前创建的spider解析器采用都是basic模板,这次爬虫是要下载简书文章,需要支持正则表达式匹配,因此建议采用crawl模板来创建spider解析器
C:\Users\Administrator\Desktop>cd jianshu C:\Users\Administrator\Desktop\jianshu>scrapy genspider -t crawl jianshu_spider jianshu.com Created spider 'jianshu_spider' using template 'crawl' in module: jianshu.spiders.jianshu_spider
配置简书下载格式
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class JianshuSpiderSpider(CrawlSpider): name = 'jianshu_spider' allowed_domains = ['jianshu.com'] start_urls = ['https://www.jianshu.com/'] # 可以指定爬虫抓取的规则,支持正则表达式 # https://www.jianshu.com/p/df7cad4eb8d8 # https://www.jianshu.com/p/07b0456cbadb?***** # https://www.jianshu.com/p/.* rules = ( Rule(LinkExtractor(allow=r'https://www.jianshu.com/p/[0-9a-z]{12}.*'), callback='parse_item', follow=True), ) # name = title = url = collection = scrapy.Field() def parse_item(self, response): print(response.text)