scrapy框架只能爬取静态网站。如需爬取动态网站,需要结合着selenium进行js的渲染,才能获取到动态加载的数据。
如何通过selenium请求url,而不再通过下载器Downloader去请求这个url"color: #ff0000">相关的配置:
1、scrapy环境中安装selenium:pip install selenium
2、确保python环境中有phantomJS(无头浏览器)
对于selenium的主要操作是下载中间件部分如下图:
代码如下
middlewares.py代码:
注意:自定义下载中间件,采用selenium的方式!!
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals from selenium import webdriver from selenium.webdriver import FirefoxOptions from scrapy.http import HtmlResponse, Response import time class TaobaospiderSpiderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, dict or Item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict # or Item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn't have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class TaobaospiderDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) *********************下面是相应是自定义的下载中间件的替换代码************************** class SeleniumTaobaoDownloaderMiddleware(object): # 将driver创建在中间件的初始化方法中,适合项目中只有一个爬虫。 # 爬虫项目中有多个爬虫文件的话,将driver对象的创建放在每一个爬虫文件中。 # def __init__(self): # # 在scrapy中创建driver对象,尽可能少的创建该对象。 # # 1. 在初始化方法中创建driver对象; # # 2. 在open_spider中创建deriver对象; # # 3. 不要将driver对象的创建放在process_request(); # option = FirefoxOptions() # option.headless = True # self.driver = webdriver.Firefox(options=option) # 参数spider就是TaobaoSpider()类的对象 def process_request(self, request, spider): if spider.name == "taobao": spider.driver.get(request.url) # 由于淘宝的页面数据加载需要进行滚动,但并不是所有js动态数据都需要滚动。 for x in range(1, 11, 2): height = float(x) / 10 js = "document.documentElement.scrollTop = document.documentElement.scrollHeight * %f" % height spider.driver.execute_script(js) time.sleep(0.2) origin_code = spider.driver.page_source # 将源代码构造成为一个Response对象,并返回。 res = HtmlResponse(url=request.url, encoding='utf8', body=origin_code, request=request) # res = Response(url=request.url, body=bytes(origin_code), request=request) return res if spider.name == 'bole': request.cookies = {} request.headers.setDefault('User-Agent','') return None def process_response(self, request, response, spider): print(response.url, response.status) return response
taobao.py 代码如下:
# -*- coding: utf-8 -*- import scrapy from selenium import webdriver from selenium.webdriver import FirefoxOptions class TaobaoSpider(scrapy.Spider): """ scrapy框架只能爬取静态网站。如需爬取动态网站,需要结合着selenium进行js的渲染,才能获取到动态加载的数据。 如何通过selenium请求url,而不再通过下载器Downloader去请求这个url""" name = 'taobao' allowed_domains = ['taobao.com'] start_urls = ['https://s.taobao.com/search""" 提取列表页的商品标题和价格 :param response: :return: """ info_divs = response.xpath('//div[@class="info-cont"]') print(len(info_divs)) for div in info_divs: title = div.xpath('.//a[@class="product-title"]/@title').extract_first('') price = div.xpath('.//span[contains(@class, "g_price")]/strong/text()').extract_first('') print(title, price)
settings.py代码如下图:
关于代码中提到的初始化driver的位置有以下两种情况:
1、只存在一个爬虫文件的话,driver初始化函数可以定义在middlewares.py的自定义中间件中(如上述代码注释初始化部分)也可以在爬虫文件中自定义(如上述代码在爬虫文件中初始化)。
注意:如果只有一个爬虫文件就不需要在自定义的process_requsests中判断是哪一个爬虫项目然后分别请求!
2、如果存在两个或两个以上爬虫项目(如下图项目结构)的时候,需要将driver的初始化函数定义在各自的爬虫项目文件下(如上述代码),同时需要在process_requsests判断是那个爬虫项目的请求!!
稳了!魔兽国服回归的3条重磅消息!官宣时间再确认!
昨天有一位朋友在大神群里分享,自己亚服账号被封号之后居然弹出了国服的封号信息对话框。
这里面让他访问的是一个国服的战网网址,com.cn和后面的zh都非常明白地表明这就是国服战网。
而他在复制这个网址并且进行登录之后,确实是网易的网址,也就是我们熟悉的停服之后国服发布的暴雪游戏产品运营到期开放退款的说明。这是一件比较奇怪的事情,因为以前都没有出现这样的情况,现在突然提示跳转到国服战网的网址,是不是说明了简体中文客户端已经开始进行更新了呢?
更新日志
- 小骆驼-《草原狼2(蓝光CD)》[原抓WAV+CUE]
- 群星《欢迎来到我身边 电影原声专辑》[320K/MP3][105.02MB]
- 群星《欢迎来到我身边 电影原声专辑》[FLAC/分轨][480.9MB]
- 雷婷《梦里蓝天HQⅡ》 2023头版限量编号低速原抓[WAV+CUE][463M]
- 群星《2024好听新歌42》AI调整音效【WAV分轨】
- 王思雨-《思念陪着鸿雁飞》WAV
- 王思雨《喜马拉雅HQ》头版限量编号[WAV+CUE]
- 李健《无时无刻》[WAV+CUE][590M]
- 陈奕迅《酝酿》[WAV分轨][502M]
- 卓依婷《化蝶》2CD[WAV+CUE][1.1G]
- 群星《吉他王(黑胶CD)》[WAV+CUE]
- 齐秦《穿乐(穿越)》[WAV+CUE]
- 发烧珍品《数位CD音响测试-动向效果(九)》【WAV+CUE】
- 邝美云《邝美云精装歌集》[DSF][1.6G]
- 吕方《爱一回伤一回》[WAV+CUE][454M]