Skip to main content
 首页 » 编程设计

python之scrapy、splash、lua、按钮点击

2024年06月20日27luoye11

我对这里的所有乐器都是新手。我的目标是从许多页面中提取所有 URL,这些页面通过“Weiter”/“next”按钮连接,即多个 URL。我决定用 scrapy 尝试一下。该页面是动态生成的。然后我了解到我需要进一步的仪器并为此安装了 Splash。安装正在运行。我按照教程设置安装。然后我通过在搜索输入字段中发送“返回”设法获得第一页。使用可以为我提供所需结果的浏览器。我的问题是,我尝试单击生成页面上的“下一步”按钮,但不知 Prop 体如何操作。正如我在几页上读到的那样,这并不总是那么容易。我尝试了建议的解决方案但没有成功。我想我离得不太远,希望得到一些帮助。谢谢。

我的设置.py

BOT_NAME = 'gr' 
SPIDER_MODULES = ['gr.spiders'] 
NEWSPIDER_MODULE = 'gr.spiders' 
ROBOTSTXT_OBEY = True 
DOWNLOADER_MIDDLEWARES = { 
    'scrapy_splash.SplashCookiesMiddleware': 723, 
    'scrapy_splash.SplashMiddleware': 725, 
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 
} 
SPLASH_URL = 'http://localhost:8050' 
SPIDER_MIDDLEWARES = { 
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, 
} 
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' 
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' 

我的蜘蛛:

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy_splash import SplashRequest 
import json 
# import base64 
 
class GrSpider(scrapy.Spider): 
    name = 'gr_' 
    allowed_domains = ['lawsearch.gr.ch'] 
    start_urls = ['http://www.lawsearch.gr.ch/le/'] 
 
    def start_requests(self): 
 
        script = """ 
        function main(splash) 
            assert(splash:go(splash.args.url)) 
            splash:set_viewport_full() 
            splash:wait(0.3) 
            splash:send_keys("<Return>") 
            splash:wait(0.3) 
            return splash:html() 
        end 
        """ 
 
        for url in self.start_urls: 
 
             yield SplashRequest(url=url, 
                    callback=self.parse, 
                    endpoint='execute', 
                    args={'lua_source': script}) 
 
    def parse(self, response): 
 
        script3 = """ 
            function main(splash) 
            splash:autoload{url="https://code.jquery.com/jquery-3.2.1.min.js"} 
            assert(splash:go(splash.args.url)) 
            splash:set_viewport_full() 
 
--            splash:wait(2.8) 
--            local element = splash:select('.result-pager-next-active .simplebutton') 
--            element:mouse_click() 
 
--            local bounds = element:bounds() 
--            assert(element:mouse_click{x=bounds.width, y=bounds.height}) 
 
--            naechster VERSCUH 
-- https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/ 
 
--  https://stackoverflow.com/questions/35720323/scrapyjs-splash-click-controller-button 
--  assert(splash:runjs("$('#date-controller > a:first-child').click()")) 
 
--  https://github.com/scrapy-plugins/scrapy-splash/issues/27 
--            assert(splash:runjs("$('#result-pager-next-active .simplebutton').click()")) 
 
 
 
-- https://developer.mozilla.org/de/docs/Web/API/Document/querySelectorAll 
-- ANSCHAUEN 
-- https://stackoverflow.com/questions/38043672/splash-lua-script-to-do-multiple-clicks-and-visits 
 
-- elementList = baseElement.querySelectorAll(selectors) 
-- var domRect = element.getBoundingClientRect(); 
-- var rect = obj.getBoundingClientRect(); 
-- https://stackoverflow.com/questions/34001917/queryselectorall-with-multiple-conditions 
 
            local get_dimensions = splash:jsfunc([[ 
               function () { 
                 var doc1 = document.querySelectorAll("result-pager-next-active.simplebutton")[0]; 
                 var el = doc1.documentElement; 
                 var rect = el.getClientRects()[0]; 
                 return {'x': rect.left, 'y': rect.top} 
               } 
            ]]) 
--            splash:set_viewport_full() 
            splash:wait(0.1) 
            local dimensions = get_dimensions() 
            splash:mouse_click(dimensions.x, dimensions.y) 
 
 
--            splash:runjs("document.querySelectorAll('result-pager-next-active ,simplebutton')[1].click()") 
--            assert(splash:runjs("$('.result-pager-next-active .simplebutton')[1].click()")) 
--            assert(splash:runjs("$('.simplebutton')[12].click()")) 
 
            splash:wait(1.6) 
            return splash:html() 
        end 
        """ 
 
        for teil in response.xpath('//div/div/div/div/a'): 
            yield { 
                'link': teil.xpath('./@href').extract() 
            } 
 
        next_page = response.xpath('//div[@class="v-label v-widget simplebutton v-label-simplebutton v-label-undef-w"]').extract_first() 
 
#        print response.body 
 
        print '----------------------'         
#        print  response.xpath('//div[@class="v-slot v-slot-simplebutton"]/div[contains(text(), "Weiter")]').extract_first() 
#        print  response.xpath('//div[@class="v-slot v-slot-simplebutton"]/div[contains(text(), "Weiter")]').extract() 
 
#       class="v-slot v-slot-simplebutton" 
#        nextPage = HtmlXPathSelector(response).select("//div[@class='paginationControl']/a[contains(text(),'Link Text Next')]/@href") 
 
#        neue_seite=response.url 
 
#        print response.url 
 
        if next_page is not None:  
#            yield SplashRequest(url=neue_seite, 
 
            yield SplashRequest(response.url, 
                    callback=self.parse, 
                    dont_filter=True, 
                    endpoint='execute', 
                    args={'lua_source': script3}) 

请您参考如下方法:

您不必总是使用 Splash。如果下一个按钮是链接,您只需获取 href 属性并将请求发送回解析函数即可。