Skip to main content
 首页 » 编程设计

shell之Scrapy:在 shell 中使用xpath检索到的数据,但未在项中检索

2025年12月25日42sharpest

我正在使用scrapy构建一个简单的网络刮板,以从BBC网站获得足球队的结果。页面(http://www.bbc.com/sport/football/teams/bolton-wanderers/results)中的相关HTML是这样的:

<tr class="report" id="match-row-EFBO755964"> 
  <td class="statistics show" title="Show latest match stats"> 
    <button>Show</button>  
  </td>  
  <td class="match-competition"> Championship  </td>   
  <td class="match-details teams">  
    <p>  
      <span class="team-home teams"> <a href="/sport/football/teams/huddersfield-town">Huddersfield</a> </span>    
      <span class="score"> <abbr title="Score"> 2-1 </abbr> </span>    
      <span class="team-away teams"> <a href="/sport/football/teams/bolton-wanderers">Bolton</a> </span>    
    </p>  
  </td>  
  <td class="match-date"> Sun 28 Dec </td>    
  <td class="time">  Full time  </td>    
  <td class="status">   <a class="report" href="/sport/football/30566395">Report</a> 
  </td>  
</tr> 


当我尝试使用scrapy shell进行抓取时,输出如下:

$ scrapy shell http://www.bbc.com/sport/football/teams/bolton-wanderers/results 
 
>>> response.selector.xpath('//tr[@class="report"]/td[@class="match-date"]/text()').extract() 
[u' Sun 28 Dec ', u' Fri 26 Dec ', u' Fri 19 Dec ', u' Sat 13 Dec ',...] 


但是,当我在Spider中使用相同的xpath时,无法获得这些日期。
这是项目:

class resultsItem(scrapy.Item): 
  date          = scrapy.Field() 
  homeTeam      = scrapy.Field() 
  score         = scrapy.Field() 
  awayTeam      = scrapy.Field() 


这是蜘蛛:

class resultsSpider(scrapy.Spider): 
name = "results" 
allowed_domains = ["bbc.com"] 
start_urls = ["http://www.bbc.com/sport/football/teams/bolton-wanderers/results"] 
 
def parse(self, response): 
    for sel in response.xpath('//tr[@class="report"]'): 
        game = resultsItem() 
        game['homeTeam'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="team-home teams"]/a/text()').extract() 
        game['score'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="score"]/abbr/text()').extract() 
        game['awayTeam'] = sel.xpath('td[@class="match-details teams"]/p/span[@class="team-away teams"]/a/text()').extract() 
        game['date'] = response.xpath('td[@class="match-date"]/text()').extract() 
 
        yield game 


最后,输出的JSON:

[{"date": [], "awayTeam": ["Bolton"], "homeTeam": ["Huddersfield"], "score": [" 2-1 "]}, 
{"date": [], "awayTeam": ["Blackburn"], "homeTeam": ["Bolton"], "score": [" 2-1 "]},... 


即使在Shell中使用相同的xpath,为什么我也无法获得日期?

请您参考如下方法:

不是吗

game['date'] = sel.xpath('td[@class="match-date"]/text()').extract() 


代替

game['date'] = response.xpath('td[@class="match-date"]/text()').extract() 


就像你在这个循环中一样

for sel in response.xpath('//tr[@class="report"]'):