作者GHdisf45a (The_rabbit)
看板Python
标题[问题]网页疑似没有更新爬虫重复写入同一则贴文
时间Thu Dec 15 12:39:55 2022
请问各位大大
我最近在学习如何使用爬虫程式所以我拿ptt网页板作为练习目标
但我碰到在10则後会反覆抓取同一则贴文的title和连结的问题
https://imgur.com/a/Bnqo2B1
我猜想是网页没有载入新的网页资料
但是下拉式载入的动态网页不是只要下拉就会更新吗
而且我看chrom driver的selenium的下拉是有在执行的,请问是什麽原因导致?
以下我的程式码
import urllib.request as req
import requests
import selenium
import schedule
import time
import json
from time import sleep
import json
import openpyxl
import random
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
import bs4
pttWeb = openpyxl.load_workbook('pttweb.xlsx')
ws = pttWeb.active
i = 1
scroll_time = int(input("scroll_Times"))
options = Options()
options.chrome_executable_path = "C:\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(options = options)
sleep(3)
driver.get('
https://www.pttweb.cc/hot/all/today')
sleep(5)
prev_ele = None
for now_time in range(1, scroll_time+1):
sleep(2)
eles = driver.find_elements(by=By.CLASS_NAME,value='e7-right.ml-2')
# 若串列中存在上一次的最後一个元素,则撷取上一次的最後一个元素到当前最後一
个元素进行爬取
try:
# print(eles)
# print(prev_ele)
eles = eles[eles.index(prev_ele):]
except:
pass
for ele in eles:
try:
titleInfo = ele.find_element(by=By.CLASS_NAME, value =
"e7-article-default")
title = titleInfo.text
href = titleInfo.get_attribute('href')
ws.cell(i,1,i)
ws.cell(i,2,title)
ws.cell(i,3,href)
sleep(3)
inner =req.Request(href, headers ={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"
})
with req.urlopen(inner) as innerRespomse:
articleData = innerRespomse.read().decode("utf-8")
articleRoot = bs4.BeautifulSoup(articleData, "html.parser")
main_content = articleRoot.find("div", itemprop="articleBody")
boardInfo= articleRoot.find("span",
class_="e7-board-name-standalone")
authorInfo = articleRoot.find("span", itemprop="name")
timeInfo = articleRoot.find("time", itemprop="datePublished")
countInfo = articleRoot.find_all("span",
class_="e7-head-content")
board = boardInfo.text
author = authorInfo.text
Time = timeInfo.text
count = countInfo[4].text
allContent = main_content.text
pre_text = allContent.split('--')[0]
ws.cell(i,4,board)
ws.cell(i,5,author)
ws.cell(i,6,Time)
ws.cell(i,7,count)
ws.cell(i,8,pre_text)
pttWeb.save('pttweb.xlsx')
sleep(random.uniform(5,20))
i = i+1
except:
pass
prev_ele = eles[-1]
print(f"now scroll {now_time}/{scroll_time}")
js = "window.scrollTo(0, document.body.scrollHeight);"
driver.execute_script(js)
sleep(40)
driver.quit()
_____________________
先谢过各位大大了
--
※ 发信站: 批踢踢实业坊(ptt.cc), 来自: 49.158.79.67 (台湾)
※ 文章网址: https://webptt.com/cn.aspx?n=bbs/Python/M.1671079197.A.34F.html
1F:→ lycantrope: 建议先改掉try-except:pass,把code贴pastebin较容易看 12/15 13:09
4F:→ surimodo: 忙猜 你class抓错 标题不只 e7-article-default 12/16 01:28
5F:→ surimodo: 还有 e7-article-viewed 跟 e7-article-most-recently-v 12/16 01:29
6F:→ surimodo: iewed 12/16 01:30
7F:→ surimodo: 然後 try expect 不要 pass 12/16 01:31
8F:→ surimodo: 一定有跳出找不到class pass干嘛 12/16 01:32
9F:→ surimodo: 不用除错乾脆把try expect全删好了 12/16 01:33
10F:→ surimodo: 写了又pass 脱裤子放屁 12/16 01:33