- 帖子
- 1
- 精华
- 0
- 积分
- 5
- 阅读权限
- 10
- 注册时间
- 2019-5-24
- 最后登录
- 2019-5-24
|
本帖最后由 Jacky 于 2019-5-24 17:36 编辑
新手刚接触爬虫,因为第一页和后续页面的规则不一致,我是这样处理的:- def page_start(*page_start_num*):
- if page_start_num == 1:
- url = base_url
- headers = {
- 'Referer': url,
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
- 'cookie': 'UM_distinctid=16adf1c99906c-0650f771e023bc-353166-1fa400-16adf1c999334f; CNZZDATA1255357127=703661195-1558519406-%7C1558519406',
- }
- else:
- url = base_url[0:-5] + '_' + str(page_start_num) + '.html'
- headers = {
- 'Referer': url,
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
- 'cookie': 'UM_distinctid=16adf1c99906c-0650f771e023bc-353166-1fa400-16adf1c999334f; CNZZDATA1255357127=703661195-1558519406-%7C1558519406',
- }
- print('当前页面路径:%s' % url)
- return url, headers
- def get_page(*url*, *headers*):
- try:
- response = requests.get(url, *headers*=headers)
- response.encoding = 'utf-8'
- if response.status_code == 200:
- return response.text
- except RequestException:
- print('请求网页失败')
- return None
- def parse_page(*html*):
- soup = BeautifulSoup(html, 'lxml')
- *# print(soup.prettify())*
- print(soup.title.string)
- title = soup.title.string
- img_info = soup.find_all(*name*='img', *attrs*='content_img')
- for i in range(len(img_info)):
- pic = img_info.attrs['src']
- yield{
- 'title': title,
- 'url': pic,
- 'num': i+1
- }
复制代码 后续有多少页通过“page_start_num”控制。
代码很丑陋,但是可以运行。但是有时会遇到网站本身的问题,有些页面不存在。
网站的处理方式是不存在的页面,都转到一个专门的导航页面。
这时,代码运行会报错如下:- File "C:\Python\Python37\lib\site-packages\bs4\__init__.py", line 245, in __init__
- elif len(markup) <= 256 and (
- TypeError: object of type 'NoneType' has no len()
复制代码 抓取会终止,请问各位,要如何处理这种异常?如何让代码跳过这个url,继续抓取后续的url?
谢谢!
|
|