print输出时候的一些奇怪现象.

16 主题	1 好友	244 积分

中级会员

Rank: 3 Rank: 3

发消息

电梯直达

楼主

发表于 2016-6-19 18:26:17 |显示全部楼层 |倒序浏览

本帖最后由 anyone 于 2016-6-19 18:28 编辑

我编了个练习代码, 用来取http://www.xiaohuayoumo.com网站的右下角的"今日热点"

import requests
from lxml import html
url='http://www.xiaohuayoumo.com/'
dict_headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36'}
# get list_links
page=requests.get(url, headers=dict_headers)
tree=html.fromstring(page.content)
list_links=tree.xpath("//div[@class='front-top-box front-top-box-2 front-top-box-right']//a/@href")
# get content for each list_links
for i in list_links:
i='http://www.xiaohuayoumo.com'+i
page=requests.get(i, headers=dict_headers)
tree=html.fromstring(page.content)
title=''.join(tree.xpath("//h1[@class='page-title']/text()")).strip()
content=''.join(tree.xpath("//div[@property='content:encoded']//text()[normalize-space(.)]"))
# # 1st not show all of the articals in sublime <<<<<<<<<<<<<<<<<<<<<<<<<<
# print '\n\n'+title.encode('utf-8')
# print i
# print content.encode('utf-8')
# # 2nd coding error in window output <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
# print '\n\n'+title.encode('cp936')
# print i
# print content.encode('cp936')

复制代码

注释的两个地方分别是两种方法输出,

第一是直接在sublime的输出中输出, 但是他会忽略一些输出, 本来一共是10则, 但只得到4则. 在查看丢失的文章的标题, 是由len()数目的, 但是就是无法上屏. 也没有任何错误提示.

第二是我尝试用windows terminal屏幕输出, 但是会碰到一个如下的错误提示, 我以前从来么有遇到过:

UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 691:
illegal multibyte sequence

复制代码

请问crossin, 这两个情况是什么原因? 该如何解决呢?

收藏0

使用道具举报

anyone

16 主题	1 好友	244 积分

中级会员

Rank: 3 Rank: 3

发消息

沙发

发表于 2016-6-19 22:31:29 |显示全部楼层

补充一下, 我最后不使用屏幕输出, 而用文件写入:

with open ('desktop\testing.txt', 'a') as f:
f.write '\n\n'+title.encode('utf-8')
f.write i
f.write content.encode('utf-8')

复制代码

这个生成的texting.txt文件就没有问题. 不知何解?

使用道具举报

anyone

16 主题	1 好友	244 积分

中级会员

Rank: 3 Rank: 3

发消息

板凳

发表于 2016-6-20 16:02:50 |显示全部楼层

crossin先生发表于 2016-6-20 14:02
输出会少我不知道，你把中间数据输出然后调试看内容到底有什么问题，是不是第二次没有请求到内容 ...

我想是取到了, 因为:

1, 我用len()测试每次递归反馈的标题长度, 那些没有显示出来的标题是有len()长度的.
2, 我换用write(), 将取到的数据写到文本中, 就是完整的.

所以我想可能是你提到的decode的问题. 我从这方面入手试试.

如果可以的话, 请问如何使用requests自定义decode的编码呢? 是在get的时候定义, 还是定义get().content呢?

使用道具举报