抓取微信公众账号每天的文章更新（抛出第一块砖）

0 主题	0 好友	63 积分

Rank: 2

楼主

发表于 2013-9-13 13:08:39 |显示全部楼层

不清楚"传送门"是这么抓取到微信的文章的，找不到微信的api接口
我尝试抓取传送门里面的文章，但是发现主体文章列表是ajax返回的。直接访问ajax地址是404页面，应该是判断了refer
无解...

0 主题	0 好友	63 积分

Rank: 2

沙发

发表于 2013-9-13 14:21:49 |显示全部楼层

用urllib2伪造了referer和user-agent，终于抓到了

import urllib2, HTMLParser
article_list = []
class MyParser(HTMLParser.HTMLParser):
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
def handle_starttag(self,tag,attrs):
if tag == 'a':
for name,value in attrs:
if name == 'href':
print(value)
article_list.append(value)
def fetch_data(uri):
request = urllib2.Request(uri)
request.add_header('Referer','http://chuansongme.com/account/crossincode')
request.add_header('Content-Type','application/x-www-form-urlencoded')
request.add_header('User-Agent','fake-client')
response = urllib2.urlopen(request)
return response
list_str = fetch_data('http://chuansongme.com/more/account-crossincode/recent?lastindex=0').read()
print(list_str)
my = MyParser()
my.feed(list_str.decode('utf-8'))
article = fetch_data(article_list[0]).read()
print(article)
f = open('weixin.html','w')
f.write(article)
f.close()

复制代码

GMT+8, 2024-5-19 00:33 , Processed in 0.023862 second(s), 24 queries .

Powered by Discuz! X2.5

		自动登录	找回密码
密码			立即加入