- 帖子
- 5
- 精华
- 0
- 积分
- 28
- 阅读权限
- 10
- 注册时间
- 2015-10-8
- 最后登录
- 2015-10-30
|
我正在做一个简单的爬虫,遇到一些问题,不知哪位同学可以帮忙解决一下,十分感谢
以下是我的代码:
#coding=utf-8
import urllib
import re
import time
def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html
def getImg(html):
reg = r'''"url":"(.*?\.jpg)","width"'''
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
return imglist
html = getHtml("http://image.baidu.com/channel/star")
print getImg(html)
urllist=getImg(html)
x=0
for imgurl in urllist:
urllib.urlretrieve(imgurl,'%s.jpg' % x)
time.sleep(10)
x+=1
然后报错了:
Traceback (most recent call last):
File "D:\BaiduYunDownload\Python\web crawler.py", line 24, in <module>
urllib.urlretrieve(imgurl,'%s.jpg' % x)
File "C:\Python27\lib\urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "C:\Python27\lib\urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "C:\Python27\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 326, in open_http
if not host: raise IOError, ('http error', 'no host given')
IOError: [Errno http error] no host given
---------------------------------------------------分隔线----------------------------------------------------------
根据我的观察应该是url取下来的时候是长这样的:
'http:\\/\\/img0.bdstatic.com\\/img\\/image\\/2016ss1.jpg'
这个原代码用“\\”来转译,我在访问的时候应该需要把这条url解析回去,变成:
'http://img0.bdstatic.com/img/image/2016ss1.jpg'
不知道用什么公式或者模块才能实现这样的功能呢?
|
|