- 帖子
- 8
- 精华
- 0
- 积分
- 28
- 阅读权限
- 10
- 注册时间
- 2017-8-8
- 最后登录
- 2017-8-18
|
嗯嗯,谢谢crossin先生,这两天没上,之前那个问题自己看了后面的明白了。
又有了新的问题,我现在在做练习题,抓取豆瓣推荐电影的练习题,然后代码已经写完,这个网址一共十页。非常奇怪的是,自动循环运行前八页的时候,都能顺利进行,运行第九页的时候,告诉我list长度出了问题,我然后自己看了下出问题的地方,长度正确,不知道问题出在了哪,麻烦crossin先生或者哪位大神给看下。- #https://movie.douban.com/top250?start=0&filter=
- #上面这个网址以start跳动25翻页
- import urllib.request
- import re
- import time
- #这是抓取的函数
- def zhua(mum):
- #抓取电影,之后把这部分写成函数形式
- url = 'https://movie.douban.com/top250?start=%d&filter='%num
- web = urllib.request.urlopen(url).read().decode('UTF-8')
- content = str(web)
- #电影名
- titles = re.findall(r'<span class="title">\w+',content)
- titles = [i[20:] for i in titles] #此句为去除匹配电影名时的特定词
- #导演
- daoyan = re.findall(r'导演:\s[^&]+',content)
- #主演
- actors = re.findall(r'主[^<]+',content)
- #上映日期
- playtime = re.findall(r'\s{29}[0-9]+',content)
- playtime = [i[29:] for i in playtime]
- #产地和剧情
- candj = re.findall(r' / .* / .*',content)
- chandi = []
- juqing = []
- for i in range(0,25):
- chandi.append(candj[i].split(' / ')[1])
- juqing.append(candj[i].split(' / ')[1])
- #一句话影评
- yp = re.findall(r'<span class="inq">[^<]+',content)
- yp = [i[18:] for i in yp]
- outdata = []
- for i in range(0,25):
- outdata.append('电影名 '+titles[i]+'\n'\
- +daoyan[i]+'\n'+actors[i]+'\n'\
- +'上映日期 '+playtime[i]+'\n'\
- +'产地/语言 '+chandi[i]+'\n'\
- +juqing[i]+'\n'\
- +'影评 '+yp[i]+'\n\n')
- out = open('out.txt','a',encoding='utf-8')
- for i in outdata:
- out.write(i)
- out.close()
- #接下来是设计循环运行十次
- for i in range(0,10):
- print(i)
- num = i*25
- zhua(num)
复制代码 出现的问题是:- Traceback (most recent call last):
- File "C:\Users\T\Desktop\913抓取豆瓣电影.py", line 51, in <module>
- zhua(num)
- File "C:\Users\T\Desktop\913抓取豆瓣电影.py", line 40, in zhua
- +'影评 '+yp[i]+'\n\n')
- IndexError: list index out of range
- >>>
复制代码 谢谢 |
|