Crossin的编程教室

标题: 求助!统计txt中词频出错 [打印本页]

作者: ouyezhe886    时间: 2018-7-25 17:45
标题: 求助!统计txt中词频出错
# -*- coding: utf-8 -*-

import sys,re
import importlib

importlib.reload(sys)

text = open('C:/Users/Administrator/Desktop/wordstata.txt','rb').read()
wfile=open('result.txt','w')
txet = text.decode('utf-8')

r = re.compile('[\x80-\xff]+')
m = r.findall(text)
dict={}
z1 = re.compile('[\x80-\xff]{2}')
z2 = re.compile('[\x80-\xff]{4}')
z3 = re.compile('[\x80-\xff]{6}')
z4 = re.compile('[\x80-\xff]{8}')
for i in m:
    x = i.encode('gb18030')
    i = z1.findall(x)
    for j in i:
        if(j in dict):
            dict[j]+=1

dict=sorted(dict.items(), key=lambda d:d[1])
for a,b in dict:
    if b>0:
        wfile.write(a+','+str(b)+'\n')


Traceback (most recent call last):
  File "C:/Users/Administrator/Desktop/wordstatabytxt.py", line 10, in <module>
    txet = text.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte              


作者: crossin先生    时间: 2018-7-26 18:44
open('C:/Users/Administrator/Desktop/wordstata.txt','r', encoding='utf8')

不要用b模式,直接设编码。如果utf8不行就改gbk




欢迎光临 Crossin的编程教室 (https://bbs.crossincode.com/) Powered by Discuz! X2.5