- 帖子
- 2
- 精华
- 0
- 积分
- 7
- 阅读权限
- 10
- 注册时间
- 2017-10-20
- 最后登录
- 2017-10-20
|
txt编码格式为utf-8,分词后列表内的字符是Unicode,然后忽略词的时候就会报错
请问该如何对列表中的Unicode编码转换
示例: [u'\ufeff', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'\u5927\u4f1a', u'\u7684']
import jieba
with open(r'C:\Users\wu\Desktop\report.txt') as f:
s = f.read()
print s
word_list = list(jieba.cut(s))
print'分词总数:', len(word_list)
print'示例:', word_list[:20]
# 2.统计词频
from collections import Counter
words_count = Counter(word_list)
most_words = words_count.most_common(128)
print(most_words)
# 去除符号和助词、介词等
# 这一步我们做了人工干预,手动选出一些忽略词
most_words = [words for words in most_words if words[0] not in ' ,、。“”()!;的和是在要为以把了对中到有上不等更二从大\n']
print(most_words)
分词总数: 15358
示例: [u'\ufeff', u' ', u' ', u' ', u' ', u' ', u' ', u' ', u'\u5927\u4f1a', u'\u7684', u'\u4e3b\u9898', u'\u662f', u'\uff1a', u'\u4e0d\u5fd8', u'\u521d\u5fc3', u'\uff0c', u'\u7262\u8bb0', u'\u4f7f\u547d', u'\uff0c', u'\u9ad8\u4e3e']
Traceback (most recent call last):
File "C:/Users/wu/Desktop/python_practice/cross_in/report_19.py", line 20, in <module>
most_words = [words for words in most_words if words[0] not in ' ,、。“”()!;的和是在要为以把了对中到有上不等更二从大\n']
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal not in range(128)
|
|