Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】
本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:
第一版: 效率低
# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
word = []
words_dict= {}
for letter in f.read():
if letter.isalnum():
word.append(letter)
elif letter.isspace(): #空白字符 空格 \t \n
if word:
word = ''.join(word).lower() #转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
#处理最后一个单词
if word:
word = ''.join(word).lower() # 转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
for k,v in words_dict.items():
print(k,v)
运行结果:
we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1
第二版:
缺点:遇到大文件要一次读入内存,性能不好
# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
data = f.read()
word_reg = re.compile(r'\w+')
#word_reg = re.compile(r'\w+\b')
word_list = word_reg.findall(data)
word_list = [word.lower() for word in word_list] #转小写
word_set = set(word_list) #避免重复查询
# words_dict = {}
# for word in word_set:
# words_dict[word] = word_list.count(word)
# 简洁写法
words_dict = {word: word_list.count(word) for word in word_set}
for k,v in words_dict.items():
print(k,v)
运行结果:
on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1
第三版:
# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
#line_words = word_reg.findall(line)
#比上面的正则更加简单
line_words = line.split()
word_list.extend(line_words)
word_set = set(word_list) # 避免重复查询
words_dict = {word: word_list.count(word) for word in word_set}
for k, v in words_dict.items():
print(k, v)
运行结果:
childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1
第四版:使用Counter统计
# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
line_words = line.split()
word_list.extend(line_words)
words_dict = dict(collections.Counter(word_list)) #使用Counter统计
for k, v in words_dict.items():
print(k, v)
运行结果:
We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1
注:这里使用的测试文本test.txt如下:
We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.
PS:这里再为大家推荐2款相关统计工具供大家参考:
在线字数统计工具:
http://tools.jb51.net/code/zishutongji
在线字符统计与编辑工具:
http://tools.jb51.net/code/char_tongji
更多关于Python相关内容感兴趣的读者可查看本站专题:《Python文件与目录操作技巧汇总》、《Python文本文件操作技巧汇总》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程》
希望本文所述对大家Python程序设计有所帮助。
相关文章
Python基于内置库pytesseract实现图片验证码识别功能
这篇文章主要介绍了Python基于内置库pytesseract实现图片验证码识别功能,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下2020-02-02
python使用response.read()接收json数据的实例
今天小编就为大家分享一篇python使用response.read()接收json数据的实例,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧2018-12-12


最新评论