如何使用 Python 中的功能和库创建 n-gram

更新时间：2023年09月29日 09:35:25 作者：迹忆客

在计算语言学中,n-gram 对于语言处理、上下文和语义分析非常重要,它们是从令牌字符串中相邻的连续单词序列,本文将讨论如何使用 Python 中的功能和库创建 n-gram,感兴趣的朋友一起看看吧

使用 for 循环在 Python 中从文本创建 n-gram

我们可以有效地创建一个 ngrams 函数，该函数接受文本和 n 值，并返回一个包含 n-gram 的列表。

为了创建这个函数，我们可以分割文本并创建一个空列表（output）来存储 n-gram。我们使用 for 循环遍历 splitInput 列表以遍历所有元素。

然后将单词（令牌）添加到 output 列表中。

def ngrams(input, num):
    splitInput = input.split(' ')
    output = []
    for i in range(len(splitInput) - num + 1):
        output.append(splitInput[i:i + num])
    return output
text = "Welcome to the abode, and more importantly, our in-house exceptional cooking service which is close to the Burj Khalifa"
print(ngrams(text, 3))

代码输出：

[['Welcome', 'to', 'the'], ['to', 'the', 'abode,'], ['the', 'abode,', 'and'], ['abode,', 'and', 'more'], ['and', 'more', 'importantly,'], ['more', 'importantly,', 'our'], ['importantly,', 'our', 'in-house'], ['our', 'in-house', 'exceptional'], ['in-house', 'exceptional', 'cooking'], ['exceptional', 'cooking', 'service'], ['cooking', 'service', 'which'], ['service', 'which', 'is'], ['which', 'is', 'close'], ['is', 'close', 'to'], ['close', 'to', 'the'], ['to', 'the', 'Burj'], ['the', 'Burj', 'Khalifa']]

使用 NLTK 在 Python 中创建 n-gram

NLTK 是一个自然语言工具包，提供了一个易于使用的接口，用于文本处理和分词等重要资源。要安装 nltk，我们可以使用以下 pip 命令。

pip install nltk

为了展示潜在问题，让我们使用 word_tokenize() 方法。它可以帮助我们使用 NLTK 推荐的单词分词器创建一个令牌化的文本副本，然后再编写更详细的代码。

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

代码输出：

Traceback (most recent call last):
File "c:\Users\akinl\Documents\Python\SFTP\n-gram-two.py", line 4, in <module>
tokens = nltk.word_tokenize(text)
File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "C:\Python310\lib\site-packages\nltk\data.py", line 750, in load
opened_resource = _open(resource_url)
File "C:\Python310\lib\site-packages\nltk\data.py", line 876, in _open
return find(path_, path + [""]).open()
File "C:\Python310\lib\site-packages\nltk\data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource [93mpunkt[0m not found.
Please use the NLTK Downloader to obtain the resource:

[31m>>> import nltk
>>> nltk.download('punkt')
[0m
For more information see: https://www.nltk.org/data.html

Attempted to load [93mtokenizers/punkt/english.pickle[0m

Searched in:
- 'C:\\Users\\akinl/nltk_data'
- 'C:\\Python310\\nltk_data'
- 'C:\\Python310\\share\\nltk_data'
- 'C:\\Python310\\lib\\nltk_data'
- 'C:\\Users\\akinl\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- ''
**********************************************************************

上述错误消息和问题的原因是 NLTK 库对于某些方法需要某些数据，而我们尚未下载这些数据，特别是如果这是您首次使用的话。因此，我们需要使用 NLTK 下载器来下载两个数据模块，punkt 和 averaged_perceptron_tagger。

当我们使用 words() 等方法时，可以使用这些数据，例如创建一个 Python 文件并运行以下代码以解决该问题。

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

或者通过命令行界面运行以下命令：

python -m nltk.downloader punkt
python -m nltk.downloader averaged_perceptron_tagger

示例代码：

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print(list(textBigGrams), list(textTriGrams))

代码输出：

[('well', 'the'), ('the', 'money'), ('money', 'has'), ('has', 'finally'), ('finally', 'come')] [('well', 'the', 'money'), ('the', 'money', 'has'), ('money', 'has', 'finally'), ('has', 'finally', 'come')]

示例代码：

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print("The Bigrams of the Text are")
print(*map(' '.join, textBigGrams), sep=', ')
print("The Trigrams of the Text are")
print(*map(' '.join, textTriGrams), sep=', ')

代码输出：

The Bigrams of the Text are
well the, the money, money has, has finally, finally come

The Trigrams of the Text are
well the money, the money has, money has finally, has finally come

到此这篇关于在 Python 中从文本创建 N-Grams的文章就介绍到这了,更多相关Python文本创建 N-Grams内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

您可能感兴趣的文章:

从基础到进阶详解Python处理Word文档的完全指南
在Python中处理Word文档是一项常见且实用的任务,本文将介绍如何使用几个主流的Python库来创建、修改和处理Word文档,涵盖从基础操作到高级功能的完整流程
2026-01-01
Python四大金刚之元组详解
这篇文章主要介绍了Python的元组，小编觉得这篇文章写的还不错，需要的朋友可以参考下，希望能够给你带来帮助
2021-10-10
python利用elaphe制作二维条形码实现代码
条形码的应用将会越来越广泛，看到了一篇文章，写的挺好的！用手机拍二维码，查二维码确实很爽！这将成为一种潮流
2012-05-05
python中的字符串类型解读
这篇文章主要介绍了python中的字符串类型,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教
2024-06-06
python神经网络特征金字塔FPN原理
这篇文章主要为大家介绍了python神经网络特征金字塔FPN原理的解释，有需要的朋友可以借鉴参考下，希望能够有所帮助，祝大家多多进步，早日升职加薪
2022-05-05
python的metaclass关键字详解
metaclass其实就是最常用的元类,本文主要介绍了python的metaclass关键字的实现,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一起学习学习吧
2026-02-02
Python使用DPKT实现分析数据包
dpkt项目是一个Python模块,主要用于对网络数据包进行解析和操作,z这篇文章主要为大家介绍了python如何利用DPKT实现分析数据包,有需要的可以参考下
2023-10-10
Flask框架实现的前端RSA加密与后端Python解密功能详解
这篇文章主要介绍了Flask框架实现的前端RSA加密与后端Python解密功能,结合实例形式详细分析了flask框架前端使用jsencrypt.js加密与后端Python解密相关操作技巧,需要的朋友可以参考下
2019-08-08
Python结合FFmpeg实现将MP4视频与SRT字幕无损合并
在日常视频处理场景中,我们经常需要将字幕（SRT）合并到视频中,本文将介绍如何使用 Python 调用 FFmpeg将 MP4 视频与 SRT 字幕无损合并为一个新视频文件,感兴趣的小伙伴可以了解下
2026-01-01
python的类class定义及其初始化方式
这篇文章主要介绍了python的类class定义及其初始化方式，具有很好的参考价值，希望对大家有所帮助。如有错误或未考虑完全的地方，望不吝赐教
2023-05-05