Python 转换文本编码实现解析

更新时间：2019年08月27日 17:06:33 作者：danvy617

这篇文章主要介绍了Python 转换文本编码实现解析,文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值

最近在做周报的时候，需要把csv文本中的数据提取出来制作表格后生产图表。

在获取csv文本内容的时候，基本上都是用with open(filename, encoding ='UTF-8') as f:来打开csv文本，但是实际使用过程中发现有些csv文本并不是utf-8格式，从而导致程序在run的过程中报错，每次都需要手动去把该文本文件的编码格式修改成utf-8，再次来run该程序，所以想说：直接在程序中判断并修改文本编码。

基本思路：先查找该文本是否是utf-8的编码，如果不是则修改为utf-8编码的文本，然后再处理。

python有chardet库可以查看到文本的encoding信息：

detect函数只需要一个非unicode字符串参数，返回一个字典（例如：{'encoding': 'utf-8', 'confidence': 0.99}）。该字典包括判断到的编码格式及判断的置信度。

import chardet
def get_encode_info(file):
  with open(file, 'rb') as f:
    return chardet.detect(f.read())['encoding']

不过这个在从处理小文件的时候性能还行，如果文本稍微过大就很慢了，目前我本地的csv文件是近200k，就能明显感觉到速度过慢了，效率低下。不过chardet库中提供UniversalDetector对象来处理：创建UniversalDetector对象，然后对每个文本块重复调用其feed方法。如果检测器达到了最小置信阈值，它就会将detector.done设置为True。

一旦您用完了源文本，请调用detector.close()，这将完成一些最后的计算，以防检测器之前没有达到其最小置信阈值。结果将是一个字典，其中包含自动检测的字符编码和置信度(与charde.test函数返回的相同)。

from chardet.universaldetector import UniversalDetector
def get_encode_info(file):
 with open(file, 'rb') as f:
    detector = UniversalDetector()
 for line in f.readlines():
      detector.feed(line)
 if detector.done:
 break
    detector.close()
 return detector.result['encoding']

在做编码转换的时候遇到问题：UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 178365: character maps to <undefined>

def read_file(file):
 with open(file, 'rb') as f:
 return f.read()
def write_file(content, file):
 with open(file, 'wb') as f:
    f.write(content)
def convert_encode2utf8(file, original_encode, des_encode):
  file_content = read_file(file)
  file_decode = file_content.decode(original_encode)  #-->此处有问题
  file_encode = file_decode.encode(des_encode)
  write_file(file_encode, file)

这是由于byte字符组没解码好，要加另外一个参数errors。官方文档中写道：

bytearray.decode(encoding=”utf-8”, errors=”strict”)

Return a string decoded from the given bytes. Default encoding is 'utf-8'. errors may be given to set a different error handling scheme. The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

意思就是字符数组解码成一个utf-8的字符串，可能被设置成不同的处理方案，默认是‘严格'的，有可能抛出UnicodeError，可以改成‘ignore'，'replace'就能解决。

所以将此行代码file_decode = file_content.decode(original_encode)修改成file_decode = file_content.decode(original_encode,'ignore')即可。

完整代码：

from chardet.universaldetector import UniversalDetector

def get_encode_info(file):
 with open(file, 'rb') as f:
   detector = UniversalDetector()
   for line in f.readlines():
     detector.feed(line)
     if detector.done:
       break
   detector.close()
   return detector.result['encoding']

def read_file(file):
  with open(file, 'rb') as f:
    return f.read()

def write_file(content, file):
  with open(file, 'wb') as f:
    f.write(content)

def convert_encode2utf8(file, original_encode, des_encode):
  file_content = read_file(file)
  file_decode = file_content.decode(original_encode,'ignore')
  file_encode = file_decode.encode(des_encode)
  write_file(file_encode, file)

if __name__ == "__main__":
  filename = r'C:\Users\danvy\Desktop\Automation\testdata\test.csv'
  file_content = read_file(filename)
  encode_info = get_encode_info(filename)
  if encode_info != 'utf-8':
    convert_encode2utf8(filename, encode_info, 'utf-8')
  encode_info = get_encode_info(filename)
  print(encode_info)

参考：https://chardet.readthedocs.io/en/latest/usage.html

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持脚本之家。

您可能感兴趣的文章:

python使用pandas实现数据分割实例代码
这篇文章主要介绍了python使用pandas实现数据分割实例代码，介绍了使用pandas实现对dataframe格式的数据分割成时间跨度相等的数据块，分享了相关代码示例，小编觉得还是挺不错的，具有一定借鉴价值，需要的朋友可以参考下
2018-01-01
python open读取文件内容时的mode模式解析
这篇文章主要介绍了python open读取文件内容时的mode模式解析,Python可以使用open函数来实现文件的打开，关闭，读写操作，本文给大家介绍的非常详细，需要的朋友可以参考下
2022-05-05
使用python求解迷宫问题的三种实现方法
关于迷宫问题,常见会问能不能到达某点,以及打印到达的最短路径,下面这篇文章主要给大家介绍了关于如何使用python求解迷宫问题的三种实现方法,需要的朋友可以参考下
2022-03-03
详解Python Selenium如何获取鼠标指向的元素
这篇文章主要介绍了如何通过Selenium获取当前鼠标指向的元素，本文方法的核心，是借助JavaScript的事件(event)来获取鼠标所在的元素，感兴趣的可以试一试
2022-03-03
用python按照图像灰度值统计并筛选图片的操作(PIL,shutil,os)
这篇文章主要介绍了用python按照图像灰度值统计并筛选图片的操作(PIL,shutil,os)，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2020-06-06
Python抓取今日头条街拍图片数据
大家好，本篇文章主要讲的是Python抓取今日头条街拍图片数据，感兴趣的同学赶快来看一看吧，对你有帮助的话记得收藏一下
2022-01-01
Python如何设置utf-8为默认编码的问题
这篇文章主要介绍了Python如何设置utf-8为默认编码的问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教
2024-06-06
Django Admin 管理工具的实现
这篇文章主要介绍了Django Admin 管理工具的实现，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧
2021-05-05
python爬虫超时的处理的实例
今天小编就为大家分享一篇python爬虫超时的处理的实例，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2018-12-12
使用Python实现图像有效压缩的方法
在数字时代,图像作为信息传递的重要媒介,在网页设计、移动应用和多媒体制作中扮演着不可或缺的角色,本文将详细介绍如何使用Python,一个功能强大且易于学习的编程语言,来实现图像的有效压缩,感兴趣的朋友可以参考下
2024-03-03

Python 转换文本编码实现解析

相关文章

最新评论

大家感兴趣的内容

最近更新的内容

常用在线小工具