使用Python实现MapReduce的示例代码

更新时间：2024年05月06日 09:38:37 作者：changwanyi

MapReduce是一个用于大规模数据处理的分布式计算模型,最初由Google工程师设计并实现的,Google已经将完整的MapReduce论文公开发布了,本文给大家介绍了使用Python实现MapReduce的示例代码,需要的朋友可以参考下

一、MapReduce

将这个单词分解为Map、Reduce。

Map阶段：在这个阶段，输入数据集被分割成小块，并由多个Map任务处理。每个Map任务将输入数据映射为一系列(key, value)对，并生成中间结果。
Reduce阶段：在这个阶段，中间结果被重新分组和排序，以便相同key的中间结果被传递到同一个Reduce任务。每个Reduce任务将具有相同key的中间结果合并、计算，并生成最终的输出。

举个例子，在一个很长的字符串中统计某个字符出现的次数。

from collections import defaultdict
def mapper(word):
    return word, 1
 
def reducer(key_value_pair):
    key, values = key_value_pair
    return key, sum(values)
def map_reduce_function(input_list, mapper, reducer):
    '''
    - input_list: 字符列表
    - mapper: 映射函数，将输入列表中的每个元素映射到一个键值对
    - reducer: 聚合函数，将映射结果中的每个键值对聚合到一个键值对
    - return: 聚合结果
    '''
    map_results = map(mapper, input_list)
    shuffler = defaultdict(list)
    for key, value in map_results:
        shuffler[key].append(value)
    return map(reducer, shuffler.items())
 
if __name__ == "__main__":
    words = "python best language".split(" ")
    result = list(map_reduce_function(words, mapper, reducer))
    print(result)

输出结果为

[('python', 1), ('best', 1), ('language', 1)]

但是这里并没有体现出MapReduce的特点。只是展示了MapReduce的运行原理。

二、基于多线程实现MapReduce

from collections import defaultdict
import threading
 
class MapReduceThread(threading.Thread):
    def __init__(self, input_list, mapper, shuffler):
        super(MapReduceThread, self).__init__()
        self.input_list = input_list
        self.mapper = mapper
        self.shuffler = shuffler
 
    def run(self):
        map_results = map(self.mapper, self.input_list)
        for key, value in map_results:
            self.shuffler[key].append(value)
 
def reducer(key_value_pair):
    key, values = key_value_pair
    return key, sum(values)
def mapper(word):
    return word, 1
def map_reduce_function(input_list, num_threads):
    shuffler = defaultdict(list)
    threads = []
    chunk_size = len(input_list) // num_threads
    
    for i in range(0, len(input_list), chunk_size):
        chunk = input_list[i:i+chunk_size]
        thread = MapReduceThread(chunk, mapper, shuffler)
        thread.start()
        threads.append(thread)
    
    for thread in threads:
        thread.join()
 
    return map(reducer, shuffler.items())
 
if __name__ == "__main__":
    words = "python is the best language for programming and python is easy to learn".split(" ")
    result = list(map_reduce_function(words, num_threads=4))
    for i in result:
        print(i)

这里的本质一模一样，将字符串分割为四份，并且分发这四个字符串到不同的线程执行，最后将执行结果归约。只不过由于Python的GIL机制，导致Python中的线程是并发执行的，而不能实现并行，所以在python中使用线程来实现MapReduce是不合理的。（GIL机制：抢占式线程，不能在同一时间运行多个线程）。

三、基于多进程实现MapReduce

由于Python中GIL机制的存在，无法实现真正的并行。这里有两种解决方案，一种是使用其他语言，例如C语言，这里我们不考虑；另一种就是利用多核，CPU的多任务处理能力。

from collections import defaultdict
import multiprocessing
 
def mapper(chunk):
    word_count = defaultdict(int)
    for word in chunk.split():
        word_count[word] += 1
    return word_count
 
def reducer(word_counts):
    result = defaultdict(int)
    for word_count in word_counts:
        for word, count in word_count.items():
            result[word] += count
    return result
 
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
 
def map_reduce_function(text, num_processes):
    chunk_size = (len(text) + num_processes - 1) // num_processes
    chunks_list = list(chunks(text, chunk_size))
 
    with multiprocessing.Pool(processes=num_processes) as pool:
        word_counts = pool.map(mapper, chunks_list)
 
    result = reducer(word_counts)
    return result
 
if __name__ == "__main__":
    text = "python is the best language for programming and python is easy to learn"
    num_processes = 4
    result = map_reduce_function(text, num_processes)
    for i in result:
        print(i, result[i])

这里使用多进程来实现MapReduce，这里就是真正意义上的并行，依然是将数据切分，采用并行处理这些数据，这样才可以体现出MapReduce的高效特点。但是在这个例子中可能看不出来很大的差异，因为数据量太小。在实际应用中，如果数据集太小，是不适用的，可能无法带来任何收益，甚至产生更大的开销导致性能的下降。

四、在100GB的文件中检索数据

这里依然使用MapReduce的思想，但是有两个问题

读取速度慢

解决方法：

使用分块读取，但是在分区时不宜过小。因为在创建分区时会被序列化到进程，在进程中又需要将其解开，这样反复的序列化和反序列化会占用大量时间。不宜过大，因为这样创建的进程会变少，可能无法充分利用CPU的多核能力。

内存消耗特别大

解决方法：

使用生成器和迭代器，但需获取。例如分块为8块，生成器会一次读取一块的内容并且返回对应的迭代器，以此类推，这样就避免了读取内存过大的问题。

from datetime import datetime
import multiprocessing
def chunked_file_reader(file_path:str, chunk_size:int):
    """
    生成器函数：分块读取文件内容
    - file_path: 文件路径
    - chunk_size: 块大小,默认为1MB
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                break
            yield chunk
 
def search_in_chunk(chunk:str, keyword:str):
    """在文件块中搜索关键字
    - chunk: 文件块
    - keyword: 要搜索的关键字
    """
    lines = chunk.split('\n')
    for line in lines:
        if keyword in line:
            print(f"找到了:", line)
 
def search_in_file(file_path:str, keyword:str, chunk_size=1024*1024):
    """在文件中搜索关键字
    file_path: 文件路径
    keyword: 要搜索的关键字
    chunk_size: 文件块大小,为1MB
    """
    with multiprocessing.Pool() as pool:
        for chunk in chunked_file_reader(file_path, chunk_size):
            pool.apply_async(search_in_chunk, args=(chunk, keyword))
        
if __name__ == "__main__":
    start = datetime.now()
    file_path = "file.txt"
    keyword = "张三"
    search_in_file(file_path, keyword)
    end = datetime.now()
    print(f"搜索完成，耗时 {end - start}")

最后程序运行时间为两分半左右。

到此这篇关于使用Python实现MapReduce的示例代码的文章就介绍到这了,更多相关Python实现MapReduce内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

您可能感兴趣的文章:

Python中np.linalg.norm()用法实例总结
在线性代数中一个向量通过矩阵转换成另一个向量时,原有向量的大小就是向量的范数,这个变化过程的大小就是矩阵的范数,下面这篇文章主要给大家介绍了关于Python中np.linalg.norm()用法的相关资料,需要的朋友可以参考下
2022-07-07
Python中实现列表的逆序、复制与清除的几种常见方法
本文介绍了Python中列表的逆序、复制和清除操作,通过reverse()方法、切片、copy()方法和clear()方法,我们可以轻松地对列表进行这些操作
2024-12-12
使用Python的Flask框架表单插件Flask-WTF实现Web登录验证
Flask处理表单除了本身的WTForms包,使用Flask-WTF扩展来增强表单功能也是很多开发者的选择,这里我们就来讲解如何使用Python的Flask框架表单插件Flask-WTF实现Web登录验证
2016-07-07
2021年最新用于图像处理的Python库总结
为了快速地处理大量信息,科学家需要利用图像准备工具来完成人工智能和深度学习任务.在本文中,我将深入研究Python中最有用的图像处理库,这些库正在人工智能和深度学习任务中得到大力利用.我们开始吧,需要的朋友可以参考下
2021-06-06
python数据分析之DataFrame内存优化
pandas处理几百兆的dataframe是没有问题的，但是我们在处理几个G甚至更大的数据时，就会特别占用内存，对内存小的用户特别不好，所以对数据进行压缩是很有必要的，本文就介绍了python DataFrame内存优化，感兴趣的可以了解一下
2021-07-07
Python使用selenium模拟键盘输入的常见方法
本文主要介绍了在使用Selenium进行模拟登录时,账号和密码输入失败的问题,并提出了几种解决方法,包括使用Selenium的send_keys函数、pyautogui库、pynput库和JavaScript输入操作,并有相关的代码示例供大家参考,需要的朋友可以参考下
2025-04-04
使用Python程序计算钢琴88个键的音高
这篇文章介绍了使用Python程序计算钢琴88个键的音高，文中通过示例代码介绍的非常详细。对大家的学习或工作具有一定的参考借鉴价值，需要的朋友可以参考下
2022-04-04
在Django中使用MQTT的方法
这篇文章主要介绍了在Django中使用MQTT的方法，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧
2021-05-05
Python运行出现DeprecationWarning的问题及解决
这篇文章主要介绍了Python运行出现DeprecationWarning的问题及解决方案，具有很好的参考价值，希望对大家有所帮助。如有错误或未考虑完全的地方，望不吝赐教
2022-07-07
PyQt界面阻塞卡死问题的解决
当用PyQt5开发一个GUI界面 ,需要执行业务逻辑时,后台逻辑执行时间长,界面就容易出现卡死、未响应等问题,本文主要介绍了PyQt界面阻塞卡死问题的解决
2024-01-01