30 Nov 2010
MapReduce初探
MapReduce最近很热,是Google提出的一个软件架构,用来大规模的分布式计算。主要思想就是Map和Reduce:
- Map是将一组键值对转换为一组新的键值对列表
- Reduce则是根据相同的键,把其值通过一定的函数合并
本来这些概念也通俗易懂(难的地方,比如分布式、可靠性我还没有去深究),但是我也想实现一个简单的MapReduce却遇到了问题。此次实现简单的MapReduce,主要想用python的多线程、列表以及Queue等基本的一些结构和数据类型来实现,具体实现步骤:
- 通过parse,将原始数据转换成键值对
- 使用map将1)步生成的键值对转换成新的键值对
- 将map生成的map进行merge操作,首先对键进行排序,然后将相同键的键值对的值合并一个list
- 对list进行reduce
这次想通过python的多线程来模拟并行,通过Queue和自定义的SynchronizeDict来缓存中间数据。但是在实现代码的时候,我发现map worker和reduce worker并不能同时执行,因为reduce worker需要等待,只有当所有的map worker执行完毕,merge完毕,reduce worker才开始自己的工作,否则必然出现不完整的情况。代码实现见:https://github.com/yangjuven/MapReduce。
重新读了下《MapReduce: Simplied Data Processing on Large Clusters》的3.1 Execution Overview,有段这样写到:
When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together.
也询问了Google's MapReduce in 98 Lines of Python的作者John Arley Burns。
Juven:In you mapreduce example, the map workers and reduce workers can’t be processed in parallel. Because reduce workers must wait for the complete of map workers. How is the true MapReduce?John Arley Burns:Yes you are correct, the map and reduce steps cannot be in parallel. Instead, all maps run in parallel, then when all are finished, all reduce steps run, depending on algorithm in order, This is inherent to the map reduce algorithm. As you noticed, this limits parallelization. This is perhaps one reason Google has largely abandoned map reduce in favor of Bigtable-based processing, letting the database function as the point of control in the algorithms.
Resouces & References:
- MapReduce: Simplied Data Processing on Large Clusters
- Google's MapReduce in 98 Lines of Python