基于MapReduce计算模型的气象资料处理调优试验

引用本文

杨润芝, 沈文海, 肖卫青, 胡开喜, 杨昕, 王颖, 田伟. 基于MapReduce计算模型的气象资料处理调优试验[J]. 应用气象学报, 2014, 25(5): 618-628. 复制到剪切板

Yang Runzhi, Shen Wenhai, Xiao Weiqing, Hu Kaixi, Yang Xin, Wang Ying, Tian Wei. A Set of MapReduce Tuning Experiments Based on Meteorological Operations[J]. Journal of Applied Meteorological Science, 2014, 25(5): 618-628 复制到剪切板

基于MapReduce计算模型的气象资料处理调优试验

杨润芝¹, 沈文海¹, 肖卫青¹, 胡开喜¹, 杨昕¹, 王颖¹, 田伟²

1. 国家气象信息中心，北京 100081;
2. 南京信息工程大学，南京 210044

2013-10-08 收到, 2014-10-08 收到修改稿.

通讯作者: 杨润芝, email: yangrz@cma.gov.cn.

摘要: 云计算技术使用分布式的计算技术实现了并行计算的计算能力和计算效率，解决了单机服务器计算能力低的问题。基于长序列历史资料所计算得出的气候标准值对于气象领域实时业务、准实时业务及科学研究中均具有重要的意义。由于长序列历史资料数据量大、运算逻辑较复杂，在传统单节点计算平台上进行整编计算耗时非常长。该文基于Hadoop分布式计算框架搭建了集群模式的云计算平台，以长序列历史资料作为源数据，基于MapReduce计算模型实现了部分整编算法，提高计算时效。同时，由于数据源本身具有文件个数多、单个文件小等特点，对数据源存储形式及数据文件大小进行改造，分别利用SequenceFile方式及文本文件合并方式对同一种场景进行计算时效对比测试，分别测试了10个文件合并、100个文件合并两种情况，使时效性得到了更大程度的提升。

关键词: MapReduce 云计算 Hadoop 历史资料整编

A Set of MapReduce Tuning Experiments Based on Meteorological Operations

Yang Runzhi¹, Shen Wenhai¹, Xiao Weiqing¹, Hu Kaixi¹, Yang Xin¹, Wang Ying¹, Tian Wei²

1. National Meteorological Information Center, Beijing 100081;
2. Nanjing University of Information Science & Technology, Nanjing 210044

Abstract: Cloud computing technologies, which solves the problem of low computing power of a standalone server, uses distributed computing technology to achieve the computing power of parallel computing and computational efficiency. Cloud computing is a new application model for decentralized computing which can provide reliable, customized and maximum number of users with minimum resource, and it is also an important way to carry out cloud computing theory research and practical application combining with other theory and good techniques. In many industries and fields, cloud computing has a wider range of applications, and its flexibility, ease of use, stability is gradually affirmed. In meteorological department, cloud-based platform for the development of scientific computing is still very limited, but some attempts are implemented with the maturation of cloud computing. In meteorological operations, such as large-scale scientific computing and other general computing model are run on high-performance server clusters. Due to limitations of resources and the number of HPC nodes, scientific computing still relies on traditional standalone or clustered mode. Therefore, an internal exploration and conventional general-purpose computing and cloud computing platform is very meaningful for the meteorological department. 60-year valuable and precious long sequence of historical data are stored in National Meteorological Information Center for the use of real-time, near-real-time business and research. Processing these historical data is time-consuming, therefore some new methods are implemented. Based on Hadoop cloud computing platform, a cluster mode is built and a variety of statistical methods are adopted using MapReduce computation model. The storage format of the source data is adjusted with SequenceFile which is composed of < Key, Value > serialization, by this mean multiple files of Format-A are merged to a large SequenceFile to test computational efficiency changes. Meanwhile, many small files are merged to a larger file. Configurations are modified experimentally for the Hadoop cluster environment, and different number of task nodes are used to record different computational efficiency.

Key words: MapReduce cloud computing Hadoop meteorological data processing

[1]	郎为民, 杨德鹏, 李虎生. 中国云计算发展现状研究. 电信快报, 2011, 10: 1–6.
[2]	李德毅. 2011云计算技术发展报告. 北京: 科学出版社, 2011: 1–10.
[3]	Ray O'Brien.[2011-12-11].http://nebula.nasa.gov/blog/2012/05/29/nasa-and-openstack-2012/.
[4]	张诚忠. 广东借助云计算破预报瓶颈天气分辨率升至3公里. [2011-12-11]. http://news.xinhuanet.com/2011-12/11/c_111234079.htm.
[5]	沈文海. 从云计算看气象部门未来的信息化趋势. 气象科技进展, 2012, 1, (2): 49–56.
[6]	沈文海. 云计算受困于服务手段的有限和体制两因素. [2012-12-15]. http://cio.itxinwen.com/Online/2011/1115/370736.html.
[7]	刘小宁, 张洪政, 李庆祥. 不同方法计算的气温平均值差异分析. 应用气象学报, 2005, 16, (3): 345–356. DOI:10.11898/1001-7313.20050309
[8]	王炳忠, 申彦波. 我国上空的水汽含量及其气候学估算. 应用气象学报, 2012, 23, (6): 763–768. DOI:10.11898/1001-7313.20120614
[9]	张强, 熊安元, 张金艳, 等. 晴雨 (雪) 和气温预报评分方法的初步研究. 应用气象学报, 2009, 20, (6): 692–698. DOI:10.11898/1001-7313.20090606
[10]	张顺谦, 马振峰, 张玉芳. 四川省潜在蒸散量估算模型. 应用气象学报, 2009, 20, (6): 729–736. DOI:10.11898/1001-7313.20090611
[11]	刘娜. 基于MapReduce的数据挖掘算法在全国人口系统中的应用. 北京: 首都经济贸易大学, 2011: 20–43.
[12]	李军华. 云计算及若干数据挖掘算法的MapReduce化研究. 成都: 电子科技大学, 2010: 19–32.
[13]	贾雄. 数值天气预报云计算环境关键技术研究与实现. 长沙: 国防科学技术大学, 2011: 2–33.
[14]	万至臻. 基于MapReduce模型的并行计算平台的设计与实现. 杭州: 浙江大学, 2008: 17–21.
[15]	朱珠. 基于Hadoop的海量数据处理模型研究和应用. 北京: 北京邮电大学, 2008: 7–20.
[16]	吴朱华. 云计算核心技术剖析. 北京: 人民邮电出版社, 2011: 16–44.
[17]	周敏奇, 王晓玲, 金澈清, 等. Hadoop权威指南. (第2版). 北京: 清华大学出版社, 2011: 213–224.
[18]	金之雁, 颜宏. 数值天气预报并行计算模式的设计与可行性讨论. 应用气象学报, 1993, 4, (1): 117–121.
[19]	牟道楠, 王宗皓. 层次分解并行计算法在TOVS资料中尺度分析中的应用. 应用气象学报, 1994, 5, (1): 77–81.


图 1. 云平台任务调度模式 Fig 1. Task scheduling model on cloud computing platform


图 2. MapReduce运算流程 Fig 2. MapReduce computation processes


图 3. 传统整编软件计算流程 Fig 3. Calculation flow chart of traditional software


图 4. MapReduce计算模型数据流 Fig 4. Data flow diagram of MapReduce computation model


图 5. 云平台上整编计算流程 Fig 5. Reorganization flowchart on cloud computing platforms


图 6. 时间随计算节点数变化的测试 Fig 6. Experiment on time change with node numbers


图 7. 时间随云平台系统参数变化的测试 Fig 7. Experiment on time change with system parameters of cloud computing platform


图 8. 时间随云平台最大并行任务数变化的测试 (a) 均匀纵坐标，(b) 指数纵坐标 Fig 8. Experiment on time change with max parallel tasks of cloud computing platform (a) uniform ordinate, (b) index ordinate


图 9. 文件上传到HDFS文件 Fig 9. Experiment on time change with upload files to HDFS


应用气象学报 2014, 25 (5): 618-628	PDF