一次 Spark SQL 性能提升10倍的经历
共 9106字,需浏览 19分钟
·
2021-01-05 12:53
1. 遇到了啥问题
2. 原因排查
该 stage 内一共有 336 个task
3. 如何解决
为什么 scheduler delay 会这么大
降低 average delay time:目前来看似乎唯一的方法是砸钱加资源
降低 task 数:粗略来看,简单的降低 task 数的话,应该是能减少 total delay time 的,但是如果task 数降低了,意味着每个 task 需要处理的数据量就多了,那其他的时间应该是会增加一些的,比如说
Task Deserialization Time, Result Serialization Time, GC Time, Duration
等。减少 task 数究竟能不能提高整体运行速度,似乎乍一看还真不好确定。
Duration: 扩大到 10 倍,14ms
Scheduler Delay: 这个指标不用估计
Task Deserialization Time: 扩大到 10 倍,6ms
GC Time: 扩大到 10 倍,最多1ms
Result Serialization Time: 扩大到 10 倍,最多1ms
Getting Result Time: 扩大到 10 倍,最多1ms
Peak Execution Memory: 扩大到 10 倍,最多 1b
Input Size / Records: 扩大到 10 倍,918.8 KB * 2 / 2
Shuffle Write Size / Records 0: 扩大到 10 倍,294.0 B * 2/ 20
@mac007:~/Desktop/dyes/git-mercury/mercury-computing$hadoop fs -ls -h hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ | cat -n | tail
317 -rw-r--r-- 3 taotao hfmkt 1.3 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00310-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
318 -rw-r--r-- 3 taotao hfmkt 1.4 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00311-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
319 -rw-r--r-- 3 taotao hfmkt 2.9 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00312-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
320 -rw-r--r-- 3 taotao hfmkt 1.2 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00313-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
321 -rw-r--r-- 3 taotao hfmkt 1.9 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00314-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
322 -rw-r--r-- 3 taotao hfmkt 1.7 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00315-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
323 -rw-r--r-- 3 taotao hfmkt 899.4 K 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00316-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
324 -rw-r--r-- 3 taotao hfmkt 2.3 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00317-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
325 -rw-r--r-- 3 taotao hfmkt 1.0 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00318-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
326 -rw-r--r-- 3 taotao hfmkt 460.9 K 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00319-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
4. 效果对比
20161117 这天
taotao@mac007:~/Desktop/dyes/git-mercury/mercury-computing$hadoop fs -ls -h hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_S* | cat -n | tail -n 5
342 -rw-r--r-- 3 taotao hfmkt 1.7 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00315-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
343 -rw-r--r-- 3 taotao hfmkt 899.4 K 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00316-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
344 -rw-r--r-- 3 taotao hfmkt 2.3 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00317-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
345 -rw-r--r-- 3 taotao hfmkt 1.0 M 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00318-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
346 -rw-r--r-- 3 taotao hfmkt 460.9 K 2016-11-17 16:37 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161117/20161117_Transaction_SZ/part-r-00319-38d7cc53-60d2-40b3-a945-0cb5832f30de.gz.parquet
20161212 这天
taotao@mac007:~/Desktop/dyes/git-mercury/mercury-computing$hadoop fs -ls -h hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161212/20161212_Transaction_S* | cat -n | tail -n 5
34 -rw-r--r-- 3 taotao hfmkt 19.2 M 2016-12-12 15:49 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161212/20161212_Transaction_SZ/part-r-00013-686bbce5-a7a1-4b5d-b25c-14cd9ddae283.gz.parquet
35 -rw-r--r-- 3 taotao hfmkt 10.7 M 2016-12-12 15:49 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161212/20161212_Transaction_SZ/part-r-00014-686bbce5-a7a1-4b5d-b25c-14cd9ddae283.gz.parquet
36 -rw-r--r-- 3 taotao hfmkt 26.0 M 2016-12-12 15:49 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161212/20161212_Transaction_SZ/part-r-00015-686bbce5-a7a1-4b5d-b25c-14cd9ddae283.gz.parquet
37 -rw-r--r-- 3 taotao hfmkt 20.1 M 2016-12-12 15:49 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161212/20161212_Transaction_SZ/part-r-00016-686bbce5-a7a1-4b5d-b25c-14cd9ddae283.gz.parquet
38 -rw-r--r-- 3 taotao hfmkt 8.7 M 2016-12-12 15:49 hdfs://hadoop-archive-cluster/hfmkt/level2/datayes/parquet/20161212/20161212_Transaction_SZ/part-r-00017-686bbce5-a7a1-4b5d-b25c-14cd9ddae283.gz.parquet
20161117 这天:99s
20161212 这天:16s
20161117 这天
20161212 这天
5. 总结
超全Spark性能优化总结
一次SparkSQL性能分析与优化之旅及相关工具小结
基于 Spark 技术快速构建数仓项目