Spark on YARN 部署案例

环境准备

1. 服务器角色分配

ip hostname role
10.8.26.197 server1 主名字节点 (NodeManager)
10.8.26.196 server2 备名字节点 (SecondaryNameNode)
10.8.26.196 server2 数据字节点 (DataNode)

2. 软件设施

  • jdk1.8.0_102
  • scala-2.11.0:
  • hadoop-2.7.0
  • spark-2.0.2-bin-hadoop2.7:对应 scala 版本不能是 scala-2.11.x

3. HOSTS 设置

在每台服务器的 “/etc/hosts” 文件,添加如下内容:

1
2
3
10.8.26.197 server1
10.8.26.196 server2
10.8.26.196 server2

4. SSH 免密码登录

Redhat 7/CentOS 7 SSH 免密登录


Hadoop YARN 分布式集群配置

注:1-8 所有节点都做同样配置

1. 环境变量设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# vim /etc/profile
# hadoop env set
export HADOOP_HOME=/usr/local/hadoop-2.7.0
export HADOOP_PID_DIR=/data/hadoop/pids
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
# jdk env set
export JAVA_HOME=/usr/local/jdk1.8.0_102
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
# scala env set
export SCALA_HOME=/usr/local/scala-2.11.0
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH

变量立即生效

1
# source /etc/profile

2. 相关路径创建

1
2
3
# mkdir -p /data/hadoop/{pids,storage}
# mkdir -p /data/hadoop/storage/{hdfs,tmp}
# mkdir -p /data/hadoop/storage/hdfs/{name,data}

3. 配置 core-site.xml

目录:$HADOOP_HOME/etc/hadoop/core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://server1:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hadoop/storage/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.native.lib</name>
<value>true</value>
</property>
</configuration>

4. 配置 hdfs-site.xml

目录:$HADOOP_HOME/etc/hadoop/hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>server2:9000</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hadoop/storage/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hadoop/storage/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

5. 配置 mapred-site.xml

目录:$HADOOP_HOME/etc/hadoop/mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>server1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>server1:19888</value>
</property>
</configuration>

6. 配置 yarn-site.xml

目录:$HADOOP_HOME/etc/hadoop/yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>server1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>server1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>server1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>server1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>server1:80</value>
</property>
</configuration>

7. 配置 hadoop-env.sh、mapred-env.sh、yarn-env.sh

均在文件开头添加

目录:

  • $HADOOP_HOME/etc/hadoop/hadoop-env.sh
  • $HADOOP_HOME/etc/hadoop/mapred-env.sh
  • $HADOOP_HOME/etc/hadoop/yarn-env.sh

在以上三个文件开头添加如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
export JAVA_HOME=/usr/local/jdk1.8.0_102
export CLASS_PATH=$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export HADOOP_HOME=/usr/local/hadoop-2.7.0
export HADOOP_PID_DIR=/data/hadoop/pids
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

8. 数据节点配置

1
2
3
4
# vim $HADOOP_HOME/etc/hadoop/slaves
server1
server2
server3

9. Hadoop 简单测试

工作目录 master $HADOOP_HOME

1
# cd $HADOOP_HOME

首次启动集群时,做如下操作 [主名字节点上执行]

1
2
3
# hdfs namenode -format
# sbin/start-dfs.sh
# sbin/start-yarn.sh

检查进程是否正常启动

主名字节点 - server1:

1
2
3
4
5
6
# jps
11842 Jps
11363 ResourceManager
10981 NameNode
11113 DataNode
11471 NodeManager

备名字节点 - server2:

1
2
3
4
5
# jps
7172 SecondaryNameNode
7252 NodeManager
7428 Jps
7063 DataNode

数据节点 - server3:

1
2
3
4
# jps
6523 NodeManager
6699 Jps
6412 DataNode

hdfs 与 mapreduce 测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# hdfs dfs -mkdir -p /user/root
# hdfs dfs -put ~/text /user/root
# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar wordcount /user/root /user/out
16/12/30 14:01:51 INFO client.RMProxy: Connecting to ResourceManager at server1/10.8.26.197:8032
16/12/30 14:01:55 INFO input.FileInputFormat: Total input paths to process : 1
16/12/30 14:01:55 INFO mapreduce.JobSubmitter: number of splits:1
16/12/30 14:01:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1483076482233_0001
16/12/30 14:01:58 INFO impl.YarnClientImpl: Submitted application application_1483076482233_0001
16/12/30 14:01:58 INFO mapreduce.Job: The url to track the job: http://server1:80/proxy/application_1483076482233_0001/
16/12/30 14:01:58 INFO mapreduce.Job: Running job: job_1483076482233_0001
16/12/30 14:02:23 INFO mapreduce.Job: Job job_1483076482233_0001 running in uber mode : false
16/12/30 14:02:24 INFO mapreduce.Job: map 0% reduce 0%
16/12/30 14:02:36 INFO mapreduce.Job: map 100% reduce 0%
16/12/30 14:02:44 INFO mapreduce.Job: map 100% reduce 100%
16/12/30 14:02:45 INFO mapreduce.Job: Job job_1483076482233_0001 completed successfully
16/12/30 14:02:46 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=242
FILE: Number of bytes written=230317
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=493
HDFS: Number of bytes written=172
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=7899
Total time spent by all reduces in occupied slots (ms)=6754
Total time spent by all map tasks (ms)=7899
Total time spent by all reduce tasks (ms)=6754
Total vcore-seconds taken by all map tasks=7899
Total vcore-seconds taken by all reduce tasks=6754
Total megabyte-seconds taken by all map tasks=8088576
Total megabyte-seconds taken by all reduce tasks=6916096
Map-Reduce Framework
Map input records=8
Map output records=56
Map output bytes=596
Map output materialized bytes=242
Input split bytes=99
Combine input records=56
Combine output records=16
Reduce input groups=16
Reduce shuffle bytes=242
Reduce input records=16
Reduce output records=16
Spilled Records=32
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=231
CPU time spent (ms)=1720
Physical memory (bytes) snapshot=293462016
Virtual memory (bytes) snapshot=4158427136
Total committed heap usage (bytes)=139976704
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=394
File Output Format Counters
Bytes Written=172

执行完成后查看输出,

1
2
3
4
# hdfs dfs -ls /user/out
Found 2 items
-rw-r--r-- 3 root supergroup 0 2016-12-30 14:02 /user/out/_SUCCESS
-rw-r--r-- 3 root supergroup 172 2016-12-30 14:02 /user/out/part-r-00000

也可以通过 UI (http://server1/cluster/apps) 查看:

hadoop ui

HDFS 信息查看

1
2
# hdfs dfsadmin -report
# hdfs fsck / -files -blocks

UI(http://server1:50070/)

hdfs ui

集群的后续维护

1
2
# sbin/start-all.sh
# sbin/stop-all.sh

监控页面 URL

http://10.8.26.197:80

Spark 分布式集群配置

注:所有节点都做同样配置

1. Spark 相关配置

Spark 环境变量设置

1
2
3
4
5
# vim /etc/profile
# spark env set
export SPARK_HOME=/usr/local/spark-2.0.2-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$SCALA_HOME/bin:$PATH
1
# source /etc/profile

配置 spark-env.sh

1
2
3
4
5
6
7
# cd $SPARK_HOME/conf
# mv spark-env.sh.template spark-env.sh
# vim spark-env.sh
## 添加如下内容
export JAVA_HOME=/usr/local/jdk1.8.0_102
export SCALA_HOME=/usr/local/scala-2.11.0
export HADOOP_HOME=/usr/local/hadoop-2.7.0

配置 worker 节点的主机名列表

1
# cd $SPARK_HOME/conf
1
2
3
4
# vim slaves
server1
server2
server3

其他配置

1
2
# cd $SPARK_HOME/conf
# mv log4j.properties.template log4j.properties

在 Master 节点上执行

1
# cd $SPARK_HOME && sbin/start-all.sh

检查进程是否启动

在 master 节点上出现 “Master”,在 slave 节点上出现 “Worker”

Master 节点:

1
2
3
4
5
6
7
8
[root@server1 spark-2.0.2-bin-hadoop2.7]# jps
11363 ResourceManager
10981 NameNode
13176 Master
13256 Worker
11113 DataNode
13435 Jps
11471 NodeManager

Slave 节点:

1
2
3
4
5
6
[root@server2 conf]# jps
7172 SecondaryNameNode
7252 NodeManager
7063 DataNode
8988 Worker
9133 Jps

相关测试

监控页面 URL

http://10.8.26.197:8080/http://server1:8080/

监控页面

切换到 “$SPARK_HOME/bin” 目录

1. 本地模式

1
# .bin/run-example org.apache.spark.examples.SparkPi local

2. 普通集群模式

1
2
# ./bin/run-example org.apache.spark.examples.SparkPi spark://10.8.26.197:7077
# ./bin/run-example org.apache.spark.examples.SparkLR spark://10.8.26.197:7077

3. 结合 HDFS 的集群模式

工作目录 $SPARK_HOME

1
2
3
4
5
6
7
8
9
scala> val file=sc.textFile("hdfs://server1:9000/user/root/README.md")
file: org.apache.spark.rdd.RDD[String] = hdfs://server1:9000/user/root/README.md MapPartitionsRDD[5] at textFile at <console>:24
scala> val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:26
scala> count.collect()
res3: Array[(String, Int)] = Array((package,1), (this,1), (Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version),1), (Because,1), (Python,2), (cluster.,1), (its,1), ([run,1), (general,2), (have,1), (pre-built,1), (YARN,,1), (locally,2), (changed,1), (locally.,1), (sc.parallelize(1,1), (only,1), (Configuration,1), (This,2), (basic,1), (first,1), (learning,,1), ([Eclipse](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse),1), (documentation,3), (graph,1), (Hive,2), (several,1), (["Specifying,1), ("yarn",1), (page](http://spark.apache.org/documentation.html),1), ([params]`.,1), ([project,2), (prefer,1), (SparkPi,2), (<http://spark.apache.org/>,1), (engine,1), (version,1), (file,1), (documentation...
scala> :quit

4. 基于 YARN 模式

1
2
3
4
5
6
7
8
# ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
examples/jars/spark-examples*.jar \
10

执行结果:

$HADOOP_HOME/logs/userlogs/application_*/container*_***/stdout

http://10.8.26.197/logs/userlogs/application_1483076482233_0009/container_1483076482233_0009_01_000001/stdout

监控页面

热评文章