大数据入门(1)——安装Hadoop

环境准备：Ubuntu16、JDK 8、Hadoop3.1.2

Ubuntu的安装这里就不讲了，JDK 的安装，之前是直接用apt命令安装的openjdk


1
2
3
4
5
6
// 搜索jdk版本

$ apt search openjdk

// 安装jdk8

$ apt install openjdk-8-jdk

// 安装好之后，查看版本号

$ java -version

因为后续要用到Java 的安装路径，配置到Hadoop的环境中，所以要找到安装在哪里。


1
2
3
4
5
6
7
8
9
// 使用which 命令，查看java的可执行程序在哪里

$ which java

/usr/bin/java

// 使用ls -l 命令查看java 程序的链接情况

$ ls -l /usr/bin/java

/usr/bin/java -> /etc/alternatives/java

// 再次使用ls -l 命令

$ ls -l /etc/alternatives/java

/etc/alternatives/java -> /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

至此，发现java 的真实路径就在/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

第二步，下载hadoop，地址：https://hadoop.apache.org/releases.html

我选择了3.1.2 版本，binary download。使用wget 命令下载，然后解压


1
2
$ wget http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

$ tar -xzvf hadoop-3.1.2.tar.gz

然后，就可以看到当前目录下的文件夹 hadoop.3.1.2了。进入该文件夹

bin 单机执行程序
etc 配置文件
sbin 分布式环境的执行程序
share/hadoop 所有引用的包，写代码时会用

编辑 ~/.bash_profile ，在文件末尾添加如下内容设置环境变量


1
2
3
4
5
HADOOP_HOME=/root/software/hadoop-3.1.2

export HADOOP_HOME



PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export PATH

保存文件，然后运行如下命令使环境变量生效


1
$ source ~/.bash_profile

进入Hadoop安装目录，编辑 etc/hadoop/hadoop-env.sh 文件并保存


1
2
  # set to the root of your Java installation

  export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

现在基本上hadoop的单机环境就安装好了，在hadoop-3.1.2/share/hadoop/mapreduce 目录下，有一个hadoop-mapreduce-examples-3.1.2.jar 示例程序。进入文件目录，通过如下命令执行该程序：


1
hadoop jar hadoop-mapreduce-examples-3.1.2.jar

看到以下信息，说明hadoop 安装成功了。

An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

此程序带有很多的示例程序，其中有单词计数wordcount，我们可以试试。


1
hadoop jar hadoop-mapreduce-examples-3.1.2.jar wordcount /root/data/input/data.txt /root/data/output/test1

看到如下内容，表明成功：

2019-08-08 11:10:47,100 INFO mapreduce.Job:  map 0% reduce 0%
2019-08-08 11:10:52,173 INFO mapreduce.Job:  map 100% reduce 0%
2019-08-08 11:10:58,210 INFO mapreduce.Job:  map 100% reduce 100%
2019-08-08 11:10:58,218 INFO mapreduce.Job: Job job_1565165510892_0005 completed successfully
2019-08-08 11:10:58,337 INFO mapreduce.Job: Counters: 53

大家可以看到，在wordcount 后面，带了两个路径：/root/data/input/data.txt /root/data/output/test1 这两个路径分别是传入的文件地址，输出的文件夹。data.txt文件内容如下，可以自行创建编辑：


1
2
3
I love Chongqing

I love China

Chongqing is a province city of China

由于使本地环境，不具备HDFS分布式文件系统，所以执行本地的文件。

最后，通过命令行可以看到test1文件下生成了两个文件，然后_SUCCESS 和part-r-00000，使用cat part-r-00000 命令可以看到排好序的单词计数信息：


1
2
3
4
5
6
7
8
9
China   2

Chongqing       2

I       2

a       1

city    1

is      1

love    2

of      1

province        1