Tags

, ,

Getting up and running with a Hive proof of concept is not very intuitive and required me to go through a mishmash of documents to find all of the pieces. I hope these instructions help somebody else get their proof of concept instance up and running sooner!

Baseline

My goal was to have a local running Hive Server instance that I could use to query various types of data to get a better idea of what was possible with Hive and how maintainable it would be in a future production architecture. This tutorial does not take into account any degree of hardening or security, nor major scalability.

I do all of my development on a Macbook Pro with 16GB of RAM, a 2.2 GHz Intel Core i7 and running OS X Yosemite. My proof of concept systems generally run on Docker using boot2docker, but in this case it was easier for me to build it directly within a VirtualBox instance. I am using Ubuntu Server 14.04.2 as my base image for the VirtualBox Virtual Machine.

I set my networking for the virtual box instance to use Bridged networking so that my main MacBook can access it.

Setup Hadoop

1. Create a new VirtualBox image. I granted mine 1GB of RAM and 8GB of disk space and installed it with the latest Ubuntu Server ISO to get started.

2. Once installation is complete, log in as your admin user.

3. Setup the Guest Additions CD (Devices -> Insert Guest Additions CD). This will make copy-pasting from the host into the VM easier.

4. Make sure SSHD is working properly by executing “ssh localhost” and accepting the server certificate.

4. apt-get install -y ssh openjdk-7-jre openjdk-7-jdk wget vim

5. Work out /tmp so we have a clean baseline: cd /tmp

6. Download Hadoop: wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

6. tar xzf hadoop-2.6.0.tar.gz

7. sudo mv hadoop-2.6.0 /usr/local/hadoop

7. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

8. export HADOOP_PREFIX=/usr/local/hadoop

10. export PATH=/usr/local/hadoop/bin:$PATH

11. Modify /usr/local/hadoop/etc/hadoop/core-site.xml to add the following inside of the <configuration> tag:

<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>

12. Modify /usr/local/hadoop/etc/hadoop/hdfs-site.xml to add the following inside of the <configuration> tag:

<property><name>dfs.replication</name><value>1</value></property>

13. Setup passwordless SSH

ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Prepare HDFS

1. cd /usr/local/hadoop

2. bin/hdfs namenode -format

3. Modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh by adding a line the end:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

4. sbin/start-dfs.sh (be sure to say yes when SSH fingerprint verification comes up)

5. bin/hdfs dfs -mkdir /user
bin/hdfs pdfs -mkdir /user/justin (replace justin with your own username)

6. bin/hdfs pdfs -put etc/hadoop input
Prepare Hive

1. wget http://mirrors.advancedhosters.com/apache/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz

2. tar xzf apache-hive-1.1.0-bin.tar.gz

3. sudo mv apache-hive-1.1.0-bin /usr/local/hive

4. export HIVE_HOME=/usr/local/hive

5. rm /usr/local/hive/lib/hive-jdbc-1.1.0-standalone.jar

6. rm /usr/local/hadoop/share/hadoop/yarn/lib/jline-0.9.94.jar

7. cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh

8. Add the following lines to the end of /usr/local/hive/conf/hive-env.sh

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
export HIVE_HOME=/usr/local/hive

9. Start the Hive Server2 Instance:

/usr/local/hive/bin/hive —-service hiveserver2

Simple Python Client

1. Back on your host computer, install the python requirements:

sudo easy_install pip
pip install pyhs2

2. Create a simple script (test.py) – replace the host with the IP address of your VirtualBox (you can get it by running ifconfig on the VirtualBox instance):

import pyhs2

with pyhs2.connect(host=’192.168.20.82′,
port=10000,
authMechanism=’PLAIN’,
user=’hdfs’,
password=’hdfs’,
database=’default’) as conn:
with conn.cursor() as cur:
print cur.getDatabases()

3. Run the script to test it:

python test.py

You should some output that looks like this:

[[‘default’, ”]]

References:

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation#AdminManualInstallation-InstallingfromaTarball

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver