Getting up and running with a Hive proof of concept is not very intuitive and required me to go through a mishmash of documents to find all of the pieces. I hope these instructions help somebody else get their proof of concept instance up and running sooner!
My goal was to have a local running Hive Server instance that I could use to query various types of data to get a better idea of what was possible with Hive and how maintainable it would be in a future production architecture. This tutorial does not take into account any degree of hardening or security, nor major scalability.
I do all of my development on a Macbook Pro with 16GB of RAM, a 2.2 GHz Intel Core i7 and running OS X Yosemite. My proof of concept systems generally run on Docker using boot2docker, but in this case it was easier for me to build it directly within a VirtualBox instance. I am using Ubuntu Server 14.04.2 as my base image for the VirtualBox Virtual Machine.
I set my networking for the virtual box instance to use Bridged networking so that my main MacBook can access it.
1. Create a new VirtualBox image. I granted mine 1GB of RAM and 8GB of disk space and installed it with the latest Ubuntu Server ISO to get started.
2. Once installation is complete, log in as your admin user.
3. Setup the Guest Additions CD (Devices -> Insert Guest Additions CD). This will make copy-pasting from the host into the VM easier.
4. Make sure SSHD is working properly by executing “ssh localhost” and accepting the server certificate.
4. apt-get install -y ssh openjdk-7-jre openjdk-7-jdk wget vim
5. Work out /tmp so we have a clean baseline: cd /tmp
6. Download Hadoop: wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
6. tar xzf hadoop-2.6.0.tar.gz
7. sudo mv hadoop-2.6.0 /usr/local/hadoop
7. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
8. export HADOOP_PREFIX=/usr/local/hadoop
10. export PATH=/usr/local/hadoop/bin:$PATH
11. Modify /usr/local/hadoop/etc/hadoop/core-site.xml to add the following inside of the <configuration> tag:
12. Modify /usr/local/hadoop/etc/hadoop/hdfs-site.xml to add the following inside of the <configuration> tag:
13. Setup passwordless SSH
ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
1. cd /usr/local/hadoop
2. bin/hdfs namenode -format
3. Modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh by adding a line the end:
4. sbin/start-dfs.sh (be sure to say yes when SSH fingerprint verification comes up)
5. bin/hdfs dfs -mkdir /user
bin/hdfs pdfs -mkdir /user/justin (replace justin with your own username)
6. bin/hdfs pdfs -put etc/hadoop input
1. wget http://mirrors.advancedhosters.com/apache/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz
2. tar xzf apache-hive-1.1.0-bin.tar.gz
3. sudo mv apache-hive-1.1.0-bin /usr/local/hive
4. export HIVE_HOME=/usr/local/hive
5. rm /usr/local/hive/lib/hive-jdbc-1.1.0-standalone.jar
6. rm /usr/local/hadoop/share/hadoop/yarn/lib/jline-0.9.94.jar
7. cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh
8. Add the following lines to the end of /usr/local/hive/conf/hive-env.sh
9. Start the Hive Server2 Instance:
/usr/local/hive/bin/hive —-service hiveserver2
Simple Python Client
1. Back on your host computer, install the python requirements:
sudo easy_install pip
pip install pyhs2
2. Create a simple script (test.py) – replace the host with the IP address of your VirtualBox (you can get it by running ifconfig on the VirtualBox instance):
database=’default’) as conn:
with conn.cursor() as cur:
3. Run the script to test it:
You should some output that looks like this: