• Home
  • Engineering
  • Business
  • Travel

DeMar.is

DeMar.is

Monthly Archives: April 2015

How-To: VirtualBox Hive Server with Python Client

06 Monday Apr 2015

Posted by Justin DeMaris in Engineering

≈ 1 Comment

Tags

engineering, hadoop, hive

Getting up and running with a Hive proof of concept is not very intuitive and required me to go through a mishmash of documents to find all of the pieces. I hope these instructions help somebody else get their proof of concept instance up and running sooner!

Baseline

My goal was to have a local running Hive Server instance that I could use to query various types of data to get a better idea of what was possible with Hive and how maintainable it would be in a future production architecture. This tutorial does not take into account any degree of hardening or security, nor major scalability.

I do all of my development on a Macbook Pro with 16GB of RAM, a 2.2 GHz Intel Core i7 and running OS X Yosemite. My proof of concept systems generally run on Docker using boot2docker, but in this case it was easier for me to build it directly within a VirtualBox instance. I am using Ubuntu Server 14.04.2 as my base image for the VirtualBox Virtual Machine.

I set my networking for the virtual box instance to use Bridged networking so that my main MacBook can access it.

Setup Hadoop

1. Create a new VirtualBox image. I granted mine 1GB of RAM and 8GB of disk space and installed it with the latest Ubuntu Server ISO to get started.

2. Once installation is complete, log in as your admin user.

3. Setup the Guest Additions CD (Devices -> Insert Guest Additions CD). This will make copy-pasting from the host into the VM easier.

4. Make sure SSHD is working properly by executing “ssh localhost” and accepting the server certificate.

4. apt-get install -y ssh openjdk-7-jre openjdk-7-jdk wget vim

5. Work out /tmp so we have a clean baseline: cd /tmp

6. Download Hadoop: wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

6. tar xzf hadoop-2.6.0.tar.gz

7. sudo mv hadoop-2.6.0 /usr/local/hadoop

7. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

8. export HADOOP_PREFIX=/usr/local/hadoop

10. export PATH=/usr/local/hadoop/bin:$PATH

11. Modify /usr/local/hadoop/etc/hadoop/core-site.xml to add the following inside of the <configuration> tag:

<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>

12. Modify /usr/local/hadoop/etc/hadoop/hdfs-site.xml to add the following inside of the <configuration> tag:

<property><name>dfs.replication</name><value>1</value></property>

13. Setup passwordless SSH

ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Prepare HDFS

1. cd /usr/local/hadoop

2. bin/hdfs namenode -format

3. Modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh by adding a line the end:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

4. sbin/start-dfs.sh (be sure to say yes when SSH fingerprint verification comes up)

5. bin/hdfs dfs -mkdir /user
bin/hdfs pdfs -mkdir /user/justin (replace justin with your own username)

6. bin/hdfs pdfs -put etc/hadoop input
Prepare Hive

1. wget http://mirrors.advancedhosters.com/apache/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz

2. tar xzf apache-hive-1.1.0-bin.tar.gz

3. sudo mv apache-hive-1.1.0-bin /usr/local/hive

4. export HIVE_HOME=/usr/local/hive

5. rm /usr/local/hive/lib/hive-jdbc-1.1.0-standalone.jar

6. rm /usr/local/hadoop/share/hadoop/yarn/lib/jline-0.9.94.jar

7. cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh

8. Add the following lines to the end of /usr/local/hive/conf/hive-env.sh

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
export HIVE_HOME=/usr/local/hive

9. Start the Hive Server2 Instance:

/usr/local/hive/bin/hive —-service hiveserver2

Simple Python Client

1. Back on your host computer, install the python requirements:

sudo easy_install pip
pip install pyhs2

2. Create a simple script (test.py) – replace the host with the IP address of your VirtualBox (you can get it by running ifconfig on the VirtualBox instance):

import pyhs2

with pyhs2.connect(host=’192.168.20.82′,
port=10000,
authMechanism=’PLAIN’,
user=’hdfs’,
password=’hdfs’,
database=’default’) as conn:
with conn.cursor() as cur:
print cur.getDatabases()

3. Run the script to test it:

python test.py

You should some output that looks like this:

[[‘default’, ”]]

References:

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation#AdminManualInstallation-InstallingfromaTarball

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • April 2015
  • March 2015
  • July 2012
  • June 2012
  • January 2012
  • December 2011
  • November 2011
  • March 2010
  • January 2009
  • July 2008
  • March 2008
  • February 2008
  • January 2008
  • August 2007
  • June 2007
  • May 2007
  • April 2007
  • February 2007
  • January 2007
  • November 2006
  • June 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • July 2005
  • June 2005

Categories

  • Business
  • Engineering
  • Travel
  • Uncategorized

Meta

  • Register
  • Log in

Blog at WordPress.com.

  • Follow Following
    • DeMar.is
    • Already have a WordPress.com account? Log in now.
    • DeMar.is
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar