How-To: VirtualBox Hive Server with Python Client

Tags

, ,

Getting up and running with a Hive proof of concept is not very intuitive and required me to go through a mishmash of documents to find all of the pieces. I hope these instructions help somebody else get their proof of concept instance up and running sooner!

Baseline

My goal was to have a local running Hive Server instance that I could use to query various types of data to get a better idea of what was possible with Hive and how maintainable it would be in a future production architecture. This tutorial does not take into account any degree of hardening or security, nor major scalability.

I do all of my development on a Macbook Pro with 16GB of RAM, a 2.2 GHz Intel Core i7 and running OS X Yosemite. My proof of concept systems generally run on Docker using boot2docker, but in this case it was easier for me to build it directly within a VirtualBox instance. I am using Ubuntu Server 14.04.2 as my base image for the VirtualBox Virtual Machine.

I set my networking for the virtual box instance to use Bridged networking so that my main MacBook can access it.

Setup Hadoop

1. Create a new VirtualBox image. I granted mine 1GB of RAM and 8GB of disk space and installed it with the latest Ubuntu Server ISO to get started.

2. Once installation is complete, log in as your admin user.

3. Setup the Guest Additions CD (Devices -> Insert Guest Additions CD). This will make copy-pasting from the host into the VM easier.

4. Make sure SSHD is working properly by executing “ssh localhost” and accepting the server certificate.

4. apt-get install -y ssh openjdk-7-jre openjdk-7-jdk wget vim

5. Work out /tmp so we have a clean baseline: cd /tmp

6. Download Hadoop: wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

6. tar xzf hadoop-2.6.0.tar.gz

7. sudo mv hadoop-2.6.0 /usr/local/hadoop

7. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

8. export HADOOP_PREFIX=/usr/local/hadoop

10. export PATH=/usr/local/hadoop/bin:$PATH

11. Modify /usr/local/hadoop/etc/hadoop/core-site.xml to add the following inside of the <configuration> tag:

<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>

12. Modify /usr/local/hadoop/etc/hadoop/hdfs-site.xml to add the following inside of the <configuration> tag:

<property><name>dfs.replication</name><value>1</value></property>

13. Setup passwordless SSH

ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Prepare HDFS

1. cd /usr/local/hadoop

2. bin/hdfs namenode -format

3. Modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh by adding a line the end:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

4. sbin/start-dfs.sh (be sure to say yes when SSH fingerprint verification comes up)

5. bin/hdfs dfs -mkdir /user
bin/hdfs pdfs -mkdir /user/justin (replace justin with your own username)

6. bin/hdfs pdfs -put etc/hadoop input
Prepare Hive

1. wget http://mirrors.advancedhosters.com/apache/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz

2. tar xzf apache-hive-1.1.0-bin.tar.gz

3. sudo mv apache-hive-1.1.0-bin /usr/local/hive

4. export HIVE_HOME=/usr/local/hive

5. rm /usr/local/hive/lib/hive-jdbc-1.1.0-standalone.jar

6. rm /usr/local/hadoop/share/hadoop/yarn/lib/jline-0.9.94.jar

7. cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh

8. Add the following lines to the end of /usr/local/hive/conf/hive-env.sh

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
export HIVE_HOME=/usr/local/hive

9. Start the Hive Server2 Instance:

/usr/local/hive/bin/hive —-service hiveserver2

Simple Python Client

1. Back on your host computer, install the python requirements:

sudo easy_install pip
pip install pyhs2

2. Create a simple script (test.py) – replace the host with the IP address of your VirtualBox (you can get it by running ifconfig on the VirtualBox instance):

import pyhs2

with pyhs2.connect(host=’192.168.20.82′,
port=10000,
authMechanism=’PLAIN’,
user=’hdfs’,
password=’hdfs’,
database=’default’) as conn:
with conn.cursor() as cur:
print cur.getDatabases()

3. Run the script to test it:

python test.py

You should some output that looks like this:

[[‘default’, ”]]

References:

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation#AdminManualInstallation-InstallingfromaTarball

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

How To: Apiary / GitHub README.md Integration

Tags

Are you using Apiary for documenting your API and GitHub for storing your code base? Apiary has a really cool feature that will keep your markdown stored inside of your git repository and update the Apiary version every time you push a change to Github. To set this up, go to the Settings section of your Apiary account by this icon on the top bar:

Apiary Settings Icon

Then scroll to the bottom of the page and you will see a “Link Your GitHub Repository” section:

Link GitHubCheck the box to grant access to the private repositories (if you’re connecting to a private one) and then click Connect to GitHub. Once you have finished the authorization process you will be back on the same page in Apiary, but it will look like this:

GitHub Repo Selector

Click on the repository you want to connect it to and click Go. Apiary will add a file to the master branch of the repository named apiary.apib. If you update the Apiary documentation on their website and save it, it will do a git commit to keep the repository up to date. Even better, if you commit changes to the Apiary markdown inside of your repository and push it to GitHub, the latest documentation will be reflected in Apiary! This makes it very convenient to keep your Apiary in lock step with your code and you can even apply branching and tagging to it now.

Now for the Coup d’état: Integration with your README.md!

As you probably know, GitHub supports a special file in the root of the repo called README.md. This is a MarkDown file that almost everybody uses to display the documentation for the repository. GitHub renders it below the root file listing on the repository homepage. Since the markdown dialects for Apiary and GitHub are reasonably similar, it would be really awesome if your Apiary documentation showed up here!

Luckily for us, git supports symlinks. Check out your repository and get rid of your existing README.md file (if any). I actually moved mine to INSTALL.md since it was more appropriate for installation than primary documentation. Now run the following from your command line (Linux and Mac Only – Sorry Windows folks, I have no clue if Windows has some version of symlinks yet):

ln -s apiary.apib README.md
git add README.md
git commit -m "Symlinking README.md to apiary.apib thanks to demar.is"
git push origin master

Voila! Other than the FORMAT: 1A line at the top, this looks great!

GitHub Apiary Link Complete

 

Union of Ages

Even though I live in New York City, I seldom get to play tourist here myself. This weekend, I had the lovely opportunity to take a guided walking tour of Union Square operated by Big Onion Walking Tours. The tour was free this time since it was sponsored by the Union Square Partnership. Our guide was excellent and I fully intend to go back for some of the paid tours later in the year. This specific tour was particularly intriguing to me because I work in the area and I walk through Union Square every day.

The tour started off at the Abe Lincoln statue. Apparently this statue pissed off a lot of people when it was erected in 1869, just four years after his assassination. At that point, Lincoln was like a god to the American people and this statue portrayed him too much like a normal man.

President Abraham Lincoln Statue

President Abraham Lincoln Statue

Union Square was an apropos location to erect a statue to Lincoln since New York City had a mixed history with him. When the south seceded from the Union, New York City was very close to seceding as well since the south was the massive agricultural complex that drove the economy and New York City has always been very tied to finance. Luckily they did not secede, but the worst riots in the history of the United States took place around Union Square when the draft was started for the Civil War.

Moving west from the statue, we got to see some of the oldest buildings in Union Square. The guide passed around a picture from the early 1900’s and you can see that the exact same buildings with only minor modifications are still standing.

Old Buildings (1800's)

Old Buildings (1800’s)

Obviously the above picture is a shot I took today, not the black and white one from a century ago. It is, however, incredible to see some piece of New York City so lasting. The second building from the left in this picture is the Decker Building, which I took a closer shot of as well:

Decker Building

Decker Building

This building is important for two reasons. First off, it is architecturally significant as it was one of the first buildings designed on the concept that form follows function. It was originally built as a piano store and designed to be elegant but creative to reflect the elegance of the pianos that they were trying to peddle. Secondly, it is the building where Andy Warhol had his factory and where Valerie Solanas attempted to murder him by shooting him three times after laying in wait in his factory.

Proceeding further south, we got to see a beautiful statue of a woman providing for the two emaciated children that remind me very strongly of Ignorance and Want from A Christmas Carol.

Anti-Alcohol Water Fountain

Anti-Alcohol Water Fountain

We could not get close enough to see the details, but apparently the statue is also a water fountain. It was erected as part of the precursor to prohibition and was one of the first public drinking fountains. It used to keep tin cups alongside the fountain so that anyone passing by thirsty could drink water instead of being tempted by the more easily available beer at a nearby pub.

In front of this statue, embedded into the ground is a copper map of Union Square from when it was first assembled.

Embedded Map of Old Union Square

Embedded Map of Old Union Square

It is still a bit covered in snow, even though the weather was finally pleasant enough to encourage melting, but the most intriguing bit is still visible: the stretch of street car roadway called Dead Man’s Curve. Back in the days before television (or even the movies), people were always in search of other forms of entertainment. As much as I can relate to the lack of TV, I’ve never seen something quite so macabre myself. This curve was so sharp and had such low visibility that someone getting struck by the street car careening around the corner was almost a daily occurrence. People would come to watch from nearby windows and place bets on each street car as to wether someone would die this time.

Offsetting this morbid story, there is a statue in the south west corner of the square that I have passed many times and I never noticed before: the statue of Mahatma Ghandi.

Mahatma Ghandi Statue

Mahatma Ghandi Statue

He blends in so well with the trees and shrubs and has such an understated figure that he is the least noted piece of art in the entire square.

Continuing counter-clockwise around the park, you can see another presidential statue. This time, George Washington astride a horse.

President George Washington Statue

President George Washington Statue

Pay close attention to his outstretched hand as it is very important to another location based art piece in another part of the park.

The next part of the tour is what really grabbed my attention. The center piece of Union Square is this flagpole:

Independence Flagpole

Independence Flagstaff

It is officially named the Independence Flagstaff, but it was originally supposed to be named the Charles F. Murphy Flagstaff. Does anybody know who Charles F. Murphy was? I have been slowly working my way through The Power Broker: Robert Moses and the Fall of New York (an incredible book that well deserves many posts once I have completed it). Robert Moses was one of, if not the, most powerful men in the history of New York City and he came to power through the political machine of Tammany Hall. Charlie Murphy was the most powerful leader of Tammany Hall and second most well known after Boss Tweed. In turns out that the entirety of Union Square is littered with Tammany Hall history! Before they really lost most of their power, the old Tammany Hall clubhouse was located in the south east corner of Union Square. The very last Tammany related building is still standing and is now the New York Film Academy.

Former Tammany Hall Clubhouse

Former Tammany Hall Clubhouse

After stopping by the statue of the Marquis De Lafayette, we ended the tour at the south eastern corner of Union Square to answer one of the questions that was burning in my mind from earlier that day. What in the world is up with that giant needle on the building down there?

Time Art Needle

Time Art Needle

It turns out that this entire building wall is a single piece of art meant to demonstrate how impossible it is to really capture time. The ripples are ripples in time, the needle is a reference to a metronome. And at the very top of those ripples, reaching out through time to us is the hand of George Washington.

Time Art Needle and Hand

Time Art Needle and Hand

It is a larger scale model of the same hand on the George Washington statue facing down from the south side of the park and shows us that it is the victors that get to reach forward and time and write history.

I don’t want to spoil all of the other details of the tour, so I will leave you here. I highly recommend reaching out to Big Onion Walking Tours and taking this same tour as I have only gently grazed on the layers of history they peeled back for me.

Modulus Sorting – Alternative to ORDER BY RAND()

If you have been doing LAMP (or anything MySQL) related for a while, you already know that the simplest way of getting a random set of rows from a table by using ORDER BY RAND() is a really Bad IdeaTM. The guts of SQL require that this solution generate a random number for each of the posts in your result set and then sort by this. Random numbers are generally expensive, and this requires one be made for each row, which as your result set grows will become quite the bottleneck.

I have seen other alternatives to this:

I am testing out another approach that I’m calling “modulus” sorting. The ID approach is decent, but as tables get large, you end up doing a large data transfer to get all of those IDs. My solution may not be truly random, but it appears to be doing nicely so far.

First query:
$query = “SELECT COUNT(*) AS c FROM `table`”;

Second query:
$query = “SELECT * FROM `table` ORDER BY `id` % ” . mt_rand(0, $count / 2);

Essentially, we pull the number of records in the table and assume that the IDs are fairly evenly distributed and then sort them by the modulus of their ID and a random value. The modulus operator is quite a bit faster than RAND() and the results I have been getting so far are random enough for my purposes. I’ll drop in updates as we see how it goes!

Tie Guy, Reporting for Duty

“I’ve wanted to ask for a while… why do you wear a tie?”

Not an unusual question for me. My office dress code is perfectly fine with jeans and a T-Shirt; in fact, that’s the norm. Plaid flannel, beige shorts, “pumped up kicks”, all good. As far as I know, I’m the only one who shows up in a shirt and tie on a daily basis. Honestly, the fact that I can dress differently and not be an outsider is an ironic testament to the open-mindedness of the office culture.

But why do I do it?

The simple answer: I like it.

The longer answer… well that’s more complicated. Personally, the process of getting ready in the morning and “suiting up” (sans blazer during the warmer months) gets me in the mood for work every day. It is a ritual sacrifice to the gods of business in hopes they will bless me with fewer bug reports. But when I get those bug reports, you can bet I’m ready to take them on.

To me, the tie is a reminder that I am part of a larger whole. It is a reminder in looking professional that I should also act professional. When I’m having a bad day and drudging through complicated problems, it’s a reminder that whenever it sucks for me, it sucks for somebody else who is using my creations even worse. Every moment I haven’t fixed the problem, they are less productive. It’s a reminder that as much as it would be nice to pass the problem on to somebody and forget about it, it is my responsibility, as part of the bigger group, to make sure it gets solved and not endlessly passed around. The people who need it fixed are funding our paychecks and they can take their money elsewhere. It is a reminder that when I screw up and it could affect other members of the greater team, they have a right to know as soon as possible so they can have the best shot at solving whatever problems it may cause them.

If a problem becomes frustrating and I want to just throw up my hands, it is a reminder that getting angry doesn’t help the company. Holding grudges doesn’t solve the problems for the customer. It’s a reminder to stop taking it personally and just look for the path that leads to an answer, no matter how much patience or time that answer is going to take. As a professional, it is my responsibility to have answers, get answers or direct people to answers. We’re all in it together.

You don’t need a tie to be professional, and a lot of people are much more comfortable not wearing one. Mandatory dress codes enforcing ties or suits don’t lead to more professional employees.

I just like ties.

The tie is my talisman.

Flow Revisited

During a big buy-out announcement recently within my company, one of the speakers mentioned the book Drive: The Surprising Truth About What Motivates Us. Now that I spend a lot of time as an NYC straphanger, I get a lot reading in during my daily commute and decided that coming from influential and successful people, it was a safe bet that it was a good read.

Let me just say up front: wow. It is not a smooth read, and at times the writing style is a bit repetitive, but the content of the book is amazing, and the majority of it is references to psychological studies around the world. If you have any desire to understand what motivates you or how you as a manager could motivate your employees, this is a must read. I’ve discovered things about myself that I did not effectively realize before, and beyond that discovered that I am not alone in them.

The core concept of the book is that human beings are far more complicated than can be modeled with a simple reward-driven behavior (“carrot and stick“). We have an innate desire to solve problems as well. When it comes to salary, as long as we are receiving fair pay or are not having money problems, more money is a very poor motivator. In fact, if you dangle the carrot of a (especially monetary) reward as a reason to do something, you can severely damage long term motivation and the desire to excel on that project. Essentially you can turn a fun project into drudgery by associating money with it.

They actually brought up Open Source web development as an example of how great things can come out of volunteer work. I take a bit of issue with that since most successful open source projects have some of the core pieces in place and are only supported because big money-spending companies back them and pay people within their own companies to make contributions. However, the idea is still reasonably sound.

Bringing this back to my previous post on the “good vs bad” sides of achieving “flow“, the book noted that it becomes much more difficult to reach flow when you are working on a project for a monetary reward. This helps to explain why when you are writing code for fun and solving problems on your own time, you can end up glancing bleary-eyed out of the window and realize it’s 5AM. If you are doing the same project for work because somebody is pushing a deadline on you, you can be checking the clock every five minutes until it’s 6PM and you can leave without feeling too guilty.

In independent analysis of the results of work in both situations, the people building for fun very consistently create higher quality code.

I say score ++ for flow.

Facebook Harassing Goo.gl Users?

So I work in social media for a large advertising-centric company, and one of the things that I do is test integration with Facebook Pages. I run a simple test Facebook Page to try out different things on, and I recently started having issues with including any goo.gl URL in a post to the page:

If I have a simple link to XKCD, shortened with Goo.GL: http://goo.gl/b4UzX and shortened with Bitly: http://bit.ly/1pB1Uk, I get the following results when I try to post them onto the wall of the page:

Original XKCD link: posts and embeds just fine
Bitly link: posts and embeds just fine
Goo.gl link: always asks to solve a CAPTCHA

I’ve tried it with a bunch of other sites and some other URL shorteners and have had no issues. Is anyone else having this problem? Any other Google-related services that are being tampered with? It’s not blocking it, it just feels like harassment!

Programmer “Flow”: Good or Bad?

If you’ve done any mind intensive task, even taking a very challenging but doable exam, you probably understand the concept of “flow”. Flow (aka getting “in the zone”) happens when you are hyper-focused on a particular problem or task, and you are able to provide contiguous output on the problem, pretty much non-stop.

Personally, coffee helps me get in the flow a lot. If I have some audio distractions around me, then Trance music can help too (preferably very repetitive and with no words). I know some people who compare the feeling of this “flow” to what happens when you’re on Adderall. The rest of the world and the seconds that it ticks with just disappear, and as a programmer, your fingers never stop typing.

It’s an amazing feeling, and I have within the past month read two different authoritative reviews on how it really affects your productivity. According to Robert C Martin in The Clean Coder, flow is patently bad. He compares it to being hypnotized (I’m not arguing there), but he says code produced in the flow is code that is unaware of the bigger picture, and not flexible. A lot of code produced in this way has to be revisited and re-architected at a later date because it will be single-minded and not jive with the rest of the program.

Just today, I read one of the old Joel on Software posts by Joel Spolsky entitled “Where do These People Get Their (Unoriginal) Ideas?”  where he says that achieving flow is one of the basic requirements of programming in general as focusing on the problem very closely helps you solve it quickly and more thoroughly.

What are your experiences with this?

Personally, I fall more on Joel’s side, but with some reservations. You need to make sure you have your environment set up correctly and you know proper programming standards. My most solid code has been written in the flow while following TDD. I was able to easily refactor it for later additions and it was solid, usable and well documented (I write all of my PHP code with PHPDoc / JavaDoc style comments). Putting all of these together and still focusing hard on the problem gives me excellent progress while still generating maintainable code.

Caribbean Cruise 2011

So for a few years now, my girlfriend and I have gone on a cruise in the Caribbean Sea in December. This year, we brought her parents along with and hit San Juan, St. Thomas, St. Lucia, St. Kitts and St. Martin. I just bought a new Sony Alpha A-500 and took 1,600 pictures throughout the cruise! I was brand new to the camera and just getting used to the lens and exposure options, but here are a few of my favorite pics!

 

 

Speaking to NY-PHP Bootup to Startup

I will be speaking to the Bootup to Startup with NY PHP group on Tuesday, February 28th at IBM. The topic for discussion is “Talk Virtualization” and this is the abstract:

Virtualization has changed the platform playing field for web development considerably. With the advent of platforms like Amazon’s EC2, not only can our PHP apps scale outward with a few clicks of the mouse, but our apps can also be platform-aware and provide all of the right metrics to know when is the right time to scale. This presentation is going to cover the core models and issues that you will encounter as a PHP programmer in a virtualized world. Focusing on EC2, but brushing on private virtual infrastructures with VMWare, Xen and KVM, we will talk about how the virtual environment affects your production code execution and your development environment as well as what new considerations you should be taking into account for every app you write. We will finish by going over future features you might want to consider that this infrastructure approach makes possible.

Material will be a combination of slide show and live demos with the technology.

Event information hasn’t been posted yet, but here’s a link to the Meetup Event: http://www.meetup.com/new-york-php/events/32189482/