• Home
  • Engineering
  • Business
  • Travel

DeMar.is

DeMar.is

Category Archives: Engineering

How-To: VirtualBox Hive Server with Python Client

06 Monday Apr 2015

Posted by Justin DeMaris in Engineering

≈ 1 Comment

Tags

engineering, hadoop, hive

Getting up and running with a Hive proof of concept is not very intuitive and required me to go through a mishmash of documents to find all of the pieces. I hope these instructions help somebody else get their proof of concept instance up and running sooner!

Baseline

My goal was to have a local running Hive Server instance that I could use to query various types of data to get a better idea of what was possible with Hive and how maintainable it would be in a future production architecture. This tutorial does not take into account any degree of hardening or security, nor major scalability.

I do all of my development on a Macbook Pro with 16GB of RAM, a 2.2 GHz Intel Core i7 and running OS X Yosemite. My proof of concept systems generally run on Docker using boot2docker, but in this case it was easier for me to build it directly within a VirtualBox instance. I am using Ubuntu Server 14.04.2 as my base image for the VirtualBox Virtual Machine.

I set my networking for the virtual box instance to use Bridged networking so that my main MacBook can access it.

Setup Hadoop

1. Create a new VirtualBox image. I granted mine 1GB of RAM and 8GB of disk space and installed it with the latest Ubuntu Server ISO to get started.

2. Once installation is complete, log in as your admin user.

3. Setup the Guest Additions CD (Devices -> Insert Guest Additions CD). This will make copy-pasting from the host into the VM easier.

4. Make sure SSHD is working properly by executing “ssh localhost” and accepting the server certificate.

4. apt-get install -y ssh openjdk-7-jre openjdk-7-jdk wget vim

5. Work out /tmp so we have a clean baseline: cd /tmp

6. Download Hadoop: wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

6. tar xzf hadoop-2.6.0.tar.gz

7. sudo mv hadoop-2.6.0 /usr/local/hadoop

7. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

8. export HADOOP_PREFIX=/usr/local/hadoop

10. export PATH=/usr/local/hadoop/bin:$PATH

11. Modify /usr/local/hadoop/etc/hadoop/core-site.xml to add the following inside of the <configuration> tag:

<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>

12. Modify /usr/local/hadoop/etc/hadoop/hdfs-site.xml to add the following inside of the <configuration> tag:

<property><name>dfs.replication</name><value>1</value></property>

13. Setup passwordless SSH

ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Prepare HDFS

1. cd /usr/local/hadoop

2. bin/hdfs namenode -format

3. Modify /usr/local/hadoop/etc/hadoop/hadoop-env.sh by adding a line the end:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre

4. sbin/start-dfs.sh (be sure to say yes when SSH fingerprint verification comes up)

5. bin/hdfs dfs -mkdir /user
bin/hdfs pdfs -mkdir /user/justin (replace justin with your own username)

6. bin/hdfs pdfs -put etc/hadoop input
Prepare Hive

1. wget http://mirrors.advancedhosters.com/apache/hive/hive-1.1.0/apache-hive-1.1.0-bin.tar.gz

2. tar xzf apache-hive-1.1.0-bin.tar.gz

3. sudo mv apache-hive-1.1.0-bin /usr/local/hive

4. export HIVE_HOME=/usr/local/hive

5. rm /usr/local/hive/lib/hive-jdbc-1.1.0-standalone.jar

6. rm /usr/local/hadoop/share/hadoop/yarn/lib/jline-0.9.94.jar

7. cp /usr/local/hive/conf/hive-env.sh.template /usr/local/hive/conf/hive-env.sh

8. Add the following lines to the end of /usr/local/hive/conf/hive-env.sh

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre
export HIVE_HOME=/usr/local/hive

9. Start the Hive Server2 Instance:

/usr/local/hive/bin/hive —-service hiveserver2

Simple Python Client

1. Back on your host computer, install the python requirements:

sudo easy_install pip
pip install pyhs2

2. Create a simple script (test.py) – replace the host with the IP address of your VirtualBox (you can get it by running ifconfig on the VirtualBox instance):

import pyhs2

with pyhs2.connect(host=’192.168.20.82′,
port=10000,
authMechanism=’PLAIN’,
user=’hdfs’,
password=’hdfs’,
database=’default’) as conn:
with conn.cursor() as cur:
print cur.getDatabases()

3. Run the script to test it:

python test.py

You should some output that looks like this:

[[‘default’, ”]]

References:

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation#AdminManualInstallation-InstallingfromaTarball

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2

https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PythonClientDriver

How To: Apiary / GitHub README.md Integration

19 Thursday Mar 2015

Posted by Justin DeMaris in Engineering

≈ Leave a comment

Tags

engineering

Are you using Apiary for documenting your API and GitHub for storing your code base? Apiary has a really cool feature that will keep your markdown stored inside of your git repository and update the Apiary version every time you push a change to Github. To set this up, go to the Settings section of your Apiary account by this icon on the top bar:

Apiary Settings Icon

Then scroll to the bottom of the page and you will see a “Link Your GitHub Repository” section:

Link GitHubCheck the box to grant access to the private repositories (if you’re connecting to a private one) and then click Connect to GitHub. Once you have finished the authorization process you will be back on the same page in Apiary, but it will look like this:

GitHub Repo Selector

Click on the repository you want to connect it to and click Go. Apiary will add a file to the master branch of the repository named apiary.apib. If you update the Apiary documentation on their website and save it, it will do a git commit to keep the repository up to date. Even better, if you commit changes to the Apiary markdown inside of your repository and push it to GitHub, the latest documentation will be reflected in Apiary! This makes it very convenient to keep your Apiary in lock step with your code and you can even apply branching and tagging to it now.

Now for the Coup d’état: Integration with your README.md!

As you probably know, GitHub supports a special file in the root of the repo called README.md. This is a MarkDown file that almost everybody uses to display the documentation for the repository. GitHub renders it below the root file listing on the repository homepage. Since the markdown dialects for Apiary and GitHub are reasonably similar, it would be really awesome if your Apiary documentation showed up here!

Luckily for us, git supports symlinks. Check out your repository and get rid of your existing README.md file (if any). I actually moved mine to INSTALL.md since it was more appropriate for installation than primary documentation. Now run the following from your command line (Linux and Mac Only – Sorry Windows folks, I have no clue if Windows has some version of symlinks yet):

ln -s apiary.apib README.md
git add README.md
git commit -m "Symlinking README.md to apiary.apib thanks to demar.is"
git push origin master

Voila! Other than the FORMAT: 1A line at the top, this looks great!

GitHub Apiary Link Complete

 

Modulus Sorting – Alternative to ORDER BY RAND()

04 Wednesday Jul 2012

Posted by Justin DeMaris in Engineering

≈ Leave a comment

If you have been doing LAMP (or anything MySQL) related for a while, you already know that the simplest way of getting a random set of rows from a table by using ORDER BY RAND() is a really Bad IdeaTM. The guts of SQL require that this solution generate a random number for each of the posts in your result set and then sort by this. Random numbers are generally expensive, and this requires one be made for each row, which as your result set grows will become quite the bottleneck.

I have seen other alternatives to this:

  • Pull all of the IDs from the table and then select X number of random ones in PHP and run another query to pull those random ones – http://www.webtrenches.com/post.cfm/avoid-rand-in-mysql
  • Use a slightly more complex set of SQL variables – http://www.electrictoolbox.com/msyql-alternative-order-by-rand/

I am testing out another approach that I’m calling “modulus” sorting. The ID approach is decent, but as tables get large, you end up doing a large data transfer to get all of those IDs. My solution may not be truly random, but it appears to be doing nicely so far.

First query:
$query = “SELECT COUNT(*) AS c FROM `table`”;

Second query:
$query = “SELECT * FROM `table` ORDER BY `id` % ” . mt_rand(0, $count / 2);

Essentially, we pull the number of records in the table and assume that the IDs are fairly evenly distributed and then sort them by the modulus of their ID and a random value. The modulus operator is quite a bit faster than RAND() and the results I have been getting so far are random enough for my purposes. I’ll drop in updates as we see how it goes!

Flow Revisited

26 Tuesday Jun 2012

Posted by Justin DeMaris in Engineering

≈ Leave a comment

During a big buy-out announcement recently within my company, one of the speakers mentioned the book Drive: The Surprising Truth About What Motivates Us. Now that I spend a lot of time as an NYC straphanger, I get a lot reading in during my daily commute and decided that coming from influential and successful people, it was a safe bet that it was a good read.

Let me just say up front: wow. It is not a smooth read, and at times the writing style is a bit repetitive, but the content of the book is amazing, and the majority of it is references to psychological studies around the world. If you have any desire to understand what motivates you or how you as a manager could motivate your employees, this is a must read. I’ve discovered things about myself that I did not effectively realize before, and beyond that discovered that I am not alone in them.

The core concept of the book is that human beings are far more complicated than can be modeled with a simple reward-driven behavior (“carrot and stick“). We have an innate desire to solve problems as well. When it comes to salary, as long as we are receiving fair pay or are not having money problems, more money is a very poor motivator. In fact, if you dangle the carrot of a (especially monetary) reward as a reason to do something, you can severely damage long term motivation and the desire to excel on that project. Essentially you can turn a fun project into drudgery by associating money with it.

They actually brought up Open Source web development as an example of how great things can come out of volunteer work. I take a bit of issue with that since most successful open source projects have some of the core pieces in place and are only supported because big money-spending companies back them and pay people within their own companies to make contributions. However, the idea is still reasonably sound.

Bringing this back to my previous post on the “good vs bad” sides of achieving “flow“, the book noted that it becomes much more difficult to reach flow when you are working on a project for a monetary reward. This helps to explain why when you are writing code for fun and solving problems on your own time, you can end up glancing bleary-eyed out of the window and realize it’s 5AM. If you are doing the same project for work because somebody is pushing a deadline on you, you can be checking the clock every five minutes until it’s 6PM and you can leave without feeling too guilty.

In independent analysis of the results of work in both situations, the people building for fun very consistently create higher quality code.

I say score ++ for flow.

Facebook Harassing Goo.gl Users?

05 Thursday Jan 2012

Posted by Justin DeMaris in Engineering

≈ Leave a comment

So I work in social media for a large advertising-centric company, and one of the things that I do is test integration with Facebook Pages. I run a simple test Facebook Page to try out different things on, and I recently started having issues with including any goo.gl URL in a post to the page:

If I have a simple link to XKCD, shortened with Goo.GL: http://goo.gl/b4UzX and shortened with Bitly: http://bit.ly/1pB1Uk, I get the following results when I try to post them onto the wall of the page:

Original XKCD link: posts and embeds just fine
Bitly link: posts and embeds just fine
Goo.gl link: always asks to solve a CAPTCHA

I’ve tried it with a bunch of other sites and some other URL shorteners and have had no issues. Is anyone else having this problem? Any other Google-related services that are being tampered with? It’s not blocking it, it just feels like harassment!

Programmer “Flow”: Good or Bad?

04 Wednesday Jan 2012

Posted by Justin DeMaris in Engineering

≈ 1 Comment

If you’ve done any mind intensive task, even taking a very challenging but doable exam, you probably understand the concept of “flow”. Flow (aka getting “in the zone”) happens when you are hyper-focused on a particular problem or task, and you are able to provide contiguous output on the problem, pretty much non-stop.

Personally, coffee helps me get in the flow a lot. If I have some audio distractions around me, then Trance music can help too (preferably very repetitive and with no words). I know some people who compare the feeling of this “flow” to what happens when you’re on Adderall. The rest of the world and the seconds that it ticks with just disappear, and as a programmer, your fingers never stop typing.

It’s an amazing feeling, and I have within the past month read two different authoritative reviews on how it really affects your productivity. According to Robert C Martin in The Clean Coder, flow is patently bad. He compares it to being hypnotized (I’m not arguing there), but he says code produced in the flow is code that is unaware of the bigger picture, and not flexible. A lot of code produced in this way has to be revisited and re-architected at a later date because it will be single-minded and not jive with the rest of the program.

Just today, I read one of the old Joel on Software posts by Joel Spolsky entitled “Where do These People Get Their (Unoriginal) Ideas?”  where he says that achieving flow is one of the basic requirements of programming in general as focusing on the problem very closely helps you solve it quickly and more thoroughly.

What are your experiences with this?

Personally, I fall more on Joel’s side, but with some reservations. You need to make sure you have your environment set up correctly and you know proper programming standards. My most solid code has been written in the flow while following TDD. I was able to easily refactor it for later additions and it was solid, usable and well documented (I write all of my PHP code with PHPDoc / JavaDoc style comments). Putting all of these together and still focusing hard on the problem gives me excellent progress while still generating maintainable code.

Speaking to NY-PHP Bootup to Startup

29 Tuesday Nov 2011

Posted by Justin DeMaris in Engineering

≈ Leave a comment

I will be speaking to the Bootup to Startup with NY PHP group on Tuesday, February 28th at IBM. The topic for discussion is “Talk Virtualization” and this is the abstract:

Virtualization has changed the platform playing field for web development considerably. With the advent of platforms like Amazon’s EC2, not only can our PHP apps scale outward with a few clicks of the mouse, but our apps can also be platform-aware and provide all of the right metrics to know when is the right time to scale. This presentation is going to cover the core models and issues that you will encounter as a PHP programmer in a virtualized world. Focusing on EC2, but brushing on private virtual infrastructures with VMWare, Xen and KVM, we will talk about how the virtual environment affects your production code execution and your development environment as well as what new considerations you should be taking into account for every app you write. We will finish by going over future features you might want to consider that this infrastructure approach makes possible.

Material will be a combination of slide show and live demos with the technology.

Event information hasn’t been posted yet, but here’s a link to the Meetup Event: http://www.meetup.com/new-york-php/events/32189482/

 

Playing with Assembly

18 Thursday Mar 2010

Posted by Justin DeMaris in Engineering

≈ Leave a comment

So I was reading Slashdot yesterday when I came across a link to this article which is basically a crash course in writing super small programs using assembly language (nasm in Linux). I was completely in awe of their acquired 45 byte executable and took it upon myself to learn the basics of assembly.

My first attempt was to write a hello world (the article writes a 45 byte executable that just returns the number 42). That worked super well and was as easy as I thought, so I decided to play around with a couple of more system calls and write a program that sends hello world to a file in /tmp/. There were a few errors that I came across, so I got some help from the friendly guys in #asm on freenode IRC and figured out the last part. Here is the code, and after it is an explanation of the two parts the screwed me up.

; tiny.asm
BITS 32
		org		0x08048000

	ehdr:                                       ; Elf32_Ehdr
		db			0x7F, "ELF", 1, 1, 1, 0         ;   e_ident
		times		8 db      0
		dw			2                               ;   e_type
		dw			3                               ;   e_machine
		dd			1                               ;   e_version
		dd			_start                          ;   e_entry
		dd			phdr - $$                       ;   e_phoff
		dd			0                               ;   e_shoff
		dd			0                               ;   e_flags
		dw			ehdrsize                        ;   e_ehsize
		dw			phdrsize                        ;   e_phentsize
		dw			1                               ;   e_phnum
		dw			0                               ;   e_shentsize
		dw			0                               ;   e_shnum
		dw			0                               ;   e_shstrndx
  
	ehdrsize		equ		$ - ehdr
  
	phdr:                                       ; Elf32_Phdr
		dd			1                               ;   p_type
		dd			0                               ;   p_offset
		dd			$$                              ;   p_vaddr
		dd			$$                              ;   p_paddr
		dd			filesize                        ;   p_filesz
		dd			filesize                        ;   p_memsz
		dd			5                               ;   p_flags
		dd			0x1000                          ;   p_align
  
	phdrsize		equ	$ - phdr

_data:
		msg		db		"Hello, World!", 0xa
		len		equ   $ - msg
		file		db		"/tmp/test.out", 0x0

_start:
		; create the file /tmp/test.out
		mov		eax, 5
		mov		ebx, file
		mov		ecx, 66
		int		0x80

		; save the file descriptor
		mov		ebx, eax

		; set read/write permissions on the file
		mov		eax, 94
		mov		ecx, 0x1ff
		int		0x80

		; write the string to the file
		mov		eax, 4
;		mov		ebx, 1			; uncomment this to print to stdout instead
		mov		ecx, msg
		mov		edx, len
		int		0x80

		; close the file
		mov		eax, 6
		int		0x80

		; exit
		mov		eax, 1
		mov		ebx, 0
		int		0x80

filesize			equ	$ - $$

You can build and execute this yourself straight from Linux command line if you have nasm installed by saving it as tiny.asm and executing “nasm -f bin -o a tiny.asm && chmod +x a”. I have the ELF headers embedded directly into the assembly and we use ZERO external libraries so there is no linking needed. In fact, since pretty much everything executed is system calls, you can then run strace on the resultant executable (named “a”) and it will give you a pretty nice dump of exactly what it does.

The first thing that kind of tripped me up was when I created the file, it creates by default with no permission flags set at all. To fix this, I added an fchmod call right after I opened the file. The hex value I pass to as the second parameter (ecx) is just the hex value for 777 permissions. That set the permissions right but originally when I went to write to the file it still wouldn’t write. Thanks to the IRC help I figured out that when I was opening the file, I was opening with O_CREATE (62) when what I really wanted to do was open it with O_CREATE | O_RDWR so that I could write to it as well, so I changed the mode parameter of the call to open to 64 instead of 62 et voila, it works!

Hope you enjoy and maybe it’ll help somebody else out who is trying to write to a file with assembly in Linux.

Parallelization in PHP

07 Wednesday Jan 2009

Posted by Justin DeMaris in Engineering

≈ Leave a comment

This is a simple example you can re-use for splitting up processing of data across processes for faster execution. Put all of the data into the $set and fill in function process with what you want to do on the data, and let ‘er loose! I’m personally using it for telnet scripts because the amount of time spent waiting for a single telnet session is horrible and I can run many sessions at once while I wait for the responses.

/**
 * Splits the given set into $count subsets that are of approximately equal size
 */
function array_split($set, $count) {
   $subset_size = ceil(count($set) / $count);
   return array_chunk($set, $subset_size);
}

/**
 * Forks into $process_count separate processes and executes the function
 * named in $job in each process to split up handling of the data in
 * $set across the processes.
 */
function fork_exec($set, $job, $process_count) {
   $subsets = array_split($set, $process_count);
   $children = array();

   // launch all of the children and store process list
   foreach ( $subsets as $a_set ) {
      $pid = pcntl_fork();

      if ( $pid == -1 ) die("Error forking");
      else if ( $pid == 0 ) { call_user_func($job, $a_set); exit(0); }
      else $children[] = $pid;
   }

   // wait for each process to end
   while ( count($children) > 0 ) {
      $pid = array_shift($children);
      pcntl_waitpid($pid, $status);
   }
}

// example set to work on
$set = array('a','b','c','d','e','f','g','h','i','j');

// Process the job with 3 threads and time it
$time = microtime(true);
fork_exec($set, 'process', 3);
$diff = microtime(true) - $time;
echo $diff . ' seconds for full run'."\n";

// This is the job to run on the set. Make sure it is multi-process safe!
function process($set) {
   foreach ( $set as $item ) {
      echo "Process [" . posix_getpid() . "] executing '" . $item . "'\n";
      sleep(1);
   }
}

Freestyle Nerds

01 Tuesday Jul 2008

Posted by Justin DeMaris in Engineering

≈ Leave a comment

<djahandarie> we ain’t here to do e-c-e
<djahandarie> we’re here to do c-s-e on the w-e-b
<djahandarie> listen to me spit these rhymes
<djahandarie> while i program lines
<djahandarie> and commit web accessibility crimes
<djahandarie> word, son
<http402> You talk like your big on these I-Net kicks,
<http402> But your shit flows slower than a two-eighty-six.
<http402> I’m tracking down hosts and nmap scans,
<http402> While Code Igniter’s got you wringing your hands.
<http402> Cut the crap rap,
<http402> Or I’ll run ettercap,
<http402> Grab your AIM chat,
<http402> N’ send a PC bitch-slap!
<http402> peace
<djahandarie> you’re talkin bout down hosts and nmap scans
<djahandarie> while i got other plans
<djahandarie> you’re at your new job, but you can’t even do it right
<djahandarie> you just create a plight with your http rewrites
<djahandarie> i’ve been on the web since the age of three
<djahandarie> you just got on directly off the bus from mississippi
<djahandarie> respect yo’ elders, bitch
<http402> You’ve been webbin’ since three, but still ain’t grown up,
<http402> Gotta update your config and send the brain a SIGHUP.
<http402> You say you’re that old? No wonder you’re slow!
<http402> You’re knocking at the door while I run this show!
<http402> Elders my ass, you’re shit’s still in school,
<http402> Hunt and pecking at the keyboard like a spaghetti-damned fool,
<http402> Rim-riffing your hard drive like a tool,
<http402> Face it. I rule.
<djahandarie> i erase my harddrives with magnets (bitch)
<djahandarie> all you can do is troll on the fagnets
<djahandarie> and son, my brain’s wrapped in a nohup
<djahandarie> it wont be hurt by the words you throwup
<djahandarie> dont mind me while i emerge my ownage
<djahandarie> while you’re still over there apt-getting your porridge
<djahandarie> you say i’m still in school
<djahandarie> but the fact is that i know the rule
<djahandarie> cuz you need to go back to grade three
<djahandarie> and you better plea, that they take sucky graduates from c-s-e
<http402> Time to bend over and apply a patch,
<http402> Your brain’s throwing static like a CD with a scratch.
<http402> Your connection got nuked and you’ve met your match.
<http402> You run a single process like a VAX with a batch.
<http402> I’d pass the torch to a real winner
<http402> But it’d just scorch a while-loop spinner
<http402> Caught in a loop that you cant escape,
<http402> I run clock cycles around your words and flows,
<http402> Cuz your rhyme is like a PS fan: it’ blows,
<http402> Your water-cooled lyrics leak and it shows,
<http402> Take your ass back to alt.paid.for.windows.
<djahandarie> Good god, I can’t even respond to that. 😛
<djahandarie> You win haha
* http402 takes a bow

← Older posts

Subscribe

  • Entries (RSS)
  • Comments (RSS)

Archives

  • April 2015
  • March 2015
  • July 2012
  • June 2012
  • January 2012
  • December 2011
  • November 2011
  • March 2010
  • January 2009
  • July 2008
  • March 2008
  • February 2008
  • January 2008
  • August 2007
  • June 2007
  • May 2007
  • April 2007
  • February 2007
  • January 2007
  • November 2006
  • June 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • July 2005
  • June 2005

Categories

  • Business
  • Engineering
  • Travel
  • Uncategorized

Meta

  • Register
  • Log in

Blog at WordPress.com.

  • Follow Following
    • DeMar.is
    • Already have a WordPress.com account? Log in now.
    • DeMar.is
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...