Using Hadoop And PHP

Getting Started

So first things first. If you haven’t used Hadoop before you’ll first need to download a Hadoop release and make sure you have Java and PHP installed. To download Hadoop head over to:

http://hadoop.apache.org/common/releases.html

Click on download a release and choose a mirror. I suggest choosing the most recent stable release. Once you’ve downloaded Hadoop, unzip it.

I like to create a symlink to the hadoop-<release> directory to make things easier to manage.

Now you should have everything you need to start creating a Hadoop PHP job.

Creating The Job

For this example I’m going to create a simple Map/Reduce job for Hadoop. Let’s start by understanding what we want to happen.

We want to read from an input system – this is our mapper
We want to do something with what we’ve mapped – this is our reducer

At the root of your development directory, let’s create another directory called script. This is where we’ll store our PHP mapper and reducer files.

Now let’s being creating our mapper script in PHP. Go ahead and create a PHP file called mapper.php under the script directory.

Now let’s look at the basic structure of a PHP mapper.

#!/usr/bin/php
<?php
    //this can be anything from reading input from files, to retrieving database content, soap calls, etc.
    //for this example I'm going to create a simple php associative array.
$a = array(
    'first_name' => 'Hello',
    'last_name' => 'World'
);
 
//it's important to note that anything you send to STDOUT will be written to the output specified by the mapper.
//it's also important to note, do not forget to end all output to STDOUT with a PHP_EOL, this will save you a lot of pain.
echo serialize($a), PHP_EOL;
?>

So this example is extremely simple. Create a simple associative array and serialize it. Now onto the reducer. Create a PHP file in the script directory called reducer.php.

Now let’s take a look at the layout of a reducer.

#!/usr/bin/php
 
<?php
//Remember when I said anything put out through STDOUT in our mapper would go to the reducer.
//Well, now we read from the STDIN to get the result of our mapper.
//iterate all lines of output from our mapper
while (($line = fgets(STDIN)) !== false) {
    //remove leading and trailing whitespace, just in case :)
    $line = trim($line);
    //now recreate the array we serialized in our mapper
    $a = unserialize($line);
    //Now, we do whatever we need to with the data.  Write it out again so another process can pick it up,
    //send it to the database, soap call, whatever.  In this example, just change it a little and
    //write it back out.
    $a['middle_name'] = 'Jason';
    //do not forget the PHP_EOL
    echo serialize($a), PHP_EOL;
}//end while
?>

So now we have a very simple mapper and reducer ready to go.

Execution

So now let’s run it and see what happens. But first, a little prep work. We need to specify the input directory that will be used when the job runs.

Ok, that was difficult. We have an input directory and we’ve created an empty conf file. The empty conf file is just something that the mapper will use to get started. For now, don’t worry about it. Now let’s run this bad boy. Make sure you have your JAVA_HOME set, this is usually in the /usr directory. You can set this by running

So here’s what the command does. The first part executes the hadoop execute script. The jar argument tells hadoop to use a jar, in this case it tells it to use hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar. Next we pass the mapper and reducer arguments to the job and specify input and output directories. If we wanted to, we could pass configuration information to the mapper, or files, etc. We would just use the same line read structure that we used in the reducer to get the information. That’s what would go in the input directory if we needed it to. But for this example, we’ll just pass nothing. Next the output directory will contain the output of our reducer. In this case if everything works out correct, it will contain the PHP serialized form of our modified $a array. If all goes well you should see something like this:

If you get errors where it’s complaining about the output directory, just remove the output directory and try again.

Result

Once you’ve got something similar to the above and no errors, we can check out the result.

There we go, a serialized form of our modified PHP array $a. That’s all there is to it. Now, go forth and Hadoop.

Published by GodLikeMouse on December 11, 2010December 11, 2010

Getting Started

Creating The Job

Execution

Result

0 Comments

Leave a Reply Cancel reply

Code

Adding A waitForElement Command To Cypress

Code

Python and Rust Working Hand in Hand

Arch Linux

How To Add Git Completion In Arch Linux

Using Hadoop And PHP

Published by GodLikeMouse on December 11, 2010December 11, 2010

Getting Started

Creating The Job

Execution

Result

0 Comments

Leave a Reply Cancel reply

Related Posts

Code

Adding A waitForElement Command To Cypress

Code

Python and Rust Working Hand in Hand

Arch Linux

How To Add Git Completion In Arch Linux