Google

Comparison of decompress ways in Hadoop

Written on:May 7, 2009
Comments
Add One

I have recently made some programming using Hadoop, which is a framework for massive data processing over many numbers of servers. Hadoop reads input data from (large) files and performs MapReduce data reductions.

One method of reading is input is decompressing GZIP:ed files. Java has built-in support for reading gzipped stream using GZIPInputStream. However, Hadoop ships with its own implementation that uses native libraries for efficiency.

I was curious about how much faster the native Hadoop variant was compared with the Vanilla version, so I wrote a small test. The result turned out to be a surprise:

May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
INFO: Loaded the native-hadoop library
May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory <clinit>
INFO: Successfully loaded & initialized native-zlib library
May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool getDecompressor
INFO: Got brand-new decompressor
Time of Hadoop  decompressor running 'small' job = 0:00:01.684 (1.684 ms/file)
Time of Hadoop  decompressor running 'large' job = 0:00:10.074 (1007.400 ms/file)
Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340 ms/file)
Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400 ms/file)
Hadoop vs. Vanilla [small]: 125.67%
Hadoop vs. Vanilla [large]: 99.80%

For small GZ files (KB), the native version was slower and for moderately sized files (MB) the difference was negligible.

I’m running the test on a 1.67GHz dual core laptop with 2GB RAM and Ubuntu 9.04 64-bit. The ‘java.library.path’ points to the ‘hadoop/lib/native/Linux-amd64-64′ directory, which is needed so the JVM can pick up the correct library.

If you want to replicate the test, I have enclosed the test class below

package com.ribomation.weatherstats.computeTemperatures;

import org.junit.Test;
import static org.hamcrest.CoreMatchers.*;
import static org.junit.Assert.assertThat;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.time.StopWatch;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;

import java.io.InputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
import java.util.List;

/**
 * Compares the built-in Java GZip uncompress with Hadoop's ditto.
 * <p/>
 * User: jens
 * Date: May 6, 2009  9:13:32 PM
 */
public class UncompressingComparativeTest {
    private String  sampleSmallFile = "/sample-file-KB.gz";
    private String  sampleLargeFile = "/sample-file-MB.gz";
    private int     numRunsSmall    = 1000;
    private int     numRunsLarge    = 10;

    @Test
    public void test_run() throws IOException {
        UncompressStreamCreator hadoop  = new HadoopUncompressStreamCreator();
        UncompressStreamCreator vanilla = new VanillaUncompressStreamCreator();

        long hadoopSmall = runTest(hadoop, sampleSmallFile, numRunsSmall, "small");
        long hadoopLarge = runTest(hadoop, sampleLargeFile, numRunsLarge, "large");

        long vanillaSmall = runTest(vanilla, sampleSmallFile, numRunsSmall, "small");
        long vanillaLarge = runTest(vanilla, sampleLargeFile, numRunsLarge, "large");

        System.out.printf("Hadoop vs. Vanilla [small]: %.2f%% %n", 100.0 * hadoopSmall / vanillaSmall);
        System.out.printf("Hadoop vs. Vanilla [large]: %.2f%% %n", 100.0 * hadoopLarge / vanillaLarge);
    }

    private long runTest(UncompressStreamCreator c, String file, int numRuns, String type) throws IOException {
        StopWatch   sw  = new StopWatch();
        int         cnt = 0;

        sw.start();
        for (int k=0; k<numRuns; ++k) {
            List<String> lines = runUncompress(c, file);
            assertThat(lines, notNullValue());
            cnt++;
        }
        sw.stop();

        assertThat(cnt, is(numRuns));
        System.out.printf("Time of %s decompressor running '%s' job = %s (%.3f ms/file)%n", c.name(), type, sw, sw.getTime() / (double)numRuns);

        return sw.getTime();
    }

    private List<String> runUncompress(UncompressStreamCreator c, String file) throws IOException {
        InputStream                 is = this.getClass().getResourceAsStream(file);
        assertThat(is, notNullValue());

        List<String> lines;
        try {
            lines = IOUtils.readLines( c.wrap(is) );
        } finally {
            c.dispose();
        }

        return lines;
    }

    static interface UncompressStreamCreator {
        InputStream     wrap(InputStream is) throws IOException;
        void            dispose();
        String          name();
    }

    static class HadoopUncompressStreamCreator implements UncompressStreamCreator {
        private CompressionCodec        codec;
        private Decompressor            decompressor;
        private CompressionInputStream  in;

        public HadoopUncompressStreamCreator() {
            CompressionCodecFactory     f = new CompressionCodecFactory(new Configuration());
            codec = f.getCodec(new Path("foo.gz"));
        }

        public InputStream wrap(InputStream is) throws IOException {
            decompressor = CodecPool.getDecompressor(codec);
            in           = codec.createInputStream(is, decompressor);
            return in;
        }

        public void dispose() {
            IOUtils.closeQuietly(in);
            CodecPool.returnDecompressor(decompressor);
        }

        public String name() {
            return "Hadoop ";
        }
    }

    static class VanillaUncompressStreamCreator implements UncompressStreamCreator {
        private InputStream in;

        public InputStream wrap(InputStream is) throws IOException {
            in = new GZIPInputStream(is);
            return in;
        }

        public void dispose() {
            IOUtils.closeQuietly(in);
        }

        public String name() {
            return "Vanilla";
        }
    }
}

Libraries used (excl Hadoop):

4 Comments add one

  1. Davide S says:

    If I execute gzcat as an external process and read the InputStream coming from stdout, I get much better results:
    <code
    Gzcat vs. Vanilla [small]: 65.27%
    Gzcat vs. Vanilla [large]: 65.12%

    The external gzip process is running in its own thread, so with 2 or more cores you get a nice performance increase.

  2. Davide S says:

    Here’s my results:

    Hadoop vs. Vanilla [small]: 94.59%
    Hadoop vs. Vanilla [large]: 91.24%

    My ‘small’ file is 22M. ‘large’ file is 991M.

  3. Davide S says:

    Interesting!
    My software has to read many gzipped files of about 1GB, every day, so I’ve started looking at Hadoop Native Libraries. Now I’ll try your benchmark first ;)

Leave a Comment

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Why ask?