I have recently made some programming using Hadoop, which is a framework for massive data processing over many numbers of servers. Hadoop reads input data from (large) files and performs MapReduce data reductions.
One method of reading is input is decompressing GZIP:ed files. Java has built-in support for reading gzipped stream using GZIPInputStream. However, Hadoop ships with its own implementation that uses native libraries for efficiency.
I was curious about how much faster the native Hadoop variant was compared with the Vanilla version, so I wrote a small test. The result turned out to be a surprise:
May 6, 2009 10:56:47 PM org.apache.hadoop.util.NativeCodeLoader <clinit> INFO: Loaded the native-hadoop library May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.zlib.ZlibFactory <clinit> INFO: Successfully loaded & initialized native-zlib library May 6, 2009 10:56:47 PM org.apache.hadoop.io.compress.CodecPool getDecompressor INFO: Got brand-new decompressor Time of Hadoop decompressor running 'small' job = 0:00:01.684 (1.684 ms/file) Time of Hadoop decompressor running 'large' job = 0:00:10.074 (1007.400 ms/file) Time of Vanilla decompressor running 'small' job = 0:00:01.340 (1.340 ms/file) Time of Vanilla decompressor running 'large' job = 0:00:10.094 (1009.400 ms/file) Hadoop vs. Vanilla [small]: 125.67% Hadoop vs. Vanilla [large]: 99.80%
For small GZ files (KB), the native version was slower and for moderately sized files (MB) the difference was negligible.
I’m running the test on a 1.67GHz dual core laptop with 2GB RAM and Ubuntu 9.04 64-bit. The ‘java.library.path’ points to the ‘hadoop/lib/native/Linux-amd64-64′ directory, which is needed so the JVM can pick up the correct library.
If you want to replicate the test, I have enclosed the test class below
package com.ribomation.weatherstats.computeTemperatures;
import org.junit.Test;
import static org.hamcrest.CoreMatchers.*;
import static org.junit.Assert.assertThat;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.time.StopWatch;
import org.apache.hadoop.io.compress.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import java.io.InputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
import java.util.List;
/**
* Compares the built-in Java GZip uncompress with Hadoop's ditto.
* <p/>
* User: jens
* Date: May 6, 2009 9:13:32 PM
*/
public class UncompressingComparativeTest {
private String sampleSmallFile = "/sample-file-KB.gz";
private String sampleLargeFile = "/sample-file-MB.gz";
private int numRunsSmall = 1000;
private int numRunsLarge = 10;
@Test
public void test_run() throws IOException {
UncompressStreamCreator hadoop = new HadoopUncompressStreamCreator();
UncompressStreamCreator vanilla = new VanillaUncompressStreamCreator();
long hadoopSmall = runTest(hadoop, sampleSmallFile, numRunsSmall, "small");
long hadoopLarge = runTest(hadoop, sampleLargeFile, numRunsLarge, "large");
long vanillaSmall = runTest(vanilla, sampleSmallFile, numRunsSmall, "small");
long vanillaLarge = runTest(vanilla, sampleLargeFile, numRunsLarge, "large");
System.out.printf("Hadoop vs. Vanilla [small]: %.2f%% %n", 100.0 * hadoopSmall / vanillaSmall);
System.out.printf("Hadoop vs. Vanilla [large]: %.2f%% %n", 100.0 * hadoopLarge / vanillaLarge);
}
private long runTest(UncompressStreamCreator c, String file, int numRuns, String type) throws IOException {
StopWatch sw = new StopWatch();
int cnt = 0;
sw.start();
for (int k=0; k<numRuns; ++k) {
List<String> lines = runUncompress(c, file);
assertThat(lines, notNullValue());
cnt++;
}
sw.stop();
assertThat(cnt, is(numRuns));
System.out.printf("Time of %s decompressor running '%s' job = %s (%.3f ms/file)%n", c.name(), type, sw, sw.getTime() / (double)numRuns);
return sw.getTime();
}
private List<String> runUncompress(UncompressStreamCreator c, String file) throws IOException {
InputStream is = this.getClass().getResourceAsStream(file);
assertThat(is, notNullValue());
List<String> lines;
try {
lines = IOUtils.readLines( c.wrap(is) );
} finally {
c.dispose();
}
return lines;
}
static interface UncompressStreamCreator {
InputStream wrap(InputStream is) throws IOException;
void dispose();
String name();
}
static class HadoopUncompressStreamCreator implements UncompressStreamCreator {
private CompressionCodec codec;
private Decompressor decompressor;
private CompressionInputStream in;
public HadoopUncompressStreamCreator() {
CompressionCodecFactory f = new CompressionCodecFactory(new Configuration());
codec = f.getCodec(new Path("foo.gz"));
}
public InputStream wrap(InputStream is) throws IOException {
decompressor = CodecPool.getDecompressor(codec);
in = codec.createInputStream(is, decompressor);
return in;
}
public void dispose() {
IOUtils.closeQuietly(in);
CodecPool.returnDecompressor(decompressor);
}
public String name() {
return "Hadoop ";
}
}
static class VanillaUncompressStreamCreator implements UncompressStreamCreator {
private InputStream in;
public InputStream wrap(InputStream is) throws IOException {
in = new GZIPInputStream(is);
return in;
}
public void dispose() {
IOUtils.closeQuietly(in);
}
public String name() {
return "Vanilla";
}
}
}
Libraries used (excl Hadoop):


If I execute gzcat as an external process and read the InputStream coming from stdout, I get much better results:
<code
Gzcat vs. Vanilla [small]: 65.27%
Gzcat vs. Vanilla [large]: 65.12%
The external gzip process is running in its own thread, so with 2 or more cores you get a nice performance increase.
Interesting, that’s a significant improvment.
Here’s my results:
Hadoop vs. Vanilla [small]: 94.59%
Hadoop vs. Vanilla [large]: 91.24%
My ‘small’ file is 22M. ‘large’ file is 991M.
Interesting!
My software has to read many gzipped files of about 1GB, every day, so I’ve started looking at Hadoop Native Libraries. Now I’ll try your benchmark first