Big Data, Small Budget

Processing big data requires a lot of CPU power and storage space. At the minimum, a cluster of powerful servers to process data in a distributed fashion is essential. Typically, the set up process can be very expensive. However, here are several relatively low cost options. Now, we’re not talking petabytes big data at the major league scale, like FB or Twitter, but big enough that a traditional RDBMS – at 10s to 100s terabytes isn’t enough.

We will be focusing on building a 10-node cluster for Hadoop. This process can also easily be adapted for other distributed computing platforms, such as Cassandra, Storm, or Spark.

  1. AWS Elastic Map/Reduce
  2. DIY using AWS EC2 instances
  3. DIY using PCs built with off-the-shelf components

AWS Elastic Map/Reduce

Amazon’s Elastic Map/Reduce provides the easiest way to setup a Hadoop cluster, so if your budget allows, this is a great option. Setting up a Elastic M/R cluster is simple – just a few clicks from the AWS console, and you’ll be up and running. Depending on your requirements, you may keep the cluster running all of the time, or you can fire up a cluster and terminate it when you’re done with your data processing tasks.

The cost of the cluster is determined by the hardware configuration (type of server instances) and the software configuration (which Hadoop distribution: Amazon or MapR). Currently, Elastic Map/Reduce provides those applications: Hive, Pig, and Hbase.

Here are the costs of some commonly used EC2 instance types, except hs1.8xlarge – the most expensive instance so far, is listed here for reference. The full EC2 instance pricing list is available here.

For a 10-node cluster, the cost can be calculated using the AWS cost calculator

  • Pros: Easy to setup, can be lauched and terminated on demand
  • Cons: Expensive, limited choices of OS, Hadoop distro, and applications

For a 10-node m2.4xlarge cluster the monthly cost is $16,435. Also, you’ll need to budget around 1/5 of that to cover network i/o and storage costs.

To process a petabyte worth of data, you’ll need a 21-node hs1.8xlarge cluster, which costs close to $88,000/month.

DIY Using AWS EC2 Instances

AWS Elastic Map/Reduce adds management and software on top of the instance costs. If you want to avoid those cost, you can build your cluster using individual EC2 instances, and install and configure the cluster software yourself.

  • Pros: Relatively cheaper, install any OS, Hadoop distro, and applications
  • Cons: Need to manually install and manage the cluster

For a 10-node m2.4xlarge cluster the monthly cost is $11,800. Also, you’ll need to budget around 1/5 of that to cover network i/o and storage costs, plus labor and software licensing costs.

If you don’t plan to set up a permanent cluster, that’s available 24/7, you can use spot instances to reduce cost. For permanent clusters, using reserved instances can lower your costs, you can find more details here.

To process a petabyte worth of data, you’ll need a 21-node hs1.8xlarge cluster, and it will cost close to $70,000/month.

DIY Using Servers Built with Off-The-Shelf Components

Your lowest cost option would be a complete DIY. Here’s the current pricing for components and total cost for a mid-to-high end single server (more or less equivalent to EC2 m2.4xlarge, with faster CPU, much more disk space including fast SSD, but less memory):

Sample configurations from low to high:

Cost of a 10-node Off-The-Shelf Servers:

  • Pros: Cheapest, install any OS, Hadoop distro, utilizes physical hardware
  • Cons: Must manually install and manage the cluster and physical hardware

So, a 10-node cluster similar to m2.4xlarge costs around $15,000 (with a much larger 45TB capacity), and a lower end setup equivalent to a 10 m2.2xlarge node cluster costs around $10,000 (with a much larger 22TB capacity)

To process a petabyte worth of data, you’d need a 63-node cluster with the high end servers, and it will cost close to $200,000.

Choosing a Hadoop Distribution

You can choose a Hadoop distribution from one of the following competing offerings, they are all pretty good and easy to setup on a cluster, so anyone of those should more or less satisfy your big data needs:

Cloudera CDH4

Applications included: DataFu, Flume, Hadoop, HBase, HCatalog, Hive, Hue, Mahout, Oozie, Parquet, Pig, Sentry, Scoop, Whirr, Zookeeper, Impala

Hortonworks HDP

Applications included: Core Hadoop(HDFS, MapReduce, Tez, YARN), Data Services(Accumulo, Flume, HBase, HCatalog, Hive, Mahout, Pig, Sqoop, Storm), Operational Services(Ambari, Falcon, Knox Gateway, Oozie, ZooKeeper)

MapR

Applications included: HBase, Pig, Hive, Mahout, Cascading, Sqoop, Flume and more

Resource Links

Other distributed computing environments worth checking out: