Big Data, Small Budget

By Mobomo February 13, 2014

Processing big data requires a lot of CPU power and storage space. At the minimum, a cluster of powerful servers to process data in a distributed fashion is essential. Typically, the set up process can be very expensive. However, here are several relatively low cost options. Now, we’re not talking petabytes big data at the major league scale, like FB or Twitter, but big enough that a traditional RDBMS – at 10s to 100s terabytes isn’t enough.

We will be focusing on building a 10-node cluster for Hadoop. This process can also easily be adapted for other distributed computing platforms, such as Cassandra, Storm, or Spark.

AWS Elastic Map/Reduce
DIY using AWS EC2 instances
DIY using PCs built with off-the-shelf components

AWS Elastic Map/Reduce

Amazon’s Elastic Map/Reduce provides the easiest way to setup a Hadoop cluster, so if your budget allows, this is a great option. Setting up a Elastic M/R cluster is simple – just a few clicks from the AWS console, and you’ll be up and running. Depending on your requirements, you may keep the cluster running all of the time, or you can fire up a cluster and terminate it when you’re done with your data processing tasks.

The cost of the cluster is determined by the hardware configuration (type of server instances) and the software configuration (which Hadoop distribution: Amazon or MapR). Currently, Elastic Map/Reduce provides those applications: Hive, Pig, and Hbase.

Here are the costs of some commonly used EC2 instance types, except hs1.8xlarge – the most expensive instance so far, is listed here for reference. The full EC2 instance pricing list is available here.

For a 10-node cluster, the cost can be calculated using the AWS cost calculator

Pros: Easy to setup, can be lauched and terminated on demand
Cons: Expensive, limited choices of OS, Hadoop distro, and applications

For a 10-node m2.4xlarge cluster the monthly cost is $16,435. Also, you’ll need to budget around 1/5 of that to cover network i/o and storage costs.

To process a petabyte worth of data, you’ll need a 21-node hs1.8xlarge cluster, which costs close to $88,000/month.

DIY Using AWS EC2 Instances

AWS Elastic Map/Reduce adds management and software on top of the instance costs. If you want to avoid those cost, you can build your cluster using individual EC2 instances, and install and configure the cluster software yourself.

Pros: Relatively cheaper, install any OS, Hadoop distro, and applications
Cons: Need to manually install and manage the cluster

For a 10-node m2.4xlarge cluster the monthly cost is $11,800. Also, you’ll need to budget around 1/5 of that to cover network i/o and storage costs, plus labor and software licensing costs.

If you don’t plan to set up a permanent cluster, that’s available 24/7, you can use spot instances to reduce cost. For permanent clusters, using reserved instances can lower your costs, you can find more details here.

To process a petabyte worth of data, you’ll need a 21-node hs1.8xlarge cluster, and it will cost close to $70,000/month.

DIY Using Servers Built with Off-The-Shelf Components

Your lowest cost option would be a complete DIY. Here’s the current pricing for components and total cost for a mid-to-high end single server (more or less equivalent to EC2 m2.4xlarge, with faster CPU, much more disk space including fast SSD, but less memory):

Sample configurations from low to high:

Cost of a 10-node Off-The-Shelf Servers:

Pros: Cheapest, install any OS, Hadoop distro, utilizes physical hardware
Cons: Must manually install and manage the cluster and physical hardware

So, a 10-node cluster similar to m2.4xlarge costs around $15,000 (with a much larger 45TB capacity), and a lower end setup equivalent to a 10 m2.2xlarge node cluster costs around $10,000 (with a much larger 22TB capacity)

To process a petabyte worth of data, you’d need a 63-node cluster with the high end servers, and it will cost close to $200,000.

Choosing a Hadoop Distribution

You can choose a Hadoop distribution from one of the following competing offerings, they are all pretty good and easy to setup on a cluster, so anyone of those should more or less satisfy your big data needs:

Cloudera CDH4

Applications included: DataFu, Flume, Hadoop, HBase, HCatalog, Hive, Hue, Mahout, Oozie, Parquet, Pig, Sentry, Scoop, Whirr, Zookeeper, Impala

Hortonworks HDP

Applications included: Core Hadoop(HDFS, MapReduce, Tez, YARN), Data Services(Accumulo, Flume, HBase, HCatalog, Hive, Mahout, Pig, Sqoop, Storm), Operational Services(Ambari, Falcon, Knox Gateway, Oozie, ZooKeeper)

MapR

Applications included: HBase, Pig, Hive, Mahout, Cascading, Sqoop, Flume and more

Resource Links

Other distributed computing environments worth checking out:

Big Data, Small Budget

AWS Elastic Map/Reduce

DIY Using AWS EC2 Instances

DIY Using Servers Built with Off-The-Shelf Components

Choosing a Hadoop Distribution

Resource Links

ELEGANT
SOLUTIONS
START
HERE.

New project request.

NASA

USGS

USO

Pulse

NASA

NOAA Fisheries

USGS

NASA Eclipse

NASA

NOAA Fisheries

USGS

Ferc

NASA

VA

PRAC

Apogee

NASA

VA

Pulse

RGS

NASA

VA

M3

PRAC

Apogee

Pulse

RGS

USO

NASA

NOAA Fisheries

USGS

Ferc

ACR/MCR

ReCapted

ThreadRobe

Pacify

Think Big.

Large Scale Web & CMS.

Mobile & APP.

User-centered Design.

WE ARE YOUR CLOUD TEAM.

Drupal.

Emerging Tech.

Think as one.

WE ARE MOBOMO.

Our team.

Careers.

Mobomo Labs.

Mobomo University

Digital Services Playbook

Press Kit.

Awards.

Big Data, Small Budget

AWS Elastic Map/Reduce

DIY Using AWS EC2 Instances

DIY Using Servers Built with Off-The-Shelf Components

Choosing a Hadoop Distribution

Resource Links

Related articles:

Perspective On Scrum Retrospectives

The History of the Web: The Evolving Role of the Designer

ELEGANTSOLUTIONSSTARTHERE.

New project request.

Large Scale
Web & CMS.

User-centered
Design.

WE ARE YOUR
CLOUD TEAM.

ELEGANT
SOLUTIONS
START
HERE.