RSS

Tag Archives: mapreduce

Learning HADOOP Cluster

1. What is Hadoop Cluster?

A. Hadoop is a Framework which allows Large Data Sets of data distrubuted across the number of systems around the cluster and scales from single server to thousands of Servers online.

A Quick Dirty Pic of Hadoop.
Hadoop Distributed Architechture

2. How does it Process files?

A. Hadoop Breaks the Files (ex: 1GB) into Blocks (ex: 64MB block each) and Splits across the cluster and Replicate the files into all the Nodes attached.  This makes sure that no Data has been lost or corrupted if One or More nodes goes down due to Hardware failure or anything..

3. What is MapReduce in Hadoop?

A. Map and Reduce are 2 different Jobs done in Hadoop Cluster.

Map: After Loading a Big Data File,  Data is then Mapped First using the Key value pairs and distributed across the system,

Reduce:  These value pairs are then Re-grouped Upon Request while reading data. (there’s a lot more it can do but this is just an overview.)

For more info on Map Reduce.. click here 

4. What Sizes of Data does Hadoop Handles?

A. Tera Bytes and Peta Bytes of Data (Very Large Sets of Data will be Processed and Distributed along the Nodes).
for ex: Hadoop Processed 3 Peta Bytes of Compressed Data i.e., 7 PB of Uncompressed Data which is very huge).

5. What is HIVE?

A. HIVE is a Software which sits on top of Hadoop Cluster to Retrieve Data using HiveQL (Hive Query Language) similar to SQL reads from Data Files, Select Queries, Joins, Sub Queries can be used to Retrieve Information from These Large files.

6. What else?

I am still trying to Learn Hadoop and will Post more details as and when i get to know more about it.

Click here for Hadoop Documentation.

Happy Reading..
**********************

Advertisements
 
Leave a comment

Posted by on September 12, 2012 in Database Administration

 

Tags: , ,

 
%d bloggers like this: