Before we start,
let's make sure we know what we're talking about here. A few concepts need to
be clarified:
- OLTP vs OLAP
- Big Data
- Machine Learning
OLTP and OLAP
We can divide IT
systems into transactional (OLTP) and analytical (OLAP). In general we can
assume that OLTP systems provide source data to data warehouses, whereas OLAP
systems help to analyze it. Taking this into the Data Bases world, we
have:
OLTP - On Line Transactional Processing,
normally handled by Relational Data Bases
OLAP - On Line Analytical Processing (Business Analytics and advanced data processing),
requires a Big Data technology, like Hive. Business Intelligence (OLAP) refers to the generation of reports
which may or may not involve sophisticated tools like Cognos or Business
Objects.
Big Data
Big Data tends to
refer to the extremely large data sets that may be analysed computationally to
reveal patterns, trends, and associations, especially relating to human
behavior and interactions.
Machine Learning
Machine Learning, or as I would call it BI 2.0, does exactly as its
name suggests. Enabling the Machine to learn from the data. Any forward looking
activity, i.e. whenever the term “Predictive” kicks in, you can expect Machine
Leaning to be there. At
the same time, it can also be used for Business intelligence. Example: predict
orders next week, identify a fraudulent insurance claim, power a chat-bots to
provide L1 support to customers and so on. Machine learning has wider
applications. It can be leveraged to power businesses as well. And since
business generates lot of data, it makes a easy ground for machine
learning. The idea is always the same:
How
does the Machine Learning work? Simply get a bunch of Data, proceed to the
"Training" analyzing the data,
and get the conclusions (Model) in order
to later be able to make Predictions,
applicable to the data that hasn't been used for training.
DATA -> TRAINING -> MODEL -> PREDICTIONS
Have in mind that
Model is a "living" thing, as the new data is continuously being
brought in, the Model is being continuously improved.
What is Hadoop?
Relational Data
Bases or RDBMS (Relational Data Base Management System) are not optimized for
the needs of Big Data and Machine Learning. We need a File System, that permits
us to scale horizontally, and allows us to perform BA. Enter - Hadoop. Hadoop
is based on the Distributed Computing principles, meaning - lots of cheap
hardware, and prepared for horizontal scaling, unlike Monolithic Computing,
where you'd rely on a single Super Computer. To get the naming right, remember
that Hadoop Clusters are composed of Nodes, that
run in Server Farms.
Simply
put, Hadoop = HDFS + MapReduce (+ YARN).
Let's now demystify this… We have 3 components:
- HDFS, or the Hadoop File System
- MapReduce, for data representation, using Java
- YARN, in charge of Replication and Clustering
HDFS: Lots of cheap hardware where the
Distributed Computing is "stored". Google File System was created to
solve the Distributed Storage problem, and Apache developed the Open Source
version of this called HDFS. It's not optimized for low latency, but it has an
"insane" throughput.
In HDFS we will have
a bunch of cheap servers, with the Hard Drives. Each of these will be known as
a Node, and there will be one Master Node, called the Name Node, containing the Metadata for all the other Nodes, known
as the Data Nodes. Name Node knows where the stuff is, but the Data Nodes
contain the Data.
HDFS works as a
Block Storage, because the large files are separated in many same sized (128MB)
Blocks, and stored in different Data Nodes. Each node contains a partition of a
split of data. Name node has the "Table of Contents" where different
Block locations are documented. For example, Block 7 is in DN4, as in the
diagram below.
How is HA handled?
Using the replication factor, meaning
that every Block is stored in more then one Data Note. Name node needs to keep
track of these. A Replication Strategy handles that the Replicas are stored in
an optimal way, to optimize Bandwidth.
Since I'm currently
preparing for the Google Cloud Architect exam, I've been investigating how
Hadoop as a Managed Service is handled on the GCP (Google Cloud Platform).
Dataproc is Google's managed Hadoop, which let's you not worry about the
Replication or Name Nodes. Google Cloud Storage is used on the GCP instead of
the HDFS, because if you followed the model with the Name Node, such as Hadoop,
the Instance (VM) with the Name Node would be spending an insane amounts of
resources. In Google Cloud Storage this is optimized without the Name Node (not
going into details here).
YARN (Yet Another Resource Negotiator): a
Resource allocator that lets us do Replication and Fault Tolerance. YARN
coordinates the cluster, using two components:
- Resource Manager, on a master node
- Node Manager, running on all other nodes. This is actually a container, isolated from everything else on the Node.
MapReduce: Abstraction that allows any
programmer to present the data in the form of Map and Reduce jobs, and enable
the Distributed Computing. The role of MapReduce is to handle a huge amounts of
data. It takes advantage of parallelism. Every step is done in two functions.
- Map operations: Express what the body needs to accomplish. Runs in parallel on many of the machines in the cluster.
- Reduce operations: Distribute all the results of Map operation, and create a final output, storing it into all the Data nodes, and their execution will happen in parallel.
This is all (map and
reduce operations) written in JAVA, and Business Analysts don’t do JAVA. This
is why an SQL interface, provided by Hive (or by BigQuery, in GCP) is so
important and popular.
How does Hadoop
work?
- User defines the Map and Reduce tasks using the MapReduce API
- A Job is Triggered, MapReduce communicates it to YARN
- YARN decides the Resource allocation model, and communicates it to HDFS.
Hadoop Ecosystem
Hive (along with Spark), which is basically the
same as Google BigQuery on GCP, provides an SQL interface to Hadoop. BigQuery
uses a columnar format called Capacitor.
Hive is great for High Latency applications (BigQuery doesn't have as high
latency as Hive, it can even be used for almost real-time applications).
Hive runs on top of
Hadoop, and stores it's data in HDFS. HiveQL is an SQL type language, familiar
to analysts and engineers. SQL is optimized for Relational DBs, and HiveQL for
Hive. Hive will TRANSLATE the queries written in SQL in HiveQL into MapReduce.
A Hive user sees data as if it were stored in Tables.
Comparing to
Relational DB, Hive is meant to be used for LARGE datasets (Giga or petabytes),
Read operations to analyze the historical behavior, with Parallel computations
(need more space - add more servers, in accordance with Horizontal Scaling
philosophy) enabled by MapReduce (relational DB runs against one really
powerful server), and remember that Hive is
designed for the High Latency use, mostly for Read operations.
Relational DB was designed for Low Latency, quick SQL consults, Read and Write
operations.
Hive uses a so
called Bucketing segmentation. Partitioning is designed for a non equal data
segments. Bucketing is designed to evenly distribute data. Since in HDFS the Blocks are 128MB each, which is why
this concepts fits the Bucketing perfectly.
Hbase, which maps directly to Google BigTable provides a management system on
top of Hadoop. It integrates with the Application much like a traditional
database. Hbase and Bigtable are columnar data bases, and they are designed for
the low latency use.
Pig - A data manipulation language. It
transforms the unstructured data into a structured format. You can query this
structured data using Hive. Included in Google
DataFlow.
Spark - A distributed computing engine used
along with Hadoop. Spark acts as an interactive Shell to quickly process
Datasheets. It completely abstracts away the MapReduce complexity in data
transformation. You can use Spark if you want to
use Scala or Python to operate HDFS and YARN. Spark has a bunch of built
in Data Libraries used for Machine Learning, stream processing, graph
processing etc. Included in Google DataFlow.
Kafka - Stream processing for unbounded
datasets. Kafka takes streaming data from sources and distributes to sinks.
Google Cloud used a Google Pub/Sub instead of Kafka.
Oozie - a workflow scheduling tool on Hadoop.
I hope this helps
understand the basics of Hadoop ecosystem and Big Data. Stay tuned for more
posts on how GCP is handling Big Data.
These ways are very simple and very much useful, as a beginner level these helped me a lot thanks fore sharing these kinds of useful and knowledgeable information.
ReplyDeleteAlso Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/
It was really a nice article and i was really impressed by reading this Big data hadoop online Course Hyderabad
ReplyDeleteThe blog is so interactive and Informative , you should write more blog like thisBig Data Hadoop Online Training Bangalore
ReplyDelete