Before we start, let's make sure we know what we're talking about here. A few concepts need to be clarified:
- OLTP vs OLAP
- Big Data
- Machine Learning
OLTP and OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it. Taking this into the Data Bases world, we have:
OLTP - On Line Transactional Processing, normally handled by Relational Data Bases
OLAP - On Line Analytical Processing (Business Analytics and advanced data processing), requires a Big Data technology, like Hive. Business Intelligence (OLAP) refers to the generation of reports which may or may not involve sophisticated tools like Cognos or Business Objects.
Big Data tends to refer to the extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.
Machine Learning, or as I would call it BI 2.0, does exactly as its name suggests. Enabling the Machine to learn from the data. Any forward looking activity, i.e. whenever the term “Predictive” kicks in, you can expect Machine Leaning to be there. At the same time, it can also be used for Business intelligence. Example: predict orders next week, identify a fraudulent insurance claim, power a chat-bots to provide L1 support to customers and so on. Machine learning has wider applications. It can be leveraged to power businesses as well. And since business generates lot of data, it makes a easy ground for machine learning. The idea is always the same:
How does the Machine Learning work? Simply get a bunch of Data, proceed to the "Training" analyzing the data, and get the conclusions (Model) in order to later be able to make Predictions, applicable to the data that hasn't been used for training.
DATA -> TRAINING -> MODEL -> PREDICTIONS
Have in mind that Model is a "living" thing, as the new data is continuously being brought in, the Model is being continuously improved.
What is Hadoop?
Relational Data Bases or RDBMS (Relational Data Base Management System) are not optimized for the needs of Big Data and Machine Learning. We need a File System, that permits us to scale horizontally, and allows us to perform BA. Enter - Hadoop. Hadoop is based on the Distributed Computing principles, meaning - lots of cheap hardware, and prepared for horizontal scaling, unlike Monolithic Computing, where you'd rely on a single Super Computer. To get the naming right, remember that Hadoop Clusters are composed of Nodes, that run in Server Farms.
Simply put, Hadoop = HDFS + MapReduce (+ YARN). Let's now demystify this… We have 3 components:
- HDFS, or the Hadoop File System
- MapReduce, for data representation, using Java
- YARN, in charge of Replication and Clustering
HDFS: Lots of cheap hardware where the Distributed Computing is "stored". Google File System was created to solve the Distributed Storage problem, and Apache developed the Open Source version of this called HDFS. It's not optimized for low latency, but it has an "insane" throughput.
In HDFS we will have a bunch of cheap servers, with the Hard Drives. Each of these will be known as a Node, and there will be one Master Node, called the Name Node, containing the Metadata for all the other Nodes, known as the Data Nodes. Name Node knows where the stuff is, but the Data Nodes contain the Data.
HDFS works as a Block Storage, because the large files are separated in many same sized (128MB) Blocks, and stored in different Data Nodes. Each node contains a partition of a split of data. Name node has the "Table of Contents" where different Block locations are documented. For example, Block 7 is in DN4, as in the diagram below.
How is HA handled? Using the replication factor, meaning that every Block is stored in more then one Data Note. Name node needs to keep track of these. A Replication Strategy handles that the Replicas are stored in an optimal way, to optimize Bandwidth.
Since I'm currently preparing for the Google Cloud Architect exam, I've been investigating how Hadoop as a Managed Service is handled on the GCP (Google Cloud Platform). Dataproc is Google's managed Hadoop, which let's you not worry about the Replication or Name Nodes. Google Cloud Storage is used on the GCP instead of the HDFS, because if you followed the model with the Name Node, such as Hadoop, the Instance (VM) with the Name Node would be spending an insane amounts of resources. In Google Cloud Storage this is optimized without the Name Node (not going into details here).
YARN (Yet Another Resource Negotiator): a Resource allocator that lets us do Replication and Fault Tolerance. YARN coordinates the cluster, using two components:
- Resource Manager, on a master node
- Node Manager, running on all other nodes. This is actually a container, isolated from everything else on the Node.
MapReduce: Abstraction that allows any programmer to present the data in the form of Map and Reduce jobs, and enable the Distributed Computing. The role of MapReduce is to handle a huge amounts of data. It takes advantage of parallelism. Every step is done in two functions.
- Map operations: Express what the body needs to accomplish. Runs in parallel on many of the machines in the cluster.
- Reduce operations: Distribute all the results of Map operation, and create a final output, storing it into all the Data nodes, and their execution will happen in parallel.
This is all (map and reduce operations) written in JAVA, and Business Analysts don’t do JAVA. This is why an SQL interface, provided by Hive (or by BigQuery, in GCP) is so important and popular.
How does Hadoop work?
- User defines the Map and Reduce tasks using the MapReduce API
- A Job is Triggered, MapReduce communicates it to YARN
- YARN decides the Resource allocation model, and communicates it to HDFS.
Hive (along with Spark), which is basically the same as Google BigQuery on GCP, provides an SQL interface to Hadoop. BigQuery uses a columnar format called Capacitor. Hive is great for High Latency applications (BigQuery doesn't have as high latency as Hive, it can even be used for almost real-time applications).
Hive runs on top of Hadoop, and stores it's data in HDFS. HiveQL is an SQL type language, familiar to analysts and engineers. SQL is optimized for Relational DBs, and HiveQL for Hive. Hive will TRANSLATE the queries written in SQL in HiveQL into MapReduce. A Hive user sees data as if it were stored in Tables.
Comparing to Relational DB, Hive is meant to be used for LARGE datasets (Giga or petabytes), Read operations to analyze the historical behavior, with Parallel computations (need more space - add more servers, in accordance with Horizontal Scaling philosophy) enabled by MapReduce (relational DB runs against one really powerful server), and remember that Hive is designed for the High Latency use, mostly for Read operations. Relational DB was designed for Low Latency, quick SQL consults, Read and Write operations.
Hive uses a so called Bucketing segmentation. Partitioning is designed for a non equal data segments. Bucketing is designed to evenly distribute data. Since in HDFS the Blocks are 128MB each, which is why this concepts fits the Bucketing perfectly.
Hbase, which maps directly to Google BigTable provides a management system on top of Hadoop. It integrates with the Application much like a traditional database. Hbase and Bigtable are columnar data bases, and they are designed for the low latency use.
Pig - A data manipulation language. It transforms the unstructured data into a structured format. You can query this structured data using Hive. Included in Google DataFlow.
Spark - A distributed computing engine used along with Hadoop. Spark acts as an interactive Shell to quickly process Datasheets. It completely abstracts away the MapReduce complexity in data transformation. You can use Spark if you want to use Scala or Python to operate HDFS and YARN. Spark has a bunch of built in Data Libraries used for Machine Learning, stream processing, graph processing etc. Included in Google DataFlow.
Kafka - Stream processing for unbounded datasets. Kafka takes streaming data from sources and distributes to sinks. Google Cloud used a Google Pub/Sub instead of Kafka.
Oozie - a workflow scheduling tool on Hadoop.
I hope this helps understand the basics of Hadoop ecosystem and Big Data. Stay tuned for more posts on how GCP is handling Big Data.