How I passed Google Certified Professional Cloud Architect Exam

After a few months of heavy preps, I managed to pass the exam. I got the electronic certificate, and supposedly I'll get a Cloud Architect Hoodie! Yeah, I'm gonna wear it :)

The exam is every bit as difficult as advertised. I did A LOTS of Hands On in the Google Cloud Platform (the 300$ that Google gives you to play around comes in quite handy), without it I don't think it's possible to pass, bunch of questions have commands to choose from, and a heavy focus on App Development and Linux Commands. If you want to know how I prepared, check out my previous posts:

  1. Why I decided to become a Certified Cloud Architect, why Google Cloud, and how I want to prepare
  2. Introduction to Big Data and Hadoop
  3. Google Cloud - Compute Options (IaaS, PaaS, CaaS/KaaS)
  4. Google Cloud - Storage and Big Data Options
  5. Google Cloud - Networking and Security Options

Stay tuned, my Cloud is about to get much more DevOps-y in 2018!

Public Cloud Networking and Security: VPCs, Interconnection to Cloud, Load Balancing

I'm so happy to finally be here, at the Networking part of the Public Cloud!!! I know, there are more important parts of Cloud then Networks, but SDN is my true love, and we should give it all the attention it deserves.

IMPORTANT: In this post I will be heavily focusing on Google Cloud Platform. The concepts described here apply to ANY Public Cloud. Yes, specifics may vary, and in my opinion GCP is a bit superior to AWS and Azure at this moment, but if you understand how this one works - you'll easily get all the others.

Virtual Private Cloud (VPC)

VPC (Virtual Private Cloud) provide global scalable and flexible networking. This is an actual Software Defined Network provided by Google. Project can have up to 5 VPC - Virtual Private Networks. VPC can be global, and contains subnets and uses a private IP space. Subnets are regional. The network that you are provided with VPC is:

  • Private
  • Secure
  • Managed
  • Scalable
  • Can contain up to 7000 VMs

Once you create the VPC, you have a cross-region RFC1918 IP Space network, using Googles private Network underneath. It uses the Global internal DNS, load balancing, firewalls, routes, and you can scale rapidly with global L7 Load Balancers. Subnets within VPC can only exist within Region/Zone, you can't extend a Subnet over your entire VPC.

VPC Networks can be provisioned in:
  • Auto Mode, where the Subnet(s) is set (automatically assigned) in every region. Firewall rules and routes are preconfigured.
  • Custom mode, where we have to manually configure the subnets.

IP Routing and Firewalling

Routes are defined for the networks to which they apply, and you can use them if you want to apply the route only for the Instances with a certain "instance tag" (If you don't specify the TAG, the route applies to all the instances).

When you use the Routes to/from the Internet, you have 2 options:

Project can contain various VPCs (Google allows you to create up to 5 VPCs per project). VPCs also have Multi Tenancy. All the resources in GCP belong to some VCP.  Routing and Forwarding must be configured to allow traffic within VPC, and with the outside world. You also need to configure the Firewall Rules.

VPCs are GLOBAL, meaning the Resources can span anywhere around the world. Even so, instances from different regions CANNOT BE IN THE SAME SUBNETAn instance needs to be in the same region as a reserved static IP address. The zone in the region doesn't matter.

Firewall Rules can be based on the Source IP (ingress) or Destination IP (Egress) There are DEFAULT "allow egress" and "deny ingress" rules, which are pre-configured for you, with the minimum priority (65535). This means that if you configure the new FW rules with the lower number/higher priority, these will be taken into account, instead of the default ones. GCP Firewall rules are STATEFUL. You can also use TAGs and Service Accounts ( for example) to configure the Firewall rules, and this is probably THE BIGGEST advantage of the Cloud Firewall, because you can do Micro Segmentation in a native way. Once you create a Firewall Rule, a TAG is created, so the next time you create an instance, and apply that rule, it will not create it again, just attach the TAG to your instance.

There are 2 types of IP addresses in VPC:
- External, in the Public IP space
- Internal, in the Private IP space

VPCs can communicate to each other using a Public IP space (External networks visible on the Internet). External IP can also be ephemeral (change every 24 hours) or static. VMs don't know what their external IP is. IMPORTANT: If you RESERVE an External IP in order to configure it as STATIC, and not use it for an Instance or a LB - you will be charged for it! Once you assign it - it's for free.

When you work with Containers - containers need to focus on the Application or Service. They don't need to do their own routing, it simplifies the traffic management.

Can I use a single RFC 1918 space within few GCP Projects?

Yes, using a Shared VPC - Networks can be shared across Regions, Projects etc. If you have different Departments that need to work on the same Network resources, you'd create two separate projects for them, give the access only to the project they work on, and use a single Shared VPC for the Network resources they all need to access.

Google Infrastructure

Google's network infrastructure has three distinct elements:
  • Core data centers (central circule), used for the Computation and Backend storage.
  • Edge Points of Presence (PoPs), Edge Points of Presence (PoPs) are where we connect Google's network to the rest of the internet via peering. We are present on over 90 internet exchanges and at over 100 interconnection facilities around the world.
  • Edge caching and services nodes (Google Global Cache, or GGC). Our edge nodes (called Google Global Cache, or GGC) represent the tier of Google's infrastructure closest to our users. With our edge nodes, network operators and internet service providers deploy Google-supplied servers inside their network.

CDN (Content Delivery Network) is also worth mentioning. It's enabled by Edge Cache Sites (Edge PoPs, or the light green circule above), the places where the online content can be delivered closer to the users for faster response times. It works with Load Balancing, and the Content is CACHED in 80+ Edge Cache Sites around the globe. unlike most CDNs, your site gets a single IP address that works everywhere, combining global performance with easy management — no regional DNS required. For more information check out the official Google docs.

Connecting your environment to GCP (Cloud Interconnect)

While this may change in the future, a VPN hosted on GCP does not allow for client connections. However, connecting a VPC to an on-premises VPN (not hosted on GCP) is not an issue.

There are 3 ways you can connect your Data Center to GCP:
  • Cloud VPN/IPsec VPN, as in a standard Site to Site VPN IPsec session (supports IKEv1 and v2). Supports up to 1,5-3 Gbps per tunnel, but you can set up various to increase performance. You can also use this option to connect different VPCs to each other, or your VPC to other Public Cloud. Cloud Router is not required for Cloud VPN, but it does make things a lot easier, by introducing the Dynamic Routing between your DC and GCP, that supports BGP. When using static routes, any new subnet on the peer end must be added to the tunnel options on the Cloud VPN gateway options.
  • Dedicated Interconnect, used if you don’t want to go via Internet, and you can meet Google in one of Dedicated Interconnect points of presence. You would be using Google Edge Location (you can connect into it Directly, or via Carrier), with Google Peering Edge (PE) device to which your physical Router (CE) connects [you need to be in the supported location - Madrid is included]. This is not cheap, currently around 1700$ per 10Gbps link, 80GB Max!
  • Direct Peering/Carrier Peering, which Google does not charge for, but also - there is no SLA. Peering is a private connection directly into Google Cloud. It's available in more locations then Dedicated Interconnect, and it can be done directly with Google (Direct Peering) if you can meet Google's direct peering requirements (Requires you to have a connection in a colocation facility, either directly or through a carrier provided wave service), or via a Carrier (Carrier Peering).

And, as always, Google provides a Choice Chart if you're not sure which option is for you:

How do I transfer my data from my Data Center to GCP?

When transferring your content into the cloud, you would use the "gsutil" command line tool, and have in mind:
  • Parallel uploads (-o, plus you need to set the parameters) are for breaking up larger files into pieces for faster uploads.
  • Multi-threaded uploads (-m) are for large numbers of smaller files.  If you have bunch of small files, you should group together and compress.
  • You can add multiple Cloud VPNs to reduce the transfer time.
  • gsutil by default will by default occupy the entire bandwidth. There are tools to optimize this. When it fails, gsutil will retry by default.
  • For ongoing automated transfers, use a cron job.

Google Transfer Appliance is a new thing, probably not in the exam, it allows you to copy all your data, ship it to google, and they will load it to the Cloud for you.

Load Balancing in GCP

One of the most important parts of Google Cloud, because it enables the Elasticity, much needed in the cloud, by providing the Auto Scaling for the Managed Instance Groups.

Have in mind that the Load Balancing services for the GCE and GKE work in a different ways, but basically they achieve the same thing - Auto Scaling. Here is how this works:
  • In GCE there is a managed group of instances generated from the same template (Managed Instance Group). By enabling a Load Balancing service, you're getting a Global URL for your Instance Group, that includes the Health check service launched from the Balancer to the Instances, which is the base trigger of the Auto Scaling.
  • In GKE you'd have a Kubernetes Cluster, and the entire Elastic operation of your containers is one of the signature functionalities of the Kubernetes Cluster, so you don't have to worry about configuring any of this manually.

Let's get deeper into the types of the Load Balancing (LB) service in GCP. Have in mind that you should always have in mind the ISO-OSI model, and if you can provide the LB service on the high level - go for it! This means that if you can do a HTTPS Balancing, rather go for that then SSL. If you can't go HTTPS - go for SSL. If your traffic is not encrypted - sure, go for TCP. Only if NONE of this works for you, you should settle for the simple Network LB Service.

IMPORTANT: Whenever you are using one of the encrypted LB Services (HTTPS, SSL/TLS), the Encryption terminates on the Load Balancer, and then the proper Load Balancer established a separate encrypted tunnel to each of the Active Instances.

There are 2 types of Load Balancing on GCP:
  1. EXTERNAL Load Balancing, for an access from the OUTSIDE (Internet)
    1. GLOBAL Load Balancing:
    • HTTP/HTTPS Load Balancing
    • SSL Proxy Load Balancing
    • TCP Proxy Load Balancing
    1. REGIONAL Load Balancing:
    • Network Load Balancer (notice that the Network Load Balancer is NOT Global, only available in a single region)
  2. INTERNAL, for the inter-tier access (example - web servers accessing Data Bases)

Google Cloud Platform (GCP) - How do I choose among the Storage and Big Data options?

Storage options are extremely important when using a GCP, performance and price wise. I will do a bit of a non-standard approach for this post. I will first cover the potential use cases, explain the Hadoop/Standard DB you would use in each case, and then the GCP option for the same use case. Once that part is done, I will go a bit deeper into each of GCP Storage and Big Data technologies. This post will therefore have 2 parts, and an "added value" Anex:
  1. Which option fits to my use case?
  2. Technical details on GCP Storage and Big Data technologies
  3. Added Value: Object Versioning and Life Cycle management

1. Which option fits to my use case?

Before we get into the use cases, let's make sure we understand the layers of abstraction of Storage. Block Storage is a typical storage carried out by applications, data stored in cylinders, UNSTRUCTURED DATA WITH NO ABSTRACTION. When you can refer to data using a physical address - you're using Block Storage. You would normally need some abstraction to use the storage, it would be rather difficult to reference your data by blocks. File Storage is a possible abstraction, and it means you are referring to data using a logical address. In order to do this, we will need some kind of layer on top of our blocks, an intelligence to make sure that our blocks underneath are properly organized and stored in the disks, so that we don't get the corrupt data.

Let's now focus on the use cases, and a single question - what kind of data do you need to store?

If you're using Mobile, the you will be using a slightly different data structures:

Let's now get a bit deeper into each of the Use Cases, and see what Google Cloud can offer.
  1. If you need Block Storage for your compute VMs/instances, you would obviously be using a Googles IaaS option called Compute Engine (GCE), and you would create the Disks using:
    • Persistent disks (Standard or SSD)
    • Local SSD
  1. If you need to store an unstructured data, or "Blobs", as Azure calls it, such as Video, Images and similar Multimedia Files - what you need is a Cloud Storage.
  2. If you need your BI guys to access your Big Data using an SQL like interface, you'll use a BigQuery, a Hive-like Google product. This applies to cases 3 (SQL interface required), and 7 (OLAP/Data Warehouse).
  3. To store the NoSQL Documents like HTML/XML, that have a characteristic pattern, you should use DataStore.
  4. For columnar NoSQL data, that requires fast scanning, use BigTable (GCP equivalent of HBase).
  5. For Transactional Processing, or OLTP , you should use Cloud SQL (if you prefer open source) or Cloud Spanner (if you need less latency, and horizontal scaling).
  6. Same like 3.
  7. Cloud Storage for Firebase is great for Security when you are doing Mobile.
  8. Firebase Realtime DB is great for fast random access with mobile SDK. This is a NoSQL database, and it remains available even when you're offline.

2. Technical details on GCP Storage and Big Data technologies

Storage - Google Cloud Storage

Google Cloud Storage is created in the form of BUCKETS, that are globally unique, identified by NAME, more or less like a DNS. Buckets are STANDALONE, not tied to any Compute or other resources.

TIP: If you want to use Cloud Storage with a web site, have in mind that you need a Domain Verification (adding a meta-tag, uploading a special HTML file or directly via the Search Console).

There are 4 types of Bucket Storage Classes. You need to be really careful to choose the most optimal Class for your Use Case, because the ones that are designed not used frequently are the ones where you'll be charged per access.  You CAN CHANGE a Buckets Storage class. The files stored in the Bucket are called OBJECTS, the Objects can have the Class which is same or "lower" then the Bucket, and if you change the Bucket storage class - the Objects will retain their storage class. The Bucket Storage Classes are:
  • Multi-regional, for frequent access from anywhere around the world. It's used for "Hot Objects", such as Web Content, it has a 99,95% availability, and it's Geo-redundant. It's pretty expensive, 0.026/GB/Month.
  • Regional, frequent access from one region, with 99,9% availability, appropriate for storing data used by Cloud Engine instances. Regilnal class has performance for data intensive computations, unlike multi-regional.
  • Nearline - access once at month at max, with 99% availability, costing 0.01/GB/month with a 30 day minimum duration, but it's got ACCESS CHARGES. It can be used for data Backup, DR or similar.
  • Coldline - access once a year at max, with same throughput and latency, for 0.007/GB/month with a 90 day minimum duration, so you would be able to retrieve your backup super fast, but you would get a bit higher bill.. At least your business wouldn’t suffer.

We can get a data IN and OUT of Cloud Storage using:
  • XML and JSON APIs
  • Command Line (gsutil - a command line tool for storage manipulation)
  • GSP Console (web)
  • Client SDK

You can use TRANSFER SERVICE in order to get your date INTO the Cloud Storage (not out!), from AWS S3, http/https, etc. This tool won't let you get the data out. Basically you would use:
  • gsutil when copying files for the first time from on premise.
  • Transfer Service when transferring from AWS etc.

Cloud Storage is not like Hadoop in the architecture sense, mostly because a HDFS architecture requires a Name Node, which you need to access A LOT, and this would increase your bill. You can read more about Hadoop and it's Ecosystem in my previous post, here.

When should I use it?

When you want to store UNSTRUCTURED data.

Storage - Cloud SQL and Google Spanner

These are both relational databases, super structured data. Cloud Spanner offers ACID++, meaning it's perfect for OLTP. It would, however, be too slow and too many checks for Analytics/BI (OLAP), because OLTP needs strict write consistency, OLAP does not. Cloud Spanner is Google proprietary, and it offers horizontal scaling, like bigger data sets.

*ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, etc.

When should I use it?

OLTP (Transactional) Applications.

Storage - BigTable (Hbase equivalent)

BigTable is used for FAST scanning of SEQUENTIAL key values with LOW latency (unlike Datastore, which would be used for non-sequential data). Bigtable is a columnar database, good for sparse data (meaning - missing fields in the table), because similar data is stored next to each other. ACID properties apply only on the ROW level.

What is columnar Data Base? Unlike RDBMS, it is not normalised, and it is perfect for Sparse data (tables with bunch of missing values, because the Columns are converted into rows in the Columnar data store, and the Null value columns are simply not converted. Easy.). Columnar DBs are also great for the data structures with the Dynamic Attributes because we can add new columns without changing the schema.

Bigtable is sensitive to hot spotting.

When should I use it?

Low Latency, SEQUENTIAL data.

Storage - Cloud Datastore (has similarities to MongoDB)

This is much simpler data store then BigTable, similar to MongoDB and CouchDB. It's a key-value structure, like structured data, designed to store documents, and it should not be used for OLTP or OLAP but instead for fast lookup on keys (needle in the haystack type of situation, lookup for non sequential keys).  Datastore is similar to RDBMS in that they both use indices for fast lookups. The difference is that DataStore query execution time depends on the size of returned result, so it will take the same time if you're querying a dataset of 10 rows or 10.000 rows.

IMPORTANT: Don’t use DataStore for Write intensive data, because the indices are fast to read, but slow to write.

When should I use it?

Low Latency, NON-SEQUENTIAL data (mostly Documents that need to be searched really quickly, like XML or HTML, that has a characteristic patterns, to which Datastore is performing INDEXING). It's perfect for SCALING of a HIARARCHICAL documents with Key/Value data. Don't use DataStore if you're using OLTP (Cloud Spanner is a better. choice) or OLAP/Warehousing (BigQuery is a better choice). Don't use for unstructured data (Cloud Storage is better here). It's good for Multi Tenancy (think of HTML, and how the schema can be used to separate data).

Big Data - Dataproc

Dataproc is a GCP managed Hadoop + Spark (every machine in the Cluster includes Hadoop, Hive, Spark and Pig. You need at lease 1 master and 2 workers, and other workers can be Preemptable VMs). Dataproc uses Google Cloud Storage instead of HDFS, simply because the Hadoop Name Node would consume a lot of GCE resources.

When should I use it?

Dataproc allows you to move your existing Hadoop to the Cloud seamlessly.

Big Data - Dataflow

In charge of transformation of data, similar to Apache Spark in Hadoop ecosystem. Dataflow is based on Apache Beam, and it models the flow (PIPELINE) of data and transforms it as needed. Transform takes one or more Pcollections as input, and produces an output Pcollection.

Apache Beam uses the I/O Source and Sink terminology, to represent the original data, and the data after the transformation.

When should I use it?

Whenever you have one data format on the Source, and you need to deliver it in a different format, as a Backend you would use something like Apache Spark or Dataflow.

Big Data - BigQuery

BigQuery is not designed for the low latency use, but it is VERY fast comparing to Hive. It's not as fast as Bigtable and Datastore which are actually preferred for low latency. BigQuery is great for OLAP, but it cannot be used for transactional processing (OLTP).

When should I use it?

If you need a Data Warehouse if your application is OLAP/BA or if you require an SQL interface on top of Big Data.

Big Data - Pub/Sub

Pub/Sub (Publisher/Subscriber) is a messaging transport system. It can be defined as messaging Middleware. The subscribers subscribe to the TOPIC that the publisher publishes, after which the Subscriber sends an ACK to the "Subscription", and the message is deleted from the source. This message stream is called the QUEUE. Message = Data + Attributes (key value pairs). There are two types of subscribers:
  • PUSH Subscriber, where the Apps make HTTPS request to
  • PULL Subscriber, where the Web Hook endpoints able to accept POST requests over HTTPS

When should I use it?

Perfect for applications such as Oder Processing, Event Notifications, Logging to multiple systems, or maybe Streaming data from various Sensors (typical for IoT).

Big Data - Datalab

Datalab is an environment where you can execute notebooks. It's basically a Jupyter or iPhython for notebooks for running code. Notebooks are better the text files for Code, because they include Code, Documentation (markdown) and Results. Notebooks are stored in Google Cloud Storage.

When should I use it?

When you want to use Notebooks for your code.

Need some help choosing?

If it's still not clear which is the best option for you, Google also made a complete Decision Tree, exactly like in the case of "Compute".

3. Added Value: Object Versioning and Lifecycle Management

Object Versioning

By default in Google Cloud Storage If you delete a file in a Bucket, the older file is deleted, and you can't get it back. When you ENABLE Object Versioning on a Bucket (can only be enabled per bucket), the previous versions are ARCHIVED, and can be RETRIEVED later.

When versioning is enabled, you can perform different actions, for example - use an older file and override the LIVE version, or similar.

Object Lifecycle Management

To avoid the archived version creating a chaos in some point of time, it's recommendable to implement some kind of Lifecycle Management. The previous versions of the file maintain their own ACL permissions, which may be different then the LIVE one.

Object Lifecycle Management can turn on the TTL. You can create CONDITIONS or RULES to base your Object Versioning. This can get much more granular, because you have:
  • Conditions are criteria that must be met before the action is taken. These are: Object age, Date of Creation, If it's currently LIVE, Match a Storage Class, and Number of Newer Versions.
  • Rules
  • Actions, you can DELETE or Set another Storage Class.

This way you can get pretty imaginative, and for example delete all objects older then 1 year, or perhaps if a Rule is triggered and conditions are met - change the Class of the Object from, for example, Regional to Nearline etc.

Most Popular Posts