--- title: WK2 Data Lakes updated: 2021-09-19 15:15:03Z created: 2021-09-19 10:18:20Z latitude: 52.09370000 longitude: 6.72510000 altitude: 0.0000 --- # Introduction to Data Lakes ![4bd6a4b0fe00fd4735c0efd1b01892a1.png](../_resources/4bd6a4b0fe00fd4735c0efd1b01892a1.png) **Data sources**: originating system or systems that are the source of all of your data **Data sinks**: build those reliable ways of retrieving and storing that data The first line of defense in an enterprise data environment is your **data lake** variety of formats, volume, and velocity **Data pipelines**: doing the transformations and processing **Orchestration layer**: coordinate efforts between many of the different components at a regular or an event driven cadence. (*Apache airflow*) It's so important to first understand what you want to do first, and then finding which of the solutions best meets your needs. # Data Storage and ETL options on GCP ![26d5943bae46c8a9bc9a4210eb56ed45.png](../_resources/26d5943bae46c8a9bc9a4210eb56ed45.png) - Cloud SQL and Cloud Spanner for **relational data** - Cloud Firestore and Cloud Bigtable for **nosql data**. The path the data takes depends on: - where is the data comming from - Volume - Where it has to go - How much processing is needed to arrive in the sink The method that you use to **load the data** into the cloud depends on how much transformation is needed from that raw data Cases: - readily ingested (**EL** => Extract and Load eq avro format) Think also about federated search - **ELT** => Extract Load Transform. Data is not in the right form to load into the sink. Volume is not big. eq use SQL to do the transformation: select from source and insert into the destination. - **ETL** => Extract Transform Load. Transformation is essential or reduces the volume significant before importing into the cloud. # Building a Data Lake using Cloud Storage Google Cloud Storage: - strong persistant - share globally - encrypted - controlled and private if needed - moderate latency and high troughput - relative inexpensive - Object store: binary objects regartless of what the data is containt in the objects - in some extent it has system compatibilities (copy out/in of objects as it where files) Cloud storage uses the bucket name and the object name to simulate a file system Use cases: - archive data - save state of application when shutdown instance ![cbffc74d82c2acc542253f2a050fa649.png](../_resources/cbffc74d82c2acc542253f2a050fa649.png) The two main entities in cloud storage are **buckets** and **objects** - buckets are containers which hold objects - identified in a single globally unique name space (no one else can use that name. till deletion and name is released) - associated with a particular region or multiple regions - For a single region bucket the objects are replicated across zones within that one region (low-latency) - multiple requesters could be retrieving the objects at the same time from different replicas (high throughput) - objects exist inside of those buckets and not apart from them. - When an object is stored, cloud storage replicates that object, it'll then monitor the replicas and if one of them is lost or corrupted it'll replace it automatically with a fresh copy. (high durability) - stored with metadata. Used for access control, compression, encryption and lifecycle management of those objects and buckets. ![d9fd10fbec213e8ae877a457a76bfe82.png](../_resources/d9fd10fbec213e8ae877a457a76bfe82.png) 1. the location of that bucket, location is set when a bucket is created and it can never be changed. 2. have the location to be a dual region bucket? Select one region and the data will be replicated to multiple zones within this region 3. need to determine how often to access or change your data. **[Storage classes](https://cloud.google.com/storage/docs/storage-classes)**: archival storage, backups or disaster recovery Cloud storage uses the bucket name and the object name to simulate a file system ![d74445c2bc5cf347c94f572c4787c354.png](../_resources/d74445c2bc5cf347c94f572c4787c354.png) In example: bucket name is declass object name is de/modules/O2/script.sh the forward slashes are just characters in the name A best practice is to avoid the use of sensitive information as part of bucket names, because bucket names are in a global namespace. ![4e05321688bdc9bd53da7809ff1e4961.png](../_resources/4e05321688bdc9bd53da7809ff1e4961.png) # Securing Cloud Storage ![5b5bd4d599ad5e479af873c9f714ee7b.png](../_resources/5b5bd4d599ad5e479af873c9f714ee7b.png) 1. **IAM** is set at the bucket level. - provides project roles and bucket roles: - bucket reader - bucket writer - bucket owner. In the ability to create and delete buckets and to set IAM policy, is a **project level role**. The ability to create or change access control lists is an **IAM bucket role**. **Custom roles** are also available. 2. Access control lists (**ACL**) - applied at the bucket level or to individual objects. So it provides more fine-grained access control. Access lists are currently enabled by default All data in Google Cloud is **encrypted at rest and in transit** and there is no way to turn off the encryption. ![ec27c52f379456fd1ccdcd586b69c7de.png](../_resources/ec27c52f379456fd1ccdcd586b69c7de.png) Which data encryption option you use generally depend on your business, legal and regulatory requirements. Two levels of encryption: data is encrypted using a data encryption key, and then the data encryption key itself is then encrypted using a key encryption key or a **KEK**. These KEKs are automatically rotated on a schedule that use the current KEK stored in Cloud KMS, or the Key management Service ![1798e9a44636860cd20db3b66ede4949.png](../_resources/1798e9a44636860cd20db3b66ede4949.png) The fourth encryption option is client-side encryption. Client-side encryption simply means that you've encrypted the data before it's uploaded and then you have to decrypt the data yourself before it's used. Google Cloud storage still performs GMEK, CMEK, or CSEK encryption on the object. **Data locking** is different from encryption. Where encryption prevents somebody from understanding the data, locking prevents them from modifying the data. # Storing All Sorts of Data Types Cloud Storage not for transactional data or for Analytics unstructured data. ![0c0746feb4d1a4cb7802f32da7da5a43.png](../_resources/0c0746feb4d1a4cb7802f32da7da5a43.png) ![ca233fcb9a340843cc102e395c3b0a38.png](../_resources/ca233fcb9a340843cc102e395c3b0a38.png) Online Transaction Processing or **OLTP** Online Analytical Processing or **OLAP** ![396364a8c39247681880d65a1f9b9a8b.png](../_resources/396364a8c39247681880d65a1f9b9a8b.png) # Storing Relational Data in the Cloud **Cloud SQL**: - managed service for third-party RDBMSs (MySQL, SQL server, PostgresSQL) - cost effective - default choice for those OLTP - fully managed ![da9fb26de9169d4567f32995e6380343.png](../_resources/da9fb26de9169d4567f32995e6380343.png) **Cloud Spanner**: - globally distributed database. Updates from applications running in different geographic regions. - database is too big to fit in a single Cloud SQL instance **Cloud Bigtable** - really high throughput inserts, like more than a million rows per second or all sure low latency on the order of milliseconds, consider **Difference between fully managed and serverless:** By fully managed, we mean that the service runs on a hardware that you can control.Dataproc is fully mananged A serverless product that's just like an API that you're calling. BigQuery and Cloud Storage is serverless ![1863886204ff15acbf0544007ff3c91a.png](../_resources/1863886204ff15acbf0544007ff3c91a.png)