Summaries/Cloud/WK 1 Modernizing Data Lakes...

9.1 KiB

title updated created latitude longitude altitude
WK 1 Modernizing Data Lakes and Data Warehouses with GCP 2021-09-18 21:24:35Z 2021-09-18 17:26:31Z 52.09370000 6.72510000 0.0000

Explore the role of a data engineer

5908a4d876bce77f6cc35e42ca230be3.png

73822f556381b2d599822c18a4e09d83.png

The point of a data pipeline is to make data-driven decisions A data lake brings together data from across the Enterprise into a single location. the Purpose of a data lake is to make data accessible for analytics.

0532ef04db4483bf2be14245b2c02b4b.png Cloud storage is blob storage. So you might need to think about the granularity of what you store.

Cloud storage bucket is a good option for staging all of your raw data in one place before building transformation pipelines into your data warehouse.

36ea2fef8a287ca026801e2d45632c20.png Because of Google's many data center locations and high network availability, storing data in a GCS bucket is durable and performed. As a data engineer, you will often use a cloud storage bucket as part of your data lake to store many different types of raw data files, CSV, JSON, Avro, parquet, etc. You could then load or query them directly from BigQuery, which is a data warehouse. Other Google Cloud platform products and services can easily query and integrate with your bucket once you've got it set up and loaded with data.

Common challanges encountered by data engineers:

  • Access to data A typical problem: data is scatered around different locations( eq departments with there own systems: need to know how to combine the data) 0745b249204a6a1ab9b5aaafaa18d53b.png
  • Data accuracy and quality cleaning, formatting, and getting the data ready for insightsrequires that you build extract transform load or ETL pipelines. ETL pipelines are usually necessary to ensure data accuracy and quality. The cleaned and transformed data are typically stored not in a data lake, but in a data warehouse. A data warehouse is a consolidated place just like a data lake, it's a consolidated place. But this time, the data that we're storing is all easily joinable and queryable. Unlike a data lake where the data is in a row format in the data warehouse, the data is stored in a way that makes it very efficient to query
  • Availiability of computational resources The problem is that the compute that's needed by any specific ETL job is not constant over time. This means that when traffic is low, you're going to be wasting money because you have computers out there doing nothing, and when traffic is high, those computers are so busy that your jobs are taking way too long.
  • Query performance Once your data is in your data warehouse, you need to optimize the queries that your users are running to make the most efficient use of your compute resources.

BigQuery

BigQuery is Google Cloud's petabyte scale Serverless Data Warehouse. Datasets are collections of tables that can be divided along business lines or a given analytical domain. Each dataset is tied to a GCP project. 67abe953280b3a8006e5a19d4e094040.png Cloud Identity and Access Management, or Cloud IAM is used to grant permission to perform specific actions BigQuery. This replaces the SQL grant and revoke statements that are used to manage access permissions in traditional SQL databases. 362653ef1df287f348947de381643f48.png BigQuery allocates storage and query resources dynamically based on your usage patterns. Storage resources are allocated as you consume them, and deallocated as you remove data or you drop tables. Query resources are allocated according to the query type and complexity. Each query uses some number of what are called slots. Slots are units of computation that comprise a certain amount of CPU and RAM.

Data Lakes and Data Warehouses

f5d936aaf30e3634df936c86cd99003e.png

Considerations when choosing a data warehouse:

  • The data warehouses going to serve as a sink. Will the data warehouse be fed by a batch pipeline or by a streaming pipeline? Need to be up-to-the-minute correct or is it enough to load data into it once a day or once a week?
  • Will the data warehouse scale to meet my needs?
  • How is the data organized? is it cataloged? Is it access controlled? Be able to share access. data to all your stakeholders? Who will pay for the querying?
  • Is the warehouse design for performance? Carefully consider concurrent query performance,and whether that performance comes out of the box,or whether you ended go around creating indexes and tuning the data warehouse
  • What level of maintenance is required by your engineering team?

7b7de3102acdf026a8f7d05aba8e7217.png BigQuery provides mechanisms for automated data transfer and powers business applications using technologies like SQL that teams already know and use

Other option

That is to treat BigQuery as just a query engine and allow it to query the data in the data lake, data in place. For example, you can use BigQuery to directly query database data in Cloud SQL, that is managed relational databases like PostgreSQL, MySQL, and SQL Server. You can also use BigQuery to directly query files and Cloud Storage as long as those files are in formats like CSV or parquet. 920bc1ebbbe823a40736ed122e695036.png The real power comes when you can leave your data in place and still join it against other data in the data warehouse.

Transactional Databases vs Data Warehouses

  • Cloud SQL backend transactional Database systems that support your company's applications: optimized to be a database for transactions. 703c231dfc0b7336315963917d74b038.png
  • Data Warehouses that support your analytic workloads. Optimized for reads

94a60b4329d9d14e300dd2f806af1521.png

The data Lake is designed for durability and high availability.

How to provide access to the data warehouse while keeping to data governance best practices?

The three most common clients are

  1. Machine Learning engineers
    • how long does it take for a transaction to make it from raw data all the way into the data warehouse? to be available at prediction time
    • how difficult it would be to add more columns or
    • making your data sets easily discoverable, documented and available to them to experiment on quickly more rows of data into certain datasets?
  2. data or business analysts
    • rely on good clean data so that they can query it for insights and build dashboards.
    • need datasets that have clearly defined schema definitions, the ability to quickly review rows, and the performance to scale too many concurrent dashboard users. 8bc3cae2d26662dcf3d103e60e365a3f.png
  3. other data engineers
    • always be available when we need it?

Google Cloud stackdriver

  • monitoring tool
  • track resource use
  • create audit logs
  • who has run what
  • trace usage sensitive datasets
  • create alerts and send notifications

Manage data access and governance

Overall governance of how data is to be used and not used by your users. Privacy and security, Clearly communicating a data governance model for who should access and who should not be able to access.

How will our end users discover different data sets that we have for analysis?

  1. One solution for data governance is Cloud Data Catalog .Data Catalog makes all the metadata about your data sets available for users to search. c7b0d8e8356eaef330da4a5ba9040df3.png
  2. Data Loss Prevention API. This helps you better understand and manage sensitive data. It provides fast scalable classification and reduction for sensitive data elements, like credit card numbers, names.

Productionize the data process

End-to-end and scalable data processing system Data engineering team is responsible for the health of the plumbing, that is the pipelines and ensuring that the data is available and up-to-date for analytic and ML workloads.

Common questions: 43d7aa7e1b2fb5703bc8874d6295cf1c.png

One common workflow orchestration tool used by enterprises is Apache Airflow and Google Cloud has a fully managed version of Apache Airflow called Cloud Composer. Cloud Composer helps your data engineering team orchestrate the pieces to the data engineering puzzle that we discussed to date, and even more that we haven't come across yet. The power of this tool comes from the fact that GCP big data products and services have API endpoints that you can call. A Cloud Composer job can then run every night or every hour and kickoff your entire pipeline from broad data to the data lake and into the data warehouse for you.