Summaries/Cloud/Wk2 Big Data and Machine Le...

9.1 KiB

title updated created latitude longitude altitude
Wk2 Big Data and Machine Learning Fundamentals 2021-09-12 20:50:42Z 2021-09-11 16:35:51Z 52.09370000 6.72510000 0.0000

9948507e5860dd40b34a9f10e6b370c2.png

Message-oriented architectures with Pub/Sub

Distributed Messages

  • Streaming data from various devices
    • issues: bad data, delayed data, no data
  • Distributing event notifications (ex: new user sign up)
    • other services to subscribe to new messages that we're publishing out
  • Scalable to handle volumes
    • needs to handle an arbitrarily high amount of data so we don't lose any messages coming in.
  • Reliable (no duplicates)
    • We need all the messages and also a way to remove any duplicates if found

Pub/Sub is a distributed messaging service that can receive messages from a variety of different streams, upstream data systems like gaming events, IoT devices, applications streams, and more. Pub/Sub will scale to meet that demand.

  • Ensures at-least-once delivery and passes them to subscribing applications
  • No provisioning, auto-everything
  • Open API's
  • Global by default
  • End-to-end encryption

51c2fc2521e7435aba198a5c23584999.png Upstream data starts in from the left and comes into those devices from all around the globe. It is then ingested into Cloud Pub/Sub as a first point of contact with our system. Cloud Pub/Sub reads, stores, and then publishes out any subscribers of this particular topic. Cloud Dataflow as a subscriber to this pub subtopic in particular and It will ingest and transform those messages in an inelastic streaming pipeline. If you're doing analytics one common data sink is Google BigQuery.

Architecture of Pub/Sub (like Kafka)

A central piece of Pub/Sub is the topic. There can be zero, one, or many publishers. Zero, one or many many subscribers relating to any given Pub/Sub topic. Completely decoupled from each other.

Designing streaming pipelines with Apache Beam

Design of the actual pipeline in code the actual implementation in serving of that pipeline at scale in production:

  • is the code compatible with both batch and streaming data: YES
  • Does the pipeline code SDK support the transformations I need to do? Likely
  • Does it have the ability to handle late data coming into the pipeline?
  • Are there any existing templates or solutions that we can leverage to quickly get us started? Choose form templates

What is Apache Beam?

Apache Beam is a portable data processing programming model.

  • It's extensible (write and share new SDK's, IO connectors and transformation libraries) and Open Source
  • can be ran in a highly distributed fashion
  • It's unified: use a single model programming model for both batch and streaming use cases.
  • Portable: Execute pipelines on multile excecution environments. No vendor lockin.
  • You can browse and write your own connectors
  • Build transformation libraries too if needed.
  • Apache Beam pipelines are written in Java, Python or Go.
  • The SDK provides a host of libraries for transformations and existing data connectors to sources and sinks. 5151c1c2b55de604999b2b6f38822b08.png
  • Apache Beam creates a model representation of your code, which is portable across many runners. Runners pass off your model to an execution environment, which you could run in many different possible engines. Cloud Dataflow is one of the popular choices for running Apache Beam as an engine.

Example of an pipeline a6d9be269ce6bc83c9f5d824ef667013.png Transformations can be done in parallel, which is how you get that truly elastic pipeline You can get input from many different and even multiple sources concurrently, then you can write output to many different sinks, and the pipeline code remains the same.

Implementing streaming pipelines on Dataflow

1238eb4c4d458a3679c68dff091d8e0c.png

5ea764eaeeb247f7b43501c4d6e11653.png

Many Hadoop workloads can be done easily and more maintainably with Dataflow. Plus, Dataflow is serverless and designed to be NoOps.

What do we mean by serverless? It means that Google will manage all the infrastructure tasks for you, like resource provisioning and performance tuning, as well as ensuring that your pipeline is reliable. 2fa4ce2967893077a569d372aed8a3ff.png

Source for Google DataFlow templates:

Recap: 87df2df0f990da8682065eeb7f8381a9.png QPS: Queries Per Second

Visualizing insights with Data Studio

The first thing you need to do is tell Data Studio the Data Source A Data Studio report can have any number of data sources. The Data Source picker shows all the data sources that you have access to.

other people who can view the report can potentially see all the data in that Data Source if you share that data source with them. Warning: Anyone who can edit the rapport can also use all the fields from any added data sources to create new charts with them.

Creating charts with Data Studio

  • Dimension chips are green. Dimensions are things like categories or buckets of information. Dimension values could be things like names, descriptions, or other characteristics of a category.
  • Metric chips are blue. Metrics measure dimension values. Metrics represent measurements or aggregations such as a sum, x plus y, a count, how many of x, or even a ratio, x over y. A calculated field can also be a dimension

Data Studio uses Google Drive for sharing and storing files.

Share andcollaborate your dashboards with your team. Google login is required to edit a report.. No login for viewing. Keep in mind that when you share a report, if you're connected to an underlying data source like a BigQuery data set, Data Studio does not automatically grant permissions to viewers and that data source if the viewer doesn't already have them and this is for data security reasons. After you share your report, users can interact with filters and sort, and then you can collect feedback on the usage of your report through Data Studio's native integration with Google Analytics.

Machine Learning on Unstructured Datasets

Comparing approaches to ML

  • Use pre-built AI: Dialogflow or Auto ML (10-100 images per label)
    • provided as services
      • Cloud Translation API
      • Cloud Natural Language API
      • Cloud-Speech-toText
      • Cloud Video intelligence API (recognizing content in motion and action video's)
      • Cloud Vision API (recognizing content in still images)
      • Dialogflow Enterprise Edition( to build chatbots)
  • Add Custom Models: only when you have a lot of data, like 100,000 plus to millions of examples.
  • Create new Models: TensorFlow, Cloud AI, Cloud TPU

Dialogflow is a platform for building natural and rich conversational experiences. It achieves a conversational user experience by handling the natural language understanding for you. It has built-in entity recognition which enables your agent to identify entities and label by types such as person, organization, location, events, products, and media. Also sentiment analysis to give an understanding of the overall sentiment expressed in a block of text. Even content classification, allows you to classify documents in over 700 predefined categories like common greetings and conversational styles. It has multi-language support so you can analyze text in multiple languages. Dialogflow works by putting all of these ML capabilities together which you can then optimize for your own training data and use case. Dialogflow then creates unique algorithms for each specific conversational agent, which continuously learns and is trained and retrained as more and more users engage with your agent. 95d94272c1cec38e84f0dd6d9a5c9e48.png

Dialogflow benefits for users:

  • Build faster:
    • start training with only a few examples
    • 40+ pre-built agents
  • Engage more efficiently
    • build-in naltural language understanding
    • multiple options to connect with backend systems
    • Training and analytics
  • Maximize reach
    • Build once, deploy everywhere
    • 20+ language supported
    • 14 single-click platform integrations and 7 SDKs

Customizing pre-built models with AutoML

c95aa196527b6a81b5ad0e3e6eb4d20d.png

So precision is the number of photos correctly classified as a particular label divided by the total number of photos classified with that label. Recall is number of photos classified as a particular label divided by the total number of photos with that label. e9bdd75ea38db93bd06929d6d2371de2.png

317aa12b279ad4a9b0e289d1c654c4a3.png

f7e03cc71bfb39115e9ccb74f9306f2e.png