--- title: Wk2 Big Data and Machine Learning Fundamentals updated: 2021-09-12 20:50:42Z created: 2021-09-11 16:35:51Z latitude: 52.09370000 longitude: 6.72510000 altitude: 0.0000 --- ![9948507e5860dd40b34a9f10e6b370c2.png](../_resources/9948507e5860dd40b34a9f10e6b370c2.png) # Message-oriented architectures with Pub/Sub ## Distributed Messages - Streaming data from various devices - issues: bad data, delayed data, no data - Distributing event notifications (ex: new user sign up) - other services to subscribe to new messages that we're publishing out - Scalable to handle volumes - needs to handle an arbitrarily high amount of data so we don't lose any messages coming in. - Reliable (no duplicates) - We need all the messages and also a way to remove any duplicates if found **Pub/Sub is a distributed messaging** service that can receive messages from a variety of different streams, **upstream data** systems like gaming events, IoT devices, applications streams, and more. Pub/Sub will scale to meet that demand. - Ensures at-least-once delivery and passes them to subscribing applications - No provisioning, auto-everything - Open API's - Global by default - End-to-end encryption ![51c2fc2521e7435aba198a5c23584999.png](../_resources/51c2fc2521e7435aba198a5c23584999.png) Upstream data starts in from the left and comes into those devices from all around the globe. It is then **ingested** into Cloud Pub/Sub as a first point of contact with our system. Cloud Pub/Sub reads, stores, and then publishes out any subscribers of this particular topic. Cloud Dataflow as a subscriber to this pub subtopic in particular and It will ingest and transform those messages in an inelastic streaming pipeline. If you're doing analytics one common data sink is Google BigQuery. ## Architecture of Pub/Sub (like Kafka) A central piece of Pub/Sub is the **topic**. There can be zero, one, or many publishers. Zero, one or many many subscribers relating to any given Pub/Sub topic. Completely decoupled from each other. ## Designing streaming pipelines with Apache Beam Design of the actual pipeline in code the actual implementation in serving of that pipeline at scale in production: - is the code compatible with both batch and streaming data: **YES** - Does the pipeline code SDK support the transformations I need to do? **Likely** - Does it have the ability to handle late data coming into the pipeline? - Are there any existing templates or solutions that we can leverage to quickly get us started? **Choose form templates** ## What is Apache Beam? Apache Beam is a **portable data processing programming model**. - It's extensible (write and share new SDK's, IO connectors and transformation libraries) and Open Source - can be ran in a **highly distributed** fashion - It's **unified**: use a single model programming model for both batch and streaming use cases. - **Portable**: Execute pipelines on multile excecution environments. No vendor lockin. - You can browse and write your own connectors - Build transformation libraries too if needed. - Apache Beam pipelines are written in Java, Python or Go. - The SDK provides a host of libraries for transformations and existing data connectors to sources and sinks. ![5151c1c2b55de604999b2b6f38822b08.png](../_resources/5151c1c2b55de604999b2b6f38822b08.png) - Apache Beam creates a model representation of your code, which is portable across many runners. Runners pass off your model to an execution environment, which you could run in many different possible engines. Cloud Dataflow is one of the popular choices for running Apache Beam as an engine. Example of an pipeline ![a6d9be269ce6bc83c9f5d824ef667013.png](../_resources/a6d9be269ce6bc83c9f5d824ef667013.png) Transformations can be done in parallel, which is how you get that truly elastic pipeline You can get input from many different and even multiple sources concurrently, then you can write output to many different sinks, and the pipeline code remains the same. ## Implementing streaming pipelines on Dataflow ![1238eb4c4d458a3679c68dff091d8e0c.png](../_resources/1238eb4c4d458a3679c68dff091d8e0c.png) ![5ea764eaeeb247f7b43501c4d6e11653.png](../_resources/5ea764eaeeb247f7b43501c4d6e11653.png) Many Hadoop workloads can be done easily and more maintainably with Dataflow. Plus, Dataflow is serverless and designed to be NoOps. What do we mean by **serverless**? It means that Google will manage all the infrastructure tasks for you, like resource provisioning and performance tuning, as well as ensuring that your pipeline is reliable. ![2fa4ce2967893077a569d372aed8a3ff.png](../_resources/2fa4ce2967893077a569d372aed8a3ff.png) [Source for Google DataFlow templates:](https://github.com/GoogleCloudPlatform/Dataflowtemplates) Recap: ![87df2df0f990da8682065eeb7f8381a9.png](../_resources/87df2df0f990da8682065eeb7f8381a9.png) QPS: Queries Per Second # Visualizing insights with Data Studio The first thing you need to do is tell Data Studio the Data Source A Data Studio report can have any number of data sources. The Data Source picker shows all the data sources that you have access to. other people who can view the report can potentially see all the data in that Data Source if you share that data source with them. Warning: Anyone who can edit the rapport can also use all the fields from any added data sources to create new charts with them. # Creating charts with Data Studio - Dimension chips are green. **Dimensions** are things like categories or buckets of information. Dimension values could be things like names, descriptions, or other characteristics of a category. - **Metric** chips are blue. Metrics measure dimension values. Metrics represent measurements or aggregations such as a sum, x plus y, a count, how many of x, or even a ratio, x over y. A calculated field can also be a dimension Data Studio uses Google Drive for sharing and storing files. Share andcollaborate your dashboards with your team. Google login is required to edit a report.. No login for viewing. Keep in mind that when you share a report, if you're connected to an underlying data source like a BigQuery data set, Data Studio does not automatically grant permissions to viewers and that data source if the viewer doesn't already have them and this is for data security reasons. After you share your report, users can interact with filters and sort, and then you can collect feedback on the usage of your report through Data Studio's native integration with Google Analytics. # Machine Learning on Unstructured Datasets Comparing approaches to ML - **Use pre-built AI**: Dialogflow or Auto ML (10-100 images per label) - provided as services - Cloud Translation API - Cloud Natural Language API - Cloud-Speech-toText - Cloud Video intelligence API (recognizing content in motion and action video's) - Cloud Vision API (recognizing content in still images) - Dialogflow Enterprise Edition( to build chatbots) - **Add Custom Models**: only when you have a lot of data, like 100,000 plus to millions of examples. - **Create new Models**: TensorFlow, Cloud AI, Cloud TPU **Dialogflow** is a platform for building natural and rich conversational experiences. It achieves a conversational user experience by handling the natural language understanding for you. It has built-in **entity recognition** which enables your agent to identify entities and label by types such as person, organization, location, events, products, and media. Also **sentiment analysis** to give an understanding of the overall sentiment expressed in a block of text. Even **content classification**, allows you to classify documents in over 700 predefined categories like common greetings and conversational styles. It has **multi-language support** so you can analyze text in multiple languages. Dialogflow works by putting all of these ML capabilities together which you can then optimize for your own training data and use case. Dialogflow then creates unique algorithms for each specific conversational agent, which continuously learns and is trained and retrained as more and more users engage with your agent. ![95d94272c1cec38e84f0dd6d9a5c9e48.png](../_resources/95d94272c1cec38e84f0dd6d9a5c9e48.png) ### Dialogflow benefits for users: - Build faster: - start training with only a few examples - 40+ pre-built agents - Engage more efficiently - build-in naltural language understanding - multiple options to connect with backend systems - Training and analytics - Maximize reach - Build once, deploy everywhere - 20+ language supported - 14 single-click platform integrations and 7 SDKs ## Customizing pre-built models with AutoML ![c95aa196527b6a81b5ad0e3e6eb4d20d.png](../_resources/c95aa196527b6a81b5ad0e3e6eb4d20d.png) So **precision** is the number of photos correctly classified as a particular label divided by the total number of photos classified with that label. **Recall** is number of photos classified as a particular label divided by the total number of photos with that label. ![e9bdd75ea38db93bd06929d6d2371de2.png](../_resources/e9bdd75ea38db93bd06929d6d2371de2.png) ![317aa12b279ad4a9b0e289d1c654c4a3.png](../_resources/317aa12b279ad4a9b0e289d1c654c4a3.png) ![f7e03cc71bfb39115e9ccb74f9306f2e.png](../_resources/f7e03cc71bfb39115e9ccb74f9306f2e.png)