4.9 KiB
title | updated | created |
---|---|---|
Data Management at Scale | 2022-05-21 12:13:15Z | 2022-05-21 12:13:02Z |
Data Management at Scale
Chapter 1
data harmonization: bringing amounts of data into a particular context
Data Management: DAMA
Data Monetization: Data Monetization refers to the process of using data to obtain quantifiable economic benefit. Internal or indirect methods include using data to make measurable business performance improvements and inform decisions. External or direct methods include data sharing to gain beneficial terms or conditions from business partners, information bartering, selling data outright (via a data broker or independently), or offering information products and services (for example, including information as a value-added component of an existing offering).
Data Proliferation: same data gets distributed across many applications and database
Dataintensiveness:read-versus-write ratio is changing significantly. Optimize for read: dupklication of data and/or applications optimized for read.
Devops and smaller applications (microservices, k8s, domain-driven design, serverless computing) result in increased complexity and increased demand to beter control data.
DataOps Focus on data interoperability, the capture of immutable events, and reproducible and loose coupling.
Bring the data to the application vs dont move the data becomes less important op cloud. Important for SaaS and Machine Learning as a Service (MLaaS) But will fragment the data possible.
Insights about where data origina‐ ted and how data is distributed are crucial. A stronger internal governance is required. The trend of stronger control is contrary to the methodologies for fast software development, which involves less documentation and fewer internal controls.
For advanced analytics, such as machine learning, leaving context out can be a big problem because if the data is meaningless, it is impossible to correctly predict the future.
The “big ball of mud” describes a system architecture that is monolithic, difficult to understand, hard to maintain, and tightly coupled because of its many dependencies. Data warehouses, with their layers, views, countless tables, relationships, scripts, ETL jobs, and scheduling flows, often result in a chaotic web of dependencies. The lack of agility often becomes a concern => risk development of work-arounds => technical debt.
Data lakes (multiple formats: structured, semi-structured, and unstructured),just like data warehouses, are considered centralized (monolithic) data repositories, but they differ from warehouses because they store data before it has been transformed, cleansed, and structured.
Data warehouses are usually engineered with RDBMs, while data lakes are commonly engineered with distributed databases or NoSQL systems or public cloud. Dumping in raw application structures—exact copies—is fast and allows data analysts and scientists quick access. However, the complexity with raw data is that use cases always require reworking the data. Data quality problems have to be sorted out, aggregations are required, and enrichments with other data are needed to bring the data into context. This introduces a lot of repeatable work and is another reason why data lakes are typically combined with data warehouses. Data warehouses, in this combination, act like high-quality repositories of cleansed and harmonized data, while data lakes act like (ad hoc) analytical environments, holding a large variety of raw data to facilitate analytics. Data-lake-implementation failure rate of more than 60%. Data lake implementations typically fail, in part, because of their immense complexity, difficult maintenance, and shared dependencies. Other reasons include management resistance, internal politics, lack of expertise, and security and governance challenges
Scaled Architecture: a reference and domain-based architecture with a set of blueprints, designs, principles,models, and best practices that simplifies and integrates data management across the entire organization in a distributed fashion.
domain-agnostic: the topic is explained without taking examples of any specific domain.