Summaries/SE/Data/Data Management at Scale.md

4.9 KiB
Raw Permalink Blame History

title updated created
Data Management at Scale 2022-05-21 12:13:15Z 2022-05-21 12:13:02Z

Data Management at Scale

Chapter 1

Datafication: is the transformation of social action into online quantified data, thus allowing for real-time tracking and predictive analysis

data harmonization: bringing amounts of data into a particular context

Data Management: DAMA

Data Monetization: Data Monetization refers to the process of using data to obtain quantifiable economic benefit. Internal or indirect methods include using data to make measurable business performance improvements and inform decisions. External or direct methods include data sharing to gain beneficial terms or conditions from business partners, information bartering, selling data outright (via a data broker or independently), or offering information products and services (for example, including information as a value-added component of an existing offering).

Data Proliferation: same data gets distributed across many applications and database

Dataintensiveness:read-versus-write ratio is changing significantly. Optimize for read: dupklication of data and/or applications optimized for read.

Devops and smaller applications (microservices, k8s, domain-driven design, serverless computing) result in increased complexity and increased demand to beter control data.

DataOps Focus on data interoperability, the capture of immutable events, and reproducible and loose coupling.

Bring the data to the application vs dont move the data becomes less important op cloud. Important for SaaS and Machine Learning as a Service (MLaaS) But will fragment the data possible.

Insights about where data origina ted and how data is distributed are crucial. A stronger internal governance is required. The trend of stronger control is contrary to the methodologies for fast software development, which involves less documentation and fewer internal controls.

For advanced analytics, such as machine learning, leaving context out can be a big problem because if the data is meaningless, it is impossible to correctly predict the future.

IntegrationDatabase: is a database which acts as the data store for multiple applications, and thus integrates data across these applications (in contrast to an ApplicationDatabase).

The “big ball of mud” describes a system architecture that is monolithic, difficult to understand, hard to maintain, and tightly coupled because of its many dependencies. Data warehouses, with their layers, views, countless tables, relationships, scripts, ETL jobs, and scheduling flows, often result in a chaotic web of dependencies. The lack of agility often becomes a concern => risk development of work-arounds => technical debt.

Data lakes (multiple formats: structured, semi-structured, and unstructured),just like data warehouses, are considered centralized (monolithic) data repositories, but they differ from warehouses because they store data before it has been transformed, cleansed, and structured.

Data warehouses are usually engineered with RDBMs, while data lakes are commonly engineered with distributed databases or NoSQL systems or public cloud. Dumping in raw application structures—exact copies—is fast and allows data analysts and scientists quick access. However, the complexity with raw data is that use cases always require reworking the data. Data quality problems have to be sorted out, aggregations are required, and enrichments with other data are needed to bring the data into context. This introduces a lot of repeatable work and is another reason why data lakes are typically combined with data warehouses. Data warehouses, in this combination, act like high-quality repositories of cleansed and harmonized data, while data lakes act like (ad hoc) analytical environments, holding a large variety of raw data to facilitate analytics. Data-lake-implementation failure rate of more than 60%. Data lake implementations typically fail, in part, because of their immense complexity, difficult maintenance, and shared dependencies. Other reasons include management resistance, internal politics, lack of expertise, and security and governance challenges

Scaled Architecture: a reference and domain-based architecture with a set of blueprints, designs, principles,models, and best practices that simplifies and integrates data management across the entire organization in a distributed fashion.

domain-agnostic: the topic is explained without taking examples of any specific domain.

Chapter 2 Introducing the Scaled Architecture: Organizing Data at Scale

How can you distribute data efficiently while retaining agility, security, and control?