Skip to content

Data Governance + Data Quality

Mở đầu

Bạn từng gặp: số trên report ≠ business thực tế, cùng 1 user trên 2 system info khác, phân tích không tin được vì dirty data? Data governance giải systematically. "Data-driven decision" era, data quality quyết decision quality — Garbage In, Garbage Out.

Bạn sẽ học:

  • 6 quality dimension: completeness, accuracy, consistency...
  • Governance framework: org + process + tech
  • Data lineage: track data từ source → consume
  • Metadata management: "data describing data"
  • Layered architecture: ODS → DWD → DWS → ADS
  • Practical: how to roll out
ChươngNội dung
16 quality dimensions
2Governance framework
3Data lineage
4Metadata management
5Layered architecture
6Tools + practice

0. Toàn cảnh: sao cần governance?

Governance không phải tech, mà management. Trả lời: Ai own data? Standard là gì? Đảm bảo data persistently trustworthy thế nào?

Tưởng cty 100 table, mỗi team riêng, no naming convention, no data dictionary, no quality check. Kết quả: cùng "MAU" metric, marketing tính 500k, product tính 300k — vì definition khác.

4 pillars governance

  1. Org: Data Owner, Data Steward role + responsibility
  2. Process: data onboarding, change, deprecation standard process
  3. Tech: deploy data quality monitor, metadata mgmt, lineage
  4. Culture: cả công ty agree "data là asset", không phải "data là byproduct"

1. 6 quality dimensions

数据质量检测器
点击不同维度,查看数据质量问题示例
📋
完整性
🎯
准确性
🔗
一致性
时效性
🔑
唯一性
有效性
📋完整性数据是否存在缺失值
问题数据
用户ID姓名邮箱手机号
001张三zhang@mail.com138xxxx1234
002李四
003wang@mail.com139xxxx5678
治理后
用户ID姓名邮箱手机号
001张三zhang@mail.com138xxxx1234
002李四li@mail.com137xxxx9012
003王五wang@mail.com139xxxx5678
质量评分
72%
DimDefinitionDetectionCommon issue
CompletenessData thiếu khôngNull rate checkRequired field empty
AccuracyData đúng khôngRule check, samplingMoney negative, date invalid
ConsistencyMulti-source đồng nhấtCross-system compareCRM + order name khác
TimelinessUpdate kịp khôngUpdate time checkStock lag, price không sync
UniquenessCó duplicateDedupUser register 2 lần
ValidityFormat đúng khôngRegex / rangeEmail sai format, age âm

1-10-100 rule

  • $1: validate ở entry, prevent dirty data
  • $10: clean dirty data trong data warehouse
  • $100: cost decision sai vì dirty data

Phát hiện + fix sớm = rẻ nhất.


2. Governance framework

Không phải one-time project, mà ongoing process xuyên data lifecycle.

数据治理框架
点击各阶段查看详情
1
定义标准
2
采集接入
3
存储管理
4
使用消费
5
归档销毁
定义标准
制定数据标准、命名规范、数据字典
📖
数据字典
定义每个字段的含义、类型、取值范围
📏
命名规范
统一字段命名:snake_case、驼峰、前缀约定
🏷️
分类分级
按敏感度分级:公开、内部、机密、绝密
StageOutputRole
Define standardData dictionary, naming convention, classificationData architect
IngestIngest spec, validation rule, lineage recordData engineer
StorageLayered model, permission matrix, lifecycle policyDBA / platform
ConsumeData catalog, mask rule, quality reportData analyst / business
ArchiveArchive strategy, deletion record, audit logSecurity + compliance

DAMA-DMBOK domains

DomainCore contentKey output
Data architectureDefine model, flow, storageArchitecture diagram, ER
Data standardUnified naming, encoding, metric definitionData dictionary, metric library
Data qualityQuality rule, monitor, fix processQuality report, SLA dashboard
Data securityClassification, access control, mask/encryptSecurity policy, audit log
Master Data ManagementUnified "golden record" cho core entityMaster data center
Data lifecycleCreate → archive → destroyRetention policy, archive rule

Maturity model

  • Level 1 - Initial: no unified, each team own way
  • Level 2 - Repeatable: basic doc, execution inconsistent
  • Level 3 - Defined: unified process + tool, đa số team theo
  • Level 4 - Managed: quantified metric + auto monitor
  • Level 5 - Optimized: continuous improve, governance trong daily dev

3. Data lineage: từ đâu đến đâu

Data lineage record full flow data từ source → consume cuối. Như "family tree" của data.

数据血缘追踪
点击任意节点,查看上下游依赖关系
数据源
🗄️
MySQL 用户表
🗄️
MySQL 订单表
📝
点击日志
ODS 层
📥
ODS 用户
📥
ODS 订单
📥
ODS 点击
DWD 层
🔧
DWD 用户明细
🔧
DWD 订单明细
🔧
DWD 点击明细
DWS 层
📊
DWS 用户画像
📊
DWS GMV 汇总
ADS 层
📈
ADS 经营报表

3 core use case:

ScenarioQuestionLineage giúp
Impact analysisSửa field user table, ảnh hưởng report nào?Trace down lineage hết dependency
Root causeHôm nay GMV report bất thường, vấn đề ở step nào?Trace up lineage từng mắt xích
Compliance auditPhone user qua system nào? Mask hết chưa?Track sensitive field full flow

Lineage collection

  • Active: parse SQL, ETL config → auto extract table/field-level lineage
  • Passive: Hook query engine (Hive, Spark) execution plan → realtime record

Mainstream tool: Apache Atlas, DataHub, OpenLineage — auto lineage collection.


4. Metadata management: "data describing data"

Metadata = data về data. Data = nội dung sách, metadata = table of contents, author, ISBN.

TypeDescriptionVd
TechnicalPhysical storage infoTable name, field type, partition, location
BusinessBusiness meaningField VN name, business definition, calculation
OperationalRuntime statusETL execution time, data volume, frequency

Data dictionary

1 dictionary tốt phải có:

  • Field name: EN + VN
  • Type: VARCHAR(50), INT, DATETIME
  • Business definition: field đại diện gì? Tính thế nào?
  • Value range: valid value? Null allowed?
  • Owner: ai maintain? Có vấn đề tìm ai?

Không dict → new hire hiểu table mất 1 tuần. Có dict → 10 phút.


5. Layered architecture: ODS → DWD → DWS → ADS

Data warehouse không nhét hết vào 1 chỗ, mà layered theo processing level.

LayerFullRoleĐặc điểm
ODSOperational Data StoreMirror business DBRaw, không xử
DWDData Warehouse DetailClean, standardize, dedupClean detail record
DWSData Warehouse SummaryTheme aggregate (day/week/month)Pre-computed metric
ADSApplication Data StoreCho report/API cụ thểDirect usable

Sao layered?

  • Reuse: DWD clean 1 lần, mọi layer trên share, tránh re-clean
  • Decouple: business DB schema change chỉ ảnh hưởng ODS, không touch report
  • Performance: DWS pre-aggregate, report query read direct
  • Traceable: mỗi layer giữ, debug từng layer

6. Tool + practice

ToolPositionCapabilityUse
Great ExpectationsData qualityDeclarative validation, auto quality reportPython pipeline
dbtData transformationSQL model, built-in test + docDW modeling
DataHubMetadataData catalog, lineage, discoveryEnterprise
Apache AtlasMetadataHadoop lineageBig data
OpenMetadataMetadataOpen data catalog, multi-sourceSME team
AmundsenDiscoverySearch-based data discoveryData democratization

Path from zero

  1. Data dictionary trước: record table + field meaning (dù bằng Excel)
  2. Add quality check: critical pipeline có null + range check
  3. Unified metric: "DAU", "MAU", "GMV" calculation method statunified
  4. Introduce tool: khi manual cost quá cao, dùng DataHub / dbt
  5. Build process: data change phải review, quality issue có SLA + alert

Tổng kết

Data governance = engineering biến data từ "usable" → "good, trustworthy, traceable". Không one-time, mà ongoing operation.

  1. 6 dimensions: completeness, accuracy, consistency, timeliness, uniqueness, validity
  2. 4 pillars: org, process, tech, culture
  3. Data lineage: support impact analysis + root cause
  4. Metadata: data dictionary = most basic + important output
  5. Layered architecture: ODS → DWD → DWS → ADS
  6. Progressive rollout: từ dictionary, gradually tool + process

2026 cho VN dev

  • Modern data stack 2026: dbt + Snowflake/BigQuery + Airflow + Datadog
  • Open-source alternative: dbt-core + DuckDB + Dagster + DataHub
  • VN context: bank dùng Informatica, Talend; modern startup dùng dbt + Snowflake; smaller team dùng Metabase + Airflow
  • AI + governance: LLM tự auto-doc data dictionary, semantic search trên catalog (DataHub có AI assistant)
  • Privacy: GDPR, VN Data Protection Decree → cần tool support PII detection (Macie AWS, BigID)

Tài liệu