Data governance as foundation for AI projects
11 jun 2026
You want to implement AI in your datacenter. Excellent.
Now answer: Do you have quality data?
If the answer is "I think we do", stop here.
Because the brutal truth is: AI is only as good as your data.
Bad data = bad models = bad decisions = real loss.
Data governance isn't an IT task. It's corporate strategy.
The Real Problem: Dirty Data
Example 1: Customer Age
Customer A registered in 1950 (118 years old)
Customer B registered in 2050 (future birth)
Customer C with no birth date
Customer D date registered as "31/02/2023" (doesn't exist)
ML model trained on this:
"Hmm, data says people were born in 1950, 2050, or invalid date"
Output: Completely broken model
Example 2: Server Location
Server registered as:
- "New York"
- "NY"
- "Nueva York" (Spanish)
- "Unit 1, Building B, NY" (too specific)
- "Default" (not filled)
- "" (empty)
Consolidation: 6 different entries, but it's 1 server
Model tries to correlate by location?
Results: inconsistent
Example 3: Memory Configuration
10 servers, memory configuration:
- "8GB"
- "8gb"
- "8 GB"
- "8"
- "8 (gigabytes)"
- "8000MB"
- "0008"
System calculating total memory:
Treats "8" as 8 bytes? 8MB? Parse failure.
Result: capacity estimates completely wrong
Pillars of Data Governance
1. Data Quality
Metrics:
| Dimension | Description | Target |
|---|---|---|
| Completeness | % of filled fields | > 95% |
| Accuracy | % of correct values vs reality | > 98% |
| Consistency | Same data equals across tables | > 99% |
| Timeliness | Data reflects current reality | < 24h lag |
| Uniqueness | No duplicate records | > 99% |
Implementation:
# Data quality score
def calculate_data_quality_score(dataset):
completeness = (non_null_fields / total_fields) * 0.25
accuracy = validate_values_against_reference() * 0.25
consistency = check_cross_table_consistency() * 0.25
freshness = (1 - days_since_update/max_days) * 0.25
total_score = completeness + accuracy + consistency + freshness
return total_score # 0-1, target > 0.95
2. Data Cataloging and Metadata
You have 10,000 data tables in your datacenter. Which is which?
Without a catalog, it's a leaf in a forest.
What to catalog:
Table: sales_transactions
├── Owner: revenue-team
├── Location: warehouse/analytics/sales
├── Last updated: 2025-04-05
├── Row count: 50M
├── Columns:
│ ├── transaction_id (PII)
│ ├── customer_id (PII)
│ ├── amount (sensitive)
│ ├── timestamp
│ └── status
├── Lineage:
│ ├── Source: order_system (daily, 8pm UTC)
│ ├── Transform: aggregation, deduplication
│ ├── Consumers: revenue_reporting, ml_models/churn_prediction
├── Data quality score: 0.96
├── Refresh SLA: daily, < 2h lag
└── Retention: 7 years (regulatory requirement)
Tools: Collibra, Apache Atlas, Alation
3. Privacy by Design
Before collecting data, ask: "Do I need this? Can I legally collect it?"
GDPR (Europe)
If any customer is European, you're regulated by GDPR.
Basic rules:
- Consent: user must consent or have "legitimate interest"
- Right to be forgotten: if they ask to delete, you delete
- Transparency: explain what you use data for
- Security: protect data from theft
- Data minimization: collect only what's necessary
LGPD (Brazil)
Similar to GDPR but specific to Brazil/Brazilian data.
Fines: up to 2% of annual revenue if violated
Practical implementation:
Before using data in AI:
☐ Do I have consent?
☐ Is data PII (Personally Identifiable Information)?
☐ If yes, is it encrypted?
☐ Can I delete data if requested?
☐ Is there access logging?
☐ Is consent documented and archived?
4. Master Data Management (MDM)
Single source of truth. When data conflicts, which version is correct?
Example:
System A: Customer John Smith, born 15/03/1980
System B: Customer John Smith, born 15/03/1981
Which is correct? Without MDM, nobody knows.
With MDM:
1. Designate system A as "master"
2. System B syncs with A
3. If difference, flag for manual review
4. Result: unique, trustworthy data
5. Data Retention and Archiving
Keeping data forever is:
- Expensive (storage)
- Risky (more data = bigger attack surface)
- Problematic (GDPR/LGPD requires deletion if not needed)
Retention policy:
Sales transactions:
├── Hot data: last 90 days (fast storage/memory)
├── Warm data: 91 days to 2 years (standard storage)
├── Cold data: 2-7 years (archive, slow access)
└── Deletion: after 7 years (regulatory requirement)
Access logs:
├── Hot: last 30 days
├── Archive: 30-90 days
└── Delete: after 90 days
6. Data Sharing and Access Governance
Not everyone accesses all data.
RBAC (Role-Based Access Control):
Sales Analyst:
├── Access: sales_transactions (last 2 years)
├── Restrictions: can't see employee salary
├── Audit: all accesses logged
└── Revocation: when leaves company
DBA:
├── Access: everything (needs it for maintenance)
├── Restrictions: accesses logged, supervised
└── Revocation: immediate if fired
Roadmap: 12 Months
Months 1-2: Assessment
- Audit: what data do you have?
- What's the quality?
- Who uses it? When?
- Which datasets are "dirty"?
Output: Current state document
Months 3-4: Governance Framework
- Define quality policy
- Establish MDM
- Create data catalog
- Document lineage (data origin)
Output: Framework approved by Legal/Compliance/CIO
Months 5-6: Technical Implementation
- Deploy tools (Collibra/Atlas)
- Data warehouse integration
- Automate quality checks
- Test with pilot dataset
Months 7-9: Controlled Rollout
- Validate with business teams
- Identify critical vs non-critical data
- Implement access controls
- User training
Months 10-12: Scale + Automation
- Expand to new data sources
- ML for automatic quality anomaly detection
- Automatic data retention
- Monthly compliance audits
Cost-Benefit
Investment (Year 1)
| Item | Cost |
|---|---|
| Tool (Collibra) | $20,000 |
| Infrastructure (storage, compute) | $16,000 |
| Resources (Data Gov Officer + team) | $100,000 |
| Training + consulting | $10,000 |
| Total | $146,000 |
Benefit (Year 1)
| Item | Value |
|---|---|
| Reduced data errors | $30,000 |
| Compliance penalties avoided | $50,000+ |
| Efficiency (less cleanup time) | $24,000 |
| AI that works better | $60,000 |
| Total | $164,000+ |
ROI: ~12% in Year 1
Governance + AI: Why It Matters
Without governance:
Dirty data → Model trained on garbage → Garbage output → Wrong decision → Loss
With governance:
Clean data → Robust model → Reliable output → Right decision → Real value
When you implement AI (predictive observability, cognitive RPA, etc.), it only works well with good data.
Data governance is prerequisite, not optional.
Conclusion
Nobody gets famous for "having good data governance".
But every AI project fails because "data was dirty".
Do you want to be known as:
a) "That person who deployed revolutionary AI" (that broke because data was dirty)
b) "That person who built solid data foundation" (that lets AI scale)
Choose (b). Your future self thanks you.
Start now. Data doesn't clean itself.