16. Privacidad y GDPR en IA
Domina el compliance de GDPR, CCPA y regulaciones de privacidad en IA con frameworks prácticos, DPIAs, casos reales de multas y auditorías de 50 puntos
🎯 Lo Que Aprenderás
Al completar esta lección, serás capaz de:
- Navegar regulaciones globales: GDPR (EU/UK), CCPA (California), LGPD (Brasil) y regulaciones sectoriales
- Clasificar personal data: Entender qué datos son personales, pseudoanonimizados, o verdaderamente anónimos
- Implementar GDPR compliance: Bases legales, minimización, retención, derechos de usuarios
- Ejecutar DPIAs: Data Protection Impact Assessment paso a paso
- Manejar data breaches: Protocolos de notificación y remediación
- Ir más allá de compliance: Ética y mejores prácticas
Tiempo estimado: 27 minutos Nivel: Intermedio Prerequisitos: Conocimiento básico de IA/ML y colección de datos
📚 El Panorama Regulatorio Global
GDPR (EU/UK): El Gold Standard
QUÉ ES:
- General Data Protection Regulation (EU Regulation 2016/679)
- Vigente desde Mayo 2018
- Aplica a: Toda empresa que procese datos de residentes de UE/UK, sin importar dónde opera la empresa
SCOPE:
APLICA SI:
✓ Tu empresa está en EU/UK
✓ Ofreces bienes/servicios a residentes EU/UK
✓ Monitoreas comportamiento de residentes EU/UK (ej: analytics, IA)
INCLUSO SI:
✓ Tu servidor está fuera de EU
✓ Tu empresa es estadounidense/latinoamericana
✓ No tienes oficina física en EU
EJEMPLO:
- Startup mexicana que entrena modelo ML con datos de usuarios españoles
- → Sí, GDPR aplica
MULTAS:
TIER 1 (Violaciones menores):
- Hasta €10M o 2% de global annual revenue (el mayor)
TIER 2 (Violaciones graves):
- Hasta €20M o 4% de global annual revenue (el mayor)
CASOS REALES:
- Amazon: €746M (2021) - Procesamiento sin consent
- WhatsApp: €225M (2021) - Falta de transparencia
- Google: €90M (2020) - Cookies sin consent
- British Airways: €22M (2020) - Data breach
CCPA (California): Derechos del Consumidor
QUÉ ES:
- California Consumer Privacy Act
- Vigente desde Enero 2020
- Actualizado por CPRA (California Privacy Rights Act) en 2023
SCOPE:
APLICA SI (any 1 of 3):
1. Revenue >$25M anual
2. Compras/vendes/compartes data de >50,000 consumidores/households/devices
3. >50% revenue viene de selling consumer data
Y procesas data de residentes de California
DIFERENCIAS CLAVE vs GDPR:
┌──────────────────┬─────────────────┬─────────────────┐
│ Aspecto │ GDPR (EU) │ CCPA (CA) │
├──────────────────┼─────────────────┼─────────────────┤
│ Opt-in required │ SÍ (en mayoría) │ NO (opt-out) │
│ Right to delete │ SÍ │ SÍ │
│ Right to access │ SÍ │ SÍ │
│ Data portability │ SÍ │ Limitado │
│ Non-discrimination│ Implícito │ Explícito │
│ Private right │ NO │ SÍ (si breach) │
│ Fines │ Hasta 4% revenue│ $7,500/violation│
└──────────────────┴─────────────────┴─────────────────┘
LGPD (Brasil): América Latina Líder
QUÉ ES:
- Lei Geral de Proteção de Dados
- Vigente desde Septiembre 2020
- Fuertemente inspirada por GDPR
SCOPE:
APLICA SI:
✓ Procesas datos en Brasil
✓ Ofreces bienes/servicios a residentes de Brasil
✓ Datos fueron colectados en Brasil
INCLUSO SI:
✓ Empresa no está en Brasil
✓ Procesamiento ocurre fuera de Brasil
SIMILITUDES CON GDPR:
- Bases legales similares (consent, legitimate interest, etc.)
- Derechos de usuarios (acceso, rectificación, portabilidad, eliminación)
- DPIA requirement para high-risk processing
- Multas: Hasta R$50M o 2% de revenue (similar a GDPR Tier 1)
Regulaciones Específicas por País
ESPAÑA: LOPDGDD
- Ley Orgánica de Protección de Datos y Garantía de Derechos Digitales
- Complementa GDPR con requisitos adicionales:
* Right to digital disconnection (trabajo)
* Enhanced protections para menores
* DPO requirement ampliado
MÉXICO: LFPDPPP
- Ley Federal de Protección de Datos Personales en Posesión de los Particulares
- Scope: Empresas privadas (no gobierno)
- Consent explícito requerido
- INAI (National Institute) supervisa
- Multas: Hasta $8M MXN (~$400K USD)
COLOMBIA: Ley 1581
- Ley de Protección de Datos Personales
- Autorización previa requirement (opt-in)
- Registro de bases de datos con Superintendencia
- Rigorous consent requirements
🔍 ¿Qué Datos Son "Personal Data"?
Definición GDPR
ARTÍCULO 4(1) GDPR:
"Personal data means any information relating to an identified
or identifiable natural person ('data subject')."
IDENTIFIABLE:
Persona puede ser identificada directa o indirectamente por:
- Name
- Identification number
- Location data
- Online identifier (IP address, cookie ID, device ID)
- One or more factors specific to physical, physiological,
genetic, mental, economic, cultural, or social identity
Ejemplos Claros
OBVIO PERSONAL DATA:
✓ Name: "Juan Pérez"
✓ Email: "juan.perez@example.com"
✓ Phone: +52 55 1234 5678
✓ Address: "Calle Reforma 123, CDMX"
✓ National ID: CURP, DNI, Social Security Number
✓ Photo con face visible
✓ Medical records
✓ Financial account numbers
TAMBIÉN PERSONAL DATA (menos obvio):
✓ IP address (identificable con ISP cooperation)
✓ Cookie ID (si puede link a persona)
✓ Device ID (IMEI, Android ID, IDFA)
✓ User handle + behavioral patterns (identifiable)
✓ Email hash (si reversible o linkable)
✓ GPS coordinates (residence/workplace patterns)
✓ Behavioural data (browsing patterns que uniquely identify)
NO PERSONAL DATA (si truly anonymous):
✗ Aggregated statistics: "45% of users clicked"
✗ Data that cannot be linked back: Random UUID with no mapping
✗ Anonymized properly (irreversibly, no re-identification possible)
El Problema con "Anonymous" Data
CASE STUDY: Netflix Prize Dataset
BACKGROUND (2006):
- Netflix released "anonymous" dataset: 100M movie ratings
- Removed names, replaced with random IDs
- Goal: Improve recommendation algorithm
WHAT WENT WRONG:
- Researchers at UT Austin (Narayanan & Shmatikov)
- Cross-referenced dataset con public IMDb reviews
- Re-identified Netflix users by matching rating patterns
- Could infer:
* Political views (from documentary ratings)
* Sexual orientation (from genre preferences)
* Religion (from religious film ratings)
LESSON:
- "Anonymous" ≠ Unidentifiable
- 87% of US population identifiable by {ZIP code + birthdate + gender}
- Behavioral patterns are unique fingerprints
LEGAL PERSPECTIVE:
GDPR Recital 26:
"To determine whether a natural person is identifiable,
account should be taken of all the means reasonably likely
to be used...either by the controller or by another person
to identify the natural person."
PRACTICAL TEST:
1. Can YOU re-identify with reasonable effort? → Personal data
2. Can SOMEONE ELSE with access to other data? → Personal data
3. Mathematically impossible to re-identify? → Anonymous
EXAMPLE - ML Training Data:
❌ User ID + purchase history + demographics
→ Personal data (easily identifiable)
❌ Hashed email + behavioral events
→ Personal data (hash can be reversed or linked)
✓ Aggregated conversion rate by age group (N>50 per group)
→ Anonymous (cannot drill-down to individual)
⚖️ GDPR Compliance para IA y Machine Learning
Bases Legales para Procesamiento
GDPR ARTÍCULO 6: 6 Legal Bases
1. CONSENT (Artículo 6(1)(a))
"Data subject has given consent"
REQUIREMENTS:
- Freely given (no coercion)
- Specific (for this purpose)
- Informed (know what happens)
- Unambiguous (clear affirmative action)
- Withdrawable (easy opt-out)
ML USE CASE:
❌ Difícil obtener consent para training data at scale
✓ Puede obtener consent de users para personalization
2. CONTRACT (Artículo 6(1)(b))
"Processing necessary for contract with data subject"
EXAMPLE:
✓ Process customer order data to fulfill purchase
✓ Use purchase history to recommend products (if part of service)
❌ Training ML on non-customer data
3. LEGAL OBLIGATION (Artículo 6(1)(c))
"Processing necessary to comply with legal obligation"
EXAMPLE:
✓ Fraud detection (AML requirements)
✓ Tax records retention
❌ Most ML use cases don't qualify
4. VITAL INTERESTS (Artículo 6(1)(d))
"Processing necessary to protect life"
EXAMPLE:
✓ Medical emergency AI
❌ Rare for business contexts
5. PUBLIC TASK (Artículo 6(1)(e))
"Processing necessary for public interest or official authority"
EXAMPLE:
✓ Government agencies
❌ Private companies usually don't qualify
6. LEGITIMATE INTEREST (Artículo 6(1)(f)) ⭐ MOST USED FOR ML
"Processing necessary for legitimate interests pursued by
controller, except where overridden by data subject's interests"
BALANCING TEST (3-part):
A) Purpose Test:
- Is this a legitimate interest?
- Legal, specific, clearly articulated?
B) Necessity Test:
- Is processing necessary to achieve purpose?
- No less intrusive alternative?
C) Balancing Test:
- Controller's interest vs data subject's interests/rights
- Reasonable expectations of data subject?
- Vulnerable groups?
- Can data subject easily object?
ML - LEGITIMATE INTEREST EXAMPLES:
✓ Fraud detection ML
✓ Product recommendation engines
✓ Customer service chatbots
✓ Predictive maintenance
❌ Intrusive profiling for discrimination
❌ Surveillance of individuals
❌ Unexpected uses
SAFEGUARDS REQUIRED:
- Transparency (privacy policy explains)
- Data minimization (only necessary fields)
- Right to object (users can opt-out)
- DPIA if high-risk
Legitimate Interest Assessment (LIA) for ML
TEMPLATE PARA MACHINE LEARNING:
## LEGITIMATE INTEREST ASSESSMENT - Churn Prediction Model
### PURPOSE TEST
**What is our legitimate interest?**
Predicting customer churn to enable proactive retention efforts,
improving customer satisfaction and reducing involuntary
customer loss.
**Is it legitimate?**
✓ Legal: Predictive analytics is lawful
✓ Specific: Limited to churn risk scoring
✓ Clearly articulated: Documented in privacy policy
### NECESSITY TEST
**Is processing necessary?**
Yes. Alternatives considered:
- Manual analysis: Not scalable (100K+ customers)
- Survey-based: Response rate <5%, not predictive
- Reactive only: Customers already churned, too late
**Data minimized?**
✓ Collect: Usage patterns, support tickets, billing history
✗ Don't collect: Personal communications, social media
✓ Retention: 24 months (churn patterns observable)
### BALANCING TEST
**Our interest:**
- Reduce churn (high - customer satisfaction + revenue)
- Proactive support (medium-high - better service)
- Resource optimization (medium)
**Data subject's interests/rights:**
- Expectation: Using service data for service improvement
- Risk: Low (no sensitive data, no automated adverse decisions)
- Vulnerable groups: None identified
- Easy opt-out: Yes (can request exclusion from model)
**Outcome:**
Our interest in customer satisfaction and reducing involuntary
churn outweighs minimal privacy impact of analyzing service
usage patterns.
**SAFEGUARDS:**
1. Transparency: Privacy policy explains predictive models
2. No automated decisions: Predictions trigger HUMAN outreach
3. Right to object: Users can opt-out of profiling
4. Retention: 24-month max, then delete
5. Security: Encrypted storage, role-based access
6. Regular audits: Quarterly fairness/bias review
Data Minimization in ML
GDPR ARTÍCULO 5(1)(c):
"Personal data shall be adequate, relevant and limited
to what is necessary (data minimisation)"
ML TRAINING - BAD PRACTICE:
❌ Collect everything "just in case":
- All user profile fields (even irrelevant)
- Historical data (10 years back when 2 years sufficient)
- Sensitive attributes (race, religion) as features
- Third-party enrichment data not needed for model
WHY BAD:
- Unnecessary risk exposure
- GDPR violation (excess collection)
- Bias introduction (sensitive attributes)
- Higher DPIA risk rating
ML TRAINING - GOOD PRACTICE:
✓ Collect only features that improve model:
- Feature selection: Remove low-importance features
- Temporal scope: Only data from relevant timeframe
- No sensitive attributes: Race, religion, health (unless justified)
- Aggregate where possible: "Number of purchases" not "each purchase detail"
✓ Retention aligned with purpose:
- Training: Keep until model trained + validated
- Inference: Don't store if not needed for improvement
- Delete training data: After model deployed (if not needed for retraining)
✓ Example - Churn Model:
FEATURES USED:
- Last login date (proxy for engagement)
- Support ticket count (service issues)
- Payment history (billing problems)
- Feature usage (product fit)
FEATURES EXCLUDED:
- Name, email (not predictive)
- Full communication logs (minimization)
- Demographics (not necessary, potential bias)
- Social media data (disproportionate)
Retention Limits for ML Systems
GDPR ARTÍCULO 5(1)(e):
"Kept in a form which permits identification of data subjects
for no longer than necessary (storage limitation)"
RETENTION POLICY TEMPLATE FOR ML:
## DATA RETENTION POLICY - ML Systems
### TRAINING DATA
**Purpose:** Train churn prediction model
**Retention:** 24 months from collection
**Rationale:**
- Churn patterns observable over 18-24 months
- Model retraining: Quarterly (requires historical data)
- No value in data >24 months (user behavior changes)
**After 24 months:**
OPTION A (Pseudonymize):
- Remove direct identifiers (name, email, user ID)
- Keep: Feature values for analysis
- Use: Aggregate statistical analysis only
OPTION B (Delete):
- Purge all individual records
- Retain: Model weights (no personal data)
**Exception - Active Users:**
If user still active customer:
- Retain until account closure + 90 days
- Then apply standard deletion
### MODEL WEIGHTS
**Purpose:** Inference (churn predictions)
**Retention:** Until model retired/replaced
**Rationale:** Model weights don't contain identifiable personal data
**Legal basis:** Not personal data (aggregate statistical patterns)
### PREDICTIONS/SCORES
**Purpose:** Customer retention actions
**Retention:** 90 days from generation
**Rationale:**
- Retention efforts occur within 30 days
- 90-day window for measurement
- Older predictions not actionable
**After 90 days:**
- Delete individual predictions
- Aggregate: "Month X: 8% predicted churn rate"
### AUDIT LOGS
**Purpose:** Compliance, security, debugging
**Retention:** 2 years
**Rationale:** Regulatory requirement, incident investigation
### REVIEW SCHEDULE
- Annual: Review necessity of each retention period
- Triggered: If purpose changes, regulations update
Derechos de los Usuarios Bajo GDPR
8 DERECHOS PRINCIPALES EN CONTEXTO DE ML:
1. RIGHT TO BE INFORMED (Artículos 13-14)
- Privacy policy must explain:
* ML models used (purpose, logic)
* Data sources
* Automated decision-making (if any)
* Retention periods
ML-SPECIFIC DISCLOSURE:
"We use machine learning to predict customer churn.
This involves analyzing your usage patterns, support
history, and billing data. Predictions trigger our team
to reach out with personalized support. You can opt out
of this profiling at any time."
2. RIGHT OF ACCESS (Artículo 15)
- User can request: "What data do you have on me?"
- Must provide:
* Copy of data used for ML
* Information about automated decision-making
* Explanation of logic involved
EXAMPLE RESPONSE:
"Your data in our churn model includes:
- Last login: 2025-01-15
- Support tickets: 3 in last 90 days
- Payment method: Credit card
- Churn risk score: 0.23 (low risk)
- Logic: Model considers engagement, support, billing"
3. RIGHT TO RECTIFICATION (Artículo 16)
- User can request correction of inaccurate data
- Must update in training data AND retrain model if material
ML CHALLENGE:
- Training data already used → Can't "unlearn"
- Solution: Update for future retraining, note correction
4. RIGHT TO ERASURE / "Right to be Forgotten" (Artículo 17)
- User can request deletion if:
* No longer necessary for purpose
* Withdraws consent (if consent was basis)
* Objects and no overriding legitimate grounds
ML IMPLEMENTATION:
1. Remove from future training datasets
2. If feasible: Retrain model without user's data
3. If not feasible: Document why (disproportionate effort)
4. Next scheduled retraining: Exclude user
EXCEPTION:
- If retention required by law
- If anonymized (no longer personal data)
5. RIGHT TO RESTRICT PROCESSING (Artículo 18)
- User can request pause while disputing accuracy/legality
- ML: Exclude from predictions, mark for exclusion in retraining
6. RIGHT TO DATA PORTABILITY (Artículo 20)
- User can request data in machine-readable format
- Can transmit to another controller
ML CONTEXT:
- Provide: Training data about user (CSV, JSON)
- Don't provide: Model weights (not user's data)
7. RIGHT TO OBJECT (Artículo 21) ⭐ CRITICAL FOR ML
- User can object to processing based on legitimate interest
- Controller must stop UNLESS compelling legitimate grounds
ML IMPLEMENTATION:
"Opt-out of Churn Prediction":
- Add user to exclusion list
- Remove from future training datasets
- Stop generating predictions for user
- Confirm within 7 days
8. RIGHTS RELATED TO AUTOMATED DECISION-MAKING (Artículo 22)
- User can object to purely automated decisions with legal/
significant effect (no human review)
- Right to human review, explanation, contest decision
ML IMPLICATION:
✓ Churn prediction → Human decides outreach (OK)
✓ Fraud detection → Flags for human review (OK)
❌ Loan denial → Purely automated decision (NOT OK without safeguards)
❌ Hiring rejection → Automated screening (NOT OK without review)
SAFEGUARDS REQUIRED:
- Right to human intervention
- Right to express point of view
- Right to contest decision
- Explanation of logic
RESPONDING TO DATA SUBJECT REQUESTS (ML Context):
TIMELINE:
- 1 month from receipt (free of charge)
- Extendable to 3 months if complex (must notify user why)
VERIFICATION:
- Must verify identity (prevent fraudulent requests)
- Cannot ask for excessive info
ML-SPECIFIC CHALLENGES:
CHALLENGE 1: "Unlearning" from trained models
- Current state: Most ML models can't selectively forget
- Workaround:
* Remove from training data
* Next retraining cycle: Exclude user
* Document when model will be retrained
CHALLENGE 2: Explaining ML decisions
- Black-box models (deep neural nets) hard to explain
- Solutions:
* Use interpretable models where possible
* SHAP values, LIME for explanation
* Document general logic (not exact algorithm)
CHALLENGE 3: Data portability for ML features
- Engineered features may not be useful elsewhere
- Provide:
* Raw data collected
* Derived features (with explanation)
* Format: CSV, JSON
🛡️ Data Protection Impact Assessment (DPIA) for ML
¿Cuándo es Obligatorio DPIA?
GDPR ARTÍCULO 35:
DPIA REQUIRED SI:
✓ Systematic and extensive profiling with significant effects
✓ Large-scale processing of special categories data
(health, race, religion, political, biometric, genetic)
✓ Systematic monitoring of publicly accessible area at large scale
✓ New technologies with high risk
ML - ALMOST ALWAYS TRIGGERS DPIA:
✓ Profiling >10,000 users
✓ Automated decision-making affecting legal rights
✓ Using sensitive data (health, biometric)
✓ Innovative ML techniques (untested risk)
✓ Large-scale data processing (Big Data + ML)
DPIA Template for Machine Learning
STEP 1: DESCRIBE THE PROCESSING
## DPIA: Customer Churn Prediction Model
### 1.1 Processing Description
**What:** Train ML model to predict customer churn probability
**Data collected:**
INPUT DATA (Training):
- Customer ID (pseudonym)
- Account creation date
- Last login timestamp
- Feature usage (counts by feature)
- Support ticket count, categories
- Billing history (amounts, payment method, late payments)
- Subscription tier
- Geographic region (country level)
OUTPUT DATA (Predictions):
- Customer ID
- Churn probability score (0-1)
- Top 3 contributing factors
- Prediction timestamp
**NOT collected:**
- Payment card details (PCI scope, not needed)
- Communications content (privacy, not predictive)
- Social media activity (disproportionate)
- Personal identifiers beyond customer ID
### 1.2 Purpose
- Predict which customers likely to churn in next 90 days
- Enable proactive retention outreach (human-led)
- Improve customer satisfaction (address issues early)
### 1.3 Legal Basis
- Legitimate interest (GDPR Art 6(1)(f))
- LIA completed (see Appendix A)
- Balancing test: Our interest (retention, satisfaction) > User privacy impact (low-risk profiling)
### 1.4 Scale
- Training dataset: 200,000 customers (24 months history)
- Inference: 50,000 active customers (monthly predictions)
- Retention: Training data 24 months, predictions 90 days
- Geographic: EU (80%), US (15%), LATAM (5%)
### 1.5 Automated Decision-Making
- NO purely automated decisions with legal effect
- Predictions trigger HUMAN review and outreach
- Customers can decline retention offers
STEP 2: NECESSITY & PROPORTIONALITY
### 2.1 Is Processing Necessary?
**Yes** - Alternatives considered:
ALTERNATIVE 1: Manual churn prediction
- Feasibility: Not scalable (50K customers)
- Accuracy: Subjective, inconsistent
- Result: Insufficient
ALTERNATIVE 2: Survey-based prediction
- Feasibility: <5% response rate
- Accuracy: Self-reported, not predictive
- Result: Insufficient
ALTERNATIVE 3: React after churn
- Feasibility: Yes
- Accuracy: N/A
- Result: Customer already lost, lower re-engagement success
CONCLUSION: ML is necessary for proactive, scalable churn prevention
### 2.2 Data Minimization Applied?
✓ Only 12 features used (from 100+ available)
- Selected via feature importance analysis
- Removed low-value features
✓ No sensitive categories
- Excluded: Demographics (age, gender) - not necessary
- Excluded: Location (beyond country) - privacy
- Excluded: Communication content - disproportionate
✓ Temporal limitation
- 24-month window (not entire customer history)
- Churn patterns not observable beyond 24 months
✓ Aggregation where possible
- "Support ticket count" not "full ticket text"
- "Login frequency" not "every login timestamp"
### 2.3 Proportionality Assessment
**Risk to customers:** LOW
- No automated adverse decisions (human-in-loop)
- No sensitive data processed
- Predictions benefit customer (proactive support)
- Easy opt-out available
**Benefit to organization:** MEDIUM-HIGH
- Reduce involuntary churn (-15% projected)
- Improve customer satisfaction (+12% projected)
- Revenue protection ($2.5M annual value)
**Conclusion:** Benefits significantly outweigh minimal risks
STEP 3: RISKS TO DATA SUBJECTS
### 3.1 Identified Risks
┌──────────────────────────┬─────────────┬───────────┬──────────┐
│ Risk │ Likelihood │ Severity │ Overall │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Unauthorized access to │ LOW │ MEDIUM │ LOW │
│ predictions (breach) │ (encrypted) │ │ │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Model bias against │ MEDIUM │ MEDIUM │ MEDIUM │
│ certain customer groups │ │ │ │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ False positive (predict │ MEDIUM │ LOW │ LOW │
│ churn for stable customer│ (expected) │ (human │ │
│ leading to unnecessary │ │ review) │ │
│ outreach) │ │ │ │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Profiling violates user │ LOW │ LOW │ LOW │
│ expectations │ (disclosed) │ │ │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Training data retention │ LOW │ LOW │ LOW │
│ beyond necessary period │ (auto- │ │ │
│ │ delete) │ │ │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Model drift causes poor │ MEDIUM │ MEDIUM │ MEDIUM │
│ predictions over time │ │ │ │
└──────────────────────────┴─────────────┴───────────┴──────────┘
### 3.2 Risk Detail - Model Bias
**Scenario:**
Model learns that customers in certain regions churn more
frequently. May lead to:
- Disproportionate retention efforts by region
- Self-fulfilling prophecy (less investment → more churn)
- Perceived discrimination
**Likelihood:** MEDIUM (common ML pitfall)
**Severity:** MEDIUM (business impact, potential discrimination)
**Current Mitigation:** None
**Overall Risk:** MEDIUM-HIGH ⚠️
### 3.3 Risk Detail - Model Drift
**Scenario:**
Customer behavior changes (market conditions, competitors, product updates)
but model not retrained. Predictions become inaccurate.
**Impact:**
- Wasted retention efforts (false positives)
- Missed at-risk customers (false negatives)
- Customer frustration (irrelevant outreach)
**Likelihood:** MEDIUM (expected over time)
**Severity:** MEDIUM (business cost, customer experience)
**Current Mitigation:** Quarterly retraining scheduled
**Overall Risk:** LOW-MEDIUM ✓
STEP 4: MEASURES TO ADDRESS RISKS
### 4.1 Technical Safeguards
✓ ENCRYPTION
- Training data: AES-256 at rest
- Predictions: Encrypted database
- Transmission: TLS 1.3
✓ ACCESS CONTROLS
- Training data: 3 ML engineers only
- Predictions: Customer success team (12 people)
- MFA required for all
- Activity logging (audit trail)
✓ DATA MINIMIZATION (Automated)
- Feature selection: Top 12 by importance
- Auto-delete: Training data >24 months
- Auto-delete: Predictions >90 days
✓ MODEL MONITORING
- Weekly: Accuracy metrics
- Monthly: Fairness metrics (by region, tier)
- Quarterly: Model drift detection
- Alerts: If accuracy <75% or bias detected
### 4.2 Organizational Safeguards
✓ GOVERNANCE
- ML Ethics Board: Reviews model quarterly
- Data Protection Officer: Approves DPIA
- Legal review: Completed (see sign-off)
✓ TRAINING
- ML team: Bias & fairness training (annual)
- Customer success: How to use predictions ethically
- All: GDPR awareness training
✓ POLICIES
- ML Model Development Policy
- Data Retention Policy (enforced)
- Incident Response Plan (tested)
✓ VENDOR MANAGEMENT
- DPA with cloud provider (AWS)
- EU data residency (Ireland region)
- GDPR compliance certification verified
### 4.3 Specific Mitigation - Model Bias
**ENHANCED FAIRNESS MEASURES:**
1. BIAS AUDIT (Pre-Deployment)
- Stratify test set by: Region, subscription tier, tenure
- Measure: Accuracy, precision, recall per group
- Threshold: No group <5% worse than best group
- If fail: Retrain with balanced sampling
2. ONGOING MONITORING (Post-Deployment)
- Monthly: Churn prediction rate by region
- Alert: If >20% difference between regions
- Investigation: Root cause (true difference vs model bias)
- Action: Retrain with fairness constraints if bias confirmed
3. HUMAN REVIEW
- Customer success reviews ALL high-risk predictions
- Can override if prediction seems unfair/unusual
- Feedback loop: Overrides inform model improvement
4. TRANSPARENCY
- Privacy policy discloses profiling
- Customers can request their churn score + explanation
- Opt-out: Easy process, confirmed within 7 days
**REDUCES RISK:**
- Likelihood: MEDIUM → LOW (bias audit catches issues)
- Severity: MEDIUM (unchanged - still potential business impact)
- Overall: MEDIUM-HIGH → LOW-MEDIUM ✓
### 4.4 User Rights Facilitation
✓ RIGHT TO BE INFORMED
- Privacy policy updated (plain language)
- Explains: ML model, purpose, data used, retention
- Accessible: Website, account settings, email on request
✓ RIGHT TO ACCESS
- Users can request their data via support ticket
- Response: 30 days (GDPR compliant)
- Includes: Features used, churn score, explanation
✓ RIGHT TO OBJECT / OPT-OUT
- Self-service: Account settings toggle
- Email: privacy@company.com
- Action: Remove from future predictions within 7 days
- Confirmation: Email sent to user
✓ RIGHT TO ERASURE
- Request: Via support ticket
- Action:
1. Remove from training dataset (immediate)
2. Delete predictions (immediate)
3. Exclude from future retraining
- Timeline: 7 days
- Confirmation: Email with deletion confirmation
STEP 5: SIGN-OFF & REVIEW
### 5.1 Residual Risk Assessment
AFTER MITIGATIONS:
┌──────────────────────────┬─────────────┬───────────┬──────────┐
│ Risk │ Likelihood │ Severity │ Residual │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Unauthorized access │ VERY LOW │ MEDIUM │ LOW │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Model bias │ LOW │ MEDIUM │ LOW │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ Model drift │ LOW │ MEDIUM │ LOW │
├──────────────────────────┼─────────────┼───────────┼──────────┤
│ All others │ LOW/VERY LOW│ LOW │ LOW │
└──────────────────────────┴─────────────┴───────────┴──────────┘
### 5.2 Outcome
☑ Processing can proceed
☐ Processing should not proceed
☐ Further consultation required (with DPA)
**Rationale:**
All residual risks are LOW or acceptable. Safeguards are robust and
regularly monitored. Legitimate interest clearly justified. Benefits
(customer satisfaction, retention) significantly outweigh minimal risks.
Comprehensive user rights protections in place.
### 5.3 Approvals
Data Protection Officer: ________________ Date: __________
Legal Counsel: __________________________ Date: __________
ML Ethics Board Chair: __________________ Date: __________
Project Owner (VP Customer Success): ____ Date: __________
### 5.4 Review Schedule
- **Next review:** January 2026 (annual)
- **Triggered review if:**
* Significant change in processing (new data sources, purposes, scale)
* Data breach involving ML system
* Regulatory guidance change (EDPB, ICO, etc.)
* User complaints >5/month about profiling
* Model accuracy <75% or bias detected
* New ML technique deployed
### 5.5 Version History
- v1.0 (2025-01-15): Initial DPIA
- v1.1 (2025-04-01): Added fairness monitoring (quarterly review)
- [Future versions logged here]
🚨 Manejando Data Breaches en ML Systems
¿Qué es un "Breach" en Contexto de ML?
GDPR ARTÍCULO 4(12):
"A breach of security leading to accidental or unlawful
destruction, loss, alteration, unauthorised disclosure of,
or access to, personal data"
ML-SPECIFIC BREACH EXAMPLES:
✓ Unauthorized access to training dataset (customer data)
✓ Model weights leaked (may reveal training data via model inversion)
✓ Prediction API exploited (unauthorized churn scores accessed)
✓ Training data accidentally published (S3 bucket misconfigured)
✓ Model memorizes training data (GPT reveals personal info in outputs)
✓ Insider downloads customer features for competitor
ML-Specific Breach: Model Inversion Attack
WHAT IS IT:
ATTACK:
Adversary queries ML model repeatedly with crafted inputs
to reverse-engineer training data.
EXAMPLE:
- Face recognition model trained on celebrity photos
- Attacker queries model with slight variations
- Reconstructs original training images (faces)
- Privacy breach: Training data (photos) revealed
REAL CASE:
Fredrikson et al. (2015): Reconstructed faces from face
recognition model with 95% accuracy.
IS THIS A GDPR BREACH?
YES, if:
✓ Training data contains personal data (photos with faces)
✓ Reconstruction reveals identifiable information
✓ No adequate safeguards (differential privacy, etc.)
NOTIFICATION REQUIRED:
- To DPA: Within 72 hours (high risk to individuals)
- To users: If high risk (identity theft, privacy violation)
Protocolo de Incident Response (ML Context)
PHASE 1: DETECTION & CONTAINMENT (0-24 hours)
### DETECTION - ML-Specific Indicators
TRAINING DATA BREACH:
- Unusual database queries (large exports)
- Access from unauthorized IP
- Download of entire training dataset
- Alert: Data exfiltration >1GB
MODEL WEIGHTS BREACH:
- Unauthorized model file download
- S3 bucket access logs show external IP
- Model weights published on GitHub (alert via search)
PREDICTION API ABUSE:
- Spike in API calls (100x normal)
- Systematic queries (iterating through customer IDs)
- Unusual query patterns (model inversion attempt)
### IMMEDIATE CONTAINMENT
ACTIONS (within 4 hours):
1. ISOLATE AFFECTED SYSTEMS
- Disable compromised API keys
- Revoke database access for compromised accounts
- Firewall rules: Block malicious IPs
2. PRESERVE EVIDENCE
- Snapshot: Logs, database state, model files
- Do NOT delete attacker artifacts (evidence)
- Chain of custody: Document who accessed what
3. ASSEMBLE RESPONSE TEAM
- ML Engineer (understand model/data)
- Security Lead
- Legal Counsel
- DPO (Data Protection Officer)
- Communications/PR
- Executive sponsor
### INITIAL ASSESSMENT (within 12 hours)
DETERMINE:
- **What data affected?**
* Training data: 200,000 customer records
* Categories: Usage patterns, billing history, support tickets
* Sensitivity: Medium (no payment cards, health data)
* Identifiable: Yes (customer IDs, potentially linkable)
- **How many users?**
* Estimate: 200,000 (entire training set)
* Breakdown: 80% EU, 15% US, 5% LATAM
- **Cause?**
* S3 bucket misconfigured (public read access)
* Duration: 7 days before detection
* Attacker: Unknown (no evidence of download, but possible)
- **Is attack ongoing?**
* NO - Bucket now private
- **Severity?**
* HIGH (large scale, identifiable data, EU residents)
NOTIFICATION OBLIGATIONS TRIGGERED:
✓ DPA notification: YES (within 72 hours)
✓ User notification: Likely YES (high risk - 200K users, identifiable)
PHASE 2: NOTIFICATION (24-72 hours)
### TO DATA PROTECTION AUTHORITY (Lead DPA: Ireland - DPC)
**NOTIFICATION TEMPLATE:**
---
**GDPR ARTICLE 33 NOTIFICATION**
**Date/Time of Breach:** 2025-01-10, 14:30 UTC
**Date/Time Discovered:** 2025-01-17, 09:15 UTC
**Notifying:** 2025-01-19, 16:00 UTC (within 72 hours ✓)
**1. NATURE OF BREACH**
**Categories of data:**
- Customer usage patterns (feature usage counts, login frequency)
- Support history (ticket counts, categories - no full text)
- Billing history (amounts, payment method type - no card details)
- Metadata (customer ID, account creation date, region)
**Approximate numbers:**
- Data subjects affected: 200,000 customers
- Records: 200,000 (one per customer)
**Cause:**
S3 bucket storing ML training dataset was misconfigured with
public read access due to infrastructure-as-code error during
deployment on 2025-01-10.
**2. CONTACT DETAILS**
Data Protection Officer:
Name: [DPO Name]
Email: dpo@company.com
Phone: +353 1 XXX XXXX
**3. LIKELY CONSEQUENCES**
**Risk Assessment:**
- Identity theft: Low (no names, emails in dataset - only IDs)
- Financial fraud: Very Low (no payment card data)
- Discrimination: Low (no sensitive categories)
- Reputational harm: Medium (usage patterns may be embarrassing)
**Mitigating factors:**
- Customer IDs are pseudonyms (not directly identifiable)
- No evidence of actual unauthorized access (logs reviewed)
- Data encrypted at rest (bucket was public but files encrypted)
- Short exposure window (7 days)
**Overall risk:** Medium (precautionary notification)
**4. MEASURES TAKEN**
**Containment:**
- S3 bucket made private (2025-01-17, 09:30 UTC)
- All access logs reviewed (no suspicious downloads detected)
- Revoked all API keys as precaution
- Forensic analysis ongoing
**Mitigation:**
- Will notify affected users (see separate communication plan)
- Offering 12 months credit monitoring (precautionary)
- Implementing additional safeguards:
* Mandatory encryption for all S3 buckets (policy)
* Infrastructure-as-code review (prevent recurrence)
* Quarterly access audits
**5. CROSS-BORDER**
EU countries affected: All (customers across EU)
Lead DPA: Ireland (DPC) - our EU headquarters
Other DPAs: Will be notified via DPC cooperation mechanism
---
**ATTACHMENTS:**
- Forensic report (preliminary)
- Access logs
- Remediation plan
- User notification draft
**SUBMITTED VIA:** DPC online portal
### TO DATA SUBJECTS (Users)
**DECISION:**
✓ Notification REQUIRED (precautionary, medium risk)
**METHOD:**
- Email to all 200,000 affected users
- In-app notification
- Website announcement (public transparency)
**EMAIL TEMPLATE:**
---
**Subject:** Important Security Notice: Your Account Data
Dear [Customer Name],
We are writing to inform you of a security issue that may have affected
your account data.
**WHAT HAPPENED**
On January 10, 2025, a configuration error made a database containing
customer usage data temporarily accessible. We discovered this on
January 17 and immediately secured the data. The exposure lasted 7 days.
**WHAT DATA WAS INVOLVED**
The database contained:
- Your account usage patterns (which features you use, how often)
- Support history (number of tickets, general categories)
- Billing history (subscription amounts, payment method type)
The database did NOT contain:
- Your name, email, or contact information
- Payment card details
- Communication content (emails, tickets, chats)
- Passwords or access credentials
**WHAT WE'RE DOING**
- We've secured the database and are investigating
- We've found no evidence the data was actually accessed
- We're implementing additional security measures
- We've notified data protection authorities
**WHAT YOU CAN DO**
While we have no evidence of misuse, as a precaution:
- Monitor your account for unusual activity
- We're offering 12 months of free credit monitoring (details below)
- Contact us if you notice anything suspicious
**WE'RE SORRY**
We deeply apologize for this incident. Protecting your privacy is our
top priority, and we're taking steps to prevent this from happening again.
**QUESTIONS?**
Contact our Data Protection Officer:
Email: dpo@company.com
Phone: +353 1 XXX XXXX
For more information: [Link to FAQ page]
Sincerely,
[CEO Name]
---
**TIMELINE:**
- Drafted: 2025-01-18
- Legal review: 2025-01-19
- DPO approval: 2025-01-19
- Send: 2025-01-20 (simultaneous with DPA notification)
PHASE 3: REMEDIATION (Days 3-30)
### ROOT CAUSE ANALYSIS
**FINDINGS:**
1. **Technical Cause:**
- Infrastructure-as-code (Terraform) template had default
S3 bucket policy: "PublicRead"
- Should have been: "Private"
- No code review caught this (template copy-pasted from internet)
- No automated policy check (allow public buckets)
2. **Process Cause:**
- Lack of security review for infrastructure changes
- No automated compliance checks (should block public buckets)
- Insufficient testing (functional tests didn't check access controls)
3. **Detection Delay:**
- No monitoring for public S3 buckets
- Alert only when manual audit 7 days later
- Should have been: Real-time alert
### REMEDIATION ACTIONS
**IMMEDIATE (Week 1):**
✓ Fix misconfiguration (done)
✓ Audit all S3 buckets (done - 3 others had public read, now fixed)
✓ Revoke all API keys (done - reissued to legitimate users)
✓ Notify DPA and users (done)
**SHORT-TERM (Weeks 2-4):**
✓ Implement AWS S3 Block Public Access (account-wide)
- Prevents any bucket from being public
- Deployed: 2025-01-20
✓ Automated compliance checks (AWS Config Rules)
- Alert if any bucket becomes public
- Alert if encryption disabled
- Deployed: 2025-01-22
✓ Infrastructure-as-code review process
- Mandatory security review for all Terraform changes
- Automated scans: Checkov, tfsec
- Deployed: 2025-01-25
**LONG-TERM (Months 2-3):**
✓ Encrypt all S3 buckets (at bucket level + object level)
- Even if bucket public, files unreadable
- Target: 2025-03-01
✓ Data minimization review
- Do we need to store training data long-term?
- Pseudonymization: Replace customer IDs with random UUIDs
- Target: 2025-03-15
✓ Quarterly access audits
- Review all S3 bucket policies
- Review all database access logs
- First audit: 2025-04-01
PHASE 4: POST-INCIDENT REVIEW (Month 2)
### LESSONS LEARNED SESSION
**DATE:** 2025-02-15
**PARTICIPANTS:** ML team, Security, Legal, DPO, Execs
**TIMELINE REVIEW:**
- Jan 10: Misconfiguration deployed
- Jan 17: Discovered (7-day delay ❌)
- Jan 17: Contained (same day ✓)
- Jan 19: DPA notified (within 72hr ✓)
- Jan 20: Users notified (within 24hr of DPA ✓)
**WHAT WENT WELL:**
✓ Fast containment once discovered
✓ Timely notifications (GDPR compliant)
✓ Transparent communication with users
✓ No evidence of actual data theft (luck + encryption)
**WHAT WENT POORLY:**
❌ 7-day detection delay (should be real-time)
❌ Configuration error in first place (process failure)
❌ No automated guardrails (should prevent public buckets)
❌ Code review didn't catch (security not prioritized)
**WHAT WOULD WE DO DIFFERENTLY:**
1. Real-time alerts for security misconfigurations
2. Automated enforcement (block public buckets)
3. Security as priority in code reviews
4. More frequent access audits (monthly not quarterly)
**ACTION ITEMS:**
| Action | Owner | Deadline | Status |
|--------|-------|----------|--------|
| Deploy S3 Block Public Access | Cloud Eng | ✅ Done | Complete |
| Implement AWS Config Rules | Security | ✅ Done | Complete |
| Security training for ML team | CISO | Feb 28 | In progress |
| Quarterly access audits → Monthly | DPO | Ongoing | Scheduled |
| Encrypt all training datasets | ML Eng | Mar 15 | In progress |
| Pseudonymize customer IDs | ML Eng | Mar 15 | In progress |
### REGULATORY FOLLOW-UP
**DPC (Irish DPA) RESPONSE:**
- Acknowledged notification (2025-01-20)
- Requested additional information:
* Forensic report (provided 2025-01-25)
* Remediation plan (provided 2025-01-25)
* Evidence of encryption (provided 2025-01-25)
- Outcome (2025-03-01):
* No formal investigation opened ✓
* No fine imposed ✓
* Guidance: Improve preventive controls
* Recommendation: Annual security audits
**KEY FACTORS IN FAVORABLE OUTCOME:**
✓ Proactive notification (within 72hr)
✓ Transparent communication
✓ No evidence of harm
✓ Swift remediation
✓ Comprehensive preventive measures
**ESTIMATED COST OF BREACH:**
- Forensic investigation: €20,000
- Credit monitoring (200K users): €150,000
- Legal counsel: €30,000
- Remediation (tech + time): €50,000
- Reputational impact: Unquantified
- **TOTAL:** ~€250,000
**FINE AVOIDED:** €0 (could have been up to €20M)
**LESSON:** Investment in prevention < Cost of breach
🎯 Puntos Clave para Recordar
Regulaciones varían: GDPR (EU/UK), CCPA (California), LGPD (Brasil). Conoce qué aplica a ti. ML casi siempre triggers GDPR extraterritorial scope.
Personal data es amplio: No solo nombre/email. Includes user IDs, behavioral patterns. ML models can reveal training data. "Anonymous" es difícil lograr.
Bases legales para ML: "Legitimate interest" es común. Requires LIA (balancing test). Consent difícil para training data at scale. Automated decisions trigger Art 22 safeguards.
DPIA casi siempre necesario: Large-scale profiling, automated decisions, sensitive data = DPIA required. Template de 5 pasos: Describe, Justify, Assess Risks, Mitigate, Sign-off.
ML-specific risks: Model inversion, memorization, bias, drift. Safeguards: encryption, access controls, fairness audits, human-in-loop, monitoring.
Breaches en ML: Training data leaks, model weights exposed, API abuse. Notification DPA <72hr, users if high risk. Prevention < remediation cost.
🔜 Próximos Pasos
En la siguiente lección exploraremos Sesgos en Análisis de Sentimientos y NLP: Cómo detectar bias en modelos de IA, realizar auditorías de fairness, y mitigar sesgos demográficos, de lenguaje y culturales en sistemas de ML.
Actualizado: Octubre 2025 Tiempo de lectura: 27 minutos Nivel: Intermedio
¿Completaste esta lección?
Marca esta lección como completada. Tu progreso se guardará en tu navegador.