Stop Hoarding Personal Data in Entity Framework

“We might need it someday.”

That sentence has cost teams more compliance headaches than any technical decision I’ve encountered. It’s the battle cry of lazy schema design, the excuse that turns every user table into a dumping ground for speculative data collection. Development teams hoard every conceivable piece of personal information during initial implementation—birth dates, phone numbers, employment history, marital status—creating sprawling user tables that seem prudent at the time but are actually architectural time bombs.

Three years later, when GDPR deletion requests arrive or ISO 27701 audits roll around, these same teams discover they’re storing data that serves no business purpose whatsoever. And by then, the cost isn’t just regulatory fines. It’s the architectural debt that makes compliant deletion technically impossible.

The Regulatory Reality

ISO 27701 and GDPR both demand something straightforward: collect only what you need, document why you need it, and delete it when asked. None of this is revolutionary. It’s basic data hygiene that somehow became optional in the “move fast and break things” era.

The problem is that “straightforward” becomes “impossible” when your schema was designed by someone who confused data collection with data strategy. Control 7.2.2 requires identifying the specific purpose before collection—not retroactively inventing justifications when auditors show up. Control 7.2.8 limits collection to what’s adequate and necessary for that documented purpose. And Control 7.3.1 requires the ability to actually fulfill deletion requests, which is awkward when your schema makes deletion a referential integrity nightmare.

When auditors examine your database schemas, they ask uncomfortable questions: Why do you store this field? What business process requires it? Can you delete it when requested?

If your answer involves the phrase “the original developer thought,” you’ve already failed. If your answer is “we’ve always collected that,” you’ve failed harder.

The Monolithic User Entity Problem

Here’s the pattern I see repeatedly:

public class User
{
    public int Id { get; set; }
    public string Email { get; set; } = default!;
    public string PasswordHash { get; set; } = default!;
    
    // "Comprehensive" personal information
    public string FirstName { get; set; } = default!;
    public string LastName { get; set; } = default!;
    public DateTime DateOfBirth { get; set; }
    public string PhoneNumber { get; set; } = default!;
    public string StreetAddress { get; set; } = default!;
    public string City { get; set; } = default!;
    
    // Employment data "just in case"
    public string EmployerName { get; set; } = default!;
    public string JobTitle { get; set; } = default!;
    public decimal AnnualIncome { get; set; }
    
    // Demographics "for future analytics"
    public string MaritalStatus { get; set; } = default!;
    public int NumberOfChildren { get; set; }
    
    public DateTime CreatedAt { get; set; }
    public ICollection<Order> Orders { get; set; } = [];
}

Why does an e-commerce system need marital status? Nobody knows anymore. The field exists because someone thought demographic analysis might be valuable someday. That person left the company two years ago. The analytics feature was never built. But the data collection persists, a monument to speculative thinking that nobody had the courage to question.

The real problem emerges when a customer requests deletion. The User entity has foreign key relationships with Orders. Delete the user? Breaks referential integrity. Keep it? Violates the deletion request. Anonymize it? You’re still retaining fields like AnnualIncome and NumberOfChildren in backups.

You’ve created an architectural deadlock before writing a single DELETE statement. Congratulations—your database schema is now a compliance liability that will cost more to fix than it cost to build.

Purpose-Driven Data Separation

The fix isn’t complex, which makes it all the more frustrating that teams don’t implement it from the start. Separate operational data from personal data:

public class UserAccount
{
    public int Id { get; set; }
    public string Email { get; set; } = default!;
    public string PasswordHash { get; set; } = default!;
    public DateTime CreatedAt { get; set; }
    public DateTime? DeletedAt { get; set; }
    
    public UserProfile? Profile { get; set; }
    public ICollection<Order> Orders { get; set; } = [];
}

public class UserProfile
{
    public int Id { get; set; }
    public int UserAccountId { get; set; }
    
    public string? FirstName { get; set; }
    public string? LastName { get; set; }
    public DateTime? DateOfBirth { get; set; }
    public string? ShippingAddress { get; set; }
    
    public DateTime ConsentGrantedAt { get; set; }
    public string ConsentPurpose { get; set; } = default!;
    
    public UserAccount UserAccount { get; set; } = default!;
}

Notice what changed: UserAccount contains only what’s necessary for system operation. UserProfile contains optional personal data—every field nullable, every field requiring explicit consent. No employment history. No marital status. No speculative demographics.

The Entity Framework configuration enforces this separation:

public void Configure(EntityTypeBuilder<UserAccount> builder)
{
    builder.HasKey(u => u.Id);
    builder.Property(u => u.Email).IsRequired();
    builder.Property(u => u.PasswordHash).IsRequired();
    
    builder.HasQueryFilter(u => u.DeletedAt == null);
    
    builder.HasOne(u => u.Profile)
        .WithOne(p => p.UserAccount)
        .HasForeignKey<UserProfile>(p => p.UserAccountId)
        .OnDelete(DeleteBehavior.Cascade);
    
    builder.HasMany(u => u.Orders)
        .WithOne(o => o.User)
        .HasForeignKey(o => o.UserId)
        .OnDelete(DeleteBehavior.Restrict);
}

The .HasQueryFilter(u => u.DeletedAt == null) is critical. It automatically excludes soft-deleted accounts from queries, preventing accidental exposure while preserving referential integrity for order history.

Compliant Deletion That Actually Works

With proper separation, deletion requests become tractable:

public async Task ProcessDeletionRequest(int userAccountId)
{
    var account = await _context.UserAccounts
        .Include(u => u.Profile)
        .IgnoreQueryFilters()
        .FirstOrDefaultAsync(u => u.Id == userAccountId);
    
    if (account is null) return;
    
    if (account.Profile is not null)
        _context.UserProfiles.Remove(account.Profile);
    
    account.DeletedAt = DateTime.UtcNow;
    account.Email = $"deleted-{account.Id}@example.invalid";
    account.PasswordHash = string.Empty;
    
    await _context.SaveChangesAsync();
}

The UserProfile gets hard-deleted with all personal information. The UserAccount gets soft-deleted, maintaining foreign key relationships with orders. Authentication credentials get cleared. Subsequent queries automatically exclude the account due to the query filter.

The customer effectively ceases to exist from an operational perspective while your database maintains consistency for historical records. No architectural gymnastics. No complex anonymization logic. No explaining to auditors why you still have someone’s annual income stored.

Validation Through Integration Tests

Compliance isn’t a one-time configuration. It requires continuous validation. Write integration tests that verify your API endpoints don’t leak unnecessary data:

[Fact]
public async Task GetUser_ReturnsOnlyOperationalFields()
{
    var client = _factory.CreateClient();
    var response = await client.GetAsync("/api/users/me");
    var json = await response.Content.ReadAsStringAsync();
    var userData = JsonDocument.Parse(json);
    
    Assert.True(userData.RootElement.TryGetProperty("id", out _));
    Assert.True(userData.RootElement.TryGetProperty("email", out _));
    
    Assert.False(userData.RootElement.TryGetProperty("dateOfBirth", out _));
    Assert.False(userData.RootElement.TryGetProperty("passwordHash", out _));
}

[Fact]
public async Task DeletedUser_IsExcludedFromQueries()
{
    var client = _factory.CreateClient();
    await client.DeleteAsync("/api/users/me");
    
    var response = await client.GetAsync("/api/users/me");
    
    Assert.Equal(HttpStatusCode.NotFound, response.StatusCode);
}

Run these in CI. When someone inadvertently exposes additional personal data through a new endpoint, the tests fail before the code reaches production. Compliance violations don’t ship.

You can take this further with a GitHub Actions workflow that runs these tests on every pull request affecting your data layer:

name: Data Minimization Compliance

on:
  pull_request:
    paths:
      - 'src/**/*.cs'

jobs:
  compliance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '9.0.x'
      - run: dotnet test --filter "Category=DataMinimization"

The tests become guardrails. Someone adds a new property to the API response? The test fails. Someone forgets to exclude a sensitive field from serialization? The test fails. The compliance requirement becomes an engineering constraint that CI enforces automatically.

Document Your Purposes

Code alone doesn’t satisfy audit requirements. You need documented business purposes for each field:

public class UserProfile
{
    /// <summary>
    /// Purpose: Personalized communication in order confirmations.
    /// Legal basis: Consent granted during registration.
    /// Retention: Until account deletion or consent withdrawal.
    /// </summary>
    public string? FirstName { get; set; }
    
    /// <summary>
    /// Purpose: Order fulfillment and delivery.
    /// Legal basis: Contract performance (GDPR Art. 6(1)(b)).
    /// Retention: 90 days after order completion.
    /// </summary>
    public string? ShippingAddress { get; set; }
}

When auditors review your code, they can trace each field to its documented purpose. If you can’t articulate why a field exists, remove it. That’s the entire point.

This documentation serves a dual purpose. First, it satisfies the audit requirement for documented purposes. Second, it forces developers to think before adding fields. When you have to write a justification in the XML docs, you’re far less likely to add MaritalStatus just because someone mentioned demographics in a planning meeting.

Collect Data When It Becomes Necessary

I’ve reviewed applications where shipping addresses are required during registration—before a customer has placed any orders. This is compliance theater: collecting data you can’t justify because the registration form had empty fields that felt incomplete. It violates the principle that data collection must be necessary at the time of collection, and it tells me the team never actually thought about why they were collecting what they were collecting.

The timing matters. Collecting a shipping address during registration for a user who might never place an order means you’re storing personal data without a current legitimate purpose. When that user requests deletion three months later without having purchased anything, you’ve been storing their address for no reason.

Collect data in context. If the shipping feature doesn’t exist yet, don’t collect addresses speculatively. When checkout happens, prompt for the address. When a feature launches, update the flow to collect newly necessary information with proper consent. Progressive data collection aligns with how users actually interact with your application—they provide information when it becomes relevant, not upfront in a registration form that asks for everything.

The Backup Problem Nobody Talks About

Even with proper separation and deletion logic, there’s a compliance trap that catches most teams: database backups.

When you soft-delete a user account and hard-delete their profile, the data is gone from your live database. But what about last night’s backup? Last week’s? The monthly snapshot from six months ago? That profile data still exists, sitting in backup storage, violating the deletion request.

Your retention policy for backups needs to align with your deletion obligations. Some options:

Encryption with user-specific keys: If you encrypt personal data with a key derived from the user’s account, deleting that key makes the backup data unreadable.
Backup rotation aligned with retention: If your stated retention period is 90 days, your backup rotation should match.
Selective restore procedures: Document that restored backups will have deletion requests re-applied before going live.

None of these are simple. But ignoring the backup problem doesn’t make it go away—it just means you’re lying to customers when you confirm their data has been deleted. And that’s exactly what auditors will call it: a lie backed by technical negligence.

The Real Cost of Data Hoarding

Every unnecessary field in your database represents breach exposure, technical debt, regulatory risk, and development friction. It’s technical debt that accrues interest in the form of compliance emergencies. Teams waste hours crafting complex anonymization queries for data that shouldn’t exist. During audits, they scramble to justify fields they’ve forgotten the purpose of—inventing post-hoc rationalizations for decisions made years ago by people who didn’t consider the consequences.

The breach exposure angle deserves emphasis. When your database contains only operational essentials and purpose-justified personal data, a breach is bad but bounded. When your database contains speculative demographics, employment history, and family information, a breach becomes catastrophic. The attacker gets everything. The notification requirements expand. The regulatory scrutiny intensifies. The headlines get worse.

Data you don’t collect can’t be breached. That’s the simplest security control in existence, and it’s also a compliance requirement.

The separated architecture I’ve shown costs nothing additional to implement initially. It saves thousands in compliance remediation later. More importantly, it makes the honest answer to audit questions actually honest: “We collect what we need, we documented why, and we can delete it when asked.”

Making This Part of Your Process

Data minimization works when it’s embedded in how you build software, not bolted on during audit preparation. A few practices that help:

Schema reviews: Treat entity model changes like code reviews. When someone adds a property, the reviewer asks: What’s the documented purpose? Is it nullable? When is it collected? How is it deleted?

Architecture decision records: Document why you chose to collect specific data. When someone asks in two years why DateOfBirth exists, the ADR explains it’s for age verification on restricted products—not because someone thought demographics might be interesting.

Deletion dry runs: Periodically test your deletion logic against production-like data. Does it complete without errors? Does the query filter exclude deleted accounts? Can you still query order history for a deleted user’s past purchases?

Periodic field audits: Once a quarter, export your entity models and review each property. Is it still used? Does the original purpose still apply? Has the feature it supported been deprecated? Fields that no longer serve a purpose should be removed, not retained indefinitely.

The Bottom Line

Stop hoarding personal data “just in case.” Define purposes. Collect minimally. Delete ruthlessly. Your deletion logic will be straightforward, your audit responses will be honest, and your customers’ privacy will be respected.

The monolithic User entity pattern isn’t just non-compliant—it’s a symptom of teams that never asked “should we?” before asking “can we?” It’s expensive, risky, and harder to maintain than the separated alternative. Purpose-driven data architecture with UserAccount and UserProfile entities, nullable personal data fields, query filters for soft deletes, and integration tests for API boundaries isn’t regulatory overhead. It’s how data management should have worked all along.

That’s not a constraint that makes development harder. It’s the bare minimum of responsible engineering that somehow became optional. Fix your schemas before the auditors do it for you.

Comments