A confidential data lake on GCP, audit ready

The starting point

Most data platforms grow before they are governed. The first hard question is usually who can read what, and that question gets harder once a dozen teams have copied tables sideways.

GCP gives you tight primitives to fix this early. Customer managed keys in Cloud KMS, column and row policies in BigQuery, VPC Service Controls around the perimeter, and Workload Identity Federation for non human access.

Pipeline

Sources land in Cloud Storage. A DLP scan runs on landing and tags objects with sensitivity. Dataflow normalises and writes into BigQuery raw, then Dataform shapes curated tables. Every step runs inside a VPC Service Controls perimeter.

Encryption uses CMEK at rest in every service. The same key ring rotates quarterly, with HSM backed keys for the highest sensitivity classes.

Access model

Access is granted to groups, never individuals. Column policies attach to taxonomy tags, so a single decision (mark column as PII) propagates to every consumer.

workload identity federation for github actions and external services
access approval required for support engineer reads
iam conditions to scope role bindings to projects and time windows
audit logs streamed to bigquery and pinned to scc
break glass procedure logged and reviewed weekly

Cost and operations

BigQuery slots are reserved for predictable workloads. Ad hoc and exploratory queries land on the on demand pool with a per user cap. Storage cost is controlled by table partitioning and lifecycle rules on raw zones.

References

Official documentation and standards we draw on for this pattern.

Links open in a new tab

Takeaway

Governance is cheap when you set it up before the data lands. It is a programme of work once the data is already everywhere.

A confidential data lake on GCP, audit ready

The starting point

Pipeline

Access model

Cost and operations

References

Related research.

A secure AWS landing zone you can defend on day one

Zero trust API access on Azure, end to end

Detection as code across AWS, Azure and GCP

Tell us where it hurts. We will tell you what good looks like.