Legacy Data Warehouse Migration to Databricks: Proven 40% Cost Reduction & Zero Data Loss

There is a particular kind of organizational pain that only people who have lived inside a large financial institution truly understand. It is the pain of running a data warehouse that was designed for a different era — one where batch jobs ran overnight, analysts waited until morning for their reports, and “real-time” meant something refreshed every four hours. The warehouse worked, after a fashion, and that was precisely the problem. It worked just well enough to survive every budget review, accumulate years of technical debt, and become the quiet bottleneck behind almost every data-driven decision the business wanted to make.

This post walks through a real legacy data warehouse migration engagement for a mid-sized BFSI (Banking, Financial Services, and Insurance) client — where we moved their entire data estate onto Databricks without losing a single row, reconciled every critical dataset end-to-end, and then tuned the new environment until query costs dropped to a fraction of what they had been paying. The names and specific numbers have been anonymized, but this legacy data warehouse migration case study covers the technical decisions, the mistakes we nearly made, and the outcomes in full.

Table of Contents

Why This Legacy Data Warehouse Migration Was Inevitable

The client was running a proprietary on-premises MPP warehouse that had served them reasonably well through the 2010s. By the early 2020s, though, the cracks were showing everywhere. Storage licensing alone was consuming a seven-figure annual budget. Query concurrency limits meant that risk analysts and business intelligence teams were constantly queuing behind each other. Schema changes required painful weekend maintenance windows. And the platform simply had no native story for machine learning or streaming data — two things the business needed badly as it moved into algorithmic underwriting and real-time fraud detection.

A cost-benefit analysis commissioned internally arrived at a familiar conclusion: the cost to stay on the legacy platform, when you factored in licensing, hardware refresh cycles, and the opportunity cost of features they could not build, was higher than the cost of migrating. The question was not whether to proceed with legacy data warehouse migration but how to do it without breaking the regulatory reporting pipelines that auditors and compliance teams depended on every single quarter.

Choosing Databricks as the Target Platform

For this legacy data warehouse migration, the client evaluated three platforms seriously. The final decision to go with Databricks rested on a handful of factors that kept coming up in every working session.

First, Delta Lake gave them ACID transactions on top of object storage — something that mattered enormously for a regulated entity where data integrity is not optional. Second, the Unity Catalog provided a single governance layer across all their data assets, which simplified the compliance story considerably. Third, Databricks’ native support for both SQL analytics workloads and Python/Scala ML pipelines meant the platform could serve the BI team and the data science team from the same infrastructure, eliminating the data-copy sprawl that had plagued them before.

There was also a commercial consideration: the pay-as-you-go model on cloud infrastructure meant the client could right-size their spending in a way that was structurally impossible with their legacy per-core licensing model. That last point would prove important later.

Mapping the Migration: What We Were Actually Moving

Before writing a single line of migration code, the team spent three weeks doing something unglamorous but absolutely necessary: cataloguing exactly what existed in the source system. This meant documenting every schema, every table, every view, every stored procedure, and — critically — every undocumented dependency that existed only in someone’s head or in a hand-crafted ETL script that had not been touched since 2017.

What we found was not unusual for a mature financial institution. There were roughly 800 tables across a dozen subject area schemas. About 200 of those tables had active downstream consumers. Another 300 were referenced in legacy ETL jobs but had not been queried by a human in over two years. The remaining 300 were in a grey zone — referenced in code, possibly still needed, impossible to confirm without asking people who had since left the organization.

The migration strategy that emerged from this audit had three tiers. Tier 1 covered the 200 actively used tables — these would be migrated first, with full reconciliation gates before any cutover. Tier 2 covered the ambiguous 300 — migrated in parallel but not cut over until usage could be confirmed. Tier 3 covered the clearly stale 300 — archived to cold storage rather than migrated at all, which turned out to be its own cost saving.

The Technical Approach to Legacy Data Warehouse Migration

The actual data migration to Databricks was executed in parallel streams rather than a sequential lift-and-shift. This decision added some coordination overhead but dramatically reduced the total migration timeline and allowed the team to run the source and target systems concurrently during validation.

For the bulk historical load, we used Apache Spark’s native JDBC connectors to pull data from the legacy warehouse in parallel partitioned reads, writing directly to Delta tables in the client’s cloud object storage. Partition sizes were tuned carefully — too large and you hit memory pressure on the executors; too small and you drown in task scheduling overhead. For tables over 500 GB, we settled on 256 MB partition targets, which gave clean parallelism without the executor thrashing we had seen in early test runs.

For the ongoing delta loads (the data that kept changing while migration was in flight), we implemented a change-data-capture layer using the source system’s transaction log export capability, feeding into a streaming pipeline on Databricks that applied changes to the Delta tables using merge operations. This was the part of the migration that required the most careful design, because a CDC pipeline that falls behind or drops events is how you end up with silent data corruption — exactly the kind of thing that does not show up until a regulator asks an awkward question.

Zero Data Loss in Legacy Data Warehouse Migration: What That Actually Required

Every legacy data warehouse migration project claims zero data loss. What that phrase actually means in practice varies enormously. In this case, it meant something very specific: every row that existed in the source system at cutover time had to exist in the target system, with identical values in every column, and the business had to be able to prove this to their auditors if asked.

The reconciliation framework we built had four layers. The first was row-count reconciliation — a necessary baseline but nowhere near sufficient on its own. The second was checksum reconciliation at the table level, comparing MD5 hashes of concatenated row values across stratified samples. The third was business-logic reconciliation, where we took twenty of the client’s most important regulatory reports and ran them against both the legacy system and the new Databricks environment, comparing outputs line by line. The fourth was a statistical distribution check that flagged any column where the mean, standard deviation, min, or max had shifted by more than a configurable threshold — a useful catch for subtle truncation or type-casting bugs that hash checks can miss.

Running all four layers added time to the project. It also caught eleven data quality issues that would have gone undetected with row-count checks alone — including two cases where the legacy ETL had been silently truncating strings longer than 255 characters for years, a bug the client did not know existed until we found it during migration reconciliation.

Delta Table Tuning: Where the Performance Gains Came From

Getting data into Delta tables is one thing. Getting those tables to perform well for mixed analytical workloads is another, and this is where a lot of migrations quietly underdeliver. The data lands in Delta format, the initial benchmarks look acceptable, and the team declares victory without doing the tuning work that turns “acceptable” into “dramatically better.”

We spent four weeks on Delta table optimization after the initial migration, and the gains from that work exceeded anything we had seen from hardware or infrastructure changes on the legacy system.

The most impactful change was applying OPTIMIZE with Z-ORDER on the high-traffic tables, colocating data by the columns most commonly used in filter predicates. For the client’s transaction history tables — queried almost always by account ID and transaction date — Z-ordering on those two columns reduced average query execution time by around 60% compared to the unoptimized Delta tables. The reason is straightforward: Z-ORDER clusters physically adjacent rows by the specified columns, which means Databricks’ file skipping can eliminate far more files before any data is read from storage.

The second major change was partitioning strategy. The initial migration had used date-based partitioning on all large tables, which is a reasonable default but not always optimal. For tables where most queries filtered primarily by account segment rather than date, we restructured partitioning to match the actual query patterns. This sounds obvious in retrospect — partition on the column you filter by most — but in practice, query pattern analysis before choosing a partitioning scheme is a step that gets skipped more often than it should.

The third area was statistics and auto-optimize settings. Enabling delta.autoOptimize.optimizeWrite and delta.autoOptimize.autoCompact on tables with frequent small-file writes significantly reduced the file proliferation problem that had started accumulating within weeks of go-live. Small files are a well-known performance killer on object storage — they do not show up dramatically in query times at first, but left unmanaged, they compound quickly.

Right-Sizing Clusters: The Other Half of Cost Reduction

Delta tuning cut query execution time. Cluster right-sizing cut the cost of that execution time. Together, they were responsible for the sharp reduction in query costs the client experienced after go-live.

The initial cluster configuration had been provisioned conservatively — larger than needed — because the team was cautious about performance during the initial live period. That caution was reasonable, but six weeks after cutover, the usage data made it clear that the clusters were substantially over-provisioned for most workloads. Spot instance usage was also lower than it should have been, because there had been anxiety during migration about interruption risks.

The right-sizing work involved analyzing cluster utilization metrics across a full four-week window, segmenting by workload type. Interactive BI queries ran well on smaller driver nodes than originally provisioned. Batch ETL jobs could use spot instances with autoscaling for 80% of their runtime without any material risk of interruption. The data science workloads needed GPU nodes for model training but ran them for much shorter periods than the always-on configuration had implied.

After right-sizing, the client’s monthly Databricks compute spend dropped by approximately 40% compared to the initial post-migration configuration — and that configuration was already cheaper than the legacy warehouse licensing. The combination of lower licensing costs, right-sized compute, and the Tier 3 archive decision (eliminating cold data from the active warehouse entirely) produced total infrastructure cost savings well above what the original business case had projected.

What the BFSI Context Added to Legacy Data Warehouse Migration Complexity

Financial services migrations carry regulatory weight that other industries do not. Data lineage is not a nice-to-have — it is something auditors ask for. Encryption at rest and in transit is mandatory. Role-based access controls need to map to existing compliance frameworks. Audit logs need to be tamper-evident and retained for specific periods.

Databricks’ Unity Catalog handled much of this well out of the box, but configuring it correctly for a regulated environment required deliberate effort. Column-level access controls for PII fields, row-level security policies for data segmented by business unit, and integration with the client’s existing identity provider via SSO all had to be built and validated before any production data moved into the new environment.

We also worked closely with the client’s compliance team to document the data lineage graph for their fifteen most critical regulatory reports — showing, end to end, which source tables each report drew from, which transformations were applied, and how the resulting output mapped to the regulatory filing format. That documentation, produced as a side effect of the migration rather than as a separate project, turned out to be among the most valued deliverables from the entire engagement.

Lessons Worth Carrying Forward

A few things stand out from this engagement that apply broadly to any organization planning a legacy data warehouse migration to Databricks or a similar modern lakehouse platform.

The audit phase is not optional. Three weeks of cataloguing before any migration code is written feels like delay. It is not — it is the work that prevents you from migrating things that do not need to be migrated and missing dependencies that will break things downstream. The Tier 3 archive decision alone saved the client months of unnecessary migration effort.

Reconciliation needs to go deeper than row counts. Row counts catch missing tables. They do not catch silent type coercions, string truncations, or floating-point rounding differences introduced by platform-specific behavior. Invest in business-logic reconciliation early, even if it means slowing down the migration timeline.

Do not tune on day one, but do not wait too long either. The first weeks after cutover should be spent stabilizing, not optimizing. But optimization work deferred past six to eight weeks tends to calcify — teams move on to new projects, query patterns shift, and the baseline gets accepted as the new normal. The performance and cost gains from Delta tuning are real and substantial; schedule the work deliberately.

Right-sizing is an ongoing process, not a one-time event. The cluster configuration that is right at go-live will not be right six months later. Establish a cadence for reviewing utilization data and adjusting configurations, especially as new workloads are onboarded to the platform.

The Outcome in Plain Terms

The client is now running their full analytical workload on Databricks. Their regulatory reporting pipelines are running faster, with better lineage documentation, on a platform that can scale without the weekend maintenance windows that used to disrupt their operations. The data science team has access to the same governed data that the BI team uses, without the staging environment gymnastics that had characterized the legacy setup.

Query costs are materially lower than they were on the legacy warehouse. The total infrastructure bill — compute, storage, licensing — is down significantly from what it was before migration. And the migration happened without data loss, with full reconciliation documented for compliance purposes, and without a single failed regulatory report in the post-cutover period.

That is the outcome a well-executed legacy data warehouse migration to Databricks can deliver for a financial institution willing to invest the time in doing it properly. The technology is capable. The process discipline is what determines whether that capability translates into results the business can rely on.

How MinervaDB Approaches Legacy Warehouse Migrations

At MinervaDB, our data engineering teams have deep experience in enterprise data warehouse migration, Databricks platform engineering, and Delta Lake optimization for regulated industries including BFSI, healthcare, and retail. We bring structured methodology to migration engagements — from the initial data estate audit through reconciliation, performance tuning, and post-go-live optimization — because we have seen what happens when those phases are treated as optional.

If your organization is evaluating a move away from a legacy warehouse, or if you are already mid-migration and running into the performance or cost issues that often surface after the initial cutover, we would welcome a conversation. You can also explore our related thinking on Databricks performance tuning and Delta Lake best practices on this site.

The right migration is not the fastest migration. It is the one your compliance team can sign off on, your analysts can rely on, and your finance team does not regret eighteen months later.

The Data Transformation Company

Data Architecture, Engineering and Operations for SQL, NoSQL, NewSQL, Cloud Native Data Platforms, Analytics and AI

Legacy Data Warehouse Migration: BFSI Client Moves to Databricks with Zero Data Loss