The Streaming-First Shift: What Uber's IngestionNext Means for Data Engineers
Uber's new data platform cuts latency from hours to minutes while reducing compute by 25%. Here's why the industry's bet on streaming-first architecture matters for your career.
I've watched batch processing dominate data engineering for fifteen years. Schedule a Spark job, wait hours for results, repeat. It worked at scale—we proved that at Stripe and everywhere else—but it always felt like we were designing around a fundamental constraint rather than solving the actual problem.
Uber just made a bet that constraint is obsolete. Their new IngestionNext platform, announced last week, replaces scheduled batch ingestion with continuous streaming across their entire data lake. The results: data freshness improved from hours to minutes, and compute usage dropped roughly 25%. More importantly, it signals where the industry is headed—and what skills will matter in the next phase of data infrastructure.
The Architecture Shift
The old Uber ingestion stack ran on Apache Spark with scheduled batch jobs processing data at fixed intervals. Functional, proven, resource-intensive. The new system flips the model: events flow continuously through Apache Kafka, get processed by Flink jobs, and land in Apache Hudi tables with full transactional guarantees.
According to the Uber engineering team, this isn't just about swapping one framework for another. It's treating data freshness as "a first-class dimension of data quality." When your analysts are waiting hours for yesterday's data, you're not building for 2025.
The technical stack:
What makes this compelling isn't the technology choices—these are all mature open source projects. It's the systemic commitment to streaming as the default, not the exception.
The Engineering Challenges They Actually Solved
Moving from batch to streaming at petabyte scale introduces problems you won't find in the documentation. Uber's team documented three that matter:
Small File Hell
Continuous streaming writes create thousands of tiny Parquet files, which murder query performance and explode metadata overhead. The naive solution—merge files record by record—requires decompressing, decoding from columnar to row format, merging, re-encoding, and recompressing. Computationally brutal.
Uber's solution: row-group-level merging that operates directly on Parquet's columnar structure, delivering 10x faster compaction. They simplified further by enforcing schema consistency, avoiding the complexity of padding and masking approaches attempted in Apache Hudi PR #13365.
Partition Skew
When downstream systems hiccup—garbage collection pauses, network blips—Kafka consumption across Flink subtasks becomes unbalanced. Skewed partitions compress poorly and query slowly. They addressed this through operational tuning (aligning parallelism with partitions), connector-level fairness (round-robin polling, per-partition quotas), and better observability (per-partition lag metrics, skew-aware autoscaling).
Checkpoint Synchronization
Flink checkpoints track consumed Kafka offsets. Hudi commits track writes to storage. If these become misaligned during failures, you get data skipped or duplicated. Their fix: embed Flink checkpoint IDs directly in Hudi commit metadata, enabling deterministic recovery during rollbacks.
These aren't exotic problems. They're what happens when you actually run streaming systems in production at scale.
Why This Matters Beyond Uber
Every major data platform vendor is moving this direction. Confluent's 2025 Data Streaming Report shows real-time data infrastructure becoming table stakes, not competitive advantage. Apache Flink adoption continues accelerating—it's become the de facto standard for stateful stream processing at scale.
The economics are compelling. Suqiang Song, co-author of the Uber blog post, noted the shift "enabled a fully end-to-end real-time data stack, from ingestion to transformation to analytics." When streaming is more efficient than batch and delivers better latency, the decision becomes obvious.
But here's what Uber's engineers admit: they've only solved ingestion. As they note in the blog post, "freshness still stalls downstream in transformation of the rawdata and analytics." The next frontier is extending streaming all the way through transformation and analytics pipelines. The industry is nowhere near done with this transition.
What Data Engineers Should Learn
If you're building data systems or considering where to invest learning time, the signals are clear:
Kafka and Flink are becoming foundational. Not nice-to-have, not experimental—foundational. The same way you needed to understand relational databases in 2010, you need to understand stream processing in 2025.
Batch isn't dead, but streaming-first is the new default. You'll still need batch for large-scale transformations and scheduled workloads, but the assumption that data arrives in batches is fading. Design for streams, batch when necessary.
Table formats like Hudi and Iceberg matter. The data lake isn't just blob storage anymore. Transactional guarantees, time travel, and schema evolution are expected features. Understanding how these formats work and when to use them differentiates mid-level from senior engineers.
Operational complexity is the real challenge. As Kai Waehner, Global Field CTO, noted on LinkedIn, "This move is all about treating data freshness as a key dimension of data quality." But maintaining thousands of streaming jobs, handling regional failover, managing checkpoints—that's where production systems live or die.
The Bigger Pattern
I've seen enough architectural shifts to recognize the pattern. First, a few large companies with extreme scale requirements build something custom. Then the open source ecosystem matures around proven patterns. Finally, it becomes the default assumption for new systems.
We're in the middle phase with streaming-first architecture. Uber, Netflix, LinkedIn, and other hyperscalers have proven it works. The tooling (Kafka, Flink, Hudi) is mature and production-ready. The missing piece is broader industry adoption and the training of a generation of engineers who think in streams, not batches.
Uber's IngestionNext isn't revolutionary—it's using well-established open source tools in a thoughtful way. What's significant is the commitment: thousands of datasets, petabyte scale, streaming as the primary architecture. When companies of this scale make platform bets, they're not experimenting. They're showing you the future.
If you're early in your data engineering career, the message is clear: learn to think in streams. If you're mid-career and still building primarily batch systems, start planning your transition. The industry is moving, and companies hiring for data roles increasingly expect streaming expertise as baseline knowledge.
The batch-first era served us well. But when streaming delivers better latency and lower costs, the question isn't whether to adopt it—it's how quickly you can make the shift.