A trading system generates about 2.5TB of new logs daily. You need to implement real-time (latency <5s) exact deduplication (keep only the latest record for the same order ID), and support fast queries by order ID, user ID, time range, and multiple dimensions. Provide the optimal architecture design, including storage selection, partitioning strategy, index design, data lifecycle management, and query optimization paths.

分类: technical

难度: hard

标签:

答题技巧

["Why MySQL/PostgreSQL are not suitable directly","Comparison of ClickHouse, Doris, Iceberg+Trino, Hudi, Kafka Streams + RocksDB, etc.","How to implement low-latency deduplication (upsert semantics)","Partitioning strategy: time-based vs order-id hash vs composite","Use cases for inverted index, full-text index, Bloom filter, Z-order curve, etc.","Hot/Warm/Cold data tiering and storage medium selection","Query engine selection and SQL vs NoSQL trade-offs"]

参考答案

Kafka → Flink exactly-once + RocksDB state backend for real-time deduplication → write to Iceberg table (partitioned by dt + user_id mod 256), use Merge-on-Read + Z-order indexing to optimize multi-dimensional queries. Cold data sinks to object storage, unified query via Trino + Iceberg. Hot data (1 month) maintains ClickHouse materialized views for high-frequency queries.

Technical
Hard

A trading system generates about 2.5TB of new logs daily. You need to implement real-time (latency <5s) exact deduplication (keep only the latest record for the same order ID), and support fast queries by order ID, user ID, time range, and multiple dimensions. Provide the optimal architecture design, including storage selection, partitioning strategy, index design, data lifecycle management, and query optimization paths.

100 views

Answer Tips

["Why MySQL/PostgreSQL are not suitable directly","Comparison of ClickHouse, Doris, Iceberg+Trino, Hudi, Kafka Streams + RocksDB, etc.","How to implement low-latency deduplication (upsert semantics)","Partitioning strategy: time-based vs order-id hash vs composite","Use cases for inverted index, full-text index, Bloom filter, Z-order curve, etc.","Hot/Warm/Cold data tiering and storage medium selection","Query engine selection and SQL vs NoSQL trade-offs"]

Sample Answer

Kafka → Flink exactly-once + RocksDB state backend for real-time deduplication → write to Iceberg table (partitioned by dt + user_id mod 256), use Merge-on-Read + Z-order indexing to optimize multi-dimensional queries. Cold data sinks to object storage, unified query via Trino + Iceberg. Hot data (1 month) maintains ClickHouse materialized views for high-frequency queries.

Start Mock Interview Practice

Improve your interview skills and confidence with AI mock interviews