1

I'm using Apache Iceberg which performs queries on Parquet files. The data saves a lot of types of events and their respective numeric values. I'm hesitating whether it's more efficient to have one column for event type or have a column for each event.

Single event type column: enter image description here

VS Dedicated column for each event type: enter image description here

I'm wondering though if I want to run a group by query for col1, col2 by some event_type in theory if there was a dedicated column for that event_type then the db would have to scan less data as opposed to having scan more data when all event types are stored in a single column.

What is the best practice when designing schemas for such data in columnar data storage?

hitchhiker
  • 113
  • 3

1 Answers1

1

I'm hesitating whether it's more efficient to have one column for event type or have a column for each event.

A column for each event type (since they appear to be different kinds of attributes) would probably make the most sense here. A single column for all event types smells like the Entity-Attribute-Value (EAV) Anti-Pattern.

The EAV anti-pattern is generally inefficient for querying, can conflict data types, and makes aggregative queries more difficult to implement. It also hurts maintainability and data integrity.

Aside from that, different database systems offer different features for indexing and compression of columnar data. Mixing different / unrelated data into the same column will hurt the maximum potential one can gain from such features.

Further reading on the EAV anti-pattern:

  1. Is there a name for this database schema of key values?
  2. EAV - in an ecommerce case, is it really an anti-pattern?
  3. Database Modelization Anti-Patterns
J.D.
  • 40,776
  • 12
  • 62
  • 141