0

For reporting purposes, we have a DWH (data warehouse) doing ETLs (extract-transform-load) to retrieve data from selected tables in a production OLTP (on-line transaction processing) database.

The ETL extracts data with incremental manner, so it only fetches the changed part of data. Tentatively, we believe this should not affect the size of data as a result.

It is a simple mapping, so for the selected tables, DWH has the same columns as OLTP. The DWH is SQL Server and the OLTP database is MySQL. Of course, the MySQL data types need to translate to corresponding types in SQL Server context, and we followed the standard in Microsoft SSMA (SQL Server Migration Assistant).

We noticed that the data became times bigger in SQL Server than in MySQL. For example, in an e-commerce Magento application:

  • The sales_order table contains 7'100'000 rows with size 5.5GB.
  • However, in the data warehouse, the same table sized 20GB with the same number of rows.
    Please see a partial table definition below.

We checked the SQL Server database, it has SQL_Latin1_General_CP1_CI_AS collation and Simple recovery model. And the MySQL OLTP has default collation latin1_swedish_ci.

Our Questions:

  • In our setting, why SQL Server got times bigger than MySQL for the same data? Please kindly point out if we missed something and the DWH can get smaller instead.
  • The straight mapping between OLTP and DWH was simple to implement, and it worked so far, so good. However, we knew that there are many columns fetched but never used in reporting. So, we wonder if there are better designs or best practices in data warehousing.

We highly appreciate any hints and suggestions.

Details of the sample partial table definitions:

  1. MySQL OLTP, also refer to the model in Magento open-source repository:
-- msab_magento.sales_order definition

CREATE TABLE sales_order ( entity_id int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Entity ID', state varchar(32) DEFAULT NULL COMMENT 'State', status varchar(32) DEFAULT NULL COMMENT 'Status', coupon_code varchar(255) DEFAULT NULL COMMENT 'Coupon Code', protect_code varchar(255) DEFAULT NULL COMMENT 'Protect Code', shipping_description varchar(255) DEFAULT NULL COMMENT 'Shipping Description', is_virtual smallint(5) unsigned DEFAULT NULL COMMENT 'Is Virtual', store_id smallint(5) unsigned DEFAULT NULL COMMENT 'Store ID', customer_id int(10) unsigned DEFAULT NULL COMMENT 'Customer ID', base_discount_amount decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Amount', base_discount_canceled decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Canceled', base_discount_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Invoiced', base_discount_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Discount Refunded', base_grand_total decimal(20,4) DEFAULT NULL COMMENT 'Base Grand Total', base_shipping_amount decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Amount', base_shipping_canceled decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Canceled', base_shipping_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Invoiced', base_shipping_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Refunded', base_shipping_tax_amount decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Tax Amount', base_shipping_tax_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Shipping Tax Refunded', base_subtotal decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal', base_subtotal_canceled decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal Canceled', base_subtotal_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal Invoiced', base_subtotal_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Subtotal Refunded', base_tax_amount decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Amount', base_tax_canceled decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Canceled', base_tax_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Invoiced', base_tax_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Tax Refunded', base_to_global_rate decimal(20,4) DEFAULT NULL COMMENT 'Base To Global Rate', base_to_order_rate decimal(20,4) DEFAULT NULL COMMENT 'Base To Order Rate', base_total_canceled decimal(20,4) DEFAULT NULL COMMENT 'Base Total Canceled', base_total_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Base Total Invoiced', base_total_invoiced_cost decimal(20,4) DEFAULT NULL COMMENT 'Base Total Invoiced Cost', base_total_offline_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Total Offline Refunded', base_total_online_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Total Online Refunded', base_total_paid decimal(20,4) DEFAULT NULL COMMENT 'Base Total Paid', base_total_qty_ordered decimal(12,4) DEFAULT NULL COMMENT 'Base Total Qty Ordered', base_total_refunded decimal(20,4) DEFAULT NULL COMMENT 'Base Total Refunded', discount_amount decimal(20,4) DEFAULT NULL COMMENT 'Discount Amount', discount_canceled decimal(20,4) DEFAULT NULL COMMENT 'Discount Canceled', discount_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Discount Invoiced', discount_refunded decimal(20,4) DEFAULT NULL COMMENT 'Discount Refunded', grand_total decimal(20,4) DEFAULT NULL COMMENT 'Grand Total', shipping_amount decimal(20,4) DEFAULT NULL COMMENT 'Shipping Amount', shipping_canceled decimal(20,4) DEFAULT NULL COMMENT 'Shipping Canceled', shipping_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Shipping Invoiced', shipping_refunded decimal(20,4) DEFAULT NULL COMMENT 'Shipping Refunded', shipping_tax_amount decimal(20,4) DEFAULT NULL COMMENT 'Shipping Tax Amount', shipping_tax_refunded decimal(20,4) DEFAULT NULL COMMENT 'Shipping Tax Refunded', store_to_base_rate decimal(12,4) DEFAULT NULL COMMENT 'Store To Base Rate', store_to_order_rate decimal(12,4) DEFAULT NULL COMMENT 'Store To Order Rate', subtotal decimal(20,4) DEFAULT NULL COMMENT 'Subtotal', subtotal_canceled decimal(20,4) DEFAULT NULL COMMENT 'Subtotal Canceled', subtotal_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Subtotal Invoiced', subtotal_refunded decimal(20,4) DEFAULT NULL COMMENT 'Subtotal Refunded', tax_amount decimal(20,4) DEFAULT NULL COMMENT 'Tax Amount', tax_canceled decimal(20,4) DEFAULT NULL COMMENT 'Tax Canceled', tax_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Tax Invoiced', tax_refunded decimal(20,4) DEFAULT NULL COMMENT 'Tax Refunded', total_canceled decimal(20,4) DEFAULT NULL COMMENT 'Total Canceled', total_invoiced decimal(20,4) DEFAULT NULL COMMENT 'Total Invoiced', total_offline_refunded decimal(20,4) DEFAULT NULL COMMENT 'Total Offline Refunded', total_online_refunded decimal(20,4) DEFAULT NULL COMMENT 'Total Online Refunded', total_paid decimal(20,4) DEFAULT NULL COMMENT 'Total Paid', total_qty_ordered decimal(12,4) DEFAULT NULL COMMENT 'Total Qty Ordered', total_refunded decimal(20,4) DEFAULT NULL COMMENT 'Total Refunded', can_ship_partially smallint(5) unsigned DEFAULT NULL COMMENT 'Can Ship Partially', can_ship_partially_item smallint(5) unsigned DEFAULT NULL COMMENT 'Can Ship Partially Item', customer_is_guest smallint(5) unsigned DEFAULT NULL COMMENT 'Customer Is Guest', customer_note_notify smallint(5) unsigned DEFAULT NULL COMMENT 'Customer Note Notify', billing_address_id int(11) DEFAULT NULL COMMENT 'Billing Address ID', customer_group_id int(11) DEFAULT NULL, ... reward_points_balance_refund int(11) DEFAULT NULL COMMENT 'Reward Points Balance Refund', PRIMARY KEY (entity_id), UNIQUE KEY SALES_ORDER_INCREMENT_ID_STORE_ID (increment_id,store_id), KEY SALES_ORDER_STATUS (status), KEY SALES_ORDER_STATE (state), KEY SALES_ORDER_STORE_ID (store_id), KEY SALES_ORDER_CREATED_AT (created_at), KEY SALES_ORDER_CUSTOMER_ID (customer_id), KEY SALES_ORDER_EXT_ORDER_ID (ext_order_id), KEY SALES_ORDER_QUOTE_ID (quote_id), KEY SALES_ORDER_UPDATED_AT (updated_at), KEY SALES_ORDER_SEND_EMAIL (send_email), KEY SALES_ORDER_EMAIL_SENT (email_sent), CONSTRAINT SALES_ORDER_CUSTOMER_ID_CUSTOMER_ENTITY_ENTITY_ID FOREIGN KEY (customer_id) REFERENCES customer_entity (entity_id) ON DELETE SET NULL, CONSTRAINT SALES_ORDER_STORE_ID_STORE_STORE_ID FOREIGN KEY (store_id) REFERENCES store (store_id) ON DELETE SET NULL ) ENGINE=InnoDB AUTO_INCREMENT=71xxxxx DEFAULT CHARSET=utf8 COMMENT='Sales Flat Order';

  1. SQL Server DWH, generated by Microsoft SSMA for MySQL:
/****** Object:  Table [msab_magento].[sales_order]    Script Date: 10/11/2023 3:17:43 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON GO

CREATE TABLE [msab_magento].[sales_order]( [entity_id] [bigint] IDENTITY(2956088,1) NOT NULL, [state] nvarchar NULL, [status] nvarchar NULL, [coupon_code] nvarchar NULL, [protect_code] nvarchar NULL, [shipping_description] nvarchar NULL, [is_virtual] [int] NULL, [store_id] [int] NULL, [customer_id] [bigint] NULL, [discount_amount] [decimal](20, 4) NULL, [discount_canceled] [decimal](20, 4) NULL, [discount_invoiced] [decimal](20, 4) NULL, [discount_refunded] [decimal](20, 4) NULL, [grand_total] [decimal](20, 4) NULL, [shipping_amount] [decimal](20, 4) NULL, [shipping_canceled] [decimal](20, 4) NULL, [shipping_invoiced] [decimal](20, 4) NULL, [shipping_refunded] [decimal](20, 4) NULL, [shipping_tax_amount] [decimal](20, 4) NULL, [shipping_tax_refunded] [decimal](20, 4) NULL, [store_to_base_rate] [decimal](12, 4) NULL, [store_to_order_rate] [decimal](12, 4) NULL, [subtotal] [decimal](20, 4) NULL, [subtotal_canceled] [decimal](20, 4) NULL, [subtotal_invoiced] [decimal](20, 4) NULL, [subtotal_refunded] [decimal](20, 4) NULL, [tax_amount] [decimal](20, 4) NULL, [tax_canceled] [decimal](20, 4) NULL, [tax_invoiced] [decimal](20, 4) NULL, [tax_refunded] [decimal](20, 4) NULL, [total_canceled] [decimal](20, 4) NULL, [total_invoiced] [decimal](20, 4) NULL, [total_offline_refunded] [decimal](20, 4) NULL, [total_online_refunded] [decimal](20, 4) NULL, [total_paid] [decimal](20, 4) NULL, [total_qty_ordered] [decimal](12, 4) NULL, [total_refunded] [decimal](20, 4) NULL, [can_ship_partially] [int] NULL, [can_ship_partially_item] [int] NULL, [customer_is_guest] [int] NULL, [customer_note_notify] [int] NULL, [billing_address_id] [int] NULL, [customer_group_id] [int] NULL, [edit_increment] [int] NULL, ... [shipping_incl_tax] [decimal](20, 4) NULL, CONSTRAINT [PK_sales_order_entity_id] PRIMARY KEY CLUSTERED ( [entity_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY], CONSTRAINT [sales_order$SALES_ORDER_INCREMENT_ID_STORE_ID] UNIQUE NONCLUSTERED ( [increment_id] ASC, [store_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

ALTER TABLE [msab_magento].[sales_order] ADD DEFAULT (NULL) FOR [state] GO

ALTER TABLE [msab_magento].[sales_order] ADD DEFAULT (NULL) FOR [status] GO

...

ALTER TABLE [msab_magento].[sales_order] ADD DEFAULT (NULL) FOR [shipping_incl_tax] GO

EXEC sys.sp_addextendedproperty @name=N'MS_SSMA_SOURCE', @value=N'msab_magento.sales_order' , @level0type=N'SCHEMA',@level0name=N'msab_magento', @level1type=N'TABLE',@level1name=N'sales_order' GO

James
  • 149
  • 1
  • 6

1 Answers1

1

The sales_order table contains 7'100'000 rows with size 5.5GB. However, in the data warehouse, the same table sized 20GB

SQL Server has several different table compression options. The one most commonly used for large data warehouse tables is Columnstore, which can produce 10x compression on tables with millions of rows.

But both ROW and PAGE compression will change the storage format for all your DECIMAL columns from fixed-width to variable-width. In an uncompressed table DECIMAL(20,4) is a 13-byte fixed-width column.

David Browne - Microsoft
  • 49,000
  • 3
  • 53
  • 102