2

I have a procedure that loops through a query that returns a set of dates, recreates a set of views using the date in the WHERE clause, and the calls a series of other procedures that use these views. So, if the loop contains three dates, it runs the procedures with each date in turn coded into the WHERE clause of the views. The tables that the views select are partitioned on the date in the WHERE clause, the views exist in order to eliminate all the unnecessary partitions.

On the first iteration around the loop, everything is fine. The views get created, the procedures get called, and the data is processed.

On the second and subsequent iterations, the procedures behave as if the views still had the date from the first iteration. After the loop finishes, the views have the last of the dates in their WHERE clauses, but the procedures had clearly just been processing the first date over and over.

I have created a trivial test that just creates a view, selects the count from it, creates it again with a different date and selects the count from that, and I get the two correct counts, so redefining and using a view within a procedure clearly is possible.

Any ideas why it doesn't work in a loop?

Here is a demonstration script, the first run should process 2 records and the second should process 4, but both runs process 2:

create schema dv;

create schema rpt;

drop table dv.batch_data;
create table dv.batch_data
(
  std_date date
, dv_load_dt timestamp with time zone
)
WITH (APPENDONLY=false, OIDS=FALSE)
DISTRIBUTED BY (std_date)
;

insert into dv.batch_data values( '2016-01-02'::date, current_timestamp );
insert into dv.batch_data values( '2016-01-04'::date, current_timestamp );

drop table dv.table_1;
create table dv.table_1
(
  std_date date
, col_1 text
, col_2 text
)
WITH (APPENDONLY=false, OIDS=FALSE)
DISTRIBUTED BY (std_date)
;

insert into dv.table_1 values( '2016-01-01'::date, 'First',   'Record' );
insert into dv.table_1 values( '2016-01-02'::date, 'Second',  'Record' );
insert into dv.table_1 values( '2016-01-02'::date, 'Third',   'Record' );
insert into dv.table_1 values( '2016-01-03'::date, 'Fourth',  'Record' );
insert into dv.table_1 values( '2016-01-03'::date, 'Fifth',   'Record' );
insert into dv.table_1 values( '2016-01-03'::date, 'Sixth',   'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Seventh', 'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Eighth',  'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Ninth',   'Record' );
insert into dv.table_1 values( '2016-01-04'::date, 'Tenth',   'Record' );

create table rpt.stage_table_a
(
  std_date date
, col text
)
WITH (APPENDONLY=false, OIDS=FALSE)
DISTRIBUTED BY (std_date)
;

CREATE OR REPLACE FUNCTION rpt.custom_proc_rpt_loop()
  RETURNS integer AS
$BODY$
DECLARE
  v_count       integer;         

  rec           RECORD;
  v_days        integer;
  v_query       varchar(10000);
  v_return_msg  varchar(256);

BEGIN
  v_count := 0;
  v_days := 0;

  BEGIN

  FOR rec IN
    SELECT DISTINCT std_date
    FROM dv.batch_data
    ORDER BY std_date
  LOOP
    v_days := v_days + 1;

    v_query := 'CREATE OR REPLACE VIEW rpt.view_batch_table_1 AS ' ||
               'SELECT std_date ' ||
                    ', col_1 ' ||
                    ', col_2 ' ||
               'FROM dv.table_1 ' || 
               'WHERE std_date = ''' || rec.std_date || '''';

    RAISE NOTICE '%',v_query;
    EXECUTE v_query;

    v_count := v_count + 1;

    PERFORM rpt.update_stage_table_a();

    ANALYZE rpt.stage_table_a;

  END LOOP;

  END;

  v_return_msg := 'custom_proc_rpt_loop completed. '
    || CAST(v_days AS VARCHAR(64)) || ' loops performed. ';

  RAISE NOTICE '%',v_return_msg;

  RETURN 1;

END;
$BODY$
  LANGUAGE plpgsql VOLATILE;


CREATE OR REPLACE FUNCTION rpt.update_stage_table_a()
  RETURNS integer AS
$BODY$
DECLARE
  v_count             integer;         
  v_return_msg  varchar(256);

  BEGIN

  TRUNCATE rpt.stage_table_a;

  INSERT INTO rpt.stage_table_a
  (
    std_date
  , col
  )
  SELECT DISTINCT 
    std_date
  , COALESCE(col_1,'') || CASE WHEN length(col_1) > 1 AND length(col_2) > 1 THEN ' ' ELSE '' || COALESCE(col_1,'') END
  FROM rpt.view_batch_table_1
  ;

  GET DIAGNOSTICS v_count = row_count;

  v_return_msg := 'stage_table_a completed. '
    || CAST(v_count AS VARCHAR(64)) || ' records processed. ';

  RAISE NOTICE '%',v_return_msg;

  RETURN 1;

END;
$BODY$
  LANGUAGE plpgsql VOLATILE;

select rpt.custom_proc_rpt_loop();

select count(*) from rpt.view_batch_table_1;

In the real world, the inner procedures are generated by a data warehouse automation tool, so I don't want to change them. I can do whatever I want in the calling procedure that contains the loop.

Jack Douglas
  • 40,517
  • 16
  • 106
  • 178
PhilHibbs
  • 539
  • 1
  • 7
  • 22

1 Answers1

2

Not using Greenplum, but guessing from its PostgreSQL heritage it's probably due to the way PL/pgSQL treats SQL statements: like prepared statements. Those are planned and cached on their first use. That means, among other things, that the view is resolved to its underlying tables.

The same plan is reused in subsequent iterations, which typically saves time for re-planning. But it also foils your attempt to recreate views.

There are various ways to fix this. One quick solution is to convert the fooled SQL statements to dynamic SQL as well, to force re-planning every time. So:

...

EXECUTE '
   SELECT * 
    FROM rpt.update_stage_table_a
    ( p_sequence        
    , p_job_name        
    , p_task_name       
    , p_job_id          
    , p_task_id          
    , p_return_msg      
    , p_status
    )'
INTO v_parameters
   , v_return_msg
   , v_status
;

...

Etc.

Related:

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633