Best way to model a database structure to store multidimensional variables?

Question

I am working with some NetCDF data which has multidimensional variables.

As far as I'm aware, the fields are not mandatory.
Each variable will have the same fields/attributes.
Each variable represents a measurement of the atmosphere, in time & space. Level represents a pressure level (like a layer of an onion -- a thick layer).

For example, a variable called SO2 (Sulphur Dioxide) may be 4-D: Time, Level (pressure), latitude, longitude.

I have about 14 variables that I need to store.

Data retrieval tendencies

Now, I'm going to need to query this data and ask questions like:

Give me all the rows for variable SO2 that fall between latitude X and longitude Y at time T and level L.
What is the average temperature for time T at level L?
etc...

Current considerations

I could let each row in my DB be an observation and have the columns: type, lat, long, time, level, value.

Or, perhaps create a table for each kind of variable and have the columns: lat, long, time, level, value.

I'm not sure which will be more appropriate, or whether there is a better option, so I'd appreciate some advice.

score 3 · Answer 1 · edited Apr 13 '17 at 12:42

So you have some basic data.

SELECT timestamp::timestamp, level, lat, long
FROM ( VALUES
  (now(), 7, 1, 9)
)
  AS t(timestamp, level, lat, long);
         timestamp          | level | lat | long 
----------------------------+-------+-----+------
 2017-02-08 15:38:54.903155 |     7 |   1 |    9

What you want to do is create a point with lat, and long...

SELECT timestamp::timestamp, level, ST_AsText(point::geography)
FROM ( VALUES
  (now(), 7, 1, 9)
)
  AS t(timestamp, level, lat, long)
CROSS JOIN LATERAL ST_SetSRID( ST_MakePoint(long,lat), 4326 )
  AS point;
         timestamp          | level | st_astext  
----------------------------+-------+------------
 2017-02-08 15:38:00.892956 |     7 | POINT(9 1)
(1 row)

Now, we just store that into a table..

CREATE TEMP TABLE foo
AS
SELECT timestamp::timestamp, level, point::geography
FROM ( VALUES
  (now(), 7, 1, 9)
)
  AS t(timestamp, level, lat, long)
CROSS JOIN LATERAL ST_SetSRID( ST_MakePoint(long,lat), 4326 )
  AS point;

Give me all the rows for variable SO2 that fall between latitude X and longitude Y at time T and level L.

This is not how GIS works, because between changes on a sphere. So instead, just use distance. So instead what we do

SELECT *
FROM foo
WHERE ST_DWithin(
  (ST_MakePoint(long,lat),4326)::geography,
  point,
  1000 -- distance in meters
)
FROM foo;

What is the average temperature for time T at level L?

For this we would have had to store the temperature, assuming we did -- which we didn't. It'd look like this.

SELECT avg(temp)
FROM foo
WHERE level = l
  AND time BETWEEN start AND finish;

As a side note ST_DWithin will use an index.. (as will level and timestamp if you create the indexes). However, if you're new to GIS you create GIS indexes with GIST and not btree.

CREATE INDEX ON foo USING gist ( point );

For a similar question see this answer.

UPDATE

This file is a complex proprietary raster,

gdalinfo file.nc 
Driver: netCDF/Network Common Data Format
Files: file.nc
Size is 512, 512
Coordinate System is `'
Metadata:
  NC_GLOBAL#Conventions=CF-1.6
  NC_GLOBAL#history=2017-02-08 12:30:09 GMT by grib_to_netcdf-2.0.2: grib_to_netcdf /data/data01/scratch/_mars-atls17-95e2cf679cd58ee9b4db4dd119a05a8d-mEMhIr.grib -o /data/data01/scratch/_grib2netcdf-atls00-95e2cf679cd58ee9b4db4dd119a05a8d-lrYLIt.nc -utime
Subdatasets:
  SUBDATASET_1_NAME=NETCDF:"file.nc":aermr01
  SUBDATASET_1_DESC=[45x2x451x900] aermr01 (16-bit integer)
  SUBDATASET_2_NAME=NETCDF:"file.nc":aermr02
  SUBDATASET_2_DESC=[45x2x451x900] aermr02 (16-bit integer)
  SUBDATASET_3_NAME=NETCDF:"file.nc":aermr03
  SUBDATASET_3_DESC=[45x2x451x900] aermr03 (16-bit integer)
  SUBDATASET_4_NAME=NETCDF:"file.nc":aermr04
  SUBDATASET_4_DESC=[45x2x451x900] aermr04 (16-bit integer)
  SUBDATASET_5_NAME=NETCDF:"file.nc":aermr05
  SUBDATASET_5_DESC=[45x2x451x900] aermr05 (16-bit integer)
  SUBDATASET_6_NAME=NETCDF:"file.nc":aermr06
  SUBDATASET_6_DESC=[45x2x451x900] aermr06 (16-bit integer)
  SUBDATASET_7_NAME=NETCDF:"file.nc":aermr11
  SUBDATASET_7_DESC=[45x2x451x900] aermr11 (16-bit integer)
  SUBDATASET_8_NAME=NETCDF:"file.nc":so2
  SUBDATASET_8_DESC=[45x2x451x900] so2 (16-bit integer)
  SUBDATASET_9_NAME=NETCDF:"file.nc":z
  SUBDATASET_9_DESC=[45x2x451x900] geopotential (16-bit integer)
  SUBDATASET_10_NAME=NETCDF:"file.nc":t
  SUBDATASET_10_DESC=[45x2x451x900] air_temperature (16-bit integer)
  SUBDATASET_11_NAME=NETCDF:"file.nc":q
  SUBDATASET_11_DESC=[45x2x451x900] specific_humidity (16-bit integer)
  SUBDATASET_12_NAME=NETCDF:"file.nc":r
  SUBDATASET_12_DESC=[45x2x451x900] relative_humidity (16-bit integer)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)

You can import it by generating PAM files.

gdal_translate -sds file.nc file

Them remove the .aux.xml files with rm *.aux.xml; And, run this to import them all into the database

raster2pgsql file_* mySchema.mytable | psql -d database

This gets some kind of raster into the database. Now you have to start learning the PostGIS GDAL raster functions to see what and how you can work with it.. Using ST_MetaData we can see you have a 12 rasters with 90 bands (with ST_BandMetaData we can see they're 16bit unsigned).

SELECT rid, (metadata).* FROM foo.foo1 CROSS JOIN LATERAL st_metadata(rast) AS metadata;
 rid |     upperleftx     | upperlefty | width | height |      scalex       | scaley | skewx | skewy | srid | numbands 
-----+--------------------+------------+-------+--------+-------------------+--------+-------+-------+------+----------
   1 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   2 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   3 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   4 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   5 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   6 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   7 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   8 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
   9 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
  10 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
  11 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90
  12 | -0.200000003394614 |       90.2 |   900 |    451 | 0.400000006789228 |   -0.4 |     0 |     0 |    0 |       90

score 2 · Answer 2 · answered Feb 08 '17 at 21:11

Since you are asking about Postgtresql, you should install postgis. It will give you the geography type and all nice operators which will simplify your location-related conditions.

Then you will be able to define a table for each variable, with column types geography, datetime, whatever is appropriate for level and value, and maybe a serial integer to reference the record when you need it.

Sir Swears-a-lot · Answer 3 · 2017-07-10T07:53:43.747

1

As your data is already multi dimensional i would create a schema that follows along those lines using a datawarehouse star schema with a central fact table.

In your example the fact table would have the lat, long, pressure, timestamp, observed value and attributes.

If you wanted to extend your model to use an olap cube you could add a time dimension and others as needed.

edited Jul 10 '17 at 07:53

answered Feb 15 '17 at 17:50

Sir Swears-a-lot

3,253
3
30
48

score 1 · Answer 4 · answered Aug 06 '17 at 16:59

I'm not sure that you want a database. Like other tools, databases have strengths and weaknesses. The NetCDF format is designed to be more compact and lightweight than a database, to be portable across computer systems that may represent numbers differently, and to allow very fast random access to subsets of hierarchical data sets.

There are some very powerful command-line tools for accessing and manipulating NetCDF files. Have a look at NCO ([NetCDF Operators][1]) and CDO ([Climate Data Operators][2]). They support all of the operations you want to do (slicing the data in various ways, smoothing and averaging, etc.), and many more.

From what I have seen, a very common layout for a NetCDF file is to imagine the file as containing a set of layers, where each layer represents measurements of a single variable such as temperature on the same geographic grid. The layers are often indexed by a time or date variable. I believe that additional variables like your SO2 variable are just stored as additional layers, but I'm not sure how that works internally (multidimensional arrays or just bigger 3D arrays with additional indexing variables). From what I've read, the HDF4 and HDF5 formats allow even more hierarchy in the data. In any case, I would aim to create a structure that matches what people in your field use.

Best way to model a database structure to store multidimensional variables?

4 Answers4

UPDATE