1

I am currently working on a sort-of a meta-modeler to build a free web service so that people can input data and run several models on that data.

The task I am currently struggling is: user needs to enter data column by column, which would consist of a number n of ID's , a number m of attributes and a number k of classes, with the conditions that n, m > 0 and k >= 0. Data is heterogeneous, so indexes can be both numeric or text, and the same goes for attributes and classes. I'm supposing there will be no null in the data for simplicity.

I'm currently thinking on:

  1. Creating a table with more than enough columns (all with null values), so that I can work using only the non-null columns (which will be got from user input). However this would limit the size of the datasets people could input.

  2. Resorting to create an specialized data structure on a programming language, do all the work there and finally, create a table dinamically to store the result data there.

  3. Using a database specialized for this kind of data (maybe a document-based DB).

  4. Create a data structure on the RDBMS itself (I'm using PostgreSQL), let's say a variable size array, so that I can create the table directly from the user input, using only 3 variable arrays (one for indexes, one for attributes and one for classes). However, I keep in mind that attributes and indexes could be of different types, so the array would have to support heterogeneous data type and I don't know if this is possible on a RDBMS or SQL.

I've been looking for information on information but got no result until now. Any guidances to a package, language library, extension or paper, thesis, technical report with relevant information would be appreciated. Also, personal experiences with doing something similar could be useful.


Edit:

Some example records:

ID NAME   CODE  HEIGHT WEIGHT       CLASS1     CLASS2
1  HARRY  100    170    70         SURVIVOR    VICTIM 
2  ALBUS  101    185    82      
4  REMUS  103    177    60

This is an example dataset where the intention is predicting if a person will survive or die based on its height and weight. The output should be lines consisting of SURVIVOR or VICTIM lines.
The ID's would be ID, NAME and CODE columns; HEIGHT and WEIGHT are the attributes and CLASS1 and CLASS2 are the classes.

The same structure should support something like this:

ID   CODE  INCOME  AGE  NUMBER_OF_HOUSES  CHILDREN   C1    C2   C3
1    101   9000    21       2                NO      HIGH  LOW  MED
2    110   5700    30       1                YES                
7    111   4000    37       0                TES

In this example, the idea is to preddict the future level of income of a person, based on it's actual age, income, number of houses and if they have children currently. The result is either HIGH, MED or LOW.

The data in this case would be: ID and CODE represent the ID's, AGE, NUMBER_OF_HOUSES, INCOME and CHILDREN represent the attributes, and C1, C2, C3 represent the classes.

The data structure should support both inputs, even though they have different number of columns for each category. And also, each one of the columns have different data type. The number of columns per category shoudn't be fixed.

1 Answers1

1

Given your "irregular" data, you might want to look at Magneto (see here and here). Originally designed for the clothing industry which can have many different products with many different attributes. It might just be a fit for your needs. There is a community edition so you can look at the code and (esp.) the database table structures and adapt to your own environment.

Be warned though! It is not a panacea - there are problems that come with the added flexibility. See my answer here which explains some of the pitfalls (be sure to follow the links) of this approach (known as EAV).

Vérace
  • 30,923
  • 9
  • 73
  • 85