I have a lots of tables in postgres with a lots of columns per each table. I need to write a program in such a way that it can identify the columns that can be a potential primary and foreign key pair.
The way I'm achieving this right now is by iterating through every combination and comparing columns based on their distinct data. the pseudocode is something like this:
for(int i=0;i<cols.size;i++)
{
for(int j=0;j<cols.size;j++)
{
if (i != j && (isNotAlreadyVisited(i,j) && isNotAlreadyVisited(j,i)))
{
QueryDistinctDataAndCompareCols(i,j)
}
}
}
in the QueryDistinctDataAndCompareCols() method, I get the distinct values of both the columns as sets and get the mismatch count between the columns, with which I come up with a score that represents whether they are eligible to be a pk-fk pair.
The problem with the above solution is the number of iterations and it takes too much time when there are too many columns and too many tables. I also have to query the db in each iteration to get the distinct value of the column.
Hence, I'm looking for a faster and scalable solution either in the app tier or in the db tier.
The number of columns and tables can be more than 5000 and 1000 respectively in some cases with more than 5000 distinct values per column.