Aligning text columns of different size and content

Question

In a past posting, I asked about commands in Bash to align text columns against one another by row. It has become clear to me that the desired task (i.e., aligning text columns of different size and content by row) is much more complex than initially anticipated and that the proposed answer, while acceptable for the past posting, is insufficient on most empirical data sets. Thus, I would like to query the community on the following pseudocode. Specifically, I would like to know if and in what way the following pseudocode could be optimized.

Assume a file with n columns of strings. Some strings might be missing, others might be duplicated. The longest column may not be the first one listed in the file, but shall be the reference column. The order of the rows of this reference column must be maintained.

> cat file  # where n=3; first row contains column headers
CL1 CL2 CL3
foo foo bar
bar baz qux
baz qux
qux foo
    bar

Pseudocode attempt 1 (totally inadequate):

Shuffle columns so that columns ordered by size (i.e., longest column is first in matrix)
Rownames = strings of first column (i.e., of longest column)
For rownames
  For (colname among columns 2:end)
    if (string in current cell == rowname) {keep string in location}
    if (string in current cell != rowname) {
      if (string in current cell == rowname of next row) {add row to bottom of table; move each string of current column one row down}
      if (string in current cell != rowname of next row) {add row to bottom of table; move each string of all other columns one row down}
    }

Order columns by size:

> cat file_columns_ordered_by_size
CL2 CL1 CL3
foo foo bar
baz bar qux
qux baz 
foo qux 
bar

Sought output:

> my_code_here file_columns_ordered_by_size
CL2 CL1 CL3
foo foo 
    bar bar
baz baz    
qux qux qux
foo
bar

score 0 · Answer 1 · answered Nov 12 '16 at 13:24

Ok. It took more an hour than 10 minutes. And your requirements were not completely specified (which is normal, but don't expect the result to be 100% complete). So here is a piece of code for you:

tokens = {'':0}
tokenIndex = 0
tokenList = ['']
def addToken(token):
    global tokenIndex
    global tokenList
    if token == " "*len(token): token = ''
    if token in tokens: return tokens[token]
    tokenList.append(token)
    tokenIndex += 1
    tokens[token] = tokenIndex
    return tokenIndex
headers = []
widths = []
columnKeys = []
usage = []
rows = []
first = True
for line in open ("data"):
    if first:
        first = False
        pos = 0
        for token in line[:-1].split(" "):
            columnKeys.append([])
            headers.append(token)
            widths.append(pos)
            pos += len(token) + 1
            usage.append(0)
        widths.append(pos)
        continue
    column = []
    for i in range(1, len(widths)):
        token = addToken(line[widths[i-1]:widths[i]-1])
        if token != 0: usage[i-1] += 1
        column.append(token)
        columnKeys[i-1].append(token)
    rows.append(column)

leadCol = 1
for i in range(2, len(usage)):
    if usage[i] > leadCol: leadCol = i
sortedUsages = {}
for i in range(len(usage)):
    key = str(usage[i])
    if not key in sortedUsages: sortedUsages[key] = []
    sortedUsages[key].append(i)
sortedKeys = []
for keys in sorted(sortedUsages.keys(), reverse=True):
    for key in keys:
        for idx in sortedUsages[key]:
          sortedKeys.append(idx)

line = headers[sortedKeys[0]]
for i in range(1, len(sortedKeys)):
    line += " " + headers[sortedKeys[i]]
print (line)

for row in rows:
    token = row[sortedKeys[0]]
    mainToken = line = tokenList[token]
    for i in range(1, len(sortedKeys)):
        line += " "
        col = columnKeys[sortedKeys[i]]
        if token in col: line += mainToken
        else: line += " "*len(mainToken)
    print (line)

The output is

CL2 CL1 CL3
foo foo    
baz baz    
qux qux qux
foo foo    
bar bar bar

which is hopefully a starting point for you to complete the work.

Aligning text columns of different size and content

1 Answers1