20

For all of these queries:

SELECT label FROM personal.storage_disks ORDER BY label ASC;
SELECT label FROM personal.storage_disks ORDER BY label COLLATE "C" ASC;
SELECT label FROM personal.storage_disks ORDER BY label COLLATE "POSIX" ASC;
SELECT label FROM personal.storage_disks ORDER BY label COLLATE "default" ASC;

The output is always: DISK 1, DISK 10, DISK 2, DISK 3, [...]
But, I want and expect: DISK 1, DISK 2, DISK 3, [...] DISK 10

I'm out of collations to try now according to SELECT * FROM pg_collation;... unless I'm supposed to use one of the many really weird ones with cryptic names. (I even tried a bunch of those with the same result.)

Please note that I've read the existing seemingly related SE questions as well as many articles on SORT BY, but they didn't help and didn't clear up anything for me.

I'm using PostgreSQL 12.4

Solomon Rutzky
  • 70,048
  • 8
  • 160
  • 306
user15080516
  • 745
  • 3
  • 8
  • 12

3 Answers3

35

Sorting strings naturally puts "15" before "2" because the first digit in the "15" is a "1", which sorts before "2". Sorting a "2" stored in a string type before the "15" can be done in a few ways. The most efficient way is to have the collation itself handle this internally. This option is not well known and not even available in most places, but any system that implements ICU (International Components for Unicode)** has the potential for allowing this type of sorting (as long as it allows for customizing the sort options), which is often referred to as "natural" sorting.

Handling natural sorting is often done programmatically by chopping the string into its pure alpha and numerical pieces and then sorting on those individually. This is a necessary evil in many cases, but fortunately, PostgreSQL (at least as of version 10), does allow for this internally. You would need to create a custom collation (this one even comes straight from their documentation):

CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');

and use that in the ORDER BY. Here is a working example of it:

https://dbfiddle.uk/?rdbms=postgres_11&fiddle=58763b51a8ccb2360cf387d8c2b91d51

NOTE

Due to collations not being generally well-understood, and the ability to customize them being more recent (for databases, at least) and even more esoteric, I would recommend doing the following when implementing this solution:

  1. Prefix the name of the new collation with "custom_" to increase awareness that this is indeed a custom collation that might have non-obvious behavior and might not exist on other systems (hence might need to be added to a system or application setup process)
  2. Place a comment just after each query that uses this custom collation, pointing out that it is a custom collation, and include a link to the official documentation:
    https://www.postgresql.org/docs/12/collation.html#id-1.6.10.4.5.7.5
    You could even mention that the -kn-true part enables "numeric" sorting.

BONUS

To more fully demonstrate how the "numeric" collation option works, I added some data to the previous example to show:

  • multiple / separate groups of numbers within a string are handled separately
  • different non-digit characters are handled as expected
  • leading 0's do not affect the result

The additional data is:

DISK 2A
DISK 2B
DISK 2B 33
DISK 2B 4
FILE 62
FILE 7
DIRECTORY 1000000
DIRECTORY 57
DIRECTORY 9999
DIRECTORY 57000
DIRECTORY 057
DIRECTORY 0057
DIRECTORY 52

Here is the updated example:

https://dbfiddle.uk/?rdbms=postgres_11&fiddle=20416b0dd731b2cc28b6fdee8ef70ec7


** To be fair, ICU / Unicode isn't "required" for doing this type of sorting, given that any collation or system can implement the same algorithm. However, it is built into ICU, and more and more systems are integrating ICU.

Solomon Rutzky
  • 70,048
  • 8
  • 160
  • 306
4

If you always have exactly one number in that label, I would convert it to a proper integer value by removing everything else and then sort by that:

select label 
from personal.storage_disks
order by nullif(regexp_replace(label, '[^0-9]', '', 'g'),'')::int

This obviously won't work if you have labels like Disk 1 of 10

2

DISK 1, DISK 10, DISK 2, DISK 3

This is not what you want, but it is the correct order when sorting text. Numbers are letters too, and they sort alphabetically. With quotes meaning 'this is a text literal', '10' < '2' indeed.

I want and expect: DISK 1, DISK 2, DISK 3, [...] DISK 10

This is pretty common. I think Windows Explorer does it.

If you want to order them alphabetically, then a quick solution is to do it like ISO dates: 'Disk 01' indeed comes before 'Disk 10' because alphabetically '0' < '1'...

If you don't want leading zeros in your labels, then you need to make a function that will add them on the fly. You could do that with plpgsql, or kludge a regexp:

select regexp_replace(
    regexp_replace(column1, '(\d+)', '000000\1')
    , '0*(\d{4})', '\1' ) r 
FROM (VALUES ('Drive 1'),('Drive 10'),('Drive 2'),('Drive 100000') )v 
ORDER BY r;
      r
--------------
 Drive 0001
 Drive 0002
 Drive 0010
 Drive 100000

This adds a fixed number of leading zeros to all numbers, then trims them back so the total number of digits is 4. You can add more digits if you want. The last line shows that it no longer works when the original number has too many digits, but hopefully you won't have more than a gorillion drives to sort. The regexp could need some touching up.

If this was python one could transform text strings into tuples or arrays like ['Drive',10] where the integers are actual ints, which would then sort properly. But you can't mix data types in a postgres array, hence the kludge.

If you want to use that, then put the regexp_replace() above as your ORDER BY argument, no need to put it in the select list, that was just for the sake of example, and to display its result.

bobflux
  • 1,776
  • 1
  • 9
  • 7