Why does Oracle use a different byte length than java for the supplementary unicode character chipmunk?

Question

I have java code trimming a UTF-8 string to the size of my Oracle (11.2.0.4.0) column which ends up throwing an error because java and Oracle see the string as different byte lengths. I've verified my NLS_CHARACTERSET parameter in Oracle is 'UTF8'.

I wrote a test which illustrates my issue below using the unicode chipmunk emoji (️)

public void test() throws UnsupportedEncodingException, SQLException {
    String squirrel = "\uD83D\uDC3F\uFE0F";
    int squirrelByteLength = squirrel.getBytes("UTF-8").length; //this is 7
    Connection connection = dataSource.getConnection();

    connection.prepareStatement("drop table temp").execute();

    connection.prepareStatement("create table temp (foo varchar2(" + String.valueOf(squirrelByteLength) + "))").execute();

    PreparedStatement statement = connection.prepareStatement("insert into temp (foo) values (?)");
    statement.setString(1, squirrel);
    statement.executeUpdate();
}

This fails on the last line of the test with the following message:

ORA-12899: value too large for column
"MYSCHEMA"."TEMP"."FOO" (actual: 9, maximum: 7)

The setting of NLS_LENGTH_SEMANTICS is BYTE. Unfortunately, I cannot change this as it is a legacy system. I'm not interested in increasing the column size, just reliably being able to predict the Oracle size of a string.

score 4 · Answer 1 · answered May 18 '17 at 20:16

What follows is my speculation.

Java Strings are internally represented using the UTF-16 encoding. When you getBytes("UTF-8") Java converts between the two encodings, and you probably use an up-to-date Java platform.

When you attempt to store a Java String in the database, Oracle also performs conversion between the Java native UTF-16 and the database character set as determined by NLS_CHARACTERSET.

The chipmunk character was approved as part of the Unicode standard in 2014 (according to the page you linked), while the latest release of Oracle 11g rel.2 was published in 2013.

One might assume that Oracle uses a different or outdated character conversion algorithm so the byte representation of ️) on the server (9 bytes long) is different than what getBytes() returns on the client (7 bytes).

I guess to resolve this issue you could upgrade your Oracle server or use UTF-16 as the database character set.

score 2 · Accepted Answer · answered May 22 '17 at 01:52

The issue is with Oracle's handling of Supplementary unicode characters when NLS_LENGTH_SEMANTICS is UTF8.

From the documentation (emphasis added).

The UTF8 character set encodes characters in one, two, or three bytes. It is for ASCII-based platforms.

Supplementary characters inserted into a UTF8 database do not corrupt the data in the database. A supplementary character is treated as two separate, user-defined characters that occupy 6 bytes. Oracle recommends that you switch to AL32UTF8 for full support of supplementary characters in the database character set.

Additionally, the last code point in the squirrel string is a variation selector and optional. I saw this using a unicode character inspector

After changing the database's NLS_CHARACTERSET parameter to AL32UTF8 the test passed.

Why does Oracle use a different byte length than java for the supplementary unicode character chipmunk?

2 Answers2