1

Imaginary example, simplified to better explain the question. Let's say I have a form with the following fields :

  • user email
  • who did you vote for in 2012
  • who did you vote for in 2016

When submitted, I fill the following PostgreSQL (v.11) DB tables :

"UserList" table

  • Serial (auto-increment INT)
  • Email (text)

"UserData" table

  • Unique_Random (INT)
  • Vote_2012 (text)
  • Vote_2016 (text)

The entries are (as far as I can tell) unlinkable across tables because there is no relation between the Serial and the Unique Random ints.

Threat model : attacker gains full control of the postgreSQL DB server (both hardware and software)

If I am correct, UserData entries are not inserted in any specific position when created, so it should not be possible to tell which was added last.

Is there anything else (logs, data position on disk, memory, ...) that can reveal which entries were created or updated at the same time (and thus are linked) ?

If so, what can be done to prevent this ?

Jane1386
  • 13
  • 3

2 Answers2

0

Rows inserted at the same time, particularly in the insert-only use case such as yours, will very likely reside next to each other on disk and as a result will have tuple IDs indicating that. They will remain so until you CLUSTER the table using your unique_random1 column as the index key. You will need to run CLUSTER periodically to shuffle newly inserted rows.

Another bit of information that potentially correlates inserted rows is their transaction ID (XID), which might remain unchanged after the original commit, unless these rows at some later time are updated by separate transactions. I don't believe CLUSTER changes row XIDs.

Basically the same information -- the transaction ID, along with the row data, -- will also be stored in the write-ahead log (WAL). WAL segments will get overwritten at some point, depending on your database's checkpoint frequency.

TL;DR: You probably shouldn't rely on Postgres internals for reliable deidentification of your data.


1 -- Although, what's truly random, cannot have guaranteed uniqueness. You probably meant some sort of a hash.

mustaccio
  • 28,207
  • 24
  • 60
  • 76
0

Like mustaccio provided, CLUSTER is good way to remove physical traces of insert order. But rows do not necessarily remain in physical order: Other commands like VACUUM or various write operations are also free to move tuples around as Postgres sees fit. The physical order of rows is unreliable.

That aside, there is a simpler, more reliable way to identify rows inserted in the same transaction with the transaction ID in xmin:

Your evil attacker can simply join on xmin:

SELECT *
FROM   "UserList" ul
JOIN   "UserData" ud ON ud.xmin = ul.xmin

What's worse (or better, depends on who's asking): since Postgres 9.5, you can also keep track of commit timestamps with the track_commit_timestamp setting and thereby identify rows committed at certain times:

Even when inserted in separate transactions, transaction IDs are sequential and still may leak information due to their proximity ...

Possible solution

To cover xmin tracks you might UPDATE periodically (like weekly?):

UPDATE "UserList" SET email = email;
UPDATE "UserData" SET vote_2012 = vote_2012;

This writes new row versions for the whole table at full cost without changing user columns. But it sets a new xmin, thus covering all tracks. You might follow up with CLUSTER (also removes table and index bloat from rewriting all rows) and VACUUM ANALYZE to make it complete.

If your tables are big, consider dropping all indexes before you do that and add them back after: cheaper overall.

OR just write new tables and drop the old ones, ordering rows randomly in the process - if you don't have many dependencies making that a pain. Faster than the above and it would serve the purpose perfectly.

Erwin Brandstetter
  • 185,527
  • 28
  • 463
  • 633