13

We have a relatively big production postgres based db: ~20GB. The PostgreSQL database is hosted on heroku.

I would like to copy a small subset of the table data to my local database so I can run some tests on them without having to work on production.

I do not want to generate sample data myself, but rather use the data which already exists in the production environment.

~100 rows from each table in the database would be sufficient. Is there an easy way to accomplish this?

jottr
  • 243
  • 2
  • 6

5 Answers5

4

I haven't used this tool, but Jailer promises to do just that

Database Subsetting and Browsing Tool. Exports consistent, referentially intact row-sets from relational databases (JDBC)

http://sourceforge.net/projects/jailer/

1

Another option is something that I've just recently come across (granted, I'm yet to use it, but plan to soon): rdbms-subsetter.

It is a little bit simpler and lighter than Jailer, with a few nice features/perks:

  • CLI so easy to wire up to existing tooling
  • Open source
  • Will follow foreign keys to fetch a coherent subset of data
    • If you don't have well-defined foreign keys, the relationships can be provided via a JSON config file. In my case, I plan to generate this config from schema metadata stored elsewhere (thanks, Rails :\)
  • You can target a specific row and get all related records (e.g. a specific customer is having issues, so you can pull down everything to make their account work data accessible locally)
  • It can either take a constant number of records per table or take a logarithm to get more data from bigger tables without going overboard.

It's worth mentioning another option for Heroku databases specifically, which I've used a lot (as I used to work there).

Heroku is actually pretty fast at bringing up a new snapshotted DB as it pulls the write-ahead logs first to prepare, then connects to the primary DB to catch up, then stops following it. You can create these "forks" across apps to avoid affecting production too much:

$ heroku addons:create heroku-postgres[:specific-plan] \
    --fork your-production-app::DATABASE \
    --app some-other-app-to-own-forked-database \
    [--as PRODUCTION_SNAPSHOT]

Then, you can boot up your app locally pointed to that snapshot. I use this frequently to do a dry run of data or schema migrations or debug customer issues.

In that command above, if you have a DB accessible with DATABASE_URL on your-production-app, you'll end up with a DB accessible with PRODUCTION_SNAPSHOT_URL (if you specified --as) on a different app named some-other-app-to-own-forked-database.

Andriy M
  • 23,261
  • 6
  • 60
  • 103
Bo Jeanes
  • 353
  • 2
  • 9
1

This question is quite old and already has some good answers provided.

However, in recent years, several new products have been released to help you subset your database. Without delving too much into the details, these products allow you to make the configuration process more declarative, so that newly connected tables and columns are handled automatically. Additionally, you can embed them into your CI/CD pipeline.

I might be biased, but we have an awesome tool in this category called Synthesized TDK: https://docs.synthesized.io/tdk/latest/. There is a free community version that supports only Open Source databases, so it should work for you!

An example configuration that takes 10% of your production database and masks data (config.yaml):

default_config:
  mode: MASKING
  target_ratio: 0.1

table_truncation_mode: TRUNCATE schema_creation_mode: CREATE_IF_NOT_EXISTS

It handles foreign keys by traversing a topologically sorted graph of tables. Also, it masks columns so no sensitive data leaves the production. You can customise masking on the columns level, see Masking Docs

To run the tool, simply execute the corresponding command: java -jar tdk.jar <connection options> -c config.yaml

If you have any questions, please join our community: Slack Community

1

Jailer can helps in this situation. I am working on the same one as yours (trying to get ~100 records of each table) and here are steps that I did: - Find the root entity (the one that associated with almost tables) and get subset of the records of a root row (e.g root is people, then I will look up for all the related records to people.id=1...) - After finish step 1, start again with another table (the one that you would like to have 100 records) and get its subset from the result above.

tulh
  • 11
  • 1
0

To get a random set of rows you can use LIMIT like this:

SELECT * FROM my_table LIMIT 100

This is the easiest and will just pick the 100 first rows that PostreSQL came across. It might be the 100 last inserted or perhaps the 100 first. If you need something really random, look at this Stackoverflow answer.