Is direct load or insert a better strategy when running parallel jobs in PostgreSQL?

by AKWF   Last Updated March 14, 2019 20:06 PM - source

In our application we run a background processor called Sidekiq which allows us to run hundreds of idempotent jobs against the database in parallel.

We have one job which can add up to 3,000 inserts into a very large table (60 attributes, 7 indexes, 1.5 billion records). Currently we have this set up to do direct loads.

CSV data is put into an array (not written to file) and fed to STDIN like so:

  ActiveRecord::Base.connection_pool.with_connection do |conn|
    rc = conn.raw_connection
    rc.copy_data "COPY apps FROM STDIN WITH CSV HEADER" do
      rows.each do |line|
        rc.put_copy_data line

To generate the data for this direct load, we have to hit the table sequence once for each row. If 200 of these jobs are running simultaneously, and they can each have 3,000 records, that seems like a lot.

Is this strategy a bad idea, and all other things being ideal would we experience better throughput by doing individual inserts instead?

Related Questions

Make PostgreSQL perform SELECT queries in memory

Updated September 03, 2017 22:06 PM

Postgres gist slow index f_unaccent

Updated April 09, 2018 20:06 PM

Optimizing ORDER BY in a full text search query

Updated October 26, 2018 14:06 PM