Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce expression2load.csv ordered by PK #30

Open
pcm32 opened this issue Mar 3, 2020 · 1 comment
Open

Produce expression2load.csv ordered by PK #30

pcm32 opened this issue Mar 3, 2020 · 1 comment

Comments

@pcm32
Copy link
Member

pcm32 commented Mar 3, 2020

Currently for many datasets a large part of the load is related to primary key, index generation and clustering. Maurizio suggest that if we load that data already sorted by the primary key, then it will greatly reduce the amount of time used on those steps:

CREATE TABLE
Creating CSV for loading
real    4m21.571s
user    4m12.718s
sys     0m8.317s
Copy CSV to DB
real    2m10.332s
user    0m8.894s
sys     0m2.451s
SET
ALTER TABLE
RESET
SET
CLUSTER
ANALYZE
RESET
ALTER TABLE
ALTER TABLE
Post-processing done
INFO:  partition constraint for table "scxa_analytics_e_hcad_6" is implied by existing constraints
ALTER TABLE
Partition table loaded for experiment E-HCAD-6 succesfully.
real    66m44.613s
user    4m21.633s
sys     0m11.332s

So, overall time 66m; create CSV 4m, COPY operation 2m. So currently it seems that it is the indexing and clustering operations that are taking the longest, for that example.

@alfonsomunozpomer is this something you could help us? Would you expect the sorting to be faster in javascript or should we just apply some good old unix sort once the file is generated?

@pcm32
Copy link
Member Author

pcm32 commented Apr 18, 2020

To add to this, psql \copy allows stdin as well as files, which could be useful provided that the machine has enough memory I guess.

Some PK times:

E-GEOD-139324     77 min
E-HCAD-10              36 min
E-HCAD-6                25 min

The easiest to try quickly would be to add the sort after the javascript code; if Jon sorts the matrix before hand, then we could just sort .mtx file only before javascript.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant