-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document usage with Rails streaming API #23
Comments
Are you suggesting to add an example on how to use this gem with ActionController::Live streaming to the README? |
Oh I think that it should be documented but I'm very undecided about the what and how. I'm not sure whether it should be directly in the README as rails is just one of the useful integrations. So I guess one of the other two variants might be more suitable. But in general I think it would be useful to advertise the possibilities as this is probably the only Ruby xlsx library that provides streaming functionality. |
@alexanderadam How about an |
@sandstrom sounds perfect 👍 |
I'm very busy right now, but I'm happy to accept a PR with an example. Both a file in the repo or a link to a gist would work for me, but including it in the repo probably makes it easier to keep up to date in the future. |
Oh just though about it: An example could also live in the Wiki, which is even easier to update than an example in the repo. |
I would love to help. Already pull new request for README.md file. |
@zmd94 Note that the example code I gave you does not use the rails streaming api, which is discussed here. The streaming API allows to stream a large XLSX file without ever writing to disk. |
I was able to get streaming working using the following (in a controller method): def export
sql = MyModel.where(something: 'test').where.not(something: 'else').to_sql
headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
headers['Content-disposition'] = "attachment; filename=\"#{filename}.xlsx\""
headers['X-Accel-Buffering'] = 'no'
headers['Cache-Control'] = 'no-cache'
headers['Last-Modified'] = Time.zone.now.ctime.to_s
headers.delete('Content-Length')
# Return enumerator to stream response
self.response_body = Enumerator.new do |enumerator|
# Create the workbook
xlsx = Xlsxtream::Workbook.new(enumerator, font: {
# name: 'Times New Roman',
# size: 10, # size in pt
# family: 'Roman' # Swiss, Modern, Script, Decorative
})
# Write the worksheet
header_written = false
xlsx.write_worksheet(name: 'My Excel Sheet', auto_format: true) do |sheet|
raw_connection = ActiveRecord::Base.connection.raw_connection
raw_connection.send_query(sql)
raw_connection.set_single_row_mode
raw_connection.get_result.stream_each do |record|
unless header_written
sheet.add_row record.keys
header_written = true
end
sheet.add_row record.values
end
raw_connection.get_result
end
xlsx.close
end
end Disclaimer: This is using a Postgres / |
@felixbuenemann Does the above example @arcreative provided works or would you prefer any other approach ? |
Depending on how performance critical your code is, you could also just use find_each or pluck_each (gem) to stream the rows. |
@felixbuenemann Below is the sample code that I am planning to use for 300k-400k records.
Please let me know if you have additional suggestions. |
At 300-400K rows I'd suggest skipping model overhead and go to straight arrays, using something like arcreative proposed. You could also use the AR User.where(org: org).in_batches do |batch|
batch.pluck(:first_name, :last_name, :username).each do |row|
sheet.add_row row
end
end Btw. your example code has a typo in the mime type header (formates). |
I've done some benchmarking on a 440K row table and the solution from @arcreative is a lot faster than in_batches/pluck with about 700ms vs 5.6s. The naive find_each took 6.8s, I would've expected pluck to be much faster then going through the model. The performance of in_batches isn't so great since it first fetches all ids for the batch in onre query and then feeds them via an IN query to anything called on the relation. This probably makes it easier to implement, but sucks for performance. At least it uses keyset pagination using the PK instead of the infamous OFFSET, which gets slower the further you progress into the result, since the DB needs to fetch all rows up to the offset. |
My streaming example should be about as fast as it can possibly get since it's a single query without model or "array of records" involvement. It also has the advantage of not buffering any more than a single line to memory at a given time. Not sure what streaming API looks like for the Historically speaking, I got tired of iteratively optimizing our data export solution and just straight to ideal state. Only faster way to "get data" is to actually have Postgres export gzipped CSV and stream that right to the client, which obviously doesn't work if you need an XLSX file. It might be faster to dump to a tempfile and use some other tool to convert to XLSX, but that would obviously be circumventing this gem entirely. |
@arcreative Your code is totally fine, I would probably do the same. The mysql2 gem also allow streaming a query: https://github.com/brianmario/mysql2#streaming I haven't tested it, since I try to avoid MySQL if I have the choice ;-) I only mentioned alternatives for when the query result is small, so it doesn't really matter. One thing your example seems to be lacking is error handling, since is doesn't ensure the query result is consumed using an ensure block or something like that and the next query would fail (you need to call the last get_result before running another send_query on the same connection). For CSV import/export PostreSQL For the record here's a database agnostic non-streaming example using keyset pagination which is comparably fast to the streaming version, but uglier and it assumes username is a unique column, since the result excludes the primary key: users = User.where(org: org).reorder(:username).select(:first_name, :last_name, :username).limit(10_000)
page = User.arel_table[:username]
conn = User.connection
username = ''
loop do
rows = conn.select_rows(users.where(page.gt(username)).to_sql)
break if rows.empty?
username = rows.last.last
rows.each { |row| sheet << row }
end So if you can use streaming for query results then use it by all means, but if not the above example might help with writing a fast query. If you don't know what keyset pagination is, I suggest to read https://use-the-index-luke.com/no-offset. Btw. the Rails implementation for in_batches is so inefficient, because it needs to be generic and caters for the case where a query on the batch relation excludes the primary key column, which it uses for keyset pagination. |
Oh you guys rock! This just saved us a TON of time! @arcreative that snippet was gold! Please let me know if there is any way we can contribute! As far as I'm concerned we owe you countless hours you saved 😂 |
Hi @felixbuenemann,
first of all:
xlsxtream
is awesome! Thank you for that gem.I just found out, that you made it possible since version 2 to stream it with the Rails streaming API and thought it might be good if it would be documented as well.
Thank you and have a wonderful day!
The text was updated successfully, but these errors were encountered: