Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document usage with Rails streaming API #23

Open
alexanderadam opened this issue Mar 8, 2018 · 17 comments
Open

Document usage with Rails streaming API #23

alexanderadam opened this issue Mar 8, 2018 · 17 comments

Comments

@alexanderadam
Copy link

alexanderadam commented Mar 8, 2018

Hi @felixbuenemann,

first of all: xlsxtream is awesome! Thank you for that gem.

I just found out, that you made it possible since version 2 to stream it with the Rails streaming API and thought it might be good if it would be documented as well.

Thank you and have a wonderful day!

@felixbuenemann
Copy link
Owner

Are you suggesting to add an example on how to use this gem with ActionController::Live streaming to the README?

@alexanderadam
Copy link
Author

alexanderadam commented Mar 9, 2018

Oh I think that it should be documented but I'm very undecided about the what and how.
Some gems document that into the README, some have a dedicated doc/ directory with more documentation and some have an examples/ directory.

I'm not sure whether it should be directly in the README as rails is just one of the useful integrations. So I guess one of the other two variants might be more suitable.

But in general I think it would be useful to advertise the possibilities as this is probably the only Ruby xlsx library that provides streaming functionality.

@sandstrom
Copy link
Contributor

@alexanderadam How about an examples/ directory plus a link from the bottom of the README.md?

@alexanderadam
Copy link
Author

@sandstrom sounds perfect 👍

@felixbuenemann
Copy link
Owner

I'm very busy right now, but I'm happy to accept a PR with an example.

Both a file in the repo or a link to a gist would work for me, but including it in the repo probably makes it easier to keep up to date in the future.

@felixbuenemann
Copy link
Owner

Oh just though about it: An example could also live in the Wiki, which is even easier to update than an example in the repo.

@zmd94
Copy link

zmd94 commented Sep 4, 2018

I would love to help. Already pull new request for README.md file.

@felixbuenemann
Copy link
Owner

@zmd94 Note that the example code I gave you does not use the rails streaming api, which is discussed here. The streaming API allows to stream a large XLSX file without ever writing to disk.

@arcreative
Copy link

arcreative commented Aug 11, 2020

I was able to get streaming working using the following (in a controller method):

    def export
      sql = MyModel.where(something: 'test').where.not(something: 'else').to_sql
      
      headers['Content-Type'] = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
      headers['Content-disposition'] = "attachment; filename=\"#{filename}.xlsx\""
      headers['X-Accel-Buffering'] = 'no'
      headers['Cache-Control'] = 'no-cache'
      headers['Last-Modified'] = Time.zone.now.ctime.to_s
      headers.delete('Content-Length')

      # Return enumerator to stream response
      self.response_body = Enumerator.new do |enumerator|

        # Create the workbook
        xlsx = Xlsxtream::Workbook.new(enumerator, font: {
          # name: 'Times New Roman',
          # size: 10, # size in pt
          # family: 'Roman' # Swiss, Modern, Script, Decorative
        })

        # Write the worksheet
        header_written = false
        xlsx.write_worksheet(name: 'My Excel Sheet', auto_format: true) do |sheet|

          raw_connection = ActiveRecord::Base.connection.raw_connection
          raw_connection.send_query(sql)
          raw_connection.set_single_row_mode
          raw_connection.get_result.stream_each do |record|
            unless header_written
              sheet.add_row record.keys
              header_written = true
            end

            sheet.add_row record.values
          end
          raw_connection.get_result
        end
        xlsx.close
      end
    end

Disclaimer: This is using a Postgres / pg streaming syntax, so this will probably differ if you're using MySQL or another database type.

@mohdasim8018
Copy link

@felixbuenemann Does the above example @arcreative provided works or would you prefer any other approach ?

@felixbuenemann
Copy link
Owner

@felixbuenemann Does the above example @arcreative provided works or would you prefer any other approach ?

Depending on how performance critical your code is, you could also just use find_each or pluck_each (gem) to stream the rows.

@mohdasim8018
Copy link

mohdasim8018 commented Oct 11, 2020

@felixbuenemann Below is the sample code that I am planning to use for 300k-400k records.

 def get_users(org)
    headers["Content-Type"] = "application/vnd.openxmlformates-officedocument.spreadsheetml.sheet"
    headers['X-Accel-Buffering'] = 'no'
    headers["Cache-Control"] ||= "no-cache"
    headers.delete("Content-Length")

    self.response_body = Enumerator.new do |enumerator|
      # Create the workbook
      xlsx = Xlsxtream::Workbook.new(enumerator)

      xlsx.write_worksheet(name: 'Journey', auto_format: true) do |sheet|
        # Generate header colums
        sheet.add_row %w(firstname lastname username group_name journey_assignment_date journey_status journey_name)
        # Generate rows
        User.where("org=?", org).select('first_name, last_name, username').find_each do |user|
          sheet.add_row [ user.firstname, user.lastname, user.username]
        end
      end

      xlsx.close
    end
  end

Please let me know if you have additional suggestions.

@felixbuenemann
Copy link
Owner

felixbuenemann commented Oct 12, 2020

At 300-400K rows I'd suggest skipping model overhead and go to straight arrays, using something like arcreative proposed.

You could also use the AR in_batches method combined with pluck to load in batches of 1000 rows straight to arrays:

User.where(org: org).in_batches do |batch|
  batch.pluck(:first_name, :last_name, :username).each do |row|
    sheet.add_row row
  end
end

Btw. your example code has a typo in the mime type header (formates).

@felixbuenemann
Copy link
Owner

I've done some benchmarking on a 440K row table and the solution from @arcreative is a lot faster than in_batches/pluck with about 700ms vs 5.6s. The naive find_each took 6.8s, I would've expected pluck to be much faster then going through the model.

The performance of in_batches isn't so great since it first fetches all ids for the batch in onre query and then feeds them via an IN query to anything called on the relation. This probably makes it easier to implement, but sucks for performance. At least it uses keyset pagination using the PK instead of the infamous OFFSET, which gets slower the further you progress into the result, since the DB needs to fetch all rows up to the offset.

@arcreative
Copy link

arcreative commented Oct 12, 2020

My streaming example should be about as fast as it can possibly get since it's a single query without model or "array of records" involvement. It also has the advantage of not buffering any more than a single line to memory at a given time. Not sure what streaming API looks like for the mysql2 gem, but if you're using Postgres I would definitely go with my snippet.

Historically speaking, I got tired of iteratively optimizing our data export solution and just straight to ideal state. Only faster way to "get data" is to actually have Postgres export gzipped CSV and stream that right to the client, which obviously doesn't work if you need an XLSX file. It might be faster to dump to a tempfile and use some other tool to convert to XLSX, but that would obviously be circumventing this gem entirely.

@felixbuenemann
Copy link
Owner

@arcreative Your code is totally fine, I would probably do the same.

The mysql2 gem also allow streaming a query: https://github.com/brianmario/mysql2#streaming

I haven't tested it, since I try to avoid MySQL if I have the choice ;-)

I only mentioned alternatives for when the query result is small, so it doesn't really matter.

One thing your example seems to be lacking is error handling, since is doesn't ensure the query result is consumed using an ensure block or something like that and the next query would fail (you need to call the last get_result before running another send_query on the same connection).

For CSV import/export PostreSQL COPY is awesome. I've used it to import gigabytes of CSV data in ruby at max speed.


For the record here's a database agnostic non-streaming example using keyset pagination which is comparably fast to the streaming version, but uglier and it assumes username is a unique column, since the result excludes the primary key:

users = User.where(org: org).reorder(:username).select(:first_name, :last_name, :username).limit(10_000)
page = User.arel_table[:username]
conn = User.connection
username = ''
loop do
  rows = conn.select_rows(users.where(page.gt(username)).to_sql)
  break if rows.empty?
  username = rows.last.last
  rows.each { |row| sheet << row }
end

So if you can use streaming for query results then use it by all means, but if not the above example might help with writing a fast query. If you don't know what keyset pagination is, I suggest to read https://use-the-index-luke.com/no-offset.

Btw. the Rails implementation for in_batches is so inefficient, because it needs to be generic and caters for the case where a query on the batch relation excludes the primary key column, which it uses for keyset pagination.

@chaffeqa
Copy link

Oh you guys rock! This just saved us a TON of time! @arcreative that snippet was gold!

Please let me know if there is any way we can contribute! As far as I'm concerned we owe you countless hours you saved 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants