-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added fs2.text.linesWithEndings #2418
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems a positive contribution as is, so I am approving.
However, I would like to ask you to consider two possible optimisations:
Reuse existing streams.
At present, your new method is going to allocate a new string, with its new byte array, for each element in the output. However, suppose that your input contains the following strings (one per line):
Once upon a midnight dreary,\r while I pondered, weak and
weary,\r Over many a quaint and curious volume of forgotten lore\r
Here, the output should look as follows:
Once upon a midnight dreary,
while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore
If you notice, the first and last strings of output are each contained in the strings from input. One possible optimisation, thus, would be to find out which output-lines are entirely contained within one input line, and use the .substring
method to get them out, without the extra process of the StringBuilder
.
NOTE I also hoped that substring
would be reusing the underlying array, but that may not be the case... https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/lang/String.java#l1945
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had a task, where I needed to preserve different linebreaks. However, I believe there can be such a task.
Concerning the PR, I think that EOF
is unnecessary. I understand that the intention was to get rid of nasty Option
in (String, Option[LineEnding])
, but from my POV the introduction of EOF does more harm than good.
I thought a bit about adding line breaking symbols into the StringBuilder, but that would make it harder to implement lines
in terms of linesWithEnding
(we would have to look at the last and the second to the last symbols and then cut them off with substring).
So, I’d prefer this signature: def linesWithEndings[F[_]]: Pipe[F, String, (String,Option[LineEnding])]
to def linesWithEndings[F[_]]: Pipe[F, String, (String, LineEnding)]
, and the implementation of lines
that you’ve proposed
first: Boolean | ||
): Pull[F, (String, LineEnding), Unit] = | ||
stream.pull.uncons.flatMap { | ||
case None => if (first) Pull.done else Pull.output1(stringBuilder.result() -> LineEnding.EOF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the EOF remains, I think we should replace Pull.done
with Pull.output1("", LineEnding.EOF)
. Otherwise there would be no EOF at the end of the stream, which breaks "EOF at the end of the stream" consistency.
// case object CR extends LineEnding("\r") looking at the existing code this won't be produced so leaving it out for now | ||
case object LF extends LineEnding("\n") | ||
case object CRLF extends LineEnding("\r\n") | ||
case object EOF extends LineEnding("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that
EOF
is technically not aLineEnding
, but I can't come up with the appropriate name for the group(LF
,CRLF
,EOF
). - lines is not only about files, but about Stream[F, String]. Thus socket, pure streams are also consumable by
.lines
operator.
@diesalbla fwiw I spent about 2 hours in morbid curiousity trying to eek more performance including trying what you suggested in various incarnations and couldn't eek out any more performance. Any case where substring's were used were signficantly less performant. Honestly really surprised by this. I also wanted to see if I could get the performance we wanted using scanChunks but no success. Someone already worked really hard to get it to perform. I suspect the StringBuilder allocated on the call stack is very cheap along with the arrays that StringBuilder.toString generates. They show to be significantly cheaper than String.substring in the profiler I was using. @nikiforo we have the use case for this in http4s where in roundtrip unit tests it is needed to properly reify the input data with the expected and actual data. Agreed on the EOF I have similar mis-givings. Does EOS (end of string | end of stream) lessen it ? Suggested from the comment that lines is about streams not files. I toss and turn on the Option[LineEnding] versus having EOF as well. Chose EOF because it was less allocations and assuming all else is equal (which it isn't). |
If we went the option route, we'd probably need to cache instances of The Java socket API uses end-of-input and end-of-output. I think either EOF or EOS would be fine here. |
😃
Consider this line. |
So oddly enough I spent the morning hacking through ServerSentEvent unit tests in http4s. ServerSentEvent defines end event as an empty line terminated with a \n. The existing protocol decoder was using fs2.text.lines and in the absence of line endings assumed \n which created corner cases when combined with this related behaviour @nikiforo great point the little response I have for you there is the existing behaviour I mentioned of I have no strong opinion on EOS or Option[LineEnding] I can make both work. Being new here I am not sure how we come to consensus. There are more "exotic" ways to encode this (just throwing spaghetti against the wall),
Noting with this encoding you lose the obvious no two Text instances in a row but it does have legs. Another alternate is have the text be optional
You could arguably pass the @nikiforo test with Anyway I am having a bit of fun just to see if any bright ideas pop out. There is a practical need here. One needs the actual line endings if you are decoding protocols where \n \r\n \r and End are not considered equivalent Anyone else find it fascinating something as simple as this can get so intricate ? |
Worth noting that if we wanted to replace fs2.text.lines with
The performance hit using LinesBenchmark on my mac mini 2021 was down 3K ops/s from 96K ops/s to 93K ops/s. Which seems sensible as there are a few more allocations and one more map.
The unit test is a robotic copy of the test for lines.
So one open question I have is to replace fs2.text.lines or not.
While more LineEnding's could be expressed the current code doesn't support them so the effort was on expressing simply what the existing system does.