RFC: Guillemet Strings #53171
mnemnion
started this conversation in
RFC: features for discussion
Replies: 1 comment 1 reply
-
This is a fun proposal, but without a PR, I don't think anyone is going to respond much, and even then, I am uncertain what triage would say about it. Bringing it to triage's attention first, then proposing to do the work to make a PR to JuliaSyntax would probably be the necessary path forward to get attention to this. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Julia's syntax is missing one article which I consider essential. If one works with strings in a certain way, the lack is glaring, but otherwise it's easy to miss.
I'm referring to strings with what I call the "enclosure property", which is a guarantee that any valid UTF-8 string can be made into a program string without any alteration to the body of the string itself. That is, it may be enclosed. In Julia's current syntax, such a string must be encoded, and the difference matters.
Examples include Here Documents found in many languages, and Lua's long strings. I want to recommend against the HEREDOC approach for Julia, it complicates parsing and syntax highlighting, and the basic convention has accumulated a bunch of cruft.
Lua long strings aren't compatible syntax for Julia,
[[]]
is aVector{Vector{Any}}
, but they point in the direction: a simple and regular wrapper for a string literal, one which may be applied programmatically by scanning the string for anything which looks like the closing token (r"\]=*\]"
) and making sure that the enclosure has more equals signs than the longest such match.I want to consider the simplest extension of Julia's syntax, so as to reject it. This would be "extra-long" multiline strings, so a string can start with e.g.
"""""
and then it must end with the same.The problem there is that the existing
"""
strings don't have the enclosure property for several reasons: not only must an internal"""
be escaped, they support escaping, they interpolate, and they have special handling for indentation. All of which is convenient for many use cases, I have no criticism to offer for the strings Julia does have. But all of these qualities work against the enclosure property.That said, the "extra long string" with the
@raw_str
macro would have the enclosure property, so it's worth considering, but there's another wrinkle, which is that it would be a breaking change.""""" string!"""
is a perfectly legal string, and using extra-long syntax, this would become a nonterminated string and break parsing. I don't even think it's that far-fetched to run into in the wild, since one of the reasons to pick a triple-quoted string is to avoid escaping double-quotes.Before I submit my proposal, I want to offer some perspective on why lacking this is an issue. It would be a convenience for embedding Python, certainly, where triple-quoted strings are normal for documentation. Code to programmatically embed Python in Julia source code would be substantially complicated by this lack. What's worse is that it's impossible to preserve a very useful invariant, namely: the hash of the string body is the hash of the string. A language armed with strings with the enclosure property can embed any syntax, of any length, in a hash-identical way. It happens this is something I want to do!
My proposal is to add guillemets to Julia, as a special sort of string which has the enclosure property. I greatly respect Julia's embrace of Unicode, I know that it annoys some people but the decision is firmly embedded in the language at this point. Guillemets are (as the link indicates) widely used to denote strings, and there are plenty of people who have keyboards where a
«
is easier to type than a{
or a~
or a|
. It's available out-of-the-box on US keyboards in macOS and Linux, as well as most other Windows keyboards such as the US International and Canadian. Julia editors and the REPL could provide them as\lg
for«
and\rg
for»
.The syntax: a string starting with
«
and ending with»
. These nest, so«this is «one» string»
, I think anything else would be surprising. A string with any number of«««
terminates with the same number of»»»
, and again, for consistency, the terminals of such a string also nest, but smaller runs of guillemets are ignored for this purpose, they do not have to balance. A leading newline is not part of the string, and if a string begins withr"«[ ]+«"
or ends withr"»[ ]+»"
, the first of those spaces (ASCII space only) is not part of the string. No interpolation, no indentation behavior, and no escaping, including escaping of guillemets.That's it, the complete syntax fits in a paragraph. The slightly complex handling of leading spaces means that literally any string can be enclosed in guillmets, including edge cases like
" «this» "
which is encoded like« «this» »
. I would suggest that the contents must be valid UTF-8, the parser may already take care of this, it's challenging to check this assumption with the tools I have at hand.The current parser rejects guillemets, so this isn't a breaking change in the SemVer sense. I'm not clear on when it's considered ok to introduce syntax which is a parsing error in earlier versions, but I do know that the
public
keyword coming up in 1.11 is a case of that. Guillemet strings can't be macroed into older version of Julia, but those interested in supporting older versions can simply continue to use the current panoply of strings, which are capable of encoding (but not enclosing!) any string, valid or otherwise.I worry that some who read this proposal will see it as adding another fancy way to do something Julia already does, and want to emphasize that this is not the case. Enclosing a byte-identical string length is different from encoding it, and hash comparison (useful for transclusion) is where the difference stands out. Figuring out where the string body begins and ends is not quite regex-simple, but it's pushdown automaton simple, and the status quo makes a prerequisite to hash comparison a complete implementation of one of Julia's string encoding rules, both for embedding and for comparison. The regex to wrap a guillemet string is very simple, simply look for the longest string of
»»»
, check the beginning and end for space+ guillemets, and wrap. If the wrapper always includes a leading newline, a leading newline will be a part of the string body.There are simpler approaches to strings with the enclosure property, ones where the terminals don't nest, but this proposal is eloquent. The strings look like strings. To me, this matters.
Strings with the enclosure property give certain guarantees which are otherwise impossible to provide. For another example, if one searches for a literal fragment of string in Julia source code where it's embedded using enclosed strings, one finds it if it's present. This is actually impossible with encoded strings, because there are a few ways to escape a character, so one needs complete knowledge of the convention followed by the encoder (be it person or machine). Strings with the enclosure property would allow application of patches with standard tools, they support diffs, and this is not a complete list of the advantages.
On a cultural level, I feel that those who use guillemets natively to quote text would enjoy having that option available. Not a technical consideration, true, but neither is the often-encountered resistance to Unicode in source code, and I wanted to suggest that this choice would bring more joy to the world than it would annoyance. As always, the annoyed tend to be louder than the pleased, but that sort of bad behavior shouldn't consitute a veto. I also like the fact that this is the exact meaning of guillemets, their raison d'etre is to denote strings.
I happen to think my proposal is the best way to add strings with the enclosure property to Julia, or I would have proposed something else. But there are many ways to do it. As long as this discussion leads to core realizing that this sort of string is important, and enclosed strings are added to Julia, I will consider this RFC a success.
Beta Was this translation helpful? Give feedback.
All reactions