micromark-abbr extension #181

richardTowers · 2024-09-05T08:38:46Z

richardTowers
Sep 5, 2024

Hello! First of all thank you for micromark - lovely code, really elegant approach to a really inelegant problem space 😅

I've been working on an extension to support the abbr syntax found in some markdown flavours (PHP markdown extras, kramdown). My code is here - https://github.com/richardTowers/micromark-abbr

Context

A bit of extra context so I don't do an XY problem.

In my day job, I work on the GOV.UK Publishing team. We have our own extension of markdown called Govspeak, which has numerous syntax extensions.

Govspeak written in Ruby (based on Kramdown), which means it's not easy to run it in the browser. The implementation is also quite fragile and buggy, relying heavily on pre-processing the markdown with regex replacements.

I'm interested in re-implementing the Govspeak syntax in a more considered way in JavaScript, so that we can have a less buggy and fragile implementation that can be run in web browsers.

micromark / remark is my current preferred option, but I haven't dismissed using markdown-it or another framework yet. I also haven't dismissed giving up completely, but hopefully we won't get there 😅

Questions

I've got a minimally working example of an abbr extension now, which is exciting! Mostly I followed the tutorial in the README and heavily cribbed off the micromark-gfm-footnotes extension.

There are two areas where I had to do things which I wasn't very happy about...

Thing 1: Starting characters for abbr calls

The abbr syntax is horrible, in that there actually is no syntax for abbr calls. If you have:

The HTML specification is maintained by the W3C.

*[HTML]: Hyper Text Markup Language
*[W3C]:  World Wide Web Consortium

Then HTML and W3C should be parsed as abbrCall. If those labels aren't defined, then they're just text.

The difficulty here is that you don't know what characters might begin an abbrCall expresion until you've parsed the document.

I worked around this by starting on all uppercase ASCII characters, but I think this overly restrictive - a fully compliant implementation would work for lowercase letters, and also unicode labels. I did look at having the parser modify itself, so the text parser doesn't get defined until we've parsed document and know what the labels are, but I couldn't make it work.

Do you have any suggestions on a nicer way to support that?

Thing 2: Hoisting events

As discussed in https://github.com/orgs/micromark/discussions/78 , definition is special, in that the HTML compiler reorders events so that definitions come first. This means that link references will always come after their definitions in the event list, which makes it possible for them to refer back to the data in the definition.

We need similar functionality for abbr definitions, but the syntax isn't quite the same as the syntax for link reference definitions:

<!-- A link reference definition -->
[micromark]: github.com/micromark/micromark "The micromark markdown parser / compiler"

<!-- An abbr definition -->
*[HTML]: Hyper Text Markup Language

The built in definition won't parse abbr definitions, even if we consume the asterisk first, becuase it's expecting a URL-like thing after the colon and we want a bunch of text with whitespaces in it. The quoted title bit is optional, so we can just ignore that.

The way I've worked around this is by using definition and definitionLabelString states, even though technically these are a slightly different kind of thing. This means they get hoisted up to the start of the events list by the compiler, and I can use the data in abbr calls.

How ugly is this work around in your view? Is there already a better way to do it?

I guess I could not provide an HTML compiler in the micromark extension, and do the transformation on the AST in a later step in the mdast / remark / rehype chain.

Alternatively, would you consider some change to micromark to allow extension-defined events to be hoisted to the start of the events list?

End note

Just to be completely open - I'm also happy to hear answers of the form "we don't think it's sensible to build a micromark extension for abbr", or "even if abbr is welcome, some of those other things in your govspeak parser look like they would be too horrible to implement". I can always look at markdown-it and other parsers.

I've really enjoyed playing around writing this extension - it's the first time I've done something with parsers and compilers, and it's been fun.

Thanks!

Answered by wooorm

Sep 5, 2024

Hi Richard!

How interesting, Govspeak.

run in web browsers

What is the reason JS is needed for this? Could not a /govspeak endpoint that turns Govspeak into HTML be created?

I haven't dismissed using markdown-it or another framework yet.

Have you seen https://github.com/micromark/micromark#markdown-it? (deep link).

giving up completely

We do often recommend folks not do syntax extensions, have you seen https://github.com/micromark/micromark#extending-markdown (deep link).

Would it somehow be possible to switch to, say, a govspeak 2, which uses directives (a singular extension syntax for different extensions?)

and also unicode labels

What do you mean by “unicode labels”? Do you me…

View full answer

wooorm · 2024-09-05T10:35:58Z

wooorm
Sep 5, 2024
Maintainer

Hi Richard!

How interesting, Govspeak.

run in web browsers

What is the reason JS is needed for this? Could not a /govspeak endpoint that turns Govspeak into HTML be created?

I haven't dismissed using markdown-it or another framework yet.

Have you seen https://github.com/micromark/micromark#markdown-it? (deep link).

giving up completely

We do often recommend folks not do syntax extensions, have you seen https://github.com/micromark/micromark#extending-markdown (deep link).

Would it somehow be possible to switch to, say, a govspeak 2, which uses directives (a singular extension syntax for different extensions?)

and also unicode labels

What do you mean by “unicode labels”? Do you mean that any punctuation, whitespace, symbol, letter, number, anything, can be used as a label?

Do you have any suggestions on a nicer way to support that?

You might be able to ignore this: parse the definitions only. Then, when you have an AST, look for them.

GFM email autolinks (asd [email protected] qwe) have a similar problem, but is limited to letters (like your ascii characters).
In Rust—which I made later, sort of like a 2nd iteration on the algorithm, and is in the process of being ported back—I already look for @ afterwards and then go through the events.
See https://github.com/micromark/micromark-extension-gfm-autolink-literal/blob/14548d89b3cf6ba678d2b29f978d7b4cd4a08206/dev/lib/syntax.js#L77-L80.
You could do something like that, with a “resolver”. Whether * turns into emphasis/strong/etc, or [ and ]… into images/links, is also done with resolvers.

micromark/packages/micromark-core-commonmark/dev/lib/attention.js

Line 25 in 4bcb4cc

resolveAll: resolveAllAttention

.

What to go with, depends. Perhaps on the grammar too. Which of these works?

# W3C?

W3C? w3c?

`W3C`? *W3C*? **W3C**?

[W3C?](W3C? "W3C?") ![W3C?](W3C? "W3C?") <https://W3C?>

```W3C?
W3C?
```

*[HTML]: Do other abbreviations work? W3C?
*[W3C]: World Wide Web Consortium (and how about recursion? W3C?)

How ugly is this work around in your view? Is there already a better way to do it?

Pretty ugly! Likely to break at some point. I’d really recommend unique names for your things
But, for a better way? 🤔

I guess I could not provide an HTML compiler in the micromark extension, and do the transformation on the AST in a later step in the mdast / remark / rehype chain.

Indeed, that could work. Still, for me, if there’s a micromark-extension-abbr, I would like it to work without the ASTs.

Alternatively, would you consider some change to micromark to allow extension-defined events to be hoisted to the start of the events list?

Right. If that’s needed, that must happen.

This reordering is not needed for footnote definitions though: https://github.com/micromark/micromark-extension-gfm-footnote/blob/0a62fad40470f2447707020c52d38d1494199ee1/dev/lib/html.js#L145 🤔

I do wonder though.
Perhaps there is also alternative HTML that can be generated.
Right now, you are using title attributes right?
They have some problems: not guaranteed to show especially on mobile; no support for rich content.
There’s also a lot of reuse: if W3C is used 10 times, the title is repeated 10 times.
A few weeks ago I made rehype-twoslash, which generates “popovers” for TypeScript errors. With some CSS, JS, and a popover attribute.
Perhaps you can also generate <span data-id="abbr-w3c">W3C</span> for every “call”, and then turn the definition, wherever it appears, into the related corresponding markup. You don’t need the definition “higher” with that.

we don't think it's sensible to build a micromark extension for abbr

No, whether it’s recommended is one thing. And there might be really weird syntaxes one comes up with for which that is the answer. But abbreviations, stuff like that, should be possible.

even if abbr is welcome, some of those other things in your govspeak parser look like they would be too horrible to implement

Ah, right. Ehh. Well, then I need to review them 😅

Steps seems weird.
Buttons seems like you could do something with a fork of micromark-extension-directive.
Heading, I recommend suggesting a space after #, it doesn’t work without it in CommonMark.

I've really enjoyed playing around writing this extension - it's the first time I've done something with parsers and compilers, and it's been fun.

Glad to hear! ASTs are already tough for most people. Integrating into compilers even more so. Cool to hear that you enjoy it!

9 replies

richardTowers Sep 6, 2024
Author

I had a bit of a go trying to use resolveAll, but I haven't been able to find an appropriate place for it. Seems if I put it on the abbrDefinition tokenizer I don't get the right events (because we're only tokenizing contentInitial), but it only gets run on text if the tokenize call succeeds, and there's nothing I want to tokenize in text, other than the abbrCalls which I'm trying to use resolveAll for. Does that make any sense? I'm probably a bit confused.

I did a draft PR to show what I've tried: richardTowers/remark-abbr#1

wooorm Sep 9, 2024
Maintainer

so, interesting that no case normalization is performed for abbreviations in Kramdown. Those do happen when matching URL references and definitions.
numbers (at least, ascii digits) are the same as letters
punctuation (at least, ascii, - and +) can at least be used inside word, your examples also add (, ), &.
your examples also add spaces! *[H(PHYR) Regulations]. That’s surprising?
non-ascii such as α and Bengali is fine

Does that make any sense? I'm probably a bit confused.

No that makes sense. That’s indeed a currently imposed limitation.
There’s no way around that now. It would somehow add the resolver to

micromark/packages/micromark/dev/lib/create-tokenizer.js

Line 64 in 4bcb4cc

const resolveAllConstructs = []

, but that isn’t supported yet.

It is possible to set resolvers from an extension to be called when a “span” is done (e.g., * and *, then you need to “resolve” the emphasis/links inside them). But that doesn’t support the “whole”:

/**
 * @import {Resolver} from 'micromark-util-types'
 */

import {micromark} from 'micromark'

const result = micromark('FBI and *CIA*.', {
  extensions: [{insideSpan: {null: [{resolveAll: resolveAbbreviationCalls}]}}]
})

console.log(result)


/** @type {Resolver} */
function resolveAbbreviationCalls(events, context) {
  for (const [kind, token] of events) {
    if (kind === 'exit') continue
    console.log([token.type, context.sliceSerialize(token)])
  }

  return events
}

richardTowers Sep 9, 2024
Author

Great, thanks for clarifying!

I'm currently planning on shooting for a remark-abbr extension instead. It will add a syntax-only extension to micromark which only
covers abbreviation definitions, and a mdast extension to transform the abbreviation calls. I'm not planning on publishing the micromark extension separately to npm, because it's not very useful on its own. As part of the unified chain though, it should work just as well as any other extension (I think?). Work in progress on this branch (repo name will need to change at some point).

The zmarkdown folks already have a remark-abbr, but it hasn't been updated to use micromark yet. If I can get my version working to an acceptable standard, I'll see if they'll accept a PR.

richardTowers Sep 9, 2024
Author

If a future version of micromark adds some functionality that would make an html compiler for abbreviations doable without ugly workarounds, I'd be happy to try to help get the extension written.

wooorm Sep 9, 2024
Maintainer

Right, cool! Yes. I think it’s good to start with something that works. Make something. See if people want it. Most folks will use ASTs. We can see about plain micromark later! Do please let me know how it goes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

micromark

micromark-abbr extension #181

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

micromark

micromark-abbr extension #181

richardTowers Sep 5, 2024

Context

Questions

Thing 1: Starting characters for abbr calls

Thing 2: Hoisting events

End note

Replies: 1 comment · 9 replies

wooorm Sep 5, 2024 Maintainer

richardTowers Sep 6, 2024 Author

wooorm Sep 9, 2024 Maintainer

richardTowers Sep 9, 2024 Author

richardTowers Sep 9, 2024 Author

wooorm Sep 9, 2024 Maintainer

richardTowers
Sep 5, 2024

Replies: 1 comment 9 replies

wooorm
Sep 5, 2024
Maintainer

richardTowers Sep 6, 2024
Author

wooorm Sep 9, 2024
Maintainer

richardTowers Sep 9, 2024
Author

richardTowers Sep 9, 2024
Author

wooorm Sep 9, 2024
Maintainer