Hi everyone !
I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !
With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !
What Am I trying to achieve?
I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…
Convert the following string:
[Some text](#Header%20Linking%20MARKDOWN.md)
Into
[Some text](#header-linking-markdown.md)
As you can see those are the following requirement:
- Pattern:
[
]( - Only edit what’s between parentheses
- Replace
space (%20)
with-
- Everything as lowercase
- Links are sometimes in nested parentheses
- e.g. (look here
[
) ](
- e.g. (look here
- Do not change a line that begins with
https
(external links)
While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/
What I tried
The furthest I got was the following:
sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase
sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -
These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20
occurrence in the file.
The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.
I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !
Thanks in advance.
This is very close
example file
[Some text](#Header%20Linking%20MARKDOWN.md) (#Should%20stay%20as%20is.md) Text surrounding [a link](readme.md#Other%20Page). Cool Multiple [links](#Links.md) in (%20) [a](#An%20A.md) SINGLE [line](#Lines.md) Do [NOT](https://example.com/URL%20Should%20Be%20Untouched.html) CHANGE%20 [hyperlinks](http://example.com/No%20Touchy.html)
but it doesn’t work if you have a http link and markdown link in the same line, and doesn’t work with
[escaped \] square brackets](#and-escaped-\)-parenthesis)
in the linkbut!! it was fun!
Hello :) Sorry for the very late response !
Effectively your regex is very close as a one line, I’m pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments…). There a 2 things missing on your beautiful and complex regex:
FROM --------------- [Link with numbers](readme.md#1.3%20this%20is%20another%20test) TO --------------- [Link with numbers](readme.md#1-3-this-is-another-test)
FROM --------------- [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md) TO --------------- [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
Sorry for the trouble I wasn’t aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free :) I’m very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
annotated it is working like this:
# use a loop to iteratively replace the %20 with -, since doing s/%20/-/g would replace too much. we loop until it cant substitute any more # label for looping :loop; # skip the following substitute command if the line contains an http link in markdown format /\[[^]]*\](http/! # capture each part of the link, and join it together with - s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g; # if the substitution made a change, loop again, otherwise break t loop; # convert all insides to the link lowercase if the line doesnt contain an http link /\[[^]]*\](http/! # this is outside the loop rather than in the s command above because if the link doesnt contain %20 at all then it won't convert to lowercase s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g
Why you assume there’s only one link in the line?
Also, you perform substitutions in the whole URL instead only the fragment component.
They did not want external (http) links to be modified as that would break it:
[Example](https://example.com/#Some%20Link)
[Example](https://example.com/#some-link)
I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was
[^h][^t][^t][^p]
but that would cause issues for#ttp
and#A
so i just gave up. Instead I think you’d want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho :)