Wednesday, April 3, 2013

Reaching the Limits of Regular Expressions Matching Code

‹prev | My Chain | next›

To embed ICE Code Editor code samples in web pages, I realize that I am going to need to invent a mini browser-based templating “language”. Actually, it is not so much as a language as a means to avoid embedding <script> tags inside of the ICE code sample.

I got my initial proof of concept working with:
<script type="text/ice-code">
[[body]][[/body]]
[[script src="http://gamingJS.com/Three.js"]][[/script]]
[[script src="http://gamingJS.com/Tween.js"]][[/script]]
[[script src="http://gamingJS.com/ChromeFixes.js"]][[/script]]
[[script]]
  // Actual three.js code here
[[/script]]
</script>
I then coded the ICE Code Editor to work through the contents of "text/ice-code" script tags to replace the double-square brackets with the equivalent angle bracket. It was simple regular expression replacement, which is what gives this approach its appeal. Unfortunately, it is ugly has can be. I do not anticipate doing this a ton, but still...

What I would love to have for ICE code (aside from being able to embed <script> tags) would be something like:
<script type="text/ice-code">
-body
-script src="http://gamingJS.com/Three.js"
-script src="http://gamingJS.com/Tween.js"
-script src="http://gamingJS.com/ChromeFixes.js"
...
</script>
The only difficulty is the ellipsis in there, which is supposed to contain the bulk of the Three.js (or whatever) code to be embedded in the page.

I think what I will try is something like this:
-script{
  // Actual three.js code here
-}script
I am not in the mood for introducing parsers and ASTs to this problem. I still hope to find an excuse to use them in more depth someday. For today, I think this ought to help with

The regular expression that I come with is:
  sourcecode = script.innerText.
    replace(/^-(\w+)(.*)\{([\s\S]+)-\}.*$/gm, "\n<$1$2>$3</$1>").
    replace(/^-(\w+)(.*)$/gm, "<$1$2></$1>");
Man, I love regular expressions.

The second replace() is a little easier to follow. It processes the single lines that represent opening and closing lines like the <script> tags with src attributes in my example. Single line matching is accomplished by virtue of the ^ and $ inside the regular expression to match the beginning and end of a line. This would not work without the m modified to tell the regular expression to treat the input as a single string.

As for the rest of the regular expression, the -(\w+) matches a string that start with a dash and continues with alphanumerics (e.g. -script). The (.*) matches zero or more characters of any type—except newlines. This should slurp up the src attributes in the -script lines as well as the nothing in the -body line.

Both the dash + alphanumerics (-script) and the everything-else matchers are grouped in parentheses so that the groups can be referenced in the replacement string: $1 for the first group, $2 for the second group, etc.

The most obvious difference between the last and second line is the explicit match on curly braces. That is fairly superficial in the regular expression. The thing that took me a bit to sort out was how to match everything between the -script{ and the -}. Originally, I had meant to use the period matcher to match everything. This will not work in this case because this content contains newlines. Since there is no match-everything-including-newlines matcher in regular expressions, I was forced to cheat with [\s\S]. The square brackets are a character class—any character will match if it matches anything in the square brackets. In this case, I match anything that is whitespace (\s) or anything that is not whitespace (\S). In other words, I match everything—including newlines.

I have crafted the regular expression so that it will work with a closing that can optionally omit the opening tag. In fact, it does not matter if the text after the closing curly brace matches the tag before the opening curly brace. The following three code samples are all equivalent:
// Close tag matches opening tag
-script{
  // Actual three.js code here
-}script

// No close
-script{
  // Actual three.js code here
-}

// Close does not match
-script{
  // Actual three.js code here
-}asdf
It is also worth noting that if a text/ice-code entry contains multiple curly brace lines, then things will almost certainly fail. This will not be a big deal for me, but I may need an AST should I want wide-spread adoption.

Anyhow, my stripped down code “document” now attaches the ICE Code Editor:



This seems a good stopping point for the night. I have the nagging suspicion that I ought not use regular expressions here. Then again, maybe I call this a proof of concept and move on to other unexplored features of ICE embedded in pages.

Day #711

No comments:

Post a Comment