Monday, April 20, 2009

WYSIWYG Strings

You may have noticed that lately my posts have been related a lot to programming language syntax. That's because I'm playing with a parser right now.

I've been thinking about WYSIWYG strings and heredoc strings.

There are at least a dozen different ways of denoting a string, each with it own limitations: The common quote delimited strings with escape sequences become unreadable when there are too many escape characters, heredocs can't be written in a single line, strings with exotic delimiters (like D's backquoted strings) are vulnerable to delimiter collision.

I like heredocs because their flexibility makes it possible to avoid demiliter collisions in just about every possible scenario (except random infinite strings), and multiple quoting, which looks like a compact variation of heredocs (except that we can't make a string with one of every possible character).

<<HEREDOC
some text
HEREDOC

qq^some text^

I also like Ruby's "<<-" heredoc, which allows the end token to be indented, as well as YAML, which lets the body of text itself to be indented.

  <<-HEREDOC
some text
  HEREDOC

  - text: |
      some text

How can we mix all of these together? Something like this (note the indentation)?

//Multiline:
//standard idiom
  @@
  some text
  @

//handling edge cases
  @HEREDOC
  email@email.com
  HEREDOC

  @@@
  email@email.com
  @@

  @#
  email@email.com
  #

//Single line:
//standard idiom
  @@:some text@

//handling edge cases
  @HEREDOC:email@email.comHEREDOC
  @@@:email@email.com@@
  @#:email@email.com#

So basically, an "@" followed by a heredoc token followed by either a colon or a line feed, depending on whether you want single line or multiline. Indentation characters up to the level of the closing token are discarded from the actual string's byte array.

Do those examples look readable? Would they play well with common indentation patterns in actual code? Would you be able to guess what the syntax is if you didn't know it?

Another random idea

How about we interpolate this heredoc notation into a regular string via an escape sequence?

"here's an example: \@@:"""some python doc comment about the backslash ("\")"""@"

"here's another: \@@@
"""this is about the at sign ("@")"""
@@"

I think making it available only as an escape sequence, rather than as a stand-alone token, would make the syntax more discoverable. The down side is, of course, that we need at least 7 characters to make a stand-alone heredoc string (compared to 4 with a stand-alone token).

"\@@:hello world@"

That in itself is not really a problem, just a minor annoyance. The bigger problem with my mix-of-everything idea is the whole thing about tabs and spaces. Unless we force an arbitrary rule for indentation (like YAML does), there's no good way of making indented heredocs work nicely when different people press tab on different text editors.

Also, what happens when the text is indented less than the closing token? What should happen there? How do we describe trailing whitespace at the end of a heredoc string?

Maybe we should forget the YAML-like text indentation.

What if whitespaces are acceptable token characters?

Then this is valid

  var bla = @
---
my heredoc
---

And so is this

  var bla = @
  """
another heredoc here
  """

Here the heredoc tokens are "\n---" and '\n\u0020\u0020"""' respectively and there are no indentation problems. Cosmetically, in the second example, the non-whitespace tokens align with the rest of the code, since they are merely indented. The only caveat is that the actual heredoc string is outdented back to zero - perhaps that's better; after all, indenting it manually is what would break it.

I like how the second style looks. Indentation good practices are directly in line with what a parser would expect as correct syntax and the position of the non-whitespace part of the tokens give enough information about the string's indentation level in relationship to the rest of the code. Meanwhile, the body of the string remain intact (which is good, since it can often be a whitespace-sensitive copy-and-paste from somewhere else - e.g. a Python script). Another benefit of keeping the string intact is that it can be diffed in source control systems.

Agree? No? Maybe?

No comments:

Post a Comment