scsh-users
[Top] [All Lists]

Regexp notation

To: shivers@ai.mit.edu
Subject: Regexp notation
From: Jim Blandy <jimb@cyclic.com>
Date: Wed, 8 Jan 1997 14:54:09 -0500
Cc: scsh@martigny.ai.mit.edu
>    (* <regex> ...)            0 or more matches

I'm not sure how to interpret the ellipses here.  Is (* A B)
equivalent to (* (& A B)), or (* (* A) (* B)), or something else?
It seems to me that * should be a unary operator.


>    (LET* ((<name> <regex> ...) ...)
>      <regex> ...)
>    <name>                     Named regex use

Does this have a big advantage over Scheme's normal quasiquote
mechanism?  Why should I write

        '(let* ((digit (in (- "09"))))
          (& (+ digit) ":" digit digit))

when I could write

        (let ((digit '(in (- "09"))))
          `(& (+ ,digit) ":" ,digit ,digit))

The former is superficially simpler, but actually more complicated
because it makes the meaning of a regexp depends on an environment.
This is a mess.  I see your 2nd note, about the "primitive regexps",
as a sign that we're already encountering hygene problems: "Are
bindings in regexps statically or dynamically scoped?"  Ugh.

The latter expression is ordinary Scheme code, whose semantics are
already explained and tested.

The only advantage I can see to giving the regexp notation its own
LET* is that some repetitions are made explicit, and the back end
could perhaps do some optimizations guided by that information.  But
the information is not hard to derive, and I don't know of any extant
regexp back ends that could take advantage of it anyway.


Suppose we strike LET*.  Then, since a symbol is not a valid regexp,
it would be unambiguous to make concatenation (what you call
"Sequence", I think?) implicit.  That is, we could write ("a" (* "b")
"c") for "ab*c".  Concatenation is the most common operator, and I
think this would also make it easier to guess that (* A B) is (* (A
B)) (if that is indeed what you meant...).



This reminds me a lot of VMS's answer to GNU Emacs, TPU.  It has a
Pascal-like extension language with a datatype for regular expressions
("patterns").
- The search function took a pattern value as its argument.
- TPU had operators that constructed bigger patterns given smaller ones.
- Because they were a real datatype, you could write your own functions
  on patterns.
- Pattern values were immutable, so each node could have an (internal)
  pointer to a compiled representation of the regexp of which it was
  the root, generated on demand.  (I don't know if they actually did
  that, but it would have been easy.)


>Named (not numbered) submatches?

TPU had an operator called @ that did this.  I don't remember it
exactly, but I think that PATTERN @ VARIABLE matched exactly what
PATTERN matched, except that VARIABLE was set to the text PATTERN
matched.  This pattern, of course, could be used as part of a larger
pattern.  Don't ask me what happened if VARIABLE went out of scope.

<Prev in Thread] Current Thread [Next in Thread>