From: Jim Blandy <jimb@cyclic.com>
To: shivers@ai.mit.edu
Date: Thu, 9 Jan 1997 16:59:46 -0500
It still seems to me that LET* crosses an important border.
Everything else in the notation is simply concerned with operations on
sets of strings, which seems right. But LET* is concerned with the
form of your descriptions. If your notation is meant to address that
area too, I would really like to have a way to define parameterized
regexps:
;; A list of words, and a list of numbers, separated by a semi:
(let* ((separated-list
;; A non-empty, comma-separated list of something:
(lambda (pat) (& pat (* "," pat))))
(word (+ (in (- "azAZ"))))
(number (+ (in (- "09")))))
(& (separated-list word) ";" (separated-list number)))
The above is all regexp notation, mind you. There's no Scheme in it.
I guess what I can't come up with is some motivating example for LET*
that is not also a motivating example for a much larger, clean, and
powerful system for working with regexps. In short, why stop there?
That is a very cool example. I considered having regexp macros, but I punted
because it just started to get hairy. On the other hand, it is very easy to
provide a means of binding regexps to names and referencing those names. The
semantics is simple: you just expand the references out. But this simple bit
of machinery provides you some nice functionality that people seem to want.
It would be easy to add simple-minded macros with no real computational
power, at the level you used in your SEPARATED-LIST example. Would it
be worth it to upgrade this to the sort of power provided by Scheme's
high-level macros, with pattern matching and ... distribution?
Date: Thu, 9 Jan 1997 15:43:39 +0600
From: Tim Pierce <twpierce@mail.bsd.uchicago.edu>
To: shivers@ai.mit.edu
Subject: Re: Regexp notation
>I considered this, as well. But I rejected it because I don't think you
>do want the ability to operate on regular languages. For example, regular
>languages are closed under intersection and set-complement. But when was
>the last time you thought to yourself "I'd like to match all the strings
>that *aren't* matched by regexp R," or "I'd like to match all the strings
>that match pattern R1 *and* pattern R2"?
A lot of the time, actually. I can't think of any examples off
the top of my head, but I can try to come up with some if you're
skeptical. At times I have wanted this badly enough in a system
that didn't provide it that I was willing to spawn a grep -v, or a
grep | grep | grep pipeline, just in order to achieve the desired
results.
OK, I'm convinced it's handy. But it's also tricky. I do not think it is
trivial to do complement or intersection on RE's using the standard
notation. Systems that work by translating to classical string notation can no
longer do a simple translation. The only way that I can think of to do this is
to (1) compute the NDFA for the regexp(s), (2) do the complement or
intersection op on the NDFA, (3) run the state-minimisation algorithm on the
result (because intersection make a big NDFA), (4) run the NDFA -> regexp
algorithm. This is a mess! Very slow, too. Could you preserve submatches info
across all these operations? Anybody volunteering to write the code?
I note that if you tied in to one of the C-based matching engines below the
standard interface, you might be able to skip the NDFA -> regexp string step.
I would be interested to hear from anyone who has experience implementing one
of the C-based matching engines on the feasibility of doing these things.
If we did this, then we could have (NOT re), (AND re ...), and (DIFF re1 re2).
-Olin
|