scsh-users
[Top] [All Lists]

Re: Regexp notation

To: scsh@martigny.ai.mit.edu
Subject: Re: Regexp notation
From: shivers@ai.mit.edu (Olin Shivers)
Date: 09 Jan 1997 18:49:14 -0500
Organization: Artificial Intelligence Lab, MIT
Reply-to: shivers@ai.mit.edu
    From: Jim Blandy <jimb@cyclic.com>
    To: shivers@ai.mit.edu
    Date: Thu, 9 Jan 1997 16:59:46 -0500

    It still seems to me that LET* crosses an important border.
    Everything else in the notation is simply concerned with operations on
    sets of strings, which seems right.  But LET* is concerned with the
    form of your descriptions.  If your notation is meant to address that
    area too, I would really like to have a way to define parameterized
    regexps:

    ;; A list of words, and a list of numbers, separated by a semi:
    (let* ((separated-list
             ;; A non-empty, comma-separated list of something:
             (lambda (pat) (& pat (* "," pat))))
           (word (+ (in (- "azAZ"))))
           (number (+ (in (- "09")))))
      (& (separated-list word) ";" (separated-list number)))

    The above is all regexp notation, mind you.  There's no Scheme in it.

    I guess what I can't come up with is some motivating example for LET*
    that is not also a motivating example for a much larger, clean, and
    powerful system for working with regexps.  In short, why stop there?

That is a very cool example. I considered having regexp macros, but I punted
because it just started to get hairy. On the other hand, it is very easy to
provide a means of binding regexps to names and referencing those names.  The
semantics is simple: you just expand the references out. But this simple bit
of machinery provides you some nice functionality that people seem to want.

It would be easy to add simple-minded macros with no real computational
power, at the level you used in your SEPARATED-LIST example. Would it
be worth it to upgrade this to the sort of power provided by Scheme's
high-level macros, with pattern matching and ... distribution?

    Date: Thu, 9 Jan 1997 15:43:39 +0600
    From: Tim Pierce <twpierce@mail.bsd.uchicago.edu>
    To: shivers@ai.mit.edu
    Subject: Re: Regexp notation

    >I considered this, as well. But I rejected it because I don't think you
    >do want the ability to operate on regular languages. For example, regular
    >languages are closed under intersection and set-complement. But when was
    >the last time you thought to yourself "I'd like to match all the strings
    >that *aren't* matched by regexp R," or "I'd like to match all the strings
    >that match pattern R1 *and* pattern R2"?

    A lot of the time, actually.  I can't think of any examples off
    the top of my head, but I can try to come up with some if you're
    skeptical.  At times I have wanted this badly enough in a system
    that didn't provide it that I was willing to spawn a grep -v, or a
    grep | grep | grep pipeline, just in order to achieve the desired
    results.

OK, I'm convinced it's handy. But it's also tricky. I do not think it is
trivial to do complement or intersection on RE's using the standard
notation. Systems that work by translating to classical string notation can no
longer do a simple translation. The only way that I can think of to do this is
to (1) compute the NDFA for the regexp(s), (2) do the complement or
intersection op on the NDFA, (3) run the state-minimisation algorithm on the
result (because intersection make a big NDFA), (4) run the NDFA -> regexp
algorithm. This is a mess! Very slow, too. Could you preserve submatches info
across all these operations? Anybody volunteering to write the code?

I note that if you tied in to one of the C-based matching engines below the
standard interface, you might be able to skip the NDFA -> regexp string step.
I would be interested to hear from anyone who has experience implementing one
of the C-based matching engines on the feasibility of doing these things.

If we did this, then we could have (NOT re), (AND re ...), and (DIFF re1 re2).
        -Olin

<Prev in Thread] Current Thread [Next in Thread>