scsh-users
[Top] [All Lists]

Re: Regexp notation

To: scsh@martigny.ai.mit.edu
Subject: Re: Regexp notation
From: jenglish@crl.com (Joe English)
Date: 10 Jan 1997 12:01:45 -0800
Organization: Tagheads
[ Cross-posted to comp.theory since someone recently asked about 
  processing regular expressions there.  Followups set back to c.l.scheme.
]

On comp.lang.scheme, comp.lang.scheme.scsh,
Olin Shivers <shivers@ai.mit.edu> wrote:

[ regarding r.e. complement, intersection, and difference ]

>OK, I'm convinced it's handy. But it's also tricky. I do not think it is
>trivial to do complement or intersection on RE's using the standard
>notation. Systems that work by translating to classical string notation can no
>longer do a simple translation. The only way that I can think of to do this is
>to (1) compute the NDFA for the regexp(s), (2) do the complement or
>intersection op on the NDFA, (3) run the state-minimisation algorithm on the
>result (because intersection make a big NDFA), (4) run the NDFA -> regexp
>algorithm. This is a mess! Very slow, too. Could you preserve submatches info
>across all these operations? Anybody volunteering to write the code?


There is another technique for regex matching (based on algebraic
transformation of regular expressions instead of finite automata)
which can easily handle intersection, complement and difference.
Check out 'regex.tar.gz' in the comp.compilers archive [1] for C code 
and documentation.

The technique is well-suited for implementation in a 
functional language, doesn't need to backtrack (like
some NFA-based algorithms do), and when combined with
lazy evaluation and memoization is as efficient if not
more so than most of the automata-based algorithms 
I've seen.  (I'm not sure how or if it could be
extended to handle submatches though.)


Hoping nobody will be upset if I post Haskell code 
to a Scheme newsgroup :-), regular expressions are 
defined like: 

        data Regex =
              Empty             -- empty set, or failure
            | Epsilon           -- null string, or success
            | Symbol Char       -- matches a single character
            | Or Regex Regex    -- alternation, matches either expression
            | Seq Regex Regex   -- sequence, matches A followed by B
            | Rep Regex         -- closure, matches zero or more times

        -- other constructors may be added, e.g., Plus for one or more,
        -- Opt for zero or one, And for intersection, &c.


The core of the algorithm is the delta or "derivative" function;
this takes a regular expression 'e' and a symbol 'c' and returns 
a new expression which matches the language { s : cs in L(e) },
i.e., all strings which are a suffix of strings in L(e) beginning 
with c. It works by simple casewise analysis:


        delta :: Regex -> Char -> Regex
        delta  Empty _          = Empty
        delta  Epsilon _        = Empty
        delta (Symbol c) c'     
                | c == c'       = Epsilon
                | otherwise     = Empty
        delta (Or x y) c        = Or (delta x c) (delta y c)
        delta (Seq x y) c
                | nullable x    = Or  (Seq (delta x c) y) (delta y c) 
                | otherwise     =      Seq (delta x c) y
        delta (Rep x) c         = Seq (delta x c) (Rep x)


The auxilliary function 'nullable' tests if a
regular expression matches the empty string Epsilon:


        nullable :: Regex -> Bool
        nullable Epsilon        = True
        nullable Empty          = False
        nullable (Symbol _)     = False
        nullable (Seq x y)      = nullable x && nullable y
        nullable (Or x y)       = nullable x || nullable y
        nullable (Rep x)        = True


To test if a string s matches an expression e, compute
(delta ... (delta (delta e s_0) s_1)  ... s_n):

        deltastar :: Regex -> [Char] -> Regex
        deltastar = foldl delta

and check if the final expression accepts the empty string:

        matches :: Regex -> [Char] -> Bool
        matches e = nullable . (deltastar e)


That's the basic idea, though there's a bit more to it 
in practice. In particular, with this code the intermediate 
regular expressions grow very rapidly so it's necessary to 
simplify them at each step (e.g., replacing (Seq Epsilon x) 
and (Or Empty x) with just (x), (Seq Empty x) with (Empty), etc.);
see [1] for details and further optimization techniques.


With this approach dealing with extended regular expressions 
is straightforward:

        data Regex = 
            ... as above ...
            | And Regex Regex   -- intersection
            | Not Regex         -- complement
            | Diff Regex Regex  -- difference

        ...
        delta (And x y) c       = And (delta x c) (delta y c)
        delta (Not x) c         = Not (delta x c)
        delta (Diff x y) c      = Diff (delta x c) (delta y c)
        ...
        nullable (And x y)      = nullable x && nullable y
        nullable (Not x)        = not (nullable x)
        nullable (Diff x y)     = nullable x && not (nullable y)


Unfortunately, modifying this algorithm to return submatches 
looks rather difficult, and might not even be possible for 
ambiguous r.e.s.


--Joe English

  jenglish@crl.com

[1] <URL:ftp://iecc.com/pub/file/regex.tar.gz>

<Prev in Thread] Current Thread [Next in Thread>