[ Cross-posted to comp.theory since someone recently asked about
processing regular expressions there. Followups set back to c.l.scheme.
]
On comp.lang.scheme, comp.lang.scheme.scsh,
Olin Shivers <shivers@ai.mit.edu> wrote:
[ regarding r.e. complement, intersection, and difference ]
>OK, I'm convinced it's handy. But it's also tricky. I do not think it is
>trivial to do complement or intersection on RE's using the standard
>notation. Systems that work by translating to classical string notation can no
>longer do a simple translation. The only way that I can think of to do this is
>to (1) compute the NDFA for the regexp(s), (2) do the complement or
>intersection op on the NDFA, (3) run the state-minimisation algorithm on the
>result (because intersection make a big NDFA), (4) run the NDFA -> regexp
>algorithm. This is a mess! Very slow, too. Could you preserve submatches info
>across all these operations? Anybody volunteering to write the code?
There is another technique for regex matching (based on algebraic
transformation of regular expressions instead of finite automata)
which can easily handle intersection, complement and difference.
Check out 'regex.tar.gz' in the comp.compilers archive [1] for C code
and documentation.
The technique is well-suited for implementation in a
functional language, doesn't need to backtrack (like
some NFA-based algorithms do), and when combined with
lazy evaluation and memoization is as efficient if not
more so than most of the automata-based algorithms
I've seen. (I'm not sure how or if it could be
extended to handle submatches though.)
Hoping nobody will be upset if I post Haskell code
to a Scheme newsgroup :-), regular expressions are
defined like:
data Regex =
Empty -- empty set, or failure
| Epsilon -- null string, or success
| Symbol Char -- matches a single character
| Or Regex Regex -- alternation, matches either expression
| Seq Regex Regex -- sequence, matches A followed by B
| Rep Regex -- closure, matches zero or more times
-- other constructors may be added, e.g., Plus for one or more,
-- Opt for zero or one, And for intersection, &c.
The core of the algorithm is the delta or "derivative" function;
this takes a regular expression 'e' and a symbol 'c' and returns
a new expression which matches the language { s : cs in L(e) },
i.e., all strings which are a suffix of strings in L(e) beginning
with c. It works by simple casewise analysis:
delta :: Regex -> Char -> Regex
delta Empty _ = Empty
delta Epsilon _ = Empty
delta (Symbol c) c'
| c == c' = Epsilon
| otherwise = Empty
delta (Or x y) c = Or (delta x c) (delta y c)
delta (Seq x y) c
| nullable x = Or (Seq (delta x c) y) (delta y c)
| otherwise = Seq (delta x c) y
delta (Rep x) c = Seq (delta x c) (Rep x)
The auxilliary function 'nullable' tests if a
regular expression matches the empty string Epsilon:
nullable :: Regex -> Bool
nullable Epsilon = True
nullable Empty = False
nullable (Symbol _) = False
nullable (Seq x y) = nullable x && nullable y
nullable (Or x y) = nullable x || nullable y
nullable (Rep x) = True
To test if a string s matches an expression e, compute
(delta ... (delta (delta e s_0) s_1) ... s_n):
deltastar :: Regex -> [Char] -> Regex
deltastar = foldl delta
and check if the final expression accepts the empty string:
matches :: Regex -> [Char] -> Bool
matches e = nullable . (deltastar e)
That's the basic idea, though there's a bit more to it
in practice. In particular, with this code the intermediate
regular expressions grow very rapidly so it's necessary to
simplify them at each step (e.g., replacing (Seq Epsilon x)
and (Or Empty x) with just (x), (Seq Empty x) with (Empty), etc.);
see [1] for details and further optimization techniques.
With this approach dealing with extended regular expressions
is straightforward:
data Regex =
... as above ...
| And Regex Regex -- intersection
| Not Regex -- complement
| Diff Regex Regex -- difference
...
delta (And x y) c = And (delta x c) (delta y c)
delta (Not x) c = Not (delta x c)
delta (Diff x y) c = Diff (delta x c) (delta y c)
...
nullable (And x y) = nullable x && nullable y
nullable (Not x) = not (nullable x)
nullable (Diff x y) = nullable x && not (nullable y)
Unfortunately, modifying this algorithm to return submatches
looks rather difficult, and might not even be possible for
ambiguous r.e.s.
--Joe English
jenglish@crl.com
[1] <URL:ftp://iecc.com/pub/file/regex.tar.gz>
|