Alex Shinn <foof@synthcode.com> writes:
[...]
> However, this style of regexp building is reminiscent of the way
> Emacs builds dynamic fontification regexps. When constructing
> regexps from smaller regexps, an sexp offers no advantage over a
> string.
Well, the regexps used in Emacs for fontification do provide, in my
opinion, a very good argument for sexps. At least if you temporarily
go back to the days of Emacs 20.
Emacs provides regexp-opt which is used to create optimised regexps
matching a set of keywords. For example, if you give it the keywords
"if", "import" and "in" it outputs something like
"i\\(mport\\|[fn]\\)". This is typically used to create regular
expressions for syntax highlighting. Now, often you want to build a
big regular expression which matches such a set of keywords and, say,
the identifier which follows, because you want to highlight that
identifier too.
The problem is that you do not know how many groups (or submatches, in
scsh terminology) were created by regexp-opt, and therefore you do not
know the number of the group containing the identifier which follows
the keyword.
For that reason, Emacs provides regexp-opt-depth which counts the
number of groups in a regexp. It can be used to solve, in an ugly way
I think, the above problem: you use regexp-opt to create the regular
expression matching the set of keywords, then you count the number of
groups it contains with regexp-opt-depth, and you use that information
to compute the group number of the identifier which follows.
The scsh solution to that same problem is a lot better, I think: by
default, when you build big regexps out of small ones, submatches are
removed. And if you really want to keep them, you can, of course.
Now what I said applies to Emacs 20 because with Emacs 21 (IIRC),
so-called shy groups were introduced (which also exist in Perl and
certainly elsewhere). These shy groups do group parts of the regexp
but they do not count as submatches. Regexp-opt now creates only shy
groups, and that solves the problem in that particular case, but in
that particular case only.
However, the general problem remains: when you want to combine several
small regexps to create a big one, and you do not know how many groups
are contained in the small ones, you have a problem. And this problem
is easily solved with SREs because the regexps are represented as an
ADT which is a lot easier to manipulate programmatically than a
string, as all compiler authors know.
Now to be fair all this is not due to the SRE notation itself, but to
the fact that scsh internally represents regexps as an ADT, and lets
the user access that ADT. One could also write a parser for
string-based regexps (i.e. scsh's posix-string->regexp) and do all
these manipulations on the ADT, like removing groups or turning them
into shy groups, before sending the regexp to the underlying engine,
maybe as a string.
Michel.
|