scsh-users
[Top] [All Lists]

New regexp substitution functions

To: scsh@martigny.ai.mit.edu
Subject: New regexp substitution functions
From: Olin Shivers <shivers@lambda.ai.mit.edu>
Date: 19 Apr 1997 00:21:07 -0400
Organization: Artificial Intelligence Lab, MIT
Another Friday evening, another 105 lines of code.

The two procedures documented below will be in the next release of scsh, 
for all you string hackers and Perl fans out there. I append the text from
the manual, some examples, and the source (which is about twice as long
and ugly as it should be, because I'm such a performance freak).

I'd like to see how you convert "9/29/61" dates to "September 29, 1961"
strings in tcl or Perl...

Please take a look at these procedures. You have a *very* small window to
influence the design before the release. Now's the time to flame me.
    -Olin

-------------------------------------------------------------------------------
(regexp-substitute port match . items) -> string or undefined

This procedure can be used to perform string substitutions based on regular
expression matches.  The results of the substitution can be either output to a
port or returned as a string.

The MATCH argument is a regular expression match structure that controls the
substitution.  If PORT is an output port, the ITEMS are written out to the
port:
    - If an item is a string, it is copied directly to the port.
    - If an item is an integer, the corresponding submatch from MATCH
      is written to the port.
    - If an item is 'PRE, the prefix of the matched string (the
      text preceding the match) is written to the port.
    - If an item is 'POST, the suffix of the matched string is written.

If PORT is #F, nothing is written, and a string is constructed
and returned instead.

-------------------------------------------------------------------------------
(regexp-substitute/global port regexp string . items) -> string or undefined

This procedure is similar to regexp-substitute, but can be used to perform
repeated match/substitute operations over a string.  It has the following
differences with regexp-substitute:

  - It takes a regular expression and string to be matched as
    parameters, instead of a completed match structure.

  - If the regular expression doesn't match the string, this
    procedure is the identity transform--it returns or outputs the
    string.

  - If an item is 'POST, the procedure recurses on the suffix string
    (the text from STRING following the match). Including a 'POST in the
    list of items is how one gets multiple match/substitution operations.

  - If an item is a procedure, it is applied to the match structure for
    a given match. The procedure returns a string to be used in the result.

Some examples:

;;; Replace occurrences of "Cotton" with "Jin".
(regexp-substitute/global #f "Cotton" s
                          'pre "Jin" 'post)

;;; mm/dd/yy -> dd/mm/yy date conversion.
(regexp-substitute/global #f "([0-9]+)/([0-9]+)/([0-9]+)" ; mm/dd/yy
                          s ; Source string
                          'pre 2 "/" 1 "/" 3 'post)

;;; "9/29/61" -> "Sep 29, 1961" date conversion.
(regexp-substitute/global #f "([0-9]+)/([0-9]+)/([0-9]+)" ; mm/dd/yy
                          s ; Source string

      'pre 
      ;; Sleazy converter -- ignores "year 2000" issue, and blows up if
      ;; month is out of range.
      (lambda (m)
        (let ((mon (vector-ref '#("Jan" "Feb" "Mar" "Apr" "May" "Jun"
                                  "Jul" "Aug" "Sep" "Oct" "Nov" "Dec")
                               (- (string->number (match:substring m 1)) 1)))
              (day (match:substring m 2))
              (year (match:substring m 3)))
          (string-append mon " " day ", 19" year)))
      'post)

;;; Remove potentially offensive substrings from string S.
(regexp-substitute/global #f "Windows|tcl|Intel" s
                          'pre 'post)


-------------------------------------------------------------------------------
(define (regexp-substitute port match . items)
  (let* ((str (regexp-match:string match))
         (sv (regexp-match:start match))
         (ev (regexp-match:end match))
         (range (lambda (item)                  ; Return start & end of
                  (cond ((integer? item)        ; ITEM's range in STR.
                         (values (vector-ref sv item)
                                 (vector-ref ev item)))
                        ((eq? 'pre item) (values 0 (vector-ref sv 0)))
                        ((eq? 'post item) (values (vector-ref ev 0)
                                                  (string-length str)))
                        (else (error "Illegal substitution item."
                                     item
                                     regexp-substitute))))))
    (if port

        ;; Output port case.
        (for-each (lambda (item)
                    (if (string? item) (write-string item port)
                        (receive (si ei) (range item)
                          (write-string str port si ei))))
                  items)

        ;; Here's the string case. Make two passes -- one to
        ;; compute the length of the target string, one to fill it in.
        (let* ((len (reduce (lambda (i item)
                              (+ i (if (string? item) (string-length item)
                                       (receive (si ei) (range item) (- ei 
si)))))
                            0 items))
               (ans (make-string len)))

          (reduce (lambda (index item)
                    (cond ((string? item)
                           (string-replace! ans index item)
                           (+ index (string-length item)))
                          (else (receive (si ei) (range item)
                                  (substring-replace! ans index str si ei)
                                  (+ index (- ei si))))))
                  0 items)
          ans))))



(define (regexp-substitute/global port re str . items)
  (let ((range (lambda (start sv ev item)       ; Return start & end of
                 (cond ((integer? item)         ; ITEM's range in STR.
                        (values (vector-ref sv item)
                                (vector-ref ev item)))
                       ((eq? 'pre item) (values start (vector-ref sv 0)))
                       (else (error "Illegal substitution item."
                                    item
                                    regexp-substitute/global)))))
        (num-posts (reduce (lambda (count item)
                             (+ count (if (eq? item 'post) 1 0)))
                           0 items)))
    (if (and port (< num-posts 2))

        ;; Output port case, with zero or one POST items.
        (let recur ((start 0))
          (let ((match (string-match re str start)))
            (if match
                (let* ((sv (regexp-match:start match))
                       (ev (regexp-match:end match)))
                  (for-each (lambda (item)
                              (cond ((string? item) (write-string item port))
                                    ((procedure? item) (write-string (item 
match) port))
                                    ((eq? 'post item) (recur (vector-ref ev 0)))
                                    (else (receive (si ei)
                                                   (range start sv ev item)
                                            (write-string str port si ei)))))
                            items))

                (write-string str port start)))) ; No match.

        (let* ((pieces (let recur ((start 0))
                         (let ((match (string-match re str start))
                               (cached-post #f))
                           (if match
                               (let* ((sv (regexp-match:start match))
                                      (ev (regexp-match:end match)))
                                 (reduce (lambda (pieces item)
                                           (cond ((string? item)
                                                  (cons item pieces))

                                                 ((procedure? item)
                                                  (cons (item match) pieces))

                                                 ((eq? 'post item)
                                                  (if (not cached-post)
                                                      (set! cached-post
                                                            (recur (vector-ref 
ev 0))))
                                                  (append cached-post pieces))

                                                 (else (receive (si ei)
                                                           (range start sv ev 
item)
                                                         (cons (substring str 
si ei)
                                                               pieces)))))
                                         '() items))

                               ;; No match. Return str[start,end].
                               (list (if (zero? start) str 
                                         (substring str start (string-length 
str))))))))
               (pieces (reverse pieces)))
          (if port (for-each (lambda (p) (write-string p port)) pieces)
              (apply string-append pieces))))))

<Prev in Thread] Current Thread [Next in Thread>