scsh-users
[Top] [All Lists]

field & record parsing

To: brent@jade.ssd.csd.harris.com, <schwartz@galapagos.cse.psu.edu>, shriram@cs.rice.edu (Shriram Krishnamurthi)
Subject: field & record parsing
From: Olin Shivers <shivers@clark.lcs.mit.edu>
Date: Tue, 6 Dec 94 03:35:28 -0500
Cc: scsh@martigny.ai.mit.edu
Reply-to: shivers@mintaka.lcs.mit.edu
Guys-

I've redesigned the field/record parsers. I append the new design, with
an extra bonus: the doc for the awk macro. I have also just posted these
designs to the Scheme group on netnews.

I think I've now got everything you all wanted. Let me know how you
like the new design. I have not implemented it yet; just got finished working
out the design and writing it up.

Thank you for the feedback on the last version.
    -Olin
-------------------------------------------------------------------------------
The scsh field and record parsers.
So featureful you'll puke! (tm)
Copyright (c) 1994 by Olin Shivers.

Since all of the readers discussed below require the ability to peek
ahead one char in the input stream, they cannot be applied to raw 
integer file descriptors, only Scheme input ports. This is because
Unix doesn't support peeking ahead into input streams.

(read-delimited char-set [port]) -> string or eof
    Read until we encounter one of the chars in CHAR-SET or eof.
    The terminating character is not included in the string returned,
    nor is it removed from the input stream; the next input operation will 
    encounter it. If we get a string back, then (eof-object? (peek-char)) 
    tells if the string was terminated by a delimiter or eof.

    CHAR-SET may be a charset, a string, a character, or a character
    predicate; it is coerced to a charset.

    This operation is likely to be implemented very efficiently. In
    the Scheme 48 implementation, the Unix port case is implemented directly
    in C, and is much faster than the equivalent operation performed
    in Scheme with PEEK-CHAR and READ-CHAR.

(read-delimited! char-set buf [port start end]) -> nchars or eof or #f
    Variant of READ-DELIMITED.
    #f means buffer filled up without encountering delimiter or eof.
    If an integer is returned, then (eof-object (peek-char port))
    tells if the string was terminated by a delimiter or eof.


(record-reader [delims elide-delims? delim-action])
    Returns a procedure that reads records from a port. A record is
    a sequence of characters terminated by one of the characters
    in DELIMS or eof. If ELIDE-DELIMS? is true, then a contiguous
    sequence of delimiter chars are taken as a single record delimiter.
    If ELIDE-DELIMS? is false, then a delimiter char coming immediately after
    a delimiter char produces an empty string record. The reader consumes
    the delimiting char(s) before returning from a read.

    DELIMS defaults to the set {newline}. It may be a charset,
    string, character, or character predicate, and is coerced to a charset.
    ELIDE-DELIMS? defaults to #f.

    DELIM-ACTION controls what is done with the record's terminating delimiter.
        'trim   Delimiters are trimmed.
        'split  Reader returns delimiter string as a second argument.
                If record is terminated by EOF, then the eof object is 
                returned as this second argument.
        'concat The record and its delimiter are returned as a single string.
    DELIM-ACTION defaults to 'trim.

    The reader procedure returned takes one optional argument, the port
    from which to read, which defaults to the current input port. It returns
    a string or eof.

(field-parser [delim grammar num-fields handle-delimiters])
    FIELD-PARSER returns a procedure that can be used to parse
    a string representing a sequence of fields separated by delimiters.
    FIELD-PARSER implements a range of options, so that you can
    precisely specify what kind of parsing you want. However, these options
    default to reasonable values for general use.

    Defaults:
        DELIM               = "[ \t\n]+|$"
        GRAMMAR             = 'SLOPPY-SUFFIX
        NUM-FIELDS          = #f
        HANDLE-DELIMITERS   = 'TRIM
    ...which means: break the string at white-space, discarding the
     white-space, and parse as many fields as possible.

    DELIM is a regular expression used to match field delimiters.
    It defaults to the regular expression "[ \t\n]+|$" which matches
    either a string of whitespace or the end-of-string.    

    DELIM-ACTION determines what to do with delimiters.
    - 'trim (default)   Delimiters are thrown away after parsing.
    - 'concat           Delimiters are appended to the field preceding them.
    - 'split            Delimiters are returned as separate elements in
                        the field vector.

    GRAMMAR determines the grammar used to parse the sequence of
    delimiters and elements from the string. It defaults to the
    symbol SLOPPY-SUFFIX. There are three possible grammars:

    - SUFFIX
    Delimiters are interpreted as element *terminators*. If vertical-bar is the
    the delimiter, then the string "" is the empty record #(), and
    "foo|" produces a one-field record #("foo").

    The syntax of suffix-delimited records is:
        <record> ::= ""                                 ; Empty record
                   | <element> <delim> <record>

    It is an error if a non-empty record does not end with a delimiter.
    To make the last delimiter optional, make sure the delimiter regexp
    matches the end-of-string (regex "$").

    - INFIX
    Delimiters are interpreted as element *separators*. If comma is the
    delimiter, then the string "foo," produces a two-field record #("foo" "").

    The syntax of infix-delimited records is:
        <record> ::= ""                 ; Force to be empty record.
                   | <real-infix-record>

        <real-infix-record> ::= <element> <delim> <real-infix-record>
                              | <element>

    Note that separator semantics doesn't really allow for empty records --
    the straightforward grammar (i.e., <real-infix-record>) parses an empty
    string as a singleton list whose one field is the empty string (i.e.,
    #("")), not as the empty record #(). This is unfortunate, since it means
    that infix string parsing doesn't make STRING-APPEND and VECTOR-APPEND
    isomorphic:
        ((field-parser delim 'infix) (string-append x y))
    doesn't always equal
        (vector-append ((field-parser delim 'infix) x)
                       ((field-parser delim 'infix) y))
    Terminator semantics *does* preserve this isomorphism.

    However, separator semantics is frequently what other systems
    use, so to parse their strings, we need to use it. For example,
    Unix $PATH lists have separator semantics. The list "/bin:"
    is broken up into ("/bin" ""), not ("/bin"). Comma-separated
    lists should also be parsed this way.

    - SLOPPY-SUFFIX
    The same as the SUFFIX case, except that the parser will skip an initial
    delimiter string if the string begins with one instead of parsing an
    initial empty field. This can be used, for example, to field-split a
    sequence of English text at white-space boundaries, where the string may
    begin or end with white-space, by using regex "[ \t]+|$".

    To see the difference between infix and suffix grammars, consider
    the following table:

    Record      : terminates    :|$ terminates  : separates
    -----------------------------------------------------------
    ""          #()             #()             #()
    ":"         #("")           #("")           #("" "")
    "foo:"      #("foo")        #("foo")        #("foo" "")
    ":foo"      ERROR           #("" "foo")     #("" "foo")
    "foo:bar"   ERROR           #("foo" "bar")  #("foo" "bar")

    The GRAMMAR argument requires you to decide what you want,
    but at least you can be precise about what you are parsing.
    Take fifteen seconds and think it out. Say what you mean; 
    mean what you say.

    The field parser produced is a procedure that can be employed as
    follows:
        (parse string) -> string-vector
    To allow simple composition of record readers and field parsers
    (see RECORD-READER, above), the STRING argument is permitted to
    be the end-of-file object; in this case, the field parser returns
    the empty vector.

    The NUM-FIELDS argument used to create the parser specifies how many
    fields to parse.  If #f (the default), the procedure parses them all. If
    a positive integer N, exactly that many fields are parsed; it is an error
    if fewer than N fields are parsed, or if text remains in the string after
    parsing N fields. If NUM-FIELDS is a negative integer or zero, then
    |N| fields are parsed, and the remainder of the string is returned
    in the last element of the field vector; it is an error if fewer than
    |N| fields can be parsed.

(field-reader [fdelim grammar num-fields rdelims r-elide? r-delim-action])
    Returns a procedure that reads records with field structure from a port.
    When the reader is applied to an input port (which defaults to the current
    input port), it returns two values: the record and a field vector.
    The record is a string; the field vector is a vector of strings and
    is the record split up into strings at field boundaries.
    When called at eof, returns [eof-object #()].

    For example, if port p is open on /etc/passwd, then
        ((field-reader ":" 'infix 7) p)
    returns two values:
        "wandy:3xuncWdpOKhR.:112:22:Wandy Saetan:/users/wandy:/bin/csh"
        #("wandy" "3xuncWdpOKhR." "112" "22" "Wandy Saetan" "/users/wandy"
                  "/bin/csh")

    FDELIM is a regular expression used to delimit the fields in the record;
    it defaults to a regexp that matches white-space and the end-of-string:
        "[ \t\n]+|$"

    GRAMMAR is used to determine how to parse the string into fields.
    See the FIELD-PARSER procedure for more detail. It defaults to
    SLOPPY-SUFFIX.

    Record boundaries are found as specified for the RECORD-READER
    procedure.
    - RDELIMS is the set of characters that are taken to terminate 
      record boundaries. It is either a charset, string, char, or char 
      predicate. It is coerced to a charset and defaults to the set {newline}.

    - R-ELIDE? determines whether or not a contiguous sequence of record
      delimiting characters is considered a single delimiter. It defaults to
      #f.

    - R-DELIM-ACTION is either the symbol TRIM (discard the record delimiter)
      or CONCAT (return the record and its delimiter as a single string).
      It defaults to TRIM. By using CONCAT, you can distinguish the
      case of a final line being ended by the record delimiter or
      being ended by EOF.

      In either case, the field reader consumes the record-delimiting char(s)
      before returning.

      FIELD-READER does not allow the full set of possibilities permitted
      by RECORD-READER and FIELD-PARSER; it only covers a convenient set
      of the main uses.
          (field-reader FDELIM GRAMMAR  RDELIMS R-ELIDE? R-DELIM-ACTION)
      is exactly equivalent to
          (lambda (maybe-port)
            (let ((rec (apply (record-reader RDELIMS R-ELIDE? R-DELIM-ACTION)
                              maybe-port)))
              (values rec ((field-reader FDELIM GRAMMAR) rec))))
      If you wish some other kind of functionality, it is trivial to compose 
      your own record reader and field parser using the full set of arguments 
      allowed for these two procedures.
      
    Examples:
    (field-reader ":" 'infix 7)         ; /etc/passwd reader
        ; dalbertz:mx3Uaqq0:107:22:David Albertz:/users/dalbertz:/bin/csh
    
    (field-reader ":" 'infix)           ; $PATH reader
        ; /usr/shivers/bin:/usr/local/bin:/usr/ucb:/usr/bin:/bin

    (field-reader "[ \t]+" 'infix 8)    ; ls -l output reader
    (field-reader "[ \t]+" 'infix -7)   ; ls -l output reader
        ; -rw-r--r--  1 shivers    22880 Sep 24 12:45 scsh.scm

    (field-reader "\\." 'infix)         ; Internet hostname reader
        ; stat.sinica.edu.tw
    
    (field-reader "\\." 'infix 4)       ; Internet IP address reader
        ; 18.24.0.241

-------------------------------------------------------------------------------
Awk-in-scheme is given by a loop macro called AWK-LOOP. It looks like
this:
    (awk-loop <next-record> <record&field-vars> <state-var-decls>
      <body> ...)

    AWK-LOOP's <body> is made up of a series of clauses, each one representing
    a kind of condition/action pair. AWK-LOOP repeatedly reads a record,
    and then executes each clause whose condition is satisfied by the record.

    In more detail:

    <next-record> is an expression that is evaluated each time through the loop
    to produce a record to process. This expression can return multiple
    values; these values are bound to the variables given in the 
    <record&field-vars> list of variables. The first value returned is
    assumed to be the record; when it is the end-of-file object, the
    loop terminates (producing what value? we'll get to that later).

    For example, let's suppose we want to read items from /etc/password,
    and we use the FIELD-READER utility to define a record parser for
    /etc/passwd entries:
        (define passwd-reader (field-reader ":" 'infix 7)),
    binds PASSWORD-READER to a procedure that reads in a line of text when
    it is called, and splits the text at colons. It returns two values: 
    the entire line read, and a seven-element vector of the split-out fields.

    So if the <next-record> form in an AWK-LOOP is (PASSWD-READER), then
    <record&field-vars> must be a list of two variables, e.g. 
        (RECORD FIELD-VEC)
    since PASSWD-READER returns two values.

    Note that AWK-LOOP allows us to use *any* record reader we want in the
    loop, returning whatever number of values we like.

    The awk-loop allows the programmer to have loop variables. These are
    declared and initialised in a ((var init-exp) (var init-exp) ...)  list
    rather like the LET macro. Whenever a clause in the loop body executes, it
    evaluates to as many values as there are state variables, thus updating
    them. When the loop terminates on eof, it returns the values of these
    variables as multiple values.

    There are several kinds of loop clause. When evaluating the body of the
    loop, awk-loop evaluates *all* the clauses sequentially. Unlike COND,
    it does not stop after the first clause is satisfied; it checks them all.

    (<test> <body1> <body2> ...)
        If <test> is true, execute the body forms. The last body form
        is the value of the clause. The test and body forms can see the
        record and state variables.

        The <test> form can be one of:
          integer:      The test is true for that iteration of the loop.
                        The first iteration is #1.
    
          string:       The string is a regular expression. The test is
                        true if the regexp matches the record.
    
          expression    If not an integer or a string, the test form is
                        a Scheme expression that is evaluated.

    (range <start-test> <stop-test>  <body1> ...)
        This clause becomes activated when <start-test> is true; it stays
        active on all further iterations until <stop-test> is true.

        So, to print out the first ten lines of a file, we use the clause:
                (range 1 10 (display record))

    (else <body1> <body2> ...)
        If no other clause has executed since the top of the loop, or
        since the last ELSE clause, this clause executes.

    (<test> => <exp>)
        If <test> returns a true value, apply <exp> to that value.
        If <test is a regular-expression string, then <exp> is applied
        to the match data structure returned by the regexp match routine.


Examples:

    (define $ vector-ref)       ; Saves typing.

    ;; Print out the name and home-directory of everyone in /etc/passwd:
    (let ((reader (field-reader ":" 'infix 7)))
      (call-with-input-file "/etc/passwd"
        (lambda (port)
          (awk-loop (field-reader port) (record fields) ()
            (#t (format "~a's home directory is ~a~%"
                        ($ fields 0)
                        ($ fields 5)))))))

    ;; Print out the user-name and home-directory of everyone whose
    ;; name begins with "S"
    (let ((reader (field-reader ":" 'infix 7))) ; READER parses lines at :'s.
      (call-with-input-file "/etc/passwd"
        (lambda (port)
          (awk-loop (field-reader port) (record fields) ()
            ("S" (format "~a's home directory is ~a~%"
                         ($ fields 0)
                         ($ fields 5)))))))
    
    ;; Read a series of integers from stdin. This expression evaluates
    ;; to the number of positive numbers were read. Note our "record-reader"
    ;; is the standard Scheme READ procedure.
    (awk-loop (read) (i)   ((npos 0))
      ((> i 0) (+ npos 1)))


------
Note: How about a loop-internal macro, NEXT, allowing update-by-name, as in

(awk (read) (i) ((num-pos 0) (num-even 0))
  ((> i 0)   (next (num-pos  (+ num-pos  1))))
  ((even? i) (next (num-even (+ num-even 1)))))

    (next (num-pos (+ num-pos 1))) 
macro-expands into
    (let ((num-pos (+ num-pos 1)))
      (values num-pos num-even))

<Prev in Thread] Current Thread [Next in Thread>
  • field & record parsing, Olin Shivers <=