Awk, record I/O, and field parsing

Unix programs frequently process streams of records, where each record is delimited by a newline, and records are broken into fields with other delimiters (for example, the colon character in /etc/passwd). Scsh has procedures that allow the programmer to easily do this kind of processing. Scsh's field parsers can also be used to parse other kinds of delimited strings, such as colon-separated $PATH lists. These routines can be used with scsh's awk loop construct to conveniently perform pattern-directed computation over streams of records.

8.1  Record I/O and field parsing

The procedures in this section are used to read records from I/O streams and parse them into fields. A record is defined as text terminated by some delimiter (usually a newline). A record can be split into fields by using regular expressions in one of several ways: to match fields, to separate fields, or to terminate fields. The field parsers can be applied to arbitrary strings (one common use is splitting environment variables such as $PATH at colons into its component elements).

The general delimited-input procedures described in chapter 7 are also useful for reading simple records, such as single lines, paragraphs of text, or strings terminated by specific characters.

8.1.1  Reading records

(record-reader [delims elide-delims? handle-delim])     --->     procedure         (procedure) 
Returns a procedure that reads records from a port. The procedure is invoked as follows:
(reader [port]) ---> string or eof
A record is a sequence of characters terminated by one of the characters in delims or eof. If elide-delims? is true, then a contiguous sequence of delimiter chars are taken as a single record delimiter. If elide-delims? is false, then a delimiter char coming immediately after a delimiter char produces an empty-string record. The reader consumes the delimiting char(s) before returning from a read.

The delims set defaults to the set {newline}. It may be a charset, string, character, or character predicate, and is coerced to a charset. The elide-delims? flag defaults to #f.

The handle-delim argument controls what is done with the record's terminating delimiter.

'trim Delimiters are trimmed. (The default)
'splitReader returns delimiter string as a second argument. If record is terminated by EOF, then the eof object is returned as this second argument.
'concat The record and its delimiter are returned as a single string.

The reader procedure returned takes one optional argument, the port from which to read, which defaults to the current input port. It returns a string or eof.

8.1.2  Parsing fields

(field-splitter [field num-fields])     --->     procedure         (procedure) 
(infix-splitter [delim num-fields handle-delim])     --->     procedure         (procedure) 
(suffix-splitter [delim num-fields handle-delim])     --->     procedure         (procedure) 
(sloppy-suffix-splitter [delim num-fields handle-delim])     --->     procedure         (procedure) 
These functions return a parser function that can be used as follows:
(parser string [start]) ---> string-list

The returned parsers split strings into fields defined by regular expressions. You can parse by specifying a pattern that separates fields, a pattern that terminates fields, or a pattern that matches fields:

Procedure Pattern
field-splitter matches fields
infix-splitter separates fields
suffix-splitterterminates fields
sloppy-suffix-splitter terminates fields

These parser generators are controlled by a range of options, so that you can precisely specify what kind of parsing you want. However, these options default to reasonable values for general use.

Defaults:

delim (rx (| (+ white) eos)) (suffix delimiter: white space or eos)
(rx (+ white)) (infix delimiter: white space)

field

(rx (+ (~ white))) (non-white-space)

num-fields

#f (as many fields as possible)

handle-delim

'trim (discard delimiter chars)
...which means: break the string at white space, discarding the white space, and parse as many fields as possible.

The delim parameter is a regular expression matching the text that occurs between fields. See chapter 6 for information on regular expressions, and the rx form used to specify them. In the separator case, it defaults to a pattern matching white space; in the terminator case, it defaults to white space or end-of-string.

The field parameter is a regular expression used to match fields. It defaults to non-white-space.

The delim patterns may also be given as a string, character, or char-set, which are coerced to regular expressions. So the following expressions are all equivalent, each producing a function that splits strings apart at colons:

(infix-splitter (rx ":"))
(infix-splitter ":")
(infix-splitter #\:)
(infix-splitter (char-set #\:))

The boolean handle-delim determines what to do with delimiters.

'trim Delimiters are thrown away after parsing. (default)
'concat Delimiters are appended to the field preceding them.
'split Delimiters are returned as separate elements in the field list.

The num-fields argument used to create the parser specifies how many fields to parse. If #f (the default), the procedure parses them all. If a positive integer n, exactly that many fields are parsed; it is an error if there are more or fewer than n fields in the record. If num-fields is a negative integer or zero, then |n| fields are parsed, and the remainder of the string is returned in the last element of the field list; it is an error if fewer than |n| fields can be parsed.

The field parser produced is a procedure that can be employed as follows:

(parse string [start]) ===> string-list
The optional start argument (default 0) specifies where in the string to begin the parse. It is an error if start > (string-length string).

The parsers returned by the four parser generators implement different kinds of field parsing:

field-splitter
The regular expression specifies the actual field.

suffix-splitter
Delimiters are interpreted as element terminators. If vertical-bar is the the delimiter, then the string "" is the empty record (), "foo|" produces a one-field record ("foo"), and "foo" is an error.

The syntax of suffix-delimited records is:

<.record.> ::= ""         (Empty record)
| <.element.> <.delim.> <.record.>

It is an error if a non-empty record does not end with a delimiter. To make the last delimiter optional, make sure the delimiter regexp matches the end-of-string (sre eos).

infix-splitter
Delimiters are interpreted as element separators. If comma is the delimiter, then the string "foo," produces a two-field record ("foo" "").

The syntax of infix-delimited records is:

<.record.> ::= ""         (Forced to be empty record)
| <.real-infix-record.>
<.real-infix-record.> ::= <.element.> <.delim.> <.real-infix-record.>
| <.element.>

Note that separator semantics doesn't really allow for empty records -- the straightforward grammar (i.e., <.real-infix-record.>) parses an empty string as a singleton list whose one field is the empty string, (""), not as the empty record (). This is unfortunate, since it means that infix string parsing doesn't make string-append and append isomorphic. For example,

((infix-splitter ":") (string-append x ":" y))
doesn't always equal
    
(append ((infix-splitter ":") x)
        ((infix-splitter ":") y))
It fails when x or y are the empty string. Terminator semantics does preserve a similar isomorphism.

However, separator semantics is frequently what other Unix software uses, so to parse their strings, we need to use it. For example, Unix $PATH lists have separator semantics. The path list "/bin:" is broken up into ("/bin" ""), not ("/bin"). Comma-separated lists should also be parsed this way.

sloppy-suffix
The same as the suffix case, except that the parser will skip an initial delimiter string if the string begins with one instead of parsing an initial empty field. This can be used, for example, to field-split a sequence of English text at white-space boundaries, where the string may begin or end with white space, by using regex
(rx (| (+ white) eos))
(But you would be better off using field-splitter in this case.)

Figure 6 shows how the different parser grammars split apart the same strings.


Record : suffix :|$ suffix : infix non-: field
"" () () () ()
":" ("") ("") ("" "") ()
"foo:" ("foo") ("foo") ("foo" "") ("foo")
":foo"error ("" "foo")("" "foo")("foo")
"foo:bar" error ("foo" "bar") ("foo" "bar") ("foo" "bar")
Figure 6:  Using different grammars to split records into fields.


Having to choose between the different grammars requires you to decide what you want, but at least you can be precise about what you are parsing. Take fifteen seconds and think it out. Say what you mean; mean what you say.

(join-strings string-list [delimiter grammar])     --->     string         (procedure) 
This procedure is a simple unparser -- it pastes strings together using the delimiter string.

The grammar argument is one of the symbols infix (the default) or suffix; it determines whether the delimiter string is used as a separator or as a terminator.

The delimiter is the string used to delimit elements; it defaults to a single space " ".

Example:


(join-strings '("foo" "bar" "baz") ":")
        ==>  "foo:bar:baz"

8.1.3  Field readers

(field-reader [field-parser rec-reader])     --->     procedure         (procedure) 
This utility returns a procedure that reads records with field structure from a port. The reader's interface is designed to make it useful in the awk loop macro (section 8.2). The reader is used as follows:
(reader [port]) ===> [raw-record parsed-record] or [eof ()]

When the reader is applied to an input port (default: the current input port), it reads a record using rec-reader. If this record isn't the eof object, it is parsed with field-parser. These two values -- the record, and its parsed representation -- are returned as multiple values from the reader.

When called at eof, the reader returns [eof-object ()].

Although the record reader typically returns a string, and the field-parser typically takes a string argument, this is not required. The record reader can produce, and the field-parser consume, values of any type. However, the empty list returned as the parsed value on eof is hardwired into the field reader.

For example, if port p is open on /etc/passwd, then

((field-reader (infix-splitter ":" 7)) p)
returns two values:

"dalbertz:mx3Uaqq0:107:22:David Albertz:/users/dalbertz:/bin/csh"
("dalbertz" "mx3Uaqq0" "107" "22" "David Albertz" "/users/dalbertz"
            "/bin/csh")
The field-parser defaults to the value of (field-splitter), a parser that picks out sequences of non-white-space strings.

The rec-reader defaults to read-line.

Figure 7 shows field-reader being used to read different kinds of Unix records.



;;; /etc/passwd reader
(field-reader (infix-splitter ":" 7))
    ; wandy:3xuncWdpKhR.:73:22:Wandy Saetan:/usr/wandy:/bin/csh

;;; Two ls -l output readers
(field-reader (infix-splitter (rx (+ white)) 8))
(field-reader (infix-splitter (rx (+ white)) -7))
    ; -rw-r--r--  1 shivers    22880 Sep 24 12:45 scsh.scm

;;; Internet hostname reader
(field-reader (field-splitter (rx (+ (  ".")))))
    ; stat.sinica.edu.tw

;;; Internet IP address reader
(field-reader (field-splitter (rx (+ (  "."))) 4))
    ; 18.24.0.241

;;; Line of integers
(let ((parser (field-splitter (rx (? ("+-")) (+ digit)))))
  (field-reader (lambda (s) (map string->number (parser s))))
    ; 18 24 0 241

;;; Same as above.
(let ((reader (field-reader (field-splitter (rx (? ("+-")) 
                                                (+ digit))))))
  (lambda maybe-port (map string->number (apply reader maybe-port))))
    ; Yale beat harvard 26 to 7.
Figure 7:  Some examples of field-reader


8.1.4  Forward-progress guarantees and empty-string matches

A loop that pulls text off a string by repeatedly matching a regexp against that string can conceivably get stuck in an infinite loop if the regexp matches the empty string. For example, the SREs bos, eos, (* any), and (| "foo" (* (  "f"))) can all match the empty string.

The routines in this package that iterate through strings with regular expressions are careful to handle this empty-string case. If a regexp matches the empty string, the next search starts, not from the end of the match (which in the empty string case is also the beginning -- that's the problem), but from the next character over. This is the correct behaviour. Regexps match the longest possible string at a given location, so if the regexp matched the empty string at location i, then it is guaranteed it could not have matched a longer pattern starting with character i. So we can safely begin our search for the next match at char i + 1.

With this provision, every iteration through the loop makes some forward progress, and the loop is guaranteed to terminate.

This has the effect you want with field parsing. For example, if you split a string with the empty pattern, you will explode the string into its individual characters:

((suffix-splitter (rx)) "foo") ===> ("" "f" "o" "o")
However, even though this boundary case is handled correctly, we don't recommend using it. Say what you mean -- just use a field splitter:
((field-splitter (rx any)) "foo") ===> ("f" "o" "o")
Or, more efficiently,
((lambda (s) (map string (string->list s))) "foo")

8.1.5  Reader limitations

Since all of the readers in this package require the ability to peek ahead one char in the input stream, they cannot be applied to raw integer file descriptors, only Scheme input ports. This is because Unix doesn't support peeking ahead into input streams.

8.2  Awk

Scsh provides a loop macro and a set of field parsers that can be used to perform text processing very similar to the Awk programming language. The basic functionality of Awk is factored in scsh into its component parts. The control structure is provided by the awk loop macro; the text I/O and parsers are provided by the field-reader subroutine library (section 8.1). This factoring allows the programmer to compose the basic loop structure with any parser or input mechanism at all. If the parsers provided by the field-reader package are insufficient, the programmer can write a custom parser in Scheme and use it with equal ease in the awk framework.

Awk-in-scheme is given by a loop macro called awk. It looks like this:


(awk <.next-record.> <.record&field-vars.>
     [<.counter.><.state-var-decls.>
  <.clause1.> ...)

The body of the loop is a series of clauses, each one representing a kind of condition/action pair. The loop repeatedly reads a record, and then executes each clause whose condition is satisfied by the record.

Here's an example that reads lines from port p and prints the line number and line of every line containing the string ``Church-Rosser'':


(awk (read-line) (ln) lineno ()
  ("Church-Rosser" (format #t " d:  s %" lineno ln)))
This example has just one clause in the loop body, the one that tests for matches against the regular expression ``Church-Rosser''.

The <.next-record.> form is an expression that is evaluated each time through the loop to produce a record to process. This expression can return multiple values; these values are bound to the variables given in the <.record&field-vars.> list of variables. The first value returned is assumed to be the record; when it is the end-of-file object, the loop terminates.

For example, let's suppose we want to read items from /etc/password, and we use the field-reader procedure to define a record parser for /etc/passwd entries:

(define read-passwd (field-reader (infix-splitter ":" 7)))
binds read-passwd to a procedure that reads in a line of text when it is called, and splits the text at colons. It returns two values: the entire line read, and a seven-element list of the split-out fields. (See section 8.1 for more on field-reader and infix-splitter.)

So if the <.next-record.> form in an awk expression is (read-passwd), then <.record&field-vars.> must be a list of two variables, e.g.,

(record field-vec)
since read-passwd returns two values.

Note that awk allows us to use any record reader we want in the loop, returning whatever number of values we like. These values don't have to be strings or string lists. The only requirement is that the record reader return the eof object as its first value when the loop should terminate.

The awk loop allows the programmer to have loop variables. These are declared and initialised by the <.state-var-decls.> form, a

((var init-exp) (var init-exp) ...)
list rather like the let form. Whenever a clause in the loop body executes, it evaluates to as many values as there are state variables, updating them.

The optional <.counter.> variable is an iteration counter. It is bound to 0 when the loop starts. The counter is incremented each time a non-eof record is read.

There are several kinds of loop clause. When evaluating the body of the loop, awk evaluates all the clauses sequentially. Unlike cond, it does not stop after the first clause is satisfied; it checks them all.

8.2.1  Examples

Here are some examples of awk being used to process various types of input stream.


(define $ list-ref)     ; Saves typing.

;;; Print out the name and home-directory of everyone in /etc/passwd:
(let ((read-passwd (field-reader (infix-splitter ":" 7))))
  (call-with-input-file "/etc/passwd"
    (lambda (port)
      (awk (read-passwd port) (record fields) ()
        (#t (format #t " a's home directory is  a %"
                    ($ fields 0)
                    ($ fields 5)))))))


;;; Print out the user-name and home-directory of everyone whose
;;; name begins with "S"
(let ((read-passwd (field-reader (infix-splitter ":" 7))))
  (call-with-input-file "/etc/passwd"
    (lambda (port)
      (awk (read-passwd port) (record fields) ()
        ((: bos "S") 
         (format #t " a's home directory is  a %"
                    ($ fields 0)
                    ($ fields 5)))))))


;;; Read a series of integers from stdin. This expression evaluates
;;; to the number of positive numbers that were read. Note our
;;; "record-reader" is the standard Scheme READ procedure.
(awk (read) (i)   ((npos 0))
  ((> i 0) (+ npos 1)))


;;; Filter -- pass only lines containing my name.
(awk (read-line) (line) ()
  ("Olin" (display line) (newline)))


;;; Count the number of non-comment lines of code in my Scheme source.
(awk (read-line) (line) ((nlines 0))
  ((: bos (* white) ";")  nlines)         ; A comment line.
  (else                   (+ nlines 1)))  ; Not a comment line.


;;; Read numbers, counting the evens and odds.
(awk (read) (val) ((evens 0) (odds 0))
  ((> val 0) (display "pos ")  (values evens odds)) ; Tell me about
  ((< val 0) (display "neg ")  (values evens odds)) ; sign, too.
  (else      (display "zero ") (values evens odds)) 

  ((even? val) (values (+ evens 1) odds))
  (else        (values evens       (+ odds 1))))


;;; Determine the max length of all the lines in the file.
(awk (read-line) (line) ((max-len 0))
  (#t (max max-len (string-length line))))


;;; (This could also be done with PORT-FOLD:)
(port-fold (current-input-port) read-line
           (lambda (line maxlen) (max (string-length line) maxlen))
           0)


;;; Print every line longer than 80 chars.
;;; Prefix each line with its line #.
(awk (read-line) (line) lineno ()
  ((> (string-length line) 80)
   (format #t " d:  s %" lineno line)))


;;; Strip blank lines from input.
(awk (read-line) (line) ()
  ((  white)   (display line) (newline)))


;;; Sort the entries in /etc/passwd by login name.
(for-each (lambda (entry) (display (cdr entry)) (newline))          ; Out
          (sort (lambda (x y) (string<? (car x) (car y)))           ; Sort
                (let ((read (field-reader (infix-splitter ":" 7)))) ; In
                  (awk (read) (line fields) ((ans '()))
                    (#t (cons (cons ($ fields 0) line) ans))))))


;;; Prefix line numbers to the input stream.
(awk (read-line) (line) lineno ()
  (#t (format #t " d:\t a %" lineno line)))

8.3  Backwards compatibility

Previous scsh releases provided an awk form with a different syntax, designed around regular expressions written in Posix notation as strings, rather than SREs.

This form is still available in a separate module for old code. It'll be documented in the next release of this manual. Dig around in the sources for it.