Guys-
I've redesigned the field/record parsers. I append the new design, with
an extra bonus: the doc for the awk macro. I have also just posted these
designs to the Scheme group on netnews.
I think I've now got everything you all wanted. Let me know how you
like the new design. I have not implemented it yet; just got finished working
out the design and writing it up.
Thank you for the feedback on the last version.
-Olin
-------------------------------------------------------------------------------
The scsh field and record parsers.
So featureful you'll puke! (tm)
Copyright (c) 1994 by Olin Shivers.
Since all of the readers discussed below require the ability to peek
ahead one char in the input stream, they cannot be applied to raw
integer file descriptors, only Scheme input ports. This is because
Unix doesn't support peeking ahead into input streams.
(read-delimited char-set [port]) -> string or eof
Read until we encounter one of the chars in CHAR-SET or eof.
The terminating character is not included in the string returned,
nor is it removed from the input stream; the next input operation will
encounter it. If we get a string back, then (eof-object? (peek-char))
tells if the string was terminated by a delimiter or eof.
CHAR-SET may be a charset, a string, a character, or a character
predicate; it is coerced to a charset.
This operation is likely to be implemented very efficiently. In
the Scheme 48 implementation, the Unix port case is implemented directly
in C, and is much faster than the equivalent operation performed
in Scheme with PEEK-CHAR and READ-CHAR.
(read-delimited! char-set buf [port start end]) -> nchars or eof or #f
Variant of READ-DELIMITED.
#f means buffer filled up without encountering delimiter or eof.
If an integer is returned, then (eof-object (peek-char port))
tells if the string was terminated by a delimiter or eof.
(record-reader [delims elide-delims? delim-action])
Returns a procedure that reads records from a port. A record is
a sequence of characters terminated by one of the characters
in DELIMS or eof. If ELIDE-DELIMS? is true, then a contiguous
sequence of delimiter chars are taken as a single record delimiter.
If ELIDE-DELIMS? is false, then a delimiter char coming immediately after
a delimiter char produces an empty string record. The reader consumes
the delimiting char(s) before returning from a read.
DELIMS defaults to the set {newline}. It may be a charset,
string, character, or character predicate, and is coerced to a charset.
ELIDE-DELIMS? defaults to #f.
DELIM-ACTION controls what is done with the record's terminating delimiter.
'trim Delimiters are trimmed.
'split Reader returns delimiter string as a second argument.
If record is terminated by EOF, then the eof object is
returned as this second argument.
'concat The record and its delimiter are returned as a single string.
DELIM-ACTION defaults to 'trim.
The reader procedure returned takes one optional argument, the port
from which to read, which defaults to the current input port. It returns
a string or eof.
(field-parser [delim grammar num-fields handle-delimiters])
FIELD-PARSER returns a procedure that can be used to parse
a string representing a sequence of fields separated by delimiters.
FIELD-PARSER implements a range of options, so that you can
precisely specify what kind of parsing you want. However, these options
default to reasonable values for general use.
Defaults:
DELIM = "[ \t\n]+|$"
GRAMMAR = 'SLOPPY-SUFFIX
NUM-FIELDS = #f
HANDLE-DELIMITERS = 'TRIM
...which means: break the string at white-space, discarding the
white-space, and parse as many fields as possible.
DELIM is a regular expression used to match field delimiters.
It defaults to the regular expression "[ \t\n]+|$" which matches
either a string of whitespace or the end-of-string.
DELIM-ACTION determines what to do with delimiters.
- 'trim (default) Delimiters are thrown away after parsing.
- 'concat Delimiters are appended to the field preceding them.
- 'split Delimiters are returned as separate elements in
the field vector.
GRAMMAR determines the grammar used to parse the sequence of
delimiters and elements from the string. It defaults to the
symbol SLOPPY-SUFFIX. There are three possible grammars:
- SUFFIX
Delimiters are interpreted as element *terminators*. If vertical-bar is the
the delimiter, then the string "" is the empty record #(), and
"foo|" produces a one-field record #("foo").
The syntax of suffix-delimited records is:
<record> ::= "" ; Empty record
| <element> <delim> <record>
It is an error if a non-empty record does not end with a delimiter.
To make the last delimiter optional, make sure the delimiter regexp
matches the end-of-string (regex "$").
- INFIX
Delimiters are interpreted as element *separators*. If comma is the
delimiter, then the string "foo," produces a two-field record #("foo" "").
The syntax of infix-delimited records is:
<record> ::= "" ; Force to be empty record.
| <real-infix-record>
<real-infix-record> ::= <element> <delim> <real-infix-record>
| <element>
Note that separator semantics doesn't really allow for empty records --
the straightforward grammar (i.e., <real-infix-record>) parses an empty
string as a singleton list whose one field is the empty string (i.e.,
#("")), not as the empty record #(). This is unfortunate, since it means
that infix string parsing doesn't make STRING-APPEND and VECTOR-APPEND
isomorphic:
((field-parser delim 'infix) (string-append x y))
doesn't always equal
(vector-append ((field-parser delim 'infix) x)
((field-parser delim 'infix) y))
Terminator semantics *does* preserve this isomorphism.
However, separator semantics is frequently what other systems
use, so to parse their strings, we need to use it. For example,
Unix $PATH lists have separator semantics. The list "/bin:"
is broken up into ("/bin" ""), not ("/bin"). Comma-separated
lists should also be parsed this way.
- SLOPPY-SUFFIX
The same as the SUFFIX case, except that the parser will skip an initial
delimiter string if the string begins with one instead of parsing an
initial empty field. This can be used, for example, to field-split a
sequence of English text at white-space boundaries, where the string may
begin or end with white-space, by using regex "[ \t]+|$".
To see the difference between infix and suffix grammars, consider
the following table:
Record : terminates :|$ terminates : separates
-----------------------------------------------------------
"" #() #() #()
":" #("") #("") #("" "")
"foo:" #("foo") #("foo") #("foo" "")
":foo" ERROR #("" "foo") #("" "foo")
"foo:bar" ERROR #("foo" "bar") #("foo" "bar")
The GRAMMAR argument requires you to decide what you want,
but at least you can be precise about what you are parsing.
Take fifteen seconds and think it out. Say what you mean;
mean what you say.
The field parser produced is a procedure that can be employed as
follows:
(parse string) -> string-vector
To allow simple composition of record readers and field parsers
(see RECORD-READER, above), the STRING argument is permitted to
be the end-of-file object; in this case, the field parser returns
the empty vector.
The NUM-FIELDS argument used to create the parser specifies how many
fields to parse. If #f (the default), the procedure parses them all. If
a positive integer N, exactly that many fields are parsed; it is an error
if fewer than N fields are parsed, or if text remains in the string after
parsing N fields. If NUM-FIELDS is a negative integer or zero, then
|N| fields are parsed, and the remainder of the string is returned
in the last element of the field vector; it is an error if fewer than
|N| fields can be parsed.
(field-reader [fdelim grammar num-fields rdelims r-elide? r-delim-action])
Returns a procedure that reads records with field structure from a port.
When the reader is applied to an input port (which defaults to the current
input port), it returns two values: the record and a field vector.
The record is a string; the field vector is a vector of strings and
is the record split up into strings at field boundaries.
When called at eof, returns [eof-object #()].
For example, if port p is open on /etc/passwd, then
((field-reader ":" 'infix 7) p)
returns two values:
"wandy:3xuncWdpOKhR.:112:22:Wandy Saetan:/users/wandy:/bin/csh"
#("wandy" "3xuncWdpOKhR." "112" "22" "Wandy Saetan" "/users/wandy"
"/bin/csh")
FDELIM is a regular expression used to delimit the fields in the record;
it defaults to a regexp that matches white-space and the end-of-string:
"[ \t\n]+|$"
GRAMMAR is used to determine how to parse the string into fields.
See the FIELD-PARSER procedure for more detail. It defaults to
SLOPPY-SUFFIX.
Record boundaries are found as specified for the RECORD-READER
procedure.
- RDELIMS is the set of characters that are taken to terminate
record boundaries. It is either a charset, string, char, or char
predicate. It is coerced to a charset and defaults to the set {newline}.
- R-ELIDE? determines whether or not a contiguous sequence of record
delimiting characters is considered a single delimiter. It defaults to
#f.
- R-DELIM-ACTION is either the symbol TRIM (discard the record delimiter)
or CONCAT (return the record and its delimiter as a single string).
It defaults to TRIM. By using CONCAT, you can distinguish the
case of a final line being ended by the record delimiter or
being ended by EOF.
In either case, the field reader consumes the record-delimiting char(s)
before returning.
FIELD-READER does not allow the full set of possibilities permitted
by RECORD-READER and FIELD-PARSER; it only covers a convenient set
of the main uses.
(field-reader FDELIM GRAMMAR RDELIMS R-ELIDE? R-DELIM-ACTION)
is exactly equivalent to
(lambda (maybe-port)
(let ((rec (apply (record-reader RDELIMS R-ELIDE? R-DELIM-ACTION)
maybe-port)))
(values rec ((field-reader FDELIM GRAMMAR) rec))))
If you wish some other kind of functionality, it is trivial to compose
your own record reader and field parser using the full set of arguments
allowed for these two procedures.
Examples:
(field-reader ":" 'infix 7) ; /etc/passwd reader
; dalbertz:mx3Uaqq0:107:22:David Albertz:/users/dalbertz:/bin/csh
(field-reader ":" 'infix) ; $PATH reader
; /usr/shivers/bin:/usr/local/bin:/usr/ucb:/usr/bin:/bin
(field-reader "[ \t]+" 'infix 8) ; ls -l output reader
(field-reader "[ \t]+" 'infix -7) ; ls -l output reader
; -rw-r--r-- 1 shivers 22880 Sep 24 12:45 scsh.scm
(field-reader "\\." 'infix) ; Internet hostname reader
; stat.sinica.edu.tw
(field-reader "\\." 'infix 4) ; Internet IP address reader
; 18.24.0.241
-------------------------------------------------------------------------------
Awk-in-scheme is given by a loop macro called AWK-LOOP. It looks like
this:
(awk-loop <next-record> <record&field-vars> <state-var-decls>
<body> ...)
AWK-LOOP's <body> is made up of a series of clauses, each one representing
a kind of condition/action pair. AWK-LOOP repeatedly reads a record,
and then executes each clause whose condition is satisfied by the record.
In more detail:
<next-record> is an expression that is evaluated each time through the loop
to produce a record to process. This expression can return multiple
values; these values are bound to the variables given in the
<record&field-vars> list of variables. The first value returned is
assumed to be the record; when it is the end-of-file object, the
loop terminates (producing what value? we'll get to that later).
For example, let's suppose we want to read items from /etc/password,
and we use the FIELD-READER utility to define a record parser for
/etc/passwd entries:
(define passwd-reader (field-reader ":" 'infix 7)),
binds PASSWORD-READER to a procedure that reads in a line of text when
it is called, and splits the text at colons. It returns two values:
the entire line read, and a seven-element vector of the split-out fields.
So if the <next-record> form in an AWK-LOOP is (PASSWD-READER), then
<record&field-vars> must be a list of two variables, e.g.
(RECORD FIELD-VEC)
since PASSWD-READER returns two values.
Note that AWK-LOOP allows us to use *any* record reader we want in the
loop, returning whatever number of values we like.
The awk-loop allows the programmer to have loop variables. These are
declared and initialised in a ((var init-exp) (var init-exp) ...) list
rather like the LET macro. Whenever a clause in the loop body executes, it
evaluates to as many values as there are state variables, thus updating
them. When the loop terminates on eof, it returns the values of these
variables as multiple values.
There are several kinds of loop clause. When evaluating the body of the
loop, awk-loop evaluates *all* the clauses sequentially. Unlike COND,
it does not stop after the first clause is satisfied; it checks them all.
(<test> <body1> <body2> ...)
If <test> is true, execute the body forms. The last body form
is the value of the clause. The test and body forms can see the
record and state variables.
The <test> form can be one of:
integer: The test is true for that iteration of the loop.
The first iteration is #1.
string: The string is a regular expression. The test is
true if the regexp matches the record.
expression If not an integer or a string, the test form is
a Scheme expression that is evaluated.
(range <start-test> <stop-test> <body1> ...)
This clause becomes activated when <start-test> is true; it stays
active on all further iterations until <stop-test> is true.
So, to print out the first ten lines of a file, we use the clause:
(range 1 10 (display record))
(else <body1> <body2> ...)
If no other clause has executed since the top of the loop, or
since the last ELSE clause, this clause executes.
(<test> => <exp>)
If <test> returns a true value, apply <exp> to that value.
If <test is a regular-expression string, then <exp> is applied
to the match data structure returned by the regexp match routine.
Examples:
(define $ vector-ref) ; Saves typing.
;; Print out the name and home-directory of everyone in /etc/passwd:
(let ((reader (field-reader ":" 'infix 7)))
(call-with-input-file "/etc/passwd"
(lambda (port)
(awk-loop (field-reader port) (record fields) ()
(#t (format "~a's home directory is ~a~%"
($ fields 0)
($ fields 5)))))))
;; Print out the user-name and home-directory of everyone whose
;; name begins with "S"
(let ((reader (field-reader ":" 'infix 7))) ; READER parses lines at :'s.
(call-with-input-file "/etc/passwd"
(lambda (port)
(awk-loop (field-reader port) (record fields) ()
("S" (format "~a's home directory is ~a~%"
($ fields 0)
($ fields 5)))))))
;; Read a series of integers from stdin. This expression evaluates
;; to the number of positive numbers were read. Note our "record-reader"
;; is the standard Scheme READ procedure.
(awk-loop (read) (i) ((npos 0))
((> i 0) (+ npos 1)))
------
Note: How about a loop-internal macro, NEXT, allowing update-by-name, as in
(awk (read) (i) ((num-pos 0) (num-even 0))
((> i 0) (next (num-pos (+ num-pos 1))))
((even? i) (next (num-even (+ num-even 1)))))
(next (num-pos (+ num-pos 1)))
macro-expands into
(let ((num-pos (+ num-pos 1)))
(values num-pos num-even))
|