scsh-users
[Top] [All Lists]

Elusive pipe I/O bug in Scsh?

To: <scsh@zurich.ai.mit.edu>
Subject: Elusive pipe I/O bug in Scsh?
From: Stefan Jankowski <janky@informatik.uni-freiburg.de>
Date: Wed, 12 Feb 2003 14:16:52 +0100 (MET)
A while back Joe Dane (jdane@hawaii.edu) reported a problem with
subprocesses; now I have something even more elusive concerning
subprocesses and I/O. Short description: my program *sometimes*
experiences mysterious read errors on subprocesses (i.e. the
stdout of subprocesses) which may manifest themselves as loss of
a few bytes, or as "Error: randomness after form after dot" (the
object written together with the error message being the list I
am trying to read, with '---' printed in one or more places).
Alternatively, the program may end up hanging in a tight loop
instead...

A more detailed description: my program is supposed to watch the
users' processes on a multi-user terminal server system (Sun Ray
terminals on a Solaris 8 4-processor server) and automagially
lart[1] CPU hogs or other grossly misbehaving processes. To this
end, it invokes a small helper program written in C that reads
the process information available through the /proc filesystem
and prints it out as series of nested association lists ("nested"
meaning that some values are alists themselves) with I have Scsh
(READ) one by one from the stdout of the subprocess. The program
then performs a few simple statistical calculations on the
resource usage of each process (the results of which it stores
in a hashtable) and takes corrective action if certain
configurable threshold values are exceeded. The code that
actually performs such actions like re-nicing or terminating
("terminator") as well as the code that logs the actions and
sends out notifications ("logger") may optionally run in
subprocesses of their own, so they can run in security contexts
different from those of the scanner.

In the course of all this action, more subprocesses may be
spawned e.g. to determine the current system load by parsing the
output of uptime(1) or in order to send notification mails. It is
these extra subprocesses that appear to be the cause of the
problem; at least the system appears to be the more unstable the
more subprocesses I create, e.g. with the logger and the
terminator running as processes the system is more likely to
crash than with these tasks running as threads. And I currently
have a situation where the program will crash /consistently/
while sending out a particular notification mail /but not in the
code that sends the mail but in the code that reads the process
info/(!)

Ah, well, in fairness to Scsh I must admit that my code does yet
another potentially distressing thing: the processing of the data
read from the helper process is not done through simple
functional composition but in a /stream/ (implementation lifted
from SICP 2^1996 ch. 3.5), so that each read triggers a whole
cascade of (DELAY) and (FORCE) calls. It seemed to be a nifty
idea at one time, but it turned out not to... (I plan to re-write
the program to use functional composition and see if it gets any
better.)

Even so, the behaviour of Scsh seems like a bug to me, and a
beastily elusive one at that. I would try to write a simple
demonstration program that triggers the errors described above,
if only I had the faintest glance of a clue as to what is going
on here... Any hints, clues, or other helpful comments will be
highly appreciated.

Regards,
Stefan Jankowski

-------- Notes:

[1] lart, v.: Denominal verb derived from LART: Luser Attitude
    Readjustment Tool, see The Jargon File for more info...


-- 
Stefan Jankowski               <janky@informatik.uni-freiburg.de>
DV-Systemtechniker                            Tel:  0761-203-8189
Institut für Informatik                       Fax:  0761-203-8109
Universität Freiburg i.Br.



<Prev in Thread] Current Thread [Next in Thread>