Replacing programs by parallel versionsOf coure, total camouflage of parallelism is both hard to achieve and useless. Users can get much more if they change the way they work, but the change can be quite small and easy. I hope my users will generally use my batch spooling system, which is quite simple to understand and use and which makes the whole power of our cluster easily accessible.
Special Cluster Management Tools
Distributed filesystemThe simplest way to connect all IDE disks (16 at the moment) into one common filesystem tree is to use NFS automounts. This is the current state of the art here. Then we might add disk use balancing daemons moving some data from disk to disk and changing links accordingly to ensure that no single disk will be out of space while there is plenty of room on another.
Ultimately we would like to make our system diskfail-safe using RAID-like technique. This is not easy to do well because of limited network bandwidth separating the disks, but we have an idea which we will try to realize, unless somebody else will do it instead of us. Description of the idea follows:
Fault tolerance could be added keeping two copies of every file; this
needs twice the space.
Fault tolerance could also be added using N+1 (up to 15+1 for me)
parity. This could be slow (?) to do this over the network.
Maybe the two methods could be combined, using N+1 parity for most of
the data ("stable part") and mirror copies (or just new copies) for
the rest ("volatile part"). The new versions of files created during
one day in the volatile part could then be moved to the stable part
overnight. The stable part would then be used read-only during the
day, gathering any changes in the volatile part, and so on.
This scheeme could ensure that when one disk fails, only one-day
volatile data could be lost (if only one copy is maintained) or only
quit recent volatile data could be lost (if we always make a copy on
another disk, but the process creating the file does not wait for the
second copy being finished).
All this could be relatively simple to do, "stable part" would require
just doing XOR of N partitions into the (N+1)th one, and the same
operation for eventual recovery after disk failure. Then we need some
mechanism for diverting file access from the stable part to the
volatile part for files modified earlier during the day, this could be
acomplished using a map of filesystem consisting from directories full
of symlinks to real files, pointing either to the stable or volatile
part - thus we could "move" file out of the stable part without
modifying the stable part (and thus without parity update). This could
be acomplished by modified amd automounter (amd simulates NFS mount
and returns symlinks to real files, which are (NFS) mounted and
unmounted on other hidden place as needed).
The only hard part is to know that file (in the stable part) is going
to be modified. We have to either create an exact volatile copy (if it
is being opened "rw"), or maybe just an empty file (if it is being
opened "w" and any old data would be deleted anyway). This could
require some kernel modifications, maybe just spying on open(), seek() and a
few other system calls (the call will block until we do what is needed
and return a proper symlink from our amd-like redirecter). We could also
do all this inside amd-like program (no kernel modification) - if we do NOT
return symlink but implement the rest of NSF chores (which amd does not
implement because the rest of file operations was diverted by symlink).
If I will start work on this, will anybody be interested in the resulting system? Or will anybody want to collaborate?
Anyway, N+1 parity would be quite easy for read-only partitions - surely I will try this.
We could also do quite a good job just simply replacing few UNIX commands with special scripts - for example 'mkdir' could look for filesystem with most free space and possibly create real new directory there and put just symlink on the original place. Maybe this simple trick could do nearly the whole job by itself! Of course, this is just heuristic which might work well or not depending on user habits, but I think it is a good one - as long as your most disk-hungry users are using old good 'mkdir' command.
All theese symlink tricks would probably require something like 'alias ls ls -L'.
Harddisk partitions and bootingAll harddisks have the same partitions:
Disk /dev/hda: 255 heads, 63 sectors, 527 cylinders Units = cylinders of 16065 * 512 bytes Device Boot Begin Start End Blocks Id System /dev/hda1 * 1 1 47 377496 83 Linux native /dev/hda2 48 48 63 128520 82 Linux swap /dev/hda3 64 64 295 1863540 83 Linux native /dev/hda4 296 296 527 1863540 83 Linux nativeThe first partition contains nearly normal Linux distribution. The whole /dev/hda1 partition can also be stored as one big file in filesystem made on /dev/hda3 or /dev/hda4. This way you can backup the system partition, make all the most dangerous experiments with new software, kernel upgrades, beta drivers tests etc., and if the result does not satisfy you (maybe because your /dev/hda1 filesystem cannot be repared :), you can simply copy back one of previous versions.
To have always something sure to boot from, there is another system (root disk) on /dev/hda4 - this will eventually be replaced by a bare minimum needed to copy to/from /dev/hda1 over network, but just now I am quite happy with copy of one older stable version of /dev/hda1 filesystem (transferred using (cd from; tar cf - .)|(cd to; tar xvfp -), dislike cp /dev/hda1 filename used to make backup copy).
All this is finished by double-LILO setup, which enables me to play even with LILO (on hda1) without worry. Look:
/etc/lilo.conf in filesystem on /dev/hda4:
boot=/dev/hda
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
# chain to lilo in hda1
other=/dev/hda1
label=chain-hda1
table=/dev/hda
image=/boot/zImage
label=1
root=/dev/hda1
read-only
image=/boot/zImage
label=4
root=/dev/hda4
read-only
image=/boot/zImage
label=c1
root=/dev/hdc1
read-only
image=/boot/zImage
label=c4
root=/dev/hdc4
read-only
/etc/lilo.conf in filesystem on /dev/hda1:
boot=/dev/hda1
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
image=/boot/zImage
label=a1
root=/dev/hda1
read-only
image=/boot/zImage
label=a4
root=/dev/hda4
read-only
image=/boot/zImage
label=c1
root=/dev/hdc1
read-only
image=/boot/zImage
label=c4
root=/dev/hdc4
read-only
/sbin/lilo is always run while the partition (/dev/hda1 or /dev/hda4) is mounted
as root partition, so kernel used during boot resides in the same partition as
corresponding lilo.conf. The two main boot alternatives are