(osum tucnaku)

Software for Linux Cluster MAGI



Replacing programs by parallel versions

Our objective was to make programs from user's perspective quite normal (like on any other UNIX) but quicker. While this is possible with many programs, the approach taken vary a lot depending on program type. The alternatives are: The alternative widely used on many other sites using clusters is quite low on my list: And of course there is one thing we could do to adapt all programs at once: It is interesting to note that some programs can act as 'gateways' to paralelism for other programs - make, xargs, shells.

Of coure, total camouflage of parallelism is both hard to achieve and useless. Users can get much more if they change the way they work, but the change can be quite small and easy. I hope my users will generally use my batch spooling system, which is quite simple to understand and use and which makes the whole power of our cluster easily accessible.

Special Cluster Management Tools

We have written yet another tcl/tk cluster monitor (tcl/tk is a great toy for things like this). Here is a small screenshot of pre-alpha version.

Distributed filesystem

Total disk capacity of our cluster is currently 68.8 GB (not including two 'emergency' replacement disks) and could be upgraded to 137.6 GB using also slave disks on all IDE interfaces, or probably the capacity could be higher because by the time of upgrade it will be feasible to use bigger IDE disks than 4.3 GB. (Of course, there is another upgrade path using SCSI and RAID array, but the price of IDE solution is much lower.)

The simplest way to connect all IDE disks (16 at the moment) into one common filesystem tree is to use NFS automounts. This is the current state of the art here. Then we might add disk use balancing daemons moving some data from disk to disk and changing links accordingly to ensure that no single disk will be out of space while there is plenty of room on another.

Ultimately we would like to make our system diskfail-safe using RAID-like technique. This is not easy to do well because of limited network bandwidth separating the disks, but we have an idea which we will try to realize, unless somebody else will do it instead of us. Description of the idea follows:

Fault tolerance could be added keeping two copies of every file; this needs twice the space.

Fault tolerance could also be added using N+1 (up to 15+1 for me) parity. This could be slow (?) to do this over the network.

Maybe the two methods could be combined, using N+1 parity for most of the data ("stable part") and mirror copies (or just new copies) for the rest ("volatile part"). The new versions of files created during one day in the volatile part could then be moved to the stable part overnight. The stable part would then be used read-only during the day, gathering any changes in the volatile part, and so on.

This scheeme could ensure that when one disk fails, only one-day volatile data could be lost (if only one copy is maintained) or only quit recent volatile data could be lost (if we always make a copy on another disk, but the process creating the file does not wait for the second copy being finished).

All this could be relatively simple to do, "stable part" would require just doing XOR of N partitions into the (N+1)th one, and the same operation for eventual recovery after disk failure. Then we need some mechanism for diverting file access from the stable part to the volatile part for files modified earlier during the day, this could be acomplished using a map of filesystem consisting from directories full of symlinks to real files, pointing either to the stable or volatile part - thus we could "move" file out of the stable part without modifying the stable part (and thus without parity update). This could be acomplished by modified amd automounter (amd simulates NFS mount and returns symlinks to real files, which are (NFS) mounted and unmounted on other hidden place as needed).

The only hard part is to know that file (in the stable part) is going to be modified. We have to either create an exact volatile copy (if it is being opened "rw"), or maybe just an empty file (if it is being opened "w" and any old data would be deleted anyway). This could require some kernel modifications, maybe just spying on open(), seek() and a few other system calls (the call will block until we do what is needed and return a proper symlink from our amd-like redirecter). We could also do all this inside amd-like program (no kernel modification) - if we do NOT return symlink but implement the rest of NSF chores (which amd does not implement because the rest of file operations was diverted by symlink).

Does anybody already work on (at least some part) of this?

If I will start work on this, will anybody be interested in the resulting system? Or will anybody want to collaborate?

Anyway, N+1 parity would be quite easy for read-only partitions - surely I will try this.

We could also do quite a good job just simply replacing few UNIX commands with special scripts - for example 'mkdir' could look for filesystem with most free space and possibly create real new directory there and put just symlink on the original place. Maybe this simple trick could do nearly the whole job by itself! Of course, this is just heuristic which might work well or not depending on user habits, but I think it is a good one - as long as your most disk-hungry users are using old good 'mkdir' command.

All theese symlink tricks would probably require something like 'alias ls ls -L'.

Harddisk partitions and booting

Currently we have two harddisk slots in each node (one on primary IDE and one on secondary IDE) and 4.3 GB harddisks in special shelves (all switched as master). Harddisks can be pluged in just anywhere as far as each machine has one on primary IDE to boot from. Main part of filesystem hierarchy is not affected by disk position and is accessed through symbolic links map (rebuild on every configuration change) and then NFS (automounted using amd).

All harddisks have the same partitions:

Disk /dev/hda: 255 heads, 63 sectors, 527 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot   Begin    Start      End   Blocks   Id  System
/dev/hda1   *        1        1       47   377496   83  Linux native
/dev/hda2           48       48       63   128520   82  Linux swap
/dev/hda3           64       64      295  1863540   83  Linux native
/dev/hda4          296      296      527  1863540   83  Linux native
The first partition contains nearly normal Linux distribution. The whole /dev/hda1 partition can also be stored as one big file in filesystem made on /dev/hda3 or /dev/hda4. This way you can backup the system partition, make all the most dangerous experiments with new software, kernel upgrades, beta drivers tests etc., and if the result does not satisfy you (maybe because your /dev/hda1 filesystem cannot be repared :), you can simply copy back one of previous versions.

To have always something sure to boot from, there is another system (root disk) on /dev/hda4 - this will eventually be replaced by a bare minimum needed to copy to/from /dev/hda1 over network, but just now I am quite happy with copy of one older stable version of /dev/hda1 filesystem (transferred using (cd from; tar cf - .)|(cd to; tar xvfp -), dislike cp /dev/hda1 filename used to make backup copy).

All this is finished by double-LILO setup, which enables me to play even with LILO (on hda1) without worry. Look:

/etc/lilo.conf in filesystem on /dev/hda4:

boot=/dev/hda
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
# chain to lilo in hda1
other=/dev/hda1
        label=chain-hda1
        table=/dev/hda
image=/boot/zImage
        label=1
        root=/dev/hda1
        read-only
image=/boot/zImage
        label=4
        root=/dev/hda4
        read-only
image=/boot/zImage
        label=c1
        root=/dev/hdc1
        read-only
image=/boot/zImage
        label=c4
        root=/dev/hdc4
        read-only

/etc/lilo.conf in filesystem on /dev/hda1:
boot=/dev/hda1
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
image=/boot/zImage
        label=a1
        root=/dev/hda1
        read-only
image=/boot/zImage
        label=a4
        root=/dev/hda4
        read-only
image=/boot/zImage
        label=c1
        root=/dev/hdc1
        read-only
image=/boot/zImage
        label=c4
        root=/dev/hdc4
        read-only
/sbin/lilo is always run while the partition (/dev/hda1 or /dev/hda4) is mounted as root partition, so kernel used during boot resides in the same partition as corresponding lilo.conf. The two main boot alternatives are and just for case there is also
Back to main MAGI page
This page was created and this WWW server is maintained by Vaclav Hanzl