(osum tucnaku)


Quick Hack Summary:

SGE is a nice opensource batch spooling system.

BProc is a nice way to install system on head node only and use slave nodes as additional processors.

It is not (yet) common to use these two nice things together but it is possible. We do it, you can do it. It is simple:

OK, now the whole truth: what is really going on here, what are the pros and cons, why nearly nobody does this.

What Happens Inside:

When you start using your cluster, ps -auxf on head node gives something like this:

\_ sge_shepherd-6889 -bg
|   \_ /bin/sh sge-bproc-starter-method job_scripts/6889
|       \_ bpsh 3 job_scripts/6889
|           \_ [6889]
|               \_ [te]
|                   \_ [sh]
|                       \_ [HERest]
\_ sge_shepherd-6888 -bg
|   \_ /bin/sh sge-bproc-starter-method job_scripts/6888
|       \_ bpsh 8 job_scripts/6888
|           \_ [6888]
|               \_ [Ser]
|                   \_ [sh]
|                       \_ [HVite]
For the SGE part, there is just one sge_execd. SGE thinks that your cluster is just one machine (you install just one execution host, your master host). There are however multiple 'queues' (I would rather call them execution slots) and every process executing in the queue gets its own sge_shepherd. Jobs to be executed are stored in job_scripts directory as text files (6889, 6888 etc.). Usually sge_shepherd executes these scripts (via /bin/sh) but we defined starter method and therefore we have additional level of sge-bproc-starter-method. This starter method uses bpsh for remote execution and moves the real work where it belongs - on slave nodes. It happened to be slaves number 3 and 8 and these numbers also serve as queue names, hence simple bpsh $SGE_QUEUE.

For the BProc part - all the [xxx] things run on slave nodes. In fact they are shown as kernel threads (hence the brackets) because every remote process has local kernel thread responsible for signal delivery. First level of remote processes ([6889], [6888]) was moved to slave nodes by bpsh using special process migration technique. Additional levels come into being the usual UNIX way via fork(2) and exec(2) - the job scripts are simply executing on slaves and involve heavy-duty C-programs HERest and HVite doing the real work. The BProc mimics any remote fork(2) and exec(2) and builds corresponding hierarchy of kernel threads on the local master node. Thus you can see the remote processes in the output of local ps -auxf, enjoy them in the unbelievable output of local top and you can even kill them via local kill.

What Are The Problems:

If you know SGE or BProc you probably see some related problems with the above setup. Both SGE and BProc were forced to do something unusual and they can strike back. But we can live with that as long as we know the limits and are lucky enough to have work which fits.

On the SGE side, you get wrong information from built-in load sensors - they are in sge_execd which is supposed to execute where the real work happens - and our real work (HERest and HVite) happens elsewhere. But you can have quite good setup with load sensors disabled (or you can use your own load sensors).

On the BProc side, it is unusual to have all executables, libraries and configuration files available on the BProc slave nodes. Usually BProc clusters run heavy MPI programs and bpsh just migrates their compiled executables doing the heavy work - exec(2) on slave nodes is rare. Usual slave node has no executables, just the minimal set of libraries and nearly no configuration files. Newer BProc versions have additional tricky mechanism which helps to execute first level of scripts (as long as you use absolute pathnames for executables which are then delivered from the master node). In general most scripts unaware of the special situation will fail when run on slave nodes with the default setup. To be able to run naive scripts, you will want to reproduce most of the master filesystem on slave nodes. This can be done using NFS and symlinks.

How To Install And Setup SGE:

How To Install And Setup BProc:

Dynamic Configuration of Available Nodes:

This is untested, please let me know if you try it.

It should be easy to do dynamic configuration of SGE when nodes become available/unavailable. With my approach where node looks like a queue on the master node one just have to call

  qmod -e $N

to enable the queue when node N becomas available, so this command should probably go to the end of the BProc's node_up script (where N=$1), and

  qmod -d $N

to disable the queue when node becomes unavailable. I am not sure there is anything like node_down script in BProc (I thought there is but I do not see it in my cluster just now); if it is, it shoud start with "qmod -d $N". We could also test node's sanity in SGE's prolog and epilog scripts (run before and after the job) and call "qmod -d $N" there when needed. (Epilog script could even re-schedule the job when node died while running the job, if the job is re-runnable.)

Another simple approach is to run script doing "bpstat" and then "qmod -d ..." every 30 seconds or so (on the master).

If all the jobs are written as re-runnable (can be aborted at any moment and run again on a different node, this usually means that the job does not change any of its input files), it should be easy to create a node-fault-tolerant system.

Limits Of This Hack:

No support for MPI/PVM programs. This hack is for bag of jobs to be executed in any order (all at once, or one after another), or with dependencies (hold_jid). Invent your own hack for MPI on SGE/BProc. I did not think about it at all. Maybe it is quite easy - there is MPI on SGE, MPI on BProc and immense number of options how to glue it together.

It is tested on small cluster only. There might be some issues on big clusters. I guess SGE would scale quite high, the only minor problem is cloning of queues - you would like to use command line tools instead of qmon. (Sure this is possible - anybody knows how?)

I think NFS scalability is rather questionable. I would be more than happy to replace NFS with something else, best of all with something doing persistent caching on local disks (where they are available). But nothing else works for me (yet). Here is a list of failed attempts:

UPDATE 22nd Oct 2004: CacheFS for Linux NFS! Great!

- 2.6.9-rc4-mm1 patch that will enable NFS (even NFS4) to do persistent file caching on the local harddisk

- older message explaining what is going on

- about ways to get this to the mainline kernel

- list archives and subscription page

There is no support for control terminal stuff on slave nodes and probably will not be anywhere soon (see BProc maillist archives for attempts this way). This stuff is one of the few little known dark corners of UNIX - it is related to kernel calls like setsid(2) and to delivery of signals to groups of processes. Most people do not understand it nor do they want to. In userland this omission translates to non-working Ctrl-C and spurious error messages from some interactive programs. Some of programs can be really upset and give up.

Better Ways Of SGE+BProc Integration:

Sure they exist. SGE works on machines with many processors, so the related logic is already somewhere inside (I know no details, please comment if you want). BProc has its own simple batch spooling (bjs) so there is a place to look for additional ideas.

I think BProc systems are important enough and SGE team is nice and responsive enough for this integration to happen if we can demonstrate interest and propose sensible ways of improvements.

This page was created and this WWW server is maintained by Vaclav Hanzl