Friday, March 5, 2010

Aggregated status file and selective reservations

There has been couple of important improvements in the working of taskfs. Firstly, there is addition of working status file in top directory of taskfs. Do not confuse the session level status file with this one. Session level status file provides the status of only that session. But this file provides the status of all resources available. This file is readable and these reads are non-destructive. Which means you can read this file can be read multiple times. What this file returns is number of compute nodes available for each OS and Architecture. The content of this file is a list of tuples each containing three values, OS Architecture Nodecount. This file is aggregated, which means that these nodecount represent counts including all the descendent's. This file gives an idea of amount and type of resources available for you.

Next enhancement is to implement selective reservation using information available from status file. So, now resource reservations can also specify what type of resources are needed for their jobs. Reservations will be made only on those resources which have requested OS and architecture. If no OS or architecture is given or if wildcard is given instead, then any available resource will be used. Following is the new format of res command. res .

Now, lets try out one demo. Python is used bellow to provide interactive job management with taskfs.

$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, os
>>> os.system ("9pfuse localhost:5555 ./mpoint")
0
>>> os.system ("tree ./mpoint")
./mpoint
`-- remote
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status

5 directories, 2 files
0
>>> topStatusFile = open ( "./mpoint/remote/status", "r", 0)
>>> print topStatusFile.read()
Linux 386 4

>>> topStatusFile.close()

Till this point, we have mounted the taskfs, and we have checked the status of top level status file. It shows that there are 4 nodes, each with Linux running on 386 architecture. Now, lets do some reservation and demo job execution.

>>> ctlFile = open ( "./mpoint/remote/clone", "r+", 0)
>>> print ctlFile.read()
0
>>> os.system ("tree ./mpoint")
./mpoint
`-- remote
|-- 0
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status

6 directories, 12 files
0
>>> sFile = open ("./mpoint/remote/0/status", "r", 0)
>>> print sFile.read()
cmd/0 2 Unreserved /home/pravin/projects/inferno/pravin//inferno-rtasks/ ''

>>> sFile.close()
>>> ctlFile.write ("res 5 Linux 386")
>>> sFile = open ("./mpoint/remote/0/status", "r", 0)
>>> print sFile.read()
cmd/1 2 Closed /home/pravin/inferno/layer3//hg/ ''
cmd/2 2 Closed /home/pravin/inferno/layer3//hg/ ''
cmd/3 2 Closed /home/pravin/inferno/layer3//hg/ ''
cmd/1 2 Closed /home/pravin/inferno/pravin2//hg/ ''
cmd/2 2 Closed /home/pravin/inferno/pravin2//hg/ ''

>>> sFile.close()
>>> ctlFile.write ("exec hostname")
>>> ioFile = open ("./mpoint/remote/0/stdio", "rr", 0)
>>> print ioFile.read()
inferno-test
inferno-test
inferno-test
inferno-test
inferno-test

>>> ioFile.close()
>>> sFile = open ("./mpoint/remote/0/status", "r", 0)
>>> print sFile.read()
cmd/1 2 Done /home/pravin/inferno/layer3//hg/ hostname
cmd/2 2 Done /home/pravin/inferno/layer3//hg/ hostname
cmd/3 2 Done /home/pravin/inferno/layer3//hg/ hostname
cmd/1 2 Done /home/pravin/inferno/pravin2//hg/ hostname
cmd/2 2 Done /home/pravin/inferno/pravin2//hg/ hostname

>>> sFile.close()
>>> topStatusFile = open ("./mpoint/remote/status", "r", 0)
>>> print topStatusFile.read()
Linux 386 4


Following the demo of how resource re-claimation works once clone file is closed.

>>> os.system ("tree -d ./mpoint")
./mpoint
`-- remote
|-- 0
| |-- 0
| | `-- 0
| | |-- 0
| | |-- 1
| | `-- 2
| `-- 1
| |-- 0
| `-- 1
|-- arch [error opening dir]
|-- env [error opening dir]
|-- fs [error opening dir]
`-- ns [error opening dir]

14 directories
0
>>> ctlFile.close ()
>>> os.system ("tree -d ./mpoint")
./mpoint
`-- remote
|-- 0
|-- arch [error opening dir]
|-- env [error opening dir]
|-- fs [error opening dir]
`-- ns [error opening dir]

6 directories
0
>>>


Unfortunately, I do not have testbed with different OS and architectures running. So, selective repeat is practically useless in this case. But it is quite handy when one is dealing with diverse mix of compute nodes.

The code revision ae33f01fea onwards, selective reservation works.

No comments:

Post a Comment