Friday, February 19, 2010

Hierarchical job distribution support in Taskfs

This is quick post to update on completion of hierarchical task distribution in taskfs. In current situation, taskfs works by distributing the tasks to devcmd2 of remote nodes and those remote nodes execute the jobs. This is flat hierarchy of one level. In hierarchical system, client taskfs will mount remote taskfs which intern will mount next level of taskfs and so on...

To implement this, local task execution support was added to the taskfs by merging it with devcmd2. This way, now all taskfs are capable of behaving as leaf node, internal node or as root node. By using these taskfs as node, and mount-links as edges, one can create a directed graph.

The task given to root taskfs will be equally distributed in all its immediate children. These children will intern distribute the share of load given to them among their immediate children. Input will be similarly propagated from root to leaf nodes via tree. And output will be aggregated at each internal node before passing it to parent.

This directed graph is automatically created based on the availability of mounted remotes taskfs filesystems. Taskfs which does not have any other taskfs mounted will assume itself as leaf node. One can also force local execution of task by sending res 0 command or by directly sending exec cmd without doing reservation. Taskfs will execute job locally if there is no prior reservation command or if prior reservation command requested zero remote resources.

Revision 103b7aa87c is the first revision providing this support. One can use the same methodology described in previous blog do execute their jobs. The only difference that can be seen is that
when resources are reserved, all of those may not appear in session directory, only the immediate childrens of this taskfs will appear here.


Now, lets see one live example of this new taskfs, and how this tree looks like. Following is similar execution as described in previous blog, so not repeating the steps here. This tree was created for request of 4 tasks.

$ 9pfuse localhost:5555 mpoint
$ tree mpoint
mpoint
`-- remote
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

6 directories, 1 file
$ ./cloneUserPause mpoint/remote/clone 4 "pwd"
6 directories, 1 file
[[DEBUG]] command is [pwd] and input file [(null)]
[[DEBUG]] clone returned [0]
[[DEBUG]] path to ctl file is [mpoint/remote//0/ctl]
[[DEBUG]] Writing command [res 4]
[[DEBUG]] Writing command [exec pwd]
[[DEBUG]] input file path [mpoint/remote//0/stdio]
[[DEBUG]] opening [mpoint/remote//0/stdio] for reading output
/home/pravin/inferno/layer3/hg
/home/pravin/inferno/layer3/hg
/home/pravin/inferno/pravin2/hg
/home/pravin/inferno/pravin2/hg
$ tree mpoint
mpoint
`-- remote
|-- 0
| |-- 0
| | |-- 0
| | | |-- 0
| | | | |-- args
| | | | |-- ctl
| | | | |-- env
| | | | |-- ns
| | | | |-- status
| | | | |-- stderr
| | | | |-- stdin
| | | | |-- stdio
| | | | |-- stdout
| | | | `-- wait
| | | |-- 1
| | | | |-- args
| | | | |-- ctl
| | | | |-- env
| | | | |-- ns
| | | | |-- status
| | | | |-- stderr
| | | | |-- stdin
| | | | |-- stdio
| | | | |-- stdout
| | | | `-- wait
| | | |-- args
| | | |-- ctl
| | | |-- env
| | | |-- ns
| | | |-- status
| | | |-- stderr
| | | |-- stdin
| | | |-- stdio
| | | |-- stdout
| | | `-- wait
| | |-- args
| | |-- ctl
| | |-- env
| | |-- ns
| | |-- status
| | |-- stderr
| | |-- stdin
| | |-- stdio
| | |-- stdout
| | `-- wait
| |-- 1
| | |-- 0
| | | |-- args
| | | |-- ctl
| | | |-- env
| | | |-- ns
| | | |-- status
| | | |-- stderr
| | | |-- stdin
| | | |-- stdio
| | | |-- stdout
| | | `-- wait
| | |-- 1
| | | |-- args
| | | |-- ctl
| | | |-- env
| | | |-- ns
| | | |-- status
| | | |-- stderr
| | | |-- stdin
| | | |-- stdio
| | | |-- stdout
| | | `-- wait
| | |-- args
| | |-- ctl
| | |-- env
| | |-- ns
| | |-- status
| | |-- stderr
| | |-- stdin
| | |-- stdio
| | |-- stdout
| | `-- wait
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

14 directories, 81 files


The root tree had two children, one at /home/pravin/inferno/pravin/inferno-rtasks and other at /home/pravin/inferno/pravin2/hg. The first child had it's own child running at location /home/pravin/inferno/layer3/hg. As you can see, actual execution was done by leaf taskfs only. Internal taskfs worked only for delegating the work and aggregating the output. Above tree will help visualizing the delegation of work.


Note : Above shown tree is available only when user is still holding the the root clone file open. As long as this file is closed. The entire tree representing the remote resources is released. Following is the tree when ./cloneUserPause program terminates.

$ tree mpoint
mpoint
`-- remote
|-- 0
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

7 directories, 11 files

Tuesday, February 16, 2010

Supporting clone file semantics from Taskfs

This is another blogpost describing something that is already implemented, but not documented properly. So this blogpost is dedicated to explaining the taskfs support for clone file semantics. There are many bugfixes and optimizations in the revision 6feb9d0431 which is referred in this post, but those are not of main concern here. This revision is also tagged as v0.0.1 and is hopefully stable.

Lets start with what is clone file semantics? It means, when you read from clone file, it gives the name of resource which is allocated for current session. It also converts the clone file to the ctl file of that session resource. After this read, you are supposed to treat the clone file channel/descriptor as ctl file channel/descriptor. Whenever this descriptor is closed, taskfs will assume that this session is no longer needed by user, and it will free the remote resources. It will aslo release the taskfs session for future session requests. The main reason for this behavior is that Taskfs can easily reclaim the resources and do the garbage collection once clone file descriptor is closed.

This semantic of releasing the resources when clone and ctl files are closed, is little unnatural to the POSIX and command prompt users. This also means that you can't use input/output redirections and cat commands from shell on these files, as shell will automatically close these files once that command is complete. For example, lets see the following command.

$ cd mpoint/remote
$ cat clone
0
$ cat clone
0
$

Here, we requested for new taskfs session two times, and it returned the same session both times. This may look erroneous from POSIX users, but you should realize that cat clone reads and shows the resource reserved by taskfs and then closes the clone file. This close triggers the release of taskfs session resource (in this case resource 0). This released resource is again allocated to the next user requesting it by opening and reading from clone.

Now lets see one more example of doing writes on ctl file of taskfs session resource.

cat clone
0
$ cd 0
$ echo res 4 > ctl

This command will write res 4 to ctl file and close the file after it. As a result of that, taskfs will allocate 4 remote resources and release them after receiving the close call. This is not desired behavior for command line users. The take home message here is Don't use input redirections and cat commands on clone and ctl files.


The proper way (ie. clone filesystem way) to use this filesystem is

  1. Open clone file.

  2. Read the resource name from clone file.

  3. Write res n into same clone file descriptor, where n is number of resources needed.

  4. Write exec cmd into same clone file descriptor, where cmd is command to be executed.

  5. Open stdio file in write mode, write the input data into this file descriptor and the close this descriptor once all input is written.

  6. Open stdio file in read mode, read the data from this file descriptor and the close this descriptor once all data is read. This is the Standard Output of your command.

  7. Open stderr file in read mode, read the data from this file descriptor and the close this descriptor once all data is read. This is the Standard Error of your command.

  8. Close the clone file descriptor. This will automatically release all the resources allocated for this session of taskfs including remote resources reserved by res n.


Here is the small C program cloneUser.c I wrote to do all above steps.

Now, lets see the example run, showing how it can be used.

$ 9pfuse localhost:5555 mpoint
$
$ tree mpoint
mpoint
`-- remote
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

6 directories, 1 file
$
$ ./cloneUser mpoint/remote/clone 4 "hostname"
[[DEBUG]] command is [hostname] and input file [(null)]
[[DEBUG]] clone returned [0]
[[DEBUG]] path to ctl file is [mpoint/remote//0/ctl]
[[DEBUG]] Writing command [res 4]
[[DEBUG]] Writing command [exec hostname]
[[DEBUG]] input file path [mpoint/remote//0/stdio]
[[DEBUG]] opening [mpoint/remote//0/stdio] for reading output
BlackPearl
inferno-test
inferno-test
BlackPearl
$
$ tree mpoint
mpoint
`-- remote
|-- 0
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

7 directories, 11 files
$
$ ./cloneUser mpoint/remote/clone 4 "wc -l" cloneUser.c
[[DEBUG]] command is [wc -l] and input file [cloneUser.c]
[[DEBUG]] clone returned [0]
[[DEBUG]] path to ctl file is [mpoint/remote//0/ctl]
[[DEBUG]] Writing command [res 4]
[[DEBUG]] Writing command [exec wc -l]
[[DEBUG]] input file path [mpoint/remote//0/stdio]
[[DEBUG]] opening [mpoint/remote//0/stdio] for reading output
192
192
192
192
$
$ tree mpoint
mpoint
`-- remote
|-- 0
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

7 directories, 11 files
$


In above test run, the cloneUser.c is the program mentioned above. In first execution of cloneUser program, 4 remote nodes are used to execute the the hostname command. You can observe that after termination of cloneUser, all remote resources are released because of which they are not visible in directory structure. Second invocation demonstrates the use of input file with cloneUser. This invocation sends the cloneuser.c file as input to the wc -l program which returns the number of lines in this code. All four remote resources report correct line-count showing expected behavior. It can be also observed that second execution of cloneUser reuses the same taskfs session 0 which was released after completion of first execution of cloneUser. So, this test run is good example of how resource reclamation works in taskfs.

Many to many support for Taskfs

This blog on this functionality was pending for some time. Finally posting it today. Hopefully it will create the base for understanding next functionality on which work is already started.

Lets start with what is many-to-many in this context. In one-to-many taskfs, once divided, these sub-tasks are not aware of presence of other subtasks. So they can't communicate with each other. Many-to-many taskfs provides a way by which subtasks can communicate with each other. This is achieved by binding all remote resources in the directory of current session of taskfs. Each subtask will be running into one of these pseudo subdirectories (which actually are compute resources on some remote nodes). In this setup, all subtasks can assume that all other subdirectories are other subtasks, and can directly communicate with these neighboring subdirectories without asking parent about actual locations of these resources.

Now, lets see how exactly it is implemented. The trick used here is whenever session in taskfs allocate new remote resource, it will bind this new resource directory as one of the sub-directory within current session directory. This also simplifies read and write aggregations. Reads and writes on taskfs session directory will just pass that read/write to corresponding file in all subdirectories. Taskfs does not bother to remember where exactly are the remote resources as this information is automatically encoded by bindings between subdirectories and remote resource directories.

Now, lets see the example of it. Following test-run assumes that there are remote deployments of inferno which are exporting devcmd2, and client inferno instance has mounted those exported remote taskfs on /remote/. This client inferno is also assumed to export the taskfs for others to use. Following are the commands executed on client inferno host.

styxlisten -A tcp!*!5555 export /task
mount -A tcp!9.3.61.180!6666 /remote/host1
mount -A tcp!9.3.61.180!6667 /remote/host2
mount -A tcp!127.0.0.1!6666 /remote/host3


Now, following was executed on Linux shell by mounting the taskfs exported by above inferno client. $ represents Linux shell prompt.

$ 9pfuse localhost:5555 mpoint
$ tree mpoint
mpoint
`-- remote
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]
$ cd mpoint/remote
$ cat clone
0
$ cd 0
$ tree ../
../
|-- 0
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

6 directories, 11 files

$ echo res 4 > ctl
$ tree ../
../
|-- 0
| |-- 0
| | |-- args
| | |-- ctl
| | |-- env
| | |-- ns
| | |-- status
| | |-- stderr
| | |-- stdin
| | |-- stdio
| | |-- stdout
| | `-- wait
| |-- 1
| | |-- args
| | |-- ctl
| | |-- env
| | |-- ns
| | |-- status
| | |-- stderr
| | |-- stdin
| | |-- stdio
| | |-- stdout
| | `-- wait
| |-- 2
| | |-- args
| | |-- ctl
| | |-- env
| | |-- ns
| | |-- status
| | |-- stderr
| | |-- stdin
| | |-- stdio
| | |-- stdout
| | `-- wait
| |-- 3
| | |-- args
| | |-- ctl
| | |-- env
| | |-- ns
| | |-- status
| | |-- stderr
| | |-- stdin
| | |-- stdio
| | |-- stdout
| | `-- wait
| |-- args
| |-- ctl
| |-- env
| |-- ns
| |-- status
| |-- stderr
| |-- stdin
| |-- stdio
| |-- stdout
| `-- wait
|-- arch [error opening dir]
|-- clone
|-- env [error opening dir]
|-- fs [error opening dir]
|-- ns [error opening dir]
`-- status [error opening dir]

10 directories, 51 files
$ cd 3
$ pwd
/home/pravin/projects/inferno/mpoint/remote/0/3
$ ls -l
total 0
--w--w--w- 1 pravin pravin 0 2010-01-14 17:12 args
-rw-rw---- 1 pravin pravin 0 2010-01-14 17:12 ctl
--w--w--w- 1 pravin pravin 0 2010-01-14 17:12 env
-rw-rw---- 1 pravin pravin 0 2010-01-14 17:12 ns
-r--r--r-- 1 pravin pravin 0 2010-01-14 17:12 status
-r--r--r-- 1 pravin pravin 0 2010-01-14 17:12 stderr
--w--w--w- 1 pravin pravin 0 2010-01-14 17:12 stdin
-rw-rw---- 1 pravin pravin 0 2010-01-14 17:12 stdio
-r--r--r-- 1 pravin pravin 0 2010-01-14 17:12 stdout
-r--r--r-- 1 pravin pravin 0 2010-01-14 17:12 wait
$ cat status
cmd/1 1 Closed /home/pravin/projects/inferno/ericvh//ericvh-brasil/ ''
$ cd ..
$ pwd
/home/pravin/projects/inferno/mpoint/remote/0
$ cat status
cmd/0 1 Closed /home/pravin/projects/inferno/ericvh//ericvh-brasil/ ''
cmd/0 1 Closed /home/pravin/inferno/host2//ericvh-brasil/ ''
cmd/0 1 Closed /home/pravin/inferno/ericvh//ericvh-brasil/ ''
cmd/1 1 Closed /home/pravin/projects/inferno/ericvh//ericvh-brasil/ ''
$ echo exec hostname > ctl
$ cat stdio
BlackPearl
inferno-test
inferno-test
BlackPearl
$ cat stderr
$ cat status
cmd/0 5 Done /home/pravin/projects/inferno/ericvh//ericvh-brasil/ hostname
cmd/0 5 Done /home/pravin/inferno/host2//ericvh-brasil/ hostname
cmd/0 5 Done /home/pravin/inferno/ericvh//ericvh-brasil/ hostname
cmd/1 5 Done /home/pravin/projects/inferno/ericvh//ericvh-brasil/ hostname
$ cd ../../..
$ pwd
/home/pravin/projects/inferno
$ fusermount -u mpoint
$ ls mpoint
$


Now, lets see what exactly happened above.

  1. The taskfs was mounted with 9pfuse.

  2. The directory structure of initial taskfs was shown with tree command.

  3. Changed the directory in side taskfs by cd mpoint/remote

  4. taskfs sesion was created by cat clone command.

  5. Changed the directory to taskfs session. in this case it iscd 0

  6. Directory structure of taskfs after session creation is examined. You can see that there are no subdirectories to the session directory, which indicate that there are no remote resources allocated.

  7. Remote resources were allocated by echo res 4 > ctl command.

  8. Directory structure of taskfs after resource allocation is examined. You can see that there are four subdirectories, one for each remote resource.

  9. Changed the directory to one of the subdirectory representing remote resource, and operated on it independently. Checked the status of that resource using cat status.

  10. Checked the status of taskfs session by doing cat status on session directory. You can see that it shows the status of all remote resources.

  11. Executed the command on taskfs session by doing echo exec hostname > ctl

  12. Checked the output, stderr and status after execution. You can see that there are two different hostnames in those 4 outputs.

  13. Came out of mounted directory, and unmounted it.



This functionality of taskfs is available in revision d125d391c1.