The following is a quasi-verbatim transcript of an email sent by BenClifford? to BruceBecker on 21 April 2009 in which the first experience of SWIFT and gLite WMS is reported. It is meant only for reference purposes and has been slightly modified from the original.

So far I've prototyped a really shitty prototype for executing on the WMS. Swift did not easily support that because of its historical site model which does not look like WMS 1. The prototype loses a lot of Swift features, detailed in 1. Some are easy to restore and are missing because of quickness of prototyping. Some are harder to restore. I have run a few thousand jobs of a 3000-job test workflow (3000 executions of /bin/touch), but have run into issues with virt.sagrid.

One is detailed in ticket #49 that I cannot submit jobs at all (when I last tried, around the time of ticket #49 opening). This prevents Swift being used at all. Another issue is the high latency of job execution. This latter does not prevent Swift workflows running, but does significantly slow things down. I have previously run more complex but smaller Swift workflows (from the swift test suite) through this prototype successfully.

The use of gLite, whilst presenting problems with using swift as its not something we've done before, is a good second choice to OSG-style submission, as we have other users who are interested in submitting to gLite. So a benefit is that this work goes both towards SAGrid support for swift and EGEE support for swift.

From a Swift code perspective, I'd like to tidy the prototype up a bunch such that it is safe to give to users, and over time integrate some of its changes more closely with the main body of Swift. Feedback from users is useful there (both in steering how I develop the code, and in indicating how much of my time should go towards this project rather than the myriad other swift-related projects I have). Thats an issue of my personal time, but if people remain interested, then it will happen rapidly.

From a production perspective, I would like to run this on real SAGrid sitess with a real sagrid application; perhaps in conjunction with something on the UJ cluster submitted directly - get some real science runs out of this, showing hopefully that (UJ PBS + SAGrid) is better than UJ PBS, and that using multiple SAGrid sites is not much harder than doing local single-cpu runs. And that using Swift is a really easy way to do all of this. What is stopping this is:

  1. lack of a working real sagrid wms;
  2. lack of an application that someone here wants to play with, that is in suitable state for experimentation.

From the perspective of the Swift group, its nice to have papers and the like which mention Swift in a favourable light - one thing we have several papers on is write ups of a particular application that someone has made and discuss how it was implemented in Swift. But even a one-paragraph mention that Swift was used in a paper talking primarily about other aspects of an application is pleasing as it gives us another reference for our website and funding proposals.

Its not clear to me what, if any, publications you intend to make specifically about SAGrid - swift may be able to fit into those in some other way.

In the traditional swift model, a site has a shared file system, accessible from the worker nodes via the filesystem, and accessible from where you are submitting your swift jobs via something like gridftp. Swift assumes that it can access that shared file system to manage a site-local cache, assumes that it can make its own site selections.

In the WMS prototype, I now make an input sandbox and an output sandbox and have the WMS move that in and out for me; Swift makes no attempt to manage a site-local data cache, and if this prototype model is maintained, I don't see an easy way for swift to maintain such a cache.

The prototype also loses other useful Swift features:

  1. replication - when a job has been in the queue too long, Swift submits a replica of that job, and whenever one begins running, the others are cancelled. This can probably be fairly straightforwardly implemented in this prototype model.
  2. multiple sites and execution mechanisms - swift can deal with multiple sites. my implementation of this prototype gutted out Swifts ability to deal with multiple sites. but that is straightforward to fix. With the present prototype, multisite runs within one WMS will occur - one WMS == one swift site in the model implemented by the prototype. With a restored ability to submit to multiple swift-sites and use multiple execution systems, one could submit to a WMS and to a local cluster PBS in the same run, with swift adjusting the load it sends to those two systems as it sees fit; or to WMS and OSG sites and local cluster.
  3. coasters and clusters. Swift has two ways of condensing multiple Swift-level jobs into "execution system" jobs. neither work in the prototype at the moment. clustering probably can be made to work pretty easily. Coasters not so easily. clustering picks jobs ahead of time and makes a combined job for them all. so the overhead (eg 5 mins in this case) is incurred only once per clustered job. Coasters launch dynamic workers on each node to receive jobs. however this
interacts badly with the idea of making input and output sandboxes. if there is a site-shared fileystsem, coasters could still be used on a site, but the execution model would need to be different. coasters are extremely useful in applications with many short jobs (seconds to minutes) or where queue times are relatively long. Other names/implementations for the same idea are pilot jobs (panda) and glide-ins (condor). Coasters are not production quality in mainline swift at the moment, but they are very compelling for certain classes of application.

-- BruceBecker - 27 Apr 2009

Topic revision: r1 - 27 Apr 2009 - 16:35:58 - BruceBecker
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback