SAGrid stress tests May 2010

A series of stress tests were undertaken in order to evaluate the state of the core services and functionality of the site services at the various sites.

Core services stress test

The core services are a single point of failure of the infrastructure in South Africa. For the moment, we do not have proper failover capability, although the potential for this exists. The stress test of core services aims to test the stability and functionality under load of the WMS and BDII, but submitting a large amount of jobs at once.

Site tests

Parametric jobs

The site tests are done at a few different levels :

  1. small parametric job test (5), with no output sandbox
  2. small parametric job test (5) with output sandbox to test gsiftp functionality
  3. large parametric job (site-size), short duration, to test whether the resources are properly used
  4. very large parametric job, short duration, long proxy lifetime, to see whether the WMS properly schedules jobs

Job Collections

Following the test plan of EGEE SA1 (e.g. https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WmsTestsICE4), we test in the following way :
  1. 1000 collections of 100 jobs each
  2. one collection every 60 seconds

Results : ZA-UFS

A set of 1000000 parameters were submitted to ZA-UFS at once. A significant load average of around 1.8 was experienced by the WMS. About 30 % of jobs were aborted due to miscommunication with the CE.

Results : ZA-UJ

Small parametric jobs

Jobs ran fine, but once infosites showed that they had finished, the WMS-LB had not :
 
valor del bdii: devslngrd001.uct.ac.za:2170 
#CPU   Free   Total Jobs   Running   Waiting   ComputingElement 
----------------------------------------------------------
  112    112      0             0   444444   wonko.bi.up.ac.za:8443/cream-pbs-gilda
  16        16      0         0             0   srvslngrd011.uct.ac.za:8443/cream-pbs-gilda
   80      80      0             0             0   glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
   32      29      3             3          0   cream-ce.core.wits.ac.za:8443/cream-pbs-gilda
 ======================= glite-wms-job-status Success ===================== 
BOOKKEEPING INFORMATION:
  
Status info for the Job : https://srvslngrd010.uct.ac.za:9000/dfnWm1976HdXiyQPe1J8nw
 Current Status:     Running
  Submitted:          Fri May 14 10:33:02 2010 SAST
 ==========================================================================
  - Nodes information for:
      Status info for the Job : https://srvslngrd010.uct.ac.za:9000/1keYkI-U9TFHCmBKgQIKxg
     Current Status:     Running
      Status Reason:      unavailable
     Destination:        glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
     Submitted:          Fri May 14 10:33:02 2010 SAST 
==========================================================================
          Status info for the Job : https://srvslngrd010.uct.ac.za:9000/L7gvocVoUvHhffUo5y8jBA
     Current Status:     Running
      Status Reason:      unavailable
     Destination:        glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
     Submitted:          Fri May 14 10:33:02 2010 SAST
==========================================================================
          Status info for the Job : https://srvslngrd010.uct.ac.za:9000/PcScNaf_je8PT_0YfjmyFQ
     Current Status:     Running
      Status Reason:      unavailable
     Destination:        glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
     Submitted:          Fri May 14 10:33:02 2010 SAST
 ==========================================================================
          Status info for the Job : https://srvslngrd010.uct.ac.za:9000/ZUhr71ddISB1kkeemcfhBA
     Current Status:     Running
      Status Reason:      unavailable
     Destination:        glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
     Submitted:          Fri May 14 10:33:02 2010 SAST
 ==========================================================================
          Status info for the Job : https://srvslngrd010.uct.ac.za:9000/q4Ev2QflkfVP2BYVV_oB0A
     Current Status:     Running
      Status Reason:      unavailable
     Destination:        glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
     Submitted:          Fri May 14 10:33:02 2010 SAST
 ========================================================================== 

Small site-size parametric jobs

Once the number of jobs equal to the available CPUS was submitted, the information system suddenly lost the GRIS :
-bash-3.2$ lcg-infosites --vo gilda ce
valor del bdii: devslngrd001.uct.ac.za:2170
#CPU   Free   Total Jobs   Running   Waiting   ComputingElement
----------------------------------------------------------
 112    112      0             0   444444   wonko.bi.up.ac.za:8443/cream-pbs-gilda
  16     16      0             0      0   srvslngrd011.uct.ac.za:8443/cream-pbs-gilda
  80     70      0             0   444444   glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda
  32     29      3             3      0   cream-ce.core.wits.ac.za:8443/cream-pbs-gilda
However, it can be seen that only 10 jobs are running (this is probably due to the limitation on the gilda queue). The LB gives different information :
-bash-3.2$ glite-wms-job-status https://srvslngrd010.uct.ac.za:9000/HTBqXxBSC0W0APcRRvYMdA|grep Running | wc -l 
21
-bash-3.2$ glite-wms-job-status https://srvslngrd010.uct.ac.za:9000/HTBqXxBSC0W0APcRRvYMdA|grep Running | wc -l 
31
Not too sure what's happening, but it seems like the GIIS at UJ is having trouble communicating with the BDII.,

Files used in the stress test

JDL

-bash-3.2$ cat cpuload.jdl 
[
   Type = "Job";
   JobType = "Parametric";
   Executable = "cpuload.sh";
   Parameters = 5;
   ParameterStep = 1;
   ParameterStart= 0;
   StdOutput = "std.out";
   StdError = "std.err";
   InputSandbox = {"cpuload.sh"};
   OutputSandbox = {"std.out","std.err"};
   OutputSandboxBaseDestUri = "gsiftp://localhost/";
   Requirements = (other.GlueCEUniqueID != "srvslngrd011.uct.ac.za:8443/cream-pbs-gilda"); # a known good site
#   Requirements = (other.GlueCEUniqueID == "glite-ce.grid.uj.ac.za:8443/cream-pbs-gilda"); # to select the site.
   PerusalFileEnable = True;
   PerusalTimeInterval = 60;
]

shell script

#!/bin/bash
#
HST=$(hostname)
WHOAMI=$(whoami)
DATE=$(date)

echo "$DATE : Testing batch host $HST..."
echo ""
echo "$DATE : Io sono $WHOAMI"

TMP=1
J=1

while [ $J -lt 3 ] ;
do
        I=1
        while [ $I -lt 2000 ] ;
        do
                TMP=$(( $TMP*$I))
                I=$(( $I + 1 ))
        done
        J=$(( $J + 1 ))
done

DATE=$(date)
echo "$DATE : Batch host $HST tested !"
exit 0
-- BruceBecker - 14 May 2010
Topic revision: r3 - 26 May 2010 - 15:27:20 - BruceBecker
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback