Network and System Monitoring of the South African HPC Grid - (WORK STILL IN PROGRESS)

Purpose

Network and system monitoring has become a need in any system administrator's job. It provides them with the ability to be notified about problems and have them react proactively to situations without having to constantly watch the server/service/network. The purpose of this document is to underline the importance of network and system monitoring in the High Performance Computing arena. This topic is closely related to performance tuning and optimisation. Monitoring systems are a traditional weak point in the design of grid infrastructures. This is largely due to the federated nature of the resources which are often under the control of administrators with very different ideas. However, taking note of the context of the SAGrid project, we have the opportunity to design and deploy a coherent and consistent monitoring solution at all sites.

The right tool for the job

Overview of Tools

Some background discussion was had to understand the state of play in distributed computing environments. There are many choices on offer all providing various levels of ease of use and functionality. Tools that were considered to use to monitor network and system performance were : Each of these is already in use at various sites and for various other projects and the experience of the site operations team will be used in choosing the standard for each site. A comparison of these tools was also referred to :http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems

The Official Monitor

After some discussion and preliminary tests, the Zabbix tool was chosen as a standard for the distributed monitor. The specific configuration of this at each site should be exactly the same and therefore some experimentation was planned in order to define the standard procedure (see ZabbixHowtos, ZabbixDiscussion).

What would we like to monitor ?

Network

Network schematic of the network monitor

Grid services

Core Services

Site services

-- TimothyCarr - 07 Jul 2009

Topic revision: r5 - 07 Aug 2009 - 13:57:04 - BruceBecker
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback