Network and System Monitoring of the South African HPC Grid - (WORK STILL IN PROGRESS)
Purpose
Network and system monitoring has become a need in any system administrator's job. It provides them with the ability to be notified about problems and have them react proactively to situations without having to constantly watch the server/service/network. The purpose of this document is to underline the importance of network and system monitoring in the High Performance Computing arena. This topic is closely related to
performance tuning and optimisation. Monitoring systems are a traditional weak point in the design of grid infrastructures. This is largely due to the federated nature of the resources which are often under the control of administrators with very different ideas. However, taking note of the context of the SAGrid project, we have the opportunity to design and deploy a coherent and consistent monitoring solution at all sites.
The right tool for the job
Overview of Tools
Some background discussion was had to understand the state of play in distributed computing environments. There are many choices on offer all providing various levels of ease of use and functionality. Tools that were considered to use to monitor network and system performance were :
Each of these is already in use at various sites and for various other projects and the experience of the site operations team will be used in choosing the standard for each site. A comparison of these tools was also referred to :http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems
The Official Monitor
After some discussion and preliminary tests, the Zabbix tool was chosen as a standard for the distributed monitor. The specific configuration of this at each site should be exactly the same and therefore some experimentation was planned in order to define the standard procedure (see
ZabbixHowtos,
ZabbixDiscussion).
What would we like to monitor ?
Network
Network schematic of the network monitor
Grid services
Core Services
Site services
--
TimothyCarr - 07 Jul 2009
Topic revision: r5 - 07 Aug 2009 - 13:57:04 -
BruceBecker