This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
[Newbie] Help request: understanding slowdowns in a network file system
- From: Daniel Ankers <md1clv at md1clv dot com>
- To: systemtap at sourceware dot org
- Date: Fri, 27 Jul 2012 15:36:51 +0100
- Subject: [Newbie] Help request: understanding slowdowns in a network file system
Hi all,
I'm trying to understand what is causing occasional slowdowns to disks
I/O in a virtual environment I manage.
The disks are stored as image files on a Gluster
(http://www.gluster.org/) FUSE filesystem, and each image file is
stored on two different Gluster servers. This means that any disk
request from an application on a virtual server goes through something
similar to the following layers:
(1) Linux VFS on guest system
(2) Hypervisor on host system
(3) Linux VFS on host system
(4) Gluster client FUSE module on host system
(5) Network layer on host system
(6) Physical network
(7) Network layer on Gluster server system
(8) Gluster server FUSE module on Gluster server system
(9) Linux VFS on Gluster server system
(10) Filesystem code on Gluster server system
(11) Physical disk on Gluster server system
The question I need to answer is "what do I need to upgrade to fix
this problem" and I've not been able to find an answer using the usual
troubleshooting tools - I've not even been able to find anything other
than observed behaviour on the guest system
I'm reading the Systemtap Beginners Guide which has some examples
which will help at certain layers (e.g. iotime.stp) but I'm struggling
to understand how to pull everything together to get helpful
diagnostic information.
The questions I have are:
1) Is Systemtap the right tool to help me get to the bottom of this
problem? If not, the rest of the questions don't matter...
2) As an administrator rather than a developer I don't really know
which system calls I need to be monitoring. What is the best way to
work this out?
3) Is there a neat way to tie together requests going out of the
client with requests coming into the server?
4) Are there any hints anyone can give on the best way to approach
troubleshooting across several different processes, layers and
services like this?
Thanks,
Dan