I was recently in a situation where a number of customer VM suddenly became unusable. When investigating the VMs in vCenter we could see that the disk latency was in the seconds. The back end storage was a NetApp AFF connected to the hosts using Fiber Channel.
So looking at the AFF event logs I could see that there were multiple connection established events (I cant remember the event name) but my train of thought here is that for there to be a big clump of connection established events then they mush be re-connecting after losing there connections. Looking through the logs did not show any disconnections or any thing else out of the ordinary. just lots of the connection established events. All to two specific controllers on different HA pairs.
Anyway, long story short… we got NetApp involved and we found a bad SFP module on a cable. NetApp found the port from the autosupport, we found it by looking at the FC switch.
These were some of the commands that got as to the problem.
Logging on to the AFF using Powershell
I have probably covered this somewere else in this blog but here it is again
# test to see if the dadaONTAP module is installed
get-module dataontap -ListAvailable
# install dataONTAP
install-module dataONTAP
# load module
import-module dataontap
# connect to FAS/AFF (will prompt for creds)
connect-NcController "AFF01-clus.domian.local"
Once connected to the AFF I dumped the event log using something like this
Get-NcEmsMessage | export-csv "Eventlog.csv" -NoTypeInformation
Going though this was when I found that all the reconnects went back about 12 hours and increased in frequency. Not normal behavior.
At this point we went through the FC switches and found the bad port. NetApp confirmed they could also see the bad port on the node.
Using the dataONTAP command Get-NcFcpAdapter we could get the info for all the FC ports. When we know what we were looking for the problem was easy to spot.
by running the following we could see all used FC ports on all controllers.
get-ncnode | Get-NcFcpAdapter | where{$_.state -eq "online"} | select node,Adapter,portname,SfpTxPower,SfpRxPower,switchport
I dont have a output at the time of the issue but a normal output looks like this.
At the time of the issue, the SFpRxPower (receive) one of these ports was 200 micro Watts. In fact Adapter 2c on node 01 looks a bit low. I will be keeping an eye on that.
The SwitchPort attribute was especially nice as it gave the connected switch name and port number.