NRPE Timeouts / Sockets not closing


#1

Hi,

We are encountering timeouts / sockets not closing correctly on several windows machines. Not all of them, but a lot of checks times out randomly. When checking the affected servers with netstat -a there are hundreds sometimes thousands of connections in CLOSED_WAIT.

Local:5666 Remote:60912 CLOSE_WAIT

Some checks still work but some of the checks times out with “CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.” Increasing the timeout value won’t work. The temporary solution is to restart the nscp-service to clear the not properly closed connections.

Nsclient.log is filled with mostly “Failed to handle…” messages, The “Socket ERROR:” only appears in the beginning of the log

2018-05-29 22:00:41: error:c:\source\master\include\socket/server.hpp:255: Socket ERROR: Already open
2018-05-29 22:00:48: error:c:\source\master\include\socket/server.hpp:258: Failed to handle incoming connection: remote_endpoint: The file handle supplied is not valid
2018-05-29 22:01:01: error:c:\source\master\include\socket/server.hpp:258: Failed to handle incoming connection: remote_endpoint: The file handle supplied is not valid

Output from nscp test --debug

PS C:\Program Files\NSClient++> .\nscp test --debug
D       core NSClient++ 0.5.3.4 2018-04-26 x64 Loading settings and logger...
D       core Settings not ready so we cant lookup: base-path
D       core Settings not ready so we cant lookup: exe-path
D   settings Boot.ini found in: C:\Program Files\NSClient++/boot.ini
D       core Settings not ready so we cant lookup: shared-path
D   settings Activating: ini://${shared-path}/nsclient.ini
D   settings Creating instance for: ini://${shared-path}/nsclient.ini
D       core Settings not ready so we cant lookup: shared-path
D   settings Loading: C:\Program Files\NSClient++/nsclient.ini
D       core NSClient++ 0.5.3.4 2018-04-26 x64 booting...
D       core Booted settings subsystem...
D       core On crash: restart: NSCP
D       core Archiving crash dumps in: C:\Program Files\NSClient++/crash-dumps
D       core Found: CheckDisk
D       core Found: CheckEventLog
D       core Found: CheckExternalScripts
D       core Found: CheckHelpers
D       core Found: CheckSystem
D       core Found: NRPEServer
D       core Loading module C:\Program Files\NSClient++\modules\CheckDisk.dll ()
D       core Loading module C:\Program Files\NSClient++\modules\CheckEventLog.dll ()
D       core Loading module C:\Program Files\NSClient++\modules\CheckExternalScripts.dll ()
D       core Loading module C:\Program Files\NSClient++\modules\CheckHelpers.dll ()
D       core Loading module C:\Program Files\NSClient++\modules\CheckSystem.dll ()
D       core Loading module C:\Program Files\NSClient++\modules\NRPEServer.dll ()
D       core Loading plugin: CheckDisk
D       core Loading plugin: CheckEventLog
D       core Loading plugin: CheckExternalScripts
D       core Loading plugin: CheckHelpers
D       core Loading plugin: CheckSystem
D       core Loading plugin: NRPEServer
D       nrpe Allowed hosts definition: 'IPv4-subnet', 'IPv6-subnet'
D       nrpe Server config: address: :5666, ssl enabled: none, no certificate, dh: C:\Program Files\NSClient++/security/nrpe_dh_512.pem, ciphers: ADH, ca: C:\Program Files\NSClient++/security/ca.pem, options:
D       nrpe Binding to: [::]:5666(ipv6)
D       nrpe Attempting to bind to: [::]:5666(ipv6)
D       nrpe Binding to: 0.0.0.0:5666(ipv4), reopen: true, reuse: true
D       nrpe Attempting to bind to: 0.0.0.0:5666(ipv4)
D       core NSClient++ - 0.5.3.4 2018-04-26 Started!
D       core Loading module C:\Program Files\NSClient++\modules\CommandClient.dll ()
D       core Loading plugin: CommandClient...
D       cli Enter command to execute, help for help or exit to exit...
D  		w32system Loading counter: disk_queue_length_0 C: = \\hostname\PhysicalDisk(0 C:)\% Disk Time
D  		w32system Loading counter: disk_queue_length__Total = \\hostname\PhysicalDisk(_Total)\% Disk Time

NSClient.ini

[/modules]
NRPEServer = 1
CheckSystem = 1
CheckDisk = 1
CheckEventLog = 1
CheckHelpers = 1
CheckExternalScripts = 1

[/paths]
certificate-path = ${shared-path}/security
module-path = ${shared-path}/modules
shared-path = C:\Program Files\NSClient++

[/settings/default]
allowed hosts = 'IPv6-subnet','IPv4-subnet'
bind to =

[/settings/log]
date format = %Y-%m-%d %H:%M:%S
file name = ${shared-path}/nsclient.log
; LOG LEVEL - Log level to use. Available levels are error,warning,info,debug,trace
level = info

[/settings/NRPE/server]
port = 5666
allow arguments = true
allow nasty characters = true
use ssl = true
insecure = true
timeout = 115

[/settings/system/windows]
default buffer length=61m

[/settings/external scripts/alias]
check_puppetservice = check_service service=puppet
check_vmwaretoolsservice = check_service service=VMTools
check_splunkservice = check_service service=SplunkForwarder

[/settings/external scripts/scripts]
timeout = 110
check_updates = powershell.exe -ExecutionPolicy RemoteSigned -noprofile -file scripts\check_windows_updates.ps1
check_license = powershell.exe -ExecutionPolicy RemoteSigned -noprofile -file scripts\check_license.ps1
check_puppet = powershell.exe -ExecutionPolicy RemoteSigned -noprofile -command "&{ruby.exe scripts\check_puppet.rb}"; exit($LastExitCode)
check_puppet_cert = powershell.exe -ExecutionPolicy RemoteSigned -noprofile -command "&{ruby.exe scripts\check_puppet_cert.rb -w 30 -c 10 C:/ProgramData/PuppetLabs/puppet/etc/ssl/certs/hostname.pem}"; exit($LastExitCode)
check_puppet_ca_cert = powershell.exe -ExecutionPolicy RemoteSigned -noprofile -command "&{ruby.exe scripts\check_puppet_cert.rb -w 30 -c 10 C:/ProgramData/PuppetLabs/puppet/etc/ssl/certs/ca.pem}"; exit($LastExitCode)
check_puppet_crl = powershell.exe -ExecutionPolicy RemoteSigned -noprofile -command "&{ruby.exe scripts\check_puppet_cert.rb -w 200 -c 365 C:/ProgramData/PuppetLabs/puppet/etc/ssl/crl.pem}"; exit($LastExitCode)

We have recently installed the beta client (0.5.3.4) on some test machines but the problem appears there aswell as the stable version (0.5.2.35)

Process Explorer shows that it is nscp.exe that has the connection in CLOSED_WAIT

nscp.exe TCP local 5666 remote 36420 CLOSE_WAIT

Appreciate any help and tips to troubleshoot this closer


#2

I get this too - I think its powershell - it doesnt seem to terminate properly when it times out

Try killing all powershell processes in task manager, leave it for 5 mins, and see if all of the blocked ports close themselves


#3

That doesn’t seem to be my problem. No powershell processes running but still connections not closing properly. In my case it seems like it’s the nscp.exe that doen’t terminate the connections.

nscp.exe	5228	TCPV6	local	5666	remote	34786	CLOSE_WAIT										
nscp.exe	5228	TCPV6	local	5666	remote	34904	CLOSE_WAIT										
nscp.exe	5228	TCPV6	local	5666	remote	35004	CLOSE_WAIT

#4

I think I have found the solution to our problems. Maybe this solves your problems aswell @Brick

Stumbled upon https://github.com/mickem/nscp/issues/312 and edited the nsclient ini “bind to” to only listen on IPv4 (0.0.0.0). + restarted nsclient service

[/settings/default]
allowed hosts = 'IPv6-subnet','IPv4-subnet'
bind to = 0.0.0.0

I just modified the file a few hours ago and all of our windows machines that the problem occured on are now OK. I’ll update this post if the problem occurs again but it looks promising