Wednesday, December 10, 2014

AppFabric Caching Service stopping unexpectedly

Today I had to fix an issue with some slowness of loading pages which was very quickly diagnosed to be related to Distributed Cache being down. The environment is SharePoint 2013 and using the minimum requirements in terms of hardware. No dynamic memory is used, which is known to cause issues with virtualized servers running the AppFabric Caching Service. So... this exists on both dev & prod environments which are running the same version of everything. We'll have a closer look on the dev for the purpose of this post. It's a single server hosting everything, except for SQL which is a separate box.

So, the service is configured with a managed account, it has the correct permissions and everything.
Once started, it takes about 3 minutes until it goes down. The events with source .NET Runtime are the actual crash of the service, just 2 lines above event ID 0 which is the start of it.


A closer look at the errors:



EventID 1000:



Get-SPServiceInstance | ? {($_.service.tostring()) -eq "SPDistributedCacheService Name=AppFabricCachingService"} | select Server, Status command shows the host is online though (Server name and user details are removed due to confidentiality):




If you try to get the cache cluster health when the windows service is running, you can't connect:


Health Analyzer shows nothing related to the Distributed Cache. The Services on Server page shows it as running... but you can't really do more about it. I decided to recreate the instance as the quickest fix:


So, the steps to do that are:

1. Remove-SPDistributedCacheServiceInstance

And then

2. Add-SPDistributedCacheServiceInstance

The service will then warm-up in a few minutes and it’s time to check its health:

3. Get-CacheClusterHealth

Here’s the nice output that we’d expect from a healthy instance (all Healthy = 10.00):
If the result is not 10.00, we would see some unallocated fractions and we need to wait for a couple of more minutes to give the service the time to warm up. Then execute the command again.

Cluster health statistics
=========================

HostName = <localhost>
-------------------------

    NamedCache =
DistributedActivityFeedLMTCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache = DistributedDefaultCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache =
DistributedActivityFeedCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache =
DistributedSecurityTrimmingCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache = DistributedAccessCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache =
DistributedLogonTokenCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache = DistributedViewStateCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache = DistributedServerToAppServerAccessTokenCache_6275b5f8-662d-4d0
6-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
       Throttled             = 0.00

    NamedCache = DistributedBouncerCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00
        Throttled             = 0.00

    NamedCache = DistributedSearchCache_6275b5f8-662d-4d06-bb63-ff3ab18a0e21
        Healthy               = 10.00
        UnderReconfiguration  = 0.00
        NotPrimary            = 0.00
        InadequateSecondaries = 0.00

        Throttled             = 0.00


That's it. The service is running stable now. I guess the root cause is that the memory was not enough in the first place as I saw that as a warning in the event logs prior to the issue, but that's to be confirmed.

No comments:

Post a Comment