SoMee Site Performance RCA - 2/3/2023

zackrspv
Admin
Joined:
2023-02-03 16:50:14

RCA (Root Cause Analaysis) Report, 2/3/2023

Primary Issue: The site Is  Very Slow

Primary Indications: Service Appdex went from .6 to 5 (meaning it was a massive change in site performance). Service physically would time out and not be responsible for most users. 

Number of Users Affected: 100%

Classification: N1 Primary Issue (Major Failure)

Cause: Service outage

--------------------

On 2/3/2023 at 7:31 am PST, I was notified of the slowness of the service. I checked our service dashboard and didn't see any significant service outages - service endpoints and hosting services were all online and working. All looked fine: system memory, server memory nodes, SQL connections, etc. I then proceeded to go to the somee.social website, and it took about 39 seconds to load. 

I then remoted into the primary node and did some testing. Upon testing a standard PHP script, it loaded almost instantly. However, loading the direct bootstrap file took 39 seconds to load. I then began troubleshooting each code line to find out exactly where the issue was. 

At 08:09 am PST, after identifying the issue on the Hive SQL initial connection line, I built a new testing PHP code and ran just the connection code. Indeed, it took 39 seconds to respond. 

ROOT CAUSE IDENTIFIED: Hive SQL Service is not operational

Upon further investigation, all Hive SQL endpoints, including the primary status page, are offline. I joined the support channel for Hive SQL, and several people noted the outage right around 3:23 am PST. 

REMEDIATION STEPS INITIATED. 

To make this issue less impact SoMee, given that we have multiple redundant backups for Hive data, I added a timeout check to the connection request for Hive SQL. By changing the connection timeout to a deficient number, we limit the connection attempt's impact on the service.

CURRENT REMEDIATION PROGRESS:  Completed, provisionally. 

The remediation is currently static, meaning the timeout is statically set to about 1-3 seconds, depending on load. This isn't the ideal situation. So I'm going to write another tool that checks the connection timeout, And then populate that into redis and our database so that the primary scripts can use it. This way, once Hive SQL comes back online, we can set the timeout back to where it should be, and if it goes offline again, we can keep it down at a low level. 

REMEDIATION AUGMENTATION: 

Part of this was a failure to monitor the hive SQL data endpoint and to see its status along with the SoMee endpoints. This was because we don't control the Hive SQL data endpoint; technically, we don't entirely rely on it, as we have multiple redundant backups in place to pull data if it's down. Thus, there was no reason to monitor it. We will probably not add that to our dashboard, given that the only thing we need to monitor is if it is online, and then change our timeout accordingly. 

SERVICE IMPACT: 4 hours and 46 minutes
SERVICE REMEDIATED:
38 minutes

Number of Users Currently Impacted: 0%

RCA was Completed and reviewed by Phillip on 2/3/2023 at 8:47 am