As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

th3raid0r · 1 year ago

As an SRE, what do I do about Alerts caused almost entirely by poor customer communication or misuse of a product?

@cbarrick@lemmy.world · edit-2 1 year ago

I think the issue is that customers can escalate directly to SRE.

SRE is supposed to work on the health and reliability of the service. It does sound like there is a reliability issue when loading large datasets. But this should be project work, not incident response work.

Is your service violating your internal SLOs when this happens?

Where I work, customers escalate to a support team, who tries to work with them. It’s only after the support team decides it’s a product issue that it makes it to SRE. Even then, 90% of the time, the support staff will file a ticket to be handled at business hours rather than page SRE.

If this auto scaling delay is expected, I’d try to do two things:

Produce better error messages, so that the customer can know what’s happening and hopefully not need to escalate.
Work with the rest of the company (typically the Product or Support teams if you have them) to make sure customers understand these limitations.
Edit Oh, also don’t let customers page you for known limitations. Design a better process around this.

And if it’s that bad, SRE should invest in project work to make the autoscaling less painful.

Edit: Your service should return some kind of client error (i.e. exempt from SLO) in this situation. In gRPC, that would probably be RESOURCE_EXHAUSTED, and the error message should be something like “Yo your DB is out of disk, chill out while we fetch more disk. To avoid these errors in the future, pre-scale before large writes.”