Plumsail Not Working, Again

Hello!

For the third time in less than two weeks, Plumsail is not working again. The document generation is timing out.

Plumsail Outages: Monday Sept 11th, Monday Sept 18th and now today, Thursday Sept 21st. These are the known dates that I have that there have been issues with Plumsail not functioning.

Plumsail Support: Please shed some light on the issue today, what you are doing thats causing the increase in problems, and what actions you are taking to work on the reliability of your product.

Thank you.

EDIT: If anyone in this community has insights on this too, please share! In the past, Plumsail has not been responsive or transparent, and we are looking for answers. Any help is appreciated! Thank you in advance!

3 Likes

Hello @bdito,

I apologize for the incident, and I've forwarded your request to our developers. We will review the situation, and I will get in touch with more details. If you are experiencing any ongoing issues with the service now, please also contact us at support@plumsail.com.

Best regards,
Petr
Plumsail team

Hello @bdito,

I'm developer in Plumsail, I want to apology for the issues and deeply regret that we were unable to meet your expectations. I want you to know that we are planning to spend next sprint to research root causes of all issues that occurred for the last month and dedicate a developer to fix it as soon as possible.

I will return in a week with list implemented fixes.

Thank you for your patience.
Roman

There was another outage today that lasted nearly an hour. We'll be awaiting the mentioned communication by @Roman_Rylov, these issues have been going on for quite some time so we would appreciate knowing what steps are being taken to resolve the instability of this service. These disruptions cause significant issues for us as Plumsail is a crucial part of our process, please keep us updated.

1 Like

Hello @jquerido,

I replied in the support ticket SP33161 and attached the RCA.

Best regards,
Petr
Plumsail team

Hello @bdito ,

Firstly, I want to thank you again for pointing our attention to this issue.
Below, I want to shed some light on our investigation of the problem.

During the investigation process, we analyzed all errors that happened during the last month. Additionally, we tested each element in the infrastructure, starting from DNS to each server configuration and network settings.

We found the following issues:

  1. Misconfiguration in the load balancer - I suppose this is the main issue that led to timeout errors. We had very low limit on incoming connections. Thus, in peak hours, the connections waited in a queue or were aborted with a timeout error. We fine-tuned settings of our load balancer, increased connections, timeouts, internal buffers, set priorities for authenticated users, improved DDoS protection and did a load test to check capability of current resources.

  2. Unnecessary dependencies on the cache server - we noticed that in a few cases, the errors occurred because of unavailability of our cache server. We inspected our code and now correctly process this kind of errors.

  3. Invalid behavior in critical situations - in some rare cases, for example when the server is out of memory, our app didn't handle this error correctly. Now we have fixed this behavior. It terminates immediately and the load balancer should pass the request to another server.

Additionally, we initiated the next project aimed at improve observability of the system. We are planning to add tracking performance-related metrics. Our goal is to track performance in percentiles of user-related metrics like response/processing time, count of errors, and so on. In the future, we are planning to add alerting to the most critical metrics and create a single dashboard for a duty engineer.

I hope the implemented changes helps us to improve stability and reduce request processing time. We are working to improve our service and open to any critics.

Thanks
Roman