Making errors easy to fix

As we start developing services to be hosted in a cloud like system, things change in how you manage your service. What happens when an error occur on a remote system? Can you log on to the system to collect all necessary information to correct the error in your development code? How will you notice when an error occur?

On a cloud based system you many times wont be able to log on to the remote server. Accessing the error logs might not be possible. You might not notice an error that occurs on a remote server because there is no monitoring setup.

Previously when an error occurred in the hosting environment we had to find out which server to log on to, and then dig through the different servers until we find the error in a log somewhere. This process might take minutes or hours, increasing the time it takes to fix the problem even when debugging.

In RemoteX Applications REST-service I recently added an endpoint where you directly in the API can access all errors that occurred in the service. The goal is to have an empty list, and having this endpoint makes it very easy to directly get information regarding what went wrong on the remote server.

In fact the endpoint was added in an effort to identify a service bug that occurred in our staging environment, but only occurred occasionally. I needed a log. The inspiration came from Google AppEngine where you in the Application Dashboard see which paths causes errors in the hosted application and how often.

Now all we have to do is log into the API and ask it, what went wrong? It will tell us if anything has gone wrong, on which path and which HTTP verb was used. It will even give us a stack trace. Now some might say that you don’t want to display stack traces since it might give hints to how your application is designed. But in this case the benefit far outweighs the drawbacks. Remember the goal in the end is to have an empty error list.

The endpoint is implemented in such a way that we can ask the API “if has anything gone wrong since last we asked”? Which allows us to effectively build a monitoring service that will ask our installations every once in a while, “are you ok?”

Other solutions