Monitoring

Monitoring the overall number of runs

  1. The Overview page shows the number of runs from all runnables you have “Can retrieve this runnable” permission. Successes are those runs whose error_type is null. Failures are those with non-null error_type
  2. Click “Overview” on the left side menu
_images/monitor_overall_2.png
  1. Note that the number of successes and failures are in different axes
  2. The left bottom shows the percent of successful runs over the last 24 hours
  3. The right bottom shows the percent of runs whose response time is less than or equal to the slow run threshold of the run’s runnable
  4. Click the small refresh button on the right top to refresh the graph. Otherwise, the page will reload every 5 minutes.

Monitoring the recent error runs

  1. The left bottom of the Overview page shows a list of recent errors
_images/monitor_error_run_1.png
  1. Click each link to see a related entity
  2. Refer Monitoring – Troubleshoot an error run to get hints on how to fix errors.

Monitoring a runnable’s number of runs

  1. A runnable’s detail page shows the number of runs of the runnable over last 24 hours like how the Overview page shows the number of runs of all runnables to which a user has retrieve permission

Monitoring a runnable’s response time over time

  1. A runnable’s detail page shows response time trends
_images/monitor_resp_time_1.png
  1. It shows min, 25-percentile, median, 75-percentile and max of response times of the runnable by hour over last 24 hours. In case there is no run, the hour will show no dots
  2. Note that the y-axis is in log-scale. Moving a mouse cursor to a dot shows the exact response time statistic
  3. The bottom shows the minimum and maximum response times over the last 24 hours.

Configuring to receive an error alert

  1. You can configure a runnable to send an alert email to you and the run request when a run fails. In a runnable detail’s page, click “Edit.” Select Error alert to True and click “Edit” to submit
  2. When an error happens, you will receive an email notification like below:
_images/receive_alert_2.png

Adjusting the slow run threshold of a runnable

  1. The slow run threshold is purely for reporting purpose and decides whether a run is considered slow when rendering the fast run percent on the bottom right part of the activity overview section
  2. For example, if there is only one run in the last 24 hours and its response time is 1 second, if the slow run threshold is 100ms, the fast run percent will be 0%. If the slow run threshold is 2 seconds, the fast run percent will be 100%. The fast run percent can show at an aggregate level whether the platform and microservices are processing requests as fast as users expected
  3. To change the slow run threshold, click “Edit” in a runnable’s detail page and change Slow run threshold in milliseconds.

Troubleshooting an error run

  1. A run is considered an error if its error_type is not null. If error_type starts with knowru_runtime_error, it is likely caused by the source code users uploaded. Writers of the microservice should look at the error_message carefully to fix the issue
  2. If error_type is timeout, it means that the run request takes longer than the timeout limit of its runnable. If the writer of the microservice believes that the timeout limit is too short, she should increase the limit. If many timeout errors happened in a short period of time only, it can be that microservices uses external resources like other database and 3rd-party services and they were not fast in processing a microservice’s requests. In this case, the writer of the microservice should consult with the resource owners to solve the issue
  3. If error_type is hub_not_ready, it means that some parts of the backend servers or services are corrupted. Knowru’s support team handle such issues mostly within a couple of hours.