System health review

You can adjust the system for optimal performance by monitoring the state of a system.

Frequent review of system health is to ensure optimal system health over the long term. This task involves identifying and resolving potential issues. It is recommended to perform system health checks at least once per month or quarter depending on the amount of data flowing into the system and how often the system is used.

It is recommended to use the following approach to set up regular system health checks:

  1. Determine which system health checks are appropriate. The following is a list of system health checks appropriate for most of the PME system. Customized PME systems might include additional checks or have some removed, however all system health checks listed should be considered:

    List of checks Application Server Database Server
    Anti-malware
    Services running -
    Message queues -
    Device communication -
    Processor usage
    Memory usage
    Disk usage
    File system growth
    File system fragmentation
    Log files
    Windows scheduled task history and status
    Database growth -
    Database fragmentation -
    Database integrity -
    Database backup -
    Software licensing
    Software updates
  2. Identify and document how the above information can be collected for the system health review. See Tools for troubleshooting. Note the following tools:

    • PME Diagnostic Tool - Install and deploy this tool to obtain a snapshot of the current state of the system. See PME Diagnostics tool for more information.

    • PME Diagnostics Viewer. See Diagnostics Viewer for more information.

  3. Create a template system health report. This report should contain at least the following information:

    • Report date

    • Contact information

    • A list of each system health check with the following information for each line item:

      • Status – Passed, Caution, or Failed

      • Description of contributing factor to the given status

      • Recommended action

  4. Determine a storage location for system health reports. Reports should be stored in a consistent location and they should be accessible by administrators and support users.

  5. Create an initial system health report.

  6. In the location created in step 4, save the template report and the initial system health report.

The following table provides the information on list of system checks on why the check is required and what you need to check:

List of checks Why What to check
Anti-Malware If the PME server has an internet connection, it is at risk for viruses and malware. The anti-malware software should be monitoring for threats in real-time and running full scans once per month.
  • Check for warnings and threats.

  • If threats were found, were the infected files quarantined?

  • Check that the latest anti-malware definitions are installed.

Services running These services are core of PME and must always be running.
  • Use Windows Services, Windows Event Logs and Diagnostics Viewer – Service Diagnostics to ensure all ION, SQL Server and IIS services are running.

  • If they are stopped, investigate logs for root cause.

See Diagnostics Viewer for more information.

Message queues
  • Log Inserter writes log data into a message queue instead of writing it to SQL Server directly. Another process (the Log Subsystem Router Service) reads the messages from the queue and writes the data to SQL Server.

  • The message queues should be at or near zero the majority of the time. One indicator of poor system health is when PME is operating in steady-state and the message queue size stays above zero for any queue for an extended period of time.

Use Diagnostics Viewer – Log Pipeline Service to check status of PME MSMQs.

See Diagnostics Viewer for more information.

Device communication Networking issues can lead to communication loss with devices. For devices without onboard logging, communication loss means data loss. Monitor device communications often to ensure the expected devices are communicating.
  • Use Diagnostics Viewer – Communication Diagnostics to check for communication issues, such as Timeouts or Log Inserter issues.

  • Repeated and frequent timeouts are common for long daisy chains – this is a sign that the loop performance needs to be assessed.

  • The LogInserter service diagnostic will reveal which devices cannot log data at the device level, and which DataRecorders have issues (reference LogHandle column).

See Diagnostics Viewer for more information.

Processor usage CPU usage over time should be less than 80%.
  • Is CPU trend showing at least 20% free?

  • Are processes using and releasing CPU resources?

In Windows Resource Monitor, track the following object counters:

  • Processor: % Privileged Time

  • Processor: %User Time

  • System: Processor Queue Length

See Monitor CPU Usage for more information.

Memory usage Prevent low memory problems by applying the appropriate server resources (RAM, CPU) as the system grows. Take into consideration future extensions of the system and upgrades, which could be performed in-place if the server is prepared ahead. Monitor memory usage to confirm that it is within range.

Use Windows Resource Monitor to track the following object counters over time to determine normal usage and identify issues:

  • Memory: Available Bytes

  • Memory: Pages/sec

See Monitor Memory Usage for more information.

Disk usage There is a risk of disk I/O issues particularly in PME systems with large ION_Data databases.

Use Windows Resource Monitor to track the following object counters for each disk:

Primary

  • PhysicalDisk: Avg. Disk sec/Write

  • PhysicalDisk: Avg. Disk sec/Read

Secondary

  • PhysicalDisk: Avg. Disk Queue Length

  • PhysicalDisk: Disk Bytes/sec

  • PhysicalDisk: Disk Transfers/sec

Track these counters over time to determine normal usage and identify issues. See Monitoring Disk Usage for more information.

File system growth

If a disk completely fills up, data loss occurs. You must ensure the all disks have adequate disk space for all maintenance tasks (defragmentation, backups, database reindexing). There should be at least 20 - 30% free disk space at all times for optimum performance. Possible causes of file system growth include database file growth, log file growth, data archives, and space used by 3rd party software.

The best preventive measure is to track of the disk space usage and assess the growth over time. If the used space for a disk has consistently increased for several months and the percent free disk space is below 30% then action is required. Investigate the root cause of the growth, and develop a plan to either prevent more disk space from being used or for more disk space to be added.

  • Ensure file system fragmentation job is running regularly

  • Check for unsustainable file system growth. If found, take preventive action to reduce the risk of a full disk, that is, add more disk space or adjust system configuration to reduce disk space usage.

File system fragmentation

If file system (or disk) fragmentation is greater than 40% on the database drive, SQL server experiences a thrashing/page faulting condition when manipulating a large volume of rows.

For PME, file system fragmentation usually results in database auto-growth. It is most common in small and medium databases because of the SQL express database size limit.

  • Pre-allocate hard drive space for the ION_Data database. See Diagnostics Viewer for more information.

  • Check file system fragmentation on all drives used by PME at least once per month

  • Schedule time and perform file system defragmentation if necessary. For standalone systems, ensure SQL Server services are stopped before defragmentation.

Log files

There are many logs that contain critical error information, non-critical error information, warnings, and informational messages. Logs can be a good source of information for how well a system is performing as well as gathering data for troubleshooting specific issues.

IIS logs should be trimmed regularly.

Check:

  • ION_SystemLog

  • IIS Logs

  • Windows Event Logs

  • SQL Server Logs

  • Check for unexpected errors in the log files

  • Check for errors related to PME components

  • Check log size to ensure total size is not excessively large (> 1 GB)

  • Archive historical logs if folder has too many log files.

Sometimes certain irrelevant errors or warnings can be ignored. Any anomalous messages in these logs should be recorded and investigated.

Windows scheduled task history and status

PME’s default database maintenance tasks are configured as Windows Scheduled tasks.

Ensure scheduled tasks are successfully launched at the scheduled time and completing.

  • Check Windows Scheduled Task Logs, if any.

  • Check PME System Logs for errors related to these tasks.

Database growth

Unexpected database growth can lead to poor performance. Usually significant database growth is a trigger.

Unexpected database growth suggests possible excessive data logging due to device misconfiguration. For example, unexpected high-frequency logging (1 second logging intervals) or waveform logging when waveform data is not necessary.

Check database growth since the last system health check. Does the growth align with expectations?
Database fragmentation Database fragmentation, if not addressed, is a common cause of poor system performance. Check for index fragmentation over 10%
Database integrity Database corruption is a rare event that is usually caused by inoperative hardware on the server. A database integrity check reviews the allocation and structural integrity of all objects in each database to ensure it is not corrupt.
  • Run DBCC CHECKDB on all PME related databases.

  • Check for errors reported in the output of DBCC CHECKDB.

Database backup Confirm that the backup scheduled tasks are completing successfully.
  • Review SQL Server Logs for errors related to each job.

  • Confirm that the expected database backup files exist, and that copies of the backups have been made to another media and off-site.

Software licensing To ensure all PME and SQL Server features are functional. Check that the license is still valid.
Software updates The latest updates ensure the system has the latest cybersecurity protection, and known software bugs are fixed. Check to see if software is out-dated and identify when is the correct time for an upgrade.