Disaster recovery strategy

Disaster recovery requires planning and assessment to develop a strategy that meets both the business requirement and PME system configuration. The disaster recovery strategy is the result of two objectives:

Data retention - The amount of data required in the active system.
System recovery - The minimal state of the system that should be recovered after a disaster and the acceptable limit of data and time loss.

Disasters can occur at any time and, if unprepared, such events can lead to data loss and service disruptions. Factors leading to system disasters include:

Inoperative hardware
Sudden power interruption / outage
External threats such as malware, virus attacks, hacking
Human errors such as accidental data deletion
Implementation or upgrade issues
Database corruption, such as a database exceeding the maximum expected size or allowable hard drive space
Natural disasters (For example, earthquakes, fire, flood, storms, and so on)

Some SQL Server disasters cannot be prevented, so it is important to prepare a complete disaster recovery plan (DRP) to ensure minimal impact on service and data availability.

An effective disaster recovery plan should comprise:

Identify disaster recovery objectives
IT architecture and resources plan
Backup plan
Recovery plan

Developing the plan requires collaboration with the IT team, application champions (administrators, power users) and recovery experts.

NOTE: If you have limited time and resource to define the strategy, you can consider third-party products and services for assistance.

Identify disaster recovery objectives

The plan starts with identifying disaster recovery objectives. This includes the business aspect of the system. You might want to consider the value of the PME system and the data it contains:

What is the business cost of one day of downtime, both explicit and implied?
What would be the result if an hour’s, day's or month's worth of analysis, reporting, alarming and data were lost?
What would be the result of a complete loss of the PME system?

If your system is not critical, you may decide the best strategy is a simple one where a new PME system is redeployed in the event of a disaster and device data is re-imported and you experience potentially irrelevant historical data loss. If your system is critical, you may develop a plan for a quick recovery with minimal data loss.

You must set a written expectation of what constitutes an acceptable loss. Consider the following questions:

What is an acceptable level of data loss in your PME systems?

The answer to this question determines the Recovery Point Objective (RPO) objective. It is the maximum amount of recent data the business can lose when a disaster strikes. It helps to measure how much time can occur between your last data backup and the disaster without causing serious damage to your business. RPO is used to determine how often to perform data backups.

For example, your backup schedule is set to daily at midnight and a disaster occurs at 8 AM. At the point of the disaster, you would have lost 8 hours’ worth of data. If your RPO is one day of data then the loss of the last 8 hours of data is not an issue. However, if your RPO is one hour of data, then you must revise your backup schedule to at least one backup per hour.
What is an acceptable recovery time?

The answer to this question determines the Recovery Time Objective (RTO) objective. It is the amount of time the business can survive without the system after a disaster and before operations are restored to normal. It determines how quickly you need to recover the PME system after a disaster.

For example, if your RTO is 24 hours, you can wait up to 24 hours before the system must be available to users. If data and infrastructure are not recovered within 24 hours, the business might be impacted.
What level of disaster should we be prepared for?

Identify the possible disasters that could affect your PME system and the level of impact of each disaster. For example: If your PME system is an on-premise solution, you should prepare for disasters such as power loss, fire, flood, etc. If your PME system is hosted on off-site servers in a data center, prepare for natural disasters but with low priority compared to cybersecurity risks.

IT architecture and resources plan

It is important to design the IT architecture so you can allocate the necessary hardware and networking resources in support of optimal performance and disaster recovery. Maintenance and backup activities often require additional resources (CPU, RAM and hard drive space) to perform and complete the activity. You can also prevent disasters using additional hardware. The recommended best practices are as follows:

Hard drive space allocation

Ensure there is enough hard drive space to perform backup operations and take at least two backup files. The spare hard drives can ensure minimal rebuild time. RAID arrays (commonly used on all PME systems) can protect against disk damages. See Storage Performance and Availability for recommended hard drive sizing.
Backup power

UPS systems and redundant power supplies to servers can prevent server power interruption.
Connection redundancy

If available, redundant data links can protect critical data transmission when the communication cannot be established in the network.
Standby servers

With supporting infrastructure and cost, standby servers can provide another set of hardware that can replace the PME system hardware in the event of an inoperative server. This approach is valuable when PME is a critical system.

Backup plan

Creating backups are a key part of every PME deployment. A backup solution unique to the PME deployment can be created based on the recovery objectives, the PME system, and available IT resources. The backup plan should comprise:

Components backup
Backup frequency
Storage and retention of backup
Test the backup

Once you have a strategy with details, document the details and supporting processes. Whenever a system or process change occurs, review and update this document. Store the document outside of the PME server.

Components backup

The following table contains the components of a standard PME system that must be considered for backup:

Component	Name	Description
PME Database	ION_Network	Sometimes called the NOM (Network Object Model), the ION_Network database stores device information, such as, device name, device type and connection address (for example, IP address and TCP/IP port or device/Modbus ID). It also contains information about the optional Application Module settings, other ION Servers, Sites, Dial Out Modems, and Connection Schedules. There is only one ION_Network per system
PMEDatabase	Application Module	The Application_Modules database contains configuration settings (for example, layouts, colors, application events, and so on) and cached historical data for some of the Web Applications (for example, Dashboards and Trends).
PME Database	ION_Data	The ION_Data database contains the historical data, events and waveforms from devices connected to the system. This includes: onboard logging configured on devices; and, PC-based logging configured in the device translators and the Virtual Processors.
PME Database	ION_Data archive	The ION_Data archive databases contain historical data that have been sectioned off from the main ION_Data database.
System Database	master	The master database is the core system database for a SQL Server installation. It contains information such as SQL Server credentials and system configuration settings.
System Database	model	The model database is used as a template for all databases created on the SQL Server instance.
System Database	msdb	The msdb database is used by SQL Server Agent for scheduling alerts and jobs. msdb also contains history tables such as the backup and restore history tables.
PME Files	PME	The application folder is where all the program and configuration files for PME are stored. By default this is “%Program Files%\Schneider Electric\Power Monitoring Expert”
System Files	SQL Server	The SQL Server folder is where all the program and configuration files for SQL Server, SQL Server Reporting Services, and SQL Server Agent are stored. By default, this is “%Program Files%\Microsoft SQL Server”.
System Files	Windows registry	Contains configuration information for the entire server
System and PME Files	Full Server Backup	It is advised to take an image of the entire PME application and database servers (excluding the actual database files – MDF, NDF, and LDF files). This backs up other important configuration information such as service credentials, security policies, and IIS setup. It can be important when simplifying and reducing the time taken for a system recovery.

All PME databases should be backed up frequently and a full server backup should be taken upon system configuration changes (for example, Vista diagrams, updating device drivers, registry settings, VIP framework changes, and so on). Use Configuration Manager for performing the backup.

Database recovery model

When backing up databases it is important to choose an appropriate recovery model. The recovery model is a database property that controls how transactions are logged, whether the transaction log requires (and allows) backing up, and what types of restore operations are available.

PME databases use one of two recovery models:

Simple recovery model

Complete database backup is taken and a restore can only be done up to the point when the backup was taken.
Full recovery model

Provides backup options such as differential, incremental, and transaction log. The restore can be done using different options.

All PME databases are configured with the simple recovery model by default. The ION_Data database recovery model should be updated to reflect your backup plan.

The recovery model is determined by comparing the disaster identification time with the backup schedule. For example, as per the following diagram:

A system that is configured to have a single backup cannot be recovered. System is not accessed by users over the weekend, and becomes inoperative such that the automated backup jobs are still able to run. In this case, the backup would not be valid and there would be loss of the complete PME system.

You can prevent this situation by setting the ION_Data database recovery model to Full, thus allowing more refined backup options.

In the case of critical PME systems, consider:

Being aware of your disaster identification time and adjust your backup schedule appropriately
Using a full recovery model with several differential backups (advanced configuration)
Keeping multiple backup copies on a rotational basis

The key benefit of the full recovery model is that it can restore a database exactly to any point in time since the last full backup was taken, including potentially to the point the disaster occurred, resulting in no data loss. It should only be used if simple recovery is not sufficient to meet the recovery needs as it incurs cost of performance and storage space.

Backup frequency

By default, PME is configured to backup the ApplicationModules and ION_Network databases on a daily basis, while the ION_Data database is backed up once per week. This default configuration assumes that meters installed throughout the network have onboard memory and onboard logging enabled with a log of at least 14 days of data. This weekly frequency balances the need for performance in steady state and disaster recovery preparation. Frequent transaction log backups can lead to an unnecessarily bloated LDF file, which can cause performance issues.

If your PME system is critical, it is important to ensure you have a frequent backup strategy to support quick recovery. In this case, the recommended practices are:

Set the ION_Data database recovery model to Full
Schedule daily full backups
Schedule hourly transaction log backups
Continue to keep the last 2 full backup files on the server
Increase hard drive storage space by the 2 x size of a ION_Data.MDF file for the additional transaction log backup files
Keep the last 24 transaction log backup files on the server

The recommended backup configuration and frequency for PME and system database are as follows:

Component	Name	Description	Recovery Model	Backup Frequency
PME Database	ION_Network	All	Simple	Daily
PME Database	Application Module	All	Simple	Daily
PME Database	ION_Data	For systems with meters that have at least 14 days of onboard logging	Simple	Weekly
PME Database *	ION_Data	For systems without onboard logging or for critical systems NOTE: Perform hourly transaction logs backup. Review data storage requirements for the additional transaction log backup files	Full	Daily
PME Database	ION_Data archive	All	Simple	Upon creation
System Database	master	All	Simple	Daily
System Database	model	All	Full	As required
System Database	msdb	All	Simple	Daily

Additionally, all the components should be manually backed up after an update.

The recommended backup configuration and frequency for PME and system files are as follows:

Component	Name	Backup frequency
PME Files	PME	Backup upon significant system change Use Configuration Manager to backup and archive PME files. Be sure to deselect the database options
System Files	SQL Server	Backup upon significant system change Backup “%PROGRAMFILES% Microsoft SQL Server” folder upon major system changes (hotfixes and upgrades)
System Files	Windows registry	Monthly and after a significant system change
System and PME Files	Full Server Backup	Annually or upon significant system change Backup the entire PME application and database servers (excluding MDF, NDF, and LDF database files) once a year and after each significant system change (upgrade)

Storage and retention of backup

In the PME Planning stage, we recommend to have enough additional hard drive space to support at least three times the expected size of the main ION_Data (MDF) database. This estimation assumes that two backup files are stored on the production server.

We recommend the following storage and retention strategy:

Follow the 3-2-1 Rule
- Store backups locally on a RAID protected drive for the shortest amount of recovery time.
- Store a copy of backups on a centralized set of disks so you can recover the backups on another server if the production SQL Server encounters a critical issue.
- Store a copy of the backups off-site on external drives or in the cloud in case a site disaster occurs.

Set up automated processes to backup and move files to separate locations.

Maintain a reasonable set of backups off site and outside of the PME servers. We recommend the following backup retention strategy. Check with your legal team on keeping certain amount of critical data in the event of a disaster.
- 10 daily backups
- 5 weekly backups
- 6 monthly backups
- 3 on-demand or annual backups

Historical backup files should be stored off-site.

Delete the old backup files on a regular basis in order to manage the storage cost.

Test the backup

A critical aspect to the backup strategy is to ensure that you can recover files from the backups. Prepare a test procedure to verify that the backup files contain the expected data and that the backups can actually be restored, that is, the backup files are not corrupt.

When practicing the restore procedure, ensure that you restore to a different server and at a different location.

This practice ensures that the recovery team:

Knows the steps to follow when recovering from a data loss or disaster.
Has existing infrastructure to support a recovery.
Can stay calm and act efficiently in a real disaster situation.

Recovery plan

Backup files are worthless if they cannot be restored, so you must have a recovery plan with the goal of getting a recovered PME system functional with minimum downtime and data loss. The disaster recovery objectives and backup and archive strategies help create a recovery strategy.

The most important point to remember when creating a recovery plan is that it is not valid until it is actually tested, and your recovery position is good as your last recovery test. Once you have a recovery plan, allocate some time to test your disaster recovery strategy. Be aware of who is executing the recovery as well. Do not assume that a specific person is available to restore the PME system.

We recommend the following approach to developing a recovery strategy:

Set a time expectation for recovery (Recovery Time Objective).
Identify the necessary hardware, software, backup and archive files and types (full, differential, and log).

Ensure resources – physical (servers, software, network) and personnel – are allocated and assigned, so they are readily available if a disaster strikes.
Document the entire recovery procedure.

If you have a large recovery time window, such as 1 week, you may have enough time to contact PME support team to assist in a recovery procedure. If you have a smaller time window then any PME administrator (factoring in employee turnover) should be capable of performing a restore, so this procedure should be well documented. At a minimum, all backup and archive locations should be documented and accessible to any PME administrator. Store the documentation outside of the PME production servers.

NOTE: Training PME administrators and / or support staff on PME disaster recovery may be important to ensure you have redundant personnel available.
Schedule system downtime and test the restore procedure. This is a necessary step to ensure the disaster recovery strategy is valid. Track the time the recovery procedure takes to verify your time expectation for recovery is valid. Take corrective action for any areas missed in your recovery documentation, backup, or archive strategies.
Progressively update recovery documentation after any major system changes are made that changes the restore procedure.

NOTE: See SQL Server Limitations on Restoring System Databases in cases where a full SQL server recovery is needed.

Recommended consolidated disaster recovery strategy plan

The recommended consolidated disaster recovery strategy plan of ION_Data database for the different purpose of PME system are as follows:

Disaster recovery strategy parameter	Purpose of PME system
Disaster recovery strategy parameter	Analysis & decision making (Capacity management, Energy usage analysis, Power Quality compliance)	Real-time monitoring & troubleshooting (Electric distribution monitoring & alarming, Insulation monitoring, Backup power testing)	Critical large advanced distributed system (with mix of real-time and analysis based applications)
Device memory	On board logging (14 to 30 days)	None	None
Recovery point objective	Up to 1 week	Yesterday	As close to the point of disaster as possible
Recovery time objective	24 hours	36 hours	24 hours
Database recovery model	Simple	Simple	Full
Full backup with frequency	Full with weekly	Full with weekly	Full with daily
Additional backup with frequency	Not applicable	Differential with daily	Transaction logs with hourly
Number of full backup files to store on server	2	2	2
Number of additional backup files to store on server	Not applicable	2 to 6	Depends on how often the transaction log backup files are validated
Additional IT resources needed for additional backup files*	Standard	Additional storage required for differential backup	Additional storage required for transaction log backup

*See IT Requirements for recommended system sizing.