Released: July 12, 2017
FCC’S PUBLIC SAFETY AND HOMELAND SECURITY BUREAU REMINDS COMMUNICATIONS SERVICE PROVIDERS OF IMPORTANCE OF IMPLEMENTING NETWORK RELIABILITY BEST PRACTICES
PS Docket No. 17-68
The Federal Communications Commission’s (Commission’s) Public Safety and Homeland Security Bureau (Bureau) encourages communications service providers to implement appropriate measures to prevent major service disruptions.
Based on submissions to the Commission’s Network Outage Reporting System (NORS) and publicly available data, the Bureau has observed a number of major service outages caused by minor changes in network management systems. These so-called “sunny day” outages do not result from a natural weather-related disaster or other unforeseeable catastrophe, and can result in “silent failures,” which are outages that occur without providing explicit notification or alarm to the service provider. In 2014, the Bureau first highlighted the occurrence of major “sunny day” outages affecting users in multiple states. These major outages continue to occur, some affecting users nationwide. Outages that impact 911 service are of particular concern, given the importance of ensuring continuity of 911 service.
After an analysis of the facts and circumstances, Bureau staff have determined that service providers likely could have prevented most of these outages if they had implemented certain industry best practices. In particular, seven best practices recommended by the Commission’s Communications Security Reliability and Interoperability Council (CSRIC) II, a former federal advisory committee, could help prevent sunny day outages and silent failures:
1. Awareness Training: “Network Operators, Service Providers and Equipment Suppliers should provide awareness training that stresses the services impact of network failure, the risks of various levels of threatening conditions and the roles components play in the overall architecture. Training should be provided for personnel involved in the direct operation, maintenance, provisioning, security and support of network elements.”
2. Required Experience and Training: “Network Operators, Service Providers, and Equipment Suppliers should establish a minimum set of work experience and training courses which must be completed before personnel may be assigned to perform maintenance activities on production network elements, especially when new technology is introduced in the network.”
3. Access Privileges: “Service Providers, Network Operators, and Equipment Suppliers should have policies on changes to and removal of access privileges upon staff member status changes.”
4. Network Change Verification: “Network Operators should establish policies and processes for adding and configuring network elements that include approval for additions and changes to configuration tables (e.g., screening tables, call tables, trusted hosts, and calling card tables. Verification rules should minimize the possibility of receiving inappropriate messages.”
5. Network Reconfiguration 911 Assessment: “Service Providers and Network Operators when reconfiguring their network (e.g., changes to Virtual Private Cloud (VPC), Mobile Position Center (MPC), Gateway Mobile Location Center (GMLC), or Emergency Services Gateway (ESGW)) should assess the impact on the routing of 911 calls.”
6. Diversity Audits: “Network Operators and Public Safety should periodically audit the physical and logical diversity called for by network design of their network segment(s) and take appropriate measures as needed.”
7. Network Monitoring: “Network Operators, Service Providers, and Public Safety should monitor their network to enable quick response to network issues.” The Bureau encourages service providers to review and consider voluntarily implementing these network reliability best practices as appropriate.
In addition to considering CSRIC-recommended best practices, the Bureau also recommends that service providers consider implementing the following lessons learned derived from the Bureau’s factbased analysis of several recent outages. The Bureau finds that taking these steps could help to prevent future outages or mitigate the impact of outages that do occur.
Access Control: Limit direct access to operations support systems that control a large number of switches, soft switches, or routers.
Validation and Authentication: Implement validation and authentication procedures for any changes that affect call routing.
Software-based Alarming: Work with vendors to implement software that warns technicians when a change is being made that could potentially affect a large number of calls or customers.
Enhanced Outage Detection: Implement traffic measurements or other mechanisms in major network elements to enable the detection of failures where calls are lost but associated equipment continues to operate.
Automatic Re-routing: Examine whether automatic re-routing of calls would be an effective remediation strategy in the event of outages.
For further information, contact John Healy, Associate Chief, Cybersecurity and Communications Reliability Division, Public Safety and Homeland Security Bureau, (202) 418-2448, firstname.lastname@example.org or Robert Finley, Attorney, Cybersecurity and Communications Reliability Division, Public Safety and Homeland Security Bureau, (202) 418-7835, email@example.com.
The Public Safety and Homeland Security Bureau issues this Public Notice under delegated authority pursuant to Sections 0.191 and 0.392 of the Commission’s rules, 47 CFR §§ 0.191, 0.392.