Network troubleshooting tips from a solo sysadmin who solved a mystery outage

Question:

How would you advise a solo sysadmin who faced a network outage caused by a malfunctioning security NVR? The sysadmin shared their experience on [r/sysadmin], a community of IT professionals, and received some encouragement and feedback. They also provided some details about their network configuration, troubleshooting steps, and lessons learned. You can read their full post below.

Answer:

How to deal with a network outage caused by a rogue device

Network outages can be frustrating and costly for any organization, especially when they are caused by unexpected factors. In this article, we will look at a real-life case of a solo sysadmin who faced a network outage caused by a malfunctioning security NVR (network video recorder). We will also discuss some best practices and tips on how to prevent, diagnose, and resolve such issues in the future.

The sysadmin, who works for a medium-sized organization, shared their experience on [r/sysadmin], a community of IT professionals. They described how their network started acting up one morning, with RDP sessions dropping, websites loading slowly or not at all, and users complaining. The only thing that worked fine was the VoIP phones.

The sysadmin initially suspected that the problem was related to the Fortigate firewall that they had installed two weeks ago. They contacted the Fortigate support and made some changes to the firewall policies, but that did not help. They also tried to configure traffic shaping, but that did not make any difference either.

After four hours of frustration and self-doubt, the sysadmin decided to do a packet capture on the LAN from the Fortigate. They noticed that one device was sending a lot of data, but it turned out to be a smartboard that was supposed to be in a meeting. They then sorted the conversations by packets instead of bytes and found the culprit: a security NVR that was spewing out MDNS (multicast DNS) packets at around 2400 per second.

The sysadmin quickly disabled the port on the switch that the NVR was connected to and the network returned to normal. They later learned that the NVR was about nine years old and was the first one they had installed, before they started using VLANs for different devices. They also discovered that they had misconfigured the storm control settings on the Meraki switches, which allowed the NVR to flood the network with MDNS packets.

The sysadmin was relieved that they had solved the problem, but also felt embarrassed that they had wasted so much time and energy on something that could have been avoided or detected sooner. They received some encouragement and feedback from the r/sysadmin community, and also shared some lessons learned from the incident.

The lessons

The sysadmin’s experience highlights some common challenges and pitfalls that solo sysadmins or small IT teams may face when dealing with network issues. Here are some of the lessons that we can learn from this case:

Have a systematic approach to troubleshooting. When faced with a network problem, it is tempting to jump to conclusions or try random fixes without a clear plan. However, this can lead to more confusion and frustration, and may even worsen the situation. A better approach is to follow a systematic method of troubleshooting, such as the OSI model, the TCP/IP model, or the network troubleshooting model. These models provide a logical framework to isolate the problem, identify the root cause, and implement the solution. They also help to avoid unnecessary steps or changes that may not be relevant to the issue.
Use the tools available to you. Modern networks have a variety of tools and features that can help sysadmins monitor, analyze, and troubleshoot network performance and behavior. Some of these tools include packet capture, network analyzer, network mapper, ping, traceroute, SNMP, NetFlow, syslog, and more. These tools can provide valuable information and insights into the network traffic, topology, configuration, and status. They can also help to identify anomalies, errors, or bottlenecks that may affect the network performance or functionality. Sysadmins should familiarize themselves with these tools and use them regularly to check the health and performance of their network.
Keep your network documentation up to date. Network documentation is essential for any sysadmin, as it provides a reference for the network design, configuration, inventory, and policies. It can also help to troubleshoot network issues, as it can show the expected behavior and state of the network, and highlight any changes or deviations that may have occurred. Network documentation should include diagrams, IP addresses, VLANs, subnets, routing tables, firewall rules, switch ports, device models, firmware versions, and more. Sysadmins should update their network documentation whenever they make any changes to the network, and review it periodically to ensure its accuracy and completeness.
Segment your network and apply security policies. Network segmentation is the practice of dividing a network into smaller, logical units based on criteria such as function, location, or security level. Network segmentation can improve the network performance, security, and manageability, as it can reduce the broadcast domain, limit the scope of attacks, and enforce access control. Sysadmins should use VLANs, subnets, firewalls, and other methods to segment their network and apply appropriate security policies to each segment. They should also avoid connecting devices that do not belong to the same segment, such as a security NVR to a general-purpose switch, as this can create security risks and performance issues.
Keep your devices updated and maintained. Network devices, such as routers, switches, firewalls, servers, and cameras, are prone to wear and tear, bugs, and vulnerabilities over time. Sysadmins should keep their devices updated with the latest firmware, patches, and security updates, as these can fix known issues, improve performance, and enhance security. They should also perform regular maintenance on their devices, such as cleaning, testing, replacing, or upgrading components, as these can prevent failures, malfunctions, or degradation. Sysadmins should also monitor the status and performance of their devices, and look for any signs of trouble, such as high CPU or memory usage, errors, or unusual traffic.

The conclusion

Network

outages can be a nightmare for any sysadmin, especially when they are caused by unexpected or obscure factors. However, by following a systematic approach to troubleshooting, using the tools available, keeping the network documentation up to date, segmenting the network and applying security policies, and keeping the devices updated and maintained, sysadmins can prevent, diagnose, and resolve network issues more effectively and efficiently. They can also learn from their mistakes and improve their skills and knowledge. As the sysadmin in this case said, “We can definitely be our own worst enemy eh! But we can also be our own best friend!”

TechNsight

Network troubleshooting tips from a solo sysadmin who solved a mystery outage

How to deal with a network outage caused by a rogue device

The lessons

The conclusion

Leave a Reply Cancel reply

Network troubleshooting tips from a solo sysadmin who solved a mystery outage

How to deal with a network outage caused by a rogue device

The lessons

The conclusion

Related posts:

Leave a Reply Cancel reply