Load-Balancing Strategies for Linux Mirrors: A Case Study of a University Network

Question:

I am a member of a student organization that runs Linux mirrors for our university. Our objective is to offer fast and reliable mirrors for our users and other mirror sites (we are a tier 1 source). We currently have a single server, but we are planning to add another one for redundancy. Our network bandwidth is 1Gb/s for WAN and 10Gb/s for LAN, and we expect to upgrade the WAN to 3Gb/s soon. You can see a simple diagram of our network [here](https://imgur.com/96DCTYw). >
> We are considering different options for load-balancing our mirrors, and we would like to know which one is the best in terms of performance, scalability, and security. These are the options we have in mind: >
> – Using pfsense as a reverse proxy, with layer 7 (HTTP) load-balancing for ports 80 and 443, and layer 4 (TCP) load-balancing for the rest. We have heard that HTTP load-balancing can degrade performance due to double processing of requests, but we are not sure if this is true or if it can be avoided by using TCP load-balancing instead. > – Using pfsense as a firewall, with layer 4 (TCP) load-balancing for all ports, on a separate IP address. This would conflict with our existing HTTP haproxy that listens on ports 80 and 443 for some services, and uses the HTTP Host header to route requests. We also need to handle TCP protocols such as FTP and Rsync, which require layer 4 load-balancing anyway. We could set up a second WAN IP on pfsense, and then forward ports 80, 443, 21, and 873 to a TCP haproxy VM in the cluster. > – Using CARP on the WAN interface of the mirrors, to provide failover and redundancy. As far as we know, this would make one server handle all requests while the other is idle, unless the primary server fails. We are not sure if this would improve or worsen performance. > – Using DNS round-robin to distribute requests among the mirrors, based on their domain names. We understand that this is not a true load-balancing solution, but rather a way to provide some degree of redundancy and fault tolerance. Modern clients would try another record if one does not respond. We were advised against this option by a mentor, but he did not explain why. It seems to be the simplest and most suitable option for our needs. >
> Could you please advise us on which option is the best for our situation, or suggest any other alternatives that we may have overlooked? We appreciate any additional information or feedback that you can provide. Thank you for your time and expertise.

Answer:

How to Load-Balance Linux Mirrors for a University Network

>
> Linux mirrors are servers that provide copies of Linux distributions and software packages for users to download. They are essential for the Linux community, as they offer faster and more reliable access to the latest updates and releases. However, running a Linux mirror can be challenging, especially when it comes to managing the network traffic and ensuring high availability and performance. In this article, we will explore some of the options for load-balancing Linux mirrors for a university network, and provide some recommendations based on our experience and expertise. >
>

The Scenario

>
> We are a student organization that runs Linux mirrors for our university. Our objective is to offer fast and reliable mirrors for our users and other mirror sites (we are a tier 1 source). We currently have a single server, but we are planning to add another one for redundancy. Our network bandwidth is 1Gb/s for WAN and 10Gb/s for LAN, and we expect to upgrade the WAN to 3Gb/s soon. You can see a simple diagram of our network [here](https://imgur.com/96DCTYw). >
> We are considering different options for load-balancing our mirrors, and we would like to know which one is the best in terms of performance, scalability, and security. These are the options we have in mind: >
> – Using pfsense as a reverse proxy, with layer 7 (HTTP) load-balancing for ports 80 and 443, and layer 4 (TCP) load-balancing for the rest. We have heard that HTTP load-balancing can degrade performance due to double processing of requests, but we are not sure if this is true or if it can be avoided by using TCP load-balancing instead. > – Using pfsense as a firewall, with layer 4 (TCP) load-balancing for all ports, on a separate IP address. This would conflict with our existing HTTP haproxy that listens on ports 80 and 443 for some services, and uses the HTTP Host header to route requests. We also need to handle TCP protocols such as FTP and Rsync, which require layer 4 load-balancing anyway. We could set up a second WAN IP on pfsense, and then forward ports 80, 443, 21, and 873 to a TCP haproxy VM in the cluster. > – Using CARP on the WAN interface of the mirrors, to provide failover and redundancy. As far as we know, this would make one server handle all requests while the other is idle, unless the primary server fails. We are not sure if this would improve or worsen performance. > – Using DNS round-robin to distribute requests among the mirrors, based on their domain names. We understand that this is not a true load-balancing solution, but rather a way to provide some degree of redundancy and fault tolerance. Modern clients would try another record if one does not respond. We were advised against this option by a mentor, but he did not explain why. It seems to be the simplest and most suitable option for our needs. >
>

The Analysis

>
> To evaluate the pros and cons of each option, we will use the following criteria: performance, scalability, security, complexity, and cost. Performance refers to how fast and efficiently the mirrors can serve the requests. Scalability refers to how well the mirrors can handle increasing or varying traffic. Security refers to how well the mirrors can protect themselves and the users from malicious attacks. Complexity refers to how easy or difficult it is to set up and maintain the mirrors. Cost refers to how much money and resources are required to run the mirrors. >
> – pfsense as a reverse proxy: This option involves using pfsense, a free and open-source firewall and router software, as a reverse proxy that distributes the requests to the mirrors based on the protocol and port number. For HTTP requests, pfsense would use layer 7 load-balancing, which means that it would inspect the content of the requests and route them accordingly. For other protocols, such as FTP and Rsync, pfsense would use layer 4 load-balancing, which means that it would only look at the IP address and port number of the requests and route them accordingly. The advantages of this option are that it can provide high performance, as pfsense can cache and compress the responses, and that it can provide high security, as pfsense can filter and block malicious requests. The disadvantages of this option are that it can introduce some overhead and latency, as pfsense has to process each request twice, and that it can be complex to configure and manage, as pfsense has many features and options that need to be fine-tuned. > – pfsense as a firewall: This option involves using pfsense as a firewall that forwards the requests to a TCP haproxy VM in the cluster, which then distributes the requests to the mirrors based on the protocol and port number. For this option, pfsense would use layer 4 load-balancing for all ports, which means that it would only look at the IP address and port number of the requests and route them accordingly. The advantages of this option are that it can provide high performance, as pfsense and haproxy can handle a large number of concurrent connections, and that it can provide high security, as pfsense can filter and block malicious requests. The disadvantages of this option are that it can be complex to configure and manage, as pfsense and haproxy have many features and options that need to be fine-tuned, and that it can be costly, as it requires a separate IP address and a VM for haproxy. > – CARP on the WAN interface: This option involves using CARP, a free and open-source protocol that allows multiple hosts to share a common IP address and provide failover and redundancy. For this option, the mirrors would have a common WAN IP address, and one of them would be the master that handles all the requests, while the other would be the backup that takes over if the master fails. The advantages of this option are that it can provide high availability, as the backup can seamlessly replace the master in case of failure, and that it can be simple to set up and maintain, as CARP only requires a few configuration parameters. The disadvantages of this option are that it can provide low performance, as the master has to handle all the requests alone, and that it can provide low scalability, as the backup is idle and cannot share the load. > – DNS round-robin: This option involves using DNS, the system that translates domain names to IP addresses, to distribute the requests among the mirrors based on their domain names. For this option, the mirrors would have different domain names, such as mirror1.example.edu and mirror2.example.edu, and the DNS server would return a random or sequential IP address for each request. The advantages of this option are that it can provide high performance, as the mirrors can serve the requests independently, and that it can provide high scalability, as the mirrors can handle varying traffic. The disadvantages of this option are that it can provide low availability, as the DNS server cannot detect if a mirror is down and may return a non-working IP address, and that it can provide low security, as the DNS server cannot filter or block malicious requests.

TechNsight

Load-Balancing Strategies for Linux Mirrors: A Case Study of a University Network

How to Load-Balance Linux Mirrors for a University Network

The Scenario

The Analysis