Modern cluster systems and their use. Means of implementing High Availability clusters. Cluster projects in Russia

High performance cluster (computer group)

A computer cluster is a group of computers interconnected by high-speed communication lines that jointly process the same requests and are presented by the user as a single computing system.

Main properties of clusters

Clusters are made up of multiple computer systems;

They work as one computing system (not all);

The cluster is managed and presented to the user as one computing system;

Why clusters are needed

Clusters can be used for different purposes. Clusters can create fault-tolerant systems, can serve to improve the performance of a computer node, or can be used for time-consuming calculations.

What are the clusters

Failover clusters

Such clusters are created to ensure a high level of availability of the service represented by the cluster. The greater the number of computers included in the cluster, the less likely the failure of the presented service. Computers that are part of a geographically dispersed cluster also provide protection against natural disasters, terrorist attacks and other threats.

Cluster data can be built according to three main principles

cold standby clusters- this is when the active node processes requests, and the passive one is idle, and just waits for the active one to fail. The passive node starts working only after the failure of the active one. A cluster built according to this principle can provide high fault tolerance, but at the moment the active node is turned off, the requests processed by it at that moment may be lost.
hot standby cluster- this is when all nodes of the system jointly process requests, and in the event of a failure of one or more nodes, the load is distributed among the remaining ones. This type of cluster can also be called a load balancing cluster, which we will talk about later, but with support for distributing requests when one or more nodes fail. When using this cluster, there is also the possibility of losing data processed by the node that failed.
modular redundancy cluster- this is when all computers in the cluster process the same requests in parallel to each other, and after processing, any value is taken. Such a scheme guarantees the execution of the request, since any result of processing the request can be taken.

Load balancing cluster

These clusters are created primarily for performance reasons, but they can also be used to improve fault tolerance, as is the case with a hot spare failover cluster. In cluster data, requests are distributed through input nodes to all other cluster nodes.

Computing clusters

This type of clusters is usually used for scientific purposes. In these systems, the task is divided into parts, executed in parallel on all nodes of the cluster. This allows you to significantly reduce the processing time compared to single computers.

Don't forget to leave

The rapid development of information technology, the growth of processed and transmitted data, and at the same time the increasing requirements for reliability, availability, fault tolerance and scalability make us take a fresh look at the already far from young clustering technology. This technology allows you to create quite flexible systems that will meet all of the above requirements. It would be wrong to think that installing a cluster will solve absolutely all problems. But it is quite possible to achieve impressive results from clustering. You just need to clearly understand what it is, what are the most significant differences between their individual varieties, and also know the advantages of certain systems - in terms of the effectiveness of their application in your business.

Analysts from IDC calculated that the size of the cluster market in 1997 was only 85 million dollars, while last year this market was already "worth" 367.7 million dollars. The upward trend is obvious.

So, let's try to dot the "i". To date, there is no clear definition of a cluster. Moreover, there is not a single standard that clearly regulates the cluster. However, do not despair, because the very essence of clustering does not imply compliance with any standard. The only thing that determines that a cluster is a cluster is the set of requirements placed on such systems. We list these requirements (four rules): l reliability; l availability of the function (availability); l scalability; l computing power. Based on this, we formulate the definition of a cluster. A cluster is a system of arbitrary devices (servers, disk drives, storage systems, etc.) that provides 99.999% fault tolerance and also satisfies the “four rules”. For example: a server cluster is a group of servers (commonly referred to as cluster nodes) connected and configured in such a way as to provide the user with access to the cluster as a single coherent resource.

fault tolerance

Undoubtedly, the main characteristic in a cluster is fault tolerance. This is also confirmed by a user survey: 95% of respondents answered that they need reliability and fault tolerance in clusters. However, these two concepts should not be confused. Fault tolerance refers to the availability of certain functions in the event of a failure, in other words, it is the redundancy of functions and load balancing. And reliability is understood as a set of means of ensuring protection against failures. Such requirements for the reliability and fault tolerance of cluster systems are due to the specifics of their use. Let's take a small example. The cluster serves the electronic payment system, so if the client at some point is left without service for the operating company, it will cost him dearly. In other words, the system must operate continuously 24 hours a day, seven days a week (7-24). At the same time, a fault tolerance of 99% is clearly not enough, since this means that almost four days a year the information system of an enterprise or operator will be inoperable. This may not seem like such a long time, given the preventive work and maintenance of the system. But today's client is absolutely indifferent to the reasons why the system does not work. He needs services. So, 99.999% becomes an acceptable figure for fault tolerance, which is equivalent to 5 minutes per year. Such indicators can be achieved by the cluster architecture itself. Let's take an example of a server cluster: each server in the cluster remains relatively independent, that is, it can be stopped and turned off (for example, for maintenance or installation of additional equipment) without disrupting the cluster as a whole. The close interaction of the servers that form the cluster (cluster nodes) guarantees maximum performance and minimum application downtime due to the fact that: l in the event of a software failure on one node, the application continues to function (or automatically restarts) on other cluster nodes;l failure or failure of a cluster node (or nodes) for any reason (including human errors) does not mean failure of the cluster as a whole; on other nodes in the cluster. Potential outages that conventional systems cannot prevent in a cluster result in either some performance degradation (if nodes go down) or a significant reduction (applications are unavailable only for a short period of time required to switch to another node ), which allows for a 99.99% availability level.

Scalability

The high cost of cluster systems is due to their complexity. Therefore, cluster scalability is quite relevant. After all, computers whose performance satisfies today's requirements will not necessarily meet them in the future. With almost any resource in the system, sooner or later one has to face a performance problem. In this case, two scaling options are possible: horizontal and vertical. Most computer systems allow for several ways to improve their performance: adding memory, increasing the number of processors in multiprocessor systems, or adding new adapters or drives. This scaling is called vertical scaling and allows you to temporarily improve system performance. However, the system will be set to the maximum supported amount of memory, processors, or disks, and system resources will be exhausted. And the user will face the same problem of improving the performance of the computer system as before. Horizontal scaling provides the ability to add additional computers to the system and distribute work between them. Thus, the performance of the new system as a whole goes beyond the limits of the previous one. A natural limitation of such a system would be software that you decide to run on it. The simplest example of using such a system is the distribution of various applications among different components of the system. For example, you can move your office applications from one Web application cluster node to another, and corporate databases to a third. However, this raises the question of the interaction of these applications with each other. Again, scalability is usually limited to the data used in applications. Different applications requiring access to the same data need a way to access data from different nodes in such a system. The solution in this case is technologies that, in fact, make the cluster a cluster, and not a system of machines connected together. At the same time, of course, the possibility of vertical scaling of the cluster system remains. Thus, due to vertical and horizontal scaling, the cluster model provides serious protection of consumer investments. As a variant of horizontal scaling, it is also worth noting the use of a group of computers connected through a switch that distributes the load (Load Balancing technology). We will talk about this rather popular option in detail in the next article. Here we only note the low cost of such a solution, which is mainly the sum of the price of a switch ($6,000 or more, depending on the functional equipment) and a host adapter (on the order of several hundred dollars for each; although, of course, you can use ordinary network cards). Such solutions find their primary use in high-traffic Web sites where a single server cannot handle all incoming requests. The ability to distribute the load between the server nodes of such a system allows you to create a single Web site on many servers.

Beowulf, or Computing Power

Often solutions similar to those described above are called the Beowulf cluster. Such systems are primarily designed for maximum computing power. Therefore, additional systems for improving reliability and fault tolerance are simply not provided. This solution has an extremely attractive price, which is probably why it has gained the most popularity in many educational and research organizations. The Beowulf project appeared in 1994 - the idea arose to create parallel computing systems (clusters) from public Intel-based computers and inexpensive Ethernet networks, installing Linux and one of the free communication libraries (PVM, and then MPI) on these computers. It turned out that for many classes of problems and with a sufficient number of nodes, such systems provide performance comparable to that of a supercomputer. As practice shows, it is quite simple to build such a system. All that is needed for this is a high-performance switch and several workstations (servers) connected to it with the Linux operating system installed. However, this is not enough. In order for this pile of iron to come to life, special software for parallel computing is needed. The most common parallel programming interface in the message passing model is MPI (Message Passing Interface). The name Message Passing Interface speaks for itself. It is a well standardized mechanism for building parallel programs in a messaging model. There are free (!) and commercial implementations for almost all supercomputing platforms, as well as for networks of UNIX and Windows NT workstations. MPI is currently the most widely used and dynamically developing interface of its class. The recommended free implementation of MPI is the MPICH package developed at Argonne National Laboratory. MPI is standardized by the MPI Forum. The latest version of the standard is 2.0. This version adds important features to MPI such as dynamic process control, one-way communications (Put/Get), parallel I/O. The constant demand for high computing power has created an attractive market for many manufacturers. Some of them have developed their own technologies for connecting computers into a cluster. The most famous of them are Myrinet by MyriCom and cLAN by Giganet. Myrinet is an open standard. To implement it, MyriCom offers a wide range of network equipment at relatively low prices. At the physical layer, SAN (System Area Network), LAN (CL-2) and fiber optic networking environments are supported. Myrinet technology provides high network scalability and is currently very widely used in building high-performance clusters. Giganet is developing software and hardware for direct interaction of the central processing units of cluster servers at gigabit speeds, bypassing the OS functions. The cost of the solution is about $2,500 for an 8-port switch, $150 for a Myrinet adapter, about $6,250 for an 8-port switch, and $800 for a Giganet adapter. The latter, by the way, received the Best of Show award at the Microsoft Tech Ed 2000 exhibition. As an example, let's take the implementation of the Beowulf cluster at the Institute for High-Performance Computing and Databases of the Ministry of Science and Technology Policy of the Russian Federation. The cluster, called "PARITET", is based on commonly available components for personal computers and workstations and provides a total peak performance of 3.2 GFLOP/sec. The cluster consists of four dual-processor computing nodes based on Intel Pentium II/450MHz processors. Each node has 512 MB of RAM and a 10 GB HDD on the Ultra Wide SCSI interface. The computing nodes of the cluster are united by a high-performance Myrinet switch (channels with a bandwidth of 1.28 GB / s, full duplex). There is also a redundant network used for management and configuration (100 Mbit Fast Ethernet). Linux operating system (Red Hat 5.2 distribution kit) is installed on the nodes of the computing cluster. The MPI/PVM message passing interfaces are used for programming parallel applications.

Mini cluster from Dell and Compaq

In addition to the switch solution for building a cluster, there are a number of solutions - both hardware and software. Some solutions are complex and are delivered "As is" - "all in one box". The last option - let's call it a "cluster in a box" - is also a fairly popular solution, since it is designed for the mass market and is an entry-level cluster (in terms of performance and scaling options). However, the construction of such systems, the interconnection of internal components, reliability and fault tolerance are fully consistent with "large" systems. In order to understand how a cluster works, consider two similar production systems - Compaq and Dell. Clusters from these well-known players in the computer market are built from two DELL servers - PowerEdge 6100 or PowerEdge 4200 and, in turn, Compaq - Proliant 1850R. The software used is Microsoft Cluster Server (Compaq, Dell) or Novell High-Availability Services for NetWare 4.0 / Clustering Services for NetWare 5.0 (Compaq). The software allows you to configure two servers so that if one of the servers in the cluster fails, its work and applications will immediately be automatically transferred to the other server, eliminating downtime. Both cluster servers provide their resources to perform production work, so neither of them is idle waiting until the other fails. level. Communication between the two servers is carried out via the so-called pulsating connection (Heartbeat) of a dedicated section of the local network. When the primary server fails, the second server that monitors the heartbeat connection learns that the primary server is down and takes over the workload that was running on the failed machine. The functions performed include running the applications, processes, and services required to respond to client requests for access to a failed server. Although each of the servers in the cluster must have all the resources required to take over the functions of another server, the main duties performed can be completely different. A secondary server that is part of a failover cluster meets the requirement to provide hot standby capability, but it can also run its own applications. However, despite the massive duplication of resources, such a cluster has a "bottle neck" - the interface of the SCSI bus and the shared external memory system, the failure of which causes the cluster to fail. Although, according to the manufacturers, the probability of this is negligible. Such mini-clusters are primarily designed for autonomous operation without constant monitoring and administration. An example of use is a solution for remote offices of large companies to ensure high availability (7S24) of the most critical applications (databases, mail systems, etc.). Given the increasing demand for powerful yet fault-tolerant entry-level systems, the market for these clusters looks quite favorable. The only “but” is that not every potential consumer of cluster systems is ready to pay about $20,000 for a two-server system.

Dry residue

As a summary, clusters finally have a mass market. Such a conclusion can easily be drawn from the forecasts of Standish Group International analysts, who claim that in the next two years the global growth in the number of installed cluster systems will be 160%. In addition, analysts from IDC calculated that the size of the cluster market in 1997 was only 85 million dollars, and last year this market was already "worth" 367.7 million dollars. The growth trend is obvious. Indeed, the need for cluster solutions today arises not only in large data centers, but also in small companies that do not want to live by the principle of "the miser pays twice" and invest their money in highly reliable and easily scalable cluster systems. Fortunately, there are more than enough options for implementing the cluster. However, when choosing any solution, one should not forget that all cluster parameters are interdependent. In other words, you need to clearly prioritize the required functionality of the cluster, because as performance increases, the degree of availability (availability) decreases. Increasing performance and ensuring the required level of availability inevitably leads to an increase in the cost of the solution. Thus, the user needs to do the most important thing - to find the golden mean of the cluster's capabilities on this moment. This is the more difficult to do, the more various solutions are offered today on the cluster market. In preparing the article, materials from WWW servers were used: http://www.dell.ru/ , http://www.compaq.ru/ , http:// www.ibm.ru/ , http://www.parallel.ru/ , http://www.giganet.com/ , http://www.myri.com/

ComputerPress 10"2000

Blue Gene /L and the SGI Altix family.

Windows Compute Cluster Server (CCS) 2003 is considered as the basic software for organizing computing on cluster systems. Its general characteristics and composition of services running on cluster nodes are given.

At the end of this section, the rules for working with the CCS launch and job control console are given. Describes the details of how the CCS scheduler works when executing job sequences on a cluster.

1.1. Architecture of high-performance processors and cluster systems

In the history of the development of the architecture of computer processors, two major stages can be distinguished:

Stage 1 - increasing the clock frequency of processors (up to 2000),
2nd stage - the emergence of multi-core processors (after 2000)

Thus, the approach based on SMP ( Symmetrical MultiProcessing ), which was developed when building high-performance servers, in which several processors share the system resource, and, first of all, RAM (see Figure 1.1), has shifted "down" to the level of cores inside processor.

Rice. 1.1.

On the way to multi-core processors, Hyper-Threading technology was the first to appear, first used in 2002 in Intel processors Pentium4:

Rice. 1.2.

In this technology, two virtual processors share all the resources of one physical processor, namely, caches, the execution pipeline, and individual execution units. At the same time, if one virtual processor has occupied a shared resource, then the second one will wait for its release. Thus, a processor with Hyper-Threading can be compared to a multitasking operating system that provides each process running in it with its own virtual computer with a full set of tools and plans the order and time of these processes on physical hardware. Only in the case of Hyper-Threading , it all happens at a much lower hardware level. However, two instruction streams allow for more efficient loading of the processor's execution units. The real increase in processor performance from the use of Hyper-Threading technology is estimated at 10 to 20 percent.

A full-fledged dual-core processor (see Figure 1.3) shows performance gains of 80 to 100 percent on some tasks.

Rice. 1.3.

Thus, a dual-core and, in general, multi-core processor can be considered as a miniature SMP system that does not need complex and expensive multiprocessor motherboards.

Moreover, each core can (as, for example, in the Intel Pentium Extreme Edition 840 processor) support Hyper-Threading technology, and therefore this kind of dual-core processor can execute four program threads simultaneously.

In early 2007, Intel introduced an 80-core single-chip processor called the Teraflops Research Chip (http://www.intel.com/research/platform/terascale/teraflops.htm). This processor can achieve 1.01 teraflops of performance at a minimum core clock speed of 3.16 GHz and a voltage of 0.95 V. At the same time, the total Energy consumption chip is only 62 watts.

According to Intel forecasts, commercial variants of processors with a large number of cores will appear in the next 5 years, and by 2010 a quarter of the volume of all servers shipped will have a teraflop performance.

Cluster computing systems and their architecture

cluster is a local (located geographically in one place) computing system, consisting of many independent computers and a network connecting them. In addition, a cluster is a local system because it is managed within a separate administrative domain as a single computer system.

computer nodes of which it is composed are standard, general-purpose (personal) computers used in various fields and for various applications. The computing node may contain either one microprocessor or several, forming, in the latter case, a symmetrical (SMP-) configuration.

The network component of the cluster can either be a conventional local network or be built on the basis of special network technologies that provide ultra-fast data transfer between cluster nodes. The cluster network is designed to integrate cluster nodes and is usually separated from the external network through which users access the cluster.

The cluster software consists of two components:

development/programming tools and
resource management tools.

Development tools include compilers for languages, libraries for various purposes, performance measurement tools, and debuggers, which together allow you to build parallel applications.

Resource management software includes installation, administration, and workflow planning tools.

Although there are many programming models for parallel processing, at the moment, the dominant approach is the message passing model, implemented in the form of the MPI (Message Passing Interface) standard. MPI is a library of functions that can be used in C or Fortran programs to send messages between parallel processes and also control these processes.

Alternatives to this approach are languages based on the so-called global partitioned address space (GPAS), typical representatives of which are HPF (High Performance Fortran) and UPC (Unified Parallel C) languages.

Some thoughts on when it makes sense to use high-availability clusters to protect applications.

One of the main tasks in the operation of an IT system in any business is to ensure the continuity of the service provided. However, very often, both engineers and IT leaders are not entirely clear about what “continuity” is expressed specifically in their business. In the author's opinion, this is due to the ambiguity and vagueness of the very concept of continuity, which is why it is not always possible to clearly say which discretization period is considered continuous and which interval will be the inaccessibility interval. The situation is aggravated by the multitude of technologies designed to ultimately solve one common problem, but in different ways.

What technology should be chosen in each specific case to solve the tasks set within the available budget? In this article, we will take a closer look at one of the most popular approaches to protecting applications, namely the introduction of hardware and software redundancy, i.e. building a high-availability cluster. This task, despite the seeming simplicity of implementation, is actually very difficult to fine-tune and operate. In addition to describing well-known configurations, we will try to show what other features - not too often used - are available in such solutions, how different implementations of clusters are arranged. In addition, it is often desirable that the customer, having seriously weighed all the advantages of the cluster approach, still had in mind its disadvantages, and therefore would consider the entire range of possible solutions.

What threatens applications...

According to various estimates, 55-60% of applications are critical for the company's business - this means that the absence of the service provided by these applications will seriously affect the financial well-being of the company. In this regard, the concept of accessibility becomes a fundamental aspect in the activities of the computer center. Let's see where the threats to application availability come from.

Data destruction. One of the main problems is the accessibility of the service. The simplest way protection - to take frequent "snapshots" of data in order to be able to return to a complete copy at any time.

Hardware failure. Manufacturers of hardware systems (servers, disk storage) produce solutions with redundant components - processor boards, system controllers, power supplies, etc. However, in some cases, a hardware failure can lead to applications being unavailable.

An error in the application. A programmer's error in an application that has already been tested and put into production can manifest itself in one case in tens or even hundreds of thousands, but if such an incident does occur, it leads to a direct loss of the organization's profit, since transaction processing stops, and the way to eliminate the error is not obvious and takes time.

Human error. A simple example: an administrator makes changes to the configuration files, such as DNS. When it tests the changes, the DNS service works, but the service that uses DNS, such as email, starts to experience problems that are not immediately detected.

Scheduled maintenance. System maintenance - replacing components, installing service packs, rebooting - is the main cause of unavailability. Gartner estimates that 80% of the time a system is unavailable is planned downtime.

General problems on the computing platform. Even if the organization does everything to protect itself from local problems, this does not guarantee the availability of the service if for some reason the entire site is unavailable. This must also be taken into account when planning the system.

...and how to deal with it

Depending on the criticality of the task, the following mechanisms for restoring the health of the computing system can be used.

Backup data to tape or disk. This is the basic level of availability - the simplest, cheapest, but also the slowest.

local mirroring. Provides real-time data availability, data is protected from destruction.

Local clustering. Once data protection is organized, the next step in ensuring application availability is local clustering, i.e. creating redundancy in terms of both hardware and software.

remote replication. It assumes the separation of computing sites in order to create a copy of the data in the separated data centers.

Remote clustering. Since the availability of data at different sites is ensured, it is also possible to maintain the availability of the service from different sites by organizing access of applications to this data.

We will not dwell here on the description of all these methods, since each item may well become the topic of a separate article. The idea is clear - the more redundancy we introduce, the higher the cost of the solution, but the better the applications are protected. For each of the methods listed above, there is an arsenal of solutions from different manufacturers, but with a typical set of features. It is very important for the solution designer to keep all these technologies in mind, since only their competent combination will lead to an exhaustive solution of the task set by the customer.

In the author's opinion, Symantec's approach is very successful for understanding the service recovery strategy (Fig. 1). There are two key points here - the point at which the system is restored (recovery point objective, RPO), and the time required to restore the service (recovery time objective, RTO).

The choice of one or another tool depends on the specific requirements that apply to a critical application or database.

For the most critical systems, RTO and RPO should not exceed 1 hour. Tape-based systems provide a recovery point of two or more days. In addition, tape recovery is not automated, the administrator must constantly remember that he has properly restored and launched everything.

Moreover, as already mentioned, when planning an availability scheme, one tool is not enough. For example, it hardly makes sense to use only the replication system. Although critical data is located at a remote site, the applications must be manually launched in the appropriate order. Thus, replication without automatic startup of applications can be considered as a kind of expensive backup.

If you want to provide RTO and RTS measured in minutes, i.e. the task requires minimizing downtime (both planned and unplanned), then the only right solution is a high-availability cluster. In this article, such systems are considered.

Due to the fact that the concept of "computing cluster" has been overloaded for some time now due to their great diversity, first we will say a little about what clusters are.

Cluster types

In its simplest form, a cluster is a system of computers that work together to solve problems together. This is not a client/server data processing model where an application can be logically separated in such a way that clients can send requests to different servers. The idea of a cluster is to pool the computing resources of related nodes to create redundant resources that provide greater shared computing power, high availability, and scalability. Thus, clusters do not just process client requests to servers, but simultaneously use many computers, presenting them as a single system and thus providing significantly greater computing capabilities.

A cluster of computers must be a self-organizing system - the work performed on one of the nodes must be coordinated with the work on other nodes. This leads to the complexity of configuration relationships, difficult communications between cluster nodes, and the need to solve the problem of access to data in a common file system. There are also operational issues associated with the operation of a potentially large number of computers as a single resource.

Clusters can exist in various forms. The most common types of clusters are high performance computing (HPC) and high availability (HA).

High-performance computing clusters use parallel computing methods with the participation of as much processor power as possible to solve the problem. There are many examples of such solutions in scientific computing, where many low-cost processors are used in parallel to perform a large number of operations.

However, the topic of this article is high availability systems. Therefore, further, speaking of clusters, we will have in mind just such systems.

As a rule, when building clusters high degree redundancy is used to create a reliable environment, i.e. a computing system is created in which the failure of one or more components (hardware, software or network facilities) does not significantly affect the availability of the application or the system as a whole.

In the simplest case, these are two identically configured servers with access to a shared storage system (Fig. 2). During normal operation, application software runs on one system while a second system waits for applications to run when the first system fails. When a failure is detected, the second system switches over the appropriate resources (file system, network addresses, etc.). This process is commonly referred to as failover. The second system completely replaces the failed one, and the user does not need to know that his applications are running on different physical machines. This is the most common two-node asymmetric configuration, where one server is active, the other is passive, that is, it is in a standby state in case the main one fails. In practice, this scheme works in most companies.

However, the question must be asked: how acceptable is it to keep an additional set of equipment that is actually in reserve and most time is not used? The problem with unloaded equipment is solved by changing the cluster scheme and the allocation of resources in it.

Cluster Configurations

In addition to the two-node asymmetric cluster structure mentioned above, there are options that different manufacturers of cluster software may have different names, but their essence is the same.

Symmetric cluster

The symmetrical cluster is also executed on two nodes, but each of them runs an active application (Fig. 3). Cluster software provides correct automatic transition of the application from server to server in case of failure of one of the nodes. In this case, the loading of the hardware is more efficient, but if a malfunction occurs, it turns out that applications of the entire system are running on one server, which can have undesirable consequences in terms of performance. In addition, you need to consider whether it is possible to run multiple applications on the same server.

N+1 configuration

This configuration already includes more than two nodes, and among them there is one dedicated, redundant one (Fig. 4). In other words, there is one hot standby for every N running servers. In the event of a malfunction, the application will “move” from the problem node to a dedicated free node. In the future, the cluster administrator will be able to replace the failed node and designate it as a standby.

The N+1 variant is a less flexible N to 1 configuration where the standby node always remains the same for all working nodes. In the event of a failure of the active server, the service switches to the standby one, and the system remains without a backup until the failed node is activated.

Of all the cluster configurations, N + 1 is probably the most efficient in terms of complexity and equipment efficiency. Table below. 1 confirms this estimate.

N to N configuration

This is the most efficient configuration in terms of the use of computing resources (Fig. 5). All servers in it are working, each of them runs applications that are part of the cluster system. If a failure occurs on one of the nodes, applications are moved from it in accordance with the established policies to the remaining servers.

When designing such a system, it is necessary to take into account the compatibility of applications, their connections when they "move" from node to node, server load, network bandwidth, and much more. This configuration is the most complex to design and operate, but it provides the most value for your hardware when using cluster redundancy.

Assessment of cluster configurations

In table. 1 summarizes what has been said above about various cluster configurations. The assessment is given on a four-point scale (4 - the highest score, 1 - the lowest).

From Table. 1 shows that the classical asymmetric system is the simplest in terms of design and operation. And if the customer can operate it independently, then it would be right to transfer the rest to external maintenance.

In conclusion, talking about configurations, I would like to say a few words about the criteria according to which the cluster core can automatically give the command to “move” an application from node to node. The vast majority of administrators in the configuration files define only one criterion - the unavailability of any component of the node, that is, a software and hardware error.

Meanwhile, modern cluster software provides the ability to balance the load. If the load on one of the nodes reaches a critical value, with a correctly configured policy, the application on it will be correctly shut down and launched on another node, where the current load allows it. Moreover, the server load control tools can be both static - the application in the cluster configuration file itself indicates how many resources it needs - and dynamic, when the load balancing tool is integrated with an external utility (for example, Precise), which calculates the current system load.

Now, in order to understand how clusters work in specific implementations, let's look at the main components of any high availability system.

Main cluster components

Like any complex complex, a cluster, regardless of the specific implementation, consists of hardware and software components.

As for the equipment on which the cluster is assembled, the main component here is an inter-node connection or an internal cluster interconnect that provides physical and logical connection of servers. In practice, this is an internal Ethernet network with duplicated connections. Its purpose is, firstly, the transmission of packets confirming the integrity of the system (the so-called heartbeat), and secondly, with a certain design or scheme that arose after a malfunction occurred, the exchange of information traffic between nodes intended for transmission outside. Other components are obvious: the nodes running the OS with the cluster software, the disk storages that the cluster nodes have access to. And finally, a common network through which the cluster interacts with the outside world.

Software components provide control over the operation of a cluster application. First of all, it is a shared OS (not necessarily a shared version). In the environment of this OS, the cluster core - cluster software - works. Those applications that are clustered, i.e., can migrate from node to node, are controlled - started, stopped, tested - by small scripts, the so-called agents. There are standard agents for most tasks, but at the design stage it is imperative to check the compatibility matrix to see if there are agents for specific applications.

Cluster implementations

There are many implementations of the cluster configurations described above on the software market. Almost all major server and software manufacturers - for example, Microsoft, HP, IBM, Sun, Symantec - offer their products in this area. Microtest has experience with Sun Cluster Server (SC) solutions from Sun Microsystems (www.sun.com) and Veritas Cluster Server (VCS) from Symantec (www.symantec.com). From the point of view of the administrator, these products are very similar in terms of functionality - they provide the same settings and reactions to events. However, in terms of their internal organization, these are completely different products.

SC was developed by Sun for its own Solaris OS and therefore only runs on that OS (both SPARC and x86). As a result, during installation, SC is deeply integrated with the OS and becomes part of it, part of the Solaris kernel.

VCS is a multi-platform product that works with almost all currently popular operating systems - AIX, HP-UX, Solaris, Windows, Linux, and is an add-on - an application that controls the operation of other applications that are subject to clustering.

We will look at the internal implementation of these two systems - SC and VCS. But we emphasize once again that despite the difference in terminology and completely different internal structure, the main components of both systems with which the administrator interacts are essentially the same.

Sun Cluster Server Software Components

The SC core (Figure 6) is Solaris 10 (or 9) OS with an add-on shell that provides a high availability feature (the core is highlighted in green). Next are the global components (light green) that provide their services derived from the cluster core. And finally, at the very top - custom components.

The HA framework is a component that extends the Solaris kernel to provide cluster services. The framework task starts by initializing the code that boots the node into cluster mode. The main tasks of the framework are inter-node interaction, managing the state of the cluster and membership in it.

The node-to-node communication module transmits heartbeating messages between nodes. These are short messages confirming the response of the neighboring node. The interaction of data and applications is also managed by the HA framework as part of the inter-node communication. In addition, the framework manages the integrity of the clustered configuration and, if necessary, performs restore and update tasks. Integrity is maintained through the quorum device; if necessary, reconfiguration is performed. The quorum device is an additional mechanism for checking the integrity of cluster nodes through small sections of the shared file system. The latest version of the SC 3.2 cluster introduced the ability to assign a quorum device outside the cluster system, that is, use an additional server on the Solaris platform, accessible via TCP/IP. Failed cluster members are taken out of the configuration. An element that becomes operational again is automatically included in the configuration.

The functions of the global components derive from the HA framework. These include:

global devices with a common cluster device namespace;
a global file service that organizes access to every file in the system for each node as if it were in its own local file system;
a global network service that provides load balancing and the ability to access cluster services through a single IP.

Custom components manage the cluster environment at the top level of the application interface. It is possible to administer both through the graphical interface and through the command line. The modules that monitor the operation of applications, start and stop them are called agents. There is a library of ready-made agents for standard applications; This list grows with every release.

Veritas Cluster Server software components

Schematically, a two-node VCS cluster is shown in fig. 7. Inter-node communication in VCS is based on two protocols - LLT and GAB. The VCS uses an internal network to maintain cluster integrity.

LLT (Low Latency Transport) is a protocol developed by Veritas that runs over Ethernet as a highly efficient replacement for the IP stack and is used by nodes in all internal communications. The required redundancy in inter-node communications requires at least two completely independent internal networks. This is necessary so that the VSC can distinguish between network and system failure.

The LLT protocol performs two main functions: traffic distribution and sending heartbeating. LLT distributes (balances) inter-node communication among all available internal links. This scheme ensures that all internal traffic is randomly distributed among internal networks (there can be a maximum of eight), which improves performance and fault tolerance. In the event of a failure of one link, the data will be redirected to the remaining others. In addition, LLT is responsible for sending heartbeat traffic over the network, which is used by GAB.

GAB (Group Membership Services/Atomic Broadcast) is the second protocol used in VCS for internal communication. He, like LLT, is responsible for two tasks. The first is the membership of nodes in the cluster. GAB receives heartbeat from each node via LLT. If the system does not receive a response from the node for a long time, then it marks its state as DOWN - inoperative.

The second function of GAB is to provide reliable inter-cluster communication. GAB provides guaranteed delivery of broadcasts and point-to-point messages between all nodes.

The control component of the VCS is the VCS engine, or HAD (High Availability daemon), running on every system. She is responsible for:

building working configurations obtained from configuration files;
distribution of information between new nodes joining the cluster;
processing input from the administrator (operator) of the cluster;
performing routine actions in case of failure.

HAD uses agents to monitor and manage resources. Information about the state of resources is collected from agents on local systems and transmitted to all members of the cluster. Each node's HAD receives information from other nodes, updating its own picture of the entire system. HAD acts as a replicated state machine RSM, i.e. the kernel on each node has a picture of the resource state that is fully synchronized with all other nodes.

The VSC cluster is managed either through the Java console or through the Web.

What's better

The question of when which cluster is better to use, we have already discussed above. We emphasize once again that the SC product was written by Sun for its own OS and is deeply integrated with it. VCS is a multi-platform product and therefore more flexible. In table. 2 compares some of the possibilities of these two solutions.

In conclusion, I would like to give one more argument in favor of using SC in the Solaris environment. Using both hardware and software from a single manufacturer - Sun Microsystems, the customer receives a "single window" service for the entire solution. Despite the fact that vendors are now creating common centers of competence, the time for broadcasting requests between software and hardware manufacturers will reduce the speed of response to an incident, which does not always suit the user of the system.

Territorially distributed cluster

We looked at how a high availability cluster is built and operates within a single site. Such an architecture can only protect against local problems within a single node and its associated data. In the event of problems affecting the entire site, whether technical, natural or otherwise, the entire system will be inaccessible. Today, more and more tasks arise, the criticality of which requires the migration of services not only within the site, but also between geographically dispersed data centers. When designing such solutions, new factors have to be taken into account - the distance between sites, channel bandwidth, etc. Which replication should be preferred - synchronous or asynchronous, host or array means, what protocols should be used? The success of the project may depend on the solution of these issues.

Replication of data from the main site to the backup site is most often performed using one of the popular packages: Veritas Volume Replicator, EMC SRDF, Hitachi TrueCopy, Sun StorageTek Availability Suite.

In the event of a hardware failure or an application or database problem, the cluster software will first try to move the application service to another node in the main site. If the primary site becomes unavailable to the outside world for any reason, all services, including DNS, migrate to the backup site, where data is already present due to replication. Thus, for users, the service is resumed.

The disadvantage of this approach is the huge cost of deploying an additional "hot" site with equipment and network infrastructure. However, the benefit of complete protection may outweigh these additional costs. If the central node is unable to provide service for a long time, this can lead to large losses and even to the death of the business.

System test before disaster

According to a study by Symantec, only 28% of companies test a disaster recovery plan. Unfortunately, most of the customers with whom the author had to talk on this issue did not have such a plan at all. The reasons why testing is not done are the lack of time for administrators, the reluctance to do it on a "live" system, and the lack of test equipment.

For testing, you can use the simulator included in the VSC package. Users who choose to use VCS as their cluster software can test their setup on the Cluster Server Simulator, which allows them to test their application migration strategy between nodes on a PC.

Conclusion

The task of providing a service with a high level of availability is very expensive both in terms of the cost of hardware and software, and the cost of further maintenance and technical support of the system. Despite the apparent simplicity of the theory and simple installation, a cluster system, when studied in depth, turns out to be a complex and expensive solution. In this article, the technical side of the system was considered only in general terms, while on certain issues of the cluster, for example, determining membership in it, one could write a separate article.

Clusters are usually built for business-critical tasks, where a unit of downtime results in large losses, for example, for billing systems. One could recommend the following rule, which determines where it is reasonable to use clusters: where the downtime of the service should not exceed one and a half hours, the cluster is the appropriate solution. In other cases, you can consider less expensive options.

First, you need to determine who the article is intended for, so that readers decide whether it is worth spending time on it.

The need to write this article arose after reading a seminar at the ENTEREX'2002 exhibition in Kyiv. It was then, at the beginning of 2002, that I saw that interest in the topic of cluster systems had increased significantly compared to what was observed just a couple of years ago.

I did not set myself the goal at the seminar and in this article to analyze options for solving specific applied problems on cluster systems, this is a separate and very extensive topic. I set myself the task of acquainting readers with the terminology and tools for building cluster systems, as well as showing what tasks clustering is useful for. To fully convince the doubters, the article provides specific examples of the implementation of cluster systems and my contacts, on which I am ready to answer, as far as possible, questions related to cluster technologies, as well as accept your comments and advice.

The concept of cluster systems

Figure 1. Cluster system

LAN - Local Area Network, local area network
SAN - Storage Area Network, storage area network

For the first time in the classification of computing systems, the term "cluster" was defined by Digital Equipment Corporation (DEC).

According to the DEC definition, a cluster is a group of computers that are interconnected and function as one information processing node.

The cluster functions as a single system, that is, for a user or an applied task, the entire set of computer technology looks like one computer. This is what is most important when building a cluster system.

Digital's first clusters were built on VAX machines. These machines are no longer in production, but are still in operation at sites where they were installed many years ago. And perhaps the most important thing is that the general principles laid down in their design remain the basis for building cluster systems today.

The general requirements for cluster systems include:

High Availability
High performance
Scaling
Sharing resources
Serviceability

Naturally, with private implementations, some of the requirements are put at the forefront, while others fade into the background. So, for example, when implementing a cluster, for which speed is the most important, less attention is paid to high availability in order to save resources.

In the general case, the cluster functions as a multiprocessor system, therefore, it is important to understand the classification of such systems within the framework of the distribution of software and hardware resources.

Figure 2. A tightly coupled multiprocessor system

Figure 3. Moderately coupled multiprocessor system

Figure 4 Loosely coupled multiprocessor system

Typically the PC platforms I work with use cluster system implementations in tightly coupled and moderately coupled multiprocessor architecture models.

Separation into High Avalibility and High Performance systems

In the functional classification, clusters can be divided into "High Performance" (HP), "High Availability" (HA), and "Mixed Systems".

High-speed clusters are used for tasks that require significant processing power. The classic areas in which such systems are used are:

image processing: rendering, pattern recognition
scientific research: physics, bioinformatics, biochemistry, biophysics
industry (geoinformation tasks, mathematical modeling)

and many others…

Clusters, which are high-availability systems, are used wherever the cost of possible downtime exceeds the cost of the costs required to build a cluster system, for example:

billing systems
Bank operations
electronic commerce
enterprise management, etc….

Mixed systems combine the features of both the first and second. Positioning them, it should be noted that a cluster that has both High Performance and High Availability parameters will definitely lose in performance to a system oriented to high-speed computing, and in possible downtime to a system oriented to work in high availability mode.

Issues of High Performance Clusters

Figure 5. High Speed Cluster

In almost any parallel-oriented task, it is impossible to avoid the need to transfer data from one subtask to another.

Thus, the performance of a High Performance cluster system is determined by the performance of the nodes and the connections between them. Moreover, the influence of the speed parameters of these connections on the overall performance of the system depends on the nature of the task being performed. If a task requires frequent data exchange with subtasks, then maximum attention should be paid to the speed of the communication interface. Naturally, the less parts of a parallel task interact with each other, the less time it will take to complete it. Which dictates certain requirements also for programming parallel tasks.

The main problems with the need to exchange data between subtasks arise due to the fact that the speed of data transfer between the central processor and the RAM of the node significantly exceeds the speed characteristics of computer-to-computer interaction systems. In addition, the difference in the speed of processor cache memory and inter-node communications greatly affects the change in the functioning of the system, compared to SMP systems familiar to us.

The speed of interfaces is characterized by two parameters: the throughput of a continuous data stream and the maximum number of the smallest packets that can be transmitted per unit of time. We will consider options for implementing communication interfaces in the section “Implementation Tools for High Performance Clusters”.

Issues of High Availability Cluster Systems

Today, several types of high availability systems are common in the world. Among them, the cluster system is the embodiment of technologies that provide the highest level of fault tolerance at the lowest cost. Cluster failover is provided by duplication of all vital components. The most fault-tolerant system should not have a single point, that is, an active element, the failure of which can lead to a loss of system functionality. This characteristic is usually called - NSPF (No Single Point of Failure, - English, the absence of a single point of failure).

Figure 6. Cluster system with no points of failure

When building high availability systems, the main goal is to ensure minimal downtime.

In order for the system to have high readiness indicators, it is necessary:

so that its components are as reliable as possible
so that it is fault-tolerant, it is desirable that it does not have points of failure
and it is also important that it be easy to maintain and allow the replacement of components without stopping

Neglect of any of the specified parameters may lead to loss of system functionality.

Let's briefly go over all three points.

As for ensuring maximum reliability, it is carried out by using electronic components of high and ultra-high integration, maintaining normal operating modes, including thermal ones.

Fault tolerance is provided by using specialized components (ECC, Chip Kill memory modules, fault-tolerant power supplies, etc.), as well as using clustering technologies. Thanks to clustering, such a functioning scheme is achieved when, in the event of a failure of one of the computers, tasks are redistributed between other nodes of the cluster that are functioning properly. Moreover, one of the most important tasks of cluster software manufacturers is to ensure the minimum system recovery time in case of failure, since system fault tolerance is needed precisely to minimize the so-called unscheduled downtime.

Many people forget that ease of maintenance, which serves to reduce planned downtime (for example, replacing failed equipment) is one of the most important parameters of high availability systems. And if the system does not allow replacing components without shutting down the entire complex, then its availability factor decreases.

mixed architectures

Figure 7. High Speed Failover Cluster

Today, you can often find mixed cluster architectures, which are both high-availability systems and high-speed cluster architectures, in which application tasks are distributed across system nodes. The presence of a fault-tolerant complex, the speed of which is increased by adding a new node, is considered the most optimal solution when building a computing system. But the very scheme of building such mixed cluster architectures leads to the need to combine a large number of expensive components to ensure high performance and redundancy at the same time. And since the most expensive component in a High Performance cluster system is a high-speed communications system, its duplication will lead to significant financial costs. It should be noted that high availability systems are often used for OLTP tasks that function optimally on symmetrical multiprocessor systems. Implementations of such cluster systems are often limited to 2-node options, primarily focused on ensuring high availability. But recently, the use of inexpensive systems with more than two as components for building mixed HA / HP cluster systems has become a popular solution.

This is confirmed, in particular, by the information of The Register agency published on its page:

"The chairman of Oracle Corporation announced that in the near future the three Unix servers that run the bulk of the company's business applications will be replaced with a block of servers based on Intel processors running Linux OS. Larry Ellison insists that the introduction of cluster support when working with applications and databases, reduces costs and improves fault tolerance."

Tools for implementing High Performance Clusters

The most popular communication technologies for building supercomputers based on cluster architectures today are:

Myrinet, Virtual Interface Architecture (Giganet's cLAN is one of the first commercial hardware implementations), SCI (Scalable Coherent Interface), QsNet (Quadrics Supercomputers World), Memory Channel (developed by Compaq Computer and Encore Computer Corp), as well as the well-known Fast Ethernet and Gigabit Ethernet.

Figure 8. Continuous Data Rate

Figure 9. Zero Length Packet Transmission Time

These diagrams (Fig. 8 and 9) make it possible to see the speed of hardware implementations of different technologies, but it should be remembered that on real tasks and when using various hardware platforms, the delay and data transfer rate parameters are obtained by 20-40%, and sometimes by 100%. % is worse than the maximum possible.

For example, when using MPI libraries for cLAN communication cards and Intel Based servers with a PCI bus, the real channel throughput is 80-100 MByte/sec, the delay is about 20 μs.

One of the problems that arise when using high-speed interfaces, such as SCI, for example, is that the PCI architecture is not suitable for working with high-speed devices of this type. But if the PCI Bridge is redesigned with a focus on one data transfer device, then this problem is solved. Such implementations take place in the solutions of some manufacturers, for example, SUN Microsystems.

Thus, when designing high-speed cluster systems and calculating their performance, one should take into account the performance losses associated with the processing and transmission of data in the cluster nodes.

Table 1. Comparison of high-speed communication interfaces

Technology	Bandwidth MByte/s	Delay μs/burst	Cost of a card/switch for 8 ports	Platform support	Comment
Fast Ethernet	12.5	158	50/200	Linux, UNIX, Windows	Low prices, popular
gigabit ethernet	125	33	150/3500	Linux, UNIX, Windows	Ease of upgrade
Myrinet	245	6	1500/5000	Linux, UNIX, Windows	Open standard, popular
VI (with LAN from Giganet)	150	8	800/6500	Linux, Windows	First hardware industrial implementation of VI
SCI	400	1.5	1200/5000 *	Linux, UNIX, Windows	Standardized, widely used
QsNet	340	2	N/A**	True64 UNIX	AlphaServer SC and Quadrics systems
Memory Channel	100	3	N/A	True64 UNIX	Used in Compaq AlphaServer

* SCI hardware (and support software) allows the construction of so-called MASH topologies without the use of switches
** no data

Figure 10. A tightly coupled multiprocessor system with asymmetric memory access

One interesting feature of communication interfaces that provide low latency is that they can be used to build systems with NUMA architecture, as well as systems that can simulate multiprocessor SMP systems at the software level. The advantage of such a system is that you can use standard operating systems and software oriented to use in SMP solutions, but due to the high delay of interprocessor interaction, which is several times higher compared to SMP, the performance of such a system will be unpredictable.

Parallelization tools

There are several different approaches to programming parallel computing systems:

in standard widely used programming languages using communication libraries and interfaces for organizing interprocessor interaction (PVM, MPI, HPVM, MPL, OpenMP, ShMem)
use of specialized parallel programming languages and parallel extensions (parallel implementations of Fortran and C/C++, ADA, Modula-3)
use of automatic and semi-automatic parallelization of sequential programs (BERT 77, FORGE, KAP, PIPS, VAST)
programming in standard languages using parallel procedures from specialized libraries that are focused on solving problems in specific areas, for example: linear algebra, Monte Carlo methods, genetic algorithms, image processing, molecular chemistry, etc. (ATLAS, DOUG, GALOPPS, NAMD, ScaLAPACK).

There are also many tools that simplify the design of parallel programs. For example:

CODE- Graphical system for creating parallel programs. A parallel program is represented as a graph whose vertices are successive parts of the program. PVM and MPI libraries are used for message passing.
TRAPPER- A commercial product of the German company Genias. A graphical programming environment that contains components for building parallel software.

According to the experience of users of high-speed cluster systems, programs that are specially written taking into account the need for interprocessor interaction work most efficiently. And even though it is much more convenient to program on packages that use a shared memory interface or automatic parallelization tools, the MPI and PVM libraries are the most common today.

Given the massive popularity of MPI (The Message Passing Interface), I would like to tell you a little about it.

"Message passing interface" is a standard that is used to build parallel programs and uses a messaging model. There are MPI implementations for C/C++ and Fortran both in free and commercial versions for most common supercomputing platforms, including High Performance cluster systems built on Unix, Linux and Windows nodes. The MPI Forum () is responsible for standardizing MPI. The new version of the 2.0 standard describes a large number of new interesting mechanisms and procedures for organizing the operation of parallel programs: dynamic process control, one-way communications (Put/Get), parallel I/O. But unfortunately, there are no complete ready-made implementations of this version of the standard yet, although some of the innovations are already being actively used.

To evaluate the functionality of MPI, I want to bring to your attention a graph of the time dependence of the calculation of the problem of solving systems of linear equations, depending on the number of processors involved in the cluster. The cluster is built on Intel processors and SCI (Scalable Coherent Interface) inter-node connections. Naturally, the problem is a particular one, and it is not necessary to understand the results obtained as a general model for predicting the performance of the desired system.

Figure 11. Dependence of the calculation time of the problem of solving systems of linear equations depending on the number of processors involved in the cluster

The graph shows two curves, blue - linear acceleration and red - obtained as a result of the experiment. That is, as a result of using each new node, we get an acceleration higher than linear. The author of the experiment claims that such results are obtained due to more effective use memory cache, which is quite logical and understandable. If anyone has thoughts and ideas about this, I will be grateful if you share them (my e-mail: [email protected]).

Means for implementing High Availability Clusters

High Availability clusters can be divided into:

Shared Nothing Architecture (architecture without resource sharing)
Shared Disk Architecture (architecture with shared disks)

Figure 12. Architecture without resource sharing

A non-shared architecture does not use shared storage. When using it, each node has its own disk drives, which are not shared by the nodes of the cluster system. In fact, only communication channels are separated at the hardware level.

Figure 13. Architecture with shared drives

The shared disk architecture is classically used to build high-availability cluster systems that are oriented towards processing large amounts of data. Such a system consists of a shared storage system and cluster nodes that distribute access to shared data. With a high capacity of the storage system, when working with data-oriented tasks, the architecture with shared disks is more efficient. In this case, there is no need to keep multiple copies of the data, and at the same time, when a node fails, tasks can be instantly available to other nodes.

If the task manages to logically separate the data so that a request from a certain subset of requests can be processed using a part of the data, then a system without resource sharing may be a more efficient solution.

In my opinion, the possibility of building heterogeneous cluster systems is interesting. For example, Tivoli Sanergy software allows you to build systems that can share access to data between heterogeneous nodes. Such a solution can be very useful in systems for collective processing of video information or other data in an organization where the required range of solutions simply does not exist on one platform, or there is already a formed fleet of hardware and software resources that need to be used more efficiently.

Figure 14. Heterogeneous cluster system

The most popular commercial systems today are two-node failover clusters. Distinguish Active-Active (Active-Active) and Active-Passive (Active-Passive) implementation models of fault-tolerant cluster systems in relation to the distribution of software resources.

Figure 15. Model Active-Active

In the Active-Active model, together with a fault-tolerant solution, we practically get a high-speed solution, since one task works on several servers at the same time. This option is implemented, for example, in Oracle Prallel Server, MS SQL 2000, IBM DB2. That is, the implementation of such a model is possible only if application software is written with a focus on functioning in a cluster mode (the exception is cluster systems with shared RAM). In the Active-Active model, it is possible to scale the speed of the task by adding a new node, if, of course, the required number of nodes is supported by the software. For example, Oracle Parallel Server 8.0.5 supports a cluster of 2 to 6 nodes.

Figure 16. Active-Active cluster on 3 nodes

Very often, users encounter such a problem when it is necessary to ensure the fault-tolerant functioning of ready-made software solutions. Unfortunately, the Active-Active model does not work in this case. For such situations, a model is used that provides for the migration of tasks that were running on the failed node to other nodes. Thus, we get the implementation of Active-Passive.

Figure 17. Active-Passive Model

Considering that in many cases we can divide one task into several distribution areas of responsibility, as well as the fact that in the general case the enterprise needs to perform many different tasks, the so-called pseudo Active-Active cluster system model is implemented.

Figure 18. Pseudo Active-Active cluster on 3 nodes

If you need to ensure fault-tolerant operation of several software resources, then it is enough to add a new node to the system and run the tasks you need on the cluster, which, in case of failure of this node, will be transferred to execution on another node. Such a model is implemented in the ReliantHA software for Caldera OpenUnix and Unixware OS, which supports clustering from 2 to 4 nodes, in MSCS (Microsoft Cluster Service) and Linux Failover Cluster models.

The communication system in failover cluster systems can be built on the same hardware as in high-speed clusters. But in the case of implementing an architecture with a shared disk drive, it becomes necessary to provide high-speed access to a shared storage system. This problem has many solutions today.

If the simplest 2-node model is used, then access to disks can be built through their direct connection to a common SCSI bus,

Figure 19. Architecture with shared SCSI bus

or using a standalone disk subsystem with a built-in SCSI to SCSI controller. In the latter case, disks are connected to internal independent channels of the disk subsystem.

Figure 20. Variant using SCSI to SCSI disk subsystem

The option using SCSI to SCSI disk subsystem is more scalable, functional and fault-tolerant. Despite the fact that there is another bridge between the node and the disks, the speed of such a system is usually higher, since we get dial-up access to the drive (the situation is similar to using a hub and switch in a local network). In contrast to the variant with shared disk access on a common SCSI bus, a separate independent disk subsystem also has the convenient ability to build systems without points of failure and the ability to build multi-node configurations.

Recently, a new serial interface for the SCSI protocol, FC (Fiber Channel), has begun to gain popularity. On the basis of FC, so-called storage networks are built - SAN (Storage Area Network).

Figure 21. Cluster system using Fiber Channel SAN

Almost all of its features can be attributed to the main advantages of Fiber Channel.

High data rates
Protocol independence (0-3 levels)
Large distances between points
Low latency when transmitting short packets
High reliability of data transmission
Virtually unlimited scaling
Multidrop topologies

These remarkable features of Fiber Channel are due to the fact that specialists in the fields of both channel and network interfaces took part in its design, and they managed to combine the positive features of both in one FC interface.

To understand the significance of FC, I will give a comparative table of FC and a parallel SCSI interface.

Table 2. Table of comparative characteristics of FC and parallel SCSI interface

Today, FC devices are more expensive than parallel SCSI devices, but the price difference has been drastically decreasing in recent years. Disks and storage systems are already almost equal in cost to parallel SCSI implementations, only FC adapters provide a significant difference in cost.

There is another very interesting implementation of the cluster architecture - a cluster system with shared memory (including RAM) Shared Memory Cluster. In fact, this cluster can function both in the model of a moderately coupled multiprocessor system and in a tightly coupled one. Such a system, as mentioned at the beginning of the article, is called NUMA.

Figure 22. Shared memory cluster model

A shared-memory cluster uses software (cluster services) that provides a single system image, even if the cluster is built as a non-allocated architecture as seen by the operating system.

At the end of the story about high-availability cluster systems, I want to give statistics on the downtime of various systems.

Figure 23. Comparison of the average downtime of different systems

Averaged data are given, as well as data taken from promotional materials of one of the manufacturing companies, so they should be taken with some degree of criticality. However, the overall picture they describe is quite correct.

As you can see, high-availability cluster systems are not a panacea for minimizing downtime. If system downtime is extremely critical, then Fault Tolerant or Continuous Availability class systems should be used, systems of this class have an availability factor an order of magnitude higher than High Availability class systems.

Examples of proven solutions

Since the success of any technology is proved by examples of its practical use, I want to show specific implementation options for several of the most important, in my opinion, cluster solutions.

First, about high-speed clusters.

One of the most useful, in my opinion, examples is that the first places, and indeed most of the places in the 18th edition of the list of the most powerful supercomputers in the world, are occupied by IBM SP2 and Compaq AlphaServer SC systems. Both systems are massively parallel computing systems (MPP), which are structurally similar to High Performance cluster solutions.

IBM SP2 uses RS/6000 machines connected by SP Switch2 as nodes. The throughput of the switch is 500MB/s in one direction, the latency is 2.5 μs.

Compaq AlphaServer SC. The nodes are 4-processor systems of the Compaq AlphaServer ES45 type, connected using the QsNet communication interface, the parameters of which were mentioned above.

In the same supercomputing list are machines built on conventional Intel platforms and SCI and Myrinet switches, and even conventional Fast and Gigabit Ethernet. Moreover, both in the first two versions, and on high-speed cluster systems built on ordinary equipment, MPI packages are used for programming.

And finally, I would like to give a beautiful example of a scalable high-availability cluster system. Hardware model of a cluster solution for fault-tolerant high-speed processing of the IBM DB/2 database.

Figure 24. IBM DB2 cluster

That's all. If anyone has any questions, advice or a desire to talk - you are welcome. You can find my coordinates at the end of the article.

Literature

"Sizing Up Parallel Architectures" - Greg Pfister, Senior Technical Specialist at IBM.
"Is fault tolerance possible for Windows?" - Natalya Pirogova, materials of the Open Systems publishing house.
"Using systems for parallelizing tasks in a loosely coupled cluster", - M.N. Ivanov.
"Fault-tolerant computers of the Stratus company", - Victor Shnitman, materials of the publishing house "Open Systems".
"Modern high-performance computers", - V. Shnitman, information-analytical materials of the Center of Information Technologies.
"A step towards data storage networks", information and analytical materials of the USTAR company.
"The Evolution of the Virtual Interface Architecture" - Torsten von Aiken, Werner Vogels, materials of the publishing house "Open Systems".
Materials of the Laboratory of Parallel Information Technologies "NIVTs MGU".
Cluster Computing Info Centre.
Materials SCI Europe.
Materials VI Forum (Virtual Architecture Developers Forum).
Caldera materials.
Dolphinics materials.
Emulex materials.
Content from KAI Software, a Division of Intel Americas, Inc. (KAI).
Materials from Myricom, Inc.
Oracle Materials.
Intel Technical Support Recommendations.