RAID + Distributed Clustering

A redundant group/array of independent disks (also, RAID, redundant array of independent disks) refers to a data storage system that uses multiple drives (hard drives or SSDs), among which the data is distributed or replicated.

Storage server with 24 holes for hard drives and RAID implemented by hardware RAID with support of various configurations or RAID levels.

Depending on your configuration (often referred to as tier), the benefits of a RAID over a single disk are one or more of the following: increased integrity, fault tolerance, transfer rates, and capacity. In its original implementations, its key advantage was the ability to combine several low-cost devices and older technology into a package that offered greater capacity, reliability, speed, or a combination of these than a single next-generation device and higher cost.

At the simplest level, a RAID combines multiple hard drives into a single logical drive. Thus, instead of seeing several hard drives, the operating system sees only one. RAIDs are typically used in servers and are typically (but not required) implemented with disk drives of the same capacity. Due to the decline in the price of hard drives and the increased availability of RAID options included in motherboard chipsets, RAID is also an option in more advanced personal computers. This is especially prevalent in computers that are dedicated to intensive tasks and that require ensuring data integrity in the event of a system failure. This feature is available in hardware RAID systems (depending on which structure we choose). In contrast, software-based systems are much more flexible and hardware-based systems add one more point of failure to the system (the RAID controller).

For some time, computer clusters have become popular as a very interesting alternative from the cost / performance / reliability point of view for the realization of high-performance computers, data processing centers and Internet servers. In any of these cases, it is very important that the cluster incorporates a storage system capable of supporting the users / clients of the system. In this TFM, various alternatives of storage systems for use in a computer cluster are analyzed. For each alternative considered, a brief description is made, its installation process and an evaluation of its behavior.

Why is RAID used?

RAID 1 is one of the most widely used RAID types for those looking for data duplication to be sure that data is never lost. In this type of RAID, data is mirrored across all disks. In this way, although we have no performance improvement in writing speeds, the reading speed is double, since the data is read at the same time from the two units. In addition, we are sure that if one of the disks fails, the data is still intact in the second and, when replacing the damaged one, the data will be duplicated again automatically.

GlusterFS is a scalable, parallel network file system suitable for data intensive tasks such as cloud storage. It enables you to create high-capacity, network-distributed storage solutions, and GlusterFS is free, open-source software that can use common hardware.

Cluster concept

Here are some terms used in a cluster FS file system. Distributed file system: It is a file system in which the data is distributed among several nodes and users can access this data without knowing the actual location of the files. The user does not experience the feeling of remote access.

As for the cluster, it is a group of multiple computers linked by a high-speed network, in such a way that the whole is seen as a single computer, more powerful than the common desktop computers.

Clusters are usually used to improve performance and/or availability above that provided by a single computer, typically being cheaper than individual computers of comparable speed and availability.

Cluster computing emerges as a result of the convergence of several current trends, including the availability of inexpensive high-performance microprocessors and high-speed networks, the development of software tools for high-performance distributed computing, as well as the increasing need for computational power for applications that require it.

The construction of the cluster computers is easier and cheaper due to its flexibility

They can all have the same hardware and operating system configuration (homogeneous cluster), different performance but with similar architectures and operating systems (semi-homogeneous cluster), or have different hardware and operating system (heterogeneous cluster), which makes it easier and economic in its construction.

The speed increase of a program using multiple processors in distributed computing is limited.

In other words, we have improved the execution speed of the program by a factor of 1.2 (rounding). Amdahl’s law is measured in generic units, that is, the results are not percentages or units of time. Amdahl’s Law can be interpreted in a more technical way, but in simple terms, it means that it is the algorithm that decides the speed improvement, not the number of processors. Finally, a moment is reached when the algorithm can no longer be paralyzed.

Clusters Classification:

The term cluster has different connotations for different groups of people. The types of clusters, established based on the use given to the clusters and the services they offer, determine the meaning of the term for the group that uses it. Clusters can be classified based on their characteristics. You can have high-performance clusters (HPC – High Performance Clusters), high-availability clusters (HA – High Availability) or high-efficiency clusters (HT – High Throughput).

High performance: They are clusters in which tasks that require large computational capacity, large amounts of memory, or both are executed at the same time. Carrying out these tasks can compromise cluster resources for long periods of time.