Sep 21, 2004

RAIN Architecture Scales Storage

Most corporations use relatively isolated and expensive disk subsystems for primary storage, and they protect this data with tape back-up systems that are stored offsite for disaster-recovery purposes.

A new storage system architecture called Redundant Array of Inexpensive Nodes (RAIN) surpasses this traditional storage architecture by offering data-storage and protection systems that are more distributed, shareable and scalable. RAIN systems also are less expensive than traditional systems.

RAIN is an open architecture approach that combines standard, off-the-shelf computing and networking hardware with highly intelligent management software. This combination lets a host of storage and data-protection applications be cost-effectively deployed across a grid of devices that are highly available and self-healing.

RAIN-based storage and protection systems consist of:

  • RAIN nodes: These hardware components are 1U servers that provide about 1 terabyte of serial ATA (SATA) disk storage capacity, standard Ethernet networking and CPU processing power to run RAIN and data management software. Data is stored and protected reliably among multiple RAIN nodes instead of within a single storage subsystem with its own redundant power, cooling and hot-swap disk-drive hardware.

  • IP-based internetworking: RAIN nodes are physically interconnected using standard IP-based LANs, metropolitan-area networks (MAN) and/or WANs. This lets administrators create an integrated storage and protection grid of RAIN nodes across multiple data centers. With MAN and WAN connectivity, RAIN nodes can protect local data while offering off-site protection for data created at other data centers.

  • RAIN management software: This software lets RAIN nodes continuously communicate their assets, capacity, performance and health among themselves. RAIN management software automatically can detect the presence of new RAIN nodes on a new network, and these nodes are self-configuring.

    The management software creates virtual pools of storage and protection capacity without administrative intervention. It also manages all recovery operations related to one or more RAIN nodes becoming unavailable because of RAIN node or network failures. RAIN nodes do not require immediate replacement upon component failure because lost data is automatically replicated among the surviving RAIN nodes in the grid.


  • Information life-cycle management software: This software replaces traditional snapshot, back-up and mirroring data-management tools with innovative virtualization, compression, versioning, encryption, self-healing integrity checking and correcting, retention and replication algorithms. Information life-cycle management software increases the overall reliability of lower-cost SATA disk drives by replicating data among multiple RAIN nodes.


A grid of RAIN nodes also can adapt to changing application workloads by load-balancing data across nodes based on utilization or storage capacity.

In a RAIN-based storage system, each RAIN node regularly checks all its own files. The combination of hundreds of RAIN nodes forms a powerful parallel data-management grid - one that is much more powerful than today's independent protection architectures. When file corruption is detected, the associated RAIN node initiates a replication request to all other RAIN nodes, which verify their own replicas and work collectively to replace the defective file.

Grids of RAIN nodes will replace existing isolated data-storage systems. Low-cost, high-performance disk drives, CPUs and IP networking make this evolution possible. In addition, businesses are demanding simplified, lower-cost, site disaster-recovery systems and faster and more reliable back-up and restore processes.

By executing information life-cycle management applications across hundreds of powerful, internetworked storage and compute RAIN nodes, RAIN systems will deliver unprecedented long-term data availability, cost-effective and rapid site disaster recovery, and automated onsite and offsite data back-up protection.



>>> http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/export/home/httpd/htdocs/news/tech/2004/0209techupdate.html&pagename=/news/tech/2004/0209techupdate.html&pageurl=http://www.networkworld.com/news/tech/2004/0209techupdate.html&site=datacenter

Google Storage Strategy

Not a SAN in sight
With 6 billion web pages to index and millions of Google searches run daily you would think, wouldn't you, that Google has an almighty impressive storage setup. It does, but not the way you think. The world's largest search company does use networked storage but in the form of networked clusters of Linux servers, cheap rack'em high, buy'em cheap x86 servers with one or two internal drives.

A cluster will consist of several hundred, even thousands of machines, each with their internal disk. At the last public count, in April 2003, there were 15,000 plus such machines with 80GB drives. As an exercise let's assume 16,000 machines with 1.5 disk drives, 120MB, per machine. That totals up to 1.84TB. In fact Google probably has between two and five petabytes altogether, if we add in duplicated systems, test systems and news systems and Froogle systems and so forth. Why does Google use such a massively distributed system?

It's the application
Crudely speaking, Google's storage has to do two production jobs. First it has to assimilate the results of the web crawlers which discover and index new pages. In file system terms the bulk of this activity is appending data to existing files rather than overwriting them.

The second task is to respond to the millions of online search requests, query the stored data, and come up with results. These searches can be extensively parallelised.

Google has its own GFS - Google File System - and it is described here. It has implemented this on several very large clusters of Linux machines spread across the globe in data centres.

Google's application is unique and not comparable to a general enterprise application which typically involves file data being overwritten and a much lower degree of parallelism. Google also requires that its services be up and running 7 x 24, every day of the year, no matter what. Single or even double points of failure, or network bottlenecks are simply not acceptable - ever.

Overall system configuration
Google has devised its own cluster architecture, which has evolved from the first Google system set up at Stanford by the founders in 1998 (so recent!) Sergey Brin and Larry Page.

The nature of a Google query, such as search for 'EMC', requires the scanning of hundreds of magabytes of data and billions of cpu cycles. But each web page that might contain the term 'EMC' can be read independently of the others. Thus it is inherently parallel. Brin and Page reasoned it was better to have many cheap Linux machines do the search in parallel rather than running an SMP Unix server. The Unix server would cost 5-10 times as much and represent a point of failure.

Run the search in clustered Linux PC servers (cheap, very cheap), each with their own internal disk rather than a networked storage device (expensive; network link is a bottleneck) and combine the results. Even better, store the index data for the web pages separately from the web pages themselves. Run the search across the web page index, then aggregate the positive hits and search the web pages to extract the little snippets of text surrounding the search term. Aggregate these and serve them to the user.

Linux was chosen because it was inexpensive and more reliable than either Windows NT or any proprietary Unix version.

There is no concept of state as there would be with a commercial web transaction. Each search request is atomic, can be dealt with and forgotten.

In scaling terms this is a classic scale out or horizontal scaling scenario and not a scale up, as in adding CPUs to a server, requirement.

The index is separated into what Google calls shards and these are stored on separate index servers.

The hard drives
Given this why not have a large disk server used by the clustered Linux machines? It's cost and reliability that drives this. A disk server is expensive and, as a single box, is vulnerable. Getting the hard drives with the PC servers means that the data is stored across hundreds if not thousands of drives. Google replicates data three times for redundancy. It can afford to be cavalier about hardware failures. So a drive fails. Log it, switch queries on that data to a replica and move on. It's all pretty instant.

There isn't even RAID protection. In a way the Google cluster architecture is similar to the RAIN storage idea, a redundant array of inexpensive nodes. (Techworld mentioned RAIN here. Exagrid is a supplier with RAIN storage product ideas which Techworld discussed recently here.)

The drives are IDE drives and not SCSI, which would be more expensive. Google spends more time reading files than waiting for them to be read. Latency is not that great an issue so having lightning fast 15,000pm SCSI drives is not a requirement. In 2001, 5400rpm 80GB maxtor IDE drives were mentioned as being used by Google.

Google's architecture is home-grown. Its PC servers are supplied by two specialist server builders. There is no great case study material here for Sun or IBM or HP, none whatsoever. The only well-known supplier is Red Hat for Linux, and much of its distribution is discarded as not needed.

Google gets its system reliability from software and hardware duplication. It uses commodity PCs to build a high-end computing cluster.

File System
The Google file system basics are that each GFS cluster has a single GFS master node and many chunk servers. These are accessed by many, many clients. Files are divided into fixed-size chunks of 64MB. The master maintains all file system metadata. The chunk servers store chunks on their local disks as Linux files. They need not cache file data because the local systems' Linux buffer cache keeps frequently accessed data in RAM.

To understand more about this read the GFS paper referenced above. The assumptions behind the file system includes one that component failures are normal. So system component health is watched rigorously and constantly and automatic recovery is integral to Google's operations.

Growth
Google has been growing at a phenomenal rate. In June 2000 it had three data centres and 4,000 Linux servers. Six months earlier it had 2,000. By April 2001 it had 8,000 servers and was moving to four datacentres from its then total of five. At that point it had 1 petabyte of storage. The number of servers had passed 15,000 in April,2003, probably well past.

By the end of this year Google could have around 18,000 servers and more than 5PB of storage. It is a fascinating exercise in commodity computing economics, performance and reliability but, unless your applications are inherently parallel, not a general role model, alas.

>>> http://www.techworld.com/features/index.cfm?featureID=467&printerfriendly=1