ENTERPRISE

Commercial Backup Utilities : Data Requiring Special Treatment, Storage Management Features

- How To Install Windows Server 2012 On VirtualBox

- How To Bypass Torrent Connection Blocking By Your ISP

- How To Install Actual Facebook App On Kindle Fire

2/16/2015 8:55:41 PM

Data Requiring Special Treatment

All commercial backup products can back up normal filesystem data. However, there is a lot of data that does not reside in a normal filesystem. Some data does reside in a filesystem but still requires special treatment before it can be backed up. Find out how the product that you are considering handles data such as this:
Network-mounted filesystems

It is sometimes necessary to back up data that resides on a disk that is mounted from another machine.
Data that needs a custom script to back it up

Some types of data fall in between “normal” filesystem data and database data. If this data is created by a special utility, that utility must be shut off prior to running the backup.
Relational Database (RDBMS) data

Although some RDBMS data can be backed up using shutdown/startup scripts, there is now a much better way to back up most commercial database products.

The following sections discuss these three needs.

Network-Mounted Filesystems

Why would anyone want to back up via NFS? In almost every shop, there is a client that is not supported by commercial backup software. Perhaps it’s an older operating system that is no longer supported. Perhaps the software vendors aren’t convinced that the operating system has enough marketshare. (Remember that these vendors aren’t in this for free.) What good does it do to port your software to a platform that no one is using?

One solution to back up such a client would be to NFS-mount its filesystems to the backup server and back up the data via NFS. Yes, NFS can be a horrible way to back up. Yes, there are problems with restoring via NFS. But, as they say, something is better than nothing.

Along with NFS is Microsoft’s CIFS filesystem, which uses the SMB protocol. This protocol works similarly to NFS, and a supported backup server could mount such drives over the network and back them up that way. (One of the problems with this method is that it must mount these drives to a Windows server, or it will not back up or restore the ACLs—but at least you’ll have the data.)

The other issue with NFS- and CIFS-mounted filesystems is whether the backup software can exclude them. Imagine what would happen in most Windows environments if the backup product started to back up all CIFS partitions! Typically, this is avoided by excluding all NFS and CIFS mount points, but some products can selectively back up NFS and CIFS partitions. Some products allow you to configure a client in such a way that it backs up all CIFS/NFS-mounted partitions that it contains.

Custom User Scripts

At one point, custom user scripts were the only option available for backing up databases. The backup product would run a special program written by the administrator that shut down the database. The backup product then would back up the files or raw partitions on which the database resided. Finally, the program would restart the database. Depending on the size of the database and the uptime requirements, this approach to database backups may still be a viable solution for some environments.

This method is now used for some types of data that do not fully qualify as “database” data but are dynamic enough that they require custom handling. Perhaps you use a network-monitoring tool that continually probes the network and stores the result in several interrelated files. Since those files must be backed up at a consistent point in time, the network-monitoring tool will need to be shut down while the backup is running. This can be accomplished by a custom script.

Databases

Several years ago, no commercial backup utilities backed up database data directly. The best you could do was to shut them down with a custom script as discussed in the previous section. However, databases are now so large that this option is no longer viable for many environments. Now all of the major database vendors have come out with some sort of Application Programming Interface (API) that commercial backup utilities can use to back up the vendor’s database. These utilities share four common traits:

They can pass a data stream to a third-party utility, such as a commercial backup product. (Some also can perform standalone backups.)

They can pass a data stream to a third-party utility while the database is online.

They can provide multiple, simultaneous threads from the same database.

Once set up, they can be called from either that commercial backup utility or the database interface. (For example, you can log in to the database server as the DBA and issue a database command, and that will automatically start a session with the commercial backup utility. You can also issue a backup command from the backup product and have it call the database product.)

The question you must ask the backup vendor is, “Which of these databases have you ported to?” Even if you aren’t using any database products today, the number of databases that a particular product backs up to demonstrates its level of commitment to the enterprise market.

Also, it should be mentioned that some other interfaces do not use the vendor-supplied APIs. Additionally, some commercial backup utility vendors have independently written their own interfaces to these database products, and these interfaces do not use the database vendor’s API either. The preferred method should be to use the vendor-supplied API. Backups and restores done through some other method usually are not supported by the database software vendor.
Storage Management Features

Another important term to know is storage management. Just as some people think the word Internet was invented within the last few years, many think that storage management is a new concept. It actually goes way back to the mainframe days when 3480 tapes were much less expensive than disk. It became necessary to move important, yet unused, data off the expensive disks and onto less expensive 3480 media. (This is one way of managing the available storage, thus the term storage management.) SCSI disks were a lot cheaper, so Unix environments just bought more disks when they needed more storage space. Unfortunately, this “disk is cheap” mentality has led users to create far more and far bigger files than ever before. Within the last few years, though, IS managers have started to become very frustrated with the amount of money they are spending on disk. They believe that a number of files could be moved from disk to a slower storage medium without anyone noticing. This belief has given rise to a demand for storage management in the Unix arena. Just what is storage management, though?

This section examines the three principal storage management concepts that relate to backups:

Archives

Hierarchical Storage Management

Information Lifecycle Management

Archives

Archives are designed for logical retrieval of information, that is, the retrieval of information grouped in a logical way. For example, with archives you can store reference data such as:

The CAD drawings, parts lists, and other manufacturing information for a widget your company used to make

All of the information pertaining to a former customer

All of the information related to a closed project, account, or legal case

Tax returns, financial records, or other records for a particular year

In other words, information that can be grouped in a logical way can be archived and stored so that a company can retrieve it based on that logical grouping. Once a widget is no longer produced, a case is closed, or a tax year has passed, the information pertaining to that event or item just takes up space. We might need to reference it again for some reason, but we don’t want it filling up our high-end storage, so we archive it and delete it from our tier-one storage.

The second way that archives manifest themselves is in the logical storage of active data. Suppose, for example, it was discovered that a critical safety part was removed from a particular widget’s design. It would be important to see every version of the specification, along with information about who changed it. Also, consider the common practice of electronic discovery of email systems. Think about the discovery requests that can come from someone in management being accused of harassment or discrimination, a trader accused of promising financial returns, or a company charged with colluding with its competitors. Such accusations may result in e-discovery requests that look like the following:

All emails from employee A to employees B, C, and D for the last year

All emails and instant messages from all traders to all customers for the last three years that contain the words “promise,” “guarantee,” “vow,” “assure,” or “warranty”

All emails that left a company going to domains X, Y, and Z or to certain specific email addresses

Using a backup program to create archive files isn’t a good idea: trying to find specific information in backups is costly and time-consuming.

Warning

A bottle of grape juice left on the shelf long enough will ferment, but no one would call it wine. Similarly, it’s possible to retrieve data from old backups, but no one should call them archives. Simply put, backups make lousy archives.

Backups make lousy archives

The most common way data is archived is by keeping backups for a long time. Weekly or monthly full backups are performed, and then the backup is kept from 1 year to 50 years, depending on business requirements. There couldn’t be a worse way to archive.

Using backups as archives poses many difficulties. The most common use of backups as archives is for the retrieval of reference data. The assumption is that if someone asks for widget ABC’s parts (or some other piece of reference data), the appropriate files can just be restored from the system where they used to reside. The first problem with that scenario is remembering where the files were several years ago. While backup products and even some backup devices are starting to offer full-text search against all your backups, the problems in the following paragraph still exist.

Even if you can remember where the files belong, the number of operating systems or application versions that have come and gone in the intervening time can stymie the effort. To restore files that were backed up from “Apollo” five years ago, the first requirement is a system named Apollo. Someone also has to handle any authentication issues between the backup server and the new Apollo because it isn’t the same Apollo it backed up from five years ago. Depending on the backup software and operating system in question, the new Apollo may also need to be running the same version of the OS and applications the old Apollo was running five years ago. Otherwise, there may be incompatibilities in the filesystem or database being restored.

Satisfy electronic discovery requests

Backups are also used to satisfy electronic discovery requests, which can be even more challenging. Let’s use the most common electronic discovery request as an example: a request for emails that match a particular pattern and were sent via an Exchange server. (The following also applies to other email systems, such as Lotus Notes or SMTP.) There are two big problems with using backups to satisfy such a request. The first is that it’s impossible to retrieve all emails sent or received by a particular person. It’s only possible to restore the emails that were in the Exchange server when backups were made. If the discovery request is looking for an email that somebody sent, deleted, and then cleared from her Deleted Items folder, it wouldn’t be on that night’s backup, and thus would never show up when you attempted to retrieve it weeks, months, or years later. It would therefore be technically impossible to meet the discovery request using backups. This means that even after doing your best to successfully satisfy the discovery request, a plaintiff may claim that you haven’t proven your case.

The second problem with using backups to satisfy an Exchange electronic discovery request is that it’s very difficult to retrieve months or years of emails using backups. Suppose, for example, a company performs a full backup of its Exchange server once a week, and for compliance reasons, it stores these backups for seven years. If the company received an electronic discovery request for emails from the last seven years, it would need to perform many restores of its entire Exchange server to satisfy the request. The first step would be to restore the Exchange server to an alternate server using last week’s backup. Next, you would have to run a query against Exchange to look for the emails in question, saving them to a .PST file. You would then have to restore the Exchange server using the backup from two weeks ago, rerun the query, and create another .PST file. It would be necessary to restore the entire Exchange server 364 times (7 years multiplied by 52 weeks) before you’re done. And almost every step in this process would have to be done manually.

Accomplishing this isn’t impossible, but the recovery effort will entail an incredible amount of time and money. A plaintiff in a civil suit or the government doesn’t care how much it costs the defendant; your company has a court order to produce the emails—regardless of the cost. Yes, a good lawyer can argue against this, but it can be argued both ways. Do you really want to take that chance?

Other backup bugaboos

Backups are also an extremely inefficient way to store archives. While an archive system makes sure it has only one or two copies of a particular version of a file, a backup system usually has no such logic. If a company is using weekly full backups as archives (or creating “archives” with its backup product but not deleting the original files), and storing its archives for 7 years, it will have 364 copies of many of its files stored on tape—even if those files have never changed. This leads to an incredible amount of media waste.

Another strike against using backups as archives is the number of times a company changes backup formats and tape formats over the years. Almost every company using backups as its archives has a number of older tape and backup formats it must continue to support for archive purposes. While older tape formats can be converted with a lot of copying, converting older backup formats is another story. Most people choose to hold onto both old tape formats and old backup formats, and hope they never actually have to read them.

True archiving

The most important feature of an archiving system is that the archive should contain enough metadata to allow information to be retrieved in logical ways. For example, metadata can include the author or the business unit that created an item. (An item can be any piece of archived information, such as a file, a record from a database, or an email.) Metadata might also contain the project the item is attached to or some other logical grouping. An email archive system would include who sent and received an email, the subject of the email, and all other appropriate metadata. Finally, an archive system may import the full text of the item into its database, allowing for full-text searches against the archive. This can be useful, especially if multiple formats can be supported. It’s particularly expedient to be able to do a full-text search against all emails, Word documents, PDF files, etc.

Another important feature of archive systems is their ability to store a predetermined number of copies of an archived item. A company can then decide how many copies it wishes to keep. For example, if a firm is storing its archives on a RAID-protected system, it may choose to have one copy on disk and another on a removable medium such as optical or tape.

Two types of archivers

Archive systems can be roughly divided into two categories depending on the way they store data. The first is the traditional, low-retrieval archive system that’s attached to your backup software package. Such an archive system allows you to make an archive of a selected group of files; attach limited metadata to it, such as “widget XYZ”; and then have the archive system delete the backup files in question. The good thing is that it allows the attachment of metadata and can reduce multiple copies in the archive by deleting the duplicate backup files as they’re archived. The bad news is that if you want to search archives using different types of metadata—such as owner, time frame, etc.—you may need to create multiple archives. The main use for this type of archive is to save space by deleting files attached to projects or entities that are no longer active.

The second—and newer—category of archive systems realizes that any archived item might need to be retrieved for different reasons and would thus require different metadata. To support multiple types of retrievals, it’s important to store the actual archived item only once but to store all of its metadata in a searchable database. Such a system realizes that a given archived item might be put into the archive not to save space, but to allow it to be searched for logically. Therefore, unlike its predecessors that stored the only copies of reference data, newer archive programs store an extra copy of the data, leaving the original in place.

As discussed previously, one of the problems with using backups as archives is that they won’t have all occurrences of a file or message; they’ll have only those items that were available when the backup was made. Some of the newer archive systems solve this problem by archiving data automatically. For example, every email that comes in or is sent out is captured by the archiving system. Every time a file is saved, a version of the file is sent to the archive system.

Another advantage of newer archive systems is their use of single-instance store concepts. They store only one copy of a file or email, no matter where it came from or who it went to. (Of course, the archiving system records who it came from or who it was sent to.) If that file or email is then changed and sent/stored again, the archiving application stores only the changed bytes in the new version. Single-instance store saves a lot of disk space.

Regarding the format issues of backups as archives, many archive systems still grapple with those issues. Many people still store their archives on tape, and as time passes, people may change their archive software. Therefore, this problem could persist even in archives.

Newer archiving systems also serve as a hierarchical storage management-like system, automatically deleting large, older files and emails, and invisibly replacing them with stubs that automatically retrieve the appropriate content when accessed. This is one of the main business justifications used to sell email archive software. In addition to satisfying e-discovery requests, you can save a lot of space by archiving redundant and unneeded emails and attachments.

Surveys show that more than 90 percent of typical email storage is consumed by attachments. If you can store only one copy of an attachment across multiple email servers (and Exchange Storage Groups), and replace it with a stub, you can save a lot of storage. If you add delta-block incrementals to that, you can save even more storage.

Tip

While Exchange does single instance storage within a storage group, it does not do so across multiple storage groups or multiple servers. Suppose you send an email to four people, two of whom are in the same storage group on the same server, one of whom is in a different storage group on the same server, and one of whom is on a completely different Exchange Server. Exchange would store one copy of the email in the storage group with the first two users, another copy in the storage group with the third user, and another copy in the second Exchange Server with the fourth user. A product like the one described here would take all three copies out, store one copy on the archive server, and put a stub in place so that any user accessing the email would retrieve it automatically from the archive.

If your company has more than one employee, it wouldn’t be hard to build a business case for archiving. And if you’re using backups as archives, you could be in for a pretty rough time when you get an electronic discovery request. Perhaps you should look at an email archiving product or an enterprise content management product today.

Hierarchical Storage Management

The second principal storage management concept is Hierarchical Storage Management, or HSM. HSM is a horse of a different color. While archiving allows someone to delete files from disk once they have been archived, that is not its primary purpose. Its primary purpose is to make those files easy to find years from now when the user doesn’t remember what system they were on. HSM’s primary purpose is to automatically monitor filesystems, looking for files that meet certain conditions, such as a file that has not been looked at for a long time.

In a truly hierarchical system, this would involve successively less expensive media types. For example, the file might be moved from a high-speed, high-availability system to older, nonmirrored disks. It then might be moved to optical disks and eventually to tape. Each of these levels is less expensive than the previous level but also has a longer access time. When a file is moved, or “migrated,” to a less expensive level, the HSM system leaves a stub file behind that has the same name as the original file. If anyone attempts to access the stub file, the original file is retrieved automatically by the HSM system. This is invisible to the end user, other than the extra time taken to access the file. (Some high-speed systems can reduce this time to something that a typical user would never notice.)

Consider a heavy CAD environment. Engineers may make new drawings every day but want to keep the old ones around for reference. Suppose somebody asks five years from now, “What parts went into XYZ?” Without an HSM system, the drawing for XYZ would have to remain on disk for years, simply taking up space. Now, if the engineer making the drawing was conscientious about the disk space she was using, she might contact the administrator and ask for her files to be archived. (However, most end users are not concerned about the administrator’s disk space problems.) An HSM system would proactively monitor this CAD directory and “notice” that this particular group of CAD drawings hasn’t been examined for more than a year. It then would migrate them to a less expensive storage medium without any action from the engineer. When the engineer did eventually need this file, all she would have to do is open it up in her CAD application. It would be retrieved automatically. To her, it would appear as if the system was really slow that day. If she had been educated about HSM, she might think, “Oh, the file must have been migrated.” As long as the file comes back, she probably won’t mind the HSM system at all. In an HSM system, there does need to be some education of the user community; they need to know that their files are being migrated. If they do not know, they will call the help desk whenever they experience a delay when retrieving a file.

Warning

Implementing an HSM system should be done slowly and methodically with significant help from someone who has set up such a system somewhere else. Unrealistic migration policies and improper end-user education can spell disaster for an HSM system. Many system administrators have been required by their end users to remove or deactivate their HSM systems. HSM can work, but so many people have seen poorly configured HSM systems that they have earned a really bad reputation. Step very carefully when moving toward an HSM solution.

Information Lifecycle Management

Information Lifecycle Management (ILM) is more of a concept than a technology. Where HSM and archiving systems typically assume that a file gets less valuable as it gets older, an ILM system recognizes that different data have different values over time, and the value of a piece of data may go up and down several times.

The best example of a pattern that an ILM system would recognize is the exploration data for a oil-and-gas company. They spend years searching and searching for oil. Then they’ve got some information that some oil is in a particular place. As they prepare for permits to drill in that area, the data is highly valuable. While they wait for approval, the value of the data goes down. Once they get approval, the value of the data goes right back up. Once they start drilling, they’d better not lose that data, or they can cost themselves millions of dollars in lost drilling time. Once the oil has been extracted and there’s no more in that particular place, the data is all of a sudden no longer important again. It sits there for some indeterminate amount of time until someone decides to file a lawsuit claiming they drilled in the wrong place. Voila! The data is important again, as is the data surrounding it, such as permits, drilling records, etc.

This is an extreme example, but it illustrates the important concept behind ILM. What matters is what the lifecycle is of your data. Ask your backup vendor how they will help you manage that lifecycle.

Other

Commercial Backup Utilities : Simultaneous Backup of Many Clients to One Drive, Disk-to-Disk-to-Tape Backup, Simultaneous Backup of One Client to Many Drives

Commercial Backup Utilities : Backup of Very Large Filesystems and Files, Aggressive Requirements

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 8) - Using a Script to Migrate CPUs

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 7) - Configuring an nPartition and Virtual Partitions for Auto-Booting

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 6) - Creating the Second Virtual Partition, Booting the Second Virtual Partition

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 5) - Virtual Partition States, Booting the First Virtual Partition

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 4) - Installation of Virtual Partitions

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 3) - Planning for Virtual Partitions

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 2) - The Virtual Partition Environment

The HP Virtual Server Environment : Virtual Partition Example Scenario (part 1)

Top 10

- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 2) - Wireframes,Legends

- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 1) - Swimlanes

- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Formatting and sizing lists

- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Adding shapes to lists

- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Sizing containers

- Microsoft Access 2010 : Control Properties and Why to Use Them (part 3) - The Other Properties of a Control

- Microsoft Access 2010 : Control Properties and Why to Use Them (part 2) - The Data Properties of a Control

- Microsoft Access 2010 : Control Properties and Why to Use Them (part 1) - The Format Properties of a Control

- Microsoft Access 2010 : Form Properties and Why Should You Use Them - Working with the Properties Window

- Microsoft Visio 2013 : Using the Organization Chart Wizard with new data

Video Sports

site
stats