Maintaining
a SharePoint farm is not an easy task for administrators. They must
find time in their fire-fighting efforts to focus and plan for
maintenance on the server systems. When maintenance tasks are
commonplace in an environment, they can alleviate many of the common
fire-fighting tasks.
The processes and
procedures for maintaining Windows Server systems can be separated based
on the appropriate time to maintain a particular aspect of SharePoint.
Some maintenance procedures require daily attention, whereas others may
require only yearly checkups. The maintenance processes and procedures
that an organization follows depend strictly on the organization;
however, the categories described in the following sections and their
corresponding procedures are best practices for organizations of all
sizes and varying IT infrastructures.
Outlining Daily Maintenance Tasks
Certain
maintenance procedures require more attention than others. The
procedures that require the most attention are categorized as daily
procedures. It is recommended that a SharePoint administrator take on
these procedures each day to ensure system reliability, availability,
performance, and security. These procedures are examined in the
following three sections.
Checking Overall SharePoint Server Functionality
Although checking the
overall server health and functionality may seem redundant or
elementary, this procedure is critical to keeping the system environment
and users working productively.
Some questions that should be addressed during the checking and verification process are the following:
Can users access data in SharePoint document libraries?
Can remote users access SharePoint via SSL if configured?
Is there an exceptionally long wait to access the portal (that is, longer than normal)?
Do SMTP alerts function properly?
Are searches properly locating newly created or modified content?
Verifying That Backups Are Successful
To provide a secure and
fault-tolerant organization, it is imperative that a successful backup
be performed every night. If a server failure occurs, the administrator
may be required to perform a restore from tape. Without a backup each
night, the IT organization is forced to rely on rebuilding the
SharePoint server without the data. Therefore, the administrator should
always back up servers so that the IT organization can restore them with
minimum downtime if a disaster occurs. Because of the importance of the
tape backups, the first priority of the administrator each day needs to
be verifying and maintaining the backup sets.
If disaster ever
strikes, the administrators want to be confident that a system or entire
farm can be recovered as quickly as possible. Successful backup
mechanisms are imperative to the recovery operation; recoveries are only
as good as the most recent backups.
Although Windows
Server’s or SharePoint’s backup programs do not offer alerting
mechanisms for bringing attention to unsuccessful backups, many
third-party programs do. In addition, many of these third-party backup
programs can send emails or pages if backups are successful or
unsuccessful.
Monitoring the Event Viewer
The Windows Event Viewer is
used to check the system, security, application, and other logs on a
local or remote system. These logs are an invaluable source of
information regarding the system. The following event logs are present for SharePoint servers running on Windows Server:
Security—
Captures all security-related events being audited on a system.
Auditing is turned on by default to record success and failure of
security events.
Application— Stores specific application information. This information includes services and any applications running on the server.
System— Stores Windows Server–specific information.
All Event Viewer events are categorized either as informational, warning, or error.
Note
Checking these logs
often helps to understand them. Some events constantly appear but aren’t
significant. Events will begin to look familiar, so it will be
noticeable when something is new or amiss in event logs. It is for this
reason that an intelligent log filter such as SCOM 2007 R2 is a welcome
addition to a SharePoint environment.
Some best practices for monitoring event logs include
Understanding the events being reported
Setting up a database for archived event logs
Archiving event logs frequently
Using an automatic log parsing and alerting tool, such as System Center Operations Manager
To simplify monitoring
hundreds or thousands of generated events each day, the administrator
should use the filtering mechanism provided in the Event Viewer.
Although warnings and errors should take priority, the informational
events should be reviewed to track what was happening before the problem
occurred. After the administrator reviews the informational events, she
can filter out the informational events and view only the warnings and
errors.
To filter events, do the following:
1. | Start the Event Viewer by choosing Start, All Programs, Administrative Tools, Event Viewer.
|
2. | Select the log from which you want to filter events.
|
3. | Right-click the log and select Filter Current Log.
|
4. | In the Filter Current Log window, select the types of events to filter.
|
5. | Optionally,
select the time frame in which the events occurred, event source,
category, event ID, or other options that will narrow down the search.
Click OK when finished.
|
Some warnings and errors are
normal because of bandwidth constraints or other environmental issues.
The more logs are monitored, the more familiar an administrator should
be with the messages and therefore will spot a problem before it affects
the user community.
Note
You might need to increase the size of the log files in the Event Viewer to accommodate an increase in logging activity.
Performing Weekly SharePoint Maintenance
Maintenance procedures
that require slightly less attention than daily checking are categorized
in a weekly routine and are examined in the following sections.
Checking Disk Space
Disk space is a precious
commodity. Although the disk capacity of a Windows Server system can
seem virtually endless, the amount of free space on all drives should be
checked daily. Serious problems can occur if there isn’t enough disk
space.
One of the most common
disk space problems occurs on database drives where all SQL SharePoint
data is held. Other volumes such as the system drive and partitions with
logging data can also quickly fill up.
As mentioned earlier, lack of free disk space can cause a multitude of problems including, but not limited to, the following:
To prevent these problems from occurring, administrators should keep the amount of free space to at least 25 percent.
Caution
If needing to free disk
space, files and folders should be moved or deleted with caution. System
files are automatically protected by Windows Server, but data files are
not.
Verifying SharePoint Hardware Components
Hardware components
supported by Windows Server are reliable, but this doesn’t mean that
they’ll always run continuously without failure. Hardware availability
is measured in terms of mean time between failures (MTBF) and mean time
to repair (MTTR). This includes downtime for both planned and unplanned
events. These measurements provided by the manufacturer are good
guidelines to follow; however, mechanical parts are bound to fail at one
time or another. As a result, hardware should be monitored weekly to
ensure efficient operation.
Hardware
can be monitored in many different ways. For example, server systems
may have internal checks and logging functionality to warn against
possible failure, Windows Server’s System Monitor may bring light to a
hardware failure, and a physical hardware check can help to determine
whether the system is about to experience a problem with the hardware.
If a failure occurs or is
about to occur on a SharePoint server, having an inventory of spare
hardware can significantly improve the chances and timing of
recoverability. Checking system hardware on a weekly basis provides the
opportunity to correct the issue before it becomes a problem.
Archiving Event Logs
The three event logs on all
servers can be archived manually, or a script can be written to automate
the task. You should archive the event logs to a central location for
ease of management and retrieval.
The specific amount of time to
keep archived log files varies on a per-organization basis. For example,
banks or other high-security organizations may be required to keep
event logs up to a few years. As a best practice, organizations should
keep event logs for at least three months.
Tip
Organizations who deploy
System Center Operations Manager with SharePoint can take advantage of
SCOM’s capability to automatically archive event log information,
providing for a significant improvement to monitoring and reporting of
SharePoint.
Performing Monthly Maintenance Tasks
When an understanding of
the maintenance required for SharePoint is obtained, it is vital to
formalize the procedures into documented steps. A maintenance plan can
contain information on what tasks to perform at different intervals. It
is recommended to perform the tasks examined in the following sections
on a monthly basis.
Maintaining File System Integrity
CHKDSK scans for file
system integrity and can check for lost clusters, cross-linked files,
and more. If Windows Server senses a problem, it runs CHKDSK
automatically at startup.
Administrators can maintain
FAT, FAT32, and NTFS file system integrity by running CHKDSK once a
month. To run CHKDSK, do the following:
1. | At the command prompt, change to the partition that you want to check.
|
2. | Type CHKDSK without any parameters to check only for file system errors.
|
3. | If any errors are found, run the CHKDSK utility with the /f parameter to attempt to correct the errors found.
|
Testing the UPS Battery
An uninterruptible power
supply (UPS) can be used to protect the system or group of systems from
power failures (such as spikes and surges) and keep the system running
long enough
after a power outage so that an administrator can gracefully shut down
the system. It is recommended that a SharePoint administrator follow the
UPS guidelines provided by the manufacturer at least once a month.
Also, monthly scheduled battery tests should be performed.
Validating Backups
Once a month, an
administrator should validate backups by restoring the backups to a
server located in a lab environment. This is in addition to verifying
that backups were successful from log files or the backup program’s
management interface. A restore gives the administrator the opportunity
to verify the backups and to practice the restore procedures that would
be used when recovering the server during a disaster. In addition, this
procedure tests the state of the backup media to ensure that they are in
working order and builds administrator confidence for recovering from a
true disaster.
Updating Documentation
An integral part of
managing and maintaining any IT environment is to document the network
infrastructure and procedures. The following are just a few of the
documents you should consider having on hand:
SharePoint Server build guides
Disaster recovery guides and procedures
Maintenance checklists
Configuration settings
Change control logs
Historical performance data
Special user rights assignments
SharePoint site configuration settings
Special application settings
As systems and services are
built and procedures are ascertained, document these facts to reduce
learning curves, administration, and maintenance.
It is not only important
to adequately document the IT environment, but it’s also often even more
important to keep those documents up-to-date. Otherwise, documents can
quickly become outdated as the environment, processes, and procedures
change as the business changes.
Performing Quarterly Maintenance Tasks
As the name implies, quarterly
maintenance is performed four times a year. Areas to maintain and manage
on a quarterly basis are typically self-sufficient and self-sustaining.
Infrequent maintenance is required to keep the system healthy. This
doesn’t mean, however, that the tasks are simple or that they aren’t as
critical as those tasks that require more frequent maintenance.
Checking Storage Limits
Storage
capacity on all volumes should be checked to ensure that all volumes
have ample free space. Keep approximately 25 percent free space on all
volumes.
Running low or
completely out of disk space creates unnecessary risk for any system.
Services can fail, applications can stop responding, and systems can
even crash if there isn’t plenty of disk space.
Keeping SQL Database
disk space consumption to a minimum can be accomplished through a
combination of limiting document library versioning or implementing site
quotas.
Changing Administrator Passwords
Administrator passwords
should, at a minimum, be changed every quarter (90 days). Changing these
passwords strengthens security measures so that systems can’t easily be
compromised. In addition to changing passwords, other password
requirements such as password age, history, length, and strength should
be reviewed.
Summary of Maintenance Tasks and Recommendations
Table 1 summarizes some of the maintenance tasks and recommendations examined.
Table 1. Maintenance Tasks for SharePoint Servers
Daily | Weekly | Monthly | Quarterly | Task |
---|
X | | | | Check overall server functionality, including the SharePoint Health Analyzer. |
X | | | | Verify backups. |
X | | | | Monitor Event Viewer. |
X | X | | | Check disk space. |
X | X | | | Verify hardware. |
| X | | | Archive event logs. |
| X | | | Check SharePoint logs. |
| X | | | Test the UPS. |
| | X | | Check SQL maintenance plans. |
| | X | | Run CHKDSK. |
| | | X | Update documentation. |
| | | X | Change administrator passwords. |
| | | X | Test farm restores. |