When the problem is defined and well
understood, and the success criteria have been agreed upon, the next
step is to gather data. What data you should gather depends on the
problem and what (if any) work has already been completed.
It is critical that decisions about remedial
action are based on data. Decisions or recommendations without the
foundation of empirical data are simply guesses or assumptions.
Gathering data is an iterative process that may require several
iterations before the complete problem is captured and a conclusive
decision can be reached. As mentioned earlier, ensure that stakeholders
understand that sufficient data will be required ahead of any
recommendations and problem resolution. These stakeholders must also
understand the value and long-term benefits of quantitative analysis
and evidence-based decisions. Clearly explaining your methodology helps
to win their confidence in the process and its outcome.
Since data collection is so important in
identifying the root cause, the following section provides an approach
around data collection and specific guidance to ensure the data
collection objectives are met.
Focused Data Collection
Only rarely is complete fault
information provided when an incident is escalated. The nature of
databases means serverwide problems are more immediately obvious, and
support teams are notified rapidly. High-impact database problems reach
support teams by phone or walk-up much faster than automated alerts or
formal support-ticket escalation. Typically, escalated support cases
contain insufficient data to make any kind of decision, and further
analysis is required before any kind of remediation can begin. In the
early minutes (sometimes hours) of these incidents, information can be
vague while the cause of the problem is unknown. At this point, the
scope of the issue is often the complete solution, consisting of all
components of the solution.
During this time, it can be useful to adopt a
top-down approach to troubleshooting by starting with Performance
Monitor (PerfMon). PerfMon has the advantage of being a serverwide
diagnostics tool, and it can be useful in identifying or indemnifying
the database tier as the root cause of the problem.
The divide-and-conquer method is especially
useful when troubleshooting performance problems. This approach takes
the complete end-to-end application environment and selects a midpoint
between the client and the database server to determine whether the
performance problem exists at that point. Based on the outcome, you can
then focus on the problem half and iterate through it until the problem
component is identified. This approach can work particularly well with
the database tier, as calling a number of stored procedures to validate
database functionality and responsiveness can be a useful junction in
identifying or absolving SQL Server as the cause of the performance
problem.
Understanding Data Gathering
Data gathering is a balance between
collecting sufficient information to capture the problem and not
collecting so much data that the collection process itself affects
system performance or there is simply too much data to analyze
efficiently.
BLACK BOX TRACING
Consider a scenario with an intermittent problem for which there is no clear pattern to occurrences. Black box tracing
creates a server-side trace, writing trace data to a circular log file
that contains data for a specified time period (e.g., 1 hour or 4
hours). If the problem event occurs again and the trace is stopped
before the occurrence of the problem is overwritten in the log (this
could be automated), the trace will contain the problem. You can also
look a setting up in Extended Events to help with this.
If the problem can be reproduced it will be much
easier to collect data and refine the scope rapidly. If it occurs in a
predictable pattern, it is usually possible to restrict data collection
to a short period of time and gather all necessary data.
Conversely, if the problem happens infrequently
or without pattern, a different strategy is required. Often it isn’t
possible to start data capture when the problem occurs because events
occurring just before the problem starts may be important. Therefore,
consider using a black box circular trace to enable a continuous
lightweight trace that can be stopped when the problem occurs.
Tools and Utilities
The following list summarizes some of
the most commonly used data collection tools and analysis utilities.
- PerfMon — Performance Monitor
(PerfMon) ships with Windows and can be used to gather information on
server resources and services. It can track serverwide information such
as CPU and memory usage, I/O statistics, and network activity. Several
SQL Server-specific counters can be useful for various troubleshooting
and monitoring scenarios.
- Profiler — SQL Server Profiler can be used to capture statement-level information from within the database engine.
- XEvents — Extended Events are a
lightweight event-driven data-capture feature that can assist
troubleshooting while minimizing the monitoring footprint.
- PSSDiag — This is a wrapper around
SQLDiag, PerfMon, and other add-ins. SQLDiag can do anything that
PSSDiag can do, but it is not pre-packaged with all the add-ins that
PSSDiag may be configured with. PSSDiag is usually configured by a
Microsoft support engineer and sent to help troubleshoot a specific
problem. It is specific to a certain version of SQL Server, and the
add-ins are usually architecture specific (x86 or x64).
- SQLDiag — SQLDiag ships with SQL
Server. It can be used to gather basic environmental information such
as the SQL Server Error Logs, Event Logs, and SQL Server configuration
settings. It can also be used to capture time-synchronized Profiler and
PerfMon information.
- Event Logs (Application, System, and Security) — These logs are often useful, displaying which errors, warnings, and informational messages have occurred in the recent past.
- Application logs — If the
application instrumentation includes writing Error Log output or
diagnostic information, these logs can be useful for identifying the
cause of a problem.
- User dumps — If you see an exception in the SQL Server Error Logs, you should also see a mini-dump file with the extension .mdmp. This can be used by Microsoft CSS to help determine why the exception occurred.
- NetMon — This is a network sniffer
that is used to look at data as it is sent over the network. It is
often used to diagnose connectivity or Kerberos problems.
- CMS — Central Management Server is
a feature with SQL Server Management Studio and provides a method of
storing your SQL Server registrations in a central database. It can be
useful in a troubleshooting scenario because you don’t have to remember
specific SQL Server instance names and passwords — they are already
stored in CMS. In addition, you can execute commands against groups of
CMS servers at once.
- Management Data Warehouse — This
SQL Server Management Studio tool is used for performance trending. You
can use it to collect and consolidate various data over time, which you
can then analyze to see how performance has changed.
- Policy-Based Management (PBM) — PBM
can be used to validate whether predetermined standards have been
followed. Some policies can prevent certain actions from ever occurring.