SQL Server 2012 : Troubleshooting Methodology and Practices - Approaching Problems

- How To Install Windows Server 2012 On VirtualBox
- How To Bypass Torrent Connection Blocking By Your ISP
- How To Install Actual Facebook App On Kindle Fire
1/1/2015 8:31:34 PM


SQL Server is a complex product, often tightly integrated with mid-tier and end-user applications. Problems can be many and varied, and the scope of a database administrator (DBA) is broad, covering storage, servers, networking, applications, and meeting business requirements. DBAs often find themselves supporting third-party and in-house-developed applications with varying levels of maturity and robustness. There are no fixed requirements for Transact-SQL development, and application developers often overlook the performance and scalability of code they write.

Consisting of the following three steps (the data collection and analysis may require multiple iterations until the problem’s cause is identified):

1. Define the problem.
2. Iterate through:
  • Data collection
  • Data analysis
3. Validate and implement resolution.

Investing time and effort to develop and refine a troubleshooting methodology helps improve the efficiency and speed with which you troubleshoot problems. Much like planning a trip, the route and endpoint may vary, but identifying the destination and developing an approach to planning an efficient route is a distinct and different skill from driving itself. As you plan subsequent journeys, you can refine your approach, becoming more adept at determining the quickest route and better at estimating the length of time it takes between breaks and optimal departure time.

Troubleshooting SQL Server problems is similar to planning a long car trip.

As you do so, consider roles and responsibilities, communication, reporting, and seeking external help. Reaching a successful resolution can often be achieved by more than one route. Identifying the path of least resistance while achieving the goal is the hallmark of a database professional experienced in troubleshooting complex issues.


Developing a professional methodology to managing problems will lead to a less stressful time at work, help make work more rewarding, and differentiate you from others. Although SQL Server is a discrete technology, it is often the case that when problems occur, uncertainty arises regarding the root cause, and problem scope is rarely well-defined. As such, issues can be passed around support teams with little progress or ownership.

Although many of the details here are not specific to SQL Server problems, they are good practices for troubleshooting many types of complex IT issues.

Ten Steps to Successful Troubleshooting

The following steps provide a detailed methodology for successful and efficient incident resolution. They intentionally separate identification of the root cause and issue resolution. These are different tasks, and many situations require equal (or greater) effort to identify the root cause of an issue versus actually fixing it. Indeed, the fix itself may be trivial, but knowing exactly which fix to make is completely dependent on accurately understanding the problem and its cause; therefore, accurate root cause diagnosis is vital.

To get in front of a complex issue — that is, understand it and resolve it — use the following ten steps:

1. Define the problem — Establish a clear problem statement. The objective is to capture in one or two sentences a summary of the technical problem and success criteria. A detailed explanation will likely be required later, but aim initially to create a concise summary for circulation to interested parties.
2. Ascertain the problem’s impact — The business stakeholders and sponsors often don’t want to know technical details. They want to know the operational and financial impact of the incident. This must be categorized and monetized to the furthest extent possible. For example, if you had a website outage, you should estimate the cost to the organization — e.g., $10,000/ hour. If degraded service is likely, how much will it cost in lost revenue or reputation? If the incident prevents employees from completing their work (e.g., call center workers are unproductive), this can be estimated by the cost of wages plus operational impact (e.g., $10/ hour for 50 call center employees plus any overtime to make callbacks).
3. Engage the correct resources — These could be internal or external. In many enterprise scenarios, it is necessary to formally engage internal resources from other disciplines, such as storage operations, application support, and incident management. There may be external suppliers or third parties who should be engaged, such as hardware manufacturers, software vendors, or implementation consultants. Ensure that all participants are briefed with the same problem description and have a good understanding of the success criteria.
4. Identify potential causes — Meet all necessary parties (physically or virtually) to share the problem description, its impact, and any troubleshooting steps already performed. Consider proposed options to mitigate the impact or work around the problem. Identify any possibility to minimize the immediate impact to the business while a long-term solution is sought.
5. Plan and coordinate tasks across teams — Develop a plan, consisting of a number of hypotheses and a number of scenarios that may cause or influence the problem. Seek to prove or disprove each hypothesis by assigning it to a team with the skills and experience necessary to prove the hypothesis and reach a conclusion. — The intention is to narrow the focus by eliminating components that are not causing the problem, until eventually the problem component is found. Iterate around this method until the hypotheses are proven or disproven.
6. Select a communication plan and review — Document the plan and agree who will keep management, end users, and the technical team updated. Mutually agree on a time to reconvene, (e.g., every 2 hours or 4 hours may be appropriate). In scenarios with geographically dispersed teams, maintaining an open conference call to assist troubleshooting can be useful, but it’s still important to plan and execute regular reviews.
7. Identify root cause — After a number of iterations (each iteration should be isolated, repeatable, and have narrow scope),you will have disproved a number of hypotheses, and hopefully proved one. Once the cause of the problem is understood, progress to the next step to find a fix.
8. Determine solution — This step involves identifying a resolution to the defined and understood cause of the problem.
9. Test and implement — Even if the problem does not exist in the test or pre-production environment, implement the fix there first. This involves making the identified change and confirming no undesired impact, then deploying to the production environment. If possible, ensure a rollback position and be prepared to invoke this plan if necessary.
10. Review — Post-mortem analysis will help prevent further recurrence of this issue or new issues in the future and can be used to identify other vulnerable systems within the organization which should be fixed, and will improve the troubleshooting approach to ensure it is as optimized and efficient as possible.

The ten steps outlined above and described in more detail in the following sections describe a troubleshooting approach you can adapt and simplify as desired. Not all problems require full formal engagement, but adopting an approximation of these disciplines can help you prioritize other activities, such as monetizing the impact of problems and defining a clear problem statement.

Behavior and Attitude

In addition to employing a good troubleshooting approach, adopting a positive attitude with moderate determination and persistence to identify the root cause and resolve issues definitely helps. A positive attitude leads to better quality results, faster resolution, and it will reduce the stress level for you and co-workers during the troubleshooting process. Using a consistent approach to resolving problems by decomposing them scientifically is a proven and effective method, and many of these aspects are within your control.

The following behaviors and attitudes are characteristic of the most effective database professionals when troubleshooting complex problems:

  • Remain calm — Stay objective, no matter how urgent the problem. Project confidence and calmness to your peers, end users, and management, even if they show signs of stress or panic. This reassures them that you are in control and able to resolve the problem. These people are more likely to give you the time and space necessary to investigate and resolve the issue if they trust your capability.
  • Remember that problems are never random — Problems with computers happen for a reason. When you don’t understand the reason, the cause may seem random, but there is always an explanation. Intermittent or infrequent problems in particular appear random; seek to identify patterns or correlating events that could lead to the circumstances that cause the problem.
  • Avoid prejudice — Never assume that you know how to solve a problem until you have a problem description and have done some basic testing. It is not necessary to provide an instant answer; the correct answer with a short delay trumps a quick, inaccurate answer. This habit also builds your credibility with management as a reliable and capable engineer.
  • Avoid looking for fixes — Ensure that finding the cause is your first priority! The people around you will be pressing hard for a fix or an estimated time to fix. The fix is the goal, but you must first lay the foundation by understanding the cause.
  • Think ahead — Proactively consider potential blockers. If you may need to restore the database, start the tape retrieval process in parallel with troubleshooting. This reduces overall downtime and impact if you do need to revert to the backup.

Success Criteria

Having defined the problem, recognizing its resolution is usually relatively straightforward. Nonetheless, explicitly agreeing on a set of success criteria helps to structure troubleshooting steps and provide a positive test case scenario. Otherwise, what constitutes problem resolution can be subjective.

With performance problems, for example, it can be difficult to reach a consensus about what constitutes good-enough performance, which can mean different things to different people. From a DBA’s perspective, it’s often the case that the first few optimizations realize the most performance gains, with each subsequent performance improvement harder to achieve — meaning more effort, more fundamental schema changes, and smaller incremental performance improvement. For this reason, it’s important to agree on the performance objective and when to stop tuning.

Unfortunately, it’s common to see an enterprise spend a lot of time troubleshooting numerous issues that have nothing to do with the main source of the problem. Avoid this by defining both the problem and the success criteria, and seeking agreement with the sponsor; that way, expectations are clear and understood by all parties.

Working with Stakeholders

Stakeholders are a group of people usually consisting of business management, IT management, owners, shareholders, and anyone with an interest in the success or failure of the organization. Most business stakeholders want problems resolved as fast as possible using the fewest possible resources, and managers often feel under pressure to provide answers to users, their superiors, and external stakeholders such as customers, investors, auditors, or the media.

When managers are not well informed or they don’t have confidence in the incident team, this can lead to the undesirable behavior of micro-management. These are the managers who hover, requesting constant updates and generally inhibiting the troubleshooting process. You can avoid this, however, by proactively handling an incident to ensure both that stakeholders have the information they need and that they receive regular updates.

Broadly speaking, managers look first for a solution, then the cause. Database professionals should first attempt to understand the cause, then identify a solution. These opposing approaches can lead to friction, so it’s important to recognize them and respect each other’s priorities.

To minimize friction with management, try enlisting their help by nominating a single spokesperson for the incident. Request that they communicate with stakeholders and anyone who isn’t directly involved in troubleshooting the problem. Agree on a schedule for providing updates and stick to that schedule to reduce distractions, such as requests for information. Identify one person to whom you will provide updates, letting that person communicate with anyone else who needs the information. If more than one person is directly involved in the technical aspects of troubleshooting, nominate just one technical person to talk to the management contact.

Managers can also help by gathering information to determine the problem’s real impact on the business. As a guideline, try to establish the following:

  • How severely is the system affected?
  • How many users cannot work?
  • Is money being lost? If so, quantify the amount.
  • What is the visibility of the issue?
  • Are external customers affected?
  • Could any regulatory or compliance obligations be breeched?
  • How serious are the consequences if the problem persists?

Management can also be enlisted to identify mitigating factors. Are any options available to run a degraded service such as manual systems that enable some operations to continue? Encourage managers to generate ideas for a short-term tactical solution while the root cause is investigated and a resolution implemented.

Managers might also be helpful in engaging third parties, initially to make contact and open a dialog, and, in situations in which escalation is required, to engage the right resources to advance a solution. Each of these factors can be used to help shape the solution.

Service-Level Agreements

A service-level agreement (SLA) forms an agreement between IT and the business or between an outsourcer and an organization. The SLA should define availability and performance metrics for key business applications. SLAs often include metrics for response and resolution times in the event of an incident. These agreements are non-functional requirements and useful for managing business expectations in terms of application performance, availability, and response time in the event of an incident.

Two terms commonly used in storage solution design can be borrowed and adapted to most other areas of IT and business agreements: recovery point objective (RPO) and recovery time objective (RTO). Both can be included within an SLA to govern the data loss and recovery period following an incident.

RTO refers to the amount of time a solution can be down before the system is recovered. This varies according to the type of failure — for example, in the event of a single server failure in a failover cluster, the RTO could reasonably be 1–2 minutes; in the event of a total site loss, it might reasonably be four hours. This RTO metric essentially governs how long IT has to restore service in the event of various types of failures.

RPO refers to how much data loss can be tolerated without impact to the business. In the SQL Server world this commonly determines the frequency of transaction log backups. If, for example, the RPO were five minutes, you would need to take log backups every five minutes to ensure a maximum data loss of the same duration. Combining these facets of an agreement, it would be fairly common for a DBA to agree to configure five-minute log backups, and log shipping to a second location with an RPO of 15 minutes and an RTO of four hours. This would mean bringing the disaster recovery location online within four hours and ensuring a maximum data loss duration of 15 minutes. Agreeing to these objectives ahead of time with the business is an important part of setting and managing expectations.

Engaging External Help

It is not always necessary or possible to solve a problem with external assistance if there is a lack of knowledge, experience or time. Knowing who and when to call are important aspects of successful troubleshooting. Often, the objection to hiring a consultant, specialist, or support provider, or to open a support request with Microsoft Customer Service and Support (CSS), is financial. In reality, many problem scenarios can be much more expensive to resolve without external help. The time, resources, and opportunity costs of taking a long time to solve a problem, solving it in an inappropriate or inefficient way, or not solving it at all can be high. Ensure that all factors are taken into consideration when deciding if and when to engage outside help.

In some situations, it may be cheaper to engage help immediately — e.g., when the day rate for a consultant is half the cost of revenue loss per day; in this scenario it may make sense to bring in a consultant immediately. For example, it may be most beneficial to engage a specialist for problems related to rarely used features, as an organization might not have deep expertise with such features.

Besides cost, another barrier to enlisting external help is a desire to be perceived by the organization as the expert in a particular feature or technology. This can be quite short-sighted, particularly if an incident is causing revenue or reputation damage to the organization. Knowing when to ask for help is a valuable trait, and engaging an external resource also provides the opportunity to learn and increase the value you deliver to the business. Using external resources also provides a firsthand opportunity to see different approaches to troubleshooting, which can be more valuable than the technical skills themselves.

Certain types of problems are well suited for outside help. One such example is database corruption. This can be a serious problem, and many urban legends and “common wisdom” surround the best approach to resolving corruption problems, and mistakes could easily make a problem worse, without solving the underlying cause of the problem.

If you do engage support, whether it’s from CSS, a consultant, or another outside assistance, you will need to provide them with some basic information. Consider the following as a starting point:

  • Environment overview (network diagram, application architecture)
  • Problem statement and steps to reproduce
  • success criteria
  • Key stakeholders
  • Steps already taken to resolve issue and outcome
  • Windows System and Application Event Logs and SQL Server Error Logs
  • Profiler trace containing the problem (if possible)
  • SQLDiag output if it will add value
Top 10
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 2) - Wireframes,Legends
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Finding containers and lists in Visio (part 1) - Swimlanes
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Formatting and sizing lists
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Adding shapes to lists
- Microsoft Visio 2013 : Adding Structure to Your Diagrams - Sizing containers
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 3) - The Other Properties of a Control
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 2) - The Data Properties of a Control
- Microsoft Access 2010 : Control Properties and Why to Use Them (part 1) - The Format Properties of a Control
- Microsoft Access 2010 : Form Properties and Why Should You Use Them - Working with the Properties Window
- Microsoft Visio 2013 : Using the Organization Chart Wizard with new data
- First look: Apple Watch

- 3 Tips for Maintaining Your Cell Phone Battery (part 1)

- 3 Tips for Maintaining Your Cell Phone Battery (part 2)
programming4us programming4us