Monitoring MySQL : Database Performance (part 2) - Database Optimization Best Practices

8/11/2012 5:27:24 PM

2. Database Optimization Best Practices

As mentioned earlier, there are many great examples, techniques, and practices that come highly recommended by the world’s best database performance experts. Rather than passing judgment or suggesting any particular tool or technique, we will instead discuss the most common best practices for improving database performance. We encourage you to examine some of the texts referenced earlier for more detail on each of these practices.

2.1. Use indexes sparingly but effectively

Most database professionals understand the importance of indexes and how they improve performance. Using the EXPLAIN command is often the best way to determine which indexes are needed. While the problem of not having enough indexes is understood, having too much of a good thing can cause a performance issue.

As you saw when exploring the EXPLAIN command, it is possible to create too many indexes or indexes that are of little or no use. Each index adds overhead for every insert and delete against the table. In some cases, having too many indexes with wide (as in many values) distributions can slow insert and delete performance considerably. It can also lead to slower replication and restore operations.

You should periodically check your indexes to ensure they are all meaningful and utilized. You should remove any indexes that are not used, have limited use, or have wide distributions. You can often use normalization to overcome some of the problems with wide distributions.

2.2. Use normalization, but don’t overdo it

Many database experts who studied computer science or a related discipline may have fond memories (or nightmares) of learning the normal forms as described by C.J. Date and others. We won’t revisit the material here; rather we will discuss the impacts of taking those lessons too far.

Normalization (at least to third normal form) is a well-understood and standard practice. However, there are situations in which you may want to violate these rules.

The use of lookup tables is often a by-product of normalization. That is, you create a special table that contains a list of related information that is used frequently in other tables. However, you can impede performance when you use lookup tables with limited distributions (only a few rows or a limited number of rows with small values) that are accessed frequently. In this case, every time your users query information, they must use a join to get the complete data. Joins are expensive, and frequently accessed data can add up over time. To mitigate this potential performance problem, you can use enumerated fields to store the data rather than a lookup table. For example, rather than creating a table for hair color (despite what some subcultures may insist upon, there really are only a limited number of hair color types), you can use an enumerated field and avoid the join altogether.

Another potential issue concerns calculated fields. Typically, we do not store data that is formed from other data (such as sales tax or the sum of several columns). Rather, the calculated data is performed either during data retrieval via a view or in the application. This may not be a real issue if the calculations are simple or are seldom performed, but what if the calculations are complex and they are performed many times? In this case, you are potentially wasting a lot of time performing these calculations. One way to mitigate this problem is to use a trigger to calculate the value and store it in the table. While this technically duplicates data (a big no-no for normalization theorists), it can improve performance in situations where there are a lot of calculations being performed.

2.3. Use the right storage engine for the task

One of the most powerful features of MySQL is its support for different storage engines. Storage engines govern how data is stored and retrieved. MySQL supports a number of them, each with unique features and uses. This allows database designers to tune their database performance by selecting the storage engine that best meets their application needs. For example, if you have an environment that requires transaction control for highly active databases, choose a storage engine best suited for this task (yes, Virginia, there are some storage engines in MySQL that do not provide transactional support). You may also have identified a view or table that is often queried but almost never updated (e.g., a lookup table). In this case, you may want to use a storage engine the keeps the data in memory for faster access.

Recent changes to MySQL have permitted some storage engines to become plug-ins, and some distributions of MySQL have only certain storage engines enabled by default. To find out which storage engines are enabled, issue the SHOW ENGINES command. Example 10 shows the storage engines on a typical installation.

Example 10. Storage engines

mysql> SHOW ENGINES \G
*************************** 1. row ***************************
      Engine: InnoDB
     Support: YES
     Comment: Supports transactions, row-level locking, and foreign keys
Transactions: YES
          XA: YES
  Savepoints: YES
*************************** 2. row ***************************
      Engine: MyISAM
     Support: DEFAULT
     Comment: Default engine as of MySQL 3.23 with great performance
Transactions: NO
          XA: NO
  Savepoints: NO
*************************** 3. row ***************************
      Engine: BLACKHOLE
     Support: YES
     Comment: /dev/null storage engine (anything you write to it disappears)
Transactions: NO
          XA: NO
  Savepoints: NO
*************************** 4. row ***************************
      Engine: CSV
     Support: YES
     Comment: CSV storage engine
Transactions: NO
          XA: NO
  Savepoints: NO
*************************** 5. row ***************************
      Engine: MEMORY
     Support: YES
     Comment: Hash based, stored in memory, useful for temporary tables
Transactions: NO
          XA: NO
  Savepoints: NO
*************************** 6. row ***************************
      Engine: FEDERATED
     Support: NO
     Comment: Federated MySQL storage engine
Transactions: NULL
          XA: NULL
  Savepoints: NULL
*************************** 7. row ***************************
      Engine: ARCHIVE
     Support: YES
     Comment: Archive storage engine
Transactions: NO
          XA: NO
  Savepoints: NO
*************************** 8. row ***************************
      Engine: MRG_MYISAM
     Support: YES
     Comment: Collection of identical MyISAM tables
Transactions: NO
          XA: NO
  Savepoints: NO
8 rows in set (0.00 sec)

The result set includes all of the known storage engines; whether they are installed and configured (where Support = YES); a note about the engine’s features; and whether it supports transactions, distributed transactions (XA), or savepoints.

Note:

A savepoint is a named event that you can use like a transaction. You can establish a savepoint and either release (delete the savepoint) or roll back the changes since the savepoint.

With so many storage engines to choose from, it can be confusing when designing your database for performance. The following describes each of the storage engines briefly, including some of the uses for which they are best suited. You can choose the storage engine for a table using the ENGINE parameter on the CREATE statement, and you can change the storage engine by issuing an ALTER TABLE command:

CREATE TABLE t1 (a int) ENGINE=InnoDB;
ALTER TABLE t1 ENGINE=MEMORY;

The InnoDB storage engine is the premier transactional support storage engine. You should always choose this storage engine when requiring transactional support; it is currently the only transactional engine in MySQL. There are third-party storage engines in various states of production that can support transactions, but the only “out-of-the–box” option is InnoDB. Interestingly, all indexes in InnoDB are B-trees, in which the index records are stored in the leaf pages of the tree. InnoDB is the storage engine of choice for high reliability and transaction-processing environments.

The MyISAM storage engine is the default engine; this engine will be used if you omit the ENGINE option on the CREATE statement. MyISAM is often used for data warehousing, e-commerce, and enterprise applications. MyISAM uses advanced caching and indexing mechanisms to improve data retrieval and indexing. MyISAM is an excellent choice when you need storage in a wide variety of applications requiring fast retrieval of data without the need for transactions.

The Blackhole storage engine is very interesting. It doesn’t store anything at all. In fact, it is what its name suggests—data goes in but never returns. All jocularity aside, the Blackhole storage engine fills a very special need. If binary logging is enabled, SQL statements are written to the logs, and Blackhole is used as a relay agent (or proxy) in a replication topology. In this case, the relay agent processes data from the master and passes it on to its slaves but does not actually store any data. The Blackhole storage engine can be handy in situations where you want to test an application to ensure it is writing data, but you don’t want to store anything on disk.

The CSV storage engine can create, read, and write comma-separated value (CSV) files as tables. The CSV storage engine is best used to rapidly export structured business data to spreadsheets. The CSV storage engine does not provide any indexing mechanisms and has certain issues in storing and converting date/time values (they do not obey locality during queries). The CSV storage engine is best used when you want to permit other applications to share or exchange data in a common format. Given that it is not as efficient for storing data, you should use the CSV storage engine sparingly.

Note:

The CSV storage engine is used for writing logfiles. For example, the backup logs are CSV files and can be opened by other applications that use the CSV protocol (but not while the server is running).

The Memory storage engine (sometimes called HEAP) is an in-memory storage that uses a hashing mechanism to retrieve frequently used data. This allows for much faster retrieval. Data is accessed in the same manner as with the other storage engines, but the data is stored in memory and is valid only during the MySQL session—the data is flushed and deleted on shutdown. Memory storage engines are typically good for situations in which static data is accessed frequently and rarely ever altered (e.g., lookup tables). Examples include zip code listings, state and county names, category listings, and other data that is accessed frequently and seldom updated. You can also use the Memory storage engine for databases that utilize snapshot techniques for distributed or historical data access.

The Federated storage engine creates a single table reference from multiple database systems. The Federated storage engine allows you to link tables together across database servers. This mechanism is similar in purpose to the linked data tables available in other database systems. The Federated storage engine is best suited for distributed or data mart environments. The most interesting feature of the Federated storage engine is that it does not move data, nor does it require the remote tables to use the same storage engine.

The Archive storage engine can store large amounts of data in a compressed format. The Archive storage engine is best suited for storing and retrieving large amounts of seldom-accessed archival or historical data. Indexes are not supported and the only access method is via a table scan. Thus, you should not use the Archive storage engine for normal database storage and retrieval.

The Merge (MRG_MYISAM) storage engine can encapsulate a set of MyISAM tables with the same structure (table layout or schema) and is referenced as a single table. Thus, the tables are partitioned by the location of the individual tables, but no additional partitioning mechanisms are used. All tables must reside on the same server (but not necessarily the same database).

Note:

When a DROP command is issued on a merged table, only the Merge specification is removed. The original tables are not altered.

The best attribute of the Merge storage engine is speed. It permits you to split a large table into several smaller tables on different disks, combine them using a merge table specification, and access them simultaneously. Searches and sorts will execute more quickly, since there is less data in each table to manipulate. Also, repairs on tables are more efficient because it is faster and easier to repair several smaller individual tables than a single large table. Unfortunately, this configuration has several disadvantages:

You must use identical MyISAM tables to form a single merge table.
The replace operation is not allowed.
Indexes are less efficient than for a single table.

The Merge storage engine is best suited for very large database (VLDB) applications, like data warehousing, where data resides in more than one table in one or more databases. You can also use it to help solve partitioning problems where you want to partition horizontally but do not want to add the complexity of the partition table options.

Clearly, with so many choices of storage engines, it is possible to choose engines that can hamper performance or, in some cases, prohibit certain solutions. For example, if you never specify a storage engine when the table is created, MySQL uses the default storage engine. If not set manually, the default storage engine reverts to the platform-specific default, which may be MyISAM on some platforms. This may mean you are missing out on optimizing lookup tables or limiting features of your application by not having transactional support. It is well worth the extra time to include an analysis of storage engine choices when designing or tuning your databases.

2.4. Use views for faster results via the query cache

Views are a very handy way to encapsulate complex queries to make it easier to work with the data. You can use views to limit data both vertically (fewer columns) or horizontally (a WHERE clause on the underlying SELECT statement). Both are very handy and, of course, the more complex views use both practices to limit the result set returned to the user or to hide certain base tables or to ensure an efficient join is executed.

Using views to limit the columns returned can help you in ways you may not have considered. It not only reduces the amount of data processed, it can also help you avoid costly SELECT * operations that users tend to do without much thought. When many of these types of operations are run, your applications are processing far too much data and this can affect performance of not only the application, but also the server, and more importantly, can decrease available bandwidth on your network. It always a good idea to use views to limit data in this manner and hide access to the base table(s) to remove any temptation users may have to access the base table directly.

Views that limit the number of rows returned also help reduce network bandwidth and can improve the performance of your applications. These types of views also protect against proliferation of SELECT * queries. Using views in this manner requires a bit more planning, because your goal is to create meaningful subsets of the data. You will have to examine the requirements for your database and understand the queries issued to form the correct WHERE clauses for these queries.

With a little effort, you may find you can create combinations of vertically and horizontally restrictive views, thereby ensuring your applications operate on only the data that is needed. The less data moving around, the more data your applications can process in the same amount of time.

Perhaps the best way to use views is to eliminate poorly formed joins. This is especially true when you have a complex normalized schema. It may not be obvious to users how to combine the tables to form a meaningful result set. Indeed, most of the work done by DBAs when striving for better performance is focused on correcting poorly formed joins. Sometimes this can be trivial—for example, fewer rows processed during the join operation—but most of the time the improved response time is significant.

Views can also be helpful when using the query cache in MySQL. The query cache stores the results of frequently used (accessed) queries. Using views that provide a standardized result set can improve the likelihood that the results will be cached and, therefore, retrieved more efficiently.

You can improve performance with a little design work and the judicious use of views in your databases. Take the time to examine how much data is being moved around (both the number of columns and rows) and examine your application for any query that uses joins. Spend some time forming views that limit the data and identify the most efficient joins and wrap them in a view as well. Imagine how much easier you’ll rest knowing your users are executing efficient joins.

2.5. Use constraints

The use of constraints provides another tool in your arsenal for combating performance problems. Rather than proselytizing about limitations on using constraints, we encourage you to consider constraints a standard practice and not an afterthought.

There are several types of constraints available in MySQL, including the following:

Unique indexes
Primary keys
Foreign keys
Enumerated values
Sets
Default values
NOT NULL option

We’ve discussed using indexes and overusing indexes. Indexes help improve data retrieval by allowing the system to store and find data more quickly.

A unique index is simply an index on one field in a table that guarantees there are no duplicates in the table if used with the NOT NULL constraint to require a value. That is, only one row can have a single value in the index. Use unique indexes for fields that you want to prohibit duplicates of, such as sequence numbers, order numbers, social security numbers, etc. A table can have one or more unique indexes.

A primary key is also considered a unique index, but in this case it uniquely identifies each row in a table and prohibits duplication. Primary keys are created as a result of normalization and, when designed effectively, form the join columns for table joins.

One of the most common primary keys is an automatically generated sequence number (called a surrogate) that uniquely identifies a row. MySQL provides an AUTO_INCRE⁠MENT option to tell the system to generate these special unique values. The use of surrogate key values is considered a compromise by some database theorists and some discourage its use, because a primary key should be made from the existing fields and not artificially generated. While we won’t go so far as to say you should never use surrogate keys, we will say that you should use them sparingly. If you find yourself using AUTO_INCREMENT on virtually every table, you are probably overusing this feature.

Foreign keys are also created as a result of the normalization process. They allow the formation of a parent/child or master/detail relationship where a row in one table is the master and one or more rows in another table contain the details of the master. A foreign key is a field identified in the detail table that refers back to the master. Foreign keys also permit cascading operations where deletion of the master row also deletes all of the detail rows.

Note:

Currently, only InnoDB supports foreign keys.

We’ve discussed using enumerated values to replace small lookup tables. However, enumerated values can be a performance tool. This is because the text for the enumerated values is stored only once—in the table header structures. What is saved in the rows is a numeric reference value that forms an index (array index) of the enumerated value. Thus, enumerated value lists can save space and can make traversing the data a bit more efficient. An enumerated field type allows one and only one value.

The use of sets in MySQL is similar to using enumerated values. However, a set field type allows storage of one or more values in the set. You can use sets to store information that represents attributes of the data rather than using a master/detail relationship. This not only saves space in the table (set values are bitwise combinations), but also eliminates the need to access another table for the values.

The use of the DEFAULT option to supply default values is an excellent way to prohibit problems associated with poorly constructed data. For example, if you have a numeric field that represents values used for calculations, you may want to ensure that when the field is unknown, a default value is given instead. You can set defaults on most data types. You can also use defaults for date and time fields to avoid problems processing invalid date and time values. More importantly, default values can save your application from having to supply the values (or using the less reliable method of asking the user to provide them), thereby reducing the amount of data sent to the server during data entry.

You should also consider using the NOT NULL option when specifying fields that must have a value. If an entry is attempted where there are NOT NULL columns and no data values are provided, the INSERT statement will fail. This prevents data integrity issues by ensuring all important fields have values.

2.6. Use EXPLAIN, ANALYZE, and OPTIMIZE

We have already discussed the benefits of these commands. We list them here as a best practice to remind you that these tools are vital for diagnostic and tuning efforts. Use them often and with impunity, but follow their use carefully. Specifically, use ANALYZE and OPTIMIZE when it makes sense and not as a regular, scheduled event. We have encountered administrators who run these commands nightly, and in some cases that may be warranted, but in the general case it is not warranted and can lead to unnecessary table copies (like we saw in the earlier examples). Clearly, forcing the system to copy data regularly can be a waste of time and could lead to limited access during the operation.

Now that we’ve discussed how to monitor and improve database performance, let us look more closely at one of the most successful and most popular features—replication. In the next section, we will discuss how you can monitor and improve replication in MySQL. We have placed this topic last because, as you shall see, you must have a well-performing server with well-performing databases and queries before you can begin to improve replication performance.

Monitoring MySQL : Database Performance (part 1) - Measuring Database Performance

Other

Transact-SQL in SQL Server 2008 : Change Data Capture (part 2) - Querying the CDC Tables

Transact-SQL in SQL Server 2008 : Change Data Capture (part 1)

Transact-SQL in SQL Server 2008 : Spatial Data Types (part 3) - Spatial Data Support in SSMS

Transact-SQL in SQL Server 2008 : Spatial Data Types (part 2) - Working with Geography Data

Transact-SQL in SQL Server 2008 : Spatial Data Types (part 1) - Representing Spatial Data, Working with Geometry Data

Transact-SQL in SQL Server 2008 : Sparse Columns

SQL Server 2005 : Beyond OWC: Full-On OLAP Development (part 4) - Analysis Services CLR Support: Server-Side ADO MD.NET

SQL Server 2005 : Beyond OWC: Full-On OLAP Development (part 3) - XMLA at Your Service

SQL Server 2005 : Beyond OWC: Full-On OLAP Development (part 2) - OLAP Development with ADO MD.NET

SQL Server 2005 : Beyond OWC: Full-On OLAP Development (part 1) - Management Studio as an MDX Client