2. Database Optimization Best Practices
As mentioned earlier, there are many great examples, techniques, and
practices that come highly recommended by the world’s best database
performance experts. Rather than passing judgment or suggesting any
particular tool or technique, we will instead discuss the most common
best practices for improving database performance. We encourage you to
examine some of the texts referenced earlier for more detail on each of
these practices.
2.1. Use indexes sparingly but effectively
Most database professionals understand the importance of indexes and
how they improve performance. Using the EXPLAIN command is often the best way to determine which indexes are
needed. While the problem of not having enough indexes is understood,
having too much of a good thing can cause a performance issue.
As you saw when exploring the EXPLAIN command, it is possible to create
too many indexes or indexes that are of little or no use. Each index
adds overhead for every insert and delete against the table. In some
cases, having too many indexes with wide (as in many values)
distributions can slow insert and delete performance considerably. It
can also lead to slower replication and restore operations.
You should periodically check your indexes to ensure they are
all meaningful and utilized. You should remove any indexes that are
not used, have limited use, or have wide distributions. You can often
use normalization to overcome some of the problems with wide
distributions.
2.2. Use normalization, but don’t overdo it
Many database experts who studied computer science or a related
discipline may have fond memories (or nightmares) of learning the
normal forms as described by C.J. Date and others. We won’t revisit
the material here; rather we will discuss the impacts of taking those
lessons too far.
Normalization (at least to third normal form) is a
well-understood and standard practice. However, there are situations
in which you may want to violate these rules.
The use of lookup tables is often a by-product of normalization.
That is, you create a special table that contains a list of related
information that is used frequently in other tables. However, you can
impede performance when you use lookup tables with limited
distributions (only a few rows or a limited number of rows with small
values) that are accessed frequently. In this case, every time your
users query information, they must use a join to get the complete
data. Joins are expensive, and frequently accessed data can add up
over time. To mitigate this potential performance problem, you can use
enumerated fields to store the data rather than a lookup table. For
example, rather than creating a table for hair color (despite what
some subcultures may insist upon, there really are only a limited
number of hair color types), you can use an enumerated field and avoid
the join altogether.
Another potential issue concerns calculated fields. Typically,
we do not store data that is formed from other data (such as sales tax
or the sum of several columns). Rather, the calculated data is
performed either during data retrieval via a view or in the
application. This may not be a real issue if the calculations are
simple or are seldom performed, but what if the calculations are
complex and they are performed many times? In this case, you are
potentially wasting a lot of time performing these calculations. One
way to mitigate this problem is to use a trigger to calculate the
value and store it in the table. While this technically duplicates
data (a big no-no for normalization theorists), it can improve
performance in situations where there are a lot of calculations being
performed.
2.3. Use the right storage engine for the task
One of the most powerful features of MySQL is its support for
different storage engines. Storage engines govern how data is stored
and retrieved. MySQL supports a number of them, each with unique
features and uses. This allows database designers to tune their
database performance by selecting the storage engine that best meets
their application needs. For example, if you have an environment that
requires transaction control for highly active databases, choose a
storage engine best suited for this task (yes, Virginia, there are
some storage engines in MySQL that do not provide
transactional support). You may also have identified a view or table
that is often queried but almost never updated (e.g., a lookup table).
In this case, you may want to use a storage engine the keeps the data
in memory for faster access.
Recent changes to MySQL have permitted some storage engines to
become plug-ins, and some distributions of MySQL have only certain
storage engines enabled by default. To find out which storage engines
are enabled, issue the SHOW ENGINES
command. Example 10 shows the storage engines
on a typical installation.
Example 10. Storage engines
mysql> SHOW ENGINES \G
*************************** 1. row ***************************
Engine: InnoDB
Support: YES
Comment: Supports transactions, row-level locking, and foreign keys
Transactions: YES
XA: YES
Savepoints: YES
*************************** 2. row ***************************
Engine: MyISAM
Support: DEFAULT
Comment: Default engine as of MySQL 3.23 with great performance
Transactions: NO
XA: NO
Savepoints: NO
*************************** 3. row ***************************
Engine: BLACKHOLE
Support: YES
Comment: /dev/null storage engine (anything you write to it disappears)
Transactions: NO
XA: NO
Savepoints: NO
*************************** 4. row ***************************
Engine: CSV
Support: YES
Comment: CSV storage engine
Transactions: NO
XA: NO
Savepoints: NO
*************************** 5. row ***************************
Engine: MEMORY
Support: YES
Comment: Hash based, stored in memory, useful for temporary tables
Transactions: NO
XA: NO
Savepoints: NO
*************************** 6. row ***************************
Engine: FEDERATED
Support: NO
Comment: Federated MySQL storage engine
Transactions: NULL
XA: NULL
Savepoints: NULL
*************************** 7. row ***************************
Engine: ARCHIVE
Support: YES
Comment: Archive storage engine
Transactions: NO
XA: NO
Savepoints: NO
*************************** 8. row ***************************
Engine: MRG_MYISAM
Support: YES
Comment: Collection of identical MyISAM tables
Transactions: NO
XA: NO
Savepoints: NO
8 rows in set (0.00 sec)
|
The result set includes all of the known storage engines;
whether they are installed and configured (where Support = YES); a
note about the engine’s features; and whether it supports
transactions, distributed transactions (XA), or savepoints.
Note:
A savepoint is a named event that you can use like a
transaction. You can establish a savepoint and either release
(delete the savepoint) or roll back the changes since the savepoint.
With so many storage engines to choose from, it can be confusing
when designing your database for performance. The following describes
each of the storage engines briefly, including some of the uses for
which they are best suited. You can choose the storage engine for a
table using the ENGINE parameter on
the CREATE statement,
and you can change the storage engine by issuing an ALTER TABLE
command:
CREATE TABLE t1 (a int) ENGINE=InnoDB;
ALTER TABLE t1 ENGINE=MEMORY;
The InnoDB storage engine is the premier transactional support storage
engine. You should always choose this storage engine when requiring
transactional support; it is currently the only transactional engine
in MySQL. There are third-party storage engines in various states of
production that can support transactions, but the only
“out-of-the–box” option is InnoDB. Interestingly, all indexes in
InnoDB are B-trees, in which the index records
are stored in the leaf pages of the tree. InnoDB is the storage engine
of choice for high reliability and transaction-processing
environments.
The MyISAM storage engine is the default engine; this engine
will be used if you omit the ENGINE
option on the CREATE statement.
MyISAM is often used for data warehousing, e-commerce, and enterprise
applications. MyISAM uses advanced caching and indexing mechanisms to
improve data retrieval and indexing. MyISAM is an excellent choice
when you need storage in a wide variety of applications requiring fast
retrieval of data without the need for transactions.
The Blackhole storage engine is very interesting. It doesn’t store anything at all.
In fact, it is what its name suggests—data goes in but never returns.
All jocularity aside, the Blackhole storage engine fills a very
special need. If binary logging is enabled, SQL statements are written
to the logs, and Blackhole is used as a relay agent (or proxy) in a
replication topology. In this case, the relay agent processes data
from the master and passes it on to its slaves but does not actually
store any data. The Blackhole storage engine can be handy in
situations where you want to test an application to ensure it is
writing data, but you don’t want to store anything on disk.
The CSV storage engine can create, read, and write comma-separated value (CSV)
files as tables. The CSV storage engine is best used to rapidly export
structured business data to spreadsheets. The CSV storage engine does
not provide any indexing mechanisms and has certain issues in storing
and converting date/time values (they do not obey locality during
queries). The CSV storage engine is best used when you want to permit
other applications to share or exchange data in a common format. Given
that it is not as efficient for storing data, you should use the CSV
storage engine sparingly.
Note:
The CSV storage engine is used for writing logfiles. For
example, the backup logs are CSV files and can be opened by other
applications that use the CSV protocol (but not while the server is
running).
The Memory storage engine (sometimes called HEAP) is an in-memory
storage that uses a hashing mechanism to retrieve frequently used
data. This allows for much faster retrieval. Data is accessed in the
same manner as with the other storage engines, but the data is stored
in memory and is valid only during the MySQL session—the data is
flushed and deleted on shutdown. Memory storage engines are typically
good for situations in which static data is accessed frequently and
rarely ever altered (e.g., lookup tables). Examples include zip code
listings, state and county names, category listings, and other data
that is accessed frequently and seldom updated. You can also use the
Memory storage engine for databases that utilize snapshot techniques
for distributed or historical data access.
The Federated storage engine creates a single table reference from
multiple database systems. The Federated storage engine allows you to
link tables together across database servers. This mechanism is
similar in purpose to the linked data tables available in other
database systems. The Federated storage engine is best suited for
distributed or data mart environments. The most interesting feature of
the Federated storage engine is that it does not move data, nor does
it require the remote tables to use the same storage engine.
The Archive storage engine can store large amounts of data in a
compressed format. The Archive storage engine is best suited for
storing and retrieving large amounts of seldom-accessed archival or
historical data. Indexes are not supported and the only access method
is via a table scan. Thus, you should not use the Archive storage
engine for normal database storage and retrieval.
The Merge (MRG_MYISAM) storage engine can encapsulate a set of MyISAM tables
with the same structure (table layout or schema) and is referenced as
a single table. Thus, the tables are partitioned by the location of
the individual tables, but no additional partitioning mechanisms are
used. All tables must reside on the same server (but not necessarily
the same database).
Note:
When a DROP command is
issued on a merged table, only the Merge specification is removed.
The original tables are not altered.
The best attribute of the Merge storage engine is speed. It
permits you to split a large table into several smaller tables on
different disks, combine them using a merge table specification, and
access them simultaneously. Searches and sorts will execute more
quickly, since there is less data in each table to manipulate. Also,
repairs on tables are more efficient because it is faster and easier
to repair several smaller individual tables than a single large table.
Unfortunately, this configuration has several disadvantages:
You must use identical MyISAM tables to form a single merge
table.
The replace operation is not allowed.
Indexes are less efficient than for a single table.
The Merge storage engine is best suited for very large database
(VLDB) applications, like data warehousing, where data resides in more
than one table in one or more databases. You can also use it to help
solve partitioning problems where you want to partition horizontally
but do not want to add the complexity of the partition table
options.
Clearly, with so many choices of storage engines, it is possible
to choose engines that can hamper performance or, in some cases,
prohibit certain solutions. For example, if you never specify a
storage engine when the table is created, MySQL uses the default
storage engine. If not set manually, the default storage engine
reverts to the platform-specific default, which may be MyISAM on some
platforms. This may mean you are missing out on optimizing lookup
tables or limiting features of your application by not having
transactional support. It is well worth the extra time to include an
analysis of storage engine choices when designing or tuning your
databases.
2.4. Use views for faster results via the query cache
Views are a very handy way to encapsulate complex queries to make it
easier to work with the data. You can use views to limit data both
vertically (fewer columns) or horizontally (a WHERE clause on the underlying SELECT statement).
Both are very handy and, of course, the more complex views use both
practices to limit the result set returned to the user or to hide
certain base tables or to ensure an efficient join is executed.
Using views to limit the columns returned can help you in ways
you may not have considered. It not only reduces the amount of data
processed, it can also help you avoid costly SELECT * operations that users tend to do
without much thought. When many of these types of operations are run,
your applications are processing far too much data and this can affect
performance of not only the application, but also the server, and more
importantly, can decrease available bandwidth on your network. It
always a good idea to use views to limit data in this manner and hide
access to the base table(s) to remove any temptation users may have to
access the base table directly.
Views that limit the number of rows returned also help reduce
network bandwidth and can improve the performance of your
applications. These types of views also protect against proliferation
of SELECT * queries. Using views in
this manner requires a bit more planning, because your goal is to
create meaningful subsets of the data. You will have to examine the
requirements for your database and understand the queries issued to
form the correct WHERE clauses for
these queries.
With a little effort, you may find you can create combinations
of vertically and horizontally restrictive views, thereby ensuring
your applications operate on only the data that is needed. The less
data moving around, the more data your applications can process in the
same amount of time.
Perhaps the best way to use views is to eliminate poorly formed
joins. This is especially true when you have a complex normalized
schema. It may not be obvious to users how to combine the tables to
form a meaningful result set. Indeed, most of the work done by DBAs
when striving for better performance is focused on correcting poorly
formed joins. Sometimes this can be trivial—for example, fewer rows
processed during the join operation—but most of the time the improved
response time is significant.
Views can also be helpful when using the query cache in MySQL.
The query cache stores the results of frequently used (accessed)
queries. Using views that provide a standardized result set can
improve the likelihood that the results will be cached and, therefore,
retrieved more efficiently.
You can improve performance with a little design work and the
judicious use of views in your databases. Take the time to examine how
much data is being moved around (both the number of columns and rows)
and examine your application for any query that uses joins. Spend some
time forming views that limit the data and identify the most efficient
joins and wrap them in a view as well. Imagine how much easier you’ll
rest knowing your users are executing efficient joins.
2.5. Use constraints
The use of constraints provides another tool in your arsenal for
combating performance problems. Rather than proselytizing about
limitations on using constraints, we encourage you to consider
constraints a standard practice and not an afterthought.
There are several types of constraints available in MySQL,
including the following:
Unique indexes
Primary keys
Foreign keys
Enumerated values
Sets
Default values
NOT NULL option
We’ve discussed using indexes and overusing indexes. Indexes help improve data
retrieval by allowing the system to store and find data more
quickly.
A unique index is simply an index on one
field in a table that guarantees there are no duplicates in the table
if used with the NOT NULL
constraint to require a value. That is, only one row can
have a single value in the index. Use unique indexes for fields that
you want to prohibit duplicates of, such as sequence numbers, order
numbers, social security numbers, etc. A table can have one or more
unique indexes.
A primary key is also considered a unique index, but in this case it
uniquely identifies each row in a table and prohibits duplication.
Primary keys are created as a result of normalization and, when
designed effectively, form the join columns for table joins.
One of the most common primary keys is an automatically
generated sequence number (called a surrogate) that uniquely
identifies a row. MySQL provides an AUTO_INCREMENT
option to tell the system to generate these special unique values. The
use of surrogate key values is considered a compromise by some
database theorists and some discourage its use, because a primary key
should be made from the existing fields and not artificially
generated. While we won’t go so far as to say you should never use
surrogate keys, we will say that you should use them sparingly. If you
find yourself using AUTO_INCREMENT
on virtually every table, you are probably overusing this
feature.
Foreign keys are also created as a result of the normalization process. They allow the formation of a
parent/child or master/detail relationship where a row in one table is
the master and one or more rows in another table contain the details
of the master. A foreign key is a field identified in the detail table
that refers back to the master. Foreign keys also permit cascading
operations where deletion of the master row also deletes all of the
detail rows.
Note:
Currently, only InnoDB supports foreign keys.
We’ve discussed using enumerated values to
replace small lookup tables. However, enumerated values can be a
performance tool. This is because the text for the enumerated values
is stored only once—in the table header structures. What is saved in
the rows is a numeric reference value that forms an index (array
index) of the enumerated value. Thus, enumerated value lists can save
space and can make traversing the data a bit more efficient. An
enumerated field type allows one and only one value.
The use of sets in MySQL is similar to
using enumerated values. However, a set field type allows storage of
one or more values in the set. You can use sets to store information
that represents attributes of the data rather than using a
master/detail relationship. This not only saves space in the table
(set values are bitwise combinations), but also eliminates the need to
access another table for the values.
The use of the DEFAULT option
to supply default values is an excellent way to prohibit problems
associated with poorly constructed data. For example, if you have a
numeric field that represents values used for calculations, you may
want to ensure that when the field is unknown, a default value is
given instead. You can set defaults on most data types. You can also
use defaults for date and time fields to avoid problems processing
invalid date and time values. More importantly, default values can
save your application from having to supply the values (or using the
less reliable method of asking the user to provide them), thereby
reducing the amount of data sent to the server during data
entry.
You should also consider using the NOT
NULL option when specifying fields that must have a value.
If an entry is attempted where there are NOT
NULL columns and no data values are provided, the INSERT statement will fail. This prevents
data integrity issues by ensuring all important fields have
values.
2.6. Use EXPLAIN, ANALYZE, and OPTIMIZE
We have already discussed the benefits of these commands. We list them
here as a best practice to remind you that these tools are vital for
diagnostic and tuning efforts. Use them often and with impunity, but
follow their use carefully. Specifically, use ANALYZE and OPTIMIZE when it makes sense and not as a
regular, scheduled event. We have encountered administrators who run
these commands nightly, and in some cases that may be warranted, but
in the general case it is not warranted and can lead to unnecessary
table copies (like we saw in the earlier examples). Clearly, forcing
the system to copy data regularly can be a waste of time and could
lead to limited access during the operation.
Now that we’ve discussed how to monitor and improve database
performance, let us look more closely at one of the most successful
and most popular features—replication. In the next section, we will
discuss how you can monitor and improve replication in MySQL. We have
placed this topic last because, as you shall see, you must have a
well-performing server with well-performing databases and queries
before you can begin to improve replication performance.