The biggest problem with copy-and-paste as a
means of solving a set of similar problems is that, of course, it leads
to code duplication. In turn, this means that we need to maintain
multiples copies of essentially the same code, but with each copy subtly
modified to suit a particular need. The real danger arises when
requirements change, and we need to make sure that this is reflected,
not just in the original code, but in all the subsequent copies. We can
easily demonstrate this risk with an example. Listing 1 creates the Sales table and loads it with some test data.
Listing 2 shows the stored procedure, SelectTotalSalesPerStateForMonth, which returns the total sales per state for a given month.
At the time we developed this code, our understanding
of a report "for a given month" is one that covers the period of time
from the first calendar day of the month until the day we run the
report. For this purpose, our stored procedure serves the customers'
needs well, and we soon receive a request for a similar report,
returning the average sales per state, for a given month. Note that our new report is required to use the same definition of "for a given month."
It is very tempting to just copy the existing SelectTotalSalesPerStateForMonth procedure, and replace sum with avg to meet the new requirements, as shown in Listing 3.
In this way, we have completed the task in just a few seconds and, in the short term at least, it will do the job.
Suppose, however, that at some later time the users
request to change the definition of "for a given month" to "thirty
consecutive calendar days, ending on the day we run the report."
Unfortunately, the definition of "for a given month" is implemented
twice, both in SelectTotalSalesPerStateForMonth and in SelectAverageSalesPerStateForMonth.
Even if one and the same person developed them both, it is possible to
forget it by the time we need to implement the change. Even if it is
clearly documented that both procedures should use one and the same
definition, it is still possible that the developer implementing the
change has failed to modify both stored procedures in a consistent way.
Suppose, for example, that only the SelectAverageSalesPerStateForMonth stored procedure was modified to meet this new requirement. Listing 4 shows how it was changed.
When we make such changes, it is very easy to forget
that we have implemented the definition of "for a given month" in two
places. If we update the definition in one place and not the other, we
will get inconsistent results, as demonstrated by Listing 5.
Clearly the average sales size for the state of
California (4,630,000) is many times greater than the total sales for
the same state (10,000), which makes no sense at all. In this example, I
have deliberately used test data that makes the discrepancy obvious. In
general, however, such discrepancies may be more subtle and difficult
to spot, so they can lurk around for a long time.
As this example clearly demonstrates, when we cut and
paste code, we expose our code to the possibility of bugs if our
requirements change, and we fail to change each of the multiple
implementations of one and the same logic in exactly the same way. In
short, copy-and-paste coding is a direct violation of the DRY (Don't
Repeat Yourself) principle, which is so fundamental in software
engineering.
The code to implement a given logic should
be implemented once, and once only, and reused by all applications that
need it. However, of course, due care must be taken when reusing SQL
code. Careless reuse of code can lead to maintenance and performance
issues, especially when this reuse takes the form of scalar UDFs. We
cannot reuse code without first verifying that it runs fast enough.