Truncation
Truncation is a method of protecting sensitive data
where a portion of its value is removed. The concept is very similar to
the repeating character masking covered earlier except that rather than
replacing values with a "mask", such as an "x" or "*", truncating
simply discards those values. For example, a Social Security Number,
"555-86-1234", that is stored in plain text might be truncated to the
value of "1234".
One way to apply this method is to permanently modify the stored value in the database by executing an UPDATE statement using the LEFT, RIGHT or SUBSTRING system function to define the remaining portion of the value.
For example, the script in Listing 9 uses the LEFT function to truncate all but the last four digits from the Identification_Value column.
Alternatively, in order to maintain the original
value but perform the truncation for viewing, we can simply reference
the column in views and stored procedures that use the LEFT, RIGHT or SUBSTRING system functions to define the remaining portion of the value. For example, Listing 10 returns only the last four digits of the values of the Identification_Value column.
Encoding
Encoding is a technique in which a series of
characters is used to represent another value. This technique can be
used to camouflage sensitive data, since the code used has no meaning
outside the system in which the code is defined.
NOTE
There are many benefits to encoding, beyond
securing sensitive data, such as overcoming language barriers when
working in an international environment and providing an expedient
means of entering data.
Encoding is a practice that is found in abundance in
the health care industry. The World Health Organization maintains the
International Classification of Diseases (ICD), which is an industry
standard that defines codes that represent diseases and health
problems. These codes are used in health records and death
certificates. For example, the ICD code for bacterial pneumonia is
J15.9.
In the establishment of foreign keys in the HomeLending database we have, at a basic level, implemented encoding. The Loan table, for example, contains two columns that are named Purpose_Type_ID and Mortgage_Type_ID as illustrated in Figure 1.
The Purpose_Type_ID and Mortgage_Type_ID columns are foreign keys to the Purpose_Type table and Mortgage_Type
table. These tables contain, respectively, the list of potential
purposes for a loan and the types of available mortgage, as defined in
the Uniform Residential Loan Application, developed by the Federal
National Mortgage Association, commonly known as Fannie Mae.
In the Purpose_Type table, we have used a
sequence of numbers to indicate these purposes. So, for example, when a
new loan record is created, the value of "2" is captured instead of the
value "Refinance".
To further enhance this encoding, we may choose to
either utilize a higher starting number in our sequence, such as
"5,000", so that the options can be organized into logical groups. For
example, we may have various types of refinance options for our
borrowers. Through a higher starting number we could use the value
range of 5,000 through 5,100 to represent the available refinance
options, while construction loans might be found in the 2,000 through
2,100 range.
Aggregation
Aggregation is a technique in which identifying
details of data are obfuscated through its provision in a summarized
format. A few examples of presenting data as an aggregation are as
follows:
As an average: 40% of the loans originated in the HomeLending database during the past quarter were refinance loans.
As a calculated sum: $2.5 million in loans were originated in the HomeLending during the past quarter.
As a geographical statistic: The median home value in the city of Indianapolis, Indiana is $150,000.
Aggregating is a common technique used to populate
data warehouses for data analysis. This not only protects the
underlying sensitive data, but also reduces the storage requirements
for the data.
An advantage of this approach is that the data that
is provided to the user is only that which they need for their
reporting and analysis requirements, so the potential for the leakage
of sensitive data is greatly reduced.
A disadvantage to this approach is that if the
aggregations are determined to be inaccurate, the detail data is not
available to identify the cause. Another challenge to this approach is
that a given aggregation may not meet everyone's needs, resulting in
requests for different views of the same aggregated data, which
increases your maintenance footprint.
Within the HomeLending database, aggregation may be beneficial in the collection of the borrower's liabilities. The current design of the Borrower_Liability table requires the capture of the monthly payment amount and remaining balance. As shown in Figure 2, the Borrower_Liability table is related to the Liability_Account table, which reveals the creditor and account number of the liability.
An alternate approach would be to dispose of the Liability_Account
table and simply capture a single record for the loan application,
indicating the sum of their monthly payments and remaining balances for
all liabilities, as shown in Figure 3.
This level of detail would suffice for most
users of this database and would protect this sensitive information
from being inappropriately disclosed. The Underwriters, who may need
access to the detailed liability data for qualification purposes, would
refer to the credit report data, which is stored in a separate
database, to determine whether or not the borrower can be approved for
the loan.