The best way to understand the role XML plays is to consider the evolution of a simple file format without
XML. For example, consider a simple program that stores product items
as a list in a file. Say when you first create this program, you decide
it will store three pieces of product information (ID, name, and price),
and you'll use a simple text file format for easy debugging and
testing. The file format you use looks like this:
1
Chair
49.33
2
Car
43399.55
3
Fresh Fruit Basket
49.99
This is the sort of format you might create by using
.NET classes such as the StreamWriter. It's easy to work with—you just
write all the information, in order, from top to bottom. Of course, it's
a fairly fragile format. If you decide to store an extra piece of
information in the file (such as a flag that indicates whether an item
is available), your old code won't work. Instead, you might need to
resort to adding a header that indicates the version of the file:
SuperProProductList
Version 2.0
1
Chair
49.33
True
2
Car
43399.55
True
3
Fresh Fruit Basket
49.99
False
Now, you could check the file version when you open
it and use different file reading code appropriately. Unfortunately, as
you add more and more possible versions, the file reading code will
become incredibly tangled, and you may accidentally break compatibility
with one of the earlier file formats without realizing it. A better
approach would be to create a file format that indicates where every
product record starts and stops. Your code would then just set some
appropriate defaults if it finds missing information in an older file
format.
Here's a relatively crude solution that improves the
SuperProProductList by adding a special sequence of characters
(##Start##) to show where each new record begins:
SuperProProductList
Version 3.0
##Start##
1
Chair
49.33
True
##Start##
2
Car
43399.55
True
##Start##
3
Fresh Fruit Basket
49.99
False
All in all, this isn't a bad effort. Unfortunately,
you may as well use the binary file format at this point—the text file
is becoming hard to read, and it's even harder to guess what piece of
information each value represents. On the code side, you'll also need
some basic error checking abilities of your own. For example, you should
make your code able to skip over accidentally entered blank lines,
detect a missing ##Start## tag, and so on, just to provide a basic level
of protection.
The central problem with this homegrown solution is
that you're reinventing the wheel. While you're trying to write basic
file access code and create a reasonably flexible file format for a
simple task, other programmers around the world are creating their own
private, ad hoc solutions. Even if your program works fine and you can
understand it, other programmers will definitely not find it easy.
1. Improving the List with XML
This is where XML comes into the picture. XML is an all-purpose way to identify any type of data using elements.
These elements use the same sort of format found in an HTML file, but
while HTML elements indicate formatting, XML elements indicate content.
(Because an XML file is just about data, there is no standardized way to
display it in a browser, although Internet Explorer shows a collapsible
view that lets you show and hide different portions of the document.)
The SuperProProductList could use the following, clearer XML syntax:
<?xml version="1.0"?>
<SuperProProductList>
<Product>
<ID>1</ID>
<Name>Chair</Name>
<Price>49.33</Price>
<Available>True</Available>
<Status>3</Status>
</Product>
<Product>
<ID>2</ID>
<Name>Car</Name>
<Price>43399.55</Price>
<Available>True</Available>
<Status>3</Status>
</Product>
<Product>
<ID>3</ID>
<Name>Fresh Fruit Basket</Name>
<Price>49.99</Price>
<Available>False</Available>
<Status>4</Status>
</Product>
</SuperProProductList>
This format is clearly understandable. Every product
item is enclosed in a <Product> element, and every piece of
information has its own element with an appropriate name. Elements are
nested several layers deep to show relationships. Essentially, XML
provides the basic element syntax, and you (the programmer) define the
elements you want to use. That's why XML is often described as a metalanguage—it's
a language you use to create your own language. In the
SuperProProductList example, this custom XML language defines elements
such as <Product>, <ID>, <Name>, and so on.
Best of all, when you read this XML document in most
programming languages (including those in the .NET Framework), you can
use XML parsers to make your life easier. In other words, you don't need
to worry about detecting where an element starts and stops, collapsing
whitespace, and so on (although you do need to worry about
capitalization, because XML is case sensitive). Instead, you can just
read the file into some helpful XML data objects that make navigating
the entire document much easier. Similarly, you can now extend the
SuperProProductList with more information using additional elements, and
any code you've already written will keep working without a hitch.
2. XML Basics
Part of XML's popularity is a result of its
simplicity. When creating your own XML document, you need to remember
only a few rules:
XML elements are composed of a start tag
(like <Name>) and an end tag (like </Name>). Content is
placed between the start and end tags. If you include a start tag, you must
also include a corresponding end tag. The only other option is to
combine the two by creating an empty element, which includes a forward
slash at the end and has no content (like <Name />). This is
similar to the syntax for ASP.NET controls.
Whitespace between elements is ignored. That means you can freely use tabs and hard returns to properly align your information.
You
can use only valid characters in the content for an element. You can't
enter special characters, such as the angle brackets (< >) and the
ampersand (&), as content. Instead, you'll have to use the entity
equivalents (such as < and > for angle brackets, and
& for the ampersand). These equivalents will be automatically
converted to the original characters when you read them into your
program with the appropriate .NET classes.
XML elements are case sensitive, so <ID> and <id> are completely different elements.
All
elements must be nested in a root element. In the SuperProProductList
example, the root element is <SuperProProductList>. As soon as the
root element is closed, the document is finished, and you cannot add
anything else after it. In other words, if you omit the
<SuperProProductList> element and start with a <Product>
element, you'll be able to enter information for only one product; this
is because as soon as you add the closing </Product>, the document
is complete. (HTML has a similar rule and requires that all page
content be nested in a root <html> element, but most browsers let
you get away without following this rule.)
Every
element must be fully enclosed. In other words, when you open a
subelement, you need to close it before you can close the parent.
<Product><ID></ID></Product> is valid, but
<Product><ID></Product></ID> isn't. As a general
rule, indent when you open a new element, because this will allow you
to see the document's structure and notice if you accidentally close the
wrong element first.
XML documents must
start with an XML declaration like <?xml version="1.0"?>. This
signals that the document contains XML and indicates any special text
encoding. However, many XML parsers work fine even if this detail is
omitted.
As long as you meet these requirements, your XML
document can be parsed and displayed as a basic tree. This means your
document is well formed, but it doesn't mean it is valid. For example,
you may still have your elements in the wrong order (for example,
<ID><Product></Product></ID>), or you may have
the wrong type of data in a given field (for example,
<ID>Chair</ID><Name>2</Name). You can impose these
additional rules on your XML documents.
Elements are the primary units for organizing
information in XML (as demonstrated with the SuperProProductList
example), but they aren't the only option. You can also use attributes.
3. Attributes
Attributes add extra information to an element.
Instead of putting information into a subelement, you can use an
attribute. In the XML community, deciding whether to use subelements or
attributes—and what information should go into an attribute—is a matter
of great debate, with no clear consensus.
Here's the SuperProProductList example with ID and Name attributes instead of ID and Name subelements:
<?xml version="1.0"?>
<SuperProProductList>
<Product ID="1" Name="Chair">
<Price>49.33</Price>
<Available>True</Available>
<Status>3</Status>
</Product>
<Product ID="2" Name="Car">
<Price>43399.55</Price>
<Available>True</Available>
<Status>3</Status>
</Product>
<Product ID="3" Name="Fresh Fruit Basket">
<Price>49.99</Price>
<Available>False</Available>
<Status>4</Status>
</Product>
</SuperProProductList>
Of course, you've already seen this sort of syntax with HTML elements and ASP.NET server controls:
<asp:DropDownList id="lstBackColor" AutoPostBack="True"
Width="194px" Height="22px" runat="server" />
Attributes are also common in the configuration file:
<sessionState mode="Inproc" cookieless="false" timeout="20" />
Using attributes in XML is more stringent than in
HTML. In XML, attributes must always have values, and these values must
use quotation marks. For example, <Product Name="Chair" /> is
acceptable, but <Product Name=Chair /> or <Product Name />
isn't. However, you do have one bit of flexibility—you can use single or
double quotes around any attribute value. It's convenient to use single
quotes if you know the text value inside will contain a double quote
(as in <Product Name='Red "Sizzle" Chair' />). If your text value
has both single and double quotes, use double quotes around the value
and replace the double quotes inside the value with the "
entity equivalent.
Order is not important when dealing with attributes.
XML parsers treat attributes as a collection of unordered information
relating to an element. On the other hand, the order of elements often is
important. Thus, if you need a way of arranging information and
preserving its order, or if you have a list of items with the same name,
then use elements, not attributes.
|
|
4. Comments
You can also add comments to an XML document.
Comments go just about anywhere and are ignored for data processing
purposes. Comments are bracketed by the <!-- and --> character
sequences. The following listing includes three valid comments:
<?xml version="1.0"?>
<SuperProProductList>
<!-- This is a test file. -->
<Product ID="1" Name="Chair">
<Price>49.33<!-- Why so expensive? --></Price>
<Available>True</Available>
<Status>3</Status>
</Product>
<!-- Other products omitted for clarity. -->
</SuperProProductList>
The only place you can't put a comment is embedded within a start or end tag (as in <myData <!-- A comment should not go here --></myData>).