WEBSITE

Strip HTML of Tags

9/26/2010 7:45:19 PM
A common practice of analyzing web data sources when they do not provide an API for structured access is to scrape the web pages themselves. This is an imprecise process and prone to easy breakage, but sometimes still useful. One part of this process that could be useful is to remove the HTML content, leaving just the text on the page. This could also be useful in a simple web-indexing application. You can use regular expressions to accomplish this :
private string StripHtml(string source)
{
string[] patterns = {
@"<(.|\n)*?>", //general HTML tags
@"<script.*?</script>" //script tags
};
string stripped = source;
foreach (string pattern in patterns)
{
stripped = System.Text.RegularExpressions.Regex.Replace(
stripped, pattern, string.Empty);
}

return stripped;
}
Other  
 
Top 10
Thermalright Archon SB-E Cooler Review (Part 3)
Thermalright Archon SB-E Cooler Review (Part 2)
Thermalright Archon SB-E Cooler Review (Part 1)
Acer CloudMobile - Ambitious Android Phone (Part 3)
Acer CloudMobile - Ambitious Android Phone (Part 2)
Acer CloudMobile - Ambitious Android Phone (Part 1)
Huawei MediaPad 10 Tablet Review (Part 2)
Huawei MediaPad 10 Tablet Review (Part 1)
Mymemory.com - Calendars And Picture Books Review (Part 2)
Mymemory.com - Calendars And Picture Books Review (Part 1)
Most View
Last Call For Blackberry? (Part 1)
Quicksilver : Giving your Mac a boost of power
Thermaltake Water 2.0 pro
Video Codecs and File Formats Exposed (Part 1) - AVI, MPG
Sony SRS-D8 Evaluation
jQuery 1.3 : Modifying table appearance (part 1) - Row highlighting
Birds Of Prey (Part 1) - Flight, Gaze, Eagle silhouette
Microsoft Tries To Flatten Competition With Surface (Part 4)
Troubleshooting Reference: Monitors Problems
Managing SharePoint 2010 Data : Custom Field Types
Installing Exchange Server 2010 : Post-setup configuration (part 2) - Add a certificate to the Client Access Server role
SQL Server 2005 : Advanced OLAP - Advanced Dimensions and Measures (part 3)
Web Design: Where To Start (Part 1)
Is Windows 8 Already In Deep Trouble? (Part 3)
SONY NEX-6 Camera - Good Value For Money (Part 1)
Windows Server 2003 : Preparing for a Disaster (part 1) - Creating Automated System Recovery Disks
Keep Kids Online Safely (Part 3)
Samsung Galaxy SIII Mini - A Small Galaxy Having Few Stars (Part 1)
2 More Ivy Bridge Chipped Laptops From Fujitsu
Java Mobile Edition Security : Development and Security Testing (part 3) - Code Security & Application Packaging and Distribution