Friday, November 23, 2007

Stripping HTML Markup and Extra Whitespace from Strings

Needed a simple way of stripping HTML markup from a given string. Since this is a ColdFusion app, I ended up finding a useful blog entry at Ray Camden's blog: Quick example of cleaning up Verity results.

This is the gist of it:
<cfset var cleaned = rereplace(arguments.input, "<.*?>", "", "all")>
<cfset cleaned = rereplace(cleaned, "<.*?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
In addition, to get rid of extra white spaces, you can do this:

<cfset var cleaned = rereplace(arguments.input, "\s{2,}", " ", "all")>
The above regex would get rid of 2 or more white spaces from the text (since we got plenty of those once the markup was stripped from the original string.

