PHP Quick Script: Remove all tags and the content in between them
Sometimes in PHP you’ll want or need to remove all the tags from a string, like the content coming in from a form field. PHP has it’s own function to strip_tags(), but what about the content in between the tags? This small function removes all tags and the content in between.
Now I may hear you ask, but why would I want to remove the tags or even the content in between it? The answer is simple, to stop the bad guys. Let’s use the body form field of a comment form for your custom blog, now let’s say that a user wants to post a comment with a link inside the body to his or her own website (often times the website being linked to malicious) or the user posts a malicious piece of javascript code to steal your users’ sessions.
PHP has it’s own built in function that allows you to strip tags like <a> or <script>, but what if the tags are nested inside each other like <scr<script>ipt>. The first tag will be removed by the strip_tags() function leaving the second entirely intact and therefore executing the script.
The following function has been written to strip all tags and all content between them, however, you could easily modify the function to use only one tag or only the <script> and <object> tags or any other tag you want. Let’s see the script first:
[sourcecode language="PHP"]
<?php
// Remove tags and everything inbetween them from text
// Hacker multiple nested tags safe
function strip_tag_content($string) {
// Match <n>n</n> – open and close tags and everything in between
$regex = ‘/<[^>]*>[^<]*<\/[^>]*>/’;
$string = preg_replace($regex, ”, $string);
// Match <n /> – xhtnl inline tags like img and br
$regex = ‘/<[^>]*\/>/’;
$string = preg_replace($regex, ”, $string);
// Match <n> – single tag only like strip_tags()
// Used for hacker multiple nested tags
$regex = ‘/<[^>]*[^<]>/’;
$string = preg_replace($regex, ”, $string);
// Match n> – text then the close of an opening tag
// Cleans up mess left over from previous replace
$regex = ‘/[a-zA-Z]+>/’;
$string = preg_replace($regex, ”, $string);
// Clean up – Replace double space with only one space
$string = str_replace(" ", " ", $string);
// Clean up – Replace less than with nothing
$string = str_replace("<", "", $string);
// Clean up – Replace greater than with nothing
$string = str_replace(">", "", $string);
// Return the stripped string
return $string;
}
?>
[/sourcecode]
For those of you not familiar with functions, you need to include this piece of code in the page where you want it to execute. You could do this by placing the piece of code in a seperate file and then include() it into the page you want to run the script.
You would then call the function like this:
[sourcecode language="PHP"]
$string_to_strip = "<scri<script>pt>Some stuff here</scri</script>pt>Lorem<a href=\"malicious_site.com\">Innocent site</a> Ipsum <img sorce=\"nude_photo_of_young_girl.jpg\" />dolor."
// Because the function returns the string that has been stripped you can
// assign the string to a variable while calling the function to use later
// on in your script.
$stripped = strip_tag_content($string_to_strip);
// Echo the stripped version of the string
echo $stripped;
[/sourcecode]
As you can see, the script now returns only “Lorem ipsum dolor”, the rest of the tags and content completely stripped away. No need to go into a loop and waste more time and server resources.
You can modify this little script to work great for you, even if you have to strip only some tags like <a> and <script>. If you only want to strip the tags and not the content in between, you’re better off going with the strip_tags() function.
