nd
Subscribe to the RSS feed

Tuesday, November 10 2009

Data driven factory: I give you data, you give me an object...

I've been working on a data warehouse project lately, in python, to support different kind of data analysis I am developing as part of my current work. I decided to use SQLAlchemy as the ORM; I can then quickly move from my development version using SQLite database, to production, using MySQL or MSSQL databases.

SQLAlchemy is also one of these amazing ORM that support sharding -- It's not necessary to tell that it's very important when you develop a tool that will import, format, process and analyze gigabytes of data.

Also, working with a lot of data types, to register them into my ORM instance, and to persist them into a database, I need my software to be able to quickly generate an object representing the data type: a particular instance of the object. Developers usually create factories in order to create instances of objects. The main idea is to delegate the instantiation of the object to a third party object. In most factories, we specify a type of object that we want to create: Give me an instance of a pizza with mushroom, tomatoes and ham.

The last point on asking for a particular type (or sub-type) of object was the main limitation for my use. In fact, most of my types are related in some ways, but without strong inheritance (Dish > Pie > Pizza); another important point is the maintainability of a code where I would list all different types of object my factory needs to create... Well, I wanted something more generic: a data driven factory.

The data driven factory is a factory that, based on the data sent to the factory object constructor, will produce an instance. A simple example would be to be able to get an instance of a Margerita pizza when giving the certain ingredients (tomatoes, mozzarella and parmesan) or a Neapolitan if I add enchovies.

This type of factory, which depends only on the data to give in parameter, is possible in python by using the class inspection capabilities of the language. In fact, the implementation I propose requires to register each class to be constructed in the factory, constructor arguments (and defaults arguments) will be analyzed for a matcher later on, and to give as arguments the "type" of each data field (basically, the arguments); the factory will then get the appropriate object for you.

Side note: The fact that the factory doesn't return an instance of an object is for performances. In fact, I get the class from the factory, store it and loop through the instantiation with millions of data...

Example of use:

class Shape(object):
        pass

class Circle(Shape):
        def __init__(self, center, radius=RAD_MAX):
                ....

class DiskHole(Shape):
        def __init__(self, center, radius, small_radius=RAD_SMALL):
                ....

factory = DDFactory()
factory.register(Shape)
factory.register(Circle)
factory.register(DiskHole)

print factory.get(['center', 'radius'])                   #> return 'Circle' ctor
print factory.get(['center', 'radius', 'small_radius'])   #> return 'DiskHole' ctor

You can access this factory here: dd_factory.py

In the distributed code, I assume that each object to create has a tablename class member that tells which database table is the eventual target (which is my case using SQLAlchemy / declarative objects). This is easy to change by replacing the factory register method by something like this:

def register(self, cls):
        if hasattr(cls, '__init__'):
                s_cls = str(cls)
                args, defaults_dict = DDFactory.defaults_values(cls)
                if s_cls not in self.registrar:
                        self.registrar[s_cls] = {'class' : cls, 'args' : args, 'defaults' : defaults_dict}

Monday, July 23 2007

Python script utility called wwwCall and Grabber news

wwwCall: HTTP(S) utilities

wwwCall is a very small module for Python (tested under python 2.5 but should be okay for python >= 2.3) which handle the HTTP(S) connection with some special features like proxy, cookies, authentification (basic, digest). This morning, I was working on Grabber and I just realized how ugly the code was, mostly because of how I handled the web connections, so I decided to create a simple module to do the job easily. The idea is to have a single object handling some basic function of the python urllib2.

If you have ever use Python for doing web calls, you'll see that the utilization is damn simple and I think, pretty cool... Example:

# create the object
http = wwwCall('http://rgaucher.info')
# add the features you want (cookies,auth)
http.setCookieFile('./the_path/file.cookie')
# reaching a logging URL and saving the cookie
http.post("http://rgaucher.info/login.php",{'username' : 'foo', 'password' : 'bar'})
# register the username/password for the basic authentification
http.setAuthBasic("romain","mypassword")
# print the content of the protected page
print http.get("http://rgaucher.info/401protected").read()

Download: wwwCall.zip

The next Grabber

So, I've been working on Grabber for a couple of months without a release now; it's mainly because I don't have that much time to work on it, but also because I made lots of modification. Today I added a couple of features:

  • Understanding some mod_rewrite rules for the spider
  • URL exclusion
  • Basic/Digest Authentification

This comes in addition on the previous features I added, mainly:

  • Multi Site
  • Multi threads
  • Cookie analyzer
  • XSS Locator in addition of the XSS Fuzzer which is definitely faster
  • Spider module, only to crawl the site and export it in XML
  • Login ability, keeping session state

I cannot give a d-day for the release of the 0.2 version because I really want to have a more stable product and will feed some test suites I made at work the tool, to be sure it's reasonable (I will not give comparison results with commercial products :P). I also want to have a better spider...

Saturday, February 3 2007

pyIndex: File Indexer in Python

A couple of months ago, I had to make a Source Code Search Engine for the SAMATE Reference Dataset. The organization of our source code are not really common but still, it's easy to understand and organized.
I now release this tiny python script in the beta section: pyIndex.

You should have all the information you need to use/adapt this script for your own purposes; it uses a MySQL database and MySQLdb to connect to the database. The script is only for adding some words or references in the database, the search is not done (but it's only a really simple SQL query...)

Sunday, January 7 2007

iDumper: Embedded iPod Music Copy

One thing I really hate with the iPod is that the songs are pseudo-obfuscated in a hidden directory on the iPod. Therefor, we cannot, with iTunes, copy the mp3 from the iPod to the iTunes library (at least under windows) ... This is really stupid!

Anyway, there is lots of tools to do that and very well, but I decided to do one: an embedded one. The executable/script is on your iPod, then you can copy your files everywhere :)

iDumper is available in my beta/ repository.

I <3 Bots!