Of late, I’ve had to think about mechanisms to protect web forms from getting spammed. For reasons that aren’t very clear to me, spammers are writting bots to randomly fill in forms and submit them, as if that will really boost sales of the advertised (and often misspelled) product.

I apologize in advance for the scattered and disorganized state of this entry. I wanted to say something about what I’ve been working on.

You can look at a working version of the solution described below here.

One solution I’ve see comes from Blogger, which can require users to type in the letters that appear in a image. The letters in the image are distorted and there is often some background noise. This is to throw off optical character recognition routines that spammers might imploy to figure out what letters appear in the image. The distortions and noise thwart most OCR attempts.

I like this approach, but since I’m not very good with image manipulation, I needed another way to do something similar. The answer came from some reading I did years ago when I wanted to make video games for a living. In particular, I recalled that console fonts stored in PC ROM where typical composed of 8x8 matrixes. I then thought that if I could recreate these matrixes, I could represent these patterns as alternating colors in an HTML table.

Of course, it wouldn’t be especially hard to create routines that read in thees tables of characters and produce the original character, but that’s a refinement for another day.

Once I could translate characters into an HTML table, it was an easy thing to create a random series of characters to feed to this routine. To prevent simple packet sniffing of the secret word, I needed a way to pass something back to the client that could be used to verify the word. Since sending the password in clear text was obviously a non-started, I needed some kind of hash. The most natural hashing algorithm to use for what amounts to passwords is DES, which has been used by UNIX systems to protect their authentication system for years (although it’s not really as unbreakable as PKI schemes).

The next problem was to create a mechanism where an HTML page could request this bundle of HTML tables that represent a word. This is where the oft-talked about Ajax techniques come into play. Ajax is simple an asynchronous RPC mechanism that’s built with javascript and some other server-side programming technology. It took me a good bit of time to work through the most common gotchas that prevent reliable transportation of the Ajax RPC messages. Popular browsers have different DOM parsers that treat large (> 4096 bytes) data differently.

There are two major problems with the approach described here. The first is that it would be easy to write an OCR routine to read the HTMLized font. The second problem is that anyone can bypass the security by creating there own DES hash of an arbritary string and passing that string in clear text to the ajax server.

Let me tackle the second problem first. To ensure that the hash-value recieved by the ajax server can be trusted, two methods can be used. The first is to rewrite the secret word checker to use a session-based system. The user would be passed a session ID from the server. The server would remember what secret phrase it sent to the user. The user sends back their guess in clear text. The validator retrieves the secret from a backing store and makes the check.

I don’t much like this stateful solution, although I believe it is superior to the original implementation. If you’ve got an app that already has sessions, perhaps this is a natural fit.

Another solution to this problem is to use a Public Key Infrasture mechanism to encode the secret into a hexadecimal string. When the user returns this encoded secret along with the clear-text guess, the server decrypts the secret using a private key. If the encoded secret has been tampered with, the decryption will fail. I would recomend this solution as a refinement to the anti-spamming mechanism I’ve described.

The second and more difficult problem with my mechanism is that it would be trivial to write OCR for the HTML emitted. It’s true that the spam bot would have to speak javascript to get to this data at all, since the HTML is injected into the static web page via innerHTML(). But I don’t think that’s sufficiently obscure enough to baffle spammers. What would be needed is some kind of “noise” in the HTML to making OCR difficult. Other than adding spacing, altering capitalization and adding weird HTML attributes, there’s nothing that a solid HTML parser and a good coder couldn’t get a round.

All of this has made me appreciate just how good the human brain is at finding meaning in chaos. Digital creations just aren’t up to that task yet.