Categories
Development Marketing PHP Security Technical

Automatically determining PageRank, or, unsigned integers in PHP

Market Ruler, LLC develops software for web marketers – and as such, I’m always on the lookout for new technologies to make life easier on the PPC and SEO crowd.

I recently took the SEOMoz toolset for a spin, and in one of their tests, I saw that they automatically checked the Google PageRank of a site. Since I’m the type who likes to see how this is done … ahem, automatically, I dug into their system to see what they were doing.

All I could find was the URL used, which contained a request to Google, with some additional parameters in the URL, one of which was, ch which was set to a really big number.

This is a checksum, and varies based on which URL you are checking. Google implemented this to prevent automated queries, however, they release their code out into the wild (their toolbar, for example) and so some enterprising engineer reverse engineered it, and their “security” is now worthless.

I found, shortly thereafter, this Perl Module to check PageRank which outlines how it’s done.

Since our PPC Tracking Software is written in PHP, I thought I would quickly port the module to PHP. Problem is, it was kind of a nightmare.

As a fair warning, severe geek talk is approaching rapidly.

As well, I should mention that this type of code is technically against Google’s Terms of Service. That is, they request no automated queries as part of their terms of service. Don’t say you haven’t been warned. Rumors of Google banning IP addresses, or user agents who use Web Position Gold, are very much true.

I’m debating on how to use this without making it “automated” – does that mean that a human being has to initiate the action? If that’s all, it may be possible to do without violating their terms.

Anyway, this long digression gets into the specific problem I had. (Here comes the geek talk …) The checksum algorithm works, roughly, as follows:

  • Checksum the actual URL
    • Convert the string into ASCII codes: A = 65, Z = 90, etc.
    • Convert the first 12 characters into 3 unsigned integers, and add each to a “magic” number which accumulates
    • Run the “mix” step which shifts bits around in each integer and combines them with the other integers
    • Complete on the next 12 characters until you run out of characters
  • You’ll get back a big number.
  • Then take this big number, and subtract multiples of nine from it 20 times (0, 9, 18, etc.)
  • Convert these big numbers into 4 characters each, and combine them into an 80 character string
  • Run the checksum (above) on the new string
  • Prepend the number “6” to the final number (denoting the version number, probably) and you are ready to go!

The details can be easily found in the Perl module above. And yet, I still haven’t divulged my problem: PHP is terrible at handling unsigned integers properly.

That is, PHP is typeless, meaning it does it’s best to convert types to the most appropriate context depending on what you’re doing. This is usually great, except when it’s not.

That is, in this case, when you do: 51234231 << 13, the final number is greater than 2147483647 (the maximum value for a signed integer) and so it gets converted to a double.

The problem is that when doing various bit-wise manipulations (as this algorithm does), PHP just pukes left and right. Depending on the sign, and the number of bits you’re manipulating, it converts from integer to double seemingly randomly.

The solution for this, if you ever encounter it, is to avoid using integers at all.

I defined a class called “ulong” which has methods bit_and, bit_or, bit_xor, etc. which allows me to simulate the unsigned long properties without the automatic conversion problems of PHP. The crux is that I break the long into two “short” parts: 16 bits each. Then I map all of the operations correctly onto them.

There may be an easier way, but after getting more an more frustrated by having to convert to integer and back to double and using fmod instead of bitwise ands, and having PHP come up with seemingly random results.

Note that this has only been tested in a limited environment, and is not suited for anything high-performance (such as cryptography).

If you find bugs or tweaks, let me know. You can download the source here: ulong.inc

If you find this useful, please comment!

6 replies on “Automatically determining PageRank, or, unsigned integers in PHP”

I also had few problems like yours and I did not find any appropriate solutions. I will try your source as soon as I can get. I hope it will solve some of my problems. Anyway, I will notify you.


Webdesign Stuttgart

That’s fairly impressive.
I mean, I was using seomoz data too but I needed to get pageranks (which is not covered by their api)
I wrote a c++ application to calculate the checksum – but then I was asked to port it to php.

And I got stuck with the unsigned problem too. :)
Thanks a bunch for this solution.

You’re class is cool but I’ve found a different solution.
In the end I’ve used something like:

sprintf(‘%u’, $num)

This code will return the unsigned representation of the number (and that’s a string btw)

Fellow developer: Unfortunately, employing basic arithmetic operations on strings doesn’t work out so well. For instance, say you need to work with 32-bit RGBA values, and you needed to employ 32-bit unsigned math on them. Without a whole lot of fudging, you can kiss your little project good-bye.

It’s funny that it wasn’t until now, after 10 years of on-and-off PHP coding, that I discovered this discrepancy. I never thought I would refer to PHP’s programmers as lazy, but this takes the cake. PHP should inherently know whether to deal with an integer as signed or unsigned; the CPU does, so PHP can.

By the way, poster, cool class. You’re right about its speed, though. PHP is still interpreted, bytecode or not, and having this functionality coded into PHP would be much quicker. I’m not at all surprised you didn’t code in float, multiply and divide operations.

(OMG, Razzed needs JS and cookies to post a message? Ok, I’ll add you guys to the list with PHP. Thanks.)

@Ron: Thanks for your comment. I’m curious: How would you determine whether to deal with an integer as signed or unsigned automatically, and always make the right choice? I am unsure how the compiler could intuit this without some sort of hint.

As for the library, the sole purpose was to solve this problem, no more no less. If I really needed speed, I’d do it in C, or I’d extend PHP to do it as a plugin. I didn’t add multiply and divide because I didn’t need it for the problem at hand.

kent: I wouldn’t have responded had I not stumbled upon this very same post. It’s not like me to let someone have the last word.

However, I did have some time to think about this, and realise now that, aside from third-party solutions such as your own, there appear to be only two ways to resolve this issue:

1) evolve PHP to support unsigned integers, which would require a reworking of the protocols, API and future developer PHP code;
2) develop a built-in class to handle additional precision (in machine code, it would be faster than my pappy on his honeymoon).

I did find this tasty link, too: http://stackoverflow.com/questions/872424/unsigned-int-to-signed-in-php

Regarding your library, I trust my post didn’t appear critical. Real programmers use whatever tools best serve their needs in the moment. I never thought I’d pay Java any attention, but now I find myself enjoying coding in it. This is a sure sign Armageddon’s coming fast.

Comments are closed.