BotID

From WoozleCodes
Jump to navigation Jump to search
BotID: a method of dealing with poorly-behaved scraper-bots

The key concept for this might be called "reputation tracking" or "proof of identity".

The idea is that every attempt to request a web page could be quickly sorted into one of three categories:

  • A (highest): Known users with accounts
  • B: Known browsers with a signed cookie
  • C (lowest): Anonymous

This filtering would take place at a stage before attempting to access the actual web site; queued requests would only be passed to the actual sites after being placed in appropriate queues -- with the requests in the "A" queue being passed along first, and those in lower queues only being passed along (when there was capacity to handle them) after any higher-ranked queues were empty.

Mechanism

Every anonymous request (class C) would be offered a unique cookie. This cookie would contain server-signed identity-and-state information.

If a valid cookie is presented with an access-request, then that request defaults to class B; if not, it remains in class C. The cookies presented with class B requests allow us to distinguish between individual actors (even if -- or perhaps especially if -- they share the cookie amongst multiple bots at different IP addresses) for purposes of rate-limiting, ROBOTS.TXT-compliance, and any other problematic behaviors.

Scenarios

So, for example: if a class-B request isn't obeying ROBOTS.TXT, and we have no reason to think it's a real user (e.g. delay between unrelated requests is too short) then we'd maybe kick it into category C.

Even if it's difficult to tell the difference between "legit but non-logged-in user accessing a no-bots zone" and "bot misbehaving", B would still get lower priority than A, so the site should always stay responsive for logged-in users (because they get top access priority).

Obstacles

  • The filtering code itself would need to be very fast and should avoid using a database, because that would create congestion even for class C requests.
    • A key technique, which may pose its own issues, is for applications to add login-state to the cookie. Many applications could easily support this with extensions or addons, but there may not be a universal solution.
  • The filtering code needs to be run across all vhosts on a server. I'm thinking this can be done via httpd configuration, but the exact methodology remains to be determined.

Notes