MWX/SpamFerret: Difference between revisions

From Woozle Writes Code
< MWX
Jump to navigation Jump to search
(→‎Design: attempts.URL_Page)
(→‎Design: decided to store server + page instead of URL (easier to code, mainly, but also allows easier reporting by site))
Line 48: Line 48:
   `ID_Pattern` INT NOT NULL COMMENT '(patterns.ID) matching pattern found',
   `ID_Pattern` INT NOT NULL COMMENT '(patterns.ID) matching pattern found',
   `ID_Client` INT NOT NULL COMMENT '(clients.ID) spamming client',
   `ID_Client` INT NOT NULL COMMENT '(clients.ID) spamming client',
   `URL_Page` varchar(255) COMMENT 'URL of attacked site/page',
   `PageServer` varchar(63) COMMENT 'identifier of wiki being attacked (usually domain)',
  `PageName` varchar(255) COMMENT 'name of page where the spam would have displayed',
   PRIMARY KEY(`ID`)
   PRIMARY KEY(`ID`)
)
)
ENGINE = MYISAM;
ENGINE = MYISAM;
</sql>
</sql>

Revision as of 20:54, 2 May 2007

Navigation

computing: software: MediaWiki: fighting spam

Overview

SpamFerret is my attempt at an improvement over the SpamBlacklist extension. It is currently in development.

Purpose

The SpamBlacklist extension has a number of shortcomings:

  • Can only handle a limited number of entries before exceeding the maximum string-length it can process, at which point all spam is allowed through
  • Does not keep track of which entries are still being "tried" (to allow for periodic "cleaning" of the list)
  • Does not keep track of offending IP addresses
  • Handles only domains; cannot blacklist by URL path (for partially compromised servers) or "catch phrases" found in spam and nowhere else
  • Does not keep a log of failed spam attempts, so there is no way to measure effectiveness

SpamFerret will:

  • be database-driven
  • keep logs and counts of spam attempts by blacklisting and by IP
  • match domains ("http://*.domain"), URLs ("http://*.domain/path") and catch-phrases ("helo please to forgive my posting but my children are hungary")
    • should also be able to match patterns, like long lists of links in a certain format

It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam.

Design

<sql> CREATE TABLE `patterns` (

 `ID` INT NOT NULL AUTO_INCREMENT,
 `Pattern` varchar(255) COMMENT 'pattern to match (regex)',
 `WhenAdded` DATETIME DEFAULT NULL COMMENT 'when this entry was added',
 `WhenTried` DATETIME DEFAULT NULL COMMENT 'when a spammer last attempted to include this pattern',
 `isActive` BOOL COMMENT 'if FALSE, do not include in checking',
 `isURL` BOOL COMMENT 'TRUE indicates that additional URL-related stats may be collected',
 `isRegex` BOOL COMMENT 'TRUE indicates that the string should not be escaped before feeding to preg_match()',
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM;

CREATE TABLE `clients` (

 `ID` INT NOT NULL AUTO_INCREMENT,
 `Address` varchar(15) COMMENT 'IP address',
 `WhenFirst` DATETIME COMMENT 'when this IP address first submitted a spam',
 `WhenLast` DATETIME COMMENT 'when this IP address last submitted a spam',
 `Count` INT COMMENT 'number of attempts',
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM;

CREATE TABLE `attempts` (

 `ID` INT NOT NULL AUTO_INCREMENT,
 `When` DATETIME COMMENT 'timestamp of attempt',
 `ID_Pattern` INT NOT NULL COMMENT '(patterns.ID) matching pattern found',
 `ID_Client` INT NOT NULL COMMENT '(clients.ID) spamming client',
 `PageServer` varchar(63) COMMENT 'identifier of wiki being attacked (usually domain)',
 `PageName` varchar(255) COMMENT 'name of page where the spam would have displayed',
 PRIMARY KEY(`ID`)

) ENGINE = MYISAM; </sql>