MWX/SpamFerret: Difference between revisions
(→To-Do: more ideas) |
No edit summary |
||
(26 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[category:MediaWiki extension|SpamFerret]] | |||
[[category:unsupported]] | |||
==About== | |||
'''SpamFerret''' was an attempt to improve on the SpamBlacklist extension, as a way for small open-edit MediaWiki sites to minimize spam edits. It is no longer supported. | |||
The version posted here worked with older versions of MediaWiki and PHP, but even when working it was necessary to manually enter patterns (blacklisted items) into the database. Fortunately this isn't hard to do, but if this were to be revived it would be nice to have a more friendly admin interface (especially some way to take a spam-page, break it up into unique URLs, and let you select which ones to add to the database). | |||
===Requirements=== | |||
SpamFerret requires the mysqli library to be installed/enabled on PHP. ''This would need revision.'' | |||
===To-Do=== | ===To-Do=== | ||
* If spam filter db cannot be contacted, fall back to a [[captcha]] rather than letting spam through | |||
* '''Important''': Document the installation process (pretty simple; the only non-obvious bit is the database spec) | * '''Important''': Document the installation process (pretty simple; the only non-obvious bit is the database spec) | ||
* Option to | * Option to allow banned IP addresses to create account and go to link received by email in order to un-block IP | ||
* Improved pages for reporting automatic IP suspension, [[ampersandbot]]s | * Improved pages for reporting automatic IP suspension, [[ampersandbot]]s | ||
* Automated reporting of intercepted spam | * Automated reporting of intercepted spam | ||
* Web-based tool for: | * Web-based tool ("Special" page) for: | ||
** Entering new patterns | ** Entering new patterns | ||
** Deactivating/modifying disused patterns | ** Deactivating/modifying disused patterns | ||
** Testing sample spam against existing or candidate filters (Kiki regex tester shows matches for spam which SpamFerret is letting through...) | |||
* Log of changes to "patterns" table | * Log of changes to "patterns" table | ||
* Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database) | * Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database) | ||
* Easy way to automatically revert to chosen revision while showing all edits below filter-creation form, to make it easier to add new patterns | * Easy way to automatically revert to chosen revision while showing all edits below filter-creation form, to make it easier to add new patterns | ||
* Manual reporting tools: | * Manual reporting tools: | ||
** [BUG] "text tester" utility does not handle diffs properly (at all?) - e.g. pattern #684 isn't matched even though the regex works | |||
** basic data viewing, i.e. received spam grouped by IP address, by triggered filter, or in chronological order | ** basic data viewing, i.e. received spam grouped by IP address, by triggered filter, or in chronological order | ||
** list patterns least recently used, for possible deactivation | ** list patterns least recently used, for possible deactivation | ||
** [[whois]] of all recently promoted domains and create a consolidated list of owners | ** [[whois]] of all recently promoted domains and create a consolidated list of owners | ||
* Optional log of complete spam contents, for possible data-mining or filter-training (e.g. to answer the question "if I disable this set of filters, how many spams would have gotten through instead of being caught by the other filters?" or "how effective would this proposed new filter have been at catching the spams received thus far?") | * <s>Optional log of complete spam contents, for possible data-mining or filter-training (e.g. to answer the question "if I disable this set of filters, how many spams would have gotten through instead of being caught by the other filters?" or "how effective would this proposed new filter have been at catching the spams received thus far?")</s> done | ||
===MW Versions=== | ===MW Versions=== | ||
* [[SpamFerret]] (no version number yet) has been used without modification on MediaWiki versions 1.5.5, 1.7.1, 1.8.2, | * [[SpamFerret]] (no version number yet) has been used without modification on MediaWiki versions 1.5.5, 1.7.1, 1.8.2, 1.9.3, 1.11.0, 1.12, 1.14.0, 1.15.0, and 1.15.1. | ||
==Purpose== | ==Purpose== | ||
Line 40: | Line 46: | ||
It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam. | It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam. | ||
== | ==Technical Docs== | ||
* [[/tables]] | |||
* [[/views]] | |||
==Installation== | |||
Rough installation instructions, to be refined later (and note that code posted here is often not the latest, and there may be incompatibilities due to version non-synchronization between the different files; bug [[User:Woozle]] for the latest code): | |||
* | |||
* [[SpamFerret.php]] goes in the MediaWiki extensions folder | * [[SpamFerret.php]] goes in the MediaWiki extensions folder | ||
* [[User:Woozle/data.php|data.php]] goes in the "includes" folder because [[PHP]] seems to want it there. | * [[User:Woozle/data.php|data.php]] goes in the "includes" folder because [[PHP]] seems to want it there. | ||
* Add these lines to your [[LocalSettings.php]]: | * Add these lines to your [[MediaWiki/LocalSettings.php|LocalSettings.php]]: | ||
<php>require_once( "$IP/extensions/SpamFerret.php" ); | <php>require_once( "$IP/extensions/SpamFerret.php" ); | ||
$wgSpamFerretSettings['dbspec'] = | $wgSpamFerretSettings['dbspec'] = /* connection string - see notes below */; | ||
$wgSpamFerretSettings['throttle_retries'] = 5; // 5 strikes and you're out | $wgSpamFerretSettings['throttle_retries'] = 5; // 5 strikes and you're out | ||
$wgSpamFerretSettings['throttle_timeout'] = 86400; // 86400 seconds = 24 hours</php> | $wgSpamFerretSettings['throttle_timeout'] = 86400; // 86400 seconds = 24 hours</php> | ||
SpamFerret.php and data.php still contain some debugging code, most of which I'll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs). | * '''connection string''' has the following format: mysql:<u>db_user_name</u>:<u>db_user_password</u>@<u>db_server</u>/<u>spamferret_db_name</u> | ||
** '''Example''': mysql:spfbot:b0tpa55@myserver.com/spamferretdb | |||
* SpamFerret.php and data.php still contain some debugging code, most of which I'll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs). | |||
===Update Log=== | ===Update Log=== | ||
* '''2007-10-13''' (1) IP throttling, and (2) logging of [[ampersandbot]] attempts | * '''2009-08-05''' Developing [[/Special|SpecialPage]] for administrative purposes | ||
* '''2007-12-27''' see [[SpamFerret.php]] code comments | |||
* '''2007-10-13''' (1) IP throttling, and (2) logging of [[ampersandbot]] attempts | |||
** If an IP address makes more than <u>N</u> spam attempts with no more than <u>T</u> seconds between them, it will not be allowed to post anything until a further <u>T</u> seconds have elapsed without spam. | ** If an IP address makes more than <u>N</u> spam attempts with no more than <u>T</u> seconds between them, it will not be allowed to post anything until a further <u>T</u> seconds have elapsed without spam. | ||
* '''2007-06-10''' Added code to prevent [[ampersandbot]] edits; need to add logging of those blocks, but don't have time right now. Also don't know if the ampersandbots trim off whitespace or if that's just how MediaWiki is displaying the changes. | * '''2007-06-10''' Added code to prevent [[ampersandbot]] edits; need to add logging of those blocks, but don't have time right now. Also don't know if the ampersandbots trim off whitespace or if that's just how MediaWiki is displaying the changes. | ||
* '''2007-08-30''' Current version accommodates some changes to the data.php class library | * '''2007-08-30''' Current version accommodates some changes to the data.php class library |
Latest revision as of 20:17, 1 May 2022
About
SpamFerret was an attempt to improve on the SpamBlacklist extension, as a way for small open-edit MediaWiki sites to minimize spam edits. It is no longer supported.
The version posted here worked with older versions of MediaWiki and PHP, but even when working it was necessary to manually enter patterns (blacklisted items) into the database. Fortunately this isn't hard to do, but if this were to be revived it would be nice to have a more friendly admin interface (especially some way to take a spam-page, break it up into unique URLs, and let you select which ones to add to the database).
Requirements
SpamFerret requires the mysqli library to be installed/enabled on PHP. This would need revision.
To-Do
- If spam filter db cannot be contacted, fall back to a captcha rather than letting spam through
- Important: Document the installation process (pretty simple; the only non-obvious bit is the database spec)
- Option to allow banned IP addresses to create account and go to link received by email in order to un-block IP
- Improved pages for reporting automatic IP suspension, ampersandbots
- Automated reporting of intercepted spam
- Web-based tool ("Special" page) for:
- Entering new patterns
- Deactivating/modifying disused patterns
- Testing sample spam against existing or candidate filters (Kiki regex tester shows matches for spam which SpamFerret is letting through...)
- Log of changes to "patterns" table
- Management tools for new spam (generate candidate patterns from spam page, allow user to fine-tune and choose which ones to use, and add chosen/tweaked patterns to database)
- Easy way to automatically revert to chosen revision while showing all edits below filter-creation form, to make it easier to add new patterns
- Manual reporting tools:
- [BUG] "text tester" utility does not handle diffs properly (at all?) - e.g. pattern #684 isn't matched even though the regex works
- basic data viewing, i.e. received spam grouped by IP address, by triggered filter, or in chronological order
- list patterns least recently used, for possible deactivation
- whois of all recently promoted domains and create a consolidated list of owners
Optional log of complete spam contents, for possible data-mining or filter-training (e.g. to answer the question "if I disable this set of filters, how many spams would have gotten through instead of being caught by the other filters?" or "how effective would this proposed new filter have been at catching the spams received thus far?")done
MW Versions
- SpamFerret (no version number yet) has been used without modification on MediaWiki versions 1.5.5, 1.7.1, 1.8.2, 1.9.3, 1.11.0, 1.12, 1.14.0, 1.15.0, and 1.15.1.
Purpose
The SpamBlacklist extension has a number of shortcomings:
- Can only handle a limited number of entries before exceeding the maximum string-length it can process, at which point all spam is allowed through
- Does not keep track of which entries are still being "tried" (to allow for periodic "cleaning" of the list)
- Does not keep track of offending IP addresses
- Handles only domains; cannot blacklist by URL path (for partially compromised servers) or "catch phrases" found in spam and nowhere else
- Does not keep a log of failed spam attempts, so there is no way to measure effectiveness
SpamFerret, on the other hand:
- is database-driven
- keeps logs and counts of spam attempts by blacklisting and by IP
- matches domains ("http://*.domain"), URLs ("http://*.domain/path") and catch-phrases ("helo please to forgive my posting but my children are hungary")
- can also match patterns, like long lists of links in a certain format
It may also be unsuitable for use on busier wikis, as the checking process (which only happens when an edit is submitted) may take a fair amount of CPU time (checks the entire page once per blacklisted pattern). This shouldn't be a problem for smaller wikis, which are often monitored less frequently than busier wikis and hence are more vulnerable to spam.
Technical Docs
Installation
Rough installation instructions, to be refined later (and note that code posted here is often not the latest, and there may be incompatibilities due to version non-synchronization between the different files; bug User:Woozle for the latest code):
- SpamFerret.php goes in the MediaWiki extensions folder
- data.php goes in the "includes" folder because PHP seems to want it there.
- Add these lines to your LocalSettings.php:
<php>require_once( "$IP/extensions/SpamFerret.php" ); $wgSpamFerretSettings['dbspec'] = /* connection string - see notes below */; $wgSpamFerretSettings['throttle_retries'] = 5; // 5 strikes and you're out $wgSpamFerretSettings['throttle_timeout'] = 86400; // 86400 seconds = 24 hours</php>
- connection string has the following format: mysql:db_user_name:db_user_password@db_server/spamferret_db_name
- Example: mysql:spfbot:b0tpa55@myserver.com/spamferretdb
- SpamFerret.php and data.php still contain some debugging code, most of which I'll clean up later (some of it calls stubbed debug-printout routines which can come in handy when adding features or fixing the inevitable bugs).
Update Log
- 2009-08-05 Developing SpecialPage for administrative purposes
- 2007-12-27 see SpamFerret.php code comments
- 2007-10-13 (1) IP throttling, and (2) logging of ampersandbot attempts
- If an IP address makes more than N spam attempts with no more than T seconds between them, it will not be allowed to post anything until a further T seconds have elapsed without spam.
- 2007-06-10 Added code to prevent ampersandbot edits; need to add logging of those blocks, but don't have time right now. Also don't know if the ampersandbots trim off whitespace or if that's just how MediaWiki is displaying the changes.
- 2007-08-30 Current version accommodates some changes to the data.php class library