In #5099 we have many people regularly report new spammers (which spam Piwik with fake visits). We need to find a more scalable solution as it's beginning to become a real problem.
Goals: - make it easier for users to report new spammers - Piwik should auto-update the spammers list (every day or something) - the list should be kept up to date in future releases (for the Piwik installs that are setup to avoid any external network call) - optional: share that spammers list with the world as open data?
- store the spammers list in a new GitHub repository in a JSON file (or YAML or whatever)
- users can report new spammers with issues and pull requests (later we can create a better UI/website for that)
- register that package on Packagist:
- Piwik requires that package, which means the list is bundled in Piwik's releases (no first download required)
- any other project can use the list by cloning the git repository or requiring it in Packagist
- Piwik would download new versions of that list in
tmp/ every day or week -> the version in
tmp/ would override the one installed in
I'm not too sure yet about the Packagist part (it's not a PHP package, would require to use
composer update before releases) but using submodules is definitely a no-go…
We need some UI in Piwik to:
1. Manage the list of URLs.
2. Update list via
It's more tech step. But Piwik must support regex in these URLs.
At least, Piwik on
update list event should check for
userlist and merge it if
userlist has different data.
And please, make it "low tech" so people with server that can't go on the internet can "copy-paste & save" the content of a file to upgrade the list...
So it seems the easiest way from a user standpoint will be to keep spammer list in external file, which to be updated on daily basis. In piwik admin panel beside each Referrer URL to be a button report as a SPAM, each report to be sent to verification list, where if some url have several reports to be moved from verification list to the main list.
Guess it would make sense to have an (auto Updated) "general List" and something User/Installation based ... As European Users often see different Referal Spam than US,Chinese, ... Users I use Piwik for about 500-600 different Sites - and Referal Spam often differs from Site to Site (of course some are the same on rather all Sites)
So one could add specific Entries only for the own Instance of Piwik and don't have to wait for the general List to be updated.
The idea of a custom user list seems like a good idea but I don't think it's beneficial for everybody on the long run: spammers are spammers for everybody. If people do not report them because they can flag them in their user list, then the interest in the global list is gone.
We should maybe take the problem the other way: when admin report spammers, they are added to the custom list. That way they don't have to wait for the spammer to be added to the official list, but it still means that users will report spammers and not simply create a custom list.
However we may want to start with a simpler goal at first (one where there is no UI to report spammers, and no way to have a custom list).
@gaumondp The list should be updated on each Piwik update, like now, I'm not sure letting users manually update the list is that necessary. That should be enough for a start for those installs that don't have internet access, especially since those might not be the target of spammers (since they don't have internet access).
I like the idea of reporting spammers to the global list adds them to the custom list ... prevents bad reporting bahaviour as you say. ~~But updating the global list only on Piwik Updates is not frequent enough as Spammers will alter their Domains faster when they know they are banned ... I would suggest to update the global list like the GeoCity Database on a regular base like AgentGod already posted.~~ (already stated in your initial post)
@mnapoli I don't want to make my case the rule but I know few people with big installation and very rigid environment/infrastructure can't keep up with Piwik fast release cycle.
In fact, I usually update 4 times a year. So we're often 3 release behind at update time. I don't think I'm alone though.
@gaumondp and those setups cannot use auto-update of the list?
Exact, no auto-update spammer possible, no one-click GeoIP updating, no easy install for stuff at http://plugins.piwik.org/ ...
And considering size (DB is at 22 GB here right now), no Web interface Piwik update possible. We use the CLI for that.
@gaumondp that's why it should be simple external txt file with spammer list in it, which can be updated easily through cli. Some people will report through user interface, some with big installs will not. The idea is before some link to be in generally distributed spam list to be automatically verified from several sources.
@gaumondp OK then we can document how to update the updated list, i.e. there will be 2 files:
- built-in list, updated with every Piwik update (this is the one installed by Composer)
- latest list, updated either through auto-update or manually (will probably be in the
That doesn't require any additional effort and should address all use cases. Then once that is done we can discuss of how to let users update manually through the UI if that's really necessary.
@mnapoli , I'm just giving information and use case about few environment and scenario I know about that maybe you don't see often. I'm not "requiring" stuff. :) I'm just good at being devil's advocate.
I'm not sure about saving the list in /tmp/ directory though. In my view, everything in /tmp/ once emptied will be "auto-generated". Tell me if I'm wrong about this! But you know Piwik internals better than me for sure so I trust you about where to store such file.
That's appreciated to list the different use cases, I for sure don't have a clear overview of all of them. In that case there is no additional effort so I don't see any issue ;)
Regarding the directory, maybe somebody else can chime in on this but I'm afraid we need a folder with write access.
Maybe a new table in Piwik but "feedable" from the text file or a future Web interface with a simple "each line is a spammer" so it's copy-paste enabled ?
Or if GeoIP database is in /misc/ maybe it makes sense to use this one instead of /tmp/ if you don't want an additional table in the DB ?
For the record the new list is here: https://github.com/piwik/referrer-spam-blacklist
Should the improved handling also discount spam visits retroactively?
@openjck no it will not remove referrer spammers from historical data
Is there a way/command to remove referrer from historical data ? (Maybe a "rebuilt)
Since we will have a long dev cycle for 3.0.0 I reckon we need to provide users a solution to have constantly auto-updated spammers and really leverage our referrer spammer list. Moving to 2.14.0
:+1: makes sense
Note: we can't easily store the file on disk (not ideal to store in
tmp/ as it can be flushed). So I suggest to cache the spammers.txt file in DB
@mattab doing so prevents users from updating manually (in environments without internet access)
@mnapoli I guess one possibility for Environments without Internet Access would be to to update the DB from a temporary File on Disk (tmp/)? So flushing tmp/ wouldn't be an issue. - Update File - run Update Script (CLI or GUI triggered)
@futureweb it would require more effort to implement, and would be less practical to use (requires SSH access, or requires to log in into Piwik instead of just dropping a file through FTP), but that's still a better solution than nothing so I guess we could do that.