@mnapoli opened this issue on April 14th 2015

In #5099 we have many people regularly report new spammers (which spam Piwik with fake visits). We need to find a more scalable solution as it's beginning to become a real problem.

Goals: - make it easier for users to report new spammers - Piwik should auto-update the spammers list (every day or something) - the list should be kept up to date in future releases (for the Piwik installs that are setup to avoid any external network call) - optional: share that spammers list with the world as open data?

Ideas: - store the spammers list in a new GitHub repository in a JSON file (or YAML or whatever) - users can report new spammers with issues and pull requests (later we can create a better UI/website for that) - register that package on Packagist: - Piwik requires that package, which means the list is bundled in Piwik's releases (no first download required) - any other project can use the list by cloning the git repository or requiring it in Packagist - Piwik would download new versions of that list in tmp/ every day or week -> the version in tmp/ would override the one installed in vendor/

I'm not too sure yet about the Packagist part (it's not a PHP package, would require to use composer update before releases) but using submodules is definitely a no-go…

@Globulopolis commented on April 14th 2015

We need some UI in Piwik to: 1. Manage the list of URLs. 2. Update list via one click.

It's more tech step. But Piwik must support regex in these URLs.

At least, Piwik on update list event should check for userlist and merge it if userlist has different data.

@gaumondp commented on April 14th 2015

And please, make it "low tech" so people with server that can't go on the internet can "copy-paste & save" the content of a file to upgrade the list...

@AgentGod commented on April 14th 2015

So it seems the easiest way from a user standpoint will be to keep spammer list in external file, which to be updated on daily basis. In piwik admin panel beside each Referrer URL to be a button report as a SPAM, each report to be sent to verification list, where if some url have several reports to be moved from verification list to the main list.

@futureweb commented on April 14th 2015

Guess it would make sense to have an (auto Updated) "general List" and something User/Installation based ... As European Users often see different Referal Spam than US,Chinese, ... Users I use Piwik for about 500-600 different Sites - and Referal Spam often differs from Site to Site (of course some are the same on rather all Sites)

So one could add specific Entries only for the own Instance of Piwik and don't have to wait for the general List to be updated.

@mnapoli commented on April 14th 2015

The idea of a custom user list seems like a good idea but I don't think it's beneficial for everybody on the long run: spammers are spammers for everybody. If people do not report them because they can flag them in their user list, then the interest in the global list is gone.

We should maybe take the problem the other way: when admin report spammers, they are added to the custom list. That way they don't have to wait for the spammer to be added to the official list, but it still means that users will report spammers and not simply create a custom list.

However we may want to start with a simpler goal at first (one where there is no UI to report spammers, and no way to have a custom list).

@gaumondp The list should be updated on each Piwik update, like now, I'm not sure letting users manually update the list is that necessary. That should be enough for a start for those installs that don't have internet access, especially since those might not be the target of spammers (since they don't have internet access).

@futureweb commented on April 14th 2015

I like the idea of reporting spammers to the global list adds them to the custom list ... prevents bad reporting bahaviour as you say. ~~But updating the global list only on Piwik Updates is not frequent enough as Spammers will alter their Domains faster when they know they are banned ... I would suggest to update the global list like the GeoCity Database on a regular base like AgentGod already posted.~~ (already stated in your initial post)

@gaumondp commented on April 15th 2015

@mnapoli I don't want to make my case the rule but I know few people with big installation and very rigid environment/infrastructure can't keep up with Piwik fast release cycle.

In fact, I usually update 4 times a year. So we're often 3 release behind at update time. I don't think I'm alone though.

@mnapoli commented on April 15th 2015

@gaumondp and those setups cannot use auto-update of the list?

@gaumondp commented on April 15th 2015

Exact, no auto-update spammer possible, no one-click GeoIP updating, no easy install for stuff at http://plugins.piwik.org/ ...

And considering size (DB is at 22 GB here right now), no Web interface Piwik update possible. We use the CLI for that.

@AgentGod commented on April 15th 2015

@gaumondp that's why it should be simple external txt file with spammer list in it, which can be updated easily through cli. Some people will report through user interface, some with big installs will not. The idea is before some link to be in generally distributed spam list to be automatically verified from several sources.

@mnapoli commented on April 15th 2015

@gaumondp OK then we can document how to update the updated list, i.e. there will be 2 files: - built-in list, updated with every Piwik update (this is the one installed by Composer) - latest list, updated either through auto-update or manually (will probably be in the tmp/ directory)

That doesn't require any additional effort and should address all use cases. Then once that is done we can discuss of how to let users update manually through the UI if that's really necessary.

@gaumondp commented on April 16th 2015

@mnapoli , I'm just giving information and use case about few environment and scenario I know about that maybe you don't see often. I'm not "requiring" stuff. :) I'm just good at being devil's advocate.

I'm not sure about saving the list in /tmp/ directory though. In my view, everything in /tmp/ once emptied will be "auto-generated". Tell me if I'm wrong about this! But you know Piwik internals better than me for sure so I trust you about where to store such file.

@mnapoli commented on April 16th 2015

That's appreciated to list the different use cases, I for sure don't have a clear overview of all of them. In that case there is no additional effort so I don't see any issue ;)

Regarding the directory, maybe somebody else can chime in on this but I'm afraid we need a folder with write access.

@gaumondp commented on April 16th 2015

Maybe a new table in Piwik but "feedable" from the text file or a future Web interface with a simple "each line is a spammer" so it's copy-paste enabled ?

Or if GeoIP database is in /misc/ maybe it makes sense to use this one instead of /tmp/ if you don't want an additional table in the DB ?

@mnapoli commented on April 19th 2015

For the record the new list is here: https://github.com/piwik/referrer-spam-blacklist

@openjck commented on April 29th 2015

Should the improved handling also discount spam visits retroactively?

@mattab commented on April 29th 2015

@openjck no it will not remove referrer spammers from historical data

@pedrosanchezpernia commented on May 1st 2015

Is there a way/command to remove referrer from historical data ? (Maybe a "rebuilt)

@mattab commented on June 12th 2015

Since we will have a long dev cycle for 3.0.0 I reckon we need to provide users a solution to have constantly auto-updated spammers and really leverage our referrer spammer list. Moving to 2.14.0

@mnapoli commented on June 12th 2015

:+1: makes sense

@mattab commented on June 20th 2015

Note: we can't easily store the file on disk (not ideal to store in tmp/ as it can be flushed). So I suggest to cache the spammers.txt file in DB option table)

@mnapoli commented on June 21st 2015

@mattab doing so prevents users from updating manually (in environments without internet access)

@futureweb commented on June 21st 2015

@mnapoli I guess one possibility for Environments without Internet Access would be to to update the DB from a temporary File on Disk (tmp/)? So flushing tmp/ wouldn't be an issue. - Update File - run Update Script (CLI or GUI triggered)

@mnapoli commented on June 21st 2015

@futureweb it would require more effort to implement, and would be less practical to use (requires SSH access, or requires to log in into Piwik instead of just dropping a file through FTP), but that's still a better solution than nothing so I guess we could do that.

@mattab commented on June 22nd 2015
  • when users upgrade Piwik to latest version, they will get latest version of referrer spammer list.
    • on the release checklist, before releasing a new stable version, we will tag new version of spammer list and update composer.lock to use the latest
  • additionally to get the latest spammer list, Piwik users who have access to the internet, will receive the latest file (proposed: update once a week)
@mnapoli commented on June 23rd 2015

PR: #8186

This issue was closed on June 25th 2015
Powered by GitHub Issue Mirror