@mnapoli opened this Pull Request on June 23rd 2015 Member

Fixes #7674

Auto-update the list from piwik/referrer-spam-blacklist (full URL is https://raw.githubusercontent.com/piwik/referrer-spam-blacklist/master/spammers.txt).

The up-to-date list is stored serialized in the option table. If it doesn't exist, the one in vendor/ is used.

I also added the possibility to run a specific scheduled task, which is pretty useful to test it:

./console scheduled-tasks:run "Piwik\Plugins\CoreAdminHome\Tasks.updateSpammerBlacklist"
@gaumondp commented on June 23rd 2015

Do I understand it right that if my Piwik server can't use the internet I just have to manually copy the list from https://raw.githubusercontent.com/piwik/referrer-spam-blacklist/master/spammers.txt to /vendor/piwik/referrer-spam-blacklist/spammers.txt and then run ./console scheduled-tasks:run "Piwik\Plugins\CoreAdminHome\Tasks.updateSpammerBlacklist" and then magic will happen ? ;)

@mnapoli commented on June 23rd 2015 Member

Not with this implementation: the auto-updated list is written in the database (and overrides the file spammers.txt).

You could overwrite the file in vendor yes, but that would be overwritten on update (which should be fine since a new Piwik release should have the latest version of the list). You don't have any command to run in that case.

@quba commented on June 23rd 2015 Contributor

Is it cached or each tracking request selects this list from piwik_option table?

@mnapoli commented on June 23rd 2015 Member

Good point, currently it isn't cached, maybe it should or is it negligible?

@mattab commented on June 23rd 2015 Owner

Good point, currently it isn't cached, maybe it should or is it negligible?

currently in Tracking API requests we only query log_* tables and not any other table (the round trip to mysql and query runtime are not negligible when you get eg. 500 requests per second.). DB performance would be affected and great tracking API performance is a big part of what makes Piwik scalable. we cache all information from the DB during tracker in the tracker cache.

Ultimately we will need to have performance regression tests to ensure we don't regress performance in Tracking API or other key features of Piwik #7889

+1 to add caching and looks good to merge!

@mnapoli commented on June 24th 2015 Member

I've added a cache, but I was quite confused by the Tracker\Cache class. I ended up using Piwik\Cache::getLazyCache(). Please advise if there's a better solution.

@sgiehl commented on June 24th 2015 Member

I guess getEagerCache would be the right choice, as the lazyCache invalidates more often.
@tsteur correct?

@mnapoli commented on June 24th 2015 Member

I was afraid that the eager cache would load the list (which can be huge, especially in the future) on every single Piwik request/process?

@sgiehl commented on June 24th 2015 Member

It already does, as the DeviceDetector uses the eager cache. See https://github.com/piwik/piwik/commit/2d2b8df3c4d2fbfde92af76ffbda26b0cb2bcaa7

@mnapoli commented on June 24th 2015 Member

:+1: thanks for the link, I'll update to use the eager cache

@mattab commented on June 25th 2015 Owner

Looks good :+1: It's awesome to know all piwik users with at least 2.14.0 will have an always up to date spam filter. This will make it much more efficient for all of us to fight referrer spammers.

Anyone reading: feel free to join the fun at: https://github.com/piwik/referrer-spam-blacklist/

This Pull Request was closed on June 25th 2015
Powered by GitHub Issue Mirror