@diosmosis opened this issue on March 6th 2015

Refs #7181

Purpose

There are two purposes behind the refactoring in this pull request: - clarity: to increase the ability for developers to understand the purging system, both in terms of how it is used and how it works. This involves better scoped classes, better documentation/comments and better class names. - resilience: to increase resilience of the system, by decoupling intra-system dependencies and providing coping methods for eventual failures.

Description of Purging System

Here is a description of the refactored purging system:

   Users               +---------------------+             +---------------------------------+          +-------------+
+-------------+        |                     |    adds to  |                                 | consumes |             |
| API         |  calls | Archive\Invalidator +------+------> SitesToReprocessDistributedList <----------+ CronArchive |
|  +          +-------->                     |      |      |     (holds site IDs)            |          |             |
|  |          |        +---------------------+      |      +----------------+----------------+          +-------------+
|  +          |                   v -updates        |                       v                                          
| Archive.php |                archive              |                    option                                        
|  +          |                  table              |                      table                                       
|  |          |                   ^ -deletes        |                       ^                                          
|  +          |  calls +---------------------+      |      +----------------+----------------+          +------------------------+
| Commands    +-------->                     |      +------>                                 | consumes |                        |
+-------------+        | Archive\Purger      |             | ArchivesToPurgeDistributedList  <----------+ Purging Scheduled Task |
                       |                     +             |     (holds year months)         |          |                        |
                       +----------^----------+             +---------------------------------+          +--------+---------------+
                                  |                                     uses                                     |
                                  +-------------------------------------------------------------------------------

Users interact primarily w/ Archive\Invalidator & Archive\Purger services. - Archive\Invalidator: - invalidate: when archives are invalidated, the invalidator changes the archive done flags to ArchiveWriter::DONE_INVALIDATED. Then it adds the archive's site to the SitesToReprocess list and the archive's table-month to the ArchivesToPurge list. - invalidate later: adds information identifying an archive to a distributed list (not pictured above) which is later read and its elements sent to the invalidate operation. Note: I did not modify this part of the code. - Archive\Purger: - purge invalidated: when purging invalidated archives is requested, the purger will look through the requested table (an archive table must be supplied) for invalidated archives that have newer, valid archives, and deletes them. Before doing this (which depends on joining the table w/ itself), a query is issued to find all site IDs in the table for invalidated archives. This is done only for performance, so the INNER JOIN query does not look through every row in the table. valid archives are created after a site is reprocessed by CronArchive. - purge others: other archive purging (temporary + custom ranges) do not depend on distributed lists. they just issue queries and delete archives. - CronArchive: - when cron archiving is run, the SitesToReprocessDistributedList is read. Every site referenced by the list is reprocessed, which should create new valid archives, thus allowing the purger to purge the invalidated archives.

** Purging Scheduled Task - the daily scheduled task that purges invalidated archives will read from the ArchivesToPurgeDistributedList, and purge invalidated archives from the tables indicated. Since there can be many archive tables, and most of them won't have invalidated archives, we depend on the distributed list for efficiency. That is the sole reason for the list's existence and use.

List of changes 1. Made ArchivePurger + ArchiveInvalidator services and moved them out of core\DataAccess. Since they access & manipulate different kinds of data in different contexts, they are not simple DAOs. 2. Moved purge skipping logic to scheduled task for purging, since it is only required there. This allows other code to force a purge if necessary. 3. Added command core:purge-old-archive-data so users can force archive purging. 4. Decouple invalidated archive purging from site entity. Originally, site IDs of invalidated archives were stored in the ArchivesToPurge list, now, instead of depending on values in the option table, we do a select on the archive table to find the sites first. Then we do the expensive query using the site IDs. 5. Decouple ArchivePurger from the ArchivesToPurgeDistributedList, so the class is not itself aware of the distributed list. Nor is it required for the list to be used when purging archive data. 6. Split InvalidatedReports class into three classes (one is a common base type). The InvalidatedReports class was used to mange two distributed lists stored in the option table. The refactor splits them up into two classes for each list (ArchivesToPurgeDistributedList & SitesToReprocessDistributedList) + a common base type for a simple distributed list (core\Concurrency\DistributedList.php). - List classes were moved to the systems that were their primary consumers. SitesToReprocessDistributedList was moved to core\CronArchive & ArchivesToPurgeDistributedList to CoreAdminHome plugin for the scheduled task. - The base type DistributedList does exactly what InvalidatedReports did before, nothing more or less. It does not introduce more logic (save one method, removeByIndex). 7. Remove custom range archive purging from outdated archive purging, since they are not the same thing. 8. Added lots of tests.

Also includes this change (for testing the command by injecting a mock command into core\Console.php): a73225f

Resilience

Resilience of the system is described in three ways, how many ways each part of the system can fail, how many ways for the users to cope if a part of the system fails, and how easy it is to debug potential bugs. - Archive\Invalidator - failures: can fail if hit by race condition when adding to distributed lists.

On failure to add to SitesToReprocess, user can run core:archive for the site, HOWEVER, there may be no easy way for the user to find which sites require this. A query on tables for invalidated rows w/o new data & w/o site entries in distributed list is required.

On failure to add to ArchivesToPurge, user can purge manually using new command, HOWEVER, there may be no easy way for the user to find which archive tables have invalided archives. Purging however is not as big an issue, since the user will notice table size problems, and use the command to deal w/ them.

- to debug: there is no way to debug the code easily at this point. Logging can be added, since invalidation does quite a lot of different stuff, but figuring out the specific type of logging is non-trivial. How can we find out what went wrong, BEFORE we tell a user to enable debug logging? - Archive\Purger - failures: can fail if it deletes more than it should.

On failure to delete correct archives, user can run core:archive, if the log data remains. If log data does not remain, there is no way for the system to cope.

- to debug: also no way to debug the code easily. Need to add logging. - Purging Scheduled Task - failures: can fail if it deletes more than it should and UI shows 'no data' or when, and can fail if scheduled tasks are not run.

On failure to delete correct archives, user can run core:archive. No immediate action in the UI, though.

On failure of task scheduler, user can manually purge to deal w/ table sizes.

- to debug: First failure can be debugged in same way as Archive\Purger failure. Second failure cannot be debugged due to current task scheduler system.

Possible future TODO: - We could add a diagnostic command to look for invalidated archives that do not have new data. - We might want to make it easy to backup archive tables, since they contain the useful report data. Perhaps this could augment the log purging PrivacyManager feature. - The ability to re-process an individual report through the UI that forces archiving for a SINGLE report (regardless of whether browser archiving is enabled), can help w/ any 'no data for this report' inconsistencies.

@piwik/core-team Ready for review.

@mnapoli commented on March 8th 2015

This is awesome, especially all the explanations :)

@mnapoli commented on March 8th 2015

Thinking about your schema again, would it make sense to encapsulate the lists (SitesToReprocessDistributedList and ArchivesToPurgeDistributedList) behind Archive\Invalidator and Archive\Purger? That would simplify the interactions between classes (everything would just use Invalidator/Purger), yet still let Invalidator and Purger ignorant of how those lists are stored…

Also that's probably a matter of preferences but Invalidator and Purger are not really explicit class names IMO, ArchiveInvalidator and ArchivePurger is more explicit. Yes it's redundant with the namespace, but take new Invalidator() for example: the namespace never appear anywhere so you don't understand what it's invalidating/purging. It's a bit like Plugin\Manager, we always alias it to PluginManager because Manager itself doesn't mean anything. Anyway, just opening the discussion, I don't have strong opinions on this.

@diosmosis commented on March 9th 2015

Thinking about your schema again, would it make sense to encapsulate the lists (SitesToReprocessDistributedList and ArchivesToPurgeDistributedList) behind Archive\Invalidator and Archive\Purger?

I was thinking about this too, it would be better to only interact w/ "services" (not really the right word since Piwik isn't a SOA), but since Piwik isn't currently organized around these "service" classes (nor does it necessarily have them), I think it would decrease the clarity of the system.

The SitesToReprocess list, for example, is more of a detail of CronArchive, but CronArchive isn't a service, nor can it currently be defined as one.

And the ArchivesToPurge list is just used for the scheduled task, so I think it would be a mistake to couple it w/ a core service. If Piwik had a real SOA, then the archive purging system would perform the scheduled purging itself, w/o introducing dependencies on a plugin or another system in core (ie, "task scheduling").

If it were possible for service classes to handle events, then what we could do is have an event for when archives are invalidated, and have the 'CronArchive' service (which doesn't currently exist) add the site to the SitesToReprocess list, and have the Archive\Purger add the table-month to its list when that event is fired. But, this isn't possible.

For now, I think having the classes be explicitly used minimizes coupling and increases clarity about how the, still very inter-dependent, system actually works. However, if you can find other ways to simplify the system, or think I'm missing something, please let me know ;)

Btw, on a slightly related note, what's your opinion on making any class in Piwik potentially able to listen to events? maybe by implementing an interface and storing the object in DI...

Also that's probably a matter of preferences but Invalidator and Purger are not really explicit class names IMO, ArchiveInvalidator and ArchivePurger is more explicit.

Have no issue w/ this, I'll rename them.

@mnapoli commented on March 9th 2015

Btw, on a slightly related note, what's your opinion on making any class in Piwik potentially able to listen to events? maybe by implementing an interface and storing the object in DI...

You mean like in Symfony you write an EventListener that declares which events it listens to, and you register it in the container and that's it? So that would be an alternative to listening to events in the Plugin class right? If that's what you mean then yes I think that's a good idea.

However the interface wouldn't be good enough for all this to work unfortunately (for technical reasons like the fact that you can register services using a factory/closure so you can't know the interface of the object before resolving all the definitions). It would have to be done by registering it explicitly (like in Symfony with tags) e.g.:

// plugin config
return [
    'event.listeners' => DI\add([
        DI\link('Piwik\Plugin\Foo\RequestEventListener')
    ]),
];
@diosmosis commented on March 10th 2015

However the interface wouldn't be good enough for all this to work unfortunately (for technical reasons like the fact that you can register services using a factory/closure so you can't know the interface of the object before resolving all the definitions).

That makes it less useful, but it could still be used to reduce coupling...

@mnapoli commented on March 10th 2015

That makes it less useful, but it could still be used to reduce coupling...

I'm missing the point? What I'm saying is that detecting entries based on an interface is extremely limited to the point where it's probably useless. I wouldn't bother going that way. Explicitly registering classes as event listeners would be necessary.

@diosmosis commented on March 10th 2015

I'm missing the point? What I'm saying is that detecting entries based on an interface is extremely limited to the point where it's probably useless. I wouldn't bother going that way. Explicitly registering classes as event listeners would be necessary.

I know, I just don't really like having to register them, but I can't see another way to go. Should still work, I think, though would need DI to be available in more places, I think?

@diosmosis commented on March 10th 2015

@mattab Ready to merge.

@mattab commented on March 11th 2015

Ok, I'm going to review it shortly

@mattab commented on March 11th 2015

I'm happy about this refactor: it cleans up some messy code and makes this process now very clear.

I have some feedback: - overall: looks good to me! - can you add the new command to the changelog? - in the test functions in PurgeOldArchiveDataTest and in PurgerTest can you add some asserts that assert that the state is expected before the console commands executed (it makes the test easier to read as well as 'stronger' i'm sure you see what i mean) - similarly i'd add something like self::$fixture->assertInvalidatedArchivesNotPurged($this->january); in test_purgeInvalidatedArchives_PurgesCorrectInvalidatedArchives_AndOnlyPurgesDataForDatesAndSites_InInvalidatedReportsDistributedList - in purgeOutdatedArchives you deconstructed the method Rules::isRequestAuthorizedToArchive() and copied the code, but it should be using the method

@diosmosis commented on March 11th 2015

in purgeOutdatedArchives you deconstructed the method Rules::isRequestAuthorizedToArchive() and copied the code, but it should be using the method

It should not use the method, since using the method will make the code confusing. It doesn't matter if the request 'is authorized to archive', what matters is 'will purging more data than expected cause problems for the UI'. Using the method will lead to confusion where developers may think that somehow archive purging requires archiving, which is not the case. Also, the wording of the method, ie, 'authorize' may result in ideas about user permissions, which will further increase confusion.

@mattab commented on March 11th 2015

what matters is 'will purging more data than expected cause problems for the UI'.

Maybe we create a method like Rules::willPurgeDataCauseProblemsForUI (or better named) that would then call the isRequestAuthorizedToArchive as implementation detail? I don't know 100% if this is good idea but I'm afraid of duplicating such code. Maybe it's ok though.

@diosmosis commented on March 11th 2015

Maybe we create a method like Rules::willPurgeDataCauseProblemsForUI (or better named) that would then call the isRequestAuthorizedToArchive as implementation detail? I don't know 100% if this is good idea but I'm afraid of duplicating such code. Maybe it's ok though.

This is an acceptable compromise to me. We should be careful though. Updates to isRequestAuthorizedToArchive() may not be applicable to willPurgeDataCauseProblemsForUI() so we may accidentally introduce regressions in the future. And I'm not sure how to effectively test this logic since it depends on the overall system (ie core:archive/browser archiving/whatever else in the future might participate/interfere).

@mattab commented on March 11th 2015

we may accidentally introduce regressions in the future

a comment in the function isRequestAuthorizedToArchive could help prevent such regression, as well as the fact that when we change a method implementation, all of us do / should check the call tree and verify that it makes sense / is correct for all callers of the method to be changed

@mattab commented on March 13th 2015

Looks good!

I'll release a beta and notify users to see if that fixes the issue #7181

This issue was closed on March 13th 2015
Powered by GitHub Issue Mirror