@mgazdzik opened this issue on April 1st 2015

It would be great feature to have ability of running all sites archiving and sharing idsites across multiple threads. Currently when we force archiving for all sites using --force-all-websites process is holding sites list within. This makes it impossible to share this sequence across other threads. It could be additional option like --share-idsites-sequence . This would also affect forcing defined idsites to be split across all threads.

Another perk of this would be ability to reduce time consumed on computing which sites: - have been invalidated - had new visits since last archiving - last archive time was on different day - should be re-processed because of other reason. This affects especially big instnaces archived once per day where we want to archive all sites anyway. Currently on our observations establishing list of idsites can take 15 to 60 minutes. Also maybe Cron archiver could be smart enough to detect situation that we want to process all sites anyway (i.e. archiving was 24hrs ago, or so?) so we don't need to play with params?

Please let me know your thoughts

@mattab commented on April 8th 2015

Hi @mgazdzik

if we can make the archive command smarter so that we wouldn't even need the new parameter, it would be really great to go this direction :+1:

Also maybe Cron archiver could be smart enough to detect situation that we want to process all sites anyway (i.e. archiving was 24hrs ago, or so?) so we don't need to play with params?

sounds interesting, can you explain in which situations you have to add the parameter --force-all-websites currently? maybe you also have other situations where you need to manually add any other of those --force-* parameters?

if we can list each such use case, we would brainstorm how to improve archive console so it is smart and archives always the data when it is expected to.

@mgazdzik commented on April 8th 2015

hi @mattab, Currently we don't use force-all-websites besides of some manual archiving runs, or rare ocassions when for some reasons some mechanisms don't work (for ex. processing all sites after midnight). One of reasons is that we cannot plug many threads to archive single queue of all sites.

Also having this param could benefit us in managing archiving more flexibly. Consider case where we have lots of sites, and initial checks take 1-2 hours. It is of course possible to force idsites to process, but it's not really an option when instance has 30k+ websites. Also currently there's no option to run multi-threaded archiving for all sites, while it actually is heaviest archiving possible I guess. Changing default behaviour to share idsites wouldn't be good as well, as we may also have following case: - we force 1-2 biggest sites in single thread - those can be contained within thread - we exclude those biggest from 'main thread' of archiving - they should be shared across other threads

So as you can see there are at least two cases when we can possibly better manage archivings and split work to be done using just params. Please let me know if I can elaborate a bit more on described use cases?

Also do you think it would be possible to move this param to 2.13.0 milestone?

@mattab commented on April 10th 2015

it makes sense @mgazdzik - moved this needed request to 2.13.0 :+1:

@mnapoli commented on April 15th 2015

See a pull request here: #7682

@mnapoli commented on April 17th 2015

#7682 has been merged.

This issue was closed on April 17th 2015
Powered by GitHub Issue Mirror