This is especially a problem for Log Importer and QueuedTracking, but can happen with normal tracking as well. It's hard to explain but I will try :)
It's a problem when a user logs in and turns from a visitor into a user or when a user logs out and becomes a visitor again. It is a problem when the requests are not inserted in the exactly same order as they were sent.
Imagine the following tracking requests:
1: http://apache.piwik/piwik.php?action_name=foo&_id=visitorId&idsite=1 // visitor 2: http://apache.piwik/piwik.php?action_name=bar&_id=visitorId&idsite=1 // visitor pageview 3: http://apache.piwik/piwik.php?action_name=foo&_id=visitorId&idsite=1&uid=5 // logs in
We will create a new visit for tracking request
1. So far so good. If then for some reason
3 is processed before
2, a second visit will be created. Why? When a userId is detected, we use the
uid as visitorId and we do overwrite the
idvisitor of all past visits (in this case of request
1). Meaning when the second tracking requests is executed, it won't find an existing
idvisitor as the
uid does not exist there and it will create a new visit.
When is this a problem? As mentioned this is especially a problem when using log importer or queuedTracking with multiple workers / recorders. Both split requests into a different queues to process them in parallel see: https://github.com/piwik/piwik-log-analytics/blob/master/import_logs.py#L1642-L1651 and https://github.com/piwik/plugin-QueuedTracking/blob/multi_test/Queue/Manager.php#L161-L177 . This means once a
uid is set, a request might go into a different queue than the one without
uid and they can be likely processed in different order.
Same problem can occur if someone has for example multiple PHP nodes with load balancing etc. but it is less likely and it would be - realistically - only one request affected and all following would be fine. Still it can create one additional visit.
(here are notes from Slack discussion):
Current analysis is that:
- we may want to change radically the behaviour of user_id tracking and revert a technical decision made during initial implementation of user ID ( #3490 #6169 )
- we cannot offer both behaviors (eg. via INI setting) as it would be too complicated, therefore we want take a product decision and understand and document the behavior
- what we want to change is that instead of setting the
visitor_id set as a hash of the
user_id, we would leave the
visitor_id as it is from the first party cookie
- this would complement #6954 #6959 where we decoupled User ID from Visitor ID for the Custom Segment user id
- also in piwik.js and PiwikTracker do not set Visitor id as a hash of user id ie. revert #7167 (note: Android/iOS SDKs + C# client etc. may need to change too)
Making this change means a few important things will be affected
- whenever a user (1) clears cookies or (2) connects via multiple devices simultaneously, currently, sessions opened on each device will be recorded in the same visit
- after the change, each simultaneous visit on separate devices would each create a new visit (but with the same user id)
- the User ID user guide will need to be updated as we change how User id works, especially this part: "How requests with a User ID are tracked > Same user from multiple device use case: [....]"
- if several users connect on the same device, within 30min, the same visit would be re-used and only the latest User Id would be kept in the visit (currently, we create a new visit for each separate user id)
- in Tracking API to be friendly to devs who want to only use user id
uid and don't want to care to use visitor id
_id, we'd ensure to default
_id it to the user id hash so multiple actions for same user id are still tracked in the same visit
- making the change would include revert #7167 #6838 which discussed the case with
trust_visitors_cookies=1 -> maybe we could explain how User id would work in the user id user guide
- we would no longer have to update old visitorId's when someone logs in (refs #6313)
- it will be helpful that when working on this we also include "User id Signing out use case" raised & discussed in #7556 (#7368 #7518)
also in piwik.js and PiwikTracker do not set Visitor id as a hash of user id ie. revert #7167 (note: Android/iOS SDKs + C# client etc. may need to change too)
No issue for the Android SDK. We don't hash/overwrite the visitor-id client side. By default every call to the Tracker contains a user-id (one per app install) and a visitor-id (per app session).
I would really appreciate if the current implementation gets changed again, as it makes no sense from a non-technical point of view to split a visit when a user sings in or out. In my opinion the user_id should be an information attached to a visitor allowing to aggregate visitors (who are - at least to some extent - browsers on devices) to a single user.
See also: Incorrect browser logged when user switches browsers when using userId #7785 where a user expected that Piwik would track two visits when the same user uses the website across two devices. (currently those clicks on those two devices would appear in same visit)
The discussion here is "which assumption should be enforced at the time we process incoming data". I'd like to at least propose that we enforce neither at runtime.
In most cases, the assumptions made by Piwik are reasonable... but they have the side-effect of masking the actual, underlying data so they preclude analysis under different assumptions. Another example of this is a prior forum discussion on referrers.
In my proposal, Piwik simply stores all the facts (userid, visitorid, time, referrer, device, etc.) at runtime. We then delegate the various analytical decisions (like the one debated here) to the reporting system. The default reporting system can select one... someone could write a module to display the other. No one is precluded from analyzing the data in the way that makes the most sense to them.
As an added benefit, eliminating runtime analysis should enhance performance. Obviously, there's an offsetting cost when the report is run, but that can be managed any number of ways (and only when it's actually required).
In my proposal, Piwik simply stores all the facts (userid, visitorid, time, referrer, device, etc.) at runtime.
The default reporting system can select one... someone could write a module to display the other
As an added benefit, eliminating runtime analysis should enhance performance.
Yep, should enhance tracker performance (or at least not make it slower).
Obviously, there's an offsetting cost when the report is run
Yep, I think this is why it is built the way it is currently. There were some performance tweaks in mind for faster archiving.
Agreed on all this :)
tentatively moving this to
Short term as it sounds like we should make this change.
Any updates on this guys? The error seems to persist.
Hi, we started testing piwik recently for high traffic, that requires more than 1 worker to make sense, and since we are using UserID it seems piwik is good as dead for us? Or was this issue fixed just someone forgot to write about it? I'm commenting here, since piwik WebUI (plugin settings) points here.. and says that only 1 worker should be run because of this bug = 7691
@saurtar So far this is not fixed AFAIK. It should be an edge case and not the norm but this issue can still occur