@diosmosis opened this issue on August 3rd 2015

While tracking actions, conversions or whatever for a visit, it is often necessary to get information about the current visit such as the current visit properties. This is done currently by selecting the last known visit from the database. If information is needed during tracking, a VisitDimension is used, adding a new column to the log_visit table which will be selected automatically. This approach works, but has several downsides: - If the information is only needed during tracking and will not be used during aggregation, then the extra column in the database will be a waste of space. Since it will only be needed for ongoing visits, the information would be useless after a visit ends. - Not all useful information can be stored as a visit dimension. Using a dimension of any type to associate data w/ a log_ table row assumes that the information has a 1-1 correspondence with the row. Which means if you need to get multitudinous data with a log_ table row (such as multiple rows of another log_ table) it is not possible. - Using the log_ tables as a way to cache current visits information means there will always be at least one SELECT on the table(s) per tracking request, even if the visitor has not appeared in a long time. Since there is no "time to idle" when querying the database, we cannot be sure if a visit is ongoing w/o doing a select. # A New ~~Hope~~ Approach

Ongoing visit data has one all important property: it is temporary. It is initialized when a new visit is created, modified when the visit is updated and discarded when the visit ends. It will not exist for long and is identified by a single key (the visitor ID). Thus using a key-value store that acts as a cache, removing data when it is old, would be ideal and very performant.

The caching approach would look like this: - First, using the current tracker request info, we create a unique ID for the visit, which would be a mix of the visit's config ID or visitor ID or whatever. - Then we try to get the current visit from the cache. If it is not there, we know there is no on-going visit, because when we save the data, we'll set a TTL to the configured visit length. - If there is data, we continue w/ tracking. If there is no data, we have to do a select on the log_visit table to check if the visitor is known. After that, we continue w/ tracking assuming a new visit. - ... normal tracking ... (all inserted logs will also be appended to in-memory current visit data) - The current visit data, now recorded, is set back into the cache. The configured visit length is used as the TTL.

Benefits include: - Being able to store whatever you want, not just simple values. - Not having to modify existing log tables w/ data that will not be used during aggregation. - Increased performance for those who want to setup redis or whatever else. # TODO - [ ] Move all visit information to new extendable value object for visits. This should include all visit dimension values, as well as connected log data, such as the list of actions. This log data should also be encapsulated in objects. Plugins must be able to extend this information non-intrusively. - [ ] Create a service (stored in DI) that fills the new visit object object lazily. Should replace VisitorRecognizer (or change it) and by default select data directly from the log_visit table. It should use an intermediate service (ie, OngoingVisitDataProvider) to get information about the current visit. Plugins have to be able to specify their own querying logic for this data, so perhaps there should be an array of them in DI. - [ ] Create the ongoing visit cache configuration option and handle it in DI by replacing the default OngoingVisitProvider w/ one that uses a cache. - [ ] Make sure visit recording logic will update the value object, and that this change is reflected in the cache, if a cache is used. - [ ] Test difference w/ load tests.

This issue was closed on May 5th 2016
Powered by GitHub Issue Mirror