calculate 7-day retention rate for mobile apps on a csv dataset
source
assumptions¶
The data set is small enough, the function will load the entire dataset and perform calculation on the entire data set when the calculate() is called, so that a 7-day retention rate of a specific time peroid needs to be retrieved, the calculation time will be instant.
- if the data is streaming or dataset is too big, I shall design models load data streamingly or by block and/or perform calculation on the subset dataset sliced specific time period.
The filter by OS will be cross-platform
- for example, if we want to calculate the 7-day retention rate filtered by ios OS of a day 9-10-2016, and that day is the first time user X opened app on that day on OS system, but EARLIER user user X opened app on 9-5-2016 on a andriod system. In that case, the model will not count user X as a new user, and user X's info will not be used to calculate 7-day retention rate for 9-10-2016 onwards.
- this assumption can be simply revered by changing very few codes, which will let the model count user X as a new user even if user X opened app on the early days on other OS.
- for example, if we want to calculate the 7-day retention rate filtered by ios OS of a day 9-10-2016, and that day is the first time user X opened app on that day on OS system, but EARLIER user user X opened app on 9-5-2016 on a andriod system. In that case, the model will not count user X as a new user, and user X's info will not be used to calculate 7-day retention rate for 9-10-2016 onwards.
from model import functions¶
In [1]:
from retention import load_data, filter_data, analyze, interval_rate
from retention import get_stat_df, get_all_data, get_filltered_data, plotting
function load csv data¶
In [2]:
path='sample_data.csv'
load_data(path)
Perform calculation on the entire dataset, if filter is needed, func filter_data should be called before analyze()¶
In [3]:
analyze()
calculate 7-day retention rate of a specified time period¶
In [4]:
interval_rate('9-1-16','9-30-16')
Out[4]:
Optional¶
call get_stat_df() will return the calculated DataFrame contains specific information of each day,columns, for reference
- day1 represents the number of NEW users on that day
- day7 tell us the Unique(not new) users on 7 days after that day
- 'matched' contains the number that users from day1 reopened 7 days after that day
In [5]:
df=get_stat_df()
df.info()
In [6]:
df.head()
Out[6]:
call get_all_data() will return the raw data will use to analyze, for reference
In [7]:
data=get_all_data()
data.info()
In [8]:
data.head()
Out[8]:
call plotting() will simply plot line charts from DataFrame returned by get_stat_df()
- just simple plotting, no further analysis
- pass a NAME will save image as html file, otherwise will plot in jupyter notebook
In [9]:
plotting()
save graph as test.html¶
In [11]:
# plotting('test')
b. What was the Day-7 retention from September 8 through September 10 for Android users?¶
filter data by os 'android'
In [12]:
filter_data('android')
- every time choose to filter data, retention rates should be reculculate
- so call analyze() then analyze
In [ ]:
analyze()
get 7-day retention rate of September 8 through September 10¶
In [13]:
interval_rate('9-8-16','9-10-16')
Out[13]:
c. What was the Day-7 retention over the month of September for iOS users using version 6.5?¶
filter data first¶
In [14]:
filter_data('ios','6.5.0')
get 7-day retention rate of September
then analyze
In [15]:
analyze()
In [16]:
interval_rate('9-1-16','9-30-16')
Out[16]:
why 0? examine¶
In [17]:
df=get_stat_df()
In [18]:
df.head()
Out[18]:
In [19]:
df['matched'].sum()
Out[19]:
0 retention rate due to 0 records matched
In [20]:
df
Out[20]:
In [ ]: