Hide code cell source
import pandas as pd
import numpy as np
from math import ceil
from bokeh.io import output_notebook, show
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool, LinearAxis, Range1d, Span, Label, LabelSet
from bokeh.layouts import gridplot, column
from bokeh.palettes import Spectral8, Pastel1
from bokeh.transform import cumsum
output_notebook()

data = pd.read_csv('/Users/rachel/Library/CloudStorage/Dropbox-TLPSummerInterns/TLP Summer Intern Folder/Zhou/CODE - MPS_data_july_2023/mps_student_activity_detail.tsv', sep='\t',on_bad_lines='skip')
Loading BokehJS ...

4.1 Rention and Drop-Out#

Engagement Duration#

Hide code cell source
data['created_at'] = pd.to_datetime(data['created_at'])

def compute_histogram(course_name, bins):
    course_data = data[data['course_name'] == course_name]
    course_span = course_data.groupby('student_user_id')['created_at'].agg(['min', 'max'])
    course_span['span_months'] = (course_span['max'] - course_span['min']).dt.days / 30
    active_months = course_span['span_months'].apply(np.ceil)
    hist, edges = np.histogram(active_months, bins=bins)
    return hist, edges

# Determine the maximum active months to set a common bin range for all courses
max_months = max(data.groupby('student_user_id')['created_at'].agg(['min', 'max'])['max'] - data.groupby('student_user_id')['created_at'].agg(['min', 'max'])['min']).days / 30
bins = range(1, int(np.ceil(max_months)) + 2)

hist_csf, edges_csf = compute_histogram('csf', bins)
hist_csd, edges_csd = compute_histogram('csd', bins)
hist_csp, edges_csp = compute_histogram('csp', bins)

# Overall histogram
hist_all = hist_csf + hist_csd + hist_csp

# Overall histogram
hist_all = hist_csf + hist_csd + hist_csp
edges_all = edges_csf  # Assuming all courses have the same range of active months

# Course-wise Active Months Distribution
source_csf = ColumnDataSource(data=dict(hist=hist_csf, left=edges_csf[:-1], right=edges_csf[1:]))
source_csd = ColumnDataSource(data=dict(hist=hist_csd, left=edges_csd[:-1], right=edges_csd[1:]))
source_csp = ColumnDataSource(data=dict(hist=hist_csp, left=edges_csp[:-1], right=edges_csp[1:]))

p_csf = figure(height=250, width=800, title="CS Fundamentals (K-5)",
               x_axis_label='Active Months', y_axis_label='Number of Students')
p_csd = figure(height=250, width=800, title="CS Discoveries (6-10)", x_range=p_csf.x_range,
               x_axis_label='Active Months', y_axis_label='Number of Students')
p_csp = figure(height=250, width=800, title="CS Principles (AP)", x_range=p_csf.x_range,
               x_axis_label='Active Months', y_axis_label='Number of Students')

p_csf.vbar(x='left', top='hist', width=0.7, source=source_csf, color=Spectral8[0])
p_csd.vbar(x='left', top='hist', width=0.7, source=source_csd, color=Spectral8[1])
p_csp.vbar(x='left', top='hist', width=0.7, source=source_csp, color=Spectral8[2])

# Overall Active Months Distribution
source_all = ColumnDataSource(data=dict(hist=hist_all, left=edges_all[:-1], right=edges_all[1:]))

p_overall = figure(height=500, width=800, title="Overall Distribution of Active Months Across All Courses",
                   x_axis_label='Active Months', y_axis_label='Number of Students')
p_overall.vbar(x='left', top='hist', width=0.7, source=source_all, color=Spectral8[3])
GlyphRenderer(
id = 'p1162', …)

Bird’s Eye View: Overall Student Engagement Duration#

show(p_overall)

The histogram showcases the distribution of the number of active months for each student.

Predominant Peaks in Short-term Engagement: The highest bars in the histogram represent the most frequent engagement durations. The tallest bar corresponds to 1-2 months, it indicates that most students are active for a brief engagement with the platform, which might point to a large dropout rate shortly after registration.

Second Peak in 6-7 Month: This peaks around semester durations possibly represent students using the platform for the duration of a school term.

Long-term Engagement: Bars further to the right demonstrate students who have been active for several months, showcasing a strong retention tendency.

Deep Dive: Course-Specific Active Months Distribution#

show(column(p_csf, p_csd, p_csp))

Retention and Drop-out Rate#

By analyzing the span of their activity, we can classify students into categories that reflect their commitment and engagement with the platform. Based on the above active month distribution, here we define:

Retained Students: A student is considered “retained” if their records span over six months.

Dropped Students: A student is considered “dropped out” if their records span fewer than and including two month.

Hide code cell source
# Extract the earliest and latest 'created_at' for each student
student_span = data.groupby('student_user_id')['created_at'].agg(['min', 'max'])
student_span['min'] = pd.to_datetime(student_span['min'])
student_span['max'] = pd.to_datetime(student_span['max'])

# Calculate the span in months
student_span['span_months'] = (student_span['max'] - student_span['min']).dt.days / 30

# Classify students based on the new retention and dropout criteria
student_span['status'] = 'Other'
student_span.loc[student_span['span_months'] > 6, 'status'] = 'Retained'
student_span.loc[student_span['span_months'] <= 2, 'status'] = 'Dropped'

# Calculate absolute numbers and proportions based on the new criteria
status_counts_new = student_span['status'].value_counts()
status_proportions_new = (status_counts_new / len(student_span)) * 100
print(status_counts_new, status_proportions_new)
status
Dropped     8474
Other       4947
Retained    3906
Name: count, dtype: int64 status
Dropped     48.906331
Other       28.550817
Retained    22.542852
Name: count, dtype: float64

Insights

High Dropout Rate: Nearly half of the students (48.91%) drop out early, engaging with the platform for only up to two months. This significant dropout rate shortly after joining might point towards challenges in the initial engagement, course content, or possibly external factors.

Consistent Engagement: About 22.54% of the students have shown consistent engagement, being active for more than half a year. This indicates a segment of students finds long-term value in the platform’s offerings.

Intermediate Engagement: A non-trivial 28.55% of students don’t fall into the “retained” or “dropped” categories, suggesting that a considerable number of students have an intermediate engagement duration. Understanding the nuances of this group could provide actionable insights to further enhance the platform’s engagement strategies.

In conclusion, while there’s a strong core of students who stay committed to the platform, there’s a significant early dropout rate that warrants further investigation. The intermediate group poses an opportunity for targeted interventions to further enhance their learning journey and possibly convert them into the retained category.