ASSISTments2015 Data Analysis¶

Data Description¶

Column Description¶

Field	Annotation
user id	Id of the student
log id	Unique ID of the logged actions
sequence id	Id of the problem set
correct	Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help

[1]:

import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

[2]:

path = "2015_100_skill_builders_main_problems.csv"
data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples¶

[3]:

pd.set_option('display.max_columns', 500)
data.head()

[3]:

	user_id	log_id	sequence_id	correct
0	50121	167478035	7014	0.0
1	50121	167478043	7014	1.0
2	50121	167478053	7014	1.0
3	50121	167478069	7014	1.0
4	50964	167478041	7014	1.0

General features¶

[4]:

data.describe()

[4]:

	user_id	log_id	sequence_id	correct
count	708631.000000	7.086310e+05	708631.000000	708631.000000
mean	296232.978276	1.695323e+08	22683.474821	0.725502
std	48018.650247	3.608096e+06	41593.028018	0.437467
min	50121.000000	1.509145e+08	5898.000000	0.000000
25%	279113.000000	1.660355e+08	7020.000000	0.000000
50%	299168.000000	1.704579e+08	9424.000000	1.000000
75%	335647.000000	1.723789e+08	14442.000000	1.000000
max	362374.000000	1.754827e+08	236309.000000	1.000000

[5]:

print("The number of records: "+ str(len(data['log_id'].unique())))

The number of records: 708631

[6]:

print('Part of missing values for every column')
print(data.isnull().sum() / len(data))

Part of missing values for every column
user_id        0.0
log_id         0.0
sequence_id    0.0
correct        0.0
dtype: float64

[7]:

len(data.user_id.unique())

[7]:

[8]:

len(data.sequence_id.unique())

[8]:

Sort by user id¶

[9]:

ds = data['user_id'].value_counts().reset_index()

ds.columns = [
    'user_id',
    'count'
]

ds['user_id'] = ds['user_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'user_id',
    orientation='h',
    title='Top 40 students by number of actions'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2015_13_0.svg

[10]:

ds = data['user_id'].value_counts().reset_index()

ds.columns = [
    'user_id',
    'count'
]

ds = ds.sort_values('user_id')

fig = px.histogram(
    ds,
    x = 'user_id',
    y = 'count',
    title = 'User action distribution'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2015_14_0.svg

Correct answers¶

[11]:

ds = data['correct'].value_counts().reset_index()

ds.columns = [
    'correct',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['correct'])

fig = px.pie(
    ds,
    names = ['0', '1/10','1/5', '1/4','1/3', '1/2','2/3', '3/4','4/5','9/10', '1'],
    values = 'percent',
    title = 'Percent of correct answers'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2015_16_0.svg

Minor note: we also have Essay questions that teachers can grade. If this value is say .25 that means the teacher gave it a 1 our of 4.

Sort by sequence id¶

[12]:

ds = data['sequence_id'].value_counts().reset_index()

ds.columns = [
    'sequence_id',
    'count'
]

ds['sequence_id'] = ds['sequence_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'sequence_id',
    orientation = 'h',
    title = 'Top 40 useful sequence_ids'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2015_19_0.svg

[13]:

ds = data.groupby('sequence_id')['correct'].mean()
ds = ds.reset_index()

ds['sequence_id'] = ds['sequence_id'].astype(str) + '-'
ds1 = ds.sort_values(['correct']).tail(20)

fig1 = px.bar(
    ds1,
    x = 'correct',
    y = 'sequence_id',
    orientation = 'h',
    title = 'Average number correct answers of problem sets (top 20)'
)

fig1.show("svg")

ds2 = ds.sort_values(['correct']).head(20)

fig2 = px.bar(
    ds2,
    x = 'correct',
    y = 'sequence_id',
    orientation = 'h',
    title = 'Average number correct answers of problem sets (bottom 20)'
)
fig2.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2015_20_0.svg

../../../_images/build_blitz_ASSISTments_ASSISTments2015_20_1.svg

This figure presents the average number correct answers of problem sets. These low-average problem sets deserve more attention from teachers and students.