2009-2010 ASSISTment Skill Builder Data¶

Data Description¶

Column Description¶

Field	Annotation
order id	Non-chronological id, refer to original problem log
assignment id	Each assignment is specific to single teacher/class.
user id	Id of the student
problem id	Id of the problem
original	Main problem or Scaffolding problem
correct	Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help
attempt count	Number of attempts of the student
ms first reponse	The time in the milliseconds for the student’s first response
tutor mode	tutor or test
answer type	choose_1 or algebra or fill_in or open_response
sequence id	Id of the problem set
student class id	Class id
position	Assignment position on the class assignments page
problem set type	Linear or Random or Mastery
base sequence id	If the sequence has been copied, this points to the original copy
skill id	ID of the skill associated with the problem. In this skill builder dataset, records will be duplicated so that each record with one skill.
skill name	Name of the skill
teacher id	ID of the teacher
school id	ID of the school
hint count	Number of student attempts
hint total	Number of possible hints on the problem
overlap time	Time in milliseconds
template id	The template ID of the ASSISTment. ASSISTments with the same template ID have similar questions.
answer id	The answer ID for multi-choice questions.
answer text	The answer text for fill-in questions.
first action	The type of first action: attemp or ask for a hint.
bottom hint	Whether or not the student asks for all hints.
opportunity	The number of opportunities the student has to practice on this skill.
opportunity original	The number of opportunities the student has to practice on this skill counting only original problems.

[1]:

import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

[2]:

path = "ASSISTments2009-2010.csv"

data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples¶

[3]:

pd.set_option('display.max_columns', 500)
data.head()

[3]:

	order_id	assignment_id	user_id	assistment_id	problem_id	original	correct	attempt_count	ms_first_response	tutor_mode	answer_type	sequence_id	student_class_id	position	type	base_sequence_id	skill_id	skill_name	teacher_id	school_id	hint_count	hint_total	overlap_time	template_id	answer_id	answer_text	bottom_hint	opportunity	opportunity_original
0	33022537	277618	64525	33139	51424	1	1	1	32454	tutor	algebra	5948	13241	126	MasterySection	5948	1.0	Box and Whisker	22763	73	0	3	32454	30799	NaN	26	NaN	1	1.0
1	33022709	277618	64525	33150	51435	1	1	1	4922	tutor	algebra	5948	13241	126	MasterySection	5948	1.0	Box and Whisker	22763	73	0	3	4922	30799	NaN	55	NaN	2	2.0
2	35450204	220674	70363	33159	51444	1	0	2	25390	tutor	algebra	5948	11816	22	MasterySection	5948	1.0	Box and Whisker	22763	73	0	3	42000	30799	NaN	88	NaN	1	1.0
3	35450295	220674	70363	33110	51395	1	1	1	4859	tutor	algebra	5948	11816	22	MasterySection	5948	1.0	Box and Whisker	22763	73	0	3	4859	30059	NaN	41	NaN	2	2.0
4	35450311	220674	70363	33196	51481	1	0	14	19813	tutor	algebra	5948	11816	22	MasterySection	5948	1.0	Box and Whisker	22763	73	3	4	124564	30060	NaN	65	0.0	3	3.0

General features¶

[4]:

data.describe()

[4]:

	order_id	assignment_id	user_id	assistment_id	problem_id	original	correct	attempt_count	ms_first_response	sequence_id	student_class_id	position	base_sequence_id	skill_id	teacher_id	school_id	hint_count	hint_total	overlap_time	template_id	answer_id	first_action	bottom_hint	opportunity	opportunity_original
count	4.017560e+05	401756.000000	401756.000000	401756.000000	401756.000000	401756.000000	401756.000000	401756.000000	4.017560e+05	401756.000000	401756.000000	401756.000000	401756.000000	338001.000000	401756.000000	401756.000000	401756.000000	401756.000000	4.017560e+05	401756.000000	45454.000000	401756.000000	67044.000000	401756.000000	328291.000000
mean	3.066256e+07	273701.845882	83414.154542	46443.517526	81117.030011	0.817140	0.642923	1.596417	4.748464e+04	7284.411088	12919.115222	57.163649	6786.020985	127.167032	46875.587322	3031.291025	0.487470	2.235817	5.964848e+04	39571.335029	145094.431667	0.130012	0.724092	20.553535	14.403307
std	5.264886e+06	11338.460017	7417.814021	11832.443427	25426.799662	0.386552	0.479139	12.050437	3.614590e+05	1497.941072	783.548291	65.215464	1263.359735	120.427518	15892.975481	1830.451486	1.187255	1.804244	3.822188e+05	12679.439926	47127.478285	0.394099	0.446974	62.523994	62.393684
min	2.022408e+07	217900.000000	14.000000	86.000000	83.000000	0.000000	0.000000	0.000000	-7.759575e+06	5870.000000	11644.000000	1.000000	5870.000000	1.000000	11158.000000	1.000000	0.000000	0.000000	-7.759575e+06	86.000000	1.000000	0.000000	0.000000	1.000000	1.000000
25%	2.660218e+07	266784.000000	78970.000000	37046.000000	58467.000000	1.000000	0.000000	1.000000	8.518000e+03	5979.000000	12352.000000	9.000000	5968.000000	39.000000	42999.000000	2770.000000	0.000000	0.000000	1.066900e+04	30244.000000	104412.000000	0.000000	0.000000	3.000000	3.000000
50%	3.110513e+07	271629.000000	80111.000000	44498.000000	80734.000000	1.000000	1.000000	1.000000	1.945300e+04	6910.000000	12574.000000	27.000000	6094.000000	74.000000	45778.000000	2770.000000	0.000000	3.000000	2.426450e+04	30987.000000	136247.000000	0.000000	1.000000	8.000000	6.000000
75%	3.494364e+07	279158.000000	88142.000000	53142.000000	93102.000000	1.000000	1.000000	1.000000	4.457825e+04	8032.000000	13241.000000	92.000000	7014.000000	279.000000	59882.000000	5056.000000	0.000000	4.000000	5.698925e+04	46399.000000	184077.000000	0.000000	1.000000	19.000000	13.000000
max	3.831020e+07	291503.000000	96299.000000	106210.000000	207348.000000	1.000000	1.000000	3824.000000	8.407692e+07	13362.000000	14415.000000	295.000000	13362.000000	378.000000	69274.000000	9948.000000	10.000000	10.000000	8.407692e+07	106180.000000	323181.000000	2.000000	1.000000	3371.000000	3371.000000

[5]:

print("The number of records: "+ str(len(data['order_id'].unique())))

The number of records: 346860

[6]:

print('Part of missing values for every column')
print(data.isnull().sum() / len(data))

Part of missing values for every column
order_id                0.000000
assignment_id           0.000000
user_id                 0.000000
assistment_id           0.000000
problem_id              0.000000
original                0.000000
correct                 0.000000
attempt_count           0.000000
ms_first_response       0.000000
tutor_mode              0.000000
answer_type             0.000000
sequence_id             0.000000
student_class_id        0.000000
position                0.000000
type                    0.000000
base_sequence_id        0.000000
skill_id                0.158691
skill_name              0.189466
teacher_id              0.000000
school_id               0.000000
hint_count              0.000000
hint_total              0.000000
overlap_time            0.000000
template_id             0.000000
answer_id               0.886862
answer_text             0.222045
first_action            0.000000
bottom_hint             0.833123
opportunity             0.000000
opportunity_original    0.182860
dtype: float64

[7]:

len(data.user_id.unique())

[7]:

[8]:

ds = data['user_id'].value_counts().reset_index()

ds.columns = [
    'user_id',
    'count'
]

ds['user_id'] = ds['user_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'user_id',
    orientation='h',
    title='Top 40 students by number of actions'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_11_0.svg

[9]:

ds = data['user_id'].value_counts().reset_index()

ds.columns = [
    'user_id',
    'count'
]

ds = ds.sort_values('user_id')

fig = px.histogram(
    ds,
    x = 'user_id',
    y = 'count',
    title = 'User action distribution'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_12_0.svg

[10]:

ds = data['problem_id'].value_counts().reset_index()

ds.columns = [
    'problem_id',
    'count'
]

ds['problem_id'] = ds['problem_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'problem_id',
    orientation = 'h',
    title = 'Top 40 useful problem_ids'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_13_0.svg

[11]:

ds = data['problem_id'].value_counts().reset_index()

ds.columns = [
    'problem_id',
    'count'
]

ds = ds.sort_values('problem_id')

fig = px.histogram(
    ds,
    x='problem_id',
    y='count',
    title='problem_id action distribution'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_14_0.svg

[12]:

ds = data['correct'].value_counts().reset_index()

ds.columns = [
    'correct',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['percent'])

fig = px.pie(
    ds,
    names = ['wrong', 'right'],
    values = 'percent',
    title = 'Percent of correct answers'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_15_0.svg

Sort by answer types¶

[13]:

ds = data['answer_type'].value_counts().reset_index()

ds.columns = [
    'answer_type',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['percent'])

fig = px.pie(
    ds,
    names = 'answer_type',
    values = 'percent',
    title = 'Problem type',
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_17_0.svg

[14]:

fig = make_subplots(rows=3, cols=2)

traces = [
    go.Bar(
        x = ['wrong', 'right'],
        y = [
            len(data[(data['answer_type'] == item) & (data['correct'] == 0)]),
            len(data[(data['answer_type'] == item) & (data['correct'] == 1)])
        ],
        name = 'Type: ' + str(item),
        text = [
            str(round(100*len(data[(data['answer_type'] == item)&(data['correct'] == 0)])/len(data[data['answer_type'] == item]),2)) + '%',
            str(round(100*len(data[(data['answer_type'] == item)&(data['correct'] == 1)])/len(data[data['answer_type'] == item]),2)) + '%'
        ],
        textposition = 'auto'
    ) for item in data['answer_type'].unique().tolist()
]

for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i //2) + 1,
        (i % 2) + 1
    )

fig.update_layout(
    title_text = 'Percent of correct answers for every problem type',
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_18_0.svg

Sort by schools¶

[15]:

len(data['school_id'].unique())

[15]:

[16]:

ds = data['school_id'].value_counts().reset_index()

ds.columns = [
    'school_id',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['percent'])

fig = px.pie(
    ds,
    names = 'school_id',
    values = 'percent',
    title = 'Percent of schools',
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_21_0.svg

[17]:

ds = data['school_id'].value_counts().reset_index()

ds.columns = [
    'school_id',
    'count'
]

ds['school_id'] = ds['school_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(20)

fig = px.bar(
    ds,
    x = 'count',
    y = 'school_id',
    orientation = 'h',
    title = 'Top 20 useful school_ids'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_22_0.svg

Sort by attemp counts¶

[18]:

ds = data['attempt_count'].value_counts().reset_index()

ds.columns = [
    'attempt_count',
    'count'
]

ds['attempt_count'] = ds['attempt_count'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'attempt_count',
    orientation = 'h',
    title = 'Top 20 often attempt count'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_24_0.svg

Sort by skills¶

[19]:

ds = data['skill_id'].dropna() # There are less NaNs in 'skill_id' column than 'skill_name' column.
ds = ds.value_counts().reset_index()

ds.columns = [
    'skill_id',
    'count'
]

ds['skill_id'] = ds['skill_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'skill_id',
    orientation = 'h',
    title = 'Top 40 useful skill_id'
)

fig.show("svg")

../../../_images/build_blitz_ASSISTments_ASSISTments2009-2010_26_0.svg

[ ]: