ASSISTments2012-2013 Data Analysis

Data Description

Column Description

Field

Annotation

problem_log_id

Unique ID of the logged actions.

skill

Skill name associated with the problem (different skills are in different rows).

problem_id

The ID of the problem

user_id

The ID of the student doing the problem

assignment_id

Two different assignments can have the same sequence id

assistment_id

The ID of the ASSISTment

start_time

Timestamp when the problem starts

end_time

Timestamp when the problem ends

problem_type

choose_1,algebra,fill_in or open_response

original

Main problem or Scaffolding problem

correct

Correct on the fisrt attempt or Incorrect on the first attempt, or asked for help

bottom_hint

Whether or not the student asks for all hints

hint_count

Number of hints on this problem

actions

Every action on this problem

attempt_count

Number of student attempts on this problem

ms_first_response

The time in milliseconds for the student’s first response

tutor_mode

tutor, test mode, pre-test, or post-test

sequence_id

The content id of the problem set

student_class_id

The class ID

position

Assignment position on the class assignments page

type

This is the type of the head section of the problem set

base_sequence_id

This is to account for if a sequence has been copied

skill_id

ID of the skill associated with the problem (different skills are in different rows)

teacher_id

The ID of the teacher who assigned the problem

school_id

The ID of the school where the problem was assigned

overlap_time

The time in milliseconds for the student’s overlap time

template_id

The template ID of the ASSISTments

answer_id

The answer ID for multi-choice questions

answer_text

The answer text for fill-in questions

first_action

The type of first action: attempt or ask for a hint

problemlog_id

Unique ID of the logged actions

Average_confidence(FRUSTRATED)

Predicted Frustration of student for the problem

Average_confidence(CONFUSED)

Predicted Confusion of student for the problem

Average_confidence(CONCENTRATING)

Predicted Engaged Concentration of student for the problem

Average_confidence(BORED)

Predicted Boredom of student for the problem

[26]:
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
[28]:
path = "2012-2013-data-with-predictions-4-final.csv"

data = pd.read_csv(path, encoding = "ISO-8859-15",low_memory=False)

Record Examples

[29]:
pd.set_option('display.max_columns', 500)
data.head()
[29]:
problem_log_id skill problem_id user_id assignment_id assistment_id start_time end_time problem_type original correct bottom_hint hint_count actions attempt_count ms_first_response tutor_mode sequence_id student_class_id position type base_sequence_id skill_id teacher_id school_id overlap_time template_id answer_id answer_text first_action problemlogid Average_confidence(FRUSTRATED) Average_confidence(CONFUSED) Average_confidence(CONCENTRATING) Average_confidence(BORED)
0 137792159 NaN 557460 61394 565736 341511 2012-09-28 15:11:27 2012-09-28 15:11:36.856 choose_1 1 1.0 0.0 0 --- \n- - start\n - 1348859487561\n - "95952... 1 9852 tutor 55482 23643 4 LinearSection 55482 NaN 53472 5048.0 9852 341511 NaN she 0 137792159 0.361323 0.0 0.336529 0.000000
1 138083797 Rounding 365981 61394 573819 204043 2012-10-09 11:01:52 2012-10-09 11:02:13.182 algebra 1 1.0 0.0 0 --- \n- - start\n - 1349794912269\n - "62459... 1 21175 tutor 34221 22967 5 LinearSection 34221 54.0 47424 5048.0 21175 204043 NaN 74.29 0 138083797 0.361323 0.0 0.766925 0.000000
2 142332619 Multiplication and Division Integers 426415 61394 734130 247525 2013-03-07 10:53:20 2013-03-07 10:53:28.661 algebra 1 0.0 0.0 0 --- \n- - start\n - 1362671600405\n - "74107... 1 8645 tutor 39601 22967 58 LinearSection 39601 279.0 47424 5048.0 8645 247525 NaN 00 0 142332619 0.361323 0.0 0.766925 0.442968
3 145939397 Proportion 86686 61394 821352 48081 2013-08-20 19:54:56 2013-08-20 19:55:21.753 algebra 1 1.0 0.0 0 --- \n- - start\n - 1377042896503\n - "73630... 1 25728 tutor 6912 26303 21 MasterySection 6912 79.0 47424 5048.0 25728 46362 NaN 3.8 0 145939397 0.775000 0.0 0.766925 0.912281
4 137111284 NaN 399669 76592 557216 227869 2012-09-10 17:20:10 2012-09-10 17:24:56.579 choose_1 1 1.0 0.0 0 --- \n- - start\n - 1347312010563\n - "69479... 1 286578 tutor 37143 21696 3 LinearSection 37143 NaN 152676 7561.0 286578 227869 NaN C (wr - 1)(wr + 1) 0 137111284 0.361323 0.0 0.766925 0.000000

General features

[30]:
data.describe()
[30]:
problem_log_id problem_id user_id assignment_id assistment_id original correct bottom_hint hint_count attempt_count ms_first_response sequence_id student_class_id position base_sequence_id skill_id teacher_id school_id overlap_time template_id answer_id first_action problemlogid Average_confidence(FRUSTRATED) Average_confidence(CONFUSED) Average_confidence(CONCENTRATING) Average_confidence(BORED)
count 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.062922e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 2.711813e+06 6.123270e+06 6.123113e+06 6.123270e+06 6.123270e+06 8.275000e+03 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06 6.123270e+06
mean 1.414932e+08 3.685675e+05 1.770492e+05 6.773074e+05 2.202825e+05 9.504296e-01 6.768206e-01 1.200497e-01 3.373479e-01 1.339212e+00 4.873469e+04 6.689567e+04 2.342511e+04 7.402669e+01 6.214174e+04 1.932575e+02 1.210437e+05 6.925225e+03 4.907237e+04 2.088952e+05 4.324879e+05 6.151860e-02 1.414932e+08 3.894586e-01 4.479487e-02 6.823843e-01 2.567723e-01
std 2.693733e+06 2.195421e+05 3.172431e+04 9.425983e+04 1.393519e+05 2.170557e-01 4.674909e-01 3.250197e-01 9.851956e-01 1.056276e+00 2.673557e+05 5.933111e+04 1.612341e+03 3.697118e+02 5.687449e+04 1.303155e+02 4.978645e+04 3.314489e+03 2.884992e+05 1.458227e+05 3.534885e+05 2.635170e-01 2.693733e+06 1.027662e-01 1.924793e-01 1.713734e-01 2.862460e-01
min 1.368431e+08 1.000000e+00 2.142100e+04 1.814560e+05 5.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -6.767000e+03 2.000000e+00 1.139300e+04 0.000000e+00 2.000000e+00 1.000000e+00 1.143600e+04 1.000000e+00 -6.767000e+03 5.000000e+00 1.000000e+00 0.000000e+00 1.368431e+08 3.613230e-01 0.000000e+00 1.707320e-01 0.000000e+00
25% 1.391705e+08 1.284030e+05 1.719780e+05 5.863570e+05 6.883725e+04 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 9.436000e+03 1.266200e+04 2.251800e+04 4.000000e+00 1.189800e+04 6.500000e+01 7.305500e+04 5.260000e+03 9.468000e+03 5.259000e+04 1.060495e+05 0.000000e+00 1.391705e+08 3.613230e-01 0.000000e+00 7.669250e-01 0.000000e+00
50% 1.414916e+08 4.168130e+05 1.791670e+05 6.785645e+05 2.399180e+05 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 2.233100e+04 4.614100e+04 2.314400e+04 1.200000e+01 4.493100e+04 2.770000e+02 1.285010e+05 5.978000e+03 2.241500e+04 2.395460e+05 3.442820e+05 0.000000e+00 1.414916e+08 3.613230e-01 0.000000e+00 7.669250e-01 2.214840e-01
75% 1.438272e+08 5.644030e+05 1.972510e+05 7.672320e+05 3.466830e+05 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00 5.486500e+04 9.901300e+04 2.460300e+04 3.800000e+01 8.711200e+04 3.100000e+02 1.562940e+05 9.394000e+03 5.505400e+04 3.434800e+05 7.385615e+05 0.000000e+00 1.438272e+08 3.613230e-01 0.000000e+00 7.669250e-01 4.429680e-01
max 1.462357e+08 7.671430e+05 2.282130e+05 8.330540e+05 4.925890e+05 1.000000e+00 1.000000e+00 1.000000e+00 1.400000e+01 2.900000e+01 3.450552e+08 2.084530e+05 2.738600e+04 8.533000e+03 2.084530e+05 1.641000e+03 2.205230e+05 1.242800e+04 3.452775e+08 4.925890e+05 1.184706e+06 2.000000e+00 1.462357e+08 8.671330e-01 1.000000e+00 7.669250e-01 1.000000e+00
[31]:
print("The number of records: "+ str(len(data['problem_log_id'].unique())))
The number of records: 6123270
[32]:
print('Part of missing values for every column')
print(data.isnull().sum() / len(data))
Part of missing values for every column
problem_log_id                       0.000000
skill                                0.570478
problem_id                           0.000000
user_id                              0.000000
assignment_id                        0.000000
assistment_id                        0.000000
start_time                           0.000000
end_time                             0.000000
problem_type                         0.000000
original                             0.000000
correct                              0.000000
bottom_hint                          0.009856
hint_count                           0.000000
actions                              0.000000
attempt_count                        0.000000
ms_first_response                    0.000000
tutor_mode                           0.000000
sequence_id                          0.000000
student_class_id                     0.000000
position                             0.000000
type                                 0.000000
base_sequence_id                     0.000000
skill_id                             0.557130
teacher_id                           0.000000
school_id                            0.000026
overlap_time                         0.000000
template_id                          0.000000
answer_id                            0.998649
answer_text                          0.056561
first_action                         0.000000
problemlogid                         0.000000
Average_confidence(FRUSTRATED)       0.000000
Average_confidence(CONFUSED)         0.000000
Average_confidence(CONCENTRATING)    0.000000
Average_confidence(BORED)            0.000000
dtype: float64
[33]:
len(data.user_id.unique())
[33]:
46674
[34]:
ds = data['user_id'].value_counts().reset_index()

ds.columns = [
    'user_id',
    'count'
]

ds['user_id'] = ds['user_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'user_id',
    orientation='h',
    title='Top 40 students by number of actions'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_11_0.svg
[35]:
ds = data['user_id'].value_counts().reset_index()

ds.columns = [
    'user_id',
    'count'
]

ds = ds.sort_values('user_id')

fig = px.histogram(
    ds,
    x = 'user_id',
    y = 'count',
    title = 'User action distribution'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_12_0.svg
[36]:
ds = data['problem_id'].value_counts().reset_index()

ds.columns = [
    'problem_id',
    'count'
]

ds['problem_id'] = ds['problem_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'problem_id',
    orientation = 'h',
    title = 'Top 40 useful problem_ids'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_13_0.svg
[37]:
ds = data['problem_id'].value_counts().reset_index()

ds.columns = [
    'problem_id',
    'count'
]

ds = ds.sort_values('problem_id')

fig = px.histogram(
    ds,
    x='problem_id',
    y='count',
    title='problem_id action distribution'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_14_0.svg
[38]:
ds = data['correct'].value_counts().reset_index()

ds.columns = [
    'correct',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['correct'])

fig = px.pie(
    ds,
    names = ['0', '0.25','0.375','0.5','0,6','0.625','0.65','0.75','0.85','0.875','0.95','0.975','1'],
    values = 'percent',
    title = 'Percent of correct answers'
)

fig.show('svg')
"Minor note:we also have Essay questions that teachers can grade. If this value is say 0.25, that means the teacher gave it a 1 out of 4."
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_15_0.svg
[38]:
'Minor note:we also have Essay questions that teachers can grade. If this value is say 0.25, that means the teacher gave it a 1 out of 4.'

Sort by sequence id

[39]:
ds = data['sequence_id'].value_counts().reset_index()

ds.columns = [
    'sequence_id',
    'count'
]

ds['sequence_id'] = ds['sequence_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'sequence_id',
    orientation = 'h',
    title = 'Top 40 useful sequence_ids'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_17_0.svg

sort by problem types

[40]:
ds = data['problem_type'].value_counts().reset_index()

ds.columns = [
    'problem_type',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['percent'])

fig = px.pie(
    ds,
    names = 'problem_type',
    values = 'percent',
    title = 'Percent of Problem Types',
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_19_0.svg
[41]:
ds = ds.sort_values(['percent']).tail(6)

fig = make_subplots(rows=3, cols=2)

traces = [
    go.Bar(
        x = ['wrong', 'right'],
        y = [
            len(data[(data['problem_type'] == item) & (data['correct'] == 0)]),
            len(data[(data['problem_type'] == item) & (data['correct'] == 1)])
        ],
        name = 'Type: ' + str(item),
        text = [
            str(round(100*len(data[(data['problem_type'] == item)&(data['correct'] == 0)])/len(data[data['problem_type'] == item]),2)) + '%',
            str(round(100*len(data[(data['problem_type'] == item)&(data['correct'] == 1)])/len(data[data['problem_type'] == item]),2)) + '%'
        ],
        textposition = 'auto'
    ) for item in ds['problem_type'].unique().tolist()
]

for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i //2) + 1,
        (i % 2) + 1
    )

fig.update_layout(
    title_text = 'Percent of correct answers for every problem type',
)

fig.show('svg')

../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_20_0.svg

Sort by schools

[42]:
len(data['school_id'].unique())
[42]:
662
[43]:
ds = data['school_id'].value_counts().reset_index()

ds.columns = [
    'school_id',
    'percent'
]

ds['percent'] /= len(data)
ds = ds.sort_values(['percent'])

fig = px.pie(
    ds,
    names = 'school_id',
    values = 'percent',
    title = 'Percent of schools',
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_23_0.svg
[44]:
ds = data['school_id'].value_counts().reset_index()

ds.columns = [
    'school_id',
    'count'
]

ds['school_id'] = ds['school_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(20)

fig = px.bar(
    ds,
    x = 'count',
    y = 'school_id',
    orientation = 'h',
    title = 'Top 20 useful school_ids'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_24_0.svg

Sort by attemp counts

[45]:
ds = data['attempt_count'].value_counts().reset_index()

ds.columns = [
    'attempt_count',
    'count'
]

ds['attempt_count'] = ds['attempt_count'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'attempt_count',
    orientation = 'h',
    title = 'Top 20 often attempt count'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_26_0.svg

Sort by skills

[46]:
ds = data['skill_id'].dropna() # There are less NaNs in 'skill_id' column than 'skill_name' column.
ds = ds.value_counts().reset_index()

ds.columns = [
    'skill_id',
    'count'
]

ds['skill_id'] = ds['skill_id'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(40)

fig = px.bar(
    ds,
    x = 'count',
    y = 'skill_id',
    orientation = 'h',
    title = 'Top 40 useful skill_id'
)

fig.show('svg')
../../../_images/build_blitz_ASSISTments_ASSISTments2012-2013_28_0.svg
[ ]: