KDD Cup 2010 —— Data Analysis on algebra_2006_2007_train

Data Description

Column Description

Attribute

Annotaion

Row

The row number

Anon Student Id

Unique, anonymous identifier for a student

Problem Hierarchy

The hierarchy of curriculum levels containing the problem

Problem Name

Unique identifier for a problem

Problem View

The total number of times the student encountered the problem so far

Step Name

Unique identifier for one of the steps in a problem

Step Start Time

The starting time of the step (Can be null)

First Transaction Time

The time of the first transaction toward the step

Correct Transaction Time

The time of the correct attempt toward the step, if there was one

Step End Time

The time of the last transaction toward the step

Step Duration (sec)

The elapsed time of the step in seconds, calculated by adding all of the durations for transactions that were attributed to the step (Can be null if step start time is null)

Correct Step Duration (sec)

The step duration if the first attempt for the step was correct

Error Step Duration (sec)

The step duration if the first attempt for the step was an error (incorrect attempt or hint request)

Correct First Attempt

The tutor’s evaluation of the student’s first attempt on the step—1 if correct, 0 if an error

Incorrects

Total number of incorrect attempts by the student on the step

Hints

Total number of hints requested by the student for the step

Corrects

Total correct attempts by the student for the step (only increases if the step is encountered more than once)

KC(KC Model Name)

The identified skills that are used in a problem, where available

Opportunity(KC Model Name)

A count that increases by one each time the student encounters a step with the listed knowledge component

Additional KC models, which exist for the challenge data sets, will appear as additional pairs of columns (KC and Opportunity columns for each model)

For the test portion of the challenge data sets, values will not be provided for the following columns:

♦ Step Start Time

♦ First Transaction Time

♦ Correct Transaction Time

♦ Step End Time

♦ Step Duration (sec)

♦ Correct Step Duration (sec)

♦ Error Step Duration (sec)

♦ Correct First Attempt

♦ Incorrects

♦ Hints

♦ Corrects

[1]:
import pandas as pd
import plotly.express as px
[2]:
path = "algebra_2006_2007_train.txt"
data = pd.read_table(path, encoding="ISO-8859-15", low_memory=False)

Record Examples

[3]:
pd.set_option('display.max_column', 500)
data.head()
[3]:
Row Anon Student Id Problem Hierarchy Problem Name Problem View Step Name Step Start Time First Transaction Time Correct Transaction Time Step End Time Step Duration (sec) Correct Step Duration (sec) Error Step Duration (sec) Correct First Attempt Incorrects Hints Corrects KC(Default) Opportunity(Default)
0 1 JG4Tz Unit CTA1_01, Section CTA1_01-1 LDEMO_WKST 1 R1C1 2006-10-26 09:51:58.0 2006-10-26 09:52:30.0 2006-10-26 09:53:30.0 2006-10-26 09:53:30.0 92.0 NaN 92.0 0 2 0 1 NaN NaN
1 2 JG4Tz Unit CTA1_01, Section CTA1_01-1 LDEMO_WKST 1 R1C2 2006-10-26 09:53:30.0 2006-10-26 09:53:41.0 2006-10-26 09:53:41.0 2006-10-26 09:53:41.0 11.0 11.0 NaN 1 0 0 1 NaN NaN
2 3 JG4Tz Unit CTA1_01, Section CTA1_01-1 LDEMO_WKST 1 R2C1 2006-10-26 09:53:41.0 2006-10-26 09:53:46.0 2006-10-26 09:53:46.0 2006-10-26 09:53:46.0 5.0 5.0 NaN 1 0 0 1 Identifying units 1
3 4 JG4Tz Unit CTA1_01, Section CTA1_01-1 LDEMO_WKST 1 R2C2 2006-10-26 09:53:46.0 2006-10-26 09:53:50.0 2006-10-26 09:53:50.0 2006-10-26 09:53:50.0 4.0 4.0 NaN 1 0 0 1 Identifying units 2
4 5 JG4Tz Unit CTA1_01, Section CTA1_01-1 LDEMO_WKST 1 R4C1 2006-10-26 09:53:50.0 2006-10-26 09:54:05.0 2006-10-26 09:54:05.0 2006-10-26 09:54:05.0 15.0 15.0 NaN 1 0 0 1 Entering a given 1
[4]:
data.describe()
[4]:
Row Problem View Step Duration (sec) Correct Step Duration (sec) Error Step Duration (sec) Correct First Attempt Incorrects Hints Corrects
count 2.270384e+06 2.270384e+06 2.267551e+06 1.751638e+06 515913.000000 2.270384e+06 2.270384e+06 2.270384e+06 2.270384e+06
mean 1.513120e+06 1.092910e+00 1.958364e+01 1.171716e+01 46.292087 7.722359e-01 4.455044e-01 1.184311e-01 1.062878e+00
std 8.736198e+05 3.448857e-01 4.768345e+01 2.645318e+01 81.817794 4.193897e-01 2.000914e+00 6.199071e-01 6.894285e-01
min 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 7.577408e+05 1.000000e+00 3.000000e+00 3.000000e+00 11.000000 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
50% 1.511844e+06 1.000000e+00 7.000000e+00 5.000000e+00 22.000000 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
75% 2.269432e+06 1.000000e+00 1.700000e+01 1.100000e+01 47.000000 1.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
max 3.025933e+06 1.000000e+01 3.208000e+03 1.204000e+03 3208.000000 1.000000e+00 3.600000e+02 1.020000e+02 9.200000e+01
[5]:
print("Part of missing values for every column")
print(data.isnull().sum() / len(data))
Part of missing values for every column
Row                            0.000000
Anon Student Id                0.000000
Problem Hierarchy              0.000000
Problem Name                   0.000000
Problem View                   0.000000
Step Name                      0.000000
Step Start Time                0.001103
First Transaction Time         0.000000
Correct Transaction Time       0.034757
Step End Time                  0.000000
Step Duration (sec)            0.001248
Correct Step Duration (sec)    0.228484
Error Step Duration (sec)      0.772764
Correct First Attempt          0.000000
Incorrects                     0.000000
Hints                          0.000000
Corrects                       0.000000
KC(Default)                    0.203407
Opportunity(Default)           0.203407
dtype: float64
[6]:
print("the number of records:")
print(len(data))
the number of records:
2270384
[7]:
print("how many students are there in the table:")
print(len(data['Anon Student Id'].unique()))
how many students are there in the table:
1338
[8]:
print("how many problems are there in the table:")
print(len(data['Problem Name'].unique()))
how many problems are there in the table:
91913

Sort by Anon Student Id

[9]:
ds = data['Anon Student Id'].value_counts().reset_index()
ds.columns = [
    'Anon Student Id',
    'count'
]
ds['Anon Student Id'] = ds['Anon Student Id'].astype(str) + '-'
ds = ds.sort_values('count').tail(40)

fig = px.bar(
    ds,
    x='count',
    y='Anon Student Id',
    orientation='h',
    title='Top 40 students by number of steps they have done'
)
fig.show("svg")
../../_images/build_blitz_KDD_Cup_2010_16_0.svg

Percent of corrects, hints and incorrects

[10]:
count_corrects = data['Corrects'].sum()
count_hints = data['Hints'].sum()
count_incorrects = data['Incorrects'].sum()

total = count_corrects + count_hints + count_incorrects

percent_corrects = count_corrects / total
percent_hints = count_hints / total
percent_incorrects = count_incorrects / total

dfl = [['corrects', percent_corrects], ['hints', percent_hints], ['incorrects', percent_incorrects]]

df = pd.DataFrame(dfl, columns=['transaction type', 'percent'])

fig = px.pie(
    df,
    names=['corrects', 'hints', 'incorrects'],
    values='percent',
    title='Percent of corrects, hints and incorrects'
)
fig.show("svg")
../../_images/build_blitz_KDD_Cup_2010_18_0.svg

Sort by Problem Name

[11]:
storeProblemCount = [1]
storeProblemName = [data['Problem Name'][0]]
currentProblemName = data['Problem Name'][0]
currentStepName = [data['Step Name'][0]]
lastIndex = 0

for i in range(1, len(data), 1):
    pbNameI = data['Problem Name'][i]
    stNameI = data['Step Name'][i]
    if pbNameI != data['Problem Name'][lastIndex]:
        currentStepName = [stNameI]
        currentProblemName = pbNameI
        if pbNameI not in storeProblemName:
            storeProblemName.append(pbNameI)
            storeProblemCount.append(1)
        else:
            storeProblemCount[storeProblemName.index(pbNameI)] += 1
        lastIndex = i
    elif stNameI not in currentStepName:
        currentStepName.append(stNameI)
        lastIndex = i
    else:
        currentStepName = [stNameI]
        storeProblemCount[storeProblemName.index(pbNameI)] += 1
        lastIndex = i

dfData = {
    'Problem Name': storeProblemName,
    'count': storeProblemCount
}
df = pd.DataFrame(dfData).sort_values('count').tail(40)
df["Problem Name"] += '-'

fig = px.bar(
    df,
    x='count',
    y='Problem Name',
    orientation='h',
    title='Top 40 useful problem'
)
fig.show("svg")
../../_images/build_blitz_KDD_Cup_2010_20_0.svg
[12]:
data['total transactions'] = data['Incorrects'] + data['Hints'] + data['Corrects']
df1 = data.groupby('Problem Name')['total transactions'].sum().reset_index()
df2 = data.groupby('Problem Name')['Corrects'].sum().reset_index()
df1['Corrects'] = df2['Corrects']
df1['Correct rate'] = df1['Corrects'] / df1['total transactions']

df1 = df1.sort_values('total transactions')
count = 0
standard = 500
for i in df1['total transactions']:
    if i > standard:
        count += 1
df1 = df1.tail(count)

df1 = df1.sort_values('Correct rate')

df1['Problem Name'] = df1['Problem Name'].astype(str) + "-"

df_px = df1.tail(20)

fig = px.bar(
    df_px,
    x='Correct rate',
    y='Problem Name',
    orientation='h',
    title='Correct rate of each problem (top 20)  (total transactions of \
each problem are required to be more than 500)',
    text='Problem Name'
)
fig.update_layout(title_font_size=10)
fig.show("svg")

df_px = df1.head(20)

fig = px.bar(
    df_px,
    x='Correct rate',
    y='Problem Name',
    orientation='h',
    title='Correct rate of each problem (bottom 20)  (total transactions of \
each problem are required to be more than 500)',
    text='Problem Name'
)
fig.update_layout(title_font_size=10)
fig.show("svg")
../../_images/build_blitz_KDD_Cup_2010_21_0.svg
../../_images/build_blitz_KDD_Cup_2010_21_1.svg

These two figures present the correct rate of problems. Problems with low correct rate deserve more attention from teachers and students.

Sort by KC

[13]:
data.dropna(subset=['KC(Default)'], inplace=True)

data['total transactions'] = data['Corrects'] + data['Hints'] + data['Incorrects']
df1 = data.groupby('KC(Default)')['total transactions'].sum().reset_index()
df2 = data.groupby('KC(Default)')['Corrects'].sum().reset_index()
df1['Corrects'] = df2['Corrects']
df1['correct rate'] = df1['Corrects'] / df1['total transactions']

count = 0
standard = 300
for i in df1['total transactions']:
    if i > standard:
        count += 1
df1 = df1.sort_values('total transactions').tail(count)

df1 = df1.sort_values('correct rate')

df1['KC(Default)'] = df1['KC(Default)'].astype(str) + '-'

df_px = df1.tail(20)

fig = px.bar(
    df_px,
    x='correct rate',
    y='KC(Default)',
    orientation='h',
    title='Correct rate of each KC(Default) (top 20)  (total transactions of \
each KC are required to be more than 300)',
    text='KC(Default)'
)
fig.update_yaxes(visible=False)
fig.update_layout(title_font_size=10)
fig.show("svg")

df_px = df1.head(20)

fig = px.bar(
    df_px,
    x='correct rate',
    y='KC(Default)',
    orientation='h',
    title='Correct rate of each KC(Default) (bottom 20)  (total transactions of \
each KC are required to be more than 300)',
    text='KC(Default)'
)
fig.update_yaxes(visible=False)
fig.update_layout(title_font_size=10)
fig.show("svg")
../../_images/build_blitz_KDD_Cup_2010_24_0.svg
../../_images/build_blitz_KDD_Cup_2010_24_1.svg

These two figures present the correct rate of KCs. KCs with low correct rate deserve more attention from teachers and students.

Postscript

Given that the whole data package is composed of 5 data sets and data files in these 5 data sets that can be used to conduct data analysis share the same data format, the following analysis based on “algebra_2006_2007_train” is just an example of data analysis on KDD Cup, and the code can be used to analyse other data files with some small changes on the file path and column names.