OLI data in fall, 2011(step)

[1]:
%matplotlib inline
import pandas as pd
import numpy as np
# global configuration: show every rows and cols
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth',None)
pd.set_option('display.max_columns', None)

1. Data Description

1.1 Column Description

[2]:
# help_table2: the description for data by steps
df2 = pd.read_csv('OLI_data/help_table2.csv',sep=',',encoding="gbk")
df2 = df2.loc[:, ['Field', 'Annotation']]
df2
[2]:
Field Annotation
0 Row A row counter.
1 Sample The sample that includes this step. If you select more than one sample to export, steps that occur in more than one sample will be duplicated in the export.
2 Anon Student ID The student that performed the step.
3 Problem Hierarchy The location in the curriculum hierarchy where this step occurs.
4 Problem Name The name of the problem in which the step occurs.
5 Problem View The number of times the student encountered the problem so far. This counter increases with each instance of the same problem. Note that problem view increases regardless of whether or not the step was encountered in previous problem views. For example, a step can have a "Problem View" of "3", indicating the problem was viewed three times by this student, but that same step need not have been encountered by that student in all instances of the problem. If this number does not increase as you expect it to, it might be that DataShop has identified similar problems as distinct: two problems with the same "Problem Name" are considered different "problems" by DataShop if the following logged values are not identical: problem name, context, tutor_flag (whether or not the problem or activity is tutored) and "other" field. For more on the logging of these fields, see the description of the "problem" element in the Guide to the Tutor Message Format. For more detail on how problem view is determined, see Determining Problem View.
6 Step Name Formed by concatenating the "selection" and "action". Also see the glossary entry for "step".
7 Step Start Time The step start time is determined one of three ways: If it's the first step of the problem, the step start time is the same as the problem start time If it's a subsequent step, then the step start time is the time of the preceding transaction, if that transaction is within 10 minutes. If it's a subsequent step and the elapsed time between the previous transaction and the first transaction of this step is more than 10 minutes, then the step start time is set to null as it's considered an unreliable value. For a visual example, see the Examples page.
8 First Transaction Time The time of the first transaction toward the step.
9 Correct Transaction Time The time of the correct attempt toward the step, if there was one.
10 Step End Time The time of the last transaction toward the step.
11 Step Duration (sec) The elapsed time of the step in seconds, calculated by adding all of the durations for transactions that were attributed to the step. See the glossary entry for more detail. This column was previously labeled "Assistance Time". It differs from "Assistance Time" in that its values are derived by summing transaction durations, not finding the difference between only two points in time (step start time and the last correct attempt).
12 Correct Step Duration (sec) The step duration if the first attempt for the step was correct. This might also be described as "reaction time" since it's the duration of time from the previous transaction or problem start event to the correct attempt. See the glossary entry for more detail. This column was previously labeled "Correct Step Time (sec)".
13 Error Step Duration (sec) The step duration if the first attempt for the step was an error (incorrect attempt or hint request).
14 First Attempt The tutor's response to the student's first attempt on the step. Example values are "hint", "correct", and "incorrect".
15 Incorrects Total number of incorrect attempts by the student on the step.
16 Hints Total number of hints requested by the student for the step.
17 Corrects Total correct attempts by the student for the step. (Only increases if the step is encountered more than once.)
18 Condition The name and type of the condition the student is assigned to. In the case of a student assigned to multiple conditions (factors in a factorial design), condition names are separated by a comma and space. This differs from the transaction format, which optionally has "Condition Name" and "Condition Type" columns.
19 KC (model_name) (Only shown when the "Knowledge Components" option is selected.) Knowledge component(s) associated with the correct performance of this step. In the case of multiple KCs assigned to a single step, KC names are separated by two tildes ("~~").
20 Opportunity (model_name) (Only shown when the "Knowledge Components" option is selected.) An opportunity is the first chance on a step for a student to demonstrate whether he or she has learned the associated knowledge component. Opportunity number is therefore a count that increases by one each time the student encounters a step with the listed knowledge component. In the case of multiple KCs assigned to a single step, opportunity number values are separated by two tildes ("~~") and are given in the same order as the KC names. Check here to see how opportunity count is computed when Event Type column is present in transaction data.
21 Predicted Error Rate (model_name) A hypothetical error rate based on the Additive Factor Model (AFM) algorithm. A value of "1" is a prediction that a student's first attempt will be an error (incorrect attempt or hint request); a value of "0" is a prediction that the student's first attempt will be correct. For specifics, see below "Predicted Error Rate" and how it's calculated. In the case of multiple KCs assigned to a single step, Datashop implements a compensatory sum across all of the KCs, thus a single value of predicted error rate is provided (i.e., the same predicted error rate for each KC assigned to a step). For more detail on Datashop's implementation for multi-skilled step, see Model Values page.

1.2 Summarization of Data

This table organizes the data as student-problem-step

[3]:
df_step =  pd.read_csv('OLI_data/AllData_student_step_2011F.csv',low_memory=False) # sep="\t"
df_step.head(2)
[3]:
Row Sample Anon Student Id Problem Hierarchy Problem Name Problem View Step Name Step Start Time First Transaction Time Correct Transaction Time Step End Time Step Duration (sec) Correct Step Duration (sec) Error Step Duration (sec) First Attempt Incorrects Hints Corrects Condition KC (F2011) Opportunity (F2011) Predicted Error Rate (F2011) KC (Single-KC) Opportunity (Single-KC) Predicted Error Rate (Single-KC) KC (Unique-step) Opportunity (Unique-step) Predicted Error Rate (Unique-step)
0 1 All Data Stu_00b2b35fd027e7891e8a1a527125dd65 sequence Statics, unit Concentrated Forces and Their Effects, module Introduction to Free Body Diagrams _m2_assess 1 q1_point1i1 UpdateComboBox 2011/9/21 17:35 2011/9/21 17:35 2011/9/21 17:35 2011/9/21 17:35 23.13 23.13 . correct 0 0 1 . identify_interaction 1 0.3991 Single-KC 1 0.4373 NaN NaN NaN
1 2 All Data Stu_00b2b35fd027e7891e8a1a527125dd65 sequence Statics, unit Concentrated Forces and Their Effects, module Introduction to Free Body Diagrams _m2_assess 1 q1_point3i3 UpdateComboBox 2011/9/21 17:35 2011/9/21 17:35 2011/9/21 17:35 2011/9/21 17:35 23.13 23.13 . correct 0 0 1 . gravitational_forces 1 0.1665 Single-KC 2 0.4373 NaN NaN NaN

2. Data Analysis

[4]:
df_step.describe()
[4]:
Row Problem View Incorrects Hints Corrects Predicted Error Rate (F2011) Opportunity (Single-KC) Predicted Error Rate (Single-KC) Opportunity (Unique-step) Predicted Error Rate (Unique-step)
count 194947.000000 194947.000000 194947.000000 194947.000000 194947.000000 113992.000000 194947.000000 194947.000000 193043.000000 0.0
mean 97474.000000 1.133154 0.379611 0.143172 0.964072 0.237508 419.751066 0.252233 1.035971 NaN
std 56276.495801 0.760515 1.373797 0.852520 0.480346 0.158128 288.365862 0.086406 0.384182 NaN
min 1.000000 1.000000 0.000000 0.000000 0.000000 0.002900 1.000000 0.038600 1.000000 NaN
25% 48737.500000 1.000000 0.000000 0.000000 1.000000 0.117900 171.000000 0.188100 1.000000 NaN
50% 97474.000000 1.000000 0.000000 0.000000 1.000000 0.201400 382.000000 0.240500 1.000000 NaN
75% 146210.500000 1.000000 0.000000 0.000000 1.000000 0.319500 635.000000 0.294700 1.000000 NaN
max 194947.000000 32.000000 413.000000 43.000000 86.000000 0.969300 1410.000000 0.773600 24.000000 NaN
[5]:
num_total = len(df_step)
num_students = len(df_step['Anon Student Id'].unique())
num_problems = len(df_step['Problem Name'].unique())
num_kcs = len(df_step['KC (F2011)'].unique())
num_null_condition = df_step['Condition'].isnull().sum()  # 空值可不要
print("num_total:",num_total)
print("num_students:",num_students)
print("num_problems:",num_problems)
print("num_kcs:",num_kcs)
print("num_null_condition:",num_null_condition)

n_incorrects = df_step['Incorrects'].sum()
n_hints = df_step['Hints'].sum()
n_corrects = df_step['Corrects'].sum()
print("\n","*"*30,"\n")
print(n_incorrects,n_hints,n_corrects)
print(n_corrects / (n_incorrects + n_hints + n_corrects))
num_total: 194947
num_students: 333
num_problems: 300
num_kcs: 98
num_null_condition: 0

 ******************************

74004 27911 187943
0.6483968011923079

(1)Analysis for Null and Unique value of column attributes

[6]:
def work_col_analysis(df_work):
    num_nonull_toal = df_work.notnull().sum()  # Not Null
    dict_col_1 = {'col_name':num_nonull_toal.index,'num_nonull':num_nonull_toal.values}
    df_work_col_1 = pd.DataFrame(dict_col_1)

    num_null_toal = df_work.isnull().sum()  # Null
    dict_col_2 = {'col_name':num_null_toal.index,'num_null':num_null_toal.values}
    df_work_col_2 = pd.DataFrame(dict_col_2)

    num_unique_toal = df_work.apply(lambda col: len(col.unique()))   # axis=0
    print(type(num_unique_toal))
    dict_col_3 = {'col_name':num_unique_toal.index,'num_unique':num_unique_toal.values}
    df_work_col_3 = pd.DataFrame(dict_col_3)

    # df_work_col = pd.concat([df_work_col_1, df_work_col_2], axis=1)
    df_work_col = pd.merge(df_work_col_1, df_work_col_2, on=['col_name'])
    df_work_col = pd.merge(df_work_col, df_work_col_3, on=['col_name'])
    return df_work_col
print("-------------------num_unique_toal and num_nonull_toal----------------------")
df_result = work_col_analysis(df_step)
df_result
-------------------num_unique_toal and num_nonull_toal----------------------
<class 'pandas.core.series.Series'>
[6]:
col_name num_nonull num_null num_unique
0 Row 194947 0 194947
1 Sample 194947 0 1
2 Anon Student Id 194947 0 333
3 Problem Hierarchy 194947 0 27
4 Problem Name 194947 0 300
5 Problem View 194947 0 32
6 Step Name 194947 0 382
7 Step Start Time 194632 315 33098
8 First Transaction Time 194947 0 34578
9 Correct Transaction Time 182132 12815 33501
10 Step End Time 194947 0 34351
11 Step Duration (sec) 194947 0 2521
12 Correct Step Duration (sec) 194947 0 2187
13 Error Step Duration (sec) 194947 0 2105
14 First Attempt 194947 0 3
15 Incorrects 194947 0 32
16 Hints 194947 0 30
17 Corrects 194947 0 17
18 Condition 194947 0 1
19 KC (F2011) 113992 80955 98
20 Opportunity (F2011) 113992 80955 1206
21 Predicted Error Rate (F2011) 113992 80955 7623
22 KC (Single-KC) 194947 0 1
23 Opportunity (Single-KC) 194947 0 1410
24 Predicted Error Rate (Single-KC) 194947 0 317
25 KC (Unique-step) 193043 1904 1179
26 Opportunity (Unique-step) 193043 1904 25
27 Predicted Error Rate (Unique-step) 0 194947 1

(3)Data Cleaning

Data Cleaning Suggestions

  • Redundant columns: Columns that are all NULL or Single value.

  • rows that KC (F2011) == null(Do not know the knowledge source)

  • rows that Step Start Time == null(This step is too short or more than 10mins, so the data is not reliable)

  • Others

[7]:
df_step_clear = df_step.copy(deep=True) # deep copy
[8]:
# 直接清除所有”冗余列“
cols = list(df_step.columns.values)
drop_cols = []
for col in cols:
    if len(df_step_clear[col].unique().tolist()) == 1:
        df_step_clear.drop(col,axis =1,inplace=True)
        drop_cols.append(col)

print("the cols num before clear: ",len(df_step.columns.to_list()))
print("the cols num after clear:",len(df_step_clear.columns.to_list()))
for col in drop_cols:
    print("drop:---",col)
the cols num before clear:  28
the cols num after clear: 24
drop:--- Sample
drop:--- Condition
drop:--- KC (Single-KC)
drop:--- Predicted Error Rate (Unique-step)
[9]:
# Others:'KC (F2011)','Step Start Time' with null value
df_step_clear.dropna(axis=0, how='any', subset=['KC (F2011)','Step Start Time'],inplace = True)
[10]:
# the remaining columns
print("-------------------num_unique_toal and num_nonull_toal----------------------")
df_result = work_col_analysis(df_step_clear)
df_result
-------------------num_unique_toal and num_nonull_toal----------------------
<class 'pandas.core.series.Series'>
[10]:
col_name num_nonull num_null num_unique
0 Row 113817 0 113817
1 Anon Student Id 113817 0 331
2 Problem Hierarchy 113817 0 26
3 Problem Name 113817 0 154
4 Problem View 113817 0 32
5 Step Name 113817 0 240
6 Step Start Time 113817 0 18856
7 First Transaction Time 113817 0 19745
8 Correct Transaction Time 103454 10363 19146
9 Step End Time 113817 0 19623
10 Step Duration (sec) 113817 0 2382
11 Correct Step Duration (sec) 113817 0 2093
12 Error Step Duration (sec) 113817 0 1949
13 First Attempt 113817 0 3
14 Incorrects 113817 0 25
15 Hints 113817 0 25
16 Corrects 113817 0 15
17 KC (F2011) 113817 0 97
18 Opportunity (F2011) 113817 0 1205
19 Predicted Error Rate (F2011) 113817 0 7622
20 Opportunity (Single-KC) 113817 0 1164
21 Predicted Error Rate (Single-KC) 113817 0 315
22 KC (Unique-step) 112869 948 625
23 Opportunity (Unique-step) 112869 948 25

Outlier Analysis

  • It is found that there is a non-numeric type in duration that is ‘.’ , which should represent 0

  • In addition, box diagrams can be used to analyze whether some outliers need to be removed

[11]:
print(df_step_clear.columns.tolist())
print("-"*100)
print(df_step_clear.describe().columns.tolist()) #有许多object类无法统计分析
print("-"*100)
print(df_step_clear.dtypes)
['Row', 'Anon Student Id', 'Problem Hierarchy', 'Problem Name', 'Problem View', 'Step Name', 'Step Start Time', 'First Transaction Time', 'Correct Transaction Time', 'Step End Time', 'Step Duration (sec)', 'Correct Step Duration (sec)', 'Error Step Duration (sec)', 'First Attempt', 'Incorrects', 'Hints', 'Corrects', 'KC (F2011)', 'Opportunity (F2011)', 'Predicted Error Rate (F2011)', 'Opportunity (Single-KC)', 'Predicted Error Rate (Single-KC)', 'KC (Unique-step)', 'Opportunity (Unique-step)']
----------------------------------------------------------------------------------------------------
['Row', 'Problem View', 'Incorrects', 'Hints', 'Corrects', 'Predicted Error Rate (F2011)', 'Opportunity (Single-KC)', 'Predicted Error Rate (Single-KC)', 'Opportunity (Unique-step)']
----------------------------------------------------------------------------------------------------
Row                                   int64
Anon Student Id                      object
Problem Hierarchy                    object
Problem Name                         object
Problem View                          int64
Step Name                            object
Step Start Time                      object
First Transaction Time               object
Correct Transaction Time             object
Step End Time                        object
Step Duration (sec)                  object
Correct Step Duration (sec)          object
Error Step Duration (sec)            object
First Attempt                        object
Incorrects                            int64
Hints                                 int64
Corrects                              int64
KC (F2011)                           object
Opportunity (F2011)                  object
Predicted Error Rate (F2011)        float64
Opportunity (Single-KC)               int64
Predicted Error Rate (Single-KC)    float64
KC (Unique-step)                     object
Opportunity (Unique-step)           float64
dtype: object
[12]:
# Change . to 0 in "xxx-duration"
rectify_cols = ['Step Duration (sec)', 'Correct Step Duration (sec)', 'Error Step Duration (sec)']
for col in rectify_cols:
    df_step_clear[col] = df_step_clear[col].apply(lambda x: 0 if x=='.' else x)
    df_step_clear[col] = df_step_clear[col].astype(float)
print(df_step_clear[rectify_cols].dtypes)
Step Duration (sec)            float64
Correct Step Duration (sec)    float64
Error Step Duration (sec)      float64
dtype: object

3. Data Visualization

[13]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
%matplotlib inline
[14]:
# Outlier analysis for each column

fig=plt.figure()
box_cols = ['Step Duration (sec)', 'Correct Step Duration (sec)','Error Step Duration (sec)']
for i, col in enumerate(box_cols):
    ax=fig.add_subplot(3, 1, i+1)
    ax.boxplot(df_step_clear[df_step_clear[col].notnull()][col].tolist())
fig.show("svg")
D:\MySoftwares\Anaconda\envs\data\lib\site-packages\ipykernel_launcher.py:8: UserWarning:

Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_21_1.png
[15]:
# The distribution of continuous values
def show_value_counts_histogram(colname, sort = True):
    # create the bins
    start = int(df_step_clear[colname].min()/10)*10
    end = int(df_step_clear[colname].quantile(q=0.95)/10+1)*10
    step = int((end - start)/20)
    print(start, end, step)
    counts, bins = np.histogram(df_step_clear[colname],bins=range(start, end, step))
    bins = 0.5 * (bins[:-1] + bins[1:])

    fig = px.bar(x=bins, y=counts, labels={'x': colname, 'y':'count'})
    fig.show("svg")


# Box distribution of continuous values
def show_value_counts_box(colname, sort = True):
#     fig = px.box(df_step_clear, y=colname)
#     fig.show("svg")
    plt.figure(figsize=(10,5))
    plt.title('Box-plot for '+ colname,fontsize=20)#标题,并设定字号大小
    plt.boxplot([df_step_clear[colname].tolist()])
    plt.show("svg")


# Histogram of discrete values
def show_value_counts_bar(colname, sort = True):
    ds = df_step_clear[colname].value_counts().reset_index()
    ds.columns = [
        colname,
        'Count'
    ]
    if sort:
        ds = ds.sort_values(by='Count', ascending=False)
    # histogram
    fig = px.bar(
        ds,
        x = colname,
        y = 'Count',
        title = colname + ' distribution'
    )
    fig.show("svg")


# Pie of discrete values
def show_value_counts_pie(colname, sort = True):
    ds = df_step_clear[colname].value_counts().reset_index()
    ds.columns = [
        colname,
        'percent'
    ]
    ds['percent'] /= len(df_step_clear)
    if sort:
        ds = ds.sort_values(by='percent', ascending=False)
    fig = px.pie(
        ds,
        names = colname,
        values = 'percent',
        title = colname+ ' Percentage',
    )
    fig.update_traces(textposition='inside', textinfo='percent+label',showlegend=False)
    fig.show("svg")
[16]:
# Bar
show_value_counts_bar('First Attempt')
show_value_counts_histogram('Step Duration (sec)')
show_value_counts_box('Step Duration (sec)')
../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_23_0.svg
0 70 3
../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_23_2.svg
../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_23_3.png
[17]:
# Pie
# show_value_counts_pie('KC (F2011)')
show_value_counts_pie('Problem Hierarchy')
show_value_counts_pie('Problem Name')
# show_value_counts_pie('Step Name')
../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_24_0.svg
../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_24_1.svg
[19]:
# four column labels are individually distributed as follows

topnum_max = 50 # show top 50 for each type
fig = make_subplots(rows=2, cols=2,   # 2*2
      start_cell="top-left",
      subplot_titles=('KC (F2011)','Problem Hierarchy','Problem Name','Step Name'),
      column_widths=[0.5, 0.5])
traces = [
    go.Bar(
        x = df_step[colname].value_counts().reset_index().index.tolist()[:topnum_max],
        y = df_step[colname].value_counts().reset_index()[colname].tolist()[:topnum_max],
        name = 'Type: ' + str(colname),
        text = df_step[colname].value_counts().reset_index()['index'].tolist()[:topnum_max],
        textposition = 'auto',
    ) for colname in ['KC (F2011)','Problem Hierarchy','Problem Name','Step Name']
]
for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i //2) + 1, # pos_row
        (i % 2) + 1  # pos_col
    )

fig.update_layout(
    title_text = 'Bar of top 50 distributions for each type ',
)

fig.show("svg")

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_25_0.svg