OLI data in fall, 2011（step）¶

[1]:

%matplotlib inline
import pandas as pd
import numpy as np
# global configuration: show every rows and cols
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth',None)
pd.set_option('display.max_columns', None)

1. Data Description¶

1.1 Column Description¶

[2]:

# help_table2: the description for data by steps
df2 = pd.read_csv('OLI_data/help_table2.csv',sep=',',encoding="gbk")
df2 = df2.loc[:, ['Field', 'Annotation']]
df2

[2]:

	Field	Annotation
0	Row	A row counter.
1	Sample	The sample that includes this step. If you select more than one sample to export, steps that occur in more than one sample will be duplicated in the export.
2	Anon Student ID	The student that performed the step.
3	Problem Hierarchy	The location in the curriculum hierarchy where this step occurs.
4	Problem Name	The name of the problem in which the step occurs.
5	Problem View	The number of times the student encountered the problem so far. This counter increases with each instance of the same problem. Note that problem view increases regardless of whether or not the step was encountered in previous problem views. For example, a step can have a "Problem View" of "3", indicating the problem was viewed three times by this student, but that same step need not have been encountered by that student in all instances of the problem. If this number does not increase as you expect it to, it might be that DataShop has identified similar problems as distinct: two problems with the same "Problem Name" are considered different "problems" by DataShop if the following logged values are not identical: problem name, context, tutor_flag (whether or not the problem or activity is tutored) and "other" field. For more on the logging of these fields, see the description of the "problem" element in the Guide to the Tutor Message Format. For more detail on how problem view is determined, see Determining Problem View.
6	Step Name	Formed by concatenating the "selection" and "action". Also see the glossary entry for "step".
7	Step Start Time	The step start time is determined one of three ways: If it's the first step of the problem, the step start time is the same as the problem start time If it's a subsequent step, then the step start time is the time of the preceding transaction, if that transaction is within 10 minutes. If it's a subsequent step and the elapsed time between the previous transaction and the first transaction of this step is more than 10 minutes, then the step start time is set to null as it's considered an unreliable value. For a visual example, see the Examples page.
8	First Transaction Time	The time of the first transaction toward the step.
9	Correct Transaction Time	The time of the correct attempt toward the step, if there was one.
10	Step End Time	The time of the last transaction toward the step.
11	Step Duration (sec)	The elapsed time of the step in seconds, calculated by adding all of the durations for transactions that were attributed to the step. See the glossary entry for more detail. This column was previously labeled "Assistance Time". It differs from "Assistance Time" in that its values are derived by summing transaction durations, not finding the difference between only two points in time (step start time and the last correct attempt).
12	Correct Step Duration (sec)	The step duration if the first attempt for the step was correct. This might also be described as "reaction time" since it's the duration of time from the previous transaction or problem start event to the correct attempt. See the glossary entry for more detail. This column was previously labeled "Correct Step Time (sec)".
13	Error Step Duration (sec)	The step duration if the first attempt for the step was an error (incorrect attempt or hint request).
14	First Attempt	The tutor's response to the student's first attempt on the step. Example values are "hint", "correct", and "incorrect".
15	Incorrects	Total number of incorrect attempts by the student on the step.
16	Hints	Total number of hints requested by the student for the step.
17	Corrects	Total correct attempts by the student for the step. (Only increases if the step is encountered more than once.)
18	Condition	The name and type of the condition the student is assigned to. In the case of a student assigned to multiple conditions (factors in a factorial design), condition names are separated by a comma and space. This differs from the transaction format, which optionally has "Condition Name" and "Condition Type" columns.
19	KC (model_name)	(Only shown when the "Knowledge Components" option is selected.) Knowledge component(s) associated with the correct performance of this step. In the case of multiple KCs assigned to a single step, KC names are separated by two tildes ("~~").
20	Opportunity (model_name)	(Only shown when the "Knowledge Components" option is selected.) An opportunity is the first chance on a step for a student to demonstrate whether he or she has learned the associated knowledge component. Opportunity number is therefore a count that increases by one each time the student encounters a step with the listed knowledge component. In the case of multiple KCs assigned to a single step, opportunity number values are separated by two tildes ("~~") and are given in the same order as the KC names. Check here to see how opportunity count is computed when Event Type column is present in transaction data.
21	Predicted Error Rate (model_name)	A hypothetical error rate based on the Additive Factor Model (AFM) algorithm. A value of "1" is a prediction that a student's first attempt will be an error (incorrect attempt or hint request); a value of "0" is a prediction that the student's first attempt will be correct. For specifics, see below "Predicted Error Rate" and how it's calculated. In the case of multiple KCs assigned to a single step, Datashop implements a compensatory sum across all of the KCs, thus a single value of predicted error rate is provided (i.e., the same predicted error rate for each KC assigned to a step). For more detail on Datashop's implementation for multi-skilled step, see Model Values page.

1.2 Summarization of Data¶

This table organizes the data as student-problem-step

[3]:

df_step =  pd.read_csv('OLI_data/AllData_student_step_2011F.csv',low_memory=False) # sep="\t"
df_step.head(2)

[3]:

	Row	Sample	Anon Student Id	Problem Hierarchy	Problem Name	Problem View	Step Name	Step Start Time	First Transaction Time	Correct Transaction Time	Step End Time	Step Duration (sec)	Correct Step Duration (sec)	Error Step Duration (sec)	First Attempt	Incorrects	Hints	Corrects	Condition	KC (F2011)	Opportunity (F2011)	Predicted Error Rate (F2011)	KC (Single-KC)	Opportunity (Single-KC)	Predicted Error Rate (Single-KC)	KC (Unique-step)	Opportunity (Unique-step)	Predicted Error Rate (Unique-step)
0	1	All Data	Stu_00b2b35fd027e7891e8a1a527125dd65	sequence Statics, unit Concentrated Forces and Their Effects, module Introduction to Free Body Diagrams	_m2_assess	1	q1_point1i1 UpdateComboBox	2011/9/21 17:35	2011/9/21 17:35	2011/9/21 17:35	2011/9/21 17:35	23.13	23.13	.	correct	0	0	1	.	identify_interaction	1	0.3991	Single-KC	1	0.4373	NaN	NaN	NaN
1	2	All Data	Stu_00b2b35fd027e7891e8a1a527125dd65	sequence Statics, unit Concentrated Forces and Their Effects, module Introduction to Free Body Diagrams	_m2_assess	1	q1_point3i3 UpdateComboBox	2011/9/21 17:35	2011/9/21 17:35	2011/9/21 17:35	2011/9/21 17:35	23.13	23.13	.	correct	0	0	1	.	gravitational_forces	1	0.1665	Single-KC	2	0.4373	NaN	NaN	NaN

2. Data Analysis¶

[4]:

df_step.describe()

[4]:

	Row	Problem View	Incorrects	Hints	Corrects	Predicted Error Rate (F2011)	Opportunity (Single-KC)	Predicted Error Rate (Single-KC)	Opportunity (Unique-step)	Predicted Error Rate (Unique-step)
count	194947.000000	194947.000000	194947.000000	194947.000000	194947.000000	113992.000000	194947.000000	194947.000000	193043.000000	0.0
mean	97474.000000	1.133154	0.379611	0.143172	0.964072	0.237508	419.751066	0.252233	1.035971	NaN
std	56276.495801	0.760515	1.373797	0.852520	0.480346	0.158128	288.365862	0.086406	0.384182	NaN
min	1.000000	1.000000	0.000000	0.000000	0.000000	0.002900	1.000000	0.038600	1.000000	NaN
25%	48737.500000	1.000000	0.000000	0.000000	1.000000	0.117900	171.000000	0.188100	1.000000	NaN
50%	97474.000000	1.000000	0.000000	0.000000	1.000000	0.201400	382.000000	0.240500	1.000000	NaN
75%	146210.500000	1.000000	0.000000	0.000000	1.000000	0.319500	635.000000	0.294700	1.000000	NaN
max	194947.000000	32.000000	413.000000	43.000000	86.000000	0.969300	1410.000000	0.773600	24.000000	NaN

[5]:

num_total = len(df_step)
num_students = len(df_step['Anon Student Id'].unique())
num_problems = len(df_step['Problem Name'].unique())
num_kcs = len(df_step['KC (F2011)'].unique())
num_null_condition = df_step['Condition'].isnull().sum()  # 空值可不要
print("num_total:",num_total)
print("num_students:",num_students)
print("num_problems:",num_problems)
print("num_kcs:",num_kcs)
print("num_null_condition:",num_null_condition)

n_incorrects = df_step['Incorrects'].sum()
n_hints = df_step['Hints'].sum()
n_corrects = df_step['Corrects'].sum()
print("\n","*"*30,"\n")
print(n_incorrects,n_hints,n_corrects)
print(n_corrects / (n_incorrects + n_hints + n_corrects))

num_total: 194947
num_students: 333
num_problems: 300
num_kcs: 98
num_null_condition: 0

 ******************************

74004 27911 187943
0.6483968011923079

（1）Analysis for Null and Unique value of column attributes¶

[6]:

def work_col_analysis(df_work):
    num_nonull_toal = df_work.notnull().sum()  # Not Null
    dict_col_1 = {'col_name':num_nonull_toal.index,'num_nonull':num_nonull_toal.values}
    df_work_col_1 = pd.DataFrame(dict_col_1)

    num_null_toal = df_work.isnull().sum()  # Null
    dict_col_2 = {'col_name':num_null_toal.index,'num_null':num_null_toal.values}
    df_work_col_2 = pd.DataFrame(dict_col_2)

    num_unique_toal = df_work.apply(lambda col: len(col.unique()))   # axis=0
    print(type(num_unique_toal))
    dict_col_3 = {'col_name':num_unique_toal.index,'num_unique':num_unique_toal.values}
    df_work_col_3 = pd.DataFrame(dict_col_3)

    # df_work_col = pd.concat([df_work_col_1, df_work_col_2], axis=1)
    df_work_col = pd.merge(df_work_col_1, df_work_col_2, on=['col_name'])
    df_work_col = pd.merge(df_work_col, df_work_col_3, on=['col_name'])
    return df_work_col
print("-------------------num_unique_toal and num_nonull_toal----------------------")
df_result = work_col_analysis(df_step)
df_result

-------------------num_unique_toal and num_nonull_toal----------------------
<class 'pandas.core.series.Series'>

[6]:

	col_name	num_nonull	num_null	num_unique
0	Row	194947	0	194947
1	Sample	194947	0	1
2	Anon Student Id	194947	0	333
3	Problem Hierarchy	194947	0	27
4	Problem Name	194947	0	300
5	Problem View	194947	0	32
6	Step Name	194947	0	382
7	Step Start Time	194632	315	33098
8	First Transaction Time	194947	0	34578
9	Correct Transaction Time	182132	12815	33501
10	Step End Time	194947	0	34351
11	Step Duration (sec)	194947	0	2521
12	Correct Step Duration (sec)	194947	0	2187
13	Error Step Duration (sec)	194947	0	2105
14	First Attempt	194947	0	3
15	Incorrects	194947	0	32
16	Hints	194947	0	30
17	Corrects	194947	0	17
18	Condition	194947	0	1
19	KC (F2011)	113992	80955	98
20	Opportunity (F2011)	113992	80955	1206
21	Predicted Error Rate (F2011)	113992	80955	7623
22	KC (Single-KC)	194947	0	1
23	Opportunity (Single-KC)	194947	0	1410
24	Predicted Error Rate (Single-KC)	194947	0	317
25	KC (Unique-step)	193043	1904	1179
26	Opportunity (Unique-step)	193043	1904	25
27	Predicted Error Rate (Unique-step)	0	194947	1

（3）Data Cleaning¶

Data Cleaning Suggestions¶

Redundant columns: Columns that are all NULL or Single value.

rows that KC (F2011) == null（Do not know the knowledge source）

rows that Step Start Time == null（This step is too short or more than 10mins, so the data is not reliable）

Others

[7]:

df_step_clear = df_step.copy(deep=True) # deep copy

[8]:

# 直接清除所有”冗余列“
cols = list(df_step.columns.values)
drop_cols = []
for col in cols:
    if len(df_step_clear[col].unique().tolist()) == 1:
        df_step_clear.drop(col,axis =1,inplace=True)
        drop_cols.append(col)

print("the cols num before clear: ",len(df_step.columns.to_list()))
print("the cols num after clear:",len(df_step_clear.columns.to_list()))
for col in drop_cols:
    print("drop:---",col)

the cols num before clear:  28
the cols num after clear: 24
drop:--- Sample
drop:--- Condition
drop:--- KC (Single-KC)
drop:--- Predicted Error Rate (Unique-step)

[9]:

# Others：'KC (F2011)','Step Start Time' with null value
df_step_clear.dropna(axis=0, how='any', subset=['KC (F2011)','Step Start Time'],inplace = True)

[10]:

# the remaining columns
print("-------------------num_unique_toal and num_nonull_toal----------------------")
df_result = work_col_analysis(df_step_clear)
df_result

-------------------num_unique_toal and num_nonull_toal----------------------
<class 'pandas.core.series.Series'>

[10]:

	col_name	num_nonull	num_null	num_unique
0	Row	113817	0	113817
1	Anon Student Id	113817	0	331
2	Problem Hierarchy	113817	0	26
3	Problem Name	113817	0	154
4	Problem View	113817	0	32
5	Step Name	113817	0	240
6	Step Start Time	113817	0	18856
7	First Transaction Time	113817	0	19745
8	Correct Transaction Time	103454	10363	19146
9	Step End Time	113817	0	19623
10	Step Duration (sec)	113817	0	2382
11	Correct Step Duration (sec)	113817	0	2093
12	Error Step Duration (sec)	113817	0	1949
13	First Attempt	113817	0	3
14	Incorrects	113817	0	25
15	Hints	113817	0	25
16	Corrects	113817	0	15
17	KC (F2011)	113817	0	97
18	Opportunity (F2011)	113817	0	1205
19	Predicted Error Rate (F2011)	113817	0	7622
20	Opportunity (Single-KC)	113817	0	1164
21	Predicted Error Rate (Single-KC)	113817	0	315
22	KC (Unique-step)	112869	948	625
23	Opportunity (Unique-step)	112869	948	25

Outlier Analysis¶

It is found that there is a non-numeric type in duration that is ‘.’ , which should represent 0

In addition, box diagrams can be used to analyze whether some outliers need to be removed

[11]:

print(df_step_clear.columns.tolist())
print("-"*100)
print(df_step_clear.describe().columns.tolist()) #有许多object类无法统计分析
print("-"*100)
print(df_step_clear.dtypes)

['Row', 'Anon Student Id', 'Problem Hierarchy', 'Problem Name', 'Problem View', 'Step Name', 'Step Start Time', 'First Transaction Time', 'Correct Transaction Time', 'Step End Time', 'Step Duration (sec)', 'Correct Step Duration (sec)', 'Error Step Duration (sec)', 'First Attempt', 'Incorrects', 'Hints', 'Corrects', 'KC (F2011)', 'Opportunity (F2011)', 'Predicted Error Rate (F2011)', 'Opportunity (Single-KC)', 'Predicted Error Rate (Single-KC)', 'KC (Unique-step)', 'Opportunity (Unique-step)']
----------------------------------------------------------------------------------------------------
['Row', 'Problem View', 'Incorrects', 'Hints', 'Corrects', 'Predicted Error Rate (F2011)', 'Opportunity (Single-KC)', 'Predicted Error Rate (Single-KC)', 'Opportunity (Unique-step)']
----------------------------------------------------------------------------------------------------
Row                                   int64
Anon Student Id                      object
Problem Hierarchy                    object
Problem Name                         object
Problem View                          int64
Step Name                            object
Step Start Time                      object
First Transaction Time               object
Correct Transaction Time             object
Step End Time                        object
Step Duration (sec)                  object
Correct Step Duration (sec)          object
Error Step Duration (sec)            object
First Attempt                        object
Incorrects                            int64
Hints                                 int64
Corrects                              int64
KC (F2011)                           object
Opportunity (F2011)                  object
Predicted Error Rate (F2011)        float64
Opportunity (Single-KC)               int64
Predicted Error Rate (Single-KC)    float64
KC (Unique-step)                     object
Opportunity (Unique-step)           float64
dtype: object

[12]:

# Change . to 0 in "xxx-duration"
rectify_cols = ['Step Duration (sec)', 'Correct Step Duration (sec)', 'Error Step Duration (sec)']
for col in rectify_cols:
    df_step_clear[col] = df_step_clear[col].apply(lambda x: 0 if x=='.' else x)
    df_step_clear[col] = df_step_clear[col].astype(float)
print(df_step_clear[rectify_cols].dtypes)

Step Duration (sec)            float64
Correct Step Duration (sec)    float64
Error Step Duration (sec)      float64
dtype: object

3. Data Visualization¶

[13]:

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
%matplotlib inline

[14]:

# Outlier analysis for each column

fig=plt.figure()
box_cols = ['Step Duration (sec)', 'Correct Step Duration (sec)','Error Step Duration (sec)']
for i, col in enumerate(box_cols):
    ax=fig.add_subplot(3, 1, i+1)
    ax.boxplot(df_step_clear[df_step_clear[col].notnull()][col].tolist())
fig.show("svg")

D:\MySoftwares\Anaconda\envs\data\lib\site-packages\ipykernel_launcher.py:8: UserWarning:

Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_21_1.png

[15]:

# The distribution of continuous values
def show_value_counts_histogram(colname, sort = True):
    # create the bins
    start = int(df_step_clear[colname].min()/10)*10
    end = int(df_step_clear[colname].quantile(q=0.95)/10+1)*10
    step = int((end - start)/20)
    print(start, end, step)
    counts, bins = np.histogram(df_step_clear[colname],bins=range(start, end, step))
    bins = 0.5 * (bins[:-1] + bins[1:])

    fig = px.bar(x=bins, y=counts, labels={'x': colname, 'y':'count'})
    fig.show("svg")


# Box distribution of continuous values
def show_value_counts_box(colname, sort = True):
#     fig = px.box(df_step_clear, y=colname)
#     fig.show("svg")
    plt.figure(figsize=(10,5))
    plt.title('Box-plot for '+ colname,fontsize=20)#标题，并设定字号大小
    plt.boxplot([df_step_clear[colname].tolist()])
    plt.show("svg")


# Histogram of discrete values
def show_value_counts_bar(colname, sort = True):
    ds = df_step_clear[colname].value_counts().reset_index()
    ds.columns = [
        colname,
        'Count'
    ]
    if sort:
        ds = ds.sort_values(by='Count', ascending=False)
    # histogram
    fig = px.bar(
        ds,
        x = colname,
        y = 'Count',
        title = colname + ' distribution'
    )
    fig.show("svg")


# Pie of discrete values
def show_value_counts_pie(colname, sort = True):
    ds = df_step_clear[colname].value_counts().reset_index()
    ds.columns = [
        colname,
        'percent'
    ]
    ds['percent'] /= len(df_step_clear)
    if sort:
        ds = ds.sort_values(by='percent', ascending=False)
    fig = px.pie(
        ds,
        names = colname,
        values = 'percent',
        title = colname+ ' Percentage',
    )
    fig.update_traces(textposition='inside', textinfo='percent+label',showlegend=False)
    fig.show("svg")

[16]:

# Bar
show_value_counts_bar('First Attempt')
show_value_counts_histogram('Step Duration (sec)')
show_value_counts_box('Step Duration (sec)')

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_23_0.svg

0 70 3

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_23_2.svg

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_23_3.png

[17]:

# Pie
# show_value_counts_pie('KC (F2011)')
show_value_counts_pie('Problem Hierarchy')
show_value_counts_pie('Problem Name')
# show_value_counts_pie('Step Name')

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_24_0.svg

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_24_1.svg

[19]:

# four column labels are individually distributed as follows

topnum_max = 50 # show top 50 for each type
fig = make_subplots(rows=2, cols=2,   # 2*2
      start_cell="top-left",
      subplot_titles=('KC (F2011)','Problem Hierarchy','Problem Name','Step Name'),
      column_widths=[0.5, 0.5])
traces = [
    go.Bar(
        x = df_step[colname].value_counts().reset_index().index.tolist()[:topnum_max],
        y = df_step[colname].value_counts().reset_index()[colname].tolist()[:topnum_max],
        name = 'Type: ' + str(colname),
        text = df_step[colname].value_counts().reset_index()['index'].tolist()[:topnum_max],
        textposition = 'auto',
    ) for colname in ['KC (F2011)','Problem Hierarchy','Problem Name','Step Name']
]
for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i //2) + 1, # pos_row
        (i % 2) + 1  # pos_col
    )

fig.update_layout(
    title_text = 'Bar of top 50 distributions for each type ',
)

fig.show("svg")

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_step_25_0.svg