OLI data in fall, 2011（problem）¶

[1]:

%matplotlib inline
import pandas as pd
import numpy as np
# global configuration: show every rows and cols
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth',None)
pd.set_option('display.max_columns', None)

1. Data Description¶

1.1 Column Description¶

[2]:

# help_table3: the description for data by problems
df3 = pd.read_csv('OLI_data/help_table3.csv',sep=',',encoding="gbk")
df3 = df3.loc[:, ['Field', 'Annotation']]
df3

[2]:

	Field	Annotation
0	Row	A row counter.
1	Sample	The sample that includes this problem. If you select more than one sample to export, problems that occur in more than one sample will be duplicated in the export.
2	Anon Student ID	The student that worked on the problem.
3	Problem Hierarchy	The location in the curriculum hierarchy where this problem occurs.
4	Problem Name	The name of the problem.
5	Problem View	The number of times the student encountered the problem so far. This counter increases with each instance of the same problem. See "Problem View" in the "By Student-Step" table above.
6	Problem Start Time	If the problem start time is not given in the original log data, then it is set to the time of the last transaction of the prior problem. If there is no prior problem for the session, the time of the earliest transaction is used. Earliest transaction time is equivalent to the minimum transaction time for the earliest step of the problem. For more detail on how problem start time is determined, see Determining Problem Start Time.
7	Problem End Time	Derived from the maximum transaction time of the latest step of the problem.
8	Latency (sec)	The amount of time the student spent on this problem. Specifically, the difference between the problem start time and the last transaction on this problem.
9	Steps Missing Start Times	The number of steps (from the student-step table) with "Step Start Time" values of "null".
10	Hints	Total number of hints the student requested for this problem.
11	Incorrects	Total number of incorrect attempts the student made on this problem.
12	Corrects	Total number of correct attempts the student made for this problem.
13	Avg Corrects	The total number of correct attempts / total number of steps in the problem.
14	Steps	Total number of steps the student took while working on the problem.
15	Avg Assistance Score	Calculated as (total hints requested + total incorrect attempts) / total steps.
16	Correct First Attempts	Total number of correct first attempts made by the student for this problem.
17	Condition	The name and type of the condition the student is assigned to. In the case of a student assigned to multiple conditions (factors in a factorial design), condition names are separated by a comma and space. This differs from the transaction format, which optionally has "Condition Name" and "Condition Type" columns.
18	KCs	Total number of KCs practiced by the student for this problem.
19	Steps without KCs	Total number of steps in this problem (performed by the student) without an assigned KC.
20	KC List	Comma-delimited list of KCs practiced by the student for this problem.

1.2 Summarization of Data¶

This table organizes the data as student-problem

[3]:

df_problem =  pd.read_csv('OLI_data/AllData_problem_2011F.csv',low_memory=False) # sep="\t"
df_problem.head(2)

[3]:

	Row	Sample	Anon Student Id	Problem Hierarchy	Problem Name	Problem View	Problem Start Time	Problem End Time	Latency (sec)	Steps Missing Start Times	Hints	Incorrects	Corrects	Avg Corrects	Steps	Avg Assistance Score	Correct First Attempts	Condition	KCs (F2011)	Steps without KCs (F2011)	KC List (F2011)	KCs (Single-KC)	Steps without KCs (Single-KC)	KC List (Single-KC)	KCs (Unique-step)	Steps without KCs (Unique-step)	KC List (Unique-step)
0	1	All Data	Stu_00b2b35fd027e7891e8a1a527125dd65	sequence Statics, unit Concentrated Forces and Their Effects, module Introduction to Free Body Diagrams	_m2_assess	1	2011/9/21 17:35	2011/9/21 17:35	0	0	0	9	12	0.571	21	0.429	12	NaN	5	0	gravitational_forces, identify_interaction, represent_interaction_cord, represent_interaction_spring, simple_step	1	0	Single-KC	0	21	.
1	2	All Data	Stu_00b2b35fd027e7891e8a1a527125dd65	sequence Statics, unit Concentrated Forces and Their Effects, module Effects of Force	tutor_03_01	1	2011/9/21 17:49	2011/9/21 17:49	9	0	0	0	3	1.000	3	0.000	3	NaN	1	0	distinguish_rotation_translation	1	0	Single-KC	3	0	KC523, KC680, KC768

2. Data Analysis¶

[4]:

df_problem.describe()

[4]:

	Row	Problem View	Latency (sec)	Steps Missing Start Times	Hints	Incorrects	Corrects	Avg Corrects	Steps	Avg Assistance Score	Correct First Attempts	Condition	KCs (F2011)	Steps without KCs (F2011)	KCs (Single-KC)	Steps without KCs (Single-KC)	KCs (Unique-step)	Steps without KCs (Unique-step)
count	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	45002.000000	0.0	45002.000000	45002.000000	45002.0	45002.0	45002.000000	45002.000000
mean	22501.500000	1.221146	85.639883	0.007000	0.620217	1.644460	4.176325	0.959571	4.331963	0.928014	3.219479	NaN	1.223923	1.798920	1.0	0.0	4.289654	0.042309
std	12991.102744	1.140622	301.895374	0.106748	1.956302	3.378211	5.125742	0.358850	5.079484	2.221907	4.603916	NaN	1.733856	3.830471	0.0	0.0	5.084490	0.557118
min	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	NaN	0.000000	0.000000	1.0	0.0	0.000000	0.000000
25%	11251.250000	1.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	1.000000	0.000000	1.000000	NaN	0.000000	0.000000	1.0	0.0	1.000000	0.000000
50%	22501.500000	1.000000	20.000000	0.000000	0.000000	1.000000	3.000000	1.000000	3.000000	0.250000	2.000000	NaN	1.000000	0.000000	1.0	0.0	3.000000	0.000000
75%	33751.750000	1.000000	73.000000	0.000000	0.000000	2.000000	5.000000	1.000000	5.000000	1.000000	4.000000	NaN	2.000000	2.000000	1.0	0.0	5.000000	0.000000
max	45002.000000	32.000000	20426.000000	8.000000	50.000000	413.000000	232.000000	19.333000	32.000000	210.500000	32.000000	NaN	9.000000	32.000000	1.0	0.0	32.000000	29.000000

（1）Analysis for Null and Unique value of column attributes¶

[5]:

def work_col_analysis(df_work):
    num_nonull_toal = df_work.notnull().sum()  # Not Null
    dict_col_1 = {'col_name':num_nonull_toal.index,'num_nonull':num_nonull_toal.values}
    df_work_col_1 = pd.DataFrame(dict_col_1)

    num_null_toal = df_work.isnull().sum()  # Null
    dict_col_2 = {'col_name':num_null_toal.index,'num_null':num_null_toal.values}
    df_work_col_2 = pd.DataFrame(dict_col_2)

    num_unique_toal = df_work.apply(lambda col: len(col.unique()))   # axis=0
    print(type(num_unique_toal))
    dict_col_3 = {'col_name':num_unique_toal.index,'num_unique':num_unique_toal.values}
    df_work_col_3 = pd.DataFrame(dict_col_3)

    df_work_col = pd.merge(df_work_col_1, df_work_col_2, on=['col_name'])
    df_work_col = pd.merge(df_work_col, df_work_col_3, on=['col_name'])
    return df_work_col
print("-------------------num_unique_toal and num_nonull_toal----------------------")
df_result = work_col_analysis(df_problem)
df_result

-------------------num_unique_toal and num_nonull_toal----------------------
<class 'pandas.core.series.Series'>

[5]:

	col_name	num_nonull	num_null	num_unique
0	Row	45002	0	45002
1	Sample	45002	0	1
2	Anon Student Id	45002	0	333
3	Problem Hierarchy	45002	0	27
4	Problem Name	45002	0	300
5	Problem View	45002	0	32
6	Problem Start Time	45002	0	25983
7	Problem End Time	45002	0	25884
8	Latency (sec)	45002	0	1290
9	Steps Missing Start Times	45002	0	8
10	Hints	45002	0	35
11	Incorrects	45002	0	37
12	Corrects	45002	0	51
13	Avg Corrects	45002	0	195
14	Steps	45002	0	31
15	Avg Assistance Score	45002	0	335
16	Correct First Attempts	45002	0	33
17	Condition	0	45002	1
18	KCs (F2011)	45002	0	10
19	Steps without KCs (F2011)	45002	0	31
20	KC List (F2011)	45002	0	170
21	KCs (Single-KC)	45002	0	1
22	Steps without KCs (Single-KC)	45002	0	1
23	KC List (Single-KC)	45002	0	1
24	KCs (Unique-step)	45002	0	32
25	Steps without KCs (Unique-step)	45002	0	16
26	KC List (Unique-step)	45002	0	1470

（2）Analysis for Discrete value of column attributes¶

Columns with a small number of discrete values may represent very informative, so identify these columns first and analyze them one by one

[6]:

discrete_cols = []
series = []
cols = list(df_problem.columns.values)

for col in cols:
    if len(df_problem[col].unique().tolist()) <= 20 and len(df_problem[col].unique().tolist()) >= 2:
        discrete_cols.append(col)
        series.append(df_problem[col].unique().tolist())

for a,b in zip(discrete_cols,series):
    print(a," : ",b)
    print("-"*80)

Steps Missing Start Times  :  [0, 1, 2, 5, 7, 6, 3, 8]
--------------------------------------------------------------------------------
KCs (F2011)  :  [5, 1, 4, 2, 3, 9, 0, 8, 6, 7]
--------------------------------------------------------------------------------
Steps without KCs (Unique-step)  :  [21, 0, 17, 15, 9, 2, 5, 1, 4, 3, 12, 10, 8, 11, 14, 29]
--------------------------------------------------------------------------------

（3）Data Cleaning¶

Data Cleaning Suggestions - Redundant columns: Columns that are all NULL or Single value. - Others

[7]:

df_problem_clear = df_problem.copy(deep=True) # deep copy

[8]:

# Clear all redundant columns directly.
cols = list(df_problem.columns.values)
drop_cols = []
for col in cols:
    if len(df_problem_clear[col].unique().tolist()) == 1:
        df_problem_clear.drop(col,axis =1,inplace=True)
        drop_cols.append(col)

print("the cols num before clear: ",len(df_problem.columns.to_list()))
print("the cols num after clear:",len(df_problem_clear.columns.to_list()))
for col in drop_cols:
    print("drop:---",col)

the cols num before clear:  27
the cols num after clear: 22
drop:--- Sample
drop:--- Condition
drop:--- KCs (Single-KC)
drop:--- Steps without KCs (Single-KC)
drop:--- KC List (Single-KC)

[9]:

df_problem_clear.head(2)

[9]:

	Row	Anon Student Id	Problem Hierarchy	Problem Name	Problem View	Problem Start Time	Problem End Time	Latency (sec)	Steps Missing Start Times	Hints	Incorrects	Corrects	Avg Corrects	Steps	Avg Assistance Score	Correct First Attempts	KCs (F2011)	Steps without KCs (F2011)	KC List (F2011)	KCs (Unique-step)	Steps without KCs (Unique-step)	KC List (Unique-step)
0	1	Stu_00b2b35fd027e7891e8a1a527125dd65	sequence Statics, unit Concentrated Forces and Their Effects, module Introduction to Free Body Diagrams	_m2_assess	1	2011/9/21 17:35	2011/9/21 17:35	0	0	0	9	12	0.571	21	0.429	12	5	0	gravitational_forces, identify_interaction, represent_interaction_cord, represent_interaction_spring, simple_step	0	21	.
1	2	Stu_00b2b35fd027e7891e8a1a527125dd65	sequence Statics, unit Concentrated Forces and Their Effects, module Effects of Force	tutor_03_01	1	2011/9/21 17:49	2011/9/21 17:49	9	0	0	0	3	1.000	3	0.000	3	1	0	distinguish_rotation_translation	3	0	KC523, KC680, KC768

[10]:

# the remaining columns
print("-------------------num_unique_toal and num_nonull_toal----------------------")
df_result = work_col_analysis(df_problem_clear)
df_result

-------------------num_unique_toal and num_nonull_toal----------------------
<class 'pandas.core.series.Series'>

[10]:

	col_name	num_nonull	num_unique
0	Row	45002	45002
1	Anon Student Id	45002	333
2	Problem Hierarchy	45002	27
3	Problem Name	45002	300
4	Problem View	45002	32
5	Problem Start Time	45002	25983
6	Problem End Time	45002	25884
7	Latency (sec)	45002	1290
8	Steps Missing Start Times	45002	8
9	Hints	45002	35
10	Incorrects	45002	37
11	Corrects	45002	51
12	Avg Corrects	45002	195
13	Steps	45002	31
14	Avg Assistance Score	45002	335
15	Correct First Attempts	45002	33
16	KCs (F2011)	45002	10
17	Steps without KCs (F2011)	45002	31
18	KC List (F2011)	45002	170
19	KCs (Unique-step)	45002	32
20	Steps without KCs (Unique-step)	45002	16
21	KC List (Unique-step)	45002	1470

3. Data Visualization¶

[11]:

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt

[12]:

# The distribution of continuous values
def show_value_counts_histogram(colname, sort = True):
    # create the bins
    start = int(df_problem_clear[colname].min()/10)*10
    end = int(df_problem_clear[colname].quantile(q=0.95)/10+1)*10
    problem = int((end - start)/20)
    print(start, end, problem)
    counts, bins = np.histogram(df_problem_clear[colname],bins=range(start, end, problem))
    bins = 0.5 * (bins[:-1] + bins[1:])

    fig = px.bar(x=bins, y=counts, labels={'x': colname, 'y':'count'})
    fig.show("svg")

# Box distribution of continuous values
def show_value_counts_box(colname, sort = True):
    # way1: plotly (too costy for box-plot)
    # fig = px.box(df_problem_clear, y=colname)
    # fig.show("svg")
    # way2: matplotlib
    plt.figure(figsize=(10,5))
    plt.title('Box-plot for '+ colname,fontsize=20)#标题，并设定字号大小
    plt.boxplot([df_problem_clear[colname].tolist()])
    plt.show("svg")

# Histogram of discrete values
def show_value_counts_bar(colname, sort = True):
    ds = df_problem_clear[colname].value_counts().reset_index()
    ds.columns = [
        colname,
        'Count'
    ]
    if sort:
        ds = ds.sort_values(by='Count', ascending=False)
    # histogram
    fig = px.bar(
        ds,
        x = colname,
        y = 'Count',
        title = colname + ' distribution'
    )
    fig.show("svg")


# Pie of discrete values
def show_value_counts_pie(colname, sort = True):
    ds = df_problem_clear[colname].value_counts().reset_index()
    ds.columns = [
        colname,
        'percent'
    ]
    ds['percent'] /= len(df_problem_clear)
    if sort:
        ds = ds.sort_values(by='percent', ascending=False)
    fig = px.pie(
        ds,
        names = colname,
        values = 'percent',
        title = colname+ 'Percentage',
    )
    fig.show("svg")

（1）sort by single attributes¶

[13]:

# Bar
show_value_counts_bar('KCs (F2011)')
show_value_counts_bar('Problem Hierarchy')

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_problem_21_0.svg

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_problem_21_1.svg

[14]:

# analysis for "duration"
# It is obvious that there are unreasonable outliers

show_value_counts_box('Latency (sec)')
show_value_counts_histogram('Latency (sec)')

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_problem_22_0.png

0 340 17

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_problem_22_2.svg

（2）group by Problem Name, sorted by meam(avg corrects)¶

[15]:

# Classification Statistic

# Problem Name,Avg Corrects, Avg Assistance Score
df_problem_group1 = df_problem_clear.groupby(['Problem Name'])['Avg Corrects'].mean().reset_index()
df_problem_group1.columns = ["Problem Name","Avg Corrects"]
df_problem_group1 = df_problem_group1.sort_values(by='Avg Corrects', ascending=False)
fig = px.bar(df_problem_group1, x="Problem Name", y="Avg Corrects", title="Questions sorted by the average accuracy")
fig.show("svg")

../../../_images/build_blitz_OLI_Fall2011_OLI_2011F_problem_24_0.svg