Junyi¶

Authorization¶

Any form of commercial usage is not allowed!
Please cite the following paper if you publish your work:

Haw-Shiuan Chang, Hwai-Jung Hsu and Kuan-Ta Chen,
"Modeling Exercise Relationships in E-Learning: A Unified Approach,"
International Conference on Educational Data Mining (EDM), 2015.

Introduction¶

The dataset contains the problem log and exercise-related information on the Junyi Academy ( http://www.junyiacademy.org/ ), an E-learning platform established in 2012 on the basis of the open-source code released by Khan Academy. In addition, the annotations of exercise relationship we collected for building models are also available.

Data Description¶

Column Description¶

Field	Annotation
name	Exercise name (The name is also an id of exercise, so each name is unique in the dataset). If you want to access the exercise on the website, please append this name after url, http://www.junyiacademy.org/exercise/ (e.g., http://www.junyiacademy.org/exercise/similar_triangles_1 ). Please note that Junyi Academy are constantly changing their contents as Khan Academy did, so some url of exercises might be unavaible when you access them.
live	Whether the exercise is still accessible on the website on Jan. 2015
prerequisite	Indicate its prerequisite exericse (parent shown in its knowledge map)
h_position	The coordiate on the x axis of the knowledge map
v_position	The coordiate on the y axis of the knowledge map
creation_date	The date this exercise is created
seconds_per_fast_problem	The website judge a student finish the exercise fast if he/she takes less then this time to answer the question. The number is manually assigned by the experts in Junyi Academy.
pretty_display_name	The chinese name of exercise shown in the knowledge map (Please use UTF-8 to decode the chinese characters)
short_display_name	Another chinese name of exercise (Please use UTF-8 to decode the chinese characters)
topic	The topic of each exercise, and the topic would be shown as a larger node in the knowledge map.
area:	The area of each exercise (Each area contains several topics)

[1]:

import numpy as np
import dask.dataframe as dd
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

[2]:

path = "./junyi/junyi_Exercise_table.csv"

data = pd.read_csv(path, encoding = "utf-8",low_memory=False)
data.head()

[2]:

	name	live	prerequisites	h_position	v_position	creation_date	seconds_per_fast_problem	pretty_display_name	short_display_name	topic	area
0	parabola_intuition_1	True	recognizing_conic_sections	47	2	2012-10-11 17:55:24.8056 UTC	13.0	?物線直覺 1	?物線直覺1	conic-sections	algebra
1	circles_and_arcs	True	NaN	40	-20	2012-10-11 17:55:33.41014 UTC	27.0	圓與弧	圓與弧	area-perimeter-and-volume	geometry
2	inscribed_angles_3	True	inscribed_angles_2	44	-22	2012-10-11 17:55:44.11836 UTC	5.0	圓周角與圓心角換算 3	圓周角與圓心角換算3	circle-properties	geometry
3	solving_quadratics_by_factoring	True	factoring_polynomials_1	50	-2	2012-10-11 17:54:59.28029 UTC	7.0	因式分解法	因式分解法	quadtratics	algebra
4	graphing_parabolas_1	True	graphing_parabolas_0.5	52	0	2012-10-11 17:55:00.48268 UTC	24.0	畫拋物線 1	畫拋物線1	quadtratics	algebra

[3]:

data.describe()

[3]:

	h_position	v_position	seconds_per_fast_problem
count	837.000000	837.000000	837.000000
mean	25.402628	-5.704898	10.782557
std	15.876667	12.721159	8.935352
min	-15.000000	-34.000000	0.000000
25%	15.000000	-17.000000	5.000000
50%	26.000000	-5.000000	8.000000
75%	36.000000	5.000000	13.000000
max	60.000000	19.000000	60.000000

[4]:

data["area"] = [item if item != "null" and item !='nan' else "unknown"
                            for item in data["area"].apply(str)]

fig = px.scatter(
    data,
    x = 'h_position',
    y = 'v_position',
    color='area',
    title='Exercises distribution on area in knowledge map'
)

fig.show('svg')

../../../_images/build_blitz_junyi_junyi_5_0.svg

[5]:

data["topic"] = [item if item != "null" and item !='nan' else "unknown"
                            for item in data["topic"].apply(str)]

fig = px.scatter(
    data,
    x = 'h_position',
    y = 'v_position',
    color='topic',
    title='Exercises distribution on topics in knowledge map'
)

fig.show('svg')

../../../_images/build_blitz_junyi_junyi_6_0.svg

[6]:

import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Creating two subplots
def makeplot(title='Average time And Exercises count spent on area',groupByItem='area'):
    ds=data.groupby(groupByItem,as_index=False).agg(exercise_count=('topic','count'))

    ds = ds.sort_values('exercise_count')

    fig = px.bar(
    ds,
    x = 'exercise_count',
    y = groupByItem,
    orientation='h',
    title=title
)


    fig.show('svg')
makeplot(title='Exercise count on area',groupByItem='area')
makeplot(title='Exercise count on topics',groupByItem='topic')

../../../_images/build_blitz_junyi_junyi_7_0.svg

../../../_images/build_blitz_junyi_junyi_7_1.svg

Field	description
Exercise_A, Exercise_B	The exercise names being compared
Similarity_avg, Difficulty_avg, Prequesite_avg	The mean opinion scores of different relationships. This is also the ground truth we used to train/test our model.
Similarity_raw, Difficulty_raw, Prequesite_raw	The raw scores given by workers (delimiter is “_”)

[8]:

path = "./junyi/relationship_annotation_training.csv"

data = dd.read_csv(path, encoding = "utf-8",low_memory=False)
data.head()

[8]:

	Exercise_A	Exercise_B	Similarity_avg	Similarity_raw	Difficulty_avg	Difficulty_raw	Prerequisite_avg	Prerequisite_raw
0	radius_diameter_and_circumference	arithmetic_word_problems_1	1.857143	1_4_1_1_1_1_2_1_1_1_3_1_3_5	2.857143	4_5_1_1_1_1_7_1_1_4_2_5_2_5	3.000000	1_6_1_1_1_3_2_1_9_2_3_2_8_2
1	radius_diameter_and_circumference	parts_of_circles	6.785714	6_9_6_6_7_8_7_8_8_8_4_6_5_7	2.428571	3_5_1_3_2_1_5_1_1_1_1_2_5_3	7.285714	6_7_7_6_8_8_9_5_9_9_7_7_5_9
2	radius_diameter_and_circumference	perimeter_of_squares_and_rectangles	3.571429	2_6_4_1_1_2_4_4_7_2_3_4_4_6	2.285714	2_5_1_1_1_1_3_2_1_1_5_2_3_4	5.000000	2_6_5_4_2_8_3_5_9_5_5_3_7_6
3	vertex_of_a_parabola	solving_quadratics_by_taking_the_square_root	5.923077	6_7_6_7_8_4_5_4_3_6_6_8_7	3.307692	3_3_3_1_2_2_4_4_4_3_5_5_4	5.846154	5_8_7_7_6_2_6_5_6_7_3_7_7
4	vertex_of_a_parabola	completing_the_square_1	5.692308	7_5_7_8_3_4_5_5_3_6_7_7_7	3.307692	2_3_3_4_2_2_4_4_5_3_4_4_3	5.461538	6_4_6_8_2_2_5_6_5_6_7_7_7

[9]:

data.describe().compute()

[9]:

	Similarity_avg	Difficulty_avg	Prerequisite_avg
count	1131.000000	1131.000000	1131.000000
mean	5.088256	4.402577	4.801077
std	2.248680	1.586114	1.934648
min	1.000000	1.000000	1.166667
25%	3.100000	3.153846	3.160256
50%	5.333333	4.333333	4.777778
75%	7.000000	5.538462	6.333333
max	9.000000	8.454545	8.800000

字段名	说明
user_id	An number represents an user
exercise	Exercise name
problem_type	Some exercises would record what template of problem this student encounters at this time
problem_number	How many times this student practices this exercise (e.g., the number would be 1 if the student tries to answer this exercise at the first time)
topic_mode	Whether the student is assigned this exercise by clicking the topic icon (This function has been closed now)
suggested	Whether the exercise is suggested by the system according to prerequisite relationships on the knowledge map
review_mode	Whether the exercise is done by the student after he/she earn proficiency
time_done	Unix timestamp in microsecends
time_taken	Second the student spend on this exercise
time_taken_attempts	Seconds the student spend on each answering attempt
correct	Whether the student’s first attempt is correct, and the field would be false if any hint is requested
count_attempts	How many times student attempt to answer the problem
hint_used	Whether student request hints
count_hints	How many times student request hints
hint_time_taken_list	Seconds the student spend on each requested hints
earned_proficiency	Whether the student reaches proficiency. Please refer to http://david-hu.com/2011/11/02/how-khan-academy-is-using-machine-learning-to-assess-student-mastery.html for the algorithm of determining proficiency
points_earned	How many points students earn for this practice

[11]:

path = "./junyi/junyi_ProblemLog_original.csv"

data = dd.read_csv(path, encoding = "utf-8",low_memory=False, dtype={'hint_time_taken_list': 'object'})
data.head()

[11]:

	user_id	exercise	problem_type	problem_number	topic_mode	suggested	review_mode	time_done	time_taken	time_taken_attempts	correct	count_attempts	hint_used	hint_time_taken_list	earned_proficiency	points_earned
0	12884	time_terminology	analog_word	1	False	False	False	1420714810324490	4	3&1	False	2	False	NaN	False	0
1	239464	multiplication_1	0	6	False	False	False	1403098400836660	2	2	True	1	False	NaN	False	14
2	147359	adding_decimals_0.5	0	6	False	False	False	1418890695540340	16	16	True	1	False	NaN	False	75
3	158155	multiplication_1	0	3	False	False	False	1400469444264040	2	2	True	1	False	NaN	False	75
4	147151	subtraction_2	subtraction-2	10	True	True	False	1382650905730160	4	4	True	1	False	NaN	False	225

[12]:

data.describe().compute()

[12]:

	user_id	problem_number	time_done	time_taken	count_attempts	count_hints	points_earned
count	2.592599e+07	2.592599e+07	2.592599e+07	2.592599e+07	2.592599e+07	2.592599e+07	2.592599e+07
mean	1.236557e+05	2.859253e+01	3.263023e+11	9.955710e+01	1.363888e+00	2.850791e-01	8.219998e+01
std	7.121600e+04	9.871659e+01	1.248303e+13	2.157362e+05	2.391150e+00	1.276758e+00	9.056150e+01
min	0.000000e+00	1.000000e+00	1.350004e+15	-5.049212e+08	0.000000e+00	0.000000e+00	0.000000e+00
25%	6.199900e+04	4.000000e+00	1.395736e+15	4.000000e+00	1.000000e+00	0.000000e+00	5.000000e+00
50%	1.242630e+05	9.000000e+00	1.405395e+15	8.000000e+00	1.000000e+00	0.000000e+00	5.000000e+01
75%	1.856380e+05	2.200000e+01	1.415168e+15	1.800000e+01	1.000000e+00	0.000000e+00	1.950000e+02
max	2.476050e+05	5.174000e+03	1.421000e+15	4.067572e+08	1.000000e+03	2.000000e+01	2.250000e+02

[13]:

data['user_id'].nunique().compute()

[13]:

[14]:

total_count = len(data)
total_count

[14]:

25925992

[15]:

ds = data['earned_proficiency'].value_counts().reset_index().compute()

ds.columns = [
    'earned_proficiency',
    'percent'
]

ds['percent'] /= total_count
ds = ds.sort_values(['percent'])

[16]:

ds

[16]:

	earned_proficiency	percent
1	True	0.046066
0	False	0.953934

[17]:

fig = px.pie(
    ds,
    names = ['mastered','not mastered'],
    values = 'percent',
    title = 'Percent of mastered exercises',
)

fig.show('svg')

../../../_images/build_blitz_junyi_junyi_18_0.svg

[18]:

ds = data['correct'].value_counts().reset_index().compute()
ds.columns = [
    'correct',
    'percent'
]
ds['percent'] /= total_count
ds = ds.sort_values(['percent'])

[19]:

ds

[19]:

	correct	percent
1	False	0.172126
0	True	0.827874

[20]:

fig = px.pie(
    ds,
    names = ['wrong','correct'],
    values = 'percent',
    title = 'Percent of answer correctly at first attempt',
)
fig.show('svg')

../../../_images/build_blitz_junyi_junyi_21_0.svg

The tab delimited format used in PSLC datashop, please refer to their document ( https://pslcdatashop.web.cmu.edu/help?page=importFormatTd ) The size of the text file is too large (9.1 GB) to analyze using tools of websites, so we compress the text file and put it as an extra file of the dataset. We also upload a small subset of data into the website for the illustration purpose. Note that there are some assumptions when converting the data into this format, please read the description of our dataset for more details.

[22]:

path = "./junyi/junyi_ProblemLog_for_PSLC.txt"
data = dd.read_csv(path, sep='\t',encoding = "utf-8")
pd.set_option('display.max_columns', 2000)
data.head()

[22]:

	Anon Student Id	Session Id	Time	Student Response Type	Tutor Response Type	Level (Unit)	Level (Section)	Problem Name	Problem Start Time	Step Name	Outcome	Condition Name	Condition Type	Selection	Action	Input	KC (Exercise)	KC (Topic)	KC (Area)	CF (points_earned)
0	12884	148691	1420714809324	ATTEMPT	RESULT	telling-time	time_terminology	time_terminology--analog_word	1420714806324	time_terminology--analog_word	INCORRECT	Choose_Exercise	NaN	NaN	NaN	NaN	time_terminology	telling-time	arithmetic	0
1	12884	148691	1420714810324	ATTEMPT	RESULT	telling-time	time_terminology	time_terminology--analog_word	1420714809324	time_terminology--analog_word	INCORRECT	Choose_Exercise	NaN	NaN	NaN	NaN	time_terminology	telling-time	arithmetic	0
2	239464	93497	1403098400837	ATTEMPT	RESULT	multiplication-division	multiplication_1	multiplication_1--0	1403098398837	multiplication_1--0	CORRECT	Choose_Exercise	NaN	NaN	NaN	NaN	multiplication_1	multiplication-division	arithmetic	14
3	147359	145156	1418890695540	ATTEMPT	RESULT	decimals	adding_decimals_0.5	adding_decimals_0.5--0	1418890679540	adding_decimals_0.5--0	CORRECT	Choose_Exercise	NaN	NaN	NaN	NaN	adding_decimals_0.5	decimals	arithmetic	75
4	158155	105559	1400469444264	ATTEMPT	RESULT	multiplication-division	multiplication_1	multiplication_1--0	1400469442264	multiplication_1--0	CORRECT	Choose_Exercise	NaN	NaN	NaN	NaN	multiplication_1	multiplication-division	arithmetic	75

Questions and Collaboration:¶

1. If you have any question to this dataset, please e-mail to hschang@cs.umass.edu.
2. If you have intention to acquire more data which fit your research purpose, please contact Junyi Academy directly for discussing the further cooperation opportunites by emailing to support@junyiacademy.org

Note:¶

1. The dataset we used in our paper (Modeling Exercise Relationships in E-Learning: A Unified Approach) is extracted from Junyi Academy on July 2014, and this dataset is extracted on Jan 2015. After applying our method on the new dataset, we got similar observation with that in our paper, even though this dataset contains more users and exercises.
2. After uncompress the original problem log and problem log using PLSC format, the text files will take around 2.6 GB and 9.1 GB respectively. Please prepare enough space in your disk.

Annotaion:¶

PSLC数据集是对original数据集做了处理以后生成的数据，拆分的字段为time_taken_attempts，因此PSLC数据集的条目数比original的多

Analysis¶

[23]:

len(data)

[23]:

39462201

[24]:

ds=data.groupby('Anon Student Id').agg({'Session Id':'count'}).describe().compute()

[25]:

ds

[25]:

	Session Id
count	247547.000000
mean	159.412964
std	598.876158
min	1.000000
25%	7.000000
50%	19.000000
75%	82.000000
max	55984.000000

[26]:

data1=data.sample(frac=0.01).compute()

[27]:

# 每个session对应的练习次数、知识点数(1%抽样)
nunique = dd.Aggregation(
    name="nunique",
    chunk=lambda s: s.apply(lambda x: list(set(x))),
    agg=lambda s0: s0.obj.groupby(level=list(range(s0.obj.index.nlevels))).sum(),
    finalize=lambda s1: s1.apply(lambda final: len(set(final))),
)
ds = data1.groupby('Session Id').agg({'KC (Exercise)':'nunique', 'KC (Topic)':'nunique','Time':lambda x: x.max()-x.min()})
ds.describe()

[27]:

	KC (Exercise)	KC (Topic)	Time
count	164015.000000	164015.000000	1.640150e+05
mean	1.994543	1.564345	5.521162e+08
std	2.457822	1.288020	2.618666e+09
min	1.000000	1.000000	0.000000e+00
25%	1.000000	1.000000	0.000000e+00
50%	1.000000	1.000000	0.000000e+00
75%	2.000000	2.000000	1.515309e+06
max	121.000000	35.000000	6.488305e+10

[ ]: