math2015-Math1 Data Analysis¶

[1]:

import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

[2]:

path = "C:/Users/Administrator/Desktop/Math1/rawdata.txt"
data = pd.read_table(path,header=None)

RECORDS¶

The learning records are saved in a score matrix. Each row corresponds to the records of a student on different test items.

[3]:

pd.set_option('display.max_rows',10)
data

[3]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
0	4	4	4	4	4	4	4	4	4	4	4	4	4	4	0	5	8	0	5	4
1	4	0	4	0	4	4	4	4	0	4	4	4	0	0	0	5	0	8	2	0
2	4	4	0	0	0	0	4	0	0	4	0	4	0	0	0	0	0	0	0	0
3	4	4	4	0	0	0	4	4	4	0	4	4	4	4	0	5	0	0	0	0
4	0	4	4	0	4	0	4	0	0	0	4	4	0	4	0	3	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4204	4	0	4	4	4	0	4	0	0	4	0	0	0	0	0	4	2	0	1	0
4205	0	4	4	0	4	4	4	0	0	0	0	0	0	0	0	3	0	0	1	0
4206	4	4	0	0	0	4	4	4	0	0	0	0	0	0	0	3	2	0	4	0
4207	4	0	4	4	0	4	0	0	4	0	0	4	0	4	0	4	1	2	0	0
4208	4	0	4	0	4	0	0	4	0	0	0	4	0	0	0	3	0	0	1	0

4209 rows × 20 columns

For example, the first row presents the learning records of student 0 where he gets 4 point on item 0 and 5 point on item 18.

General features¶

[4]:

data.describe()

[4]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19
count	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000	4209.000000
mean	2.983131	2.421478	2.532668	2.827275	2.205750	2.109765	3.573295	2.387265	1.588976	1.663103	2.438584	2.986933	1.044429	2.305536	0.022808	4.247327	2.692564	1.226657	2.333333	0.953196
std	1.741888	1.955317	1.927991	1.821100	1.989625	1.997223	1.234951	1.962381	1.957542	1.971655	1.951550	1.739736	1.757162	1.976759	0.301222	1.476007	2.789518	2.238923	2.110050	1.378800
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	3.000000	0.000000	0.000000	1.000000	0.000000
50%	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	0.000000	0.000000	4.000000	4.000000	0.000000	4.000000	0.000000	5.000000	2.000000	0.000000	2.000000	0.000000
75%	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	0.000000	5.000000	4.000000	1.000000	4.000000	2.000000
max	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	4.000000	6.000000	8.000000	8.000000	9.000000	9.000000

[5]:

print('The number of records:' + str(len(data)))

The number of records:4209

[6]:

data['count']=data.apply(lambda x: x.sum(),axis=1)
data['index1']=data.index
ds=data.loc[:, ['count', 'index1']]
ds['index1'] = ds['index1'].astype(str) + '-'
ds = ds.sort_values(['count']).tail(50)
fig = px.bar(
    ds,
    x = 'count',
    y = 'index1',
    orientation='h',
    title='Top 50 students by score'
)

fig.show("svg")

../../../_images/build_blitz_math2015_Math1_10_0.svg

This figure shows the total score of Top 50 students.

[7]:

ds=data.loc[:, ['count', 'index1']]
ds = ds.sort_values(['index1'])
fig = px.histogram(
    ds,
    x='index1',
    y='count',
    title='Students score distribution'
)
fig.show("svg")

../../../_images/build_blitz_math2015_Math1_12_0.svg

Sort by correct rate¶

[8]:

path = "C:/Users/Administrator/Desktop/Math1/data.txt"
data = pd.read_table(path,header=None)

[9]:

ds = data.mean()
ds1=pd.DataFrame(columns=['problem_id','count'])
for i in range(len(ds)):
    new=pd.DataFrame({
        'problem_id':int(i),
        'count':ds[i]
    },index=[0]
    )
    ds1=ds1.append(new,ignore_index=True)

ds1=ds1.sort_values(['count'])
ds1['problem_id'] = (ds1['problem_id']).astype(str) + '-'
fig = px.bar(
    ds1,
    x = 'count',
    y = 'problem_id',
    orientation = 'h',
    title = 'Average correct rate of questions'
)

fig.show("svg")

../../../_images/build_blitz_math2015_Math1_15_0.svg

This figure presents the average correct rate of questions.It’s obvious that students do the best on item 6 but need to improve on item 14.

Sort by problem type¶

[10]:

ds = data.mean()
ds1=pd.DataFrame(columns=['problem_id','count'])
for i in range(len(ds)):
    new=pd.DataFrame({
        'problem_id':int(i),
        'count':ds[i]
    },index=[0]
    )
    ds1=ds1.append(new,ignore_index=True)

data2= [('Obj',ds1[ds1['problem_id']<15]['count'].mean()),
    ('Sub',ds1[ds1['problem_id']>=14]['count'].mean())]
ds2 = pd.DataFrame(
    data=data2,
    columns=['Type','Percent']
)

fig = make_subplots(rows=1,cols=2)
traces = [
    go.Bar(
        x=['wrong','right'],
        y=[
        1-float(ds2[ds2['Type']==item]['Percent']),
        float(ds2[ds2['Type']==item]['Percent'])
        ],
        name='Type:' + str(item),
        text=[
        str(round(100*(1-float(ds2[ds2['Type']==item]['Percent'])))) + '%',
        str(round(100*float(ds2[ds2['Type']==item]['Percent']))) + '%'
        ],
        textposition='auto'
    ) for item in ds2['Type'].tolist()
]
for i in range(len(traces)):
    fig.append_trace(
        traces[i],
        (i //2) + 1,
        (i % 2) + 1
    )
fig.update_layout(
    title_text = 'Average correct rate of questions for every problem type',
)

fig.show("svg")

../../../_images/build_blitz_math2015_Math1_18_0.svg

This figure shows that students do better in objective questions but there is a huge gap in the average correct rate of subjective and objective questions. The subjective questions are a challenge to students and their ability to answer them needs to be strengthened.

[11]:

path = "C:/Users/Administrator/Desktop/Math1/problemdesc.txt"
data1 = pd.read_table(path,header=0)

[12]:

count = data1['Full Score'].sum()
data2= [('Obj',data1[data1['Type']=='Obj']['Full Score'].sum()),
    ('Sub',data1[data1['Type']=='Sub']['Full Score'].sum())]
ds = pd.DataFrame(
    data=data2,
    columns=['Type','Percent']
)

ds['Percent']/=count
ds=ds.sort_values('Percent')

fig=px.pie(
    ds,
    names='Type',
    values='Percent',
    title='Problem type',
)
fig.show("svg")

../../../_images/build_blitz_math2015_Math1_21_0.svg

Sort by skills¶

[13]:

path1 = "C:/Users/Administrator/Desktop/Math1/q.txt"
data1 = pd.read_table(path1,header=None)

[14]:

ds1 = data1.sum()

[15]:

path2 = "C:/Users/Administrator/Desktop/Math1/qnames.txt"
data2 = pd.read_table(path2,header=0)

[16]:

ds=pd.DataFrame(columns=['skill','count'])
for i in range(len(ds1)):
    new=pd.DataFrame({
        'skill':data2['Skill Names'][i],
        'count':ds1[i]
    },index=[0]
    )
    ds=ds.append(new,ignore_index=True)

[17]:

ds=ds.sort_values(['count'])
fig = px.bar(
    ds,
    x='count',
    y='skill',
    orientation='h',
    title='Skill count'
)
fig.show("svg")

../../../_images/build_blitz_math2015_Math1_27_0.svg

This figure shows that calculation is the most important skill in the test as almost every item is related to it. Besides, reasoning and demonstration is also the key to do better in the test.