Welcome to Part 2 of the Beyond the SLC Headlines. In case you have forgotten, we are looking at the 2069 SLC results data set and trying to see what we can learn from it. For Part 1 of this series please look at http://www.karkhana.asia/beyond-the-slc-headlines-part-1/ but you don’t have to read it to make sense of this blog post. You just need to know two things:
- We are looking at data to see if the standard post-SLC headlines… of govt school bad, private school good holds up to a more subtle analysis grounded in data. While acknowledging that private school students do better on the SLC as a whole, we want to paint out fuller picture that is not in black and white.
- We are using Private/Institutional and Govt/Community school interchangeably, though this is not 100% accurate. As we get more identifying data for schools we will try to slice SLC performance using other categories too (e.g. Rural, Urban)
So think back to the last time you went to the doctor.When you go to the Doctor with a problem she almost always takes your pulse. But is that enough to make a diagnosis? No!
Depending on your complaints, she might test your reflexes, order blood tests, or peer into your nose and throat. Why? Because multiple points of data allow us to make a better diagnosis of the problem.
Yet when it comes to the SLC we do the equivalent of just taking the pulse i.e. we only look at the total pass percentage. In 2069 we are told that the private school pass rate was 86.40, because 93360 students appeared from private schools and 80663 students passed. And because 86760 of the 309699 students that appeared from community schools passed, we are told their pass rate was 28.01 percent.
What does this tell us? It tells us that as a whole the private school students are much more likely to pass the SLC. Which is nice to know but it is not really all that illuminating for e.g. it does not tell us if a few community schools are failing badly and bringing down the national average. Or is it than most community schools are showing a below average performance. Knowing that would help us because the response to the two problems should look quite different. So what do we do?
Fortunately we have some basic statistical tests we can use to figure out a bit more about this data. Let’s start by looking at the mean, median & other percentiles numbers, range and standard division. Those of you who are familiar with these concepts can skip ahead. For the rest, here is a little primer.
——————————————————————————————————————————————
For the purpose of this primer we are looking at 5 schools with the following results
#Students Appearing on SLC | #Students Passed | Pass Percentage | |
School 1 | 6 | 6 | 100% |
School 2 | 50 | 12 | 24% |
School 3 | 50 | 33 | 66% |
School 4 | 40 | 8 | 20% |
School 5 | 100 | 15 | 15% |
Total | 246 | 74 | 30% |
Mean: is what we generally call average. This is where you take all a bunch of numbers, add them up, then divide them by a how many numbers were added. E.g. the mean of pass percentage is (100, 24, 66, 20, 15) is: first add them up 100+24+66+20+15 = 225, then count how many numbers were added up i.e. 5, so mean pass percentage is = 225/5 = 45%.
You can already see how which statistic we use to gauge the health of the govt school system matters… because if we used the ‘total students passed/total students appeared’ for this example we get only 30% but if we use a mean of the pass percentages we get 45%. Rather than choosing the number that suits our message better, what we need to do is look at both these numbers and to help us get more information than just one number would give us.
Median: is another type of average. Let’s take the same numbers (100, 24, 66, 20, 15). To get a median we first arrange them in ascending order so (15, 20, 24, 66, 100). Then we pick the number in the middle, so: 24. The median tells us that half the numbers in the data set are below it, and half a above.
Looking at the mean and median together can help improve our diagnosis. In the example above if we were told the mean is 45% pass rate and the median is a 24% pass rate what could we know about the data? One thing we could say with confidence is that while a majority of the schools are doing a bad job getting kids to pass the SLC (we know this because over 50% of them have less than a 24% pass rate) some schools are doing a much better job (else how could the mean be so much higher than 24%?). If we see a median that is a lot bigger than the mean then we would say the opposite.
Percentile: is a tricky concept to explain with just 5 items in the data set! But it is a valuable concept to grapple with. The easiest way to come at it might be by thinking of the median as the 50th percentile. I.e. the median pass percentage tells us that half the schools in the country had pass percentages lower than that… and half the schools had a pass percentage higher than that. So if I said the 25 percentile for community schools is 20% and for private schools is 60%, what would you understand? You should understand that ¼ of community schools have pass percentages below 20% while ¼ of private schools have a pass percentage of below 60%. This will become valuable later when we look to see if the top performing community schools are outperforming the worst performing private schools.
Standard Deviation: is look at variation i.e. to figure out how spread out a data set is. Let’s make up a data set (18, 2, 2, 4, 4) and compare it to another data set (4, 9, 4, 4, 4, 9). Both data sets add up to 30, both have a mean of 6 and a median of 4. But for the first the Std. Dev is 6.07 while the latter one has the 2.45. This tells us that the data in the second data set is closer together and the one in the first one is more spread apart.
The way this is important to us is to see how much variation there is in the private schools vs public school data set. In practical terms, it will allow us to see if there is a lot of difference in SLC pass percentages for community schools or not! Why do we want to know this? Because if you remember the purpose of this exercise is to some large degree to figure out if community schools are uniformly bad at getting kids to pass the SLC… or if there are some that do it well.
——————————————————————————————————————————————
Now that we know what the terms we are going to use lets apply them to real data. We will look at the following:
- National Pass Percentage: this is the total number of pass students/students appearing*100 for private vs community. While this number is based on the total number of students, all the others will use schools as the basic unit of analysis.
- Mean: to get the mean we take the pass percentage for each school and get an average of these numbers for private vs community.
- Median & Percentiles: we find the median pass percentage for each type of school around the country and look at percentiles.
- Range: we find the range for each type of school
- Std Dev: we look for the standard deviation in each kind of school’s pass percentage
Ok, so let’s make a little table to help compare these data items.
Private Schools | Community Schools | |
National Pass Percentage | 86.4 | 28.01 |
Mean | 86.47 | 30.37 |
Median | 95.45 | 24.32 |
10th percentile | 57.14 | 4.35 |
25th percentile | 83.33 | 11.48 |
75th percentile | 100 | 43.75 |
90th percentile | 100 | 66.7 |
Range | 100 | 100 |
Std Dev | 21.08 | 24.29 |
What do we observe when looking at this data?
- There is a lot more variability in Community school performance as compared to private schools.
Though the range and the standard deviation don’t seem to show much of a difference in variability, it is clear there is from the percentile figures. The private school data quickly clusters around a relatively high point (almost 84% at the 25^{th} percentile), this tells us that they are few private schools that have low-ish pass percentages, but a vast majority of them are on the high end. The community school data though shows some nutty stuff. We find that 10% of community schools have less than 4.35% pass rate!!! That is insane. But we also see that the top 10% of community schools have a pass rate of over 66%. Form this data we can conclude that that National Pass Percentage is not an adequate way to look at the community school performance. For private schools the 86.4% is fair since most of the private schools cluster around that number. But 28.01% misrepresents community school performance, since many of them do much better than that… and many of them do worse. (Note: Also the top 10% of community schools do better than the bottom 10% of the private schools.)
- A few bad performers hide in the private school numbers
We notice that the median for private schools (95.45%) is much higher than its mean (86.47%). This means that over 50% of private schools get a 95% pass rate, but because a few perform really badly (there are 27 private schools with a 0% pass rate) their overall average gets dragged down.
- A few good performers go unnoticed in the community schools numbers
For community schools show the opposite tendency, its mean (30.37%) is higher than its median (24.32%). This tells us that even though 50% of community schools have pass percentages of lower than 24.32%, some high performers drag the mean to higher than that.
This ends this episode of “Behind the SLC Headlines”. I see this exploration going in two directions possible directions.
The first possible direction will be to go more visual. So far my analysis has been entirely textual but it will be nice to visualize some of this data. My plan is to play around with some graphical representations of this data over the next week and we’ll see if there is anything interesting there. It has been a few years since I messed around with the statistical package R
The second possible direction is to see what happens when we start playing with outliers, both in the sense of only analyzing them, or removing them from the analysis. I am going to do some reading into how we work with outliers and what we can learn from them.
Hope you enjoyed the series and will keep tuning in. As I understand the data better (and get my hands on more data… there is an effort to extract more data from various sources) the posts should get even more juicy.