Sunday, February 17, 2008

CS805 Doctor Howard Project #2 Normal model

Normal Model
Tai Cleveland
Class CS805
Doctor Carol Howard
Project #2 Normal Model
Due on Monday November 5th, 2007





The following lists the number of errors per 10000 lines of code for a large service-oriented architecture (SOA) software project on the inventory management system of Kuiper Leda, a global electronics components provider ().
516
548
566
534
551
548
523
538
523
529
486
558
574
586
552

We assume a normal model in analyzing the data above (Weiers, 2005). First we determine measures of central tendency: the mean, mode, and median to describe the statistics of the number of errors per 10000 lines of code for this IT project. The following shows the descriptive statistics.
Mean



Median
The data is first sorted in the order from lowest to highest as shown below:
486 516 523 523 529 534 538 546 548 551 552 558 566 574 586

The 8th ordered measurement is the median of 15 observations.

Mode: There are two measurements for 523, thus serving as the mode.





Standard deviationTo measure the dispersion of the data, we also calculate the standard deviation of the above data, as shown in the following:




Another way of describing the variation or spread in the data set is to determine the location of values that divide a set of observations into four equal parts. This technique is known as quartiles (Weiers, 2005). To do this we first multiply 0.25 to the sample size n added by 1: 0.25 (15 + 1) = 0.25 (16) = 4. Therefore, the first and third quartile values are located at positions 4 and 12. The 4th value in the ordered array used to determine the median is 523; the 12th value is 558. These are the first and third quartiles, respectively. Below is the 5-Number-Summary, which includes the minimum, median and maximum values:
Minimum: 486
First Quartile: 523
Median: 546
Third Quartile: 558
Maximum: 586
The summary above allows us to make the box plot, which is shown below:
The box plot shows that the middle 50 percent of the codes have between 523 and 558 errors per 10000 lines of code. The distance between the ends of the box above, 35 errors per 10000 lines of code, is known as the interquartile range, which shows the dispersion of the majority of errors per 10000 lines of code on Kuiper Leda’s SOA project.
Lastly, we plot the histogram of values using a class interval of 20 errors per 10000 lines of code. The histogram is illustrated below:


A histogram groups data into classes (Weiers, 2005). When we organized the data into nine classes, we lost the exact value of the measurements. Since we only have a small number of samples, the histogram shown above is not as smooth as we expect from a normal distribution. Nevertheless, the histogram does exhibit central tendency, with well-defined mean, median, mode, and standard deviation thereby self-consistently validating the normal model that we had assumed before analyzing the data.
Reference
Weiers, R.M. (2005). Introduction to business statistics. 5th ed. Duxbury: Brooks/Cole.

No comments: