As of today, I'm still struggling with class CS802 from Doctor Debra Beazley. I'm not for sure what is it does she want.
Sunday, February 17, 2008
Class CS805 Doctor Howard Carol Project 3 Sampling Distributions
Sampling Distributions
Professor Doctor Carol Howard
Class CS 805
By Tai Cleveland
Project # 3
Due on Monday November 12th, 2007
As defined by David W. Stockburger, “the sampling distribution is the distribution of a sample statistic” (Stockburger, 1998). While it is also a distribution model, the values are statistics and not raw data. For example, if we have 10 samples, and we computed for the mean of those 10 samples several times, the results or the means we derived will be the values in our sampling distribution. The sampling distribution is represented by µ. There is always a subscript to the µ, which tells us what kind of statistics the sampling distribution refers to.
As mentioned, there are also sampling distributions for each statistic. If we will compute for the sampling distribution of medians, then we shall use medians, not means. However, it was found that the mean has a smaller standard of error than the median, even the mode, since the mean takes into consideration all the value or scores included in the sample. The median simply is the middle number, while the mode is the value that most often comes up in a sample. The standard of error (σ), on the other hand, is “the degree by which the computed statistics will differ from one another when calculated from sample of similar size and selected from similar population models” (Stockburger, 1998).
The sampling distribution of means is simply made of means as computed from a sample of scores or values. It was also found that the sampling distribution of means is closely related to the population distribution (Stockburger, 1998), which is called the Central Limit Theorem. This means that the mean of the sampling distribution of means and the mean of the population are equal. There are 2 rules under this theorem. To illustrate the first rule, below is an example of five 5 values, from which scores were derived.
Population of values
Samples from the Population
Means from the samples
1
1, 2, 4, 5
3
2
1, 2, 3, 4
2.5
3
2, 3, 4, 5
3.5
4
1, 3, 4, 5
3.25
5
1 ,2, 3, 5
2.75
Population distribution = 3
(All possible samples) n = 4
Total = 15 / 5 = 3
In computing for the Central Limit Theorem, follow the steps outlined below:
1. Start with the population values. In this case, we had 5 values in our population.
2. Obtain all possible samples. In each population, we were able to derive a maximum of 4 samples.
3. Add up all samples in each population, then divide it by the n or the sample size. This will give us the mean for that particular population.
4. Add all the means, then divide by the N or the sample population. This will yield the same value as that of the population distribution.
The second rule under the Central Limit Theorem states that the sampling distribution of means will have a normal curve regardless of the shape of the population distribution (Sampling Distribution Demo). The reason behind this rule is that even though the samples are taken from different samples, the means of the samples will always be near to the center of the population distribution.
The Central Limit Theorem works well in small sample sizes as shown above, though it works even greater with a larger sample size, as it is closer to the true population. As such, the Central Limit Theorem is often the basis of most hypothesis testing and sampling theory (Stockburger, 1998). Additionally, the Central Limit also serves as a powerful tool for most researchers, as this always has a normal curve, providing scientists and researchers the basis or justification for several studies, even naturally occurring phenomena (Stockburger, 1998).
References
Sampling Distributions. Sampling Distributions Demo. Accessed October 31, 2007, from
http://faculty.uncfsu.edu/dwallace/ssample.html
Stockburger, David W. (1998). The Sampling Distribution. Introductory Statistics:
Concepts, Models, and Applications, 1.0. Accessed October 29, 2007, from http://www.psychstat.missouristate.edu/introbook/SBK19.htm
Posted by
Dr.Tai Cleveland
at
7:22 AM
0
comments
CS805 Doctor Howard Project #2 Normal model
Normal Model
Tai Cleveland
Class CS805
Doctor Carol Howard
Project #2 Normal Model
Due on Monday November 5th, 2007
The following lists the number of errors per 10000 lines of code for a large service-oriented architecture (SOA) software project on the inventory management system of Kuiper Leda, a global electronics components provider ().
516
548
566
534
551
548
523
538
523
529
486
558
574
586
552
We assume a normal model in analyzing the data above (Weiers, 2005). First we determine measures of central tendency: the mean, mode, and median to describe the statistics of the number of errors per 10000 lines of code for this IT project. The following shows the descriptive statistics.
Mean
Median
The data is first sorted in the order from lowest to highest as shown below:
486 516 523 523 529 534 538 546 548 551 552 558 566 574 586
The 8th ordered measurement is the median of 15 observations.
Mode: There are two measurements for 523, thus serving as the mode.
Standard deviationTo measure the dispersion of the data, we also calculate the standard deviation of the above data, as shown in the following:
Another way of describing the variation or spread in the data set is to determine the location of values that divide a set of observations into four equal parts. This technique is known as quartiles (Weiers, 2005). To do this we first multiply 0.25 to the sample size n added by 1: 0.25 (15 + 1) = 0.25 (16) = 4. Therefore, the first and third quartile values are located at positions 4 and 12. The 4th value in the ordered array used to determine the median is 523; the 12th value is 558. These are the first and third quartiles, respectively. Below is the 5-Number-Summary, which includes the minimum, median and maximum values:
Minimum: 486
First Quartile: 523
Median: 546
Third Quartile: 558
Maximum: 586
The summary above allows us to make the box plot, which is shown below:
The box plot shows that the middle 50 percent of the codes have between 523 and 558 errors per 10000 lines of code. The distance between the ends of the box above, 35 errors per 10000 lines of code, is known as the interquartile range, which shows the dispersion of the majority of errors per 10000 lines of code on Kuiper Leda’s SOA project.
Lastly, we plot the histogram of values using a class interval of 20 errors per 10000 lines of code. The histogram is illustrated below:
A histogram groups data into classes (Weiers, 2005). When we organized the data into nine classes, we lost the exact value of the measurements. Since we only have a small number of samples, the histogram shown above is not as smooth as we expect from a normal distribution. Nevertheless, the histogram does exhibit central tendency, with well-defined mean, median, mode, and standard deviation thereby self-consistently validating the normal model that we had assumed before analyzing the data.
Reference
Weiers, R.M. (2005). Introduction to business statistics. 5th ed. Duxbury: Brooks/Cole.
Posted by
Dr.Tai Cleveland
at
7:17 AM
0
comments
CS805 Doctor Howard Carol Project #1
Tai Cleveland
Class CS 805
Project # 1 Hacker Attacks
Dr. Carol Howard
Due Monday Oct 29th, 2007
Hacker Attacks
Aside from DUI incidents, the war on terrorism and the continuous propagation of viruses on the internet, web server attacks has been one of the most serious crimes as to date. It has not been given considerable attention unlike the aforementioned crimes and incidents. According to a recent survey conducted by Zone-H (2005), web server attacks and web defacements grew about 36% in 2004 – this is about 400,000 reported incidents in that year alone. Christmas holidays are the most popular time for malicious hackers to attack sites.
Meanwhile, Web Application Security Consortium reported the following statistics from 1999 to 2007 (2007):
Year
Total
Security Breaches
Vulnerability Disclosures
1999
1
1
2000
5
2
3
2001
6
1
5
2002
4
3
1
2003
9
3
6
2004
17
6
11
2005
62
31
31
2006
44
18
26
2007
45
42
3
On the other hand, the following table summarizes the number of incidents recorded based on attack classification (2007). It must be noted that such data falls under the rule of small numbers.
Class
Total
Security Breaches
Vulnerability Disclosures
Cross-site Scripting
54
16
38
Unknown
41
38
3
SQL Injection
25
16
9
Insufficient Authorization
22
9
13
Credential/Session Prediction
16
3
13
Insufficient Authentication
14
6
8
OS Commanding
10
9
1
Predictable Resource Location
7
3
4
Other
7
6
1
Weak Password Recovery Validation
4
1
3
Information Leakage
4
4
Content Spoofing
4
4
Abuse of Functionality
4
3
1
Misconfiguration
3
3
Worm
2
2
Insufficient Anti-automation
2
2
Known Vulnerabity
2
1
1
Denial of Service
1
1
Brute Force
1
1
Defacement
1
1
Directory Indexing
1
1
HTTP Response Splitting
1
1
Insufficient Session Expiration
1
1
Path Traversal
1
1
Phishing
1
1
Redirection
1
1
Insufficient Process Validation
1
1
References
BBC News. (2005). “Web server attacks 'growing fast': More than 2,500 web servers every day are being hacked, reveals a report.” Retrieved November 2007 from website:
Web Application Consortium. (2007). “Web Hacking Statistics.” Retrieved November 2007 from website:
Posted by
Dr.Tai Cleveland
at
7:12 AM
0
comments