14.8 - Uniform Applications

Perhaps not surprisingly, the uniform distribution is not particularly useful in describing much of the randomness we see in the natural world. Its claim to fame is instead its usefulness in random number generation. That is, approximate values of the \(U(0,1)\) distribution can be simulated on most computers using a random number generator. The generated numbers can then be used to randomly assign people to treatments in experimental studies, or to randomly select individuals for participation in a survey.

Before we explore the above-mentioned applications of the \(U(0,1)\) distribution, it should be noted that the random numbers generated from a computer are not technically truly random, because they are generated from some starting value (called the seed). If the same seed is used again and again, the same sequence of random numbers will be generated. It is for this reason that such random number generation is sometimes referred to as pseudo-random number generation. Yet, despite a sequence of random numbers being pre-determined by a seed number, the numbers do behave as if they are truly randomly generated, and are therefore very useful in the above-mentioned applications. They would probably not be particularly useful in the applications of cryptography or internet security, however!

Quantile-Quantile (Q-Q) Plots Section

Before we jump in and use a computer and a \(U(0,1)\) distribution to make random assignments and random selections, it would be useful to discuss how we might evaluate if a particular set of data follow a particular probability distribution. One possibility is to compare the theoretical mean (\(\mu\)) and variance (\(\sigma^2\)) with the sample mean ( \(\bar{x}\)) and sample variance (\(s^2\)). It shouldn't be surprising that such a comparison is hardly sufficient. Another technique used frequently is the creation of what is called a quantile-quantile plot (or a q-q plot, for short. The basic idea behind a q-q plot is a two-step process: 1) first determine the theoretical quantiles (from the supposed probability distribution) and the sample quantiles (from the data), and then 2) compare them on a plot. If the theoretical and sample quantiles "match," there is good evidence that the data follow the supposed probability distribution. Here are the specific details of how to create a q-q plot:

Determine the theoretical quantile of order \(p\), that is, the \((100p)^{th}\) percentile \(\pi_p\).
Determine the sample quantile, \(y_r\), of order \(\dfrac{r}{n+1}\), that is the \(100\dfrac{r}{n+1}\) percentile. While that might sound complicated, it amounts to just ordering the data \(x_1, x_2, \ldots, x_n\) to get the order statistics \(y_1\le y_2\le \ldots \le y_n\).
Plot the theoretical quantile on the y-axis against the sample quantile on the x-axis. If the sample data follow the theoretical probability distribution, we would expect the points \((y_r, \pi_p)\) to lie close to a line through the origin with slope equal to one.

In the case of the \(U(0,1)\) distribution, the cumulative distribution function is \(F(x)=x\). Now, recall that to find the \((100p)^{th}\) percentile \(\pi_p\), we set \(p\) equal to \(F(\pi_p)\) and solve for \(\pi_p\). That means in the case of the \(U(0,1)\) distribution, we set \(F(\pi_p)=\pi_p\) equal to \(p\) and solve for \(\pi_p\). Ahhhhaaa! In the case of the \(U(0,1)\) distribution, \(\pi_p=p\). That is, \(\pi_{0.05}=0.05\), \(\pi_{0.35}=0.35\), and so on. Let's take a look at an example!

Example 14-9 Section

Consider the following set of 19 numbers generated from Minitab's \(U(0,1)\) random number generator. Do these data appear to have come from the probability model given by \(f(x)=1\) for \(0<x<1\)?

Uniform
0.790222	0.367893	0.446442	0.889043	0.227839	0.541575
0.805958	0.156496	0.157753	0.465619	0.805580	0.784926
0.288771	0.010717	0.511768	0.496895	0.076856	0.254670
0.752679

Solution

Here are the original data (the column labeled Uniform) along with their sample quantiles (the column labeled Sorted) and their theoretical quantiles (the column labeled Percent):

r	Uniform	Sorted	Percent
1	0.790222	0.010717	0.05
2	0.367893	0.076856	0.10
3	0.446442	0.156496	0.15
4	0.889043	0.157753	0.20
5	0.227839	0.227839	0.25
6	0.541575	0.254670	0.30
7	0.805958	0.288771	0.35
8	0.156496	0.367893	0.40
9	0.157753	0.446442	0.45
10	0.465619	0.465619	0.50
11	0.805580	0.496895	0.55
12	0.784926	0.511768	0.60
13	0.288771	0.541575	0.65
14	0.010717	0.752679	0.70
15	0.511768	0.784926	0.75
16	0.496895	0.790222	0.80
17	0.076856	0.805580	0.85
18	0.254670	0.805958	0.90
19	0.7526679	0.889043	0.95

As might be obvious, the Sorted column is just the original data in increasing sorted order. The Percent column is determined from the \(\pi_p=p\) relationship. In a set of 19 data points, we'd expect the 1st of the 19 points to be the 1/20th or fifth percentile, we'd expect the 2nd of the 19 points to be the 2/20th or tenth percentile, and so on. Plotting the Percent column on the vertical axis (labeled Theoretical Quantile) and the Sorted column on the horizontal axis (labeled Sample Quantile), here's the resulting q-q plot:

Now, the key to interpreting q-q plots is to do it loosely! If the data points generally follow an (approximate) straight line, then go ahead and conclude that the data follow the tested probability distribution. That's what we'll do here!

Incidentally, the theoretical mean and variance of the \(U(0,1)\) distribution are \(\dfrac{1}{2}=0.5\) and \(\dfrac{1}{12}=0.0833\), respectively. If you calculate the sample mean and sample variance of the 19 data points, you'll find that they are 0.4648 and 0.078, respectively. Not too shabby of an approximation for such a small data set.

Random Assignment to Treatment Section

As suggested earlier, the \(U(0,1)\) distribution can be quite useful in randomly assigning experimental units to the treatments in an experiment. First, let's review why randomization is a useful venture when conducting an experiment. Suppose we were interested in measuring how high a person could reach after "taking" an experimental treatment. It would be awfully hard to draw a strong conclusion about the effectiveness of the experimental treatment if the people in one treatment group were, to begin with, significantly taller than the people in the other treatment group. Randomly assigning people to the treatments in an experiment minimize the chances that such important differences exist in the treatment groups. That way if differences exist in the two groups at the conclusion of the study with respect to the primary variable of interest, we can feel confident in attributing the difference strongly to the treatment of interest rather than due to some other fundamental difference in the groups.

Okay, now let's talk about how the \(U(0,1)\) distribution can help us randomly assign the experimental units to the treatments in a completely randomized experiment. For the sake of concreteness, suppose we wanted to randomly assign 20 students to one group (those who complete a blue data collection form, say) and 20 students to a second group (those who complete a green data collection form, say). This is what the procedure might look like:

Assign the pool of 40 potential students each one number from 1 to 40. It doesn't matter how you assign these numbers.
Generate 40 \(U(0,1)\) numbers in one column of a spreadsheet. Enter the numbers 1 to 40 in a second column of a spreadsheet.
Sort the 40 \(U(0,1)\) numbers in sorted increasing order, so that the numbers in the second column follow along during the sorting process. For example, if the 13th generated \(U(0,1)\) number was the smallest number generated, then the number 13 should appear, after sorting, in the first row of the second column. If the 24th generated \(U(0,1)\) number was the second smallest number generated, the number 24 should appear, after sorting, in the second row of the second column. And so on.
The students whose numbers appear in the first 20 rows of the second column should be assigned to complete the blue data collection form. The students whose numbers appear in the second 20 rows of the second column should be assigned to complete the green data collection form.

One semester, I conducted the above experiment exactly as described. Twenty students were randomly assigned to complete a blue version of the following form, and the remaining twenty students were randomly assigned to complete a green version of the form:

Data Collection Form
1. What is your gender?	Male ___	Female ___
2. What is your current cumulative grade point average?	___.___ ___
3. What is your height?	___ ___.___ inches	(or ___ ___ ___.___ centimeters)
4. What is your weight?	___ ___ ___.___ pounds	(or ___ ___ ___.___ kilograms)
5. Excluding class time, how much do you study for this course?	___ ___ hours/week
6. What is your current exam average in this class?	___ ___ ___
7. Are you a native resident of Pennsylvania?	Yes ___	No ___
8. How many credits are you taking this semester?	___ ___ credits

After administering the forms to the 40 students, here's what the resulting data looked like:

ROW	COLOR	GENDER	GPA	HEIGHT	WEIGHT	HOURS	EXAM	NATIVE	CREDITS
1	blue	1	3.44	68.00	140.0	2.0	85.0	1	18.0
2	blue	1	3.90	71.00	210.0	1.0	98.5	0	18.0
3	blue	2	*	68.00	200.0	10.0	83.0	0	9.0
4	blue	2	*	67.00	185.0	4.0	100.0	1	10.0
5	blue	1	3.82	66.00	143.0	3.0	98.0	0	9.0
6	blue	2	3.66	66.00	140.0	5.0	91.5	0	17.0
7	blue	2	2.98	66.00	135.0	4.0	71.0	1	16.0
8	blue	2	3.67	67.00	118.0	5.0	65.0	1	18.0
9	blue	1	3.15	69.50	180.5	10.0	61.0	1	13.0
10	blue	1	3.29	72.00	120.0	2.0	61.0	1	17.0
11	blue	1	4.00	69.00	175.0	4.0	95.0	0	3.0
12	blue	2	2.46	69.70	181.0	7.0	59.0	0	16.0
13	blue	2	2.97	52.00	105.0	6.0	62.0	0	15.0
14	blue	1	*	68.80	136.5	2.0	95.0	0	12.0
15	blue	1	3.50	70.00	215.0	1.0	87.5	1	15.0
16	blue	2	3.61	65.00	135.0	9.0	89.0	0	12.0
17	blue	2	3.24	64.00	148.0	10.0	65.0	1	16.0
18	blue	2	3.41	65.50	126.0	5.0	70.0	1	15.0
19	blue	2	3.40	65.00	115.0	2.0	83.0	1	17.0
20	blue	1	3.80	68.00	236.0	6.0	87.5	1	18.0
21	blue	2	2.95	65.00	140.0	6.0	79.0	1	15.0
22	green	1	3.80	71.50	145.0	3.0	77.5	1	18.0
23	green	2	4.00	62.00	185.0	6.0	98.0	1	10.0
24	green	2	*	60.00	100.0	8.0	95.0	0	10.0
25	green	2	3.65	62.00	150.0	4.0	85.0	1	15.0
26	green	1	3.54	71.75	160.0	0.5	83.0	1	6.0
27	green	2	*	65.00	113.0	7.0	97.5	0	18.0
28	green	1	3.63	69.50	155.0	10.0	80.0	1	18.0
29	green	2	3.90	65.20	110.0	1.0	96.0	0	9.0
30	green	1	*	76.00	165.0	3.0	88.0	1	15.0
31	green	1	3.80	70.40	143.0	1.0	100.0	0	9.0
32	green	2	3.19	64.50	140.0	10.0	65.0	1	16.0
33	green	1	2.97	71.00	140.0	5.0	72.0	1	17.5
34	green	1	2.87	65.50	160.3	7.0	45.0	1	13.0
35	green	1	3.43	60.70	160.0	12.0	90.0	0	12.0
36	green	1	3.61	66.00	126.0	18.0	88.0	0	13.0
37	green	1	*	66.00	120.0	7.0	79.0	1	15.0
38	green	1	2.97	69.70	185.5	6.0	83.0	1	13.0
39	green	1	3.26	70.00	152.0	3.0	70.5	1	13.0
40	green	1	3.30	72.00	190.0	2.5	81.0	1	17.0

And, here's a portion of a basic descriptive analysis of six of the variables on the form:

VARIABLE	COLOR	N	N*	MEAN	MEDIAN	TrMean
GPA	blue	18	3	3.4028	3.4250	3.4244
	green	15	4	3.4613	3.5400	3.4654
HEIGHT	blue	21	0	66.786	67.000	67.289
	green	19	0	67.30	66.00	67.22
WEIGHT	blue	21	0	156.38	140.00	154.89
	green	19	0	147.36	150.00	147.64
HOURS	blue	21	0	4.952	5.000	4.895
	green	19	0	5.526	6.000	5.441
EXAM	blue	21	0	80.29	83.00	80.37
	green	19	0	82.82	83.00	84.03
CREDITS	blue	21	0	14.238	15.000	14.632
	green	19	0	13.553	13.000	13.735

VARIABLE	COLOR	StDev	SE Mean	MINIMUM	MAXIMUM	Q1
GPA	blue	0.3978	0.0938	2.4600	4.0000	3.1075
	green	0.3564	0.0920	2.8700	4.0000	3.1900
HEIGHT	blue	4.025	0.878	52.000	72.000	65.250
	green	4.43	1.02	60.00	76.00	64.50
WEIGHT	blue	36.97	8.07	105.00	236.00	130.50
	green	25.53	5.86	100.00	190.00	126.00
HOURS	blue	2.941	0.642	1.000	10.000	2.000
	green	3.251	0.746	0.500	12.000	3.000
EXAM	blue	14.5	3.09	59.00	100.00	65.00
	green	13.42	3.08	45.00	100.00	77.50
CREDITS	blue	3.885	0.848	3.000	18.000	12.000
	green	3.547	0.814	6.000	18.000	10.000

VARIABLE	COLOR	Q3
GPA	blue	3.7025
	green	3.8000
HEIGHT	blue	69.250
	green	71.00
WEIGHT	blue	183.00
	green	160.30
HOURS	blue	6.500
	green	8.000
EXAM	blue	93.25
	green	95.00
CREDITS	blue	17.000
	green	17.000

The analysis suggests that my randomization worked quite well. For example, the mean grade-point average of those students completing the blue form was 3.40, while the mean g.p.a. for those students completing the green form was 3.46. And, the mean height of those students completing the blue form was 66.8 inches, while the mean height for those students completing the green form was 67.3 inches. The two groups appear to be similar, on average, with respect to the other collected data as well. It should be noted that there is no guarantee that any particular randomization will be as successful as the one I illustrated here. The only thing that the randomization ensures is that the chance that the groups will differ with respect to key measurements will be small.

Random Selection for Participation in a Survey Section

Just as you should always randomly assign experimental units to treatments when conducting an experiment, you should always randomly select your participants when conducting a survey. If you don't, you might very well end up with biased survey results. (The people who choose to take the time to complete a survey in a magazine or on a web site typically have quite strong opinions!) The procedure we can use to randomly select participants for a survey is quite similar to that used for randomly assigning people to treatments in a completely randomized experiment. This is what the procedure would look like if you wanted to randomly select, say, 1000 students to participate in a survey from a potential pool of, say, 40000 students:

Assign the pool of 40000 potential participants each one number from 1 to 40000. It doesn't matter how you assign these numbers.
Generate 40000 \(U(0,1)\) numbers in one column of a spreadsheet. Enter the numbers 1 to 40000 in a second column of a spreadsheet.
Sort the 40000 \(U(0,1)\) numbers in sorted increasing order, so that the numbers in the second column follow along during the sorting process. For example, if the 23rd generated \(U(0,1)\) number was the smallest number generated, the number 23 should appear, after sorting, in the first row of the second column. If the 102nd generated \(U(0,1)\) number was the second smallest number generated, the number 102 should appear, after sorting, in the second row of the second column. And so on.
The students whose numbers appear in the first 1000 rows of the second column should be selected to participate in the survey.

Following the procedure as described, the 1000 selected students represent a random sample from the population of 40000 students.