34.3 - Stratified Random Sampling

In the two previous sections, we were concerned with taking a random sample from a data set without regard to whether an observation comes from a particular subgroup. When you are conducting a survey, it often behooves you to make sure that your sample contains a certain number of observations from each particular subgroup. We'll concern ourselves with such a restriction here. That is, in this section, we'll focus on ways of using SAS to obtain a stratified random sample, in which a subset of observations are selected randomly from each subgroup of observations as determined by the value of a stratifying variable. We'll also go back to sampling without replacement, in which once an observation is selected it cannot be selected again.

Selecting a Stratified Sample of Equal-Sized Groups Section

We'll first focus on the situation in which an equal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable.

Example 34.9 Section

The following code illustrates how to select a stratified random sample of equal-sized groups. Specifically, the code tells SAS to randomly select 5 observations from each of the three subgroups — State College, Port Matilda, Bellefonte — as determined by the value of the variable city:

PROC FREQ data=stat482.mailing;
table city/out=bycount noprint;
RUN;

PROC SORT data=stat482.mailing;
	by city;
RUN;

DATA sample5;
	merge stat482.mailing bycount (drop = percent);
	by city;
	retain k;
	if first.city then k=5;
	random = ranuni(109);
	propn = k/count;
	if random le propn then
	do;
		output;
		k=k-1;
	end;
	count=count-1;
RUN;

PROC PRINT data=bycount;
	title 'Count by CITY';
RUN;

PROC PRINT data=sample5;
	title 'Sample5: Stratified Random Sample with Equal-Sized Strata';
	by city;
RUN;

Count by CITY
Obs	City	COUNT	PERCENT
1	Bellefonte	15	30
2	Port Matilda	13	26
3	State College	22	44

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, how does the program work? In order to select a stratified random sample in SAS, we basically use code similar to selecting equal-sized random samples without replacement, except now we process within each subgroup. More specifically, here's how the program works step-by-step:

The sole purpose of the FREQ procedure is to determine the number of observations in the stat482.mailing data set that correspond to each level of the stratification variable city (hence, "table city"). The OUT = option tells SAS to create a data set called bycount that contains the variable city and two variables that contain the number (count) and percentage (percent) of records for each level of city.
The SORT procedure merely sorts the stat482.mailing data set by city and stores the sorted result in a temporary data set called mailing so that it can be processed by city in the next DATA step.
Merge, by city, the sorted data set mailing with the bycount data set, so that the number of observations per subgroup is available. Since the percentage of observations is not needed, drop it from the data set on input.
The rest of the code in the DATA step should look very familiar. That is, once the number of observations per subgroup in the original stat482.mailing data set is available, you can randomly select records from the subgroup as you would select equal-sized random samples without replacement, except you select within city (hence, "by city"). Every time SAS reads in a new city (hence, "if first.city"), the number of observations still needed in the subgroup's sample (k) is set to the number of observations desired in each of the subgroups (5, here).

Note that, again, the random = ranuni( ) and propn = k/n assignments are made here only so their values can be printed. In another situation, these values would be incorporated directly in the IF statement: if ranuni( ) le k/n then do; Additionally, k and count could be dropped from the output data set, but are kept here, so their values can be printed for educational purposes.

Example 34.10 Section

The following code illustrates an alternative way of randomly selecting a stratified random sample of equal-sized groups. The code, while less efficient — because it requires that the data set be processed twice and sorted once — may feel more natural and intuitive to you:

DATA scollege pmatilda bellefnt;
set stat482.mailing;
		if city = 'State College' then output scollege;
else if city = 'Port Matilda'  then output pmatilda;
else if city = 'Bellefonte'    then output bellefnt;
RUN;
 
%MACRO select (dsn, num);
	DATA &dsn;
		set &dsn;
		random=ranuni(85329);
	RUN;
	PROC SORT data=&dsn;
		by random;
	RUN;
	DATA &dsn;
		set &dsn;
		if _N_ le 5;
	RUN;
%MEND select;
 
%SELECT(scollege, 5);  %SELECT(pmatilda, 5);  %SELECT(bellefnt, 5); 
 
DATA sample6A;
	set bellefnt pmatilda scollege;
RUN;
 
PROC PRINT data=sample6A;
	title 'Sample6A: Stratified Random Sample with Equal-Sized Strata';
	by city;
RUN;

Sample6A: Stratified Random Sample with Equal-Sized Strata

F4=Bellefonte

Obs	Num	Name	Street	State	random
1	10	Laura Mills	704 Hill Street	PA	0.05728
2	4	Mark Adams	312 Oak Lane	PA	0.22701
3	13	James Whitney	104 Pine Hill Drive	PA	0.28315
4	12	Fran Cipolla	912 Cardinal Drive	PA	0.34773
5	5	Lisa Brothers	89 Elm Street	PA	0.42637
6	6	Delilah Fequa	2094 Acorn Street	PA	0.46698
7	3	Jim Jefferson	10101 Allegheny Street	PA	0.60821
8	11	Linda Bentlager	1010 Tricia Lane	PA	0.63431
9	1	Jonathon Smothers	103 Oak Lane	PA	0.67112
10	2	Jane Doe	845 Main Street	PA	0.70002
11	8	Mamie Davison	102 Cherry Avenue	PA	0.72302
12	14	William Edwards	79 Oak Lane	PA	0.79275
13	7	John Doe	812 Main Street	PA	0.86987
14	9	Ernest Smith	492 Main Street	PA	0.87446
15	15	Harold Harvey	480 Main Street	PA	0.88875

F4=Port Matilda

Obs	Num	Name	Street	State	random
16	47	Barb Wyse	21 Cleveland Drive	PA	0.05728
17	41	Lou Barr	219 Eagle Street	PA	0.22701
18	50	George Matre	75 Ashwind Drive	PA	0.28315
19	49	Tim Winters	95 Dove Street	PA	0.34773
20	42	Casey Spears	123 Main Street	PA	0.42637
21	43	Leslie Olin	487 Bluebird Haven	PA	0.46698
22	40	Jane Smiley	298 Cardinal Drive	PA	0.60821
23	48	Coach Pierce	74 Main Street	PA	0.63431
24	38	Miriam Denders	2348 Robin Avenue	PA	0.67112
25	39	Scott Fitzgerald	43 Blue Jay Drive	PA	0.70002
26	45	Ann Draper	72 Lake Road	PA	0.72302
27	44	Edwin Hoch	389 Dolphin Drive	PA	0.86987
28	46	Linda Nicolson	71 Liberty Terrace	PA	0.87446

F4=State College

Obs	Num	Name	Street	State	random
29	33	Steve Ignella	367 Whitehall Road	PA	0.00548
30	25	Steve Lindhoff	130 E. College Avenue	PA	0.05728
31	19	Frank Smith	238 Waupelani Drive	PA	0.22701
32	31	Robert Williams	156 Straford Drive	PA	0.26377
33	28	Srabashi Kundu	112 E. Beaver Avenue	PA	0.28315
34	27	Lucy Arnets	345 E. College Avenue	PA	0.34773
35	34	Mike Dahlberg	1201 No. Atherton	PA	0.36894
36	20	Kristin Jones	120 Stratford Drive	PA	0.42637
37	21	Amy Kuntz	357 Park Avenue	PA	0.46698
38	36	Daniel Fremgen	103 W. College Avenue	PA	0.53660
39	18	Ade Fequa	803 Allen Street	PA	0.60821
40	26	Jan Davison	201 E. Beaver Avenue	PA	0.63431
41	35	Doris Alcorn	453 Fraser Street	PA	0.66005
42	16	Linda Edmonds	410 College Avenue	PA	0.67112
43	32	George Ball	888 Park Avenue	PA	0.69135
44	17	Rigna Patel	101 Beaver Avenue	PA	0.70002
45	23	Greg Pope	5100 No. Atherton	PA	0.72302
46	37	Scott Henderson	245 W. Beaver Avenue	PA	0.72795
47	29	Joe White	678 S. Allen Street	PA	0.79275
48	22	Roberta Kudla	312 Whitehall Road	PA	0.86987
49	24	Mark Mendel	256 Fraser Street	PA	0.87446
50	30	Daniel Peterson	328 Waupelani Drive	PA	0.88875

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, how does the program work? In summary, here's how this approach works:

The first DATA step uses an IF-THEN-ELSE statement in conjunction with OUTPUT statements to divide the original mailing data set up into several data sets based on the value of city. (Here, we create three data sets, one for each city... namely, scollege, pmatilda, and bellefnt.)
Then, the macro select exactly mimics the creation of the sample3A data set in Example 10.5 on the Random Sampling Without Replacement page. That is, the macro generates a random number for each observation, the data set is sorted by the random number, and then the first num observations are selected.
Then, call the macro select three times once for each of the city data sets .... scollege, pmatilda, and bellefnt .... selecting five observations from each.
Finally, the final DATA step concatenates the three data sets, bellefnt, scollege, and pmatilda, with 5 observations each back into one data set called sample6A with the 15 randomly selected observations.

Lo and behold, when all is said and done, we have another stratified random sample of equal-sized groups! Approach #2 checked off. Now, onto one last approach!

Example 34.11 Section

The following code illustrates yet another alternative way of randomly selecting a stratified random sample of equal-sized groups. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly five observations from each of the three city subgroups in the permanent SAS data set mailing:

PROC SURVEYSELECT data = stat482.mailing
		out = sample6B
			method = SRS
			seed = 12345678
			sampsize = (5 5 5);
	strata city notsorted;
	title;
RUN;
 
PROC PRINT data = sample6B;
	title1 'Sample6B: Stratified Random Sample';
	title2 'with Equal-Sized Strata (using PROC SURVEYSELECT)';
RUN;

The SURVEYSELECT Procedure

Selection Method	Simple Random Sampling
Strata Variable	City

Input Data Set	MAILING
Random Number Seed	12345678
Number of Strata	3
Total Sample Size	15
Output Data Set	SAMPLE6B

Sample6B: Stratified Random Sample
with Equal-Sized Strata (using PROC SURVEYSELECT)
Obs	City	Num	Name	Street	State	SelectionProb	SamplingWeight
1	Bellefonte	5	Lisa Brothers	89 Elm Street	PA	0.33333	3.0
2	Bellefonte	7	John Doe	812 Main Street	PA	0.33333	3.0
3	Bellefonte	8	Mamie Davison	102 Cherry Avenue	PA	0.33333	3.0
4	Bellefonte	11	Linda Bentlager	1010 Tricia Lane	PA	0.33333	3.0
5	Bellefonte	15	Harold Harvey	480 Main Street	PA	0.33333	3.0
6	Port Matilda	41	Lou Barr	219 Eagle Street	PA	0.38462	2.6
7	Port Matilda	42	Casey Spears	123 Main Street	PA	0.38462	2.6
8	Port Matilda	44	Edwin Hoch	389 Dolphin Drive	PA	0.38462	2.6
9	Port Matilda	48	Coach Pierce	74 Main Street	PA	0.38462	2.6
10	Port Matilda	50	George Matre	75 Ashwind Drive	PA	0.38462	2.6
11	State College	20	Kristin Jones	120 Stratford Drive	PA	0.22727	4.4
12	State College	30	Daniel Peterson	328 Waupelani Drive	PA	0.22727	4.4
13	State College	32	George Ball	888 Park Avenue	PA	0.22727	4.4
14	State College	35	Doris Alcorn	453 Fraser Street	PA	0.22727	4.4
15	State College	37	Scott Henderson	245 W. Beaver Avenue	PA	0.22727	4.4

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, the specifics about the code. The only things that should look new here are the STRATA statement and the form of the SAMPSIZE statement. The STRATA statement tells SAS to partition the input data set stat482.mailing into nonoverlapping groups defined by the variable city. The NOTSORTED option does not tell SAS that the data set is unsorted. Instead, the NOTSORTED option tells SAS that the observations in the data set are arranged in city groups, but the groups are not necessarily in alphabetical order. The SAMPSIZE statement tells SAS that we are interested in sampling five observations from each of the city groups.

Selecting a Stratified Sample of Unequal-Sized Groups

Now, we'll focus on the situation in which an unequal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable. If there are an unequal number of observations for each subgroup in the original data set, this sampling scheme may be accomplished by selecting the same proportion of observations from each subgroup. Again, we'll sample without replacement, in which once an observation is selected it cannot be re-selected.

To select a stratified random sample of unequal-sized groups, we could use the code from Example 10.10 by passing the different group sample sizes into the macro select. Alternatively, we could create a data set containing two count variables ...one that contains the number of observations in each subgroup (n) ...and the other that contains the number of observations that need to be selected from each subgroup (k). Once the data set is created, we could merge it with the original data set, and select observations randomly as we have done previously for random samples without replacement. That's the strategy that the following example uses.

Selecting a Stratified Sample of Unequal-Sized Groups Section

Now, we'll focus on the situation in which an unequal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable. If there are an unequal number of observations for each subgroup in the original data set, this sampling scheme may be accomplished by selecting the same proportion of observations from each subgroup. Again, we'll sample without replacement, in which once an observation is selected it cannot be re-selected.

To select a stratified random sample of unequal-sized groups, we could use the code from Example 10.10 by passing the different group sample sizes into the macro select. Alternatively, we could create a data set containing two count variables ...one that contains the number of observations in each subgroup (n) ...and the other that contains the number of observations that need to be selected from each subgroup (k). Once the data set is created, we could merge it with the original data set, and select observations randomly as we have done previously for random samples without replacement. That's the strategy that the following example uses.

Example 34.12 Section

The following code illustrates how to select a stratified random sample of unequal-sized groups. Specifically, the code tells SAS to randomly select 5, 6, and 8 observations, respectively, from each of the three subgroups — Bellefonte, Port Matilda, and State College — as determined by the value of the variable city:

DATA nselect;
	set stat482.mailing (keep = city);
	by city;
	n+1;
	if last.city;
	input k;
	output;
	n=0;
	DATALINES;
	5
	6
	8
	;
RUN;
 
DATA sample7 (drop = k n);
	merge stat482.mailing nselect;
	by city;
	if ranuni(7841) le k/n then
	do;
		output;
		k=k-1;
	end;
	n=n-1;
RUN;
PROC PRINT data=nselect;
	title 'NSELECT: Count by CITY';
RUN;
PROC PRINT data=sample7;
	title 'Sample7: Stratified Random Sample of Unequal-Sized Groups';
	by city;
RUN;

NSELECT: Count by CITY

Obs	City	n	k
1	Bellefonte	15	5
2	Port Matilda	13	6
3	State College	22	8

Sample7: Stratified Random Sample of Unequal-Sized Groups

F4=Bellefonte

Obs	Num	Name	Street	State
1	1	Jonathon Smothers	103 Oak Lane	PA
2	3	Jim Jefferson	10101 Allegheny Street	PA
3	6	Delilah Fequa	2094 Acorn Street	PA
4	11	Linda Bentlager	1010 Tricia Lane	PA
5	15	Harold Harvey	480 Main Street	PA

F4=Port Matilda

Obs	Num	Name	Street	State
6	38	Miriam Denders	2348 Robin Avenue	PA
7	42	Casey Spears	123 Main Street	PA
8	43	Leslie Olin	487 Bluebird Haven	PA
9	46	Linda Nicolson	71 Liberty Terrace	PA
10	48	Coach Pierce	74 Main Street	PA
11	49	Tim Winters	95 Dove Street	PA

F4=State College

Obs	Num	Name	Street	State
12	16	Linda Edmonds	410 College Avenue	PA
13	17	Rigna Patel	101 Beaver Avenue	PA
14	18	Ade Fequa	803 Allen Street	PA
15	21	Amy Kuntz	357 Park Avenue	PA
16	24	Mark Mendel	256 Fraser Street	PA
17	26	Jan Davison	201 E. Beaver Avenue	PA
18	34	Mike Dahlberg	1201 No. Atherton	PA
19	35	Doris Alcorn	453 Fraser Street	PA

First, launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.

Now, how does the program work? The key to understanding the program is to understand the first DATA step. The remainder of the program is much like code we've seen before, like that in say Example 10.4, in which a random sample is selected without replacement. Now, the first DATA step creates a temporary data set called nselect that contains three variables city, n, and k:

To count the number of observations n from each city, we use a counter variable n in conjunction with the last.city variable. By default, SAS sets n to 0 on the first iteration of the DATA step, and then increases n by 1 for each subsequent iteration of the DATA step until it counts the number of observations for one of the levels of city.
To tell SAS the number of observations to select from each city, we use an INPUT statement in conjunction with a DATALINES statement. The numbers k are listed in the order of city ...so here we tell SAS we want to randomly select 5 observations from Bellefonte, 6 observations from Port Matilda, and 8 observations from State College.
To write the numbers n and k to the new data set nselect, we use the last.city variable in a subsetting IF statement. So here, when SAS finds the last record within a city subgroup, n and k are written to the nselect data set, and n is reset to 0 in preparation for counting the number of observations for the next city in the data set.

The second DATA step creates a temporary data set called sample7 by merging the stat482.mailing data set with the nselect data set. After merging, the code then randomly selects the deemed number of observations from each city just as we did previously for random samples without replacement.

Example 34.13 Section

The following code illustrates an alternative way of randomly selecting a stratified random sample of unequal-sized groups. In selecting such a sample, rather than specifying the desired number sampled from each group, we could tell SAS to select an equal proportion of observations from each group. The following code does just that. Specifically, the code tells SAS to randomly select 25% of the observations from each of the three subgroups — Bellefonte, Port Matilda, and State College:

DATA nselect2;
	set stat482.mailing (keep=city);
	by city;
	n+1;
	if last.city;
	k=ceil(0.25*n);
	output;
	n=0;
RUN;
		 
DATA sample8 (drop = k n);
	merge stat482.mailing nselect2;
	by city;
	if ranuni(7841) le k/n then
		do;
			output;
			k=k-1;
		end;
	n=n-1;
RUN;
		 
PROC PRINT data=nselect2;
	title 'NSELECT2: Count by CITY';
RUN;
		 
PROC PRINT data=sample8;
	title 'Sample8: Stratified Random Sample of Unequal-Sized Groups';
RUN

NSELECT2: Count by CITY
Obs	City	n	k
1	Bellefonte	15	4
2	Port Matilda	13	4
3	State College	22	6

Sample8: Stratified Random Sample of Unequal-Sized Groups
Obs	Num	Name	Street	City	State
1	3	Jim Jefferson	10101 Allegheny Street	Bellefonte	PA
2	6	Delilah Fequa	2094 Acorn Street	Bellefonte	PA
3	11	Linda Bentlager	1010 Tricia Lane	Bellefonte	PA
4	15	Harold Harvey	480 Main Street	Bellefonte	PA
5	38	Miriam Denders	2348 Robin Avenue	Port Matilda	PA
6	42	Casey Spears	123 Main Street	Port Matilda	PA
7	46	Linda Nicolson	71 Liberty Terrace	Port Matilda	PA
8	49	Tim Winters	95 Dove Street	Port Matilda	PA
9	16	Linda Edmonds	410 College Avenue	State College	PA
10	17	Rigna Patel	101 Beaver Avenue	State College	PA
11	18	Ade Fequa	803 Allen Street	State College	PA
12	24	Mark Mendel	256 Fraser Street	State College	PA
13	26	Jan Davison	201 E. Beaver Avenue	State College	PA
14	35	Doris Alcorn	453 Fraser Street	State College	PA

In this case, it probably makes most sense to first compare the code here with the code from the previous example. The only difference you should see is that rather than using an INPUT and DATALINES statement to read in the number of observations, k, to be selected from each of the subgroups, here we use the ceiling function, ceil( ), to determine k. Specifically, k is calculated using:

k=ceil(0.25*n);

Now, if you think about it, if I tell you to select 25% of the n = 16 observations in a subgroup, you'd tell me that we need to select 4 observations. But what if I tell you to select 25% of the n = 15 observations in a subgroup? If you calculate 25% of 15, you get 3.75. Hmmm.... how can you select 3.75 observations? That's where the ceiling function comes into play. The ceiling function, ceil(argument), returns the smallest integer that is greater than or equal to the argument. So, for example, ceil(3.75) equals 4 ... as does of course ceil(3.1), ceil(3.25), and ceil(3.87) ...you get the idea.

That's it ... that's all there is to it. Once k is determined using the ceiling function, the rest of the program is identical to the program in the previous example.

Now, try it out... launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, 25% of the observations from Bellefonte, Port Matilda, and State College. In this case, that translates to 4, 4, and 6 observations, respectively.

Example 34.14 Section

The following code illustrates yet another alternative way of randomly selecting a stratified random sample of unequal-sized groups. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 5, 6, and 8 observations, respectively, from each of the three city subgroups in the permanent SAS data set stat482.mailing:

PROC SURVEYSELECT data = stat482.mailing
		out = sample9
			method = SRS
			seed = 12345678
			sampsize = (5 6 8);
	strata city notsorted;
	title;
RUN;

PROC PRINT data = sample9;
	title1 'Sample9: Stratified Random Sample';
	title2 'with Unequal-Sized Strata (using PROC SURVEYSELECT)';
RUN;

The SURVEYSELECT Procedure

Selection Method	Simple Random Sampling
Strata Variable	City

Input Data Set	MAILING
Random Number Seed	12345678
Number of Strata	3
Total Sample Size	19
Output Data Set	SAMPLE9

Sample9: Stratified Random Sample
with Unequal-Sized Strata (using PROC SURVEYSELECT)
Obs	City	Num	Name	Street	State	SelectionProb	SamplingWeight
1	Bellefonte	5	Lisa Brothers	89 Elm Street	PA	0.33333	3.00000
2	Bellefonte	7	John Doe	812 Main Street	PA	0.33333	3.00000
3	Bellefonte	8	Mamie Davison	102 Cherry Avenue	PA	0.33333	3.00000
4	Bellefonte	11	Linda Bentlager	1010 Tricia Lane	PA	0.33333	3.00000
5	Bellefonte	15	Harold Harvey	480 Main Street	PA	0.33333	3.00000
6	Port Matilda	40	Jane Smiley	298 Cardinal Drive	PA	0.46154	2.16667
7	Port Matilda	41	Lou Barr	219 Eagle Street	PA	0.46154	2.16667
8	Port Matilda	42	Casey Spears	123 Main Street	PA	0.46154	2.16667
9	Port Matilda	43	Leslie Olin	487 Bluebird Haven	PA	0.46154	2.16667
10	Port Matilda	49	Tim Winters	95 Dove Street	PA	0.46154	2.16667
11	Port Matilda	50	George Matre	75 Ashwind Drive	PA	0.46154	2.16667
12	State College	20	Kristin Jones	120 Stratford Drive	PA	0.36364	2.75000
13	State College	24	Mark Mendel	256 Fraser Street	PA	0.36364	2.75000
14	State College	25	Steve Lindhoff	130 E. College Avenue	PA	0.36364	2.75000
15	State College	28	Srabashi Kundu	112 E. Beaver Avenue	PA	0.36364	2.75000
16	State College	30	Daniel Peterson	328 Waupelani Drive	PA	0.36364	2.75000
17	State College	32	George Ball	888 Park Avenue	PA	0.36364	2.75000
18	State College	34	Mike Dahlberg	1201 No. Atherton	PA	0.36364	2.75000
19	State College	37	Scott Henderson	245 W. Beaver Avenue	PA	0.36364	2.75000

Straightforward enough! The only difference between this code and the code in Example 10.11 is that here the sample sizes are specified at 5, 6, and 8 rather than 5, 5, and 5. Note that you must list the stratum sample size values in the order in which the strata appear in the input data set.

Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the mailing data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.