15.2 - One-to-One Merging

At first glance, one-to-one merging appears to be the same as one-to-one reading, since it too combines two or more SAS data sets, one "to the right" of the other into a single "fat" data set. That is, just like one-to-one reading, one-to-one merging combines observations from two or more data sets into a single observation in a new data set. There is just one primary difference though — SAS continues to merge observations until it has read all of the observations from all of the data sets. For example, suppose again that our patients data set contains three variables: patient ID number (ID), gender (Sex), and age of the patient (Age):

ID	Sex	Age
1157	F	33
2395	F	48
1098	M	39
4829	F	24
3456	M	30
5920	M	41
1493	F	42

and our scale data set contains three variables: ID (number), Height (in inches) and Weight (in pounds):

ID	Height	Weight
1157	65	122
2395	64	130
1098	70	178
4829	67	142
3456	72	190
5920	71	188

Then, when we one-to-one merge the two data sets, we get a data set, called say one2onemerge, that looks like this:

ID	Sex	Age	Height	Weight
157	F	33	65	122
2395	F	48	64	130
1098	M	39	70	178
4829	F	24	67	142
3456	M	30	72	190
5920	M	41	71	188
1493	F	42	.	.

in which the second data set gets placed to the "right" of the first data set. Again, the observations are combined based on their relative position in the data set. The first observation of patient is combined with the first observation of scale to create the first observation in one2onemerge; the second observation of patient is combined with the second observation of scale to create the second observation in one2onemerge; and so on. When SAS performs a one-to-one merge, the DATA step continues to read observations until the last observation is read from the largest data set. That's why the one2onemerge data set has one more observation than the one2oneread data set. In general, the number of observations in a data set created by a one-to-one merge always equals the numbers of observations in the largest data set named for one-to-one merging.

Example 15.4 Section

The following program uses one-to-one merging to combine the patients data set with the scale data set:

DATA one2onemerge;
    merge patients scale;
RUN;

PROC PRINT NOOBS;
    title 'The one2onemerge data set';
RUN;

The one2onemerge data set
ID	VarA	VarB	VarC
40	A1	B4	C1
50	A2	B5	C2

The MERGE statement tells SAS to create a new data set by combining patients and scale. Launch and run the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. You should see that the first observation in one2onemerge contains the first observation of patients and scale, the second observation in one2onemerge contains the second observation of patients and scale, and so on. Since there are seven observations in patients and six observations in scale, the new one2onemerge data set contains seven observations, with missing values for the Height and Weight variables in the seventh observation. Note that although this example only combined two data sets, the MERGE statement can contain any number of input data sets.

Just as is true for one-to-one reading, if data sets that are being one-to-one merged contain variables that have the same names, the values that are read in from the last data set overwrite the values that were read in from earlier data sets. Let's go back to our contrived example to illustrate this point.

Example 15.5 Section

The following program uses one-to-one merging to combine the one data set with the two data set to create a new data set called onetwomerged:

DATA one;
    input ID VarA $ VarB $;
    DATALINES;
    10 A1 B1
    20 A2 B2
    30 A3 B3
    ;
RUN;
 
DATA two;
    input ID VarB $ VarC $;
    DATALINES;
    40 B4 C1
    50 B5 C2
    ;
RUN;
 
DATA onetwomerged;
    merge one two;
RUN;
 
PROC PRINT data = onetwomerged NOOBS;
    title 'The onetwomerged data set';
RUN;

The onetwomerged data set
ID	VarA	VarB	VarC
40	A1	B4	C1
50	A2	B5	C2
30	A3	B3

Note again that the one and two data sets share two variables, namely ID and VarB. The third (and last) DATA step tells SAS to combine the two data sets using the one-to-one merging method. Let's walk our way through how SAS processes the DATA step. At the end of the compile phase, SAS will have created a program data vector containing the variables from the one and two data sets in the order in which they appear in the DATA step:

_N_	_ERROR_	Num	VarA	VarB	VarC
1	0	.			.

During the first iteration of the DATA step, the MERGE statement reads first one observation from data set one:

_N_	_ERROR_	Num	VarA	VarB	VarC
1	0	10	A1	B1	.

and then one observation from data set two. The values for ID and VarB in data set two overwrites the values for ID and VarB in data set one:

_N_	_ERROR_	Num	VarA	VarB	VarC
1	0	40	A1	B4	C1

Being at the end of the first iteration of the DATA step, SAS writes the contents of the program data vector as the first observation in the onetwomerged data set. Upon returning to the top of the DATA step, the program data vector looks like this at the beginning of the second iteration of the DATA step:

_N_	_ERROR_	Num	VarA	VarB	VarC
2	0	40	A1	B4	C1

Just as is true for a SET statement, SAS retains the values of variables that were read from a SAS data set with the MERGE statement. Now, the MERGE statement reads the second observation from the one data set:

_N_	_ERROR_	Num	VarA	VarB	VarC
2	0	20	A2	B2	C1

and the second observation from the two data set. Again, the values for ID and VarB in data set two overwrites the values for ID and VarB in data set one:

_N_	_ERROR_	Num	VarA	VarB	VarC
2	0	50	A2	B5	C2

Being at the end of the second iteration of the DATA step, SAS writes the contents of the program data vector as the second observation in the onetwomerged data set. Now this is where things get different! Because an observation still exists in the one data set, SAS returns to the top of the DATA step. The program data vector looks like this at the beginning of the third iteration of the DATA step:

_N_	_ERROR_	Num	VarA	VarB	VarC
3	0	50	A2	B5	C2

And, the MERGE statement reads the third observation from the one data set:

_N_	_ERROR_	Num	VarA	VarB	VarC
3	0	30	A3	B3	C2

SAS attempts to read a third observation from the two data set but instead encounters an end-of-data set marker. Therefore, as is always the case in this kind of situation, SAS sets the values of all of that data set's variables in the program data vector to missing:

_N_	_ERROR_	Num	VarA	VarB	VarC
3	0	30	A3	B

Being at the end of the third iteration of the DATA step, SAS writes the contents of the program data vector as the third observation in the onetwomerged SAS data set. Because there are no more observations in either the one or the two data set, processing stops. Thank goodness!

Now, launch and run the SAS program, and review the output to convince yourself that the one and two data sets are combined as described.

One more closing comment. One-to-one reading and one-to-one merging require users to exercise extreme caution when combining two (or more) data sets based on relative position only. It would just take one of the data sets to be "shifted" ever so slightly to get really messed up results. It's for this reason that I personally don't find the one-to-one read or the one-to-one merge all that practical. The more useful and therefore much more common merge performed in SAS is what is called match-merging. We'll learn about it in the next lesson.