16.1 - Match-Merging SAS Data Sets

Match-merging is one of the most powerful methods of combining two or more SAS data sets. A match-merge combines observations across data sets based on the values of one or more common variables. For example, when data set bird:

Year	X
2000	1
2001	2
2002	3
2003	4
2004	5

and data set bee:

Year	X
2000	1
2001	2
2002	3
2003	4
2004	5

are match-merged by the variable year, we get the data set baby:

Year	X	Y
2000	1	1
2000	1	2
2001	2	.
2002	3	3
2003	4	4
2004	5	5

As you can see, the observations in the baby data set are created by joining observations from the bird and bee data sets that share the same year. To match-merge, you simply specify the data sets you would like to merge in a MERGE statement and indicate the variables on which you would like to merge in a BY statement. One thing to keep in mind, though, is you can't match-merge SAS data sets unless they are sorted by the variables appearing in the BY statement.

Throughout the rest of this section, we will look at examples of match-merges under various situations. In increasing order of complexity, the situations are:

a match-merge in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set
a match-merge in which the data sets to be merged are sorted by the BY variable in descending order
a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set
a match-merge in which the data sets to be merged share, aside from the variable that links the data sets together, at least one common variable name
a match-merge in which the observations in the first data set to be merged matches one or more observations in the second data set to be merged

Let's jump right in!

Example 16.1 Section

The following program illustrates the simplest case of a match-merge, in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set. Specifically, the demog and status data sets each contain five observations — one for each subj 1000, 1001, 1002, 1003, and 1004:

DATA demog;
    input subj gender $ age;
    cards;
    1000 M 42
    1001 M 20
    1002 F 53
    1003 F 40
    1004 M 29;
RUN;
 
DATA status;
    input subj disease $ test $ ;
    cards;
    1000 Y Y
    1001 N Y
    1002 N N
    1003 Y Y
    1004 N N;
RUN;
 
DATA patients;
    merge demog status;
    by subj;
RUN;
 
PROC PRINT data=patients NOOBS;
    title 'The patients data set';
RUN;

The patients data set
subj	gender	age	disease	test
1000	M	42	Y	Y
1001	M	20	N	Y
1002	F	53	N	N
1003	F	40	Y	Y
1004	M	29	N	N

Of course, the first two DATA steps just read in the respective demog and status data sets. Note that the two data sets are "linked" by the subject's ID number (subj). The meat of the match-merge takes place in the third (and last) DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (demog and status), by the variable appearing in the BY statement (subj), and to store the result in a new data set called patients. This is a perfectly valid request because both of the data sets (demog and status) are already sorted — in ascending order — by the subj variable.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the data set patients contains five observations — one observation for each of the five subjects appearing in both the demog and status data sets.

Example 16.2 Section

The following program is another example of a simple match-merge, except this time the data sets to be merged — descdemog and descstatus — are sorted by the BY variable — subj — in descending order:

PROC SORT data = demog out = descdemog;
    by descending subj;
RUN;
 
PROC SORT data = status out = descstatus;
    by descending subj;
RUN;
 
DATA descpatients;
    merge descdemog descstatus;
    by descending subj;
RUN;
 
PROC PRINT data = descpatients NOOBS;
    title 'The descpatients data set';
RUN;

The descpatients data set
subj	gender	age	disease	test
1004	M	29	N	N
1003	F	40	Y	Y
1002	F	53	N	N
1001	M	20	N	Y
1000	M	42	Y	Y

The two SORT procedures tell SAS to sort the demog and status data sets by the variable subj, and to store the results, respectively, in data sets called descdemog and descstatus. The keyword DESCENDING that precedes the subj variable in each SORT procedure tells SAS to sort the subj values in descending order — that is, with the largest subj value appearing first, and the smallest subj value appearing last.

The match-merge again takes place in the third DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (descdemog and descstatus), by the variable appearing in the BY statement (subj), and to store the result in a new data set called descpatients. Note that, to accomplish the match-merge, we have to place the DESCENDING keyword in the BY statement before the subj variable, so that SAS knows that the input data sets (descdemog and descstatus) are sorted in descending order.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the data set descpatients contains five observations created from the observations in descdemog and descstatus, and arranged in descending order of the subj variable.

Example 16.3 Section

The following program illustrates a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set. Specifically, the newdemog data set contains six observations — one for each subj 1000, 1001, 1002, 1003, 1004, and 1005 — while the status data set contains just five observations — one for each subj 1000, 1001, 1002, 1003, and 1004:

DATA newdemog;
    input subj gender $ age;
    cards;
    1000 M 42
    1001 M 20
    1002 F 53
    1003 F 40
    1004 M 29
    1005 F 29;
RUN;
 
DATA status;
    input subj disease $ test $ ;
    cards;
    1000 Y Y
    1001 N Y
    1002 N N
    1003 Y Y
    1004 N N;
RUN;
 
DATA newpatients;
    merge newdemog status;
    by subj;
RUN;
 
PROC PRINT data=newpatients NOOBS;
    title 'The newpatients data set';
RUN;

The newpatients data set
subj	gender	age	disease	test
1000	M	42	Y	Y
1001	M	20	N	Y
1002	F	53	N	N
1003	F	40	Y	Y
1004	M	29	N	N
1005	F	29

Again, the first two DATA steps just tell SAS to read in the newdemog and status data sets. The third DATA step tells SAS to match-merge the newdemog and status data sets by the subj variable, and to store the result in a new data set called newpatients. Although the newdemog and status data sets contain different numbers of observations, SAS continues to merge the two data sets until it runs out of observations. Where necessary, SAS fills variables with missing values.

Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the merged data set newpatients contains six subjects. SAS fills the variables in the newpatients data set with values based on the variables' values in the data set from which they came. Because subject 1005 does not appear in the status data set, SAS assigns a missing value to the variables disease and test for subject 1005.

Example 16.4 Section

In the examples we've looked at so far, the variables in the datasets to be merged were unique. That is, the data sets did not share any common variable names other than the variable that linked the data sets together. The following program illustrates how SAS merges two data sets when there are common variables across the data sets besides the linking variable(s). Specifically, in addition to the linking variable subj, the moredemog and morestatus data sets share a variable called v_date (for visit date):

DATA moredemog;
    input subj gender $ age v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 M 42 03/10/96
    1001 M 20 02/19/96
    1002 F 53 02/01/96
    1003 F 40 12/31/95
    1004 M 29 01/10/97;
RUN;
 
DATA morestatus;
    input subj disease $ test $ v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 Y Y 03/17/96
    1001 N Y 03/01/96
    1002 N N 02/18/96
    1003 Y Y 01/15/96
    1004 N N 02/01/97;
RUN;
 
DATA morepatients;
    merge moredemog morestatus;
    by subj;
RUN;
 
PROC PRINT data=morepatients NOOBS;
    title 'The morepatients data set';
RUN;

The morepatients data set
subj	gender	age	v_date	disease	test
1000	M	42	03/17/96	Y	Y
1001	M	20	03/01/96	N	Y
1002	F	53	02/18/96	N	N
1003	F	40	01/15/96	Y	Y
1004	M	29	02/01/97	N	N

Again, the first two DATA steps just tell SAS to read in the moredemog and morestatus data sets. As you can see, the moredemog and morestatus data sets have in common not only the linking variable, subj, but also the v_date variable. So, when the two data sets are merged into a data set called morepatients, as happens in the third DATA step, the variable v_date gets its values from the morestatus data set. That's because morestatus appears last in the MERGE statement. So, the values of v_date from moredemog get over-written in the program data vector by the values of v_date in morestatus. Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the values for v_date in the merged data set morepatients are the same as the values for v_date in the morestatus data set.

So, in general, if data sets share common variable names, the variable in the merged data set takes its value from the data set appearing last in the MERGE statement. To reinforce the point, the following data step merges data sets morestatus and moredemog again, but this time with moredemog appearing after morestatus in the MERGE statement:

DATA morepatients2;
    merge morestatus moredemog;
    by subj;
RUN;
PROC PRINT data=morepatients2 NOOBS;
    title 'The morepatients2 data set';
RUN;

The morepatients2 data set
subj	disease	test	v_date	gender	age
1000	Y	Y	03/10/96	M	42
1001	N	Y	02/19/96	M	20
1002	N	N	02/01/96	F	53
1003	Y	Y	12/31/95	F	40
1004	N	N	01/10/97	M	29

Here, the variable v_date in the morepatients2 data set gets its value from the moredemog data set. That's because moredemog appears last in the MERGE statement. Therefore, the values of v_date from morestatus get over-written in the program data vector by the values of v_date in moredemog. Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that this time the values for v_date in the merged data set morepatients2 are the same as the values for v_date in the moredemog data set.

Example 16.5 Section

The following program illustrates a match-merge in which the observations in the first data set to be merged match one or more observations in the second data set to be merged. Case in point, the salesone data set contains two observations for the 2004 year, while the salestwo data set contains three observations for the 2004 year:

DATA salesone;
    input year prd sales;
    DATALINES;
    2004 1 100
    2004 2 200
    2005 3 300
    2006 4 400
    2007 5 500
    2008 6 600;
RUN;
 
DATA salestwo;
    input year loc sales;
    DATALINES;
    2004 7  700
    2004 8  800
    2004 9  900
    2006 10 950
    2007 11 960
    2008 12 970;
RUN;
 
DATA allsales;
    merge salesone salestwo;
    by year;
RUN;
 
PROC PRINT data=allsales NOOBS;
    title 'The allsales data set';
RUN;

the allsales data set
year	prd	sales	loc
2004	1	700	7
2004	2	800	8
2004	2	900	9
2005	3	300	.
2006	4	950	10
2007	5	960	11
2008	6	970	12

Again, the first two DATA steps just tell SAS to read in the salesone and salestwo data sets. Note that the salesone data set contains two 2004 years, while the salestwo data set contains three 2004 years. And, although the salesone data set contains a 2005 year, the salestwo data set contains no 2005 year. The third DATA step tells SAS to match-merge the salesone and salestwo data sets by the year variable, and to store the result in a new data set called allsales. When you launch and run this program, you should get the following output:

the allsales data set
year	prd	sales	loc
2004	1	700	7
2004	2	800	8
2004	2	900	9
2005	3	300	.
2006	4	950	10
2007	5	960	11
2008	6	970	12

Read what follows slowly and carefully, as it gets messy! As you can see, the last three observations for the 2006, 2007, and 2008 years are obtained simply:

by merging the 2006 observation in salesone with the 2006 observation in salestwo,
by merging the 2007 observation in salesone with the 2007 observation in salestwo,
and by merging the 2008 observation in salesone with the 2008 observation in salestwo.

Note that because the sales variable is common to both the salesone and salestwo data sets, its values in the merged allsales data set comes from salestwo, the data set appearing last in the MERGE statement.

Let's now focus on the more difficult years of 2004 and 2005. Recall that the salesone data set contains two 2004 years and one 2005 year, while the salestwo data set contains three 2004 years and no 2005 years. The first two observations are again obtained simply enough:

by merging the first 2004 observation in the salesone data set with the first 2004 observation in the salestwo data set,
and by merging the second 2004 observation in the salesone data set with the second 2004 observation in the salestwo data set.

Again, because the sales variable is common to both the salesone and salestwo data sets, its values in the merged allsales data set comes from salestwo, the data set appearing last in the MERGE statement.

Now, for the messiest observation of all! SAS creates the third observation for 2004 by first getting what it can from the salesone data set — it gets year 2004, prd 2 (the last known value of it), and sales 200 (the last known value of it). Then, it gets what it can from the salestwo data set — it gets year 2004, loc 9, and sales 900 (and thereby writing over the value of 200 it obtained from the salesone data set).

Similarly, SAS creates the fourth observation for 2005 by first getting what it can from the salesone data set — it gets year 2005, prd 3, and sales 300. Then, SAS gets what it can from the salestwo data set —which is nothing. Hence, loc is assigned a missing value for 2005.

So, do you get it? What do you think the data set would look like if you merged in the reverse order? The following DATA step tells SAS to merge salestwo and salesone by year and to store the result in a new data set called allsales2:

DATA allsales2;
    merge salestwo salesone;
    by year;
RUN;
 
PROC PRINT data=allsales2 NOOBS;
    title 'The allsales2 data set';
RUN;

the allsales2 data set
year	loc	sales	prd
2004	7	100	1
2004	8	200	2
2004	9	900	2
2005	.	300	3
2006	10	400	4
2007	11	500	5
2008	12	600	6

Launch and run this program. Then, see if you can justify each of the values obtained in the resulting output:

the allsales2 data set
year	loc	sales	prd
2004	7	100	1
2004	8	200	2
2004	9	900	2
2005	.	300	3
2006	10	400	4
2007	11	500	5
2008	12	600	6

One closing comment. Note that although each of our match-merging examples involved combining just two data sets, you can specify any number of data sets in a MERGE statement.