Lesson 16: Combining SAS Data Sets -- Part II

Lesson 16: Combining SAS Data Sets -- Part II

Overview

In the last lesson, we learned different ways of combining SAS data sets — one-to-one reading, one-to-one merging, concatenating, and interleaving. In this lesson, we'll finish up our work in this arena by investigating the process of match-merging, in which we combine two or more SAS data sets based on the values of one or more common variables using MERGE and BY statements. This method deserves its own lesson, because it is far and away the most commonly used method of combining SAS data sets. Once we've learned how to match-merge two or more SAS data sets, we'll also spend some time exploring how to use DATA step options at the same time that match-merge.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

  • recognize that SAS data sets must be sorted properly in order to match-merge them
  • write code to match-merge data sets that contain the same number of observations, when each observation in the first data set matches with exactly one observation in the second data set
  • write code to match-merge data sets that are sorted by the BY variable in descending order
  • write code to match-merge data sets that contain a different number of observations, when each observation in the first data set matches with no more than one observation in the second data set
  • write code to match-merge data sets that share, aside from the variable that links the data sets together, at least one common variable name
  • write code to match-merge data sets when the observations in the first data set to be merged matches one or more observations in the second data set to be merged
  • explain how SAS completes a match-merge, and therefore be able to predict the outcome of a match-merge
  • use the various DATA step options that are available when match-merging

16.1 - Match-Merging SAS Data Sets

16.1 - Match-Merging SAS Data Sets

Match-merging is one of the most powerful methods of combining two or more SAS data sets. A match-merge combines observations across data sets based on the values of one or more common variables. For example, when data set bird:

Year

X

2000

1

2001

2

2002

3

2003

4

2004

5

and data set bee:

Year

X

2000

1

2001

2

2002

3

2003

4

2004

5

are match-merged by the variable year, we get the data set baby:

Year

X

Y

2000

1

1

2000

1

2

2001

2

.

2002

3

3

2003

4

4

2004

5

5

As you can see, the observations in the baby data set are created by joining observations from the bird and bee data sets that share the same year. To match-merge, you simply specify the data sets you would like to merge in a MERGE statement and indicate the variables on which you would like to merge in a BY statement. One thing to keep in mind, though, is you can't match-merge SAS data sets unless they are sorted by the variables appearing in the BY statement.

Throughout the rest of this section, we will look at examples of match-merges under various situations. In increasing order of complexity, the situations are:

  • a match-merge in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set
  • a match-merge in which the data sets to be merged are sorted by the BY variable in descending order
  • a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set
  • a match-merge in which the data sets to be merged share, aside from the variable that links the data sets together, at least one common variable name
  • a match-merge in which the observations in the first data set to be merged matches one or more observations in the second data set to be merged

Let's jump right in!

Example 16.1

The following program illustrates the simplest case of a match-merge, in which the data sets to be merged contain the same number of observations, and each observation in the first data set matches with exactly one observation in the second data set. Specifically, the demog and status data sets each contain five observations — one for each subj 1000, 1001, 1002, 1003, and 1004:

DATA demog;
    input subj gender $ age;
    cards;
    1000 M 42
    1001 M 20
    1002 F 53
    1003 F 40
    1004 M 29;
RUN;
 
DATA status;
    input subj disease $ test $ ;
    cards;
    1000 Y Y
    1001 N Y
    1002 N N
    1003 Y Y
    1004 N N;
RUN;
 
DATA patients;
    merge demog status;
    by subj;
RUN;
 
PROC PRINT data=patients NOOBS;
    title 'The patients data set';
RUN;

The patients data set

subj

gender

age

disease

test

1000

M

42

Y

Y

1001

M

20

N

Y

1002

F

53

N

N

1003

F

40

Y

Y

1004

M

29

N

N

Of course, the first two DATA steps just read in the respective demog and status data sets. Note that the two data sets are "linked" by the subject's ID number (subj). The meat of the match-merge takes place in the third (and last) DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (demog and status), by the variable appearing in the BY statement (subj), and to store the result in a new data set called patients. This is a perfectly valid request because both of the data sets (demog and status) are already sorted — in ascending order — by the subj variable.

Launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that the data set patients contains five observations — one observation for each of the five subjects appearing in both the demog and status data sets.

Example 16.2

The following program is another example of a simple match-merge, except this time the data sets to be merged — descdemog and descstatus — are sorted by the BY variable — subj — in descending order:

PROC SORT data = demog out = descdemog;
    by descending subj;
RUN;
 
PROC SORT data = status out = descstatus;
    by descending subj;
RUN;
 
DATA descpatients;
    merge descdemog descstatus;
    by descending subj;
RUN;
 
PROC PRINT data = descpatients NOOBS;
    title 'The descpatients data set';
RUN;

The descpatients data set

subj

gender

age

disease

test

1004

M

29

N

N

1003

F

40

Y

Y

1002

F

53

N

N

1001

M

20

N

Y

1000

M

42

Y

Y

The two SORT procedures tell SAS to sort the demog and status data sets by the variable subj, and to store the results, respectively, in data sets called descdemog and descstatus. The keyword DESCENDING that precedes the subj variable in each SORT procedure tells SAS to sort the subj values in descending order — that is, with the largest subj value appearing first, and the smallest subj value appearing last.

The match-merge again takes place in the third DATA step, in which we see a MERGE statement and a BY statement. The DATA step tells SAS that we want to match-merge the data sets appearing in the MERGE statement (descdemog and descstatus), by the variable appearing in the BY statement (subj), and to store the result in a new data set called descpatients. Note that, to accomplish the match-merge, we have to place the DESCENDING keyword in the BY statement before the subj variable, so that SAS knows that the input data sets (descdemog and descstatus) are sorted in descending order.

Launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that the data set descpatients contains five observations created from the observations in descdemog and descstatus, and arranged in descending order of the subj variable.

Example 16.3

The following program illustrates a match-merge in which the data sets to be merged contain a different number of observations, and each observation in the first data set matches with no more than one observation in the second data set. Specifically, the newdemog data set contains six observations — one for each subj 1000, 1001, 1002, 1003, 1004, and 1005 — while the status data set contains just five observations — one for each subj 1000, 1001, 1002, 1003, and 1004:

DATA newdemog;
    input subj gender $ age;
    cards;
    1000 M 42
    1001 M 20
    1002 F 53
    1003 F 40
    1004 M 29
    1005 F 29;
RUN;
 
DATA status;
    input subj disease $ test $ ;
    cards;
    1000 Y Y
    1001 N Y
    1002 N N
    1003 Y Y
    1004 N N;
RUN;
 
DATA newpatients;
    merge newdemog status;
    by subj;
RUN;
 
PROC PRINT data=newpatients NOOBS;
    title 'The newpatients data set';
RUN;

The newpatients data set

subj

gender

age

disease

test

1000

M

42

Y

Y

1001

M

20

N

Y

1002

F

53

N

N

1003

F

40

Y

Y

1004

M

29

N

N

1005

F

29

  

Again, the first two DATA steps just tell SAS to read in the newdemog and status data sets. The third DATA step tells SAS to match-merge the newdemog and status data sets by the subj variable, and to store the result in a new data set called newpatients. Although the newdemog and status data sets contain different numbers of observations, SAS continues to merge the two data sets until it runs out of observations. Where necessary, SAS fills variables with missing values.

Launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that the merged data set newpatients contains six subjects. SAS fills the variables in the newpatients data set with values based on the variables' values in the data set from which they came. Because subject 1005 does not appear in the status data set, SAS assigns a missing value to the variables disease and test for subject 1005.

Example 16.4

In the examples we've looked at so far, the variables in the datasets to be merged were unique. That is, the data sets did not share any common variable names other than the variable that linked the data sets together. The following program illustrates how SAS merges two data sets when there are common variables across the data sets besides the linking variable(s). Specifically, in addition to the linking variable subj, the moredemog and morestatus data sets share a variable called v_date (for visit date):

DATA moredemog;
    input subj gender $ age v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 M 42 03/10/96
    1001 M 20 02/19/96
    1002 F 53 02/01/96
    1003 F 40 12/31/95
    1004 M 29 01/10/97;
RUN;
 
DATA morestatus;
    input subj disease $ test $ v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 Y Y 03/17/96
    1001 N Y 03/01/96
    1002 N N 02/18/96
    1003 Y Y 01/15/96
    1004 N N 02/01/97;
RUN;
 
DATA morepatients;
    merge moredemog morestatus;
    by subj;
RUN;
 
PROC PRINT data=morepatients NOOBS;
    title 'The morepatients data set';
RUN;

    

The morepatients data set

subj

gender

age

v_date

disease

test

1000

M

42

03/17/96

Y

Y

1001

M

20

03/01/96

N

Y

1002

F

53

02/18/96

N

N

1003

F

40

01/15/96

Y

Y

1004

M

29

02/01/97

N

N

Again, the first two DATA steps just tell SAS to read in the moredemog and morestatus data sets. As you can see, the moredemog and morestatus data sets have in common not only the linking variable, subj, but also the v_date variable. So, when the two data sets are merged into a data set called morepatients, as happens in the third DATA step, the variable v_date gets its values from the morestatus data set. That's because morestatus appears last in the MERGE statement. So, the values of v_date from moredemog get over-written in the program data vector by the values of v_date in morestatus. Launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that the values for v_date in the merged data set morepatients are the same as the values for v_date in the morestatus data set.

So, in general, if data sets share common variable names, the variable in the merged data set takes its value from the data set appearing last in the MERGE statement. To reinforce the point, the following data step merges data sets morestatus and moredemog again, but this time with moredemog appearing after morestatus in the MERGE statement:

DATA morepatients2;
    merge morestatus moredemog;
    by subj;
RUN;
PROC PRINT data=morepatients2 NOOBS;
    title 'The morepatients2 data set';
RUN;

The morepatients2 data set

subj

disease

test

v_date

gender

age

1000

Y

Y

03/10/96

M

42

1001

N

Y

02/19/96

M

20

1002

N

N

02/01/96

F

53

1003

Y

Y

12/31/95

F

40

1004

N

N

01/10/97

M

29

Here, the variable v_date in the morepatients2 data set gets its value from the moredemog data set. That's because moredemog appears last in the MERGE statement. Therefore, the values of v_date from morestatus get over-written in the program data vector by the values of v_date in moredemog. Launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that this time the values for v_date in the merged data set morepatients2 are the same as the values for v_date in the moredemog data set.

Example 16.5

The following program illustrates a match-merge in which the observations in the first data set to be merged match one or more observations in the second data set to be merged. Case in point, the salesone data set contains two observations for the 2004 year, while the salestwo data set contains three observations for the 2004 year:

DATA salesone;
    input year prd sales;
    DATALINES;
    2004 1 100
    2004 2 200
    2005 3 300
    2006 4 400
    2007 5 500
    2008 6 600;
RUN;
 
DATA salestwo;
    input year loc sales;
    DATALINES;
    2004 7  700
    2004 8  800
    2004 9  900
    2006 10 950
    2007 11 960
    2008 12 970;
RUN;
 
DATA allsales;
    merge salesone salestwo;
    by year;
RUN;
 
PROC PRINT data=allsales NOOBS;
    title 'The allsales data set';
RUN;

the allsales data set

year

prd

sales

loc

2004

1

700

7

2004

2

800

8

2004

2

900

9

2005

3

300

.

2006

4

950

10

2007

5

960

11

2008

6

970

12

Again, the first two DATA steps just tell SAS to read in the salesone and salestwo data sets. Note that the salesone data set contains two 2004 years, while the salestwo data set contains three 2004 years. And, although the salesone data set contains a 2005 year, the salestwo data set contains no 2005 year. The third DATA step tells SAS to match-merge the salesone and salestwo data sets by the year variable, and to store the result in a new data set called allsales. When you launch and run  this program, you should get the following output:

the allsales data set

year

prd

sales

loc

2004

1

700

7

2004

2

800

8

2004

2

900

9

2005

3

300

.

2006

4

950

10

2007

5

960

11

2008

6

970

12

Read what follows slowly and carefully, as it gets messy! As you can see, the last three observations for the 2006, 2007, and 2008 years are obtained simply:

  • by merging the 2006 observation in salesone with the 2006 observation in salestwo,
  • by merging the 2007 observation in salesone with the 2007 observation in salestwo,
  • and by merging the 2008 observation in salesone with the 2008 observation in salestwo.

Note that because the sales variable is common to both the salesone and salestwo data sets, its values in the merged allsales data set comes from salestwo, the data set appearing last in the MERGE statement.

Let's now focus on the more difficult years of 2004 and 2005. Recall that the salesone data set contains two 2004 years and one 2005 year, while the salestwo data set contains three 2004 years and no 2005 years. The first two observations are again obtained simply enough:

  • by merging the first 2004 observation in the salesone data set with the first 2004 observation in the salestwo data set,
  • and by merging the second 2004 observation in the salesone data set with the second 2004 observation in the salestwo data set.

Again, because the sales variable is common to both the salesone and salestwo data sets, its values in the merged allsales data set comes from salestwo, the data set appearing last in the MERGE statement.

Now, for the messiest observation of all! SAS creates the third observation for 2004 by first getting what it can from the salesone data set — it gets year 2004, prd 2 (the last known value of it), and sales 200 (the last known value of it). Then, it gets what it can from the salestwo data set — it gets year 2004, loc 9, and sales 900 (and thereby writing over the value of 200 it obtained from the salesone data set).

Similarly, SAS creates the fourth observation for 2005 by first getting what it can from the salesone data set — it gets year 2005, prd 3, and sales 300. Then, SAS gets what it can from the salestwo data set —which is nothing. Hence, loc is assigned a missing value for 2005.

So, do you get it? What do you think the data set would look like if you merged in the reverse order? The following DATA step tells SAS to merge salestwo and salesone by year and to store the result in a new data set called allsales2:

DATA allsales2;
    merge salestwo salesone;
    by year;
RUN;
 
PROC PRINT data=allsales2 NOOBS;
    title 'The allsales2 data set';
RUN;

the allsales2 data set

year

loc

sales

prd

2004

7

100

1

2004

8

200

2

2004

9

900

2

2005

.

300

3

2006

10

400

4

2007

11

500

5

2008

12

600

6

Launch and run  this program. Then, see if you can justify each of the values obtained in the resulting output:

the allsales2 data set

year

loc

sales

prd

2004

7

100

1

2004

8

200

2

2004

9

900

2

2005

.

300

3

2006

10

400

4

2007

11

500

5

2008

12

600

6

One closing comment. Note that although each of our match-merging examples involved combining just two data sets, you can specify any number of data sets in a MERGE statement.


16.2 - How SAS Match-Merges

16.2 - How SAS Match-Merges

The match-merge examples we looked at in the previous section started first with the simplest case and then increased in complexity. The truth is that match-merging can be even more complex than we've seen. For example, we haven't yet investigated the situation in which we merge more than one variable at a time. It is because of this potential complexity that you will be best served by understanding how the DATA step performs match-merges. Only then will you be able to predict the results of, and therefore perform, match-merges correctly.

In this section, we'll revisit two examples from the previous section first Example 16.4 and then Example 16.5. Our focus this time won't be on how to perform the match-merge, but rather on how SAS performs the match-merge.

Example 16.6

The following program is identical to the first program that appears in Example 16.4:

DATA moredemog;
    input subj gender $ age v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 M 42 03/10/96
    1001 M 20 02/19/96
    1002 F 53 02/01/96
    1003 F 40 12/31/95
    1004 M 29 01/10/97;
RUN;
 
DATA morestatus;
    input subj disease $ test $ v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 Y Y 03/17/96
    1001 N Y 03/01/96
    1002 N N 02/18/96
    1003 Y Y 01/15/96
    1004 N N 02/01/97;
RUN;
 
DATA morepatients;
    merge moredemog morestatus;
    by subj;
RUN;
      
PROC PRINT data=morepatients NOOBS;
        title 'The morepatients data set';
RUN;

The morepatients data set

subj

gender

age

v_date

disease

test

1000

M

42

03/17/96

Y

Y

1001

M

20

03/01/96

N

Y

1002

F

53

02/18/96

N

N

1003

F

40

01/15/96

Y

Y

1004

M

29

02/01/97

N

N

You might want to launch and re-run  the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that, in addition to the linking variable subj, the moredemog and morestatus data sets share a variable called v_date. In this case, the variable v_date gets its values from the morestatus data set, since it is the last data set to appear in the MERGE statement.

Our focus this time is going to be to work our way through the program to see how SAS match-merges the moredemog and morestatus data sets to get the morepatients data set. As always, at the end of the compile phase, SAS will have created a program data vector containing the variables from the moredemog and morestatus data sets in the order in which they appear in the DATA step:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

1

0

.

  

.

  

As always, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "BY group" in each data set to determine which BY group should appear first. In this case, the first BY group, that is, observations for which subj = 1000, is the same in both data sets. So, SAS reads the first observation from the moredemog data set:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

1

0

1000

M

42

03/10/96

  

and then SAS reads the first observation from the morestatus data set:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

1

0

1000

M

42

03/17/96

Y

Y

As you can see, the value of v_date from the moredemog data set (03/10/96) gets written over by the value of v_date from the morestatus data set (03/17/96). Having no more data sets from which to read, SAS writes the first observation to the morepatients data set and retains the values in the program data vector. (If the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

SAS then looks in each data set for a second observation in the subj = 1000 BY group. Neither data set has one. Therefore, SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

2

0

.

 

.

.

  

SAS reads the second observation from the moredemog data set:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

2

0

1001

M

20

02/19/96

  

and then SAS reads the second observation from the morestatus data set:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

2

0

1001

M

20

03/01/96

N

Y

Again, the value of v_date from the moredemog data set (02/19/96) gets written over by the value of v_date from the morestatus data set (03/01/96). Having no more data sets from which to read, SAS writes the second observation to the morepatients data set and retains the values in the program data vector.

SAS then looks in each data set for another observation in the subj = 1001 BY group. Because neither data set has one, SAS sets all of the values in the program data vector to missing and begins processing the next BY group. Shall we stop this seemingly endless cycle? You should have the basic idea now... SAS continues processing observations as just described until it exhausts all of the observations in both data sets.

Example 16.7

The following program is identical to the first program that appears in Example 16.5:

DATA salesone;
    input year prd sales;
    DATALINES;
    2004 1 100
    2004 2 200
    2005 3 300
    2006 4 400
    2007 5 500
    2008 6 600
;
RUN;
    
    DATA salestwo;
    input year loc sales;
    DATALINES;
    2004 7  700
    2004 8  800
    2004 9  900
    2006 10 950
    2007 11 960
    2008 12 970
;
RUN;
    
DATA allsales;
        merge salesone salestwo;
        by year;
RUN;
    
PROC PRINT data=allsales NOOBS;
    title 'The allsales data set';
RUN;

The allsales data set

year

prd

sales

loc

2004

1

700

7

2004

2

800

8

2004

2

900

9

2005

3

300

.

2006

4

950

10

2007

5

960

11

2008

6

970

12

Again, you might want to launch and re-run  the program, so that you can review the output again to reacquaint yourself with the issues of this particular match-merge. Specifically, recall that the observations in the first data set to be merged matches one or more observations in the second data set to be merged. For example, the salesone data set contains two observations for the 2004 year, while the salestwo data set contains three observations for the 2004 year.

Let's work our way through this match-merge. At the end of the compile phase, SAS will have created a program data vector containing the variables from the salesone and salestwo data sets in the order in which they appear in the DATA step:

_N_

_ERROR_

year

prd

sales

loc

1

0

.

  

.

Again, SAS starts the execution phase having initialized the values of all of the data set variables in the program data vector to missing. Then, SAS looks at the first "BY group" in each data set to determine which BY group should appear first. In this case, the first BY group, that is observations for which year = 2004, is the same in both data sets. So, SAS reads the first observation from the salesone data set:

_N_

_ERROR_

year

prd

sales

loc

1

0

2004

1

100

.

and then SAS reads the first observation from the salestwo data set:

_N_

_ERROR_

year

prd

sales

loc

1

0

2004

1

700

7

You now shouldn't be surprised to see that the value of sales from the salesone data set (100) gets written over by the value of sales from the salestwo data set (700). Having no more data sets from which to read, SAS writes the first observation to the allsales data set and retains the values in the program data vector. (Again, if the program data vector contained variables created in the DATA step, SAS would set them to missing after writing their values to the new data set.)

SAS then looks in each data set for a second observation in the year = 2004 BY group. Both data sets have one. Therefore, SAS reads the second observation from the salesone data set:

_N_

_ERROR_

year

prd

sales

loc

2

0

2004

2

200

7

The value 7 stored in the loc variable is a remnant from the previous iteration of the DATA step. It doesn't get written over until SAS reads the second observation from the salestwo data set:

_N_

_ERROR_

year

prd

sales

loc

2

0

2004

2

800

8

Again, the value of sales from the salesone data set (200) gets written over by the value of sales from the salestwo data set (800). Having no more data sets from which to read, SAS writes the second observation to the allsales data set and retains the values in the program data vector.

SAS then looks in each data set for another observation in the subj = 2004 BY group. The salesone data set has no more, but the salestwo data set has one more. So, SAS retains the value of the prd variable from the previous iteration and reads the next observation from the salestwo data set:

_N_

_ERROR_

year

prd

sales

loc

3

0

2004

2

900

9

Again, the value of sales that was retained from the last iteration (800) gets written over by the value of sales from the salestwo data set (900). Having no more data sets from which to read, SAS writes the third observation to the allsales data set and retains the values in the program data vector.

SAS then looks in each data set for another observation in the year = 2004 BY group. Neither data set has one. Therefore, SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

_N_

_ERROR_

year

prd

sales

loc

4

0

.

.

.

.

When is this circus going to end?! Soon... bear with us! Okay... so SAS determines that the year = 2005 is the next BY group and that salesone has an observation for that BY group, while salestwo does not. SAS reads the available observation from the salesone data set:

_N_

_ERROR_

year

prd

sales

loc

4

0

2005

3

300

.

and having no more observations to read, SAS writes the contents of the program data vector to the fourth observation in the allsales data set.

SAS then looks in each data set for another observation in the year = 2005 BY group. Neither data set has one. Therefore, SAS sets all of the values in the program data vector to missing and begins processing the next BY group:

_N_

_ERROR_

subj

gender

age

v_date

5

0

.

 

.

.

Okay, let's cry, uncle! At this point, it really is a straightforward trek to the finish line provided you follow the general guidelines, which are probably worth summarizing once and for all:

  • At the beginning of the execution phase, set the values in the program data vector to missing.
  • Determine the first BY group, and read all of the observations from all of the data sets having the BY group.
  • Write the contents of the program data vector to the output data set.
  • If there are more observations in the BY group, retain the values in the program data vector.
  • If there are no more observations in the BY group, set all of the values in the program data vector to missing and begin processing the next BY group.
  • Continue processing observations as described until all of the observations in all of the input data sets are exhausted.

Using these guidelines, you should be able to predict the contents of just about any match-merge you encounter.


16.3 - Renaming Variables

16.3 - Renaming Variables

As is the case when combining data sets by other methods, data set options, such as RENAME=, IN=, DROP=, and KEEP=, can be used when match-merging data sets. In this section, we'll look at an example that uses the RENAME= option to rename variable names that are shared by the data steps to be merged.

Example 16.8

The following program uses the RENAME= option to rename the v_date variables in the demogtwo and statustwo data sets so that when they are merged into a new data set called patientstwo, both visit dates are preserved:

DATA demogtwo;
    input subj gender $ age v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 M 42 03/10/96
    1001 M 20 02/19/96
    1002 F 53 02/01/96
    1003 F 40 12/31/95
    1004 M 29 01/10/97;
RUN;
 
DATA statustwo;
    input subj disease $ test $ v_date mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    1000 Y Y 03/17/96
    1001 N Y 03/01/96
    1002 N N 02/18/96
    1003 Y Y 01/15/96
    1004 N N 02/01/97;
RUN;
DATA patientstwo;
    merge demogtwo (rename = (v_date = demogdate))
        statustwo (rename = (v_date = statusdate));
    by subj;
RUN;
 
PROC PRINT data=patientstwo NOOBS;
    title 'The patientstwo data set';
RUN;

The patientstwo data set

subj

gender

age

demogdate

disease

test

statusdate

1000

M

42

03/10/96

Y

Y

03/17/96

1001

M

20

02/19/96

N

Y

03/01/96

1002

F

53

02/01/96

N

N

02/18/96

1003

F

40

12/31/95

Y

Y

01/15/96

1004

M

29

01/10/97

N

N

02/01/97

When reviewing the first two DATA steps, in which we tell SAS to read in the demogtwo and statustwo data sets, note that both of the data sets contain a date variable called v_date. The third DATA step tells SAS to merge the demogtwo and statustwo data sets by the subj variable, and when doing so change the name of the v_date variable in the demogtwo data set to demogdate and the name of the v_date variable in the statustwo data set to statusdate. Because of this renaming that takes place, rather than the program data vector looking like this:

_N_

_ERROR_

subj

gender

age

v_date

disease

test

1

0

.

  

.

  

it looks like this:

_N_

_ERROR_

subj

gender

demogdate

disease

test

statusdate

1

0

.

.

.

  

.

Therefore, the merge reduces to a simple match-merge in which all of the values in the input data sets have a rightful position in the program data vector and are therefore preserved.

Launch and run  the program, and review the output to convince yourself that the demogtwo and statustwo data sets are merged by subj successfully and that the values in each input data set are preserved in the output data set patientstwo.


16.4 - Excluding Unmatched Observations

16.4 - Excluding Unmatched Observations

By default, when match-merging, the DATA step combines all of the observations in all of the input data sets. There may be situations, however, in which we want SAS to select only those observations for which a match exists in all of the input data sets. This is a rather common thing to do when, for instance, one of the data sets contains a master list of patients, say, and the other data sets contain information on actions taken by the patients. When we merge the data sets, we'd only want to include those patients who appear both in the master list data set and in the action data set. For example, suppose we have a data set containing a master list of patients:

id v_date
110011 01/01/06
110012 01/02/06
110013 01/04/06

and a data set called allvoids containing information about some patients' urinary voiding habits (void_no and volume) for a given visit date (v_date):

id v_date void_no volume
110011 01/01/06 1 250
110011 01/01/06 2 300
110011 01/01/06 3 302
110011 01/01/06 4 231
110012 01/02/06 1 305
110012 01/02/06 2 225
110012 01/02/06 3 400
110013 01/04/06 1 300
110013 01/04/06 2 333
110013 01/04/06 3 401
110013 01/04/06 4 404
110014 01/06/06 1 398
110014 01/06/06 2 413

Note that the patients data set contains three id numbers (110011, 110012, and 110013), whereas the allvoids data set contains information on four id numbers (110011, 110012, 110013, and 110014). Because the patients data set is viewed as the master list of patients to include in any statistical analysis, we would want to merge the patients and allvoids data set in such a way to ensure that only those patients whose id appears in both input data sets are included in the output analysis data set:

id v_date void_no volume
110011 01/01/06 1 250
110011 01/01/06 2 300
110011 01/01/06 3 302
110011 01/01/06 4 231
110012 01/02/06 1 305
110012 01/02/06 2 225
110012 01/02/06 3 400
110013 01/04/06 1 300
110013 01/04/06 2 333
110013 01/04/06 3 401
110013 01/04/06 4 404

Fortunately, telling SAS to exclude unmatched observations is as simple as using the IN= data set option along with a subsetting IF statement. Specifically:

  • We use the IN= DATA step option to create a variable that indicates whether the data set contributed to the current observation.
  • Then, we use a subsetting IF statement to check the IN= values and to write to the output (merged) data set only those observations that appear in the data sets for which IN= specified.

Let's try this approach out on our patients and allvoids data sets!

Example 16.9

The following program simply reads in the patients and allvoids data sets:

DATA patients;
    input id v_date : mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    110011 01/01/06
    110012 01/02/06
    110013 01/04/06;
RUN;
     
DATA allvoids;
    input id v_date : mmddyy8. void_no volume;
    format v_date mmddyy8.;
    DATALINES;
    110011  01/01/06 1 250
    110011  01/01/06 2 300
    110011  01/01/06 3 302
    110011  01/01/06 4 231
    110012  01/02/06 1 305
    110012  01/02/06 2 225
    110012  01/02/06 3 400
    110013  01/04/06 1 300
    110013  01/04/06 2 333
    110013  01/04/06 3 401
    110013  01/04/06 4 404
    110014  01/06/06 1 398
    110014  01/06/06 2 413;
RUN;

Launch and run  the program, so that we can work with the data sets. Note that the two data sets are linked by the variables id and v_date. The patients data set contains one observation for each of the three patients. And, the allvoids data set contains multiple observations for each of the four patients. Incidentally, in case you are curious, each observation in the allvoids data set corresponds to a patient's reported void of urine. For example, subject 110011 reported four voids on 01/01/06, subject 110012 reported three voids on 01/02/06, and so on.

Now, if we wanted to analyze the voiding data, we'd have to make sure that we didn't include any data from patients not included in the patients data set. That is, we'd want to exclude the voiding data corresponding to subject 110014. If we merge the patients and allvoids data sets by the id and v_date variables, we get an analysis data set that has one observation for each patient's reported void:

DATA analysis;
    merge patients allvoids;
    by id v_date;
RUN;
      
PROC PRINT data=analysis NOOBS; 
    title 'The analysis data set'; 
RUN;

The analysis data set

id

v_date

void_no

volume

110011

01/01/06

1

250

110011

01/01/06

2

300

110011

01/01/06

3

302

110011

01/01/06

4

231

110012

01/02/06

1

305

110012

01/02/06

2

225

110012

01/02/06

3

400

110013

01/04/06

1

300

110013

01/04/06

2

333

110013

01/04/06

3

401

110013

01/04/06

4

404

110014

01/06/06

1

398

110014

01/06/06

2

413

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the data sets were merged as described. You might also want to take particular note that this is the first example we've seen in which we have specified two variables in the BY statement. There is nothing that much more difficult about it. Other than making sure that both of the input data sets are sorted by both the id and v_date variables.

In reviewing the output, you should also note that we have not yet achieved what we set out to do, namely to create an analysis data set that contains only the voiding data for the patients appearing in the patients data set. That is, the analysis data set still includes the voiding data on subject 110014. The following code uses the IN= option and a subsetting IF statement to help us accomplish our task:

DATA analysis;
    merge patients (in = inpatients)
            allvoids (in = inallvoids);
    by id v_date;
    if inpatients and inallvoids;
RUN;
      
PROC PRINT data=analysis NOOBS; 
    title 'The analysis data set'; 
RUN;

The analysis data set

id

v_date

void_no

volume

110011

01/01/06

1

250

110011

01/01/06

2

300

110011

01/01/06

3

302

110011

01/01/06

4

231

110012

01/02/06

1

305

110012

01/02/06

2

225

110012

01/02/06

3

400

110013

01/04/06

1

300

110013

01/04/06

2

333

110013

01/04/06

3

401

110013

01/04/06

4

404

The IN = inpatients option tells SAS to assign a value of 1 to the inpatients variable when an observation from the patients data set contributes to the current observation. Likewise, the IN = inallvoids option tells SAS to assign a value of 1 to the inallvoids variable when an observation from the allvoids data set contributes to the current observation. The subsetting IF statement tells SAS to write only those observations to the analysis data set whose value for both inpatients and inallvoids is 1, that is, only those observations that were created from observations in both the patients and allvoids data sets. Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the analysis data set now contains only the voiding data for the patients appearing in the patients data set.


16.5 - Selecting Variables

16.5 - Selecting Variables

It is often the case that input data sets contain many more variables than you want in your final merged data set. Just as you can use the DROP= and KEEP= DATA step options to drop or keep a subset of variables when reading SAS data sets, you can use the DROP= and KEEP= options to drop or keep a subset of variables when merging SAS data sets. Let's take a look at an example!

Example 16.10

The following program prints a subset of the observations in the ICDB Study's background (back), pain and urgency (purg), and family history (fhx) data sets:

LIBNAME icdb 'C:\yourdrivename\Stat481WC\04combineII\sasndata';
PROC PRINT data = icdb.back (OBS=5) NOOBS;
        title 'The back data set';
RUN;       
PROC PRINT data = icdb.purg (OBS=5) NOOBS;
        title 'The purg data set';
                    where v_type = 0;
RUN;        
PROC PRINT data = icdb.fhx (OBS=5) NOOBS;
        title 'The fhx data set';
RUN;

The back data set

subj

v_type

v_date

r_id

b_date

sex

state

country

race

ethnic

relig

mar_st

ed_level

emp_st

job_chng

income

110027

0

10/05/93

2068

07/05/62

2

5

1

4

0

0

1

3

1

.

2

110029

0

10/05/93

2068

09/07/26

2

5

1

4

0

2

1

5

8

.

2

110039

0

12/07/93

2068

07/24/24

2

22

1

4

0

3

1

3

8

.

2

110040

0

11/30/93

2068

10/20/67

2

32

1

4

0

7

1

5

1

.

2

110045

0

01/11/94

2068

04/18/25

1

36

1

4

0

3

1

1

8

0

2


The purg data set

subj

v_type

v_date

r_id

purg_1

purg_2

110027

0

10/05/93

2068

8

8

110029

0

10/05/93

2068

8

8

110039

0

12/07/93

2068

2

3

110040

0

11/30/93

2068

3

5

110045

0

01/11/94

2068

2

2


The fhx data set

subj

v_type

v_date

i_id

fhx_1

fhx_2

fhx_3

fhx_4

fhx_5

fhx_6

fhx_7

fhx_8

fhx_9

fhx_10

fhx_11

fhx_12

fhx_13

fhx_14

110027

0

10/05/93

2068

0

0

0

.

0

1

1

1

0

0

.

0

1

0

110029

0

10/05/93

2068

0

0

0

.

1

1

8

0

8

0

.

0

0

8

110039

0

12/07/93

2068

1

0

0

.

0

0

0

0

0

0

.

0

0

1

110040

0

11/30/93

2068

0

0

8

.

8

8

1

1

8

0

.

1

8

1

110045

0

01/11/94

2068

0

0

1

0

0

0

1

1

0

1

0

0

0

1

In each case, the PRINT procedure's OBS=5 option tells SAS to print just the first five observations of each data set. The second PRINT procedure contains an extra direction to SAS to print only those observations for which the v_type variable equals 0. Before you can run the program, you'll need to right-click the data set links... back, purg, and fhx... so that you can save the data sets to a convenient location on your computer. After doing so, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data sets. Then, run  the program, and review the output so that you can familiarize yourself with each data set.

As you can see, the data sets share three variables — subj, v_type, and v_date, and then each data set has its own set of unique variables. Our interest in this example is to merge the three data sets into a new data set called icdbdata that contains the common subj variable, the b_date and sex variables from the back data set, the purg_1 and purg_2 variables from the purg data set, and the fhx_1 and fhx_2 variables from the fhx data set. That is, when all is said and done, this is what the first ten observations of the icdbdata set should look like:

subj

b_date

sex

purg_1

purg_2

fhx_1

fhx_2

110027

07/05/62

2

8

8

0

0

110029

09/07/26

2

8

8

0

0

110039

07/24/24

2

2

3

1

0

110040

10/20/67

2

3

5

0

0

110045

04/18/25

1

2

2

0

0

110049

10/05/23

2

3

3

1

0

110051

12/02/23

2

6

7

0

8

110052

1202/42

2

6

6

0

0

110053

03/15/22

2

7

7

0

0

110055

03/31/41

2

6

6

0

0

The following code does the trick for us:

LIBNAME icdb 'C:\yourdrivename\Stat481WC\04combineII\sasndata';
DATA icdbdata (drop = v_type);
   merge icdb.back (keep = subj b_date sex)
           icdb.purg (where = (v_type = 0)
                      keep = subj v_type purg_1 purg_2)
           icdb.fhx  (keep = subj fhx_1 fhx_2);
   by subj;
RUN;
PROC PRINT data = icdbdata (OBS = 10) NOOBS;
        title 'The icdbdata data set';
RUN;

The icdbdata data set

subj

b_date

sex

purg_1

purg_2

fhx_1

fhx_2

110027

07/05/62

2

8

8

0

0

110029

09/07/26

2

8

8

0

0

110039

07/24/24

2

2

3

1

0

110040

10/20/67

2

3

5

0

0

110045

04/18/25

1

2

2

0

0

110049

10/05/23

2

3

3

1

0

110051

12/02/42

2

6

7

0

8

110052

01/04/25

2

6

6

0

0

110053

03/15/22

2

7

7

0

0

110055

03/31/41

2

6

6

0

0

At a global level, the DATA step tells SAS to merge the (permanent) back, purg, and fhx data sets by the subj variable. The data set options give SAS more specific directions about what to do while match-merging the data sets. Specifically:

  • The KEEP= option attached to the icdb.back data set tells SAS to keep just three variables — subj, b_date, and sex — from the icdb.back data set.
  • The WHERE= option attached to the icdb.purg data set tells SAS to select the observations for which the v_type variable equals 0.
  • The KEEP= option attached to the icdb.purg data set tells SAS to keep four variables — subj, v_type, purg_1, and purg_2 — from the icdb.purg data set. Although we don't want the v_type variable in the final icdbdata data set, we still need to tell SAS to keep the variable just so we can use it to select the observations for which v_type equals 0.
  • The KEEP= option attached to the icdb.fhx data set tells SAS to keep just three variables — subj, fhx_1, and fhx_2 — from the icdb.purg data set.
  • Finally, the DROP= option attached to the icdbdata output data set tells SAS to drop the v_type variable from the output data set.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the icdbdata data set contains only the variables and observations that we desired.


16.6 - Summary

16.6 - Summary

In this lesson, we learned how to match-merge two or more SAS data sets.

The homework for this lesson will give you practice with this technique.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility