15.3 - Concatenating Two or More Data Sets

To concatenate two or more SAS data sets means to stack one "on top" of the other into a single SAS data set. For example, suppose the data set store1 contains three variables, store (number), day (of the week), and sales (in dollars):

StoreDaySale
1M1200
1T1435
1W1712
1R1529
1F1920
1S2325

and the data set store2 contains the same three variables:

StoreDaySales
2M2215
2T2458
2W1789
2R1692
2F2105
2S2847

Then, when we concatenate the two data sets, we get what I like to call a "tall" data set:

StoreDaySales
1M1200
1T1435
1W1712
1R1529
1F1920
1S2325
2M2215
2T2458
2W1789
2R1692
2F2105
2S2847

in which the data sets are stacked on top of each other. Note that the number of observations in the new data set is the sum of the numbers of observations in the original data sets. To concatenate SAS data sets, you simplify specify a list of data set names in one SET statement.

Example 15.6 Section

The following program concatenates the store1 and store2 data sets to create a new "tall" data set called bothstores:

DATA store1;
    input Store Day $ Sales;
    DATALINES;
    1 M 1200
    1 T 1435
    1 W 1712
    1 R 1529
    1 F 1920
    1 S 2325
    ;
RUN;

DATA store2;
    input Store Day $ Sales;
    DATALINES;
    2 M 2215
    2 T 2458
    2 W 1798
    2 R 1692
    2 F 2105
    2 S 2847
    ;
RUN;

DATA bothstores;
    set store1 store2;
RUN;

PROC PRINT data = bothstores NOOBS;
    title 'The bothstores data set';
RUN;

The bothstores data set
StoreDaySales
1M1200
1T1435
1W1712
1R1529
1F1920
1S2325
2M2215
2T2458
2W1798
2R1692
2F2105
2S2847

Note that the input data sets — store1 and store2 — contain the same variables — Store, Day, and Sales — with identical attributes. In the third DATA step, the DATA statement tells SAS to create a new data set called bothstores, and the SET statement tells SAS that the data set should contain first the observations from store1 and then the observations from store2. Note that although we have specified only two input data sets here, the SET statement can contain any number of input data sets.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that SAS did indeed concatenate the store1 and store2 data sets to make one "tall" data set called bothstores. You might then want to edit the SET statement so that store1 follows store2, and re-run  the SAS program to see that then the contents of store1 follow the contents of store2 in the bothstores data set.

In general, a data set that is created by concatenating data sets contains all of the variables and all of the observations from all of the input data sets. Therefore, the number of variables the new data set contains always equals the total number of unique variables among all of the input data sets. And, the number of observations in the new data set is the sum of the number of observations in the input data sets. Let's return to the contrived example we've used throughout this lesson.

Example 15.7 Section

The following program concatenates the one and two data sets to create a new "tall" data set called onetopstwo:

DATA one;
    input ID VarA $ VarB $;
    DATALINES;
    10 A1 B1
    20 A2 B2
    30 A3 B3
    ;
RUN;
 
DATA two;
    input ID VarB $ VarC $;
    DATALINES;
    40 B4 C1
    50 B5 C2
    ;
RUN;
 
DATA onetopstwo;
    set one two;
RUN;
 
PROC PRINT data = onetopstwo NOOBS;
    title 'The onetopstwo data set';
RUN;

The onetopstwo data set
IDVarAVarBVarC
10A1B1 
20A2B2 
30A3B3 
40 B4C1
50 B5C2

As you review the first two DATA steps, in which SAS reads in the respective one and two data sets, note that the total number of unique variables is four — ID, VarA, VarB, and VarC. The total number of observations among the two input data sets is 3 + 2 = 5. Therefore, we can expect the concatenated data set onetopstwo to contain four variables and five observations. Launch and run  the SAS program, and review the output to convince yourself that SAS did grab first all of the variables and all of the observations from the one data set and then all of the variables and all of the observations from the two data set. As you can see, to make it all work out okay, observations arising from the one data set have missing values for VarC, and observations from the two data set have missing values for VarA.

When Variable Attributes Differ

As you know, variable attributes include the type of variable (character vs. numeric), the informat (how the variable is read in) and format (how its values are printed) of a variable, the length of the variable, and the label (how its variable name is printed) of a variable. Concatenating data sets when variable attributes differ across the input data sets may pose problems for SAS (and therefore you):

  • If the data sets you name in the SET statement contain variables with the same names and types, you can concatenate the data sets without modification.
  • If the variable types differ, you must modify one or more of the data sets before concatenating them. SAS will not concatenate the data sets until you do.
  • If the lengths, formats, informats, or labels differ, you may want to modify one or more of the data sets before concatenating them. SAS will concatenate the data sets; you may just not like the results.

Example 15.8 Section

If a variable is defined as numeric in one data set named in the SET statement and as a character in another data set, SAS issues an error message and will not concatenate the data sets. The following program attempts to concatenate the store3 and store4 data sets when the Store variable is defined as a character in the store3 data set, but as numeric in the store4 data set:

DATA store3;
	input Store $ 1 Day $ 3 Sales 5-8;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
	;
RUN;
 
DATA store4;
	input Store 1 Day $ 3 Sales 5-8;
    DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
	;
RUN;
 
DATA bothstores2;
	set store3 store4;
RUN;
 
PROC PRINT data = bothstores2 NOOBS;
	title 'The bothstores2 data set';
RUN;

 

Launch and run  the SAS program, and review the log window. You can see that SAS balks at the idea of concatenating the two data sets and gives us a "dead in the water" message:

DATA bothstores2;
      set store3 store4;
ERROR: Variable Store has been defined as both character and numeric.
RUN;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.BOTHSTORES2 may be incomplete.  When this step was stopped there were 0 
         observations and 3 variables.
WARNING: Data set WORK.BOTHSTORES2 was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

In order to concatenate the store3 and store4 data sets successfully, we need to change either the Store character variable in the store3 data set to numeric or the Store numeric variable in the store4 data set to character. As you know from our work in Stat 480, to perform an explicit character-to-numeric conversion of the Store variable in the store3 data set, we'd have to use the INPUT function. Alternatively, we could use the PUT function to perform an explicit numeric-to-character conversion of the Store variable in the store4 data set. That's what the following DATA step does:

DATA store4 (rename = (Store2 = Store));
    set store4;
    Store2 = put(Store,1.);
    drop Store;
RUN;
PROC CONTENTS data = store3;
    title 'Contents of store3';
RUN;
PROC CONTENTS data = store4;
    title 'Contents of store4';
RUN;

 

Contents of store3
Alphabetic List of Variables and Attributes
#VariableTypeLen
2DayChar1
3SalesNum8
1StoreChar1
Contents of store4
Alphabetic List of Variables and Attributes
#VariableTypeLen
1DayChar1
2SalesNum8
3StoreChar1

If you launch and run  the SAS program, and review the output from the CONTENTS procedures, you can see that now the Store variable in the store3 data set is defined as a character, as is the Store variable in the store4 data set.

How does the conversion work? Well, of course, it's all because of the PUT function. The general form of the PUT function is:

PUT(source,format)

where source is a numeric variable, constant or expression to be converted to a character value, and format is a valid format that matches the data type of the source variable. In our example, Store is the numeric variable that we want to convert to a character variable, and since the values for Store are one-digit numbers, the numeric format 1. is an appropriate format to specify.

The assignment statement in our program:

Store2 = put(Store, 1.);

thus tells SAS to convert the numeric values in Store to character values and to store the results in a character variable called Store2. A little detail issue here ... if we do not use a different variable name than Store on the left side of the equation, SAS will be forced to do an automatic numeric-to-character conversion rather than the explicit numeric-to-character conversion that we desire. The one other statement in the DATA step, as well as the RENAME= data set option, is necessary just because of this naming issue. The DROP statement tells SAS to drop the Store variable, since we now have the desired character variable Store2. And, the RENAME= option tells SAS to rename the Store2 variable back to Store, just so it matches the original name of the variable.

Now that we've seen both the PUT function and INPUT function in action, it pays to summarize:

  • To perform an explicit numeric-to-character conversion, use the (source, format) with a numeric source and a numeric format.
  • To perform an explicit character-to-numeric conversion, use the INPUT(source, informat) function with a character source and a numeric informat.

To remember which function requires a format versus an informat, note that the INPUT function requires an informat.

Ahhh... but back to the task on hand. Finally, the following program allows us to accomplish our original goal of concatenating the store3 and store4 data sets:

DATA bothstores2;
    set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
    title 'The bothstores2 data set';
RUN;

The bothstores2 data set
StoreDaySales
1M1200
1T1435
1W1712
1R1529
1F1920
1S2325
2M2215
2T2458
2W1798
2R1692
2F2105
2S2847

Simple enough! Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that SAS successfully concatenated store3 and store4 into a data set called bothstores2.

Now that we've taken a look at what we need to do when we want to concatenate data sets containing the same variables but with different variable types, let's address a few more attributes. When data sets are concatenated with the SET statement, the formats, informats, and labels associated with the variables in the new data set are determined by the following rule:

If two or more data sets define different formats, informats, or labels for the same variable, the variable in the new data set takes the attribute from the first data set in the SET statement that contains the variable with the attribute.

Example 15.9 Section

The following program creates two data sets — store5 and store6 — that intentionally contain different labels and different formats for the Sales variable:

DATA store5;
	input Store 1 Day $ 3 Sales 5-8;
    format Sales comma5.;
    label Sales = 'Total Sales';
    DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
    ;
RUN;
DATA store6;
    input Store 1 Day $ 3 Sales 5-8;
    format Sales dollar6.;
    label Sales = 'Sales for Day';
    DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
    ;
RUN;
PROC CONTENTS data = store5;
    title 'Contents of the store5 data set';
RUN;
PROC CONTENTS data = store6;
    title 'Contents of the store6 data set';
RUN;

Contents of the store5 data set
Alphabetic List of Variables and Attributes
#VariableTypeLenFormatLabel
2DayChar1  
3SalesNum8COMMA5.Total Sales
1StoreNum8  
Contents of the store6 data set
Alphabetic List of Variables and Attributes
#VariableTypeLenFormatLabel
2DayChar1  
3SalesNum8DOLLAR6.Sales for Day
1StoreNum8  

As you can see, the first DATA step tells SAS to set the format and label for the Sales variable as comma5. and Total Sales, respectively. And, the second DATA step tells SAS to set the format and label for the Sales variable as dollar6. and Sales for Day, respectively. Launch and run  the SAS program, and review the output from the CONTENTS procedures to convince yourself that the formats and labels for the two versions of the Sales variable have been assigned as described.

Now, the following program tells SAS to store first the observations from the store5 data set, followed by the observations from the store6 data set, into a new data set called bothstores3:

DATA bothstores3;
    set store5 store6;
RUN;
 
PROC PRINT data = bothstores3 NOOBS LABEL;
    title 'The bothstores3 data set';
RUN;

The bothstores3 data set
StoreDayTotal Sales
1M1,200
1T1,435
1W1,712
1R1,529
1F1,920
1S2,325
2M2,215
2T2,458
2W1,798
2R1,692
2F2,105
2S2,847

Recall the rule that if two data sets explicitly define different formats and labels for the same variable, then the variable in the new data set takes the attribute from the first data set in the SET statement that explicitly defines the attribute. That said, we should expect the Sales variable in the bothstores3 data set to take on the attributes of Sales in the store5 data set. Specifically, Sales should be formatted as comma5. and labeled as Total Sales. Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the store5 and store6 data sets. That is, the program tells SAS to store first the observations from the store6 data set, followed by the observations from the store5 data set, into a new data set called bothstores4:

DATA bothstores4;
    set store6 store5;
RUN;
 
PROC PRINT data = bothstores4 NOOBS LABEL;
    title 'The bothstores4 data set';
RUN;

The bothstores4 data set
StoreDaySales for Day
2M$2,215
2T$2,458
2W$1,798
2R$1,692
2F$2,105
2S$2,847
1M$1,200
1T$1,435
1W$1,712
1R$1,529
1F$1,920
1S$2,325

According to the rule, we should expect the Sales variable in the bothstores4 data set to take on the attributes of Sales in the store6 data set. Specifically, Sales should be formatted as dollar6. and labeled as Sales for Day. Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

And, now onto one last attribute, namely that of variable lengths. When the SET statement is used to concatenate data sets in which the same variable is assigned different lengths, the lengths of the variables in the new data set are determined by the following rule:

If two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement.

Example 15.10 Section

The following program creates two data sets — store7 and store8 — that intentionally contain different lengths for the numeric variable Store and the character variable Day:

DATA store7;
    length Store 4;
    input Store 1 Day $ 3-5 Sales 7-10;
    DATALINES;
1 Mon 1200
1 Tue 1435
1 Wed 1712
1 Thu 1529
1 Fri 1920
1 Sat 2325
    ;
RUN;
DATA store8;
    input Store 1 Day $ 3 Sales 5-8;
    DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
    ;
RUN;
PROC CONTENTS data = store7;
    title 'Contents of the store7 data set';
RUN;
PROC CONTENTS data = store8;
    title 'Contents of the store8 data set';
RUN;

Contents of the store7 data set
Alphabetic List of Variables and Attributes
#VariableTypeLen
2DayChar3
3SalesNum8
1StoreNum4
Contents of the store8 data set
Alphabetic List of Variables and Attributes
#VariableTypeLen
2DayChar1
3SalesNum8
1StoreNum8

As you can see, the LENGTH statement in the first DATA step tells SAS to set the length for the numeric variable Store to 4 in the store7 data set. Since the character variable Day is not explicitly mentioned in a LENGTH statement, SAS sets the length of Day in the store7 data set to 3, the number of columns specified in the INPUT statement. And, since neither the numeric variable Store nor the character variable Day are explicitly mentioned in a LENGTH statement in the second DATA step, SAS sets the length of Store in store8 to 8 by default, and the length of Day to 1, the number of columns specified in the INPUT statement. Launch and run  the SAS program, and review the output from the CONTENTS procedures to convince yourself that the lengths for the two versions of the Store and Day variables have been assigned as described.

Now, the following program tells SAS to store first the observations from the store7 data set, followed by the observations from the store8 data set, into a new data set called bothstores5:

DATA bothstores5;
    set store7 store8;
RUN;
 
PROC PRINT data = bothstores5 NOOBS LABEL;
    title 'The bothstores5 data set';
RUN;
 
PROC CONTENTS data = bothstores5;
    title '
Contents of the bothstores5 data set';
RUN;

The bothstores5 data set
StoreDaySales
1Mon1200
1Tue1435
1Wed1712
1Thu1529
1Fri1920
1Sat2325
2M2215
2T2458
2W1798
2R1692
2F2105
2S2847
Contents of the bothstores5 data set
Alphabetic List of Variables and Attributes
#VariableTypeLen
2DayChar3
3SalesNum8
1StoreNum4

Recall the rule that if two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement. That said, we should expect the Store variable in the bothstores5 data set to take on the attributes of Store in the store7 data set. Specifically, Store should be assigned a length of 4. Likewise, we should expect the Day variable in the bothstores5 data set to take on the attributes of Day in the store7 data set. Specifically, Day should be assigned a length of 3. Launch and run  the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the store7 and store8 data sets. That is, the program tells SAS to store first the observations from the store8 data set, followed by the observations from the store7 data set, into a new data set called bothstores6:

DATA bothstores6;
    set store8 store7;
RUN;
 
PROC PRINT data = bothstores6 NOOBS LABEL;
    title 'The bothstores6 data set';
RUN;
 
PROC CONTENTS data = bothstores6;
    title '
Contents of the bothstores6 data set';
RUN;

The bothstores6 data set
StoreDaySales
2M2215
2T2458
2W1798
2R1692
2F2105
2S2847
1M1200
1T1435
1W1712
1T1529
1F1920
1S2325
Contents of the bothstores6 data set
Alphabetic List of Variables and Attributes
#VariableTypeLen
2DayChar1
3SalesNum8
1StoreNum8

According to the rule, we should expect the Store variable in the bothstores6 data set to take on the attributes of Store in the store8 data set. Specifically, Store should be assigned a length of 8. Likewise, we should expect the Day variable in the bothstores6 data set to take on the attributes of Day in the store8 data set. Specifically, Day should be assigned a length of 1. Launch and run  the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected. You should note in particular how SAS is forced to truncate the values of the Day variable in order to get them to fit within their assigned length of 1.