Lesson 2: Reading Data into a SAS Data Set - Part ILesson 2: Reading Data into a SAS Data Set - Part I
In order to be able to analyze our data, we need to be able to read it into a data set that our SAS software understands. In this lesson, we learn methods that will work in many situations to get your data read into a SAS data set. Keep in mind though that when reading data into a SAS data set, you always need to tell SAS three things:
- where your data reside — Are they instream data, that is, embedded within your SAS program? Are they contained in a raw data file external to your SAS program? Or are they contained already in another SAS data set?
- the format of the data — Are the data values arranged in neatly defined columns so that they can be read in using column input? Are the data values separated by at least one blank space so that they can be read in using list input? Do the data values contain special characters so that they must be read in using formatted input?
- the kind of SAS data set that you want to create — Do you want to create a permanent SAS data set? Or do you want to create a temporary SAS data set?
In this lesson, we'll learn how to tackle all but two of the above situations. That is, we'll investigate how to read instream data, how to read data from an external raw data file, how to read data from another SAS data set. We'll learn how to use column input to read in data values arranged in neatly defined columns. And, we'll learn how to create both temporary and permanent SAS data sets. In the next lesson, we'll investigate how to use formatted input and list input to read data into our SAS data sets.
- read instream data into SAS using a DATALINES or CARDS statement
- read data into SAS from an external raw data file using an INFILE statement
- use the SET statement to read data into a SAS data set from other SAS data sets
- know how to create temporary SAS data sets
- use the LIBNAME statement to create a new SAS data library
- know how to create permanent SAS data sets
- be able to read data into SAS using column input
2.1 - Reading Instream Data2.1 - Reading Instream Data
Although the title of this section is reading instream data, it is hard to focus on just one method of reading data into SAS. As discussed in the introduction to this lesson, every time we read data into a SAS data set, we need to tell SAS three things — where our data reside, the form of the data, and the kind of SAS data set that we want to create. Let's jump right in and take a look at an example.
The following SAS program illustrates how to create a temporary SAS data set called temp1 to read instream data using column input:
DATA temp1; input subj 1-4 gender 6 height 8-9 weight 11-13; DATALINES; 1024 1 65 125 1167 1 68 140 1168 2 68 190 1201 2 72 190 1302 1 63 115 ; RUN; PROC PRINT data=temp1; title 'Output dataset: TEMP1'; RUN;
It would probably be most helpful to you if you started by inspecting the SAS code. The key things to note about the program are:
The DATALINES statement is the statement that you must use to tell SAS to expect instream data. The DATALINES statement:
- Must be the last statement to appear in the DATA step (that is, except for the RUN statement)
- Must immediately precede the data lines
- Must be closed by a null statement (that is, a single semicolon). Only one DATALINES statement can appear in a DATA step
The INPUT statement is the statement that you must use to tell SAS the form of the data. Here, we use what is called column input, because the data values are:
Recall that standard numeric data values can contain only numbers, decimal points, numbers in scientific notation (e.g., 3.1E5), and plus or minus signs.
In general, for each field of raw data that you want to read into your SAS data set, you must specify the following information in the INPUT statement:
If you intend for the variable to be a character variable, place one blank space and then a dollar sign ($) right after the variable's name in the INPUT statement. None of the variables in our data set are character variables, and therefore no dollar signs appear in the INPUT statement in our program. As our INPUT statement informs SAS, the subject number (subj) begins in column 1 and ends in column 4, gender occupies just column 6, the subject's height begins in column 8 and ends in column 9, and the subject's weight begins in column 11 and ends in column 13. You might want to count the columns out from left to right to convince yourself that we've defined the fields correctly.
- standard character or numeric values, and
- arranged in neatly defined columns.
- a valid SAS variable name,
- a type (character or numeric),
- and the number of the column in which the field starts and the number of the column in which the field ends, separated by a dash (-).
The DATA statement is the statement that you must use to tell SAS whether the data set that you intend to create should be temporary or permanent. We'll learn more about temporary and permanent data sets in the lesson pages that follow. Know for now .... errrrr, that is, trust me? ... that the above DATA statement tells SAS to create a temporary data set called temp1. Okay, I'll let the cat out of the bag a little bit ... the DATA statement tells SAS that temp1 should be treated as temporary by specifying what is called a one-level name, such as temp1, rather than a two-level name, such as stat480.temp1. Okay, best to stop there. The key thing for now is to know that, because temp1 is a temporary data set, it exists only until the end of your current SAS session. That is, once you close out your SAS session, the SAS data set is removed from memory, and would have to be created again if you needed to use it again.
Okay, enough explaining! Let's go ahead and have you launch and run the SAS program. Then, as always, view the log window first to see if SAS displays any errors from running the code. Then, view the output window. You should see a display of the data set that arises from the PRINT procedure in our code:
Output dataset: TEMP1
Note that the CARDS statement is an alias for the DATALINES statement. That is, we could have alternatively entered the data by replacing the "DATALINES;" statement with a "CARDS;" statement. In your program editor, replace "DATALINES;" with "CARDS;" and rerun your program to convince yourself that this is indeed true.
One more thing ... if any of your data values contain semicolons, the DATALINES statement will not work. Instead you must replace the DATALINES statement with a DATALINES4 statement, and the null statement with a single semicolon (;) with a null statement containing four semicolons (;;;;). Strange, I know.
2.2 - SAS Data Libraries2.2 - SAS Data Libraries
Before we can really get a handle on the distinction between temporary and permanent SAS data sets, we need to understand what SAS refers to as SAS data libraries. In short, a SAS library is simply a collection of SAS files that are stored in the same folder or directory on your computer. Other files can be stored in the same folder or directory, but only the files that have SAS file extensions are recognized as part of the SAS library.
Depending on the library name that you use when you create a SAS file, such as a SAS data set, SAS files are stored either temporarily or permanently as follows:
- When creating a SAS file, if you use the library name Work or you don't specify a library name at all, then the file is stored in a temporary SAS library called Work. When you close out the SAS session in which you created the SAS file, the temporary library and all of its files are removed from your computer's memory.
- If you use a library name other than the default library name Work when creating a SAS file, then the file is stored permanently until you delete it. That is, you can use permanent SAS libraries and their contents in subsequent SAS sessions.
Referencing Permanent SAS Files
Regardless of whether a SAS data set is temporary or permanent, SAS always refers to the data set by a two-level name:
In the two-level name, libref is the (nick)name that you gave the SAS data library that contains the SAS data set, and filename is the name of the file itself. For example:
in order to print the permanent data set called Back that is stored in the SAS library called Stat480, we have to refer to the permanently stored SAS data set as:
as in the following code:
PROC print data = Stat480.Back; RUN;
In this example, we know the SAS data set called Back is permanent, because the SAS library name is Stat480, not Work.
Referencing Temporary SAS Files
As stated previously, regardless of whether a SAS data set is temporary or permanent, SAS always refers to the data set by a two-level name:
Now, we know that the libref of a temporary data set is always Work. Therefore, we can refer to any SAS data set stored temporarily in Work by:
where filename is the name of the file itself. For example:
in order to print the temporary SAS data set called Temp1 that is stored in the default temporary SAS library called Work, we can refer to the temporarily stored SAS data set as:
as in the following code:
PROC PRINT data = work.temp1; RUN;
Now, although SAS always refers to SAS data sets by their two-level names, it doesn't mean you have to do the same! In this case, SAS lets you take a shortcut by using just the one-level name:
When you specify a one-level name, the default libref Work is assumed. For example, rather than using the previous code to print the temporary SAS data set called Temp1, we could use the following code:
PROC PRINT data = Temp1; RUN;
We have just one more thing to tend to before we complete our discussion of SAS libraries, and that is how to define a library. To do so, we use a LIBNAME statement to tell SAS to associate a a name — called a libref — to one of the folders or directories on your computer that contains SAS files. For example, I have a subdirectory on my computer:
that contains three SAS data sets (back, log, and qul):
To tell SAS to associate the libref Stat480 with this subdirectory, we simply use a
LIBNAME statement as follows:
LIBNAME Stat480 'C:\Simon\Stat480WC\sp08\04dtatstep\sasdata';
In general, the libref can be a nickname of your choosing, as long as it is between 1 and 8 characters long, begins with a letter or underscore, and contains only letters, numbers, or underscores. The specification of the physical name of the library, of course, adheres to the conventions of the Windows operating system. You can use as many LIBNAME statements in a program as necessary to assign the librefs that you need.
Now, after submitting our LIBNAME statement, we can (and should!) look at the log window to verify that SAS assigned the libref successfully. If all went well, we should see a message similar to this in this log window:
32 33 LIBNAME Stat480 'C:\Simon\Stat480WC\sp08\04datastep\sasndata'; NOTE: Libref STAT 480 was successfully assigned as follows: Engine: V9 Physical Name: C:\Simon\Stat480WC\sp08\04datastep\sasndata 34
Because the LIBNAME statement is a global statement, our Stat480 libref remains in effect until we have modified it, canceled it, or ended the SAS session. That is, the LIBNAME statement assigns the libref for the current SAS session only. Each time you begin a SAS session, you must assign a libref to each permanent SAS data library that contains files that you want to access in that SAS session.
Phewww ... now that we've completed our digression on SAS libraries, let's revisit Example 2.1 on the previous page and create a permanent SAS data set rather than a temporary one.
2.3 - Reading Data into Permanent SAS Data Sets2.3 - Reading Data into Permanent SAS Data Sets
Rather than having created a temporary data set in Example 2.1 that disappears at the end of our SAS session, we could have just as easily created a permanent data set that instead gets stored in a directory of our choosing and is therefore available to us permanently. Let's try that!
The following SAS program illustrates how to create a permanent SAS data set called stat480.temp2 to read instream data using column input:
The following SAS program illustrates the simplest example of column input.
LIBNAME stat480 'C:\stat480\data\'; *Specifies the SAS data library (directory); DATA stat480.temp2; input subj 1-4 gender 6 height 8-9 weight 11-13; DATALINES; 1024 1 65 125 1167 1 68 140 1168 2 68 190 1201 2 72 190 1302 1 63 115 ; RUN; PROC PRINT data=stat480.temp2; title 'Output dataset: STAT480.TEMP2'; RUN;
Note that the only differences between this program and the program in Example 2.1 is that:
- there is now a LIBNAME statement. This is how we tell SAS to which of our directories we want the data permanently written. Here, the SAS data set is written to our C:\stat480\data directory.
- the name of the data set appearing in the DATA statement is now a two-level name. This is how SAS knows to put the permanent dataset temp2 in our C:\stat480\data directory.
- in subsequent procedures, such as the PRINT procedure here, we refer to the permanent data set by its two-level name.
Launch and run the SAS program. Don't forget that you must make sure that you have already created the C:\stat480\data subdirectory on your computer (or at least some other appropriately placed and named subdirectory). After running the program, note that:
- the contents and structure of stat480.temp2 is identical to the contents and structure of work.temp1. The only thing that differs between the two datasets is that temp1 is temporary and temp2 is permanent.
- in the Explorer Window, the dataset temp2 appears in the stat480 library indicating that the data set is permanent.
- using your Windows Explorer, you can see that the SAS data set is permanently stored in your C:\stat480\data directory.
2.4 - Reading From a Raw Data File2.4 - Reading From a Raw Data File
Thus far, we've only looked at examples in which we've read instream data into SAS data sets. Now, let's direct our attention to learning how to read data from a raw data file into SAS data sets.
A raw data file is an external text file whose records contain data values that are arranged in fields. Typical filename extensions of raw data files are .dat and .txt. A raw data file (also commonly called an ASCII file) is the kind of data file that you would view using your Notepad or Wordpad software. Data sets stored in spreadsheets, such as Microsoft's Excel, are binary, not raw ASCII data files.
The following SAS program illustrates how to create a temporary SAS data set called temp3 by using column input to read in data stored in a raw data file called temp3.dat:
The following SAS program illustrates the simplest example of column input.
DATA temp3; infile 'C:\stat480\data\temp3.dat'; input subj 1-4 gender 6 height 8-9 weight 11-13; RUN; PROC PRINT data=temp3; title 'Output dataset: TEMP3'; RUN;
Notice that the INFILE statement, which must precede the INPUT statement, merely replaces the DATALINES statement and the data that appeared in the previous two examples. The INFILE statement tells SAS where the raw data file is stored on your computer. The name and location of the raw data file must appear in single quotes and immediately follow the INFILE keyword. In this case, our raw data file temp3.dat is stored in the directory "C:\stat480\data\". As you can see, the data values are the only items stored in the file.
In order to run this program, first save the temp3.dat file to a convenient location on your computer. Then, edit the INFILE statement as necessary to reflect the correct location. Finally, launch and run the SAS program. Review the log and output windows to convince yourself that the data were properly read into the temporary data set temp3.
Instead of identifying the raw data file by specifying the entire filename and location in the INFILE statement, we can alternatively use what is called a fileref (for file reference). Just as we use a LIBNAME statement to assign a libref, we use a FILENAME statement to assign a filref. Filerefs perform the same function as librefs. That is, they temporarily point to a storage location for data. However, librefs point to SAS data libraries, whereas filerefs point to external data files.
The following SAS program illustrates the use of a fileref in the INFILE statement, in conjunction with a FILENAME statement, to read data stored in a raw data file called temp3.dat to create a temporary SAS data set called temp4:
The following SAS program illustrates the simplest example of column input.
FILENAME patients 'C:\stat480\data\temp3.dat'; DATA temp4; infile patients; input subj 1-4 gender 6 height 8-9 weight 11-13; RUN; PROC PRINT data = temp4; title 'Output dataset: TEMP4'; RUN;
The FILENAME statement in our program assigns the fileref patients to our temp3.dat file stored in our C:\stat480\data folder. In general, the fileref can be a nickname of our choosing, as long as it is between 1 and 8 characters long, begins with a letter or underscore, and contains only letters, numbers, or underscores. The specification of the physical name of the file, of course, adheres to the conventions of the Windows operating system.
If you haven't already done so for the previous example, save the temp3.dat file to a convenient location on your computer. Then, edit the FILENAME statement as necessary to reflect the correct location. Finally, launch and run the SAS program. Review the log and output windows to convince yourself that the data were properly read into the temporary data set temp4.
2.5 - Reading Column Input2.5 - Reading Column Input
As mentioned in the introduction to this lesson, there are three different styles of input that are available to us in SAS. They are:
- column input, which is the most commonly used style, allows you to read data values that are entered in fixed columns.
- list input, which allows you to read data by simply listing the variable names in the INPUT statement. At least one space (or character) must occur between each value in the data set.
- formatted input, which allows you to read numeric data containing special characters, such as dates and dollar amounts.
In this section, we will take a look at two simple examples of column input. In the next lesson, we will spend some time investigating list input and formatted input.
A couple of comments. For the sake of the examples that follow, we'll use the DATALINES statement to read in data. We could have just as easily used the INFILE statement to illustrate each point. Additionally, we'll create temporary data sets rather than permanent ones, even though we could have just as easily created permanent data sets to illustrate each point. Finally, after each SAS DATA step, we'll use the SAS print procedure (PROC PRINT) to print the resulting SAS data set for your perusal.
Column input allows you to read variable values that occupy the same columns within each record. To use column input, list the variable names in the INPUT statement, immediately following each variable name with its corresponding column positions in each of the data lines. (Of course, you'll need to follow each character variable with a dollar sign ($) first.) Column input can be used whenever your raw data are in fixed columns and in standard character or numeric format. Column input reads data values until it reaches the last specified column for the field.
The important points to note about column input are:
- When using column input, you are not required to indicate missing values with a placeholder, such as a period. That is, missing values can be left as blank.
- Column input uses the columns specified to determine the length of character variables, thereby allowing the character values to exceed the default 8 characters and to have embedded spaces.
- Column input allows fields to be skipped altogether or to be read in any order.
- Column input allows only part of a value to be read and allows values to be re-read.
- Spaces are not required between the data values.
The following SAS program illustrates the simplest example of column input.
DATA temp; input subj 1-4 name $ 6-23 gender 25 height 27-28 weight 30-32; CARDS; 1024 Alice Smith 1 65 125 1167 Maryann White 1 68 140 1168 Thomas Jones 2 68 190 1201 Benedictine Arnold 2 68 190 1302 Felicia Ho 1 63 115 ; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
Output dataset: TEMP
First, inspect the SAS code to make sure you understand how to set up the INPUT statement for column input.
Then, launch and run the SAS program.
Finally, review the output (click onfrom the print procedure to convince yourself that the data are read in properly.
The following SAS program illustrates some of the key features of column input:
DATA temp; input init $ 6 f_name $ 6-16 l_name $ 18-23 weight 30-32 height 27-28; CARDS; 1024 Alice Smith 1 65 125 1167 Maryann White 1 68 140 1168 Thomas Jones 2 190 1201 Benedictine Arnold 2 68 190 1302 Felicia Ho 1 63 115 ; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
First, inspect the SAS code so that you can become familiar with some of the features of column input. Then, launch and run the SAS program.
Review the output (click on ) from the print procedure to convince yourself that the data are read in properly. Note that the position of the variables within the temporary data set temp corresponds to the order in which the variables appear in the input statement, not the order in which the variables appear in the data set.
2.6 - Reading Data in From Other SAS Data Sets2.6 - Reading Data in From Other SAS Data Sets
In terms of telling SAS where your input data reside, we've investigated how to read instream data into a SAS data set. We've also investigated how to read data contained in an external raw data file into a SAS data set. Now, we'll investigate how to read data already contained in one SAS data set into another SAS data set.
The following SAS program creates a temporary SAS data set called work.temp, which is identical to the permanent SAS data set called stat480.temp2:
LIBNAME stat480 'C:\stat480\data'; DATA temp; set stat480.temp2; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
In this case:
- because a one-level name is used, the DATA statement tells SAS to create a temporary dataset called temp.
- the SET statement tells SAS to assign the data in the existing permanent SAS data set stat480.temp2 — observation by observation — to the temporary temp data set appearing in the DATA statement. Because the variables in the existing data set have already been named, no input statement is necessary. That is, the SET statement indicates that the data being read are already in the structure of a SAS data set, and therefore merely gives the name of the data set.
Note that before this program will work, you will have already had to create the permanent SAS data set stat480.temp2 in Example 2.2 of this lesson. Launch the SAS program. If stat480.temp2 is not stored in a directory called C:\stat480\data, you will need to edit the LIBNAME statement so it reflects the location that stat480.temp2 is stored on your computer. After convincing yourself that you've made any necessary changes, run the SAS program and review the output. Note that the structure and contents of the new temporary data set temp are identical to the permanent SAS data set stat480.temp2.
2.7 - Summary2.7 - Summary
In this lesson, we investigated various methods for reading data into a SAS data set. One of the key things to remember from this lesson is that every time you read data into a SAS data set, you need to tell SAS three things:
- Where your data reside
- The format of your data
- Whether you want to create a temporary or permanent SAS data set
- If you are reading data instream, use a DATALINES statement.
- If you are reading data from a raw data file, use an INFILE statement.
- If you are reading data from another SAS data set, use a SET statement.
- Use an INPUT statement containing column numbers to read data arranged in neatly defined columns.
- Use a two-level name in the DATA statement if you want to create a permanent SAS data set.
- Use a one-level name in the DATA statement if you want to create a temporary SAS data set.
Please be aware that there are other ways of getting your data into SAS that we won't learn in this course. For example, you could use the VIEWTABLE feature, which is similar to entering your data into an Excel spreadsheet. It is simple enough, but we won't investigate that approach here. One of the primary purposes of this course is to learn how to use the SAS programming language, something we can't do when using the VIEWTABLE feature. At least now that you know the VIEWTABLE feature exists, you can investigate it on your own.
We also won't learn how to use the IMPORT procedure to import data directly from other software, such as Excel and Access, into SAS data sets. Doing so, requires having the SAS/ACCESS module, i.e. more than the Base SAS software, at your disposal. Instead, we'll proceed assuming that data stored in other software, such as an Excel spreadsheet, can be read into SAS first by exporting it to a raw data file and then reading the data from the raw file into a SAS data set.
In the next lesson, we will investigate the two styles of input that we didn't do much with in this lesson, namely list input and formatted input. Now, it is time to work on some homework problems to give you a chance to put what you have learned in this lesson to good use.