Lesson 3: Reading Data into a SAS Data Set - Part IILesson 3: Reading Data into a SAS Data Set - Part II
As the title suggests, we'll spend some time in this lesson extending our knowledge of how to read data into a SAS data set. In the last lesson, we focused on column input. In this lesson, we'll investigate two other styles of input, namely list input and formatted input.
Along the way, we'll digress here and there to address a few topics that are best to be exposed to early on in your SAS education. In that vein, we'll learn:
- How to use the CONTENTS procedure to view the contents of SAS libraries
- How to use the SAS on-line Help and Documentation facilities
- How to use the OPTIONS statement to alter SAS system options that control the appearance of your output
- use the CONTENTS procedure to view the contents of a SAS library
- use the CONTENTS procedure to view the descriptor portion of a SAS data set
- be able to read data into SAS using either list input or formatted input
- understand that formatted input does not imply formatted output, and in so doing know the basic FORMAT statement
- be able to affect the appearance of your output using the OPTIONS statement
- access the electronic index in the SAS Help and Documentation that comes with the SAS software
3.1 - Viewing the Contents of SAS Libraries3.1 - Viewing the Contents of SAS Libraries
Before investigating other methods for reading data into a SAS data set, let's digress for a few minutes to learn how to use the CONTENTS procedure to view the contents of SAS libraries.
The CONTENTS Procedure
The CONTENTS procedure allows us to create SAS output that describes either the contents of a SAS library or the descriptor information for an individual SAS data set. In order to view the contents of a SAS library, we can use the following general form of the procedure:
PROC CONTENTS data = libref._ALL_ NODS; RUN;
- libref is the libref that you assigned to the library
- the _ALL_ option requests a listing of all of the SAS files in the library
- the NODS option (which stands for "no details") suppresses the printing of detailed information about each file when you specify the _ALL_ option.
- You need to connect the _ALL_ option to the libref with a period (.)
- You can specify the NODS option only when you specify the _ALL_ option. Let's take a look at an example
The following SAS code requests a listing of the contents of the library called sashelp:
PROC CONTENTS data = sashelp._ALL_ nods; RUN;
First, what are SAS Libraries all about? Let's have you launch your SAS application. If the Explorer Window is not active, then click on the Explorer tab that appears among the bottom tabs.
Double click Libraries, which contains all available libraries. Then you should see a listing of your libraries, that is, something that looks like this:
You should see, at least, three libraries that SAS defines for you by default:
- As you know, Work is a temporary library for files that do not need to be saved from session to session.
- Sashelp is a permanent library that contains more than 200 sample data and other files that control how SAS works at your site. You can download these sample data for practicing or illustrating purpose.
- Sasuser is a permanent library that contains SAS files in the Profile catalog that store your personal settings. You may also opt to store your own files there. (Personally, I don't.)
Now, let's have you run the SAS code. Upon doing so, you should see output that looks something like this:
The CONTENTS Procedure
|Physical Name||C:\Program Files\SASHome\SASFoundation\9.4\nls\en\SASCFG|
|Physical Name||C:\Program Files\SASHome\SASFoundation\9.4\core\sashelp|
|Physical Name||C:\Program Files\SASHome\SASFoundation\9.4\aacomp\sashelp|
|Physical Name||C:\Program Files\SASHome\SASFoundation\9.4\af\sashelp|
... partial listing below ...
|#||Name||Member Type||Level||File Size||Last Modified|
Now, to view the descriptor information for any of the files that appear in the sashelp library, we need to modify our CONTENTS procedure a little bit.
The following SAS code tells SAS to display the descriptor information for the class data set that appears in the sashelp library:
PROC CONTENTS data = sashelp.class; RUN;
Launch and run the SAS code. (If, by chance, the class data set is not stored in your sashelp library, modify the code so that it references a data set that is stored in the library.) Review the resulting output to convince yourself that SAS does indeed display the descriptor information. Note in particular that, by default, SAS displays the variable information in alphabetical order:
Alphabetic List of Variables and Attributes
If we instead preferred to view a listing of the variables in the order in which they appear in the data set, we again need to modify the CONTENTS procedure a bit.
The following SAS code again tells SAS to display the descriptor information for the class data set that appears in the sashelp library. The varnum option requests that SAS display the listing of the variables in the order in which they appear in the data set:
PROC CONTENTS data = sashelp.class varnum; RUN;
Launch and run the SAS code. (Again, if the deskact data set is not stored in your sasuser library, modify the code so that it references a data set that is stored in the library.) Review the resulting output to convince yourself that SAS displays the descriptor information, and in particular, the variable information in the order in which the variables appear in the data set:
Variables in Creation Order
3.2 - List Input3.2 - List Input
Okay, now let's spend some time on a subject that is a little more fitting for the title of this lesson. In the previous lesson, we investigated how to use column input to read data values that appear in neatly define columns looking like this, say:
Smith 8145551354 3.89 Washington 8145569847 2.73 Wing 8145359376 3.56 Jackson 8145557437 3.12
Here, we'll investigate how to use list input to read in free-format data looking like this, say:
Smith 8145551354 3.89 Washington 8145569847 2.73 Wing 8145359376 3.56 Jackson 8145557437 3.12
List input might be the easiest input style to use because, as shown in the examples that follow, you simply list the variable names in the same order as the corresponding raw data fields. The ease comes with a bit of a price, however. Because you do not have to tell SAS the columns in which the data values appear, you must take note of the following restrictions:
- Fields must be separated by at least one blank (or other delimiters).
- Fields must be read in order from left to right.
- You cannot skip or re-read fields.
- Missing values must be represented by a place holder such as a period. (A blank field causes the matching of variable names and values to get out of sync.)
- Character values can't contain embedded blanks.
- The default length of character values is 8 bytes. A longer value is truncated when it is written to the data set NOTE! 1 byte = 1 character
- Data must be in standard character or numeric format.
The following SAS program illustrates the simplest example of list input. Note that there is one blank space between each of the data values. Also note that although the data values need not be lined up in columns, we still recommend doing so because of the difficulty otherwise in "eyeing" the data quickly.
DATA temp; input subj name $ gender height weight; * The $ that follows name tells SAS that it is a character variable; * By default, name only allows up to 8 characters to be read in; CARDS; 1024 Alice 1 65 125 1167 Maryann 1 68 140 1168 Thomas 2 68 190 1201 Benny 2 72 190 1302 Felicia 1 63 115 ; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
The INPUT statement is how you tell SAS to read in the data using list input. For list input, simply list the variable names — leaving at least one space between names — in the order in which the variables appear in the data file. Remember to use the dollar sign ($) to distinguish character variables from numeric variables.
Launch and run the SAS program. Review the output (click on ) from the print procedure to convince yourself that the data are read in properly.
The following SAS program illustrates the necessary use of the missing value (.) placeholder when a data value is missing:
DATA temp; input subj name $ gender height weight; CARDS; 1024 Alice 1 65 125 1167 Maryann 1 68 140 1168 Thomas 2 68 190 1201 Benny 2 . 190 1302 Felicia 1 63 115 ; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
Note that Benny's height is missing. Therefore, since we are using the list input style to read in the data, we have to put in a missing value (.) placeholder.
First, launch and run the SAS program. Review the output (click on ) from the print procedure to convince yourself that the data are read in properly. Then, edit the program by deleting the missing value (.) placeholder. Rerun the SAS program to see what happens when you fail to account for the missing value. In the log file, you should see a note that says:
NOTE: SAS went to a new line when INPUT statement reached past the end of a line
And the resulting output:
Output dataset: TEMP
should indicate that something has clearly gone awry. What is going on here is that, by default, SAS goes to the next data line to find more data if there are more variable names in the INPUT statement then there are values in the data line. In this case, Benny's height becomes 190, the first number to appear in the data line after gender, and Benny's weight becomes 1302, the first number to appear in the next data line.
The following SAS program illustrates how a character variable is, by default, truncated if it contains more than 8 characters. The name 'Benedictine' is saved in the variable name as 'Benedict'.
DATA temp; input subj name $ gender height weight; CARDS; 1024 Alice 1 65 125 1167 Maryann 1 68 140 1168 Thomas 2 68 190 1201 Benedictine 2 68 190 1302 Felicia 1 63 115 ; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
Launch and run the SAS program. Review the output (click on ) from the print procedure to convince yourself that the name 'Benedictine' is indeed truncated to 'Benedict'. Incidentally, it is possible to use a LENGTH statement to tell SAS to allow the character variable name to contain more than eight characters. We'll learn about the LENGTH statement later.
The following SAS program illustrates how you can use the DELIMITER option of the INFILE statement to use values separators other than blanks. This example, in particular, illustrates it for the commonly used comma (,) as a delimiter:
DATA temp; infile cards delimiter=','; input subj name $ gender height weight; CARDS; 1024,Alice,1,65,125 1167,Maryann,1,68,140 1168,Thomas,2,68,190 1201,Benny,2,.,190 1302,Felicia,1,63,115 ; RUN; PROC PRINT data=temp; title 'Output dataset: TEMP'; RUN;
By default, SAS assumes data are space-delimited. The DELIMITER option of the INFILE statement here instead warns SAS that the data are comma-delimited — that is, that commas separate the data values rather than blank spaces. You might also have noted that although the INFILE statement typically directs SAS to externally-stored data, here the CARDS option included in the INFILE statement alerts SAS that the data are actually included in the code. Launch and run the SAS program. Review the output (click on ) from the print procedure to convince yourself that the data are indeed read in properly.
Now, let's go investigate another style of input, namely formatted input.
3.3 - Formatted Input3.3 - Formatted Input
The fundamental difference between column input, which we studied in the previous lesson, and formatted input, which we'll explore now, is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.
Standard numeric data values
Recall that standard numeric data values can contain only:
- decimal points
- numbers using scientific (E) notation
- negative (minus) and positive (plus) signs
Examples of standard numeric values include: 26, 3.9, -13, +3.14, 314E-2, and 2.193E3.
Nonstandard numeric data values
On the other hand, nonstandard numeric data values include:
- values that contain special characters, such as dollar signs ($), percent signs (%), and commas (,)
- date and time values
- data in fraction, integer binary, real binary, and hexadecimal forms
Examples of nonstandard numeric values include: 23.3%, $1.26, and 03/07/47.
The INPUT Statement
Here's the general form of the INPUT statement when using formatted input:
INPUT <pointer-control> variable informat.;
- pointer-control tells SAS at what column to start reading the data value
- variable is the name of the variable being created
- informat is a special instruction that tells SAS how to read the raw data values
A couple of things here. The above INPUT statement is written using standard SAS Help notation. For example, the pointer-control appears in brackets (<>) only to indicate that it is optional, or rather, not necessary for every variable you create. For example, you need not tell SAS to go to column 1 if that's where you want to start reading data values, because that's where SAS starts by default. In practice, the brackets (<>) should not appear in the INPUT statements in your program ... otherwise, SAS will hiccup.
There are two pointer controls that we'll learn about here:
- The @n pointer control moves the input pointer a specific column number n
- The +n pointer control moves the input pointer forward n columns to a column number that is relative to the current position
Again, an informat is what is used to tell SAS what special instructions are required to read in the raw data values. Many (too many?) special informats are available in SAS. For example, the numeric informat "mmddyy8." tells SAS that you want to read in a date that takes up to 8 spaces and looks like 10/27/05. The numeric informat "comma6." tells SAS that you want to read in a number that contains a comma and takes up to 6 spaces (including the comma), such as the number 11,387.
Let's take a look at an example!
The following SAS program uses the @n column pointer control and standard numeric informats to read three numeric variables — subj, height, and weight — into a temporary SAS data set called temp:
DATA temp; input @1 subj 4. @27 height 2. @30 weight 3.; DATALINES; 1024 Alice Smith 1 65 125 12/1/95 2,036 1167 Maryann White 1 68 140 12/01/95 1,800 1168 Thomas Jones 2 190 12/2/95 2,302 1201 Benedictine Arnold 2 68 190 11/30/95 2,432 1302 Felicia Ho 1 63 115 1/1/96 1,972 ; RUN; PROC PRINT data = temp; title 'Output dataset: TEMP'; RUN;
If you look at the INPUT statement, you'll see that it uses @n absolute pointer controls to move the input pointer first to column 1 to read subj, then to column 27 to read height, and finally to column 30 to read weight. In general, the @ moves the pointer to column n, the first column of the field that is being read.
The 4. that appears after subj, the 2. that appears after height, and 3. that appears after weight are the informats that tell SAS how to read each of the raw data values. In each case here, we are trying to read in standard numeric data values, and so we use what is called the w.d informat. The w tells SAS the width of the raw data value, that is how many columns it occupies. The d, which is optional, tells SAS the number of implied decimal places for the raw data value. The w and d must be connected by a period (.) delimiter. If the d is not present, you still need to make sure that you include the period in the informat name.
Making this all a little more concrete ... here, the subj values are four columns wide with no decimal places, and hence we use the 4. informat. We alternatively could have specified a 4.0 informat, but we could not have specified 4 (without the period) as the informat. The height values are two columns wide with no decimal places, and hence we use the 2. informat. Finally, the weight values are three columns wide with no decimal places, and hence we use the 3. informat.
Incidentally, the w.d informat ignores any d that we specify if the data value already contains a decimal point. So, for example, if we had a raw data value of 23.001 (occupying 6 columns with 5 digits, 1 decimal point, and 3 decimal places) and specified a 6. informat, SAS would still store the value as 23.001, even though we told SAS not to expect any decimal places.
Okay ... launch and run the SAS program, and review the output (click on ) obtained from the print procedure to convince yourself that the three variables were read incorrectly. Oh, do you remember when I said that you needn't tell SAS to move the input pointer to the first column because it does so by default? You might want to convince yourself of this by removing the @1 that appears in the INPUT statement. Then, rerun the program to convince yourself that SAS still reads the data values properly.
The following SAS program uses the @n column pointer control and standard character and numeric informats to read, respectively, two character variables — l_name and f_name — and two numeric variables — weight and height — into a temporary SAS data set called temp:
DATA temp; input @18 l_name $6. @6 f_name $11. @30 weight 3. @27 height 2.; DATALINES; 1024 Alice Smith 1 65 125 12/1/95 2,036 1167 Maryann White 1 68 140 12/01/95 1,800 1168 Thomas Jones 2 190 12/2/95 2,302 1201 Benedictine Arnold 2 68 190 11/30/95 2,432 1302 Felicia Ho 1 63 115 1/1/96 1,972 ; RUN; PROC PRINT data = temp; title 'Output dataset: TEMP'; RUN;
The INPUT statement uses @n absolute pointer controls to move the input pointer first to column 18 to read l_name, then back to column 6 to read f_name, forward to column 30 to read weight, and back again to column 27 to read height. This example illustrates, therefore, how you can use @n pointer controls to read data fields in any order ... backwards, forwards, and backwards again.
The 3. that appears after weight and the 2. that appears after height should look familiar. They are again the numeric informats that tell SAS how to read each of the two numeric variables of interest. Now, if you look at the informats for the two character variables, l_name and f_name, you should see that they each begin with a dollar sign ($). Because we are trying to read in character data values, we use what is called the $w. informat. The w tells SAS the width of the raw data value, that is how many columns it occupies. The dollar sign ($) and period (.) are required delimiters. In our example, the last name (l_name) occupies as many as 6 columns and hence we use the $6. informat, and the first name (f_name) occupies as many as 11 columns and hence we use the $11. informat.
Okay ... launch and run the SAS program, and review the output (click on ) obtained from the print procedure to convince yourself that the four variables were read in correctly.
In addition to using @n absolute pointer controls with numeric and character informats, the following SAS program uses +n relative pointer controls with nonstandard informats to create a temporary SAS data set called temp containing six variables:
DATA temp; input @1 subj 4. @6 f_name $11. @18 l_name $6. +3 height 2. +5 wt_date mmddyy8. +1 calorie comma5.; DATALINES; 1024 Alice Smith 1 65 125 12/1/95 2,036 1167 Maryann White 1 68 140 12/01/95 1,800 1168 Thomas Jones 2 190 12/2/95 2,302 1201 Benedictine Arnold 2 68 190 11/30/95 2,432 1302 Felicia Ho 1 63 115 1/1/96 1,972 ; RUN; PROC PRINT data = temp; title 'Output dataset: TEMP'; RUN;
You should now understand the specifications for reading subj, f_name and l_name. The INPUT statement tells SAS to go to column 1 to read the numeric field subj that is four columns wide, then to go to column 6 to read the character field f_name that is 11 columns wide, and then to go to column 18 to read the character field l_name that is 6 columns wide.
The +n relative pointer controls are used to read the remaining three variables — height, wt_date, and calorie. The +n pointer control moves the input pointer forward to a column number that is relative to the current position of the pointer. That is, the + moves the pointer forward n columns. In order to count correctly, it is important to understand where the column pointer is located after each data value is read. In general, when using formatted input, the column pointer moves to the first column that follows the field that was just read.
Again, let's make it more concrete. When SAS finishes reading the l_name field, the column pointer moves to the first column that follows that field, that is, to the column that immediately follows the d in Arnold (column 24). Now, the height field begins 3 columns to the right, so that's why we tell SAS to move +3 before reading the height data. When SAS finishes reading the height field, the column pointer moves to the first column that follows that field, that is, to the empty column that follows the heights (column 29). Now, the wt_date field begins 5 columns to the right, so that's why we tell SAS to move +5 before reading the weight dates. When SAS finished reading the wt_date field, the column pointer moves to the first column that follows the field, that is to the empty column just before the calorie data (column 42). Now, the calorie field begins 1 column to the right, so that's why we tell SAS to move +1 before reading the calorie data.
We'd be all done explaining the INPUT statement if it weren't for the wt_date and calorie fields containing nonstandard numeric data. The wt_date field contains forward slashes (/) for specifying the dates, and the calorie data contains commas (,). To read the wt_date field, we use the mmddyy8. informat to tell SAS to expect dates written in mm/dd/yy form. The 8, of course, tells SAS that the dates could occupy as many as eight columns. To read the calorie field, we use the comma6. informat to tell SAS to read numeric data containing commas and occupying as many as 6 columns.
In general, the COMMAw.d informat is used to read numeric values and to remove embedded blanks, commas, dashes, dollar signs, percent signs, right parentheses, and left parentheses (which are converted to minus signs). The w. tells SAS the width of the field (including dollar signs, decimal places, or other special characters). The d, which is optional, tells SAS the number of implied decimal places for a value.
Whew!!! Now, go ahead and launch and run the SAS program, and review the output (click on ) obtained from the print procedure. In doing so, note that formatted input does not imply formatted output. For example, note that SAS stores the date 12/01/95 as the number 13118. We won't worry about this too much yet — it's just that, interestingly enough, SAS stores dates as numeric values equal to the number of days that have passed since January 1, 1960. Yup, that's right —dates after January 1, 1960 are stored as unique positive integers, and dates before January 1, 1960 are stored as unique negative integers. Also, note that SAS prints the values of the variable calorie without commas. In order to get SAS to print the values in a more understandable format, we must tell SAS how to format the output by using a SAS FORMAT statement.
Type in the following statement exactly as it appears:
FORMAT wt_date mmddyy8. calorie comma5.;
between the INPUT statement and the DATALINES statement in your program. Rerun the SAS program with the now included FORMAT statement, and review the output from the print procedure to convince yourself that the data are now printed as desired. You should now see that it looks something like this:
Output dataset: TEMP
3.4 - SAS Help and Documentation3.4 - SAS Help and Documentation
Here's as good an opportunity as any to practice using the SAS Help feature that comes with SAS. Be forewarned that any information you could possibly need is in SAS Help — it's just that sometimes it takes some patience to find it.
Note that SAS changes the interface and access of help files dramatically since 9.2. The contents shown below apply to SAS 9.3 and above.
In the previous section, we used two informats for reading nonstandard numeric data, namely MMDDYY8. and COMMA6. There are oodles of other informats available making it impossible to address them all in this course. To name just a few, there's PERCENTw.d, DATEw., TIMEw., and JULIANw. There's no doubt that at some point in your future SAS programming, you'll need to look up an informat in the SAS Help and Documentation. So, let's give you some practice with doing that now. Let's look for a list of the available SAS informats. To do so:
On the menu toolbar, under Help, select SAS Help and Documentation:
A new window pops out:
In the box labeled Quick Search, type the word informats:
Then click the arrow button next to it to start searching…
Finally, under the relevant list that SAS displays, use your mouse to select the item 36 labeled Informats by Category:
In the righthand portion of the window, you should see that the relevant content appears. It is titled Informats by Category. Page down and take a look at the (long) list of available SAS informats. For example, you should see one in the Character category called $CHARw. which SAS tells you that it is the informat you can use if you want to read in characters with blanks. You should also see one in the Character category called $w. which SAS tells you that it is the informat to use if you want to read in standard character data. Finally, you should see one in the Numeric category called w.d which SAS tells you that it is the informat to use if you want to read in standard numeric data.
I'll be the first person to agree with you that some of the SAS Help can be intimidating and not particularly clear on first pass. On the other hand, it has been my experience that if you are patient with it, you can usually find a good example or two to help clarify sticking points. It is in your best interest to get comfortable with SAS Help, as no course (not even this one!) is going to teach you everything you need to know about SAS.
3.5 - SAS System Options3.5 - SAS System Options
Here's a topic that can really be brought up anywhere and therefore doesn't necessarily belong in this lesson. On the other hand, it's a subject that we don't want to wait too long to introduce, because of its inherent benefits. So, let's go ahead and do it now before any more time passes by.
SAS System options are parameters that you set that affect how SAS works, how SAS formats your output, how SAS uses memory, how SAS handles errors, and so on. All SAS System options have default settings that are used unless you specify otherwise. For example, page numbers are automatically displayed on your output. To modify system options, we just need to submit an OPTIONS statement. Here's one example of an OPTIONS statement that you might want to include at the beginning of each of your SAS programs:
Option PS = 60 LS = 80 NODATE;
The above statement tells SAS that you want:
- each page of output to contain no more than 60 lines (PS = 60, where PS stands for "pagesize")
- each line of output to contain no more than 80 characters (LS = 80, where LS stands for "linesize")
- each page of output to contain no date and time at which the output was created
You can place an OPTIONS statement anywhere in a SAS program to change the settings from that point onward. However, it is good programming practice to place OPTIONS statements outside of DATA or PROC steps so that your programs are easier to read and debug.
In general, the common options you might consider specifying are:
- CENTER or NOCENTER. CENTER, which is the default, centers the text in your output.
- DATE or NODATE. DATE, which is the default, prints the current date and time on your output, while NODATE doesn't.
- NUMBER or NONUMBER. NUMBER, which is the default, prints an accumulative page number on each page of your output.
- LINESIZE = n, where n can be any number between 64 and 256 (inclusive), specifies the width of the print line for your procedure output and log. Observations that do not fit within the line size continue on a different line or page.
- PAGESIZE = n, where n can be any number between 15 and 32767 (inclusive), specifies how many lines each page of output contains.
- PAGENO = n, where n is the page number at which you want SAS to start numbering your output pages. If you don't specify the PAGENO= option, your output is numbered sequentially throughout your SAS session, starting with page 1.
There are just a couple of things to keep in mind about the OPTIONS statement. First, it is a global statement, meaning it affects every DATA step and PROC step that follows it. Second, it overrides any previous OPTIONS statements. In this sense, you can think of the options you specify as toggle switches. The options stay in place until you change them with a new OPTIONS statement or you end your SAS session. We certainly encourage you to use OPTIONS statements throughout this course (and beyond!) to make your output more user friendly.
3.6 - Summary3.6 - Summary
Although this lesson has the (deceiving?) title about reading data into a SAS data set, we covered that and much more. In terms of reading data into a SAS data set, we investigated how the INPUT statement works with list input and how it works with formatted input. Then, we addressed other miscellaneous, but important, issues such as using the CONTENTS procedure to view the contents of your data sets, getting help from SAS Help and Documentation, and setting SAS system options.
Now, it is time to get some practice using these techniques. See the homework assignment for this lesson.