Entries in Revision (2)

Friday
Feb132009

Revision Time (Missing and Undefined Values)

Missing and undefined values is a source of confusion, for those previously using SPSS because it is different and for those from traditional programming environments where the concept are not so well used. The following is extracted from the SIR 2 Manual but still remains true.

The user may specify one, two or three missing values for each variable in the data base with the MISSING VALUES command. This capability is also available for summary variables during Retrieval.

For example:

MISSING VALUES VAR1 TO VAR5 (7,8,9)
 RO01 TO R025 (-1,-2)
 DEPT JOBNO HOURS (BLANK)

When a value of 7 is input or computed for VAR1, SIR will recognize it as a missing value.

In SIR, missing values are distinguished from "undefined" values. The latter can arise in several ways:

a. as the result of undefined computations such as by division zero, square root of a negative, logarithm of zero or negative, tangent of pi/2, etc.

COMPUTE V = SQRT(Z/Y)

V will be undefined if Y is zero or Z/Y is negative.

b. as the result of a computation using missing or undefined values.

COMPUTE RADIUS = Ql + 7*Q2

If either Ql or Q2 is missing or undefined, RADIUS will be set to undefined.

c. as the result of a computation which sets a variable to an invalid or out-of-range value.

VAR RANGES MEAS1 TO MEAS5 (1,50)
...
...
COMPUTE MEASl = (BAK + BAL) 

If the result of the expression on the right is outside the range 1 through 50, MEAS1 will be set to undefined.

d. as the result of a Batch Data Input operation which the user has chosen to ACCEPT records containing invalid or out-of-range values. The records are stored in the data base with "undefined" substituted for the "bad" data.

SIR provides the user with several functions that detect missing and undefined values.

EXISTS (X)

This function returns the value 1 if X is defined and the value 0 if X is missing or undefined. For example, the logical expression (EXISTS(XNUM) EQ 0) will be true if XNUM contains a missing or undefined value.

MISNUM (X)
This function returns the value 0 if X is undefined and the value 1, 2 or 3 if X is the contains the first, second or third missing value. If X is defined, MISNUM returns undefined. For example, if the missing values for ANSO1 are defined by:

MISSING VALUES ANS01(97,98,99)

then the statement

XANS = MISNUM(ANS01)

will set XANS to 3 if ANS01 is 99.

MISSING (X)
This function returns the actual missing value of X when X contains one of its three missing values. Otherwise, MISSING returns undefined. The MISSING function is the only way to retrieve a missing value for further use in SIR. For example, suppose the missing values for ANSO1 are 97, 98 and 99 (as before), and ANSO1 in a particular record contains 99,
then the statement

XANS = ANSO1

would set XANS to undefined, but the statement

XANS = MISSING(ANS01)

would set XANS to 99.

And the following will allow comparison with a specific missing value:

IF (MISSING (ANSO1)=99)

It is important to check for missing and undefined values because they generate undefined results and undefined whenever they appear in arithmetic and logical expressions.

Manipulation of missing values in calculations
There are also several functions that detect and ignore missing and undefined values. The group of functions listed below compute various aggregates for two or more variables.

CNT : count the defined values
FST : find first defined value

LST : find last defined value
MAX : find maximum defined value
MEAN: mean of the defined values
MIN : find minimum defined value
STD : standard deviation
SUM : sum of the defined values

The CNT function is particularly useful for detecting simultaneously defined (or undefined) values. For example, suppose we want to compute R only if A, Band C are all defined; if anyone of them is undefined or missing we want R set to zero. The EXISTS function (see above) could be used as follows:

COMPUTE R = 0
IF (EXISTS (A) EQ 1 AND EXISTS (B) EQ 1 AND EXISTS (C) EQ 1) R = SQRT(A + B + C)

However the CNT functions provides a more concise and elegant form:

COMPUTE R = 0
IF (CNT(A,B,C) EQ 3) R = SQRT(A + B + C)

There are several other functions (available in Retrieval' only) that compute aggregates of defined quantities across several records of a given type.

CNTR : count defined quantities
FSTR : get first defined quantity
LSTR : get last defined quantity

MAXR : get maximum defined quantity
MINR : get minimum defined quantity

MEANR : get mean defined quantities

STDEVR: get standard deviation of defined quantities

SUMR : get sum of the defined quantities

Tuesday
Jan272009

Revision Time (PROCESS REC)

PROCESS REC is an essential part of virtually any data retrieval in a case-structured database, here a short article from Report Issue No 1 (Feb 1996) runs us through the basics.

PROCESS REC VIA | WITH
Via and With are actually synonymous, but they cause some new users a good deal of grief so it may be useful to run through some examples.

PROCESS REC N
A loop over all records of type N within the current case (as defined by the PROCESS CASE loop). The loop ends with an END PROCESS REC or an END REC command. If the case has no records of type N, execution skips to the command after the END REC.

PROCESS REC N VIA (1)
A loop over all records of type N in the current case, where the first keyfield (sometimes called a sort-id) variable takes a value equal to 1. If the record has only one keyfield this will be at most a single record; if there are other keyfields they can vary freely. If the current case has no record(s) where the first keyfield is equal to 1, execution skips to the command after the END REC.

PROCESS REC N VIA (4,5,6)
A loop over all records type N (in the current case) where the first key field is equal to 4, the second keyfield is 5 and the third keyfield is 6. This command would be invalid unless there were at least three key fields. With exactly three keyfields this would define a single record which may or may not exist for the current case. If there are more than three keyfields, the fourth and succeeding keyfields can take any value.

The examples above use constant values as the arguments in the brackets, but you may use any mixture of constants and local (sometimes called summary) variables. Eg PROCESS REC N WITH (A,B,C) will only compile if A, B and C are the names of local variables (created with GET VARS, COMPUTE etc). If you do not know what your keyfields are, there are various options under the Database menu item which will show you. In the COMPANY database the OCCUP record has one keyfield called POSITION, which means that the command:
PROCESS REC OCCUP VIA (10)
will retrieve the OCCUP records where POSITION has value 10. Because it specifies a keyfield value and thereby uses the indexes, it is much faster than its functional equivalent:
PROCESS REC OCCUP
. IFNOT (POSITION = 10) NEXT REC
which has to read every OCCUP record.

The keyword VIA (or WITH) matches exact values; ranges are supported with the keywords FROM, THRU, AFTER and UNTIL.