Introduction to Stata
Part II

Part II: Data Management

Stata Command Syntax

Having used a few Stata commands, it may be time to comment briefly on their structure. One of Stata’s great strengths is the consistency of its command syntax. Most of Stata’s commands share the following syntax, where bold indicates keywords and square brackets mean that something is optional

[by varlist:] command [varlist] [if exp] [in range] [weight] [, options]

In this diagram, varlist denotes a list of variable names, command denotes a Stata command, exp denotes an algebraic expression, range
denotes an observation range, weight denotes a weighting expression, and options denotes a list of options. Let’s briefly describe each syntax element:

  • varlist: If no varlist appears, these commands assume a varlist of _all, the Stata shorthand for indicating all the variables in the dataset. Some commands take a varname, rather than a varlist. A varname refers to exactly one variable
  • if exp: The if-qualifier restricts the scope of a command to observations for which the value of the expression is true (which is equivalent to the expression being nonzero)
  • in range: The in-qualifier qualifier restricts the scope of the command to a specific observation range. A range specification takes the form $\#_1 [/ \#_2]$ , where $\#_1$ and $\#_2$ are positive or negative integers. Negative integers are understood to mean „from the end of the data“, with -1 referring to the last observation. The implied first observation must be less than or equal to the implied last observation. The first and last observations in the dataset may be denoted by f and l, respectively.
  • weight: Some commands allow the use of weights. You should use these if you want to do some analysis for the whole population based on your sample.

1.1  The by-Prefix

The by varlist: prefix causes Stata to repeat a command for each subset of the data for which the values of the variables in varlist are equal. When prefixed with by varlist:, the result of the command will be the same as if you had formed separate datasets for each group of observations, saved them, and then given the command on each dataset separately. The data must already be sorted by varlist, although by has a sort option. The by prefix is important for understanding data manipulation and working with subpopulations within Stata. Furthermore, the varlist in by varlist: may contain string variables, numeric variables, or both.

Let’s show how the by-prefix works with a small example. We reload the GSOEP data set and calculate the average income for the women and men in the sample. First, the data are sorted by gender, whereby two alternatives are available for sorting. On the one hand, we can just sort the data before we use the by-prefix in combination with the sum command.

. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear
(SOEP 2009 (Kohler/Kreuter))

. sort gender

. by gender: sum income

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      2,458       13323    21290.77          0     612757

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      2,320    28190.75    47868.24          0     897756

On the other hand, we can also sort directly within the by-prefix

. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear
(SOEP 2009 (Kohler/Kreuter))

. bysort gender: sum income

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      2,458       13323    21290.77          0     612757

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      2,320    28190.75    47868.24          0     897756

As we can see, both approaches lead to the same result. Now, which of the two alternatives is preferred? As already explained in Part 1 of this tutorial, the command sort leads to an ascending sorting of the data set by the respective variables. Thus, the first alternative, i.e. sorting outside the by-prefix is always appropriate when we need descending sorting and therefore work with the gsort command.

Since the concept of the by-prefix is not clearly visible from the command lines, let’s go step by step through Stata’s procedure. First, let’s sort the data by the identification number pnr and view the selected variables in the Browser window.

. sort pnr

. br pnr gender income

As you can see, the observations are sorted by pnr and each respondent has a different gender and income. Next, we sort the data by gender and view the data again in the Browser window. Since the gender variable is a string variable, the sort command sorts the observations alphabetically. The command that follows the by-prefix is then executed for each category of the variable gender. Since the variable gender only has two possible values, the sum command is
only executed twice.

Now, let’s obtain average incomes according to gender (gender) broken down by educational attainment (educ). Thus, we type

. sort gender educ

. by gender educ: sum income

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female, educ = Elementary

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        755    8132.087    24628.12          0     612757

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female, educ = Intermediate Secondary

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        821    14754.12    13836.29          0     102111

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female, educ = Technical Secondary

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        107    20141.42    17049.57          0      84958

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female, educ = Maturity

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        448    22042.61    27588.56          0     424107

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female, educ = .

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        327    7537.804    13151.93          0      96065

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male, educ = Elementary

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        783    19431.59    25558.53          0     365076

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male, educ = Intermediate Secondary

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        628    27109.07    21148.42          0     109121

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male, educ = Technical Secondary

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        138     40967.2    41181.82          0     287194

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male, educ = Maturity

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        449    51046.58       92057          0     897756

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male, educ = .

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        322    14253.83    18768.37          0     108143

Sorting is as follows: First, the observations are sorted by the gender variable, while no sorting by the second variable educ takes place. In the next step, the first sorting by the gender variable is preserved and the observations are sorted by the second variable educ in each category or possible value of the first variable gender. Now, the command that follows the by-prefix is executed for each combination of the two variables in varlist of the by-prefix.

1.2  Wildcards and Ordering

Variable lists (or varlists) can be specified in a variety of ways, all designed to save typing and encourage good variable names. If you want to address several variables in a command, you can use different placeholders to save some time.

  • If the variables differ only in a single character, you should use a question mark.
  • . use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear 
    (SOEP 2009 (Kohler/Kreuter))
    
    . desc educ?
    
                  storage   display    value
    variable name   type    format     label      variable label
    ---------------------------------------------------------------------------------------------------------------------------
    educy           float   %9.0g                 Number of Years of Education
    
  • The wildcard * can be used to address variables whose names are partially identical.
  • . desc hh* *nr
    
                  storage   display    value
    variable name   type    format     label      variable label
    ---------------------------------------------------------------------------------------------------------------------------
    hhnr            long    %12.0g                Houshold Number
    hhmem           byte    %8.0g                 Number of Persons in Household
    hhkids          byte    %8.0g                 Number of Kids (0-14 Years) in Household
    hhtyp           byte    %35.0g     hhtyp      Household Type
    hhinc           long    %10.0g                Household Post-Government Income (in Euro)
    pnr             long    %12.0g                Person Number
    hhnr            long    %12.0g                Houshold Number
    
  • Variables that are arranged one after the other in the data set can be addressed together with a hyphen.
  • . desc task-rooms
    
                  storage   display    value
    variable name   type    format     label      variable label
    ---------------------------------------------------------------------------------------------------------------------------
    task            byte    %58.0g     task       Working Task of Dependent Employees
    state           byte    %22.0g     state      State of Residence
    health          byte    %32.0g     health     Satisfaction with Health
    satlif          byte    %32.0g     satlif     Overall Life Satisfaction
    polint          byte    %20.0g     polint     Political Interests
    party           byte    %15.0g     party      Political party supported
    suppar          byte    %20.0g     suppar     Supports political party
    worpea          byte    %20.0g     worpea     Worried about peace
    worter          byte    %20.0g     worter     Worried about global terrorism
    worcri          byte    %20.0g     worcri     Worried about crime in Germany
    worimm          byte    %20.0g     worimm     Worried about immigration to Germany
    worhfo          byte    %20.0g     worhfo     Worried about hostility to foreigners
    worjos          byte    %20.0g     worjos     Worried about job security
    size            float   %12.0g                Size of Housing (in m^2)
    rent            float   %12.0g                Rent Minus Heating Costs (in Euro)
    rooms           byte    %8.0g                 Number of Rooms > 6m^2
    

Since the use of hyphens in varlists depends on the order of the variables in the dataset, we briefly introduce the order command. This command enables changing the order of variables in the dataset. If we type

. desc

Contains data from https://www.mustafacoban.de/wp-content/stata/gsoep.dta
  obs:         5,410                          SOEP 2009 (Kohler/Kreuter)
 vars:            36                          23 Sep 2015 16:20
 size:       384,110                          
---------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------
pnr             long    %12.0g                Person Number
hhnr            long    %12.0g                Houshold Number
gender          str6    %9s                   Gender
female          byte    %20.0g     female     Female - Dummy
age             float   %9.0g                 Age
marst           byte    %29.0g     marst      Marital Status of Individual
marr            float   %11.0g     marr       Married / Not Married - Dummy
hhmem           byte    %8.0g                 Number of Persons in Household
hhkids          byte    %8.0g                 Number of Kids (0-14 Years) in Household
hhtyp           byte    %35.0g     hhtyp      Household Type
income          long    %10.0g                Individual Labor Earnings (in Euro)
hhinc           long    %10.0g                Household Post-Government Income (in Euro)
educ            byte    %28.0g     educ       Education
educy           float   %9.0g                 Number of Years of Education
ausb            byte    %40.0g     ausb       Ausbildungsabschluss
emplst          byte    %44.0g     emplst     Employment Status
lfp             float   %18.0g     lfp        Labor Force Participation
task            byte    %58.0g     task       Working Task of Dependent Employees
state           byte    %22.0g     state      State of Residence
health          byte    %32.0g     health     Satisfaction with Health
satlif          byte    %32.0g     satlif     Overall Life Satisfaction
polint          byte    %20.0g     polint     Political Interests
party           byte    %15.0g     party      Political party supported
suppar          byte    %20.0g     suppar     Supports political party
worpea          byte    %20.0g     worpea     Worried about peace
worter          byte    %20.0g     worter     Worried about global terrorism
worcri          byte    %20.0g     worcri     Worried about crime in Germany
worimm          byte    %20.0g     worimm     Worried about immigration to Germany
worhfo          byte    %20.0g     worhfo     Worried about hostility to foreigners
worjos          byte    %20.0g     worjos     Worried about job security
size            float   %12.0g                Size of Housing (in m^2)
rent            float   %12.0g                Rent Minus Heating Costs (in Euro)
rooms           byte    %8.0g                 Number of Rooms > 6m^2
renttype        byte    %20.0g     renttype   Status of living
condit          byte    %24.0g     condit     Condition of house
satliv          byte    %45.0g     satliv     Satisfaction with Living/Habitation
---------------------------------------------------------------------------------------------------------------------------
Sorted by: pnr

. order wor* hh* female

. desc

Contains data from https://www.mustafacoban.de/wp-content/stata/gsoep.dta
  obs:         5,410                          SOEP 2009 (Kohler/Kreuter)
 vars:            36                          23 Sep 2015 16:20
 size:       384,110                          
---------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------
worpea          byte    %20.0g     worpea     Worried about peace
worter          byte    %20.0g     worter     Worried about global terrorism
worcri          byte    %20.0g     worcri     Worried about crime in Germany
worimm          byte    %20.0g     worimm     Worried about immigration to Germany
worhfo          byte    %20.0g     worhfo     Worried about hostility to foreigners
worjos          byte    %20.0g     worjos     Worried about job security
hhnr            long    %12.0g                Houshold Number
hhmem           byte    %8.0g                 Number of Persons in Household
hhkids          byte    %8.0g                 Number of Kids (0-14 Years) in Household
hhtyp           byte    %35.0g     hhtyp      Household Type
hhinc           long    %10.0g                Household Post-Government Income (in Euro)
female          byte    %20.0g     female     Female - Dummy
pnr             long    %12.0g                Person Number
gender          str6    %9s                   Gender
age             float   %9.0g                 Age
marst           byte    %29.0g     marst      Marital Status of Individual
marr            float   %11.0g     marr       Married / Not Married - Dummy
income          long    %10.0g                Individual Labor Earnings (in Euro)
educ            byte    %28.0g     educ       Education
educy           float   %9.0g                 Number of Years of Education
ausb            byte    %40.0g     ausb       Ausbildungsabschluss
emplst          byte    %44.0g     emplst     Employment Status
lfp             float   %18.0g     lfp        Labor Force Participation
task            byte    %58.0g     task       Working Task of Dependent Employees
state           byte    %22.0g     state      State of Residence
health          byte    %32.0g     health     Satisfaction with Health
satlif          byte    %32.0g     satlif     Overall Life Satisfaction
polint          byte    %20.0g     polint     Political Interests
party           byte    %15.0g     party      Political party supported
suppar          byte    %20.0g     suppar     Supports political party
size            float   %12.0g                Size of Housing (in m^2)
rent            float   %12.0g                Rent Minus Heating Costs (in Euro)
rooms           byte    %8.0g                 Number of Rooms > 6m^2
renttype        byte    %20.0g     renttype   Status of living
condit          byte    %24.0g     condit     Condition of house
satliv          byte    %45.0g     satliv     Satisfaction with Living/Habitation
---------------------------------------------------------------------------------------------------------------------------
Sorted by: pnr

the specified variables are placed at the beginning of the data set. Now, variables beginning with a wor are at the beginning. They are followed by the variables beginning with an hh and finally by the gender variable. The remaining variables are appended with no change to their sorting. Further layout rules can be found typing help order.

Create New Variables

One of the three commandments says that the original data should never be overwritten. However, this does not mean that the data must not be changed. For data analysis and generation of new knowledge from the existing raw data set, new variables often have to be generated from the existing variables and existing variables have to be modified. In addition, it is common for large datasets to delete variables that are not relevant to a particular application or project. This ensures a better overview for the user and shortens the calculation time of certain analysis methods and algorithms.

2.1  Creating and Modifying Variables

The most common command to generate a new variable is generate, which is usually abbreviated to gen. This command can be used with the by-prefix as well as with the in– and if-qualifiers. Thus, the basic syntax for this command is

generate newvariable = expression

First we have to choose a variable name for our newvariable and then type a single equals sign to start the definition of the new variable. An expression is a formula made up of constants, existing variables, operators, and functions. In general we can distinguish between mathematical and logical expressions. The operators needed for these expressions are given below

Arithmetic Logical Relational
+ addition ! not > greater than
- subtraction | or < less than
* multiplication & and >= greater than or equal to
/ division <= less than or equal to
^ power == equal
!= not equal
+ string concatenation

First, let's generate a new variable using a mathematical expression. Since the GSOEP has data on a person's individual labor earnings, we want to take the logarithm of this income variable and call the new variable loginc to make our operation identifiable in the new variable's name.

. gen loginc = log(income)
(2,001 missing values generated)

. sum loginc income

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      loginc |      3,409    9.770343    1.124561   3.828641   13.70765
      income |      4,778    20542.17    37426.25          0     897756

The new variable has many more missings than the original income variable because the logarithm of a zero income generates always a missing. Stata has many mathematical, statistical, string, date, time-series, and programming functions. Just type help functions to see some basic functions.

Now, let's generate a new variable using a logical expression. We want to generate a new variable called midage that takes the value 1 if a person is aged between 25 and 64 and otherwise takes the value 0

. gen midage = age >= 25 & age < 64

. tab midage

     midage |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,913       35.36       35.36
          1 |      3,497       64.64      100.00
------------+-----------------------------------
      Total |      5,410      100.00

Thus, 3,497 persons are within the defined age range between 25 and 64 years. In a next step we want to generate a new variable using a string variable within the logical expression. For this purpose, let's apply the string variable gender, which has two possible values, Male and Female. We want to create a new variable called male that takes the value 1 if a person is a man and the value 0 if a person is a woman

. gen male = gender == "Male"

. tab male

       male |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,825       52.22       52.22
          1 |      2,585       47.78      100.00
------------+-----------------------------------
      Total |      5,410      100.00

. tab gender

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
     Female |      2,825       52.22       52.22
       Male |      2,585       47.78      100.00
------------+-----------------------------------
      Total |      5,410      100.00

The logical expression above says that a person is assigned the value 1 in the new variable if the expression is true for this person and otherwise is assigned the value 0, i.e. if the gender variable equals the string "Male" for a person, then the expression is true. Therefore, logical expressions are case-sensitive and sensitive to spaces. The following procedure leads to a completely different and undesired result due to the extra space at the end of the string notation

. gen male2 = gender == "Male "

. tab male2

      male2 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      5,410      100.00      100.00
------------+-----------------------------------
      Total |      5,410      100.00

Using the generate command, we can also generate new string variables. We can combine two or more string variables to a new string variable. But we can also create a new string variable by attaching strings to an existing string variable. First, let's generate the new string variable gender2 by combining the gender variable with itself

. gen gender2 = gender + gender

. list gender2 gender in 1/5

     +-----------------------+
     |      gender2   gender |
     |-----------------------|
  1. |     MaleMale     Male |
  2. | FemaleFemale   Female |
  3. |     MaleMale     Male |
  4. | FemaleFemale   Female |
  5. |     MaleMale     Male |
     +-----------------------+

As you can see, using the operator "+" will concatenate the string variables - in our case the replication of the gender variable - without spaces, i.e. simply join them together. Further, the missing value for a string variable is nothing special - it is simply the empty string " ". Second, let's create a new string variable gender3 by combing the gender variable with a constant string, e.g. the string " - Gender".

. gen gender3 = gender + " - Gender"

. list gender3 gender2 gender in 1/5

     +-----------------------------------------+
     |         gender3        gender2   gender |
     |-----------------------------------------|
  1. |   Male - Gender       MaleMale     Male |
  2. | Female - Gender   FemaleFemale   Female |
  3. |   Male - Gender       MaleMale     Male |
  4. | Female - Gender   FemaleFemale   Female |
  5. |   Male - Gender       MaleMale     Male |
     +-----------------------------------------+

Stata shows a particularity if you want to change the values of an existing variable because Stata will not let you overwrite an existing variable using the generate command. If you really want to replace the values of an old variable you have to use the replace command. Thus, Stata uses two different commands to prevent you from accidentally modifying your data. The syntax of the replace command is similar to syntax of the generate command, although the former cannot be abbreviated.

Now, let's change the values in the male variable to 2 if a person is a man and to 1 if a person is a woman by application of the gender variable within the if-qualifier.

. replace male = 2  if gender == "Male"
(2,585 real changes made)

. replace male = 1  if gender == "Female"
(2,825 real changes made)

. tab male

       male |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      2,825       52.22       52.22
          2 |      2,585       47.78      100.00
------------+-----------------------------------
      Total |      5,410      100.00

2.2  More Generating and Recoding Variables

There is another important command to create new variable. Let me introduce the more powerful egen command which is useful for working across groups of variables or within groups of observations. There are plenty of functions that can be applied by the egen command. Just type help egen to explore some of them. For example, if we are interested in the amount of missing values for an observation from a selected variable list, we can generate a new variable miss by applying the rowmiss() function to the egen command

. egen miss = rowmiss(gender - educ)

. tab miss

       miss |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      4,129       76.32       76.32
          1 |      1,169       21.61       97.93
          2 |        112        2.07      100.00
------------+-----------------------------------
      Total |      5,410      100.00

Thus, 4,129 persons have valid values for all five variables, while 112 persons have missing values for two out of the five variables. The rowmiss function is useful if you want to create a dataset containing observations without any missings for your selected variables. Furthermore, you can use the egen command if you want to store summary statistics of a variable in a new variable by group membership. Let's generate a new variable that stores the mean income of men if a person is a man and the mean income of women otherwise.

. bysort gender: egen incgen_av = mean(income)

. bysort gender: sum income

---------------------------------------------------------------------------------------------------------------------------
-> gender = Female

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      2,458       13323    21290.77          0     612757

---------------------------------------------------------------------------------------------------------------------------
-> gender = Male

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      2,320    28190.75    47868.24          0     897756


. sort pnr

. list incgen_av gender in 1/5

     +-------------------+
     | incgen~v   gender |
     |-------------------|
  1. | 28190.75     Male |
  2. |    13323   Female |
  3. | 28190.75     Male |
  4. |    13323   Female |
  5. | 28190.75     Male |
     +-------------------+

There is another command to generate new variables and modify existing variables. The recode command is used to group numeric variables into categories or the easily change the values for existing categories in categorical variables. Now, let's generate a new variable agecat4 that divides persons into four age groups, whereby the first age group is assigned the value 1 and the last age group the value 4.

. sum age

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      5,410    49.50961    18.12717         17        100

. recode age (17/24 = 1) (25/44 = 2) (45/64 = 3) (65/100 = 4), gen(agecat5)
(5410 differences between age and agecat5)

. tab agecat5

  RECODE of |
  age (Age) |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        562       10.39       10.39
          2 |      1,678       31.02       41.40
          3 |      1,870       34.57       75.97
          4 |      1,300       24.03      100.00
------------+-----------------------------------
      Total |      5,410      100.00

Each expression in the parentheses is a recoding rule and consists of a list or range of values, followed by an equals sign and a new value. A range is specified by using a slash and includes the two boundaries, so 17/24 is 17 to 24. The gen() option guarantees that the new variable is created following the recoding rule, while the existing variable age remains unchanged. Moreover, you can use min to refer to the smallest value and max to refer to the largest value within the recoding rule, as in min/24 and 65/max. Values that are never assigned to a category are kept as they are. You can use else within the recoding rule to capture these values and assign them a specific category.

The next example shows that the recode command can also be used to swap certain numeric values for a variable

. recode female (0 = 1) (1 = 0)
(female: 5410 changes made)

. tab female

   Female - |
      Dummy |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |      2,825       52.22       52.22
     Female |      2,585       47.78      100.00
------------+-----------------------------------
      Total |      5,410      100.00

Since no option was applied, the existing variable female has been recoded. Now, all women take the value 0 and all men take the value 1. We simply swapped the values for this variable. I recommend that you always use the gen() option or make a copy of the original variable before recoding it.

2.3  Variable Names Convention, Dropping Variables, and Missings

Variable names can have up to 32 characters, but many commands print only 12. Since shorter names are easier to type, I recommend a maximum length of 8 to 12 characters for variable names. A variable name is a sequence of 1 to 32 letters (A-Z, a-z, and any Unicode letter), digits (0-9), and underscores (_). Thus, Stata names are case-sensitive, which means that Age and age are two different variables.

Furthermore, the first character of a variable name must be a letter or an underscore. I recommend, however, that you not begin your variable names with an underscore because all Stata's built-in variables begin with an underscore. Moreover, Stata reserves the following names

_all float _n _skip
_b if _N str#
byte in _pi strL
_coef int _pred using
_cons long _rc with
double

It pays to develop a convention for naming variables and sticking to it. I prefer short lowercase names and tend to use single words or abbreviations rather than multi-word names with underscores; for example, I prefer hhinc to household_income, although both names are legal.

There are two main commands for removing data and variables from memory: drop and keep. Remember that they affect only what is in memory. None of these commands alter anything that has been saved to the disk. The drop command is used to remove variables or observations from the dataset in memory. If you want to drop variables after you reload the GSOEP dataset, just type

. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear 
(SOEP 2009 (Kohler/Kreuter))

. drop rent size rooms

If you want to drop observations, you have to use an if- or an in-qualifier. For example, at first we drop the last ten observations by applying the in-qualifier, and then we delete all men from the dataset by applying the if-qualifier.

. drop in -10/l
(10 observations deleted)

. drop if gender == "Male"
(2,581 observations deleted)

These changes are only to the data in memory. If you want to make the changes permanent, you need to save the dataset. The keep command is a command for preserving specified variables or observations. Thus, it works inversely the drop command. If you want to keep a specific list of variables and drop the rest, just type

. keep pnr female age 

If you want to keep certain observations, the same syntax of the drop command applies. For example, at first we only keep the first 100 observations, and then keep all individuals younger than 45 years.

. keep in 1/100
(2,719 observations deleted)

. keep if age < 45
(52 observations deleted)

You can use the browse and describe commands to take a look at your miniature dataset in memory.

So far we have become acquainted with Stata without dealing with the topic of missings. For the rest of the tutorial, however, it is indispensable to understand how missings are coded and programed in Stata. Like other statistical packages, Stata distinguishes missing values. The basic missing value for numeric variables is represented by a dot . There are 26 additional missing-value codes denoted by .a to .z. These values are represented internally as incredibly large numbers and the following ranking applies

. < .a < .b < ... < .z

Because missings internally take very large numbers, using the operators > and >= with an if-qualifier may produce erroneous results if this programming property is not taken into account. Let's look at example for this missing problem. We want to know the mean income of individuals who obtained more than 14 years of education (educy)

. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear 
(SOEP 2009 (Kohler/Kreuter))

. sum income if educy > 14

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |      1,119    32226.08    64576.71          0     897756

The summary statistics, however, may be incorrect if there are individuals with valid income values, but missings for the education variable. Thus, the command calculates the mean income of individuals who obtained more than 14 years of education or have not indicated their years of education. Since we do not want to consider the latter group of individuals, the correct command must be

. sum income if educy > 14 & educy < .

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        804    42125.13    73451.05          0     897756

As you can see, the two commands lead to a difference in mean incomes, which is due to the exclusion of individuals with no education information in the second command. Furthermore, Stata has the missing() function which can be used within an if qualifier to exclude observations with missings for a certain variable.

. sum income if educy > 14 & !missing(educy)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      income |        804    42125.13    73451.05          0     897756

We get the same result as above. But why are there several missing values in Stata and why is a simple dot not sufficient as a missing definition? Various missing values are useful for survey data. If a respondent has a missing for a particular question or variable, there may be different reasons behind that. The first would be if the respondent hadn’t answered the question. The second possible reason is if the question did not apply to the respondent. For example, if there is a question about my spouse's income, a missing could occur due to one of these two reasons. In the first case, the reason would be that I don't want to disclose my spouse’s income. The second case would be that I have no spouse. Of course, there could many more reasons, but I think these two examples make my point clear. Thus, if there is only one missing value, we lose the information about the reasons for this missing observation. But having the possibility to assign different missing values enables us to account for the reasons. In our example, we can choose the missing values .a and .b to distinguish between the two reasons.

Now, how can we detect missings in our data or variables? If we want to know the amount of missings for a categorical variable with numeric values, we can use the missing of the tab command

. tab educy, missing

  Number of |
   Years of |
  Education |      Freq.     Percent        Cum.
------------+-----------------------------------
        8.7 |        640       11.83       11.83
         10 |      1,321       24.42       36.25
         11 |      1,168       21.59       57.84
         12 |        487        9.00       66.84
         13 |        357        6.60       73.44
         14 |        191        3.53       76.97
         15 |        243        4.49       81.46
       16.1 |        166        3.07       84.53
         18 |        465        8.60       93.12
          . |        372        6.88      100.00
------------+-----------------------------------
      Total |      5,410      100.00

If we want to know the amount of missings for a continuous or quantitative variable, we can apply the misstable summarize command

. misstable sum income
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        income |       632               4,778  |   >500          0      897756
  -----------------------------------------------------------------------------

. misstable sum _all
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        income |       632               4,778  |   >500          0      897756
         hhinc |         4               5,406  |   >500        583      507369
          educ |       761               4,649  |      4          1           4
         educy |       372               5,038  |      9        8.7          18
          ausb |     1,309               4,101  |      4          1           4
        emplst |       155               5,255  |      6          1           6
           lfp |        23               5,387  |      4          1           4
          task |       622               4,788  |      7          1           7
        health |        77               5,333  |     11          0          10
        satlif |        88               5,322  |     11          0          10
        polint |        89               5,321  |      4          1           4
         party |     3,309               2,101  |      7          1           7
        suppar |        85               5,325  |      2          1           2
        worpea |        85               5,325  |      3          1           3
        worter |        92               5,318  |      3          1           3
        worcri |        92               5,318  |      3          1           3
        worimm |       101               5,309  |      3          1           3
        worhfo |       111               5,299  |      3          1           3
        worjos |     2,329               3,081  |      3          1           3
          rent |     3,049               2,361  |   >500       27.3      3003.7
        condit |        10               5,400  |      4          1           4
        satliv |        95               5,315  |     11          0          10
  -----------------------------------------------------------------------------

Using the notation _all tells Stata to apply all numeric variables of the dataset to the command. If you want to recode a valid numeric value of a variable to a specific missing, you can use the recode command

. recode female (2 = .a)
(female: 0 changes made)

Data Documentation

Now, we will discuss, in brief, the labeling of the dataset, variables, and values. Such labeling is critical to the careful use of data. Labeling variables with descriptive names clarifies their meanings and their measurements. Labeling values of numerical categorical variables ensures that the real-world meanings of the encodings are not forgotten. These points are crucial when sharing data with others, including your future self. Labels are also used in the output of most Stata commands, so proper labeling of the dataset will produce much more readable results.

Let's start with variable labels. Since we use abbreviations and short notations for variables in the dataset, labelling variables is essential. We can label a variable by using the command label variable.

. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear 
(SOEP 2009 (Kohler/Kreuter))

. label var ausb "Educational Attainment"

. desc aus

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------
ausb            byte    %40.0g     ausb       Educational Attainment

The command is followed by the variable to be labeled. Then we type our preferred label in quotation marks. Next, we will take a look at value labels. These are very important because the numerical values of the categorical variables would have no real-world meaning otherwise. Value labels allow numeric variables to have words associated with numeric codes. Stata has a two-step approach to defining labels. First, you define a named label set which associates integer codes with labels of up to 80 characters, using the label define command. Then you assign the set of labels to a variable, using the label values command.

. recode female (1 = 0) (0 = 1), gen(male)
(5410 differences between female and male)

. label define male_lb 0 "Female" 1 "Male"

. label values male male_lb

First, we created a new variable male which takes a value of one for men and a zero for women. Then we defined the new value label male_lb and assigned the 0 to women and the 1 to men. Next, we associated our new value label with the male variable. I highly recommend using the same name for the value label set and the variable because then you don't have to remember the order in the last step.

. label define male 0 "Female" 1 "Male"

. label values male male

One advantage of this two-step approach is that you can use the same set of value labels for several variables. The canonical example is

. label define yesno 1 "yes" 0 "no"

, which can then be associated with all 0-1 variables in your dataset by simply stringing all variables together after label values and by putting the name of the value label yesno at the end of the command. Moreover, label sets can be modified using the options add or modify. Just check help label. Since we have seen that we can define different missing values you can also assign value labels to them. For example,

. label define party .a "No answer" .b "Not applicable", modify

Using the desc command, you can check whether your variables have value labels. If you want to take a look at one or several specific value labels you can use the label list command.

. desc

Contains data from https://www.mustafacoban.de/wp-content/stata/gsoep.dta
  obs:         5,410                          SOEP 2009 (Kohler/Kreuter)
 vars:            37                          23 Sep 2015 16:20
 size:       389,520                          
---------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------
pnr             long    %12.0g                Person Number
hhnr            long    %12.0g                Houshold Number
gender          str6    %9s                   Gender
female          byte    %20.0g     female     Female - Dummy
age             float   %9.0g                 Age
marst           byte    %29.0g     marst      Marital Status of Individual
marr            float   %11.0g     marr       Married / Not Married - Dummy
hhmem           byte    %8.0g                 Number of Persons in Household
hhkids          byte    %8.0g                 Number of Kids (0-14 Years) in Household
hhtyp           byte    %35.0g     hhtyp      Household Type
income          long    %10.0g                Individual Labor Earnings (in Euro)
hhinc           long    %10.0g                Household Post-Government Income (in Euro)
educ            byte    %28.0g     educ       Education
educy           float   %9.0g                 Number of Years of Education
ausb            byte    %40.0g     ausb       Educational Attainment
emplst          byte    %44.0g     emplst     Employment Status
lfp             float   %18.0g     lfp        Labor Force Participation
task            byte    %58.0g     task       Working Task of Dependent Employees
state           byte    %22.0g     state      State of Residence
health          byte    %32.0g     health     Satisfaction with Health
satlif          byte    %32.0g     satlif     Overall Life Satisfaction
polint          byte    %20.0g     polint     Political Interests
party           byte    %15.0g     party      Political party supported
suppar          byte    %20.0g     suppar     Supports political party
worpea          byte    %20.0g     worpea     Worried about peace
worter          byte    %20.0g     worter     Worried about global terrorism
worcri          byte    %20.0g     worcri     Worried about crime in Germany
worimm          byte    %20.0g     worimm     Worried about immigration to Germany
worhfo          byte    %20.0g     worhfo     Worried about hostility to foreigners
worjos          byte    %20.0g     worjos     Worried about job security
size            float   %12.0g                Size of Housing (in m^2)
rent            float   %12.0g                Rent Minus Heating Costs (in Euro)
rooms           byte    %8.0g                 Number of Rooms > 6m^2
renttype        byte    %20.0g     renttype   Status of living
condit          byte    %24.0g     condit     Condition of house
satliv          byte    %45.0g     satliv     Satisfaction with Living/Habitation
male            byte    %9.0g      male       RECODE of female (Female - Dummy)
---------------------------------------------------------------------------------------------------------------------------
Sorted by: pnr
     Note: Dataset has changed since last saved.

. label list emplst
emplst:
           1 Full-Time Employee
           2 Part-Time Employee
           3 Irregular Employee
           4 Unemployed
           5 Retired
           6 Not in Labor Force

Work Documentation

While it is fun to type commands interactively and see the results straightaway, serious work requires that you save your results and keep track of the commands that you have used, so that you can document your work and reproduce it later if needed.

4.1  Log-Files

When you work on an analysis, it is worthwhile to behave like a bench scientist and keep a lab notebook of your actions so that your work can be easily replicated. Everyone has a feeling of complete omniscience while working intensely - this feeling is wonderful but fleeting. By the next day, the exact little details needed for perfect duplication have become obscure. Stata has a lab notebook on hand: the log file.

A log file is simply a record of your Results window. It records all commands and all textual output in real time. Thus it keeps your lab notebook for you as you work. Because it saves the file to the disk while it writes the Results window, it also protects you from disastrous failures, be they power failures or computer crashes. We recommend that you start a log file whenever you begin any serious work in Stata.

To open a log file, use the log using command and give your log file a meaningful filename.

. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear 
(SOEP 2009 (Kohler/Kreuter))

. log using project1, replace
---------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl
  log type:  smcl
 opened on:   2 Aug 2018, 19:08:13

The replace option ensures that an existing log file with the name project1 will be overwritten. This will often be the case if you need to re-run your commands several times to get them right. By default, Stata will save the log file in its Stata Markup and Control Language (SMCL) format, which preserves all formatting and links from the Results window.

If you want to temporarily suspend logging and then resume logging, just use the commands log off and log on

. tab emplst

 Employment Status |      Freq.     Percent        Cum.
-------------------+-----------------------------------
Full-Time Employee |      2,040       38.82       38.82
Part-Time Employee |        599       11.40       50.22
Irregular Employee |        288        5.48       55.70
        Unemployed |        312        5.94       61.64
           Retired |      1,389       26.43       88.07
Not in Labor Force |        627       11.93      100.00
-------------------+-----------------------------------
             Total |      5,255      100.00

. log off
      name:  <unnamed>
       log:  N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl
  log type:  smcl
 paused on:   2 Aug 2018, 19:08:13
---------------------------------------------------------------------------------------------------------------------------

. drop party

. log on
---------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl
  log type:  smcl
resumed on:   2 Aug 2018, 19:08:14

To finish your logging, close your log file using the log close command. Once the log file is closed and saved on the hard disk, you can use the view command to open the file in the Viewer of Stata or you can directly print the content of your log file using the print command.

. tab lfp

       Labor Force |
     Participation |      Freq.     Percent        Cum.
-------------------+-----------------------------------
Dependent Employee |      2,846       52.83       52.83
     Self-Employed |        213        3.95       56.78
        Unemployed |        312        5.79       62.58
Not in Labor Force |      2,016       37.42      100.00
-------------------+-----------------------------------
             Total |      5,387      100.00

. log close
      name:  <unnamed>
       log:  N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl
  log type:  smcl
 closed on:   2 Aug 2018, 19:22:29
---------------------------------------------------------------------------------------------------------------------------

. view project1.smcl
. print project1.smcl

As log files in SMCL format can only be opened with Stata, a log file can alternatively be saved in ASCII/HTML format with the extension .log or in plain-text format with the extension .txt. However, I recommend that you use the default SMCL format because SMCL files can be translated into variety of formats, such as plain log, plain-text, PostScript, and PDF, using the translate command.

. translate project1.smcl project1.pdf, replace
(file project1.pdf written in PDF format)

4.2  Do-Files and Comments

Stata comes with an integrated text editor called the Do-file Editor, which can be used for many tasks. It gets its name from the term do-file, which is a file containing a list of commands for Stata to run (called a batch file or a script in other programs). Although the Do-file Editor has advanced features that can help in writing such files, it can also be used to build up a series of commands that can then be submitted to Stata all at once. This feature can be handy when writing a loop to process multiple variables in a similar fashion or when doing complex, repetitive tasks interactively. Thus, you can run your program directly from the editor without using the Command Window anymore.

To access Stata's Do-File Editor, use the shortcut Ctrl+9 or type the command doedit in the Command window. Do-files have the extension .do and existing do-files can be opened by typing

doedit dofilename.do

There are several useful shortcuts to handle your new do-file if you're working within a do-file

Shortcurt Execution
Ctrl + s Save the do-file
Ctrl + d All commands of the do-file are executed, starting at the beginning of the do-file
Ctrl + Shift + d All commands starting from the current cursor position are executed

If you want to execute a do-file from your hard disk, you can use the do command by typing

doedit dofilename.do

You will notice that the color of the text changes as you type within a do-file. The different colors are examples of the Do-File Editor's syntax highlighting which you can modify if you want to.

Code that looks obvious to you may not be so obvious to a co-worker, or even to you a few months later. It is always a good idea to annotate your do-files with explanatory comments that provide the gist of what you are trying to do. If the default settings of highlighting within do-files have not been modified, comments are in green. There are three alternative ways of using comments in a do-file

  1. Single Comment: *
    You can start a new line with a * to indicate that this line is a comment, not a command.
  2. . sum income
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          income |      4,778    20542.17    37426.25          0     897756
    
    . * The sum command calculates the mean value of a single variable or several variables
    
  3. Toggle Comment: //
    A toggle comment // is at the end of a command and indicates that everything that follows to the end of the line is a comment and should be ignored by Stata.
  4. . gen loginc = log(income)        //      New Variable with Logarith of Income
    (2,001 missing values generated)
    
  5. Block Comment: /*[...]*/
    A block comment /*[...]*/ is used to indicate that all text between the opening /* and the closing */, which may be a few characters or may span several lines, is a comment to be ignored by Stata. This type of comment can be used anywhere, even in the middle of a line, and is usually used to "comment out" temporarily unused commands.
  6. . replace loginc = 0              if loginc >= .
    (2,001 real changes made)
    
    . /*
    > All Missings in "loginc"
    > are assigned a zero value
    > */
    

Often, commands can be very long, especially when it comes to graph commands. In a do-file you will probably want to break long commands into lines to improve readability. There are two alternatives to tell Stata that a command continues on the next line or lines

  1. Triple Slashes: ///
    Triple Slashes say that everything after them to the end of the line is a comment and the command itself continues on the next line.

    . sum income ///
    > if educ == 1 & ///
    > female == 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          income |        755    8132.087    24628.12          0     612757
    
  2. Delimiter: ;
    Alternatively, you tell Stata to use a semi-colon instead of the carriage return at the end of the line to mark the end of a command by using #delimit ;. Now all commands need to terminate with a semi-colon. To return to using carriage return as the delimiter, use #delimit cr. Remember, the delimiter can only be changed in do-files.

    . desc income
    
                  storage   display    value
    variable name   type    format     label      variable label
    ---------------------------------------------------------------------------------------------------------------------------
    income          long    %10.0g                Individual Labor Earnings (in Euro)
    
    . #delimit ;
    delimiter now ;
    . sum income
    > if educ == 1 &
    > female == 1 ;
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
          income |        755    8132.087    24628.12          0     612757
    
    . #delimit cr
    delimiter now cr
    . desc income
    
                  storage   display    value
    variable name   type    format     label      variable label
    ---------------------------------------------------------------------------------------------------------------------------
    income          long    %10.0g                Individual Labor Earnings (in Euro)
    

Now, let's take a look at a sample do-file and what it should contain at minimum

/*
An Introduction to Stata
Mustafa Coban
July 2018
*/

version 15
clear
set more off
capture log close
log using project1.smcl, replace

// Load GSOEP dataset
use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear 
 
sum income ///
if educ == 1 & ///
female == 1

#delimit ;
gen loginc = log(income) ;
replace loginc = 0
if loginc >= . ;
#delimit cr

* Replace Missings with zero values

desc loginc income

log close
exit

It is always a good idea to start every do file with comments that include at least a title, the name of the programmer who wrote the file, and the date. Assumptions about required files should also be noted. Then we continue with specifying the version of Stata we are using, in this case 15. This ensures that future versions of Stata will continue to interpret the commands correctly, even if Stata has changed. The clear statement deletes the data currently held in memory and any value labels you might have. We need clear just in case we need to rerun the program. The set more off command ensures that the execution of the do-file is not interrupted if the Results window is not large enough. If an earlier run of the do-file has failed, it is likely that you still have a log file open, in which case the log using command will fail. Thus, at first we have to close any open logs. The problem with this solution is that it will not work if there is no log file open. The way out of this problem is to use the prefix capture. This prefix tells Stata to run the command that follows and ignore any errors. Use judiciously. At the end of the do-file we close the log-file and exit the do-file.

Recommendation
After learning how to use do-files, there is no need to use the Command anymore. Any command you want to test out can be written in a do-file, commented on, and executed directly from the do-file. The saved do-file allows you to check everything that you have coded so far the next day. Furthermore, there is no need to save your data transformations in many different temporarily datasets after each milestone. You can start loading the original dataset and, after all transformations, save a single final master dataset for your upcoming analysis.