1 Stata Command Syntax
Having used a few Stata commands, it may be time to comment briefly on their structure. One of Stata’s great strengths is the consistency of its command syntax. Most of Stata’s commands share the following syntax, where bold indicates keywords and square brackets mean that something is optional
[by varlist:] command [varlist] [if exp] [in range] [weight] [, options]
In this diagram, varlist denotes a list of variable names, command denotes a Stata command, exp denotes an algebraic expression, range
denotes an observation range, weight denotes a weighting expression, and options denotes a list of options. Let’s briefly describe each syntax element:
- varlist: If no varlist appears, these commands assume a varlist of _all, the Stata shorthand for indicating all the variables in the dataset. Some commands take a varname, rather than a varlist. A varname refers to exactly one variable
- if exp: The if-qualifier restricts the scope of a command to observations for which the value of the expression is true (which is equivalent to the expression being nonzero)
- in range: The in-qualifier qualifier restricts the scope of the command to a specific observation range. A range specification takes the form $\#_1 [/ \#_2]$ , where $\#_1$ and $\#_2$ are positive or negative integers. Negative integers are understood to mean „from the end of the data“, with -1 referring to the last observation. The implied first observation must be less than or equal to the implied last observation. The first and last observations in the dataset may be denoted by f and l, respectively.
- weight: Some commands allow the use of weights. You should use these if you want to do some analysis for the whole population based on your sample.
1.1 The by-Prefix
The by varlist: prefix causes Stata to repeat a command for each subset of the data for which the values of the variables in varlist are equal. When prefixed with by varlist:, the result of the command will be the same as if you had formed separate datasets for each group of observations, saved them, and then given the command on each dataset separately. The data must already be sorted by varlist, although by has a sort option. The by prefix is important for understanding data manipulation and working with subpopulations within Stata. Furthermore, the varlist in by varlist: may contain string variables, numeric variables, or both.
Let’s show how the by-prefix works with a small example. We reload the GSOEP data set and calculate the average income for the women and men in the sample. First, the data are sorted by gender, whereby two alternatives are available for sorting. On the one hand, we can just sort the data before we use the by-prefix in combination with the sum
command.
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear
(SOEP 2009 (Kohler/Kreuter))
. sort gender
. by gender: sum income
---------------------------------------------------------------------------------------------------------------------------
-> gender = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 2,458 13323 21290.77 0 612757
---------------------------------------------------------------------------------------------------------------------------
-> gender = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 2,320 28190.75 47868.24 0 897756
On the other hand, we can also sort directly within the by-prefix
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear
(SOEP 2009 (Kohler/Kreuter))
. bysort gender: sum income
---------------------------------------------------------------------------------------------------------------------------
-> gender = Female
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 2,458 13323 21290.77 0 612757
---------------------------------------------------------------------------------------------------------------------------
-> gender = Male
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 2,320 28190.75 47868.24 0 897756
As we can see, both approaches lead to the same result. Now, which of the two alternatives is preferred? As already explained in Part 1 of this tutorial, the command sort leads to an ascending sorting of the data set by the respective variables. Thus, the first alternative, i.e. sorting outside the by-prefix is always appropriate when we need descending sorting and therefore work with the gsort command.
Since the concept of the by-prefix is not clearly visible from the command lines, let’s go step by step through Stata’s procedure. First, let’s sort the data by the identification number pnr and view the selected variables in the Browser window.
. sort pnr . br pnr gender income
As you can see, the observations are sorted by pnr and each respondent has a different gender and income. Next, we sort the data by gender and view the data again in the Browser window. Since the gender variable is a string variable, the sort command sorts the observations alphabetically. The command that follows the by-prefix is then executed for each category of the variable gender. Since the variable gender only has two possible values, the sum command is
only executed twice.
Now, let’s obtain average incomes according to gender (gender) broken down by educational attainment (educ). Thus, we type
. sort gender educ . by gender educ: sum income --------------------------------------------------------------------------------------------------------------------------- -> gender = Female, educ = Elementary Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 755 8132.087 24628.12 0 612757 --------------------------------------------------------------------------------------------------------------------------- -> gender = Female, educ = Intermediate Secondary Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 821 14754.12 13836.29 0 102111 --------------------------------------------------------------------------------------------------------------------------- -> gender = Female, educ = Technical Secondary Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 107 20141.42 17049.57 0 84958 --------------------------------------------------------------------------------------------------------------------------- -> gender = Female, educ = Maturity Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 448 22042.61 27588.56 0 424107 --------------------------------------------------------------------------------------------------------------------------- -> gender = Female, educ = . Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 327 7537.804 13151.93 0 96065 --------------------------------------------------------------------------------------------------------------------------- -> gender = Male, educ = Elementary Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 783 19431.59 25558.53 0 365076 --------------------------------------------------------------------------------------------------------------------------- -> gender = Male, educ = Intermediate Secondary Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 628 27109.07 21148.42 0 109121 --------------------------------------------------------------------------------------------------------------------------- -> gender = Male, educ = Technical Secondary Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 138 40967.2 41181.82 0 287194 --------------------------------------------------------------------------------------------------------------------------- -> gender = Male, educ = Maturity Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 449 51046.58 92057 0 897756 --------------------------------------------------------------------------------------------------------------------------- -> gender = Male, educ = . Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 322 14253.83 18768.37 0 108143
Sorting is as follows: First, the observations are sorted by the gender variable, while no sorting by the second variable educ takes place. In the next step, the first sorting by the gender variable is preserved and the observations are sorted by the second variable educ in each category or possible value of the first variable gender. Now, the command that follows the by-prefix is executed for each combination of the two variables in varlist of the by-prefix.
1.2 Wildcards and Ordering
Variable lists (or varlists) can be specified in a variety of ways, all designed to save typing and encourage good variable names. If you want to address several variables in a command, you can use different placeholders to save some time.
- If the variables differ only in a single character, you should use a question mark.
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear (SOEP 2009 (Kohler/Kreuter)) . desc educ? storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- educy float %9.0g Number of Years of Education
- The wildcard * can be used to address variables whose names are partially identical.
. desc hh* *nr storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- hhnr long %12.0g Houshold Number hhmem byte %8.0g Number of Persons in Household hhkids byte %8.0g Number of Kids (0-14 Years) in Household hhtyp byte %35.0g hhtyp Household Type hhinc long %10.0g Household Post-Government Income (in Euro) pnr long %12.0g Person Number hhnr long %12.0g Houshold Number
- Variables that are arranged one after the other in the data set can be addressed together with a hyphen.
. desc task-rooms storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- task byte %58.0g task Working Task of Dependent Employees state byte %22.0g state State of Residence health byte %32.0g health Satisfaction with Health satlif byte %32.0g satlif Overall Life Satisfaction polint byte %20.0g polint Political Interests party byte %15.0g party Political party supported suppar byte %20.0g suppar Supports political party worpea byte %20.0g worpea Worried about peace worter byte %20.0g worter Worried about global terrorism worcri byte %20.0g worcri Worried about crime in Germany worimm byte %20.0g worimm Worried about immigration to Germany worhfo byte %20.0g worhfo Worried about hostility to foreigners worjos byte %20.0g worjos Worried about job security size float %12.0g Size of Housing (in m^2) rent float %12.0g Rent Minus Heating Costs (in Euro) rooms byte %8.0g Number of Rooms > 6m^2
Since the use of hyphens in varlists depends on the order of the variables in the dataset, we briefly introduce the order
command. This command enables changing the order of variables in the dataset. If we type
. desc Contains data from https://www.mustafacoban.de/wp-content/stata/gsoep.dta obs: 5,410 SOEP 2009 (Kohler/Kreuter) vars: 36 23 Sep 2015 16:20 size: 384,110 --------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- pnr long %12.0g Person Number hhnr long %12.0g Houshold Number gender str6 %9s Gender female byte %20.0g female Female - Dummy age float %9.0g Age marst byte %29.0g marst Marital Status of Individual marr float %11.0g marr Married / Not Married - Dummy hhmem byte %8.0g Number of Persons in Household hhkids byte %8.0g Number of Kids (0-14 Years) in Household hhtyp byte %35.0g hhtyp Household Type income long %10.0g Individual Labor Earnings (in Euro) hhinc long %10.0g Household Post-Government Income (in Euro) educ byte %28.0g educ Education educy float %9.0g Number of Years of Education ausb byte %40.0g ausb Ausbildungsabschluss emplst byte %44.0g emplst Employment Status lfp float %18.0g lfp Labor Force Participation task byte %58.0g task Working Task of Dependent Employees state byte %22.0g state State of Residence health byte %32.0g health Satisfaction with Health satlif byte %32.0g satlif Overall Life Satisfaction polint byte %20.0g polint Political Interests party byte %15.0g party Political party supported suppar byte %20.0g suppar Supports political party worpea byte %20.0g worpea Worried about peace worter byte %20.0g worter Worried about global terrorism worcri byte %20.0g worcri Worried about crime in Germany worimm byte %20.0g worimm Worried about immigration to Germany worhfo byte %20.0g worhfo Worried about hostility to foreigners worjos byte %20.0g worjos Worried about job security size float %12.0g Size of Housing (in m^2) rent float %12.0g Rent Minus Heating Costs (in Euro) rooms byte %8.0g Number of Rooms > 6m^2 renttype byte %20.0g renttype Status of living condit byte %24.0g condit Condition of house satliv byte %45.0g satliv Satisfaction with Living/Habitation --------------------------------------------------------------------------------------------------------------------------- Sorted by: pnr . order wor* hh* female . desc Contains data from https://www.mustafacoban.de/wp-content/stata/gsoep.dta obs: 5,410 SOEP 2009 (Kohler/Kreuter) vars: 36 23 Sep 2015 16:20 size: 384,110 --------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- worpea byte %20.0g worpea Worried about peace worter byte %20.0g worter Worried about global terrorism worcri byte %20.0g worcri Worried about crime in Germany worimm byte %20.0g worimm Worried about immigration to Germany worhfo byte %20.0g worhfo Worried about hostility to foreigners worjos byte %20.0g worjos Worried about job security hhnr long %12.0g Houshold Number hhmem byte %8.0g Number of Persons in Household hhkids byte %8.0g Number of Kids (0-14 Years) in Household hhtyp byte %35.0g hhtyp Household Type hhinc long %10.0g Household Post-Government Income (in Euro) female byte %20.0g female Female - Dummy pnr long %12.0g Person Number gender str6 %9s Gender age float %9.0g Age marst byte %29.0g marst Marital Status of Individual marr float %11.0g marr Married / Not Married - Dummy income long %10.0g Individual Labor Earnings (in Euro) educ byte %28.0g educ Education educy float %9.0g Number of Years of Education ausb byte %40.0g ausb Ausbildungsabschluss emplst byte %44.0g emplst Employment Status lfp float %18.0g lfp Labor Force Participation task byte %58.0g task Working Task of Dependent Employees state byte %22.0g state State of Residence health byte %32.0g health Satisfaction with Health satlif byte %32.0g satlif Overall Life Satisfaction polint byte %20.0g polint Political Interests party byte %15.0g party Political party supported suppar byte %20.0g suppar Supports political party size float %12.0g Size of Housing (in m^2) rent float %12.0g Rent Minus Heating Costs (in Euro) rooms byte %8.0g Number of Rooms > 6m^2 renttype byte %20.0g renttype Status of living condit byte %24.0g condit Condition of house satliv byte %45.0g satliv Satisfaction with Living/Habitation --------------------------------------------------------------------------------------------------------------------------- Sorted by: pnr
the specified variables are placed at the beginning of the data set. Now, variables beginning with a wor are at the beginning. They are followed by the variables beginning with an hh and finally by the gender variable. The remaining variables are appended with no change to their sorting. Further layout rules can be found typing help order.
2 Create New Variables
One of the three commandments says that the original data should never be overwritten. However, this does not mean that the data must not be changed. For data analysis and generation of new knowledge from the existing raw data set, new variables often have to be generated from the existing variables and existing variables have to be modified. In addition, it is common for large datasets to delete variables that are not relevant to a particular application or project. This ensures a better overview for the user and shortens the calculation time of certain analysis methods and algorithms.
2.1 Creating and Modifying Variables
The most common command to generate a new variable is generate
, which is usually abbreviated to gen. This command can be used with the by-prefix as well as with the in– and if-qualifiers. Thus, the basic syntax for this command is
generatenewvariable = expression
First we have to choose a variable name for our newvariable and then type a single equals sign to start the definition of the new variable. An expression is a formula made up of constants, existing variables, operators, and functions. In general we can distinguish between mathematical and logical expressions. The operators needed for these expressions are given below
Arithmetic | Logical | Relational | |||
---|---|---|---|---|---|
+ | addition | ! | not | > | greater than |
- | subtraction | | | or | < | less than |
* | multiplication | & | and | >= | greater than or equal to |
/ | division | <= | less than or equal to | ||
^ | power | == | equal | ||
!= | not equal | ||||
+ | string concatenation |
First, let’s generate a new variable using a mathematical expression. Since the GSOEP has data on a person’s individual labor earnings, we want to take the logarithm of this income variable and call the new variable loginc to make our operation identifiable in the new variable’s name.
. gen loginc = log(income) (2,001 missing values generated) . sum loginc income Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- loginc | 3,409 9.770343 1.124561 3.828641 13.70765 income | 4,778 20542.17 37426.25 0 897756
The new variable has many more missings than the original income variable because the logarithm of a zero income generates always a missing. Stata has many mathematical, statistical, string, date, time-series, and programming functions. Just type help functions to see some basic functions.
Now, let’s generate a new variable using a logical expression. We want to generate a new variable called midage that takes the value 1 if a person is aged between 25 and 64 and otherwise takes the value 0
. gen midage = age >= 25 & age < 64 . tab midage midage | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,913 35.36 35.36 1 | 3,497 64.64 100.00 ------------+----------------------------------- Total | 5,410 100.00
Thus, 3,497 persons are within the defined age range between 25 and 64 years. In a next step we want to generate a new variable using a string variable within the logical expression. For this purpose, let’s apply the string variable gender, which has two possible values, Male and Female. We want to create a new variable called male that takes the value 1 if a person is a man and the value 0 if a person is a woman
. gen male = gender == "Male" . tab male male | Freq. Percent Cum. ------------+----------------------------------- 0 | 2,825 52.22 52.22 1 | 2,585 47.78 100.00 ------------+----------------------------------- Total | 5,410 100.00 . tab gender Gender | Freq. Percent Cum. ------------+----------------------------------- Female | 2,825 52.22 52.22 Male | 2,585 47.78 100.00 ------------+----------------------------------- Total | 5,410 100.00
The logical expression above says that a person is assigned the value 1 in the new variable if the expression is true for this person and otherwise is assigned the value 0, i.e. if the gender variable equals the string „Male“ for a person, then the expression is true. Therefore, logical expressions are case-sensitive and sensitive to spaces. The following procedure leads to a completely different and undesired result due to the extra space at the end of the string notation
. gen male2 = gender == "Male " . tab male2 male2 | Freq. Percent Cum. ------------+----------------------------------- 0 | 5,410 100.00 100.00 ------------+----------------------------------- Total | 5,410 100.00
Using the generate command, we can also generate new string variables. We can combine two or more string variables to a new string variable. But we can also create a new string variable by attaching strings to an existing string variable. First, let’s generate the new string variable gender2 by combining the gender variable with itself
. gen gender2 = gender + gender . list gender2 gender in 1/5 +-----------------------+ | gender2 gender | |-----------------------| 1. | MaleMale Male | 2. | FemaleFemale Female | 3. | MaleMale Male | 4. | FemaleFemale Female | 5. | MaleMale Male | +-----------------------+
As you can see, using the operator „+“ will concatenate the string variables – in our case the replication of the gender variable – without spaces, i.e. simply join them together. Further, the missing value for a string variable is nothing special – it is simply the empty string " ". Second, let’s create a new string variable gender3 by combing the gender variable with a constant string, e.g. the string " - Gender".
. gen gender3 = gender + " - Gender" . list gender3 gender2 gender in 1/5 +-----------------------------------------+ | gender3 gender2 gender | |-----------------------------------------| 1. | Male - Gender MaleMale Male | 2. | Female - Gender FemaleFemale Female | 3. | Male - Gender MaleMale Male | 4. | Female - Gender FemaleFemale Female | 5. | Male - Gender MaleMale Male | +-----------------------------------------+
Stata shows a particularity if you want to change the values of an existing variable because Stata will not let you overwrite an existing variable using the generate command. If you really want to replace the values of an old variable you have to use the replace
command. Thus, Stata uses two different commands to prevent you from accidentally modifying your data. The syntax of the replace command is similar to syntax of the generate command, although the former cannot be abbreviated.
Now, let’s change the values in the male variable to 2 if a person is a man and to 1 if a person is a woman by application of the gender variable within the if-qualifier.
. replace male = 2 if gender == "Male" (2,585 real changes made) . replace male = 1 if gender == "Female" (2,825 real changes made) . tab male male | Freq. Percent Cum. ------------+----------------------------------- 1 | 2,825 52.22 52.22 2 | 2,585 47.78 100.00 ------------+----------------------------------- Total | 5,410 100.00
2.2 More Generating and Recoding Variables
There is another important command to create new variable. Let me introduce the more powerful egen
command which is useful for working across groups of variables or within groups of observations. There are plenty of functions that can be applied by the egen command. Just type help egen to explore some of them. For example, if we are interested in the amount of missing values for an observation from a selected variable list, we can generate a new variable miss by applying the rowmiss()
function to the egen command
. egen miss = rowmiss(gender - educ) . tab miss miss | Freq. Percent Cum. ------------+----------------------------------- 0 | 4,129 76.32 76.32 1 | 1,169 21.61 97.93 2 | 112 2.07 100.00 ------------+----------------------------------- Total | 5,410 100.00
Thus, 4,129 persons have valid values for all five variables, while 112 persons have missing values for two out of the five variables. The rowmiss function is useful if you want to create a dataset containing observations without any missings for your selected variables. Furthermore, you can use the egen command if you want to store summary statistics of a variable in a new variable by group membership. Let’s generate a new variable that stores the mean income of men if a person is a man and the mean income of women otherwise.
. bysort gender: egen incgen_av = mean(income) . bysort gender: sum income --------------------------------------------------------------------------------------------------------------------------- -> gender = Female Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 2,458 13323 21290.77 0 612757 --------------------------------------------------------------------------------------------------------------------------- -> gender = Male Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 2,320 28190.75 47868.24 0 897756 . sort pnr . list incgen_av gender in 1/5 +-------------------+ | incgen~v gender | |-------------------| 1. | 28190.75 Male | 2. | 13323 Female | 3. | 28190.75 Male | 4. | 13323 Female | 5. | 28190.75 Male | +-------------------+
There is another command to generate new variables and modify existing variables. The recode
command is used to group numeric variables into categories or the easily change the values for existing categories in categorical variables. Now, let’s generate a new variable agecat4 that divides persons into four age groups, whereby the first age group is assigned the value 1 and the last age group the value 4.
. sum age Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 5,410 49.50961 18.12717 17 100 . recode age (17/24 = 1) (25/44 = 2) (45/64 = 3) (65/100 = 4), gen(agecat5) (5410 differences between age and agecat5) . tab agecat5 RECODE of | age (Age) | Freq. Percent Cum. ------------+----------------------------------- 1 | 562 10.39 10.39 2 | 1,678 31.02 41.40 3 | 1,870 34.57 75.97 4 | 1,300 24.03 100.00 ------------+----------------------------------- Total | 5,410 100.00
Each expression in the parentheses is a recoding rule and consists of a list or range of values, followed by an equals sign and a new value. A range is specified by using a slash and includes the two boundaries, so 17/24 is 17 to 24. The gen() option guarantees that the new variable is created following the recoding rule, while the existing variable age remains unchanged. Moreover, you can use min to refer to the smallest value and max to refer to the largest value within the recoding rule, as in min/24 and 65/max. Values that are never assigned to a category are kept as they are. You can use else within the recoding rule to capture these values and assign them a specific category.
The next example shows that the recode command can also be used to swap certain numeric values for a variable
. recode female (0 = 1) (1 = 0) (female: 5410 changes made) . tab female Female - | Dummy | Freq. Percent Cum. ------------+----------------------------------- Male | 2,825 52.22 52.22 Female | 2,585 47.78 100.00 ------------+----------------------------------- Total | 5,410 100.00
Since no option was applied, the existing variable female has been recoded. Now, all women take the value 0 and all men take the value 1. We simply swapped the values for this variable. I recommend that you always use the gen() option or make a copy of the original variable before recoding it.
2.3 Variable Names Convention, Dropping Variables, and Missings
Variable names can have up to 32 characters, but many commands print only 12. Since shorter names are easier to type, I recommend a maximum length of 8 to 12 characters for variable names. A variable name is a sequence of 1 to 32 letters (A-Z, a-z, and any Unicode letter), digits (0-9), and underscores (_). Thus, Stata names are case-sensitive, which means that Age and age are two different variables.
Furthermore, the first character of a variable name must be a letter or an underscore. I recommend, however, that you not begin your variable names with an underscore because all Stata’s built-in variables begin with an underscore. Moreover, Stata reserves the following names
_all | float | _n | _skip |
_b | if | _N | str# |
byte | in | _pi | strL |
_coef | int | _pred | using |
_cons | long | _rc | with |
double |
It pays to develop a convention for naming variables and sticking to it. I prefer short lowercase names and tend to use single words or abbreviations rather than multi-word names with underscores; for example, I prefer hhinc to household_income, although both names are legal.
There are two main commands for removing data and variables from memory: drop
and keep
. Remember that they affect only what is in memory. None of these commands alter anything that has been saved to the disk. The drop command is used to remove variables or observations from the dataset in memory. If you want to drop variables after you reload the GSOEP dataset, just type
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear (SOEP 2009 (Kohler/Kreuter)) . drop rent size rooms
If you want to drop observations, you have to use an if– or an in-qualifier. For example, at first we drop the last ten observations by applying the in-qualifier, and then we delete all men from the dataset by applying the if-qualifier.
. drop in -10/l (10 observations deleted) . drop if gender == "Male" (2,581 observations deleted)
These changes are only to the data in memory. If you want to make the changes permanent, you need to save the dataset. The keep command is a command for preserving specified variables or observations. Thus, it works inversely the drop command. If you want to keep a specific list of variables and drop the rest, just type
. keep pnr female age
If you want to keep certain observations, the same syntax of the drop command applies. For example, at first we only keep the first 100 observations, and then keep all individuals younger than 45 years.
. keep in 1/100 (2,719 observations deleted) . keep if age < 45 (52 observations deleted)
You can use the browse and describe commands to take a look at your miniature dataset in memory.
So far we have become acquainted with Stata without dealing with the topic of missings. For the rest of the tutorial, however, it is indispensable to understand how missings are coded and programed in Stata. Like other statistical packages, Stata distinguishes missing values. The basic missing value for numeric variables is represented by a dot . There are 26 additional missing-value codes denoted by .a to .z. These values are represented internally as incredibly large numbers and the following ranking applies
. < .a < .b < ... < .z
Because missings internally take very large numbers, using the operators > and >= with an if-qualifier may produce erroneous results if this programming property is not taken into account. Let’s look at example for this missing problem. We want to know the mean income of individuals who obtained more than 14 years of education (educy)
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear (SOEP 2009 (Kohler/Kreuter)) . sum income if educy > 14 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 1,119 32226.08 64576.71 0 897756
The summary statistics, however, may be incorrect if there are individuals with valid income values, but missings for the education variable. Thus, the command calculates the mean income of individuals who obtained more than 14 years of education or have not indicated their years of education. Since we do not want to consider the latter group of individuals, the correct command must be
. sum income if educy > 14 & educy < . Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 804 42125.13 73451.05 0 897756
As you can see, the two commands lead to a difference in mean incomes, which is due to the exclusion of individuals with no education information in the second command. Furthermore, Stata has the missing()
function which can be used within an if qualifier to exclude observations with missings for a certain variable.
. sum income if educy > 14 & !missing(educy) Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 804 42125.13 73451.05 0 897756
We get the same result as above. But why are there several missing values in Stata and why is a simple dot not sufficient as a missing definition? Various missing values are useful for survey data. If a respondent has a missing for a particular question or variable, there may be different reasons behind that. The first would be if the respondent hadn’t answered the question. The second possible reason is if the question did not apply to the respondent. For example, if there is a question about my spouse’s income, a missing could occur due to one of these two reasons. In the first case, the reason would be that I don’t want to disclose my spouse’s income. The second case would be that I have no spouse. Of course, there could many more reasons, but I think these two examples make my point clear. Thus, if there is only one missing value, we lose the information about the reasons for this missing observation. But having the possibility to assign different missing values enables us to account for the reasons. In our example, we can choose the missing values .a and .b to distinguish between the two reasons.
Now, how can we detect missings in our data or variables? If we want to know the amount of missings for a categorical variable with numeric values, we can use the missing
of the tab command
. tab educy, missing Number of | Years of | Education | Freq. Percent Cum. ------------+----------------------------------- 8.7 | 640 11.83 11.83 10 | 1,321 24.42 36.25 11 | 1,168 21.59 57.84 12 | 487 9.00 66.84 13 | 357 6.60 73.44 14 | 191 3.53 76.97 15 | 243 4.49 81.46 16.1 | 166 3.07 84.53 18 | 465 8.60 93.12 . | 372 6.88 100.00 ------------+----------------------------------- Total | 5,410 100.00
If we want to know the amount of missings for a continuous or quantitative variable, we can apply the misstable summarize
command
. misstable sum income Obs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ income | 632 4,778 | >500 0 897756 ----------------------------------------------------------------------------- . misstable sum _all Obs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ income | 632 4,778 | >500 0 897756 hhinc | 4 5,406 | >500 583 507369 educ | 761 4,649 | 4 1 4 educy | 372 5,038 | 9 8.7 18 ausb | 1,309 4,101 | 4 1 4 emplst | 155 5,255 | 6 1 6 lfp | 23 5,387 | 4 1 4 task | 622 4,788 | 7 1 7 health | 77 5,333 | 11 0 10 satlif | 88 5,322 | 11 0 10 polint | 89 5,321 | 4 1 4 party | 3,309 2,101 | 7 1 7 suppar | 85 5,325 | 2 1 2 worpea | 85 5,325 | 3 1 3 worter | 92 5,318 | 3 1 3 worcri | 92 5,318 | 3 1 3 worimm | 101 5,309 | 3 1 3 worhfo | 111 5,299 | 3 1 3 worjos | 2,329 3,081 | 3 1 3 rent | 3,049 2,361 | >500 27.3 3003.7 condit | 10 5,400 | 4 1 4 satliv | 95 5,315 | 11 0 10 -----------------------------------------------------------------------------
Using the notation _all tells Stata to apply all numeric variables of the dataset to the command. If you want to recode a valid numeric value of a variable to a specific missing, you can use the recode command
. recode female (2 = .a)
(female: 0 changes made)
3 Data Documentation
Now, we will discuss, in brief, the labeling of the dataset, variables, and values. Such labeling is critical to the careful use of data. Labeling variables with descriptive names clarifies their meanings and their measurements. Labeling values of numerical categorical variables ensures that the real-world meanings of the encodings are not forgotten. These points are crucial when sharing data with others, including your future self. Labels are also used in the output of most Stata commands, so proper labeling of the dataset will produce much more readable results.
Let’s start with variable labels. Since we use abbreviations and short notations for variables in the dataset, labelling variables is essential. We can label a variable by using the command label variable
.
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear (SOEP 2009 (Kohler/Kreuter)) . label var ausb "Educational Attainment" . desc aus storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- ausb byte %40.0g ausb Educational Attainment
The command is followed by the variable to be labeled. Then we type our preferred label in quotation marks. Next, we will take a look at value labels. These are very important because the numerical values of the categorical variables would have no real-world meaning otherwise. Value labels allow numeric variables to have words associated with numeric codes. Stata has a two-step approach to defining labels. First, you define a named label set which associates integer codes with labels of up to 80 characters, using the label define
command. Then you assign the set of labels to a variable, using the label values
command.
. recode female (1 = 0) (0 = 1), gen(male) (5410 differences between female and male) . label define male_lb 0 "Female" 1 "Male" . label values male male_lb
First, we created a new variable male which takes a value of one for men and a zero for women. Then we defined the new value label male_lb and assigned the 0 to women and the 1 to men. Next, we associated our new value label with the male variable. I highly recommend using the same name for the value label set and the variable because then you don’t have to remember the order in the last step.
. label define male 0 "Female" 1 "Male" . label values male male
One advantage of this two-step approach is that you can use the same set of value labels for several variables. The canonical example is
. label define yesno 1 "yes" 0 "no"
, which can then be associated with all 0-1 variables in your dataset by simply stringing all variables together after label values and by putting the name of the value label yesno at the end of the command. Moreover, label sets can be modified using the options add or modify. Just check help label. Since we have seen that we can define different missing values you can also assign value labels to them. For example,
. label define party .a "No answer" .b "Not applicable", modify
Using the desc command, you can check whether your variables have value labels. If you want to take a look at one or several specific value labels you can use the label list
command.
. desc Contains data from https://www.mustafacoban.de/wp-content/stata/gsoep.dta obs: 5,410 SOEP 2009 (Kohler/Kreuter) vars: 37 23 Sep 2015 16:20 size: 389,520 --------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- pnr long %12.0g Person Number hhnr long %12.0g Houshold Number gender str6 %9s Gender female byte %20.0g female Female - Dummy age float %9.0g Age marst byte %29.0g marst Marital Status of Individual marr float %11.0g marr Married / Not Married - Dummy hhmem byte %8.0g Number of Persons in Household hhkids byte %8.0g Number of Kids (0-14 Years) in Household hhtyp byte %35.0g hhtyp Household Type income long %10.0g Individual Labor Earnings (in Euro) hhinc long %10.0g Household Post-Government Income (in Euro) educ byte %28.0g educ Education educy float %9.0g Number of Years of Education ausb byte %40.0g ausb Educational Attainment emplst byte %44.0g emplst Employment Status lfp float %18.0g lfp Labor Force Participation task byte %58.0g task Working Task of Dependent Employees state byte %22.0g state State of Residence health byte %32.0g health Satisfaction with Health satlif byte %32.0g satlif Overall Life Satisfaction polint byte %20.0g polint Political Interests party byte %15.0g party Political party supported suppar byte %20.0g suppar Supports political party worpea byte %20.0g worpea Worried about peace worter byte %20.0g worter Worried about global terrorism worcri byte %20.0g worcri Worried about crime in Germany worimm byte %20.0g worimm Worried about immigration to Germany worhfo byte %20.0g worhfo Worried about hostility to foreigners worjos byte %20.0g worjos Worried about job security size float %12.0g Size of Housing (in m^2) rent float %12.0g Rent Minus Heating Costs (in Euro) rooms byte %8.0g Number of Rooms > 6m^2 renttype byte %20.0g renttype Status of living condit byte %24.0g condit Condition of house satliv byte %45.0g satliv Satisfaction with Living/Habitation male byte %9.0g male RECODE of female (Female - Dummy) --------------------------------------------------------------------------------------------------------------------------- Sorted by: pnr Note: Dataset has changed since last saved. . label list emplst emplst: 1 Full-Time Employee 2 Part-Time Employee 3 Irregular Employee 4 Unemployed 5 Retired 6 Not in Labor Force
4 Work Documentation
While it is fun to type commands interactively and see the results straightaway, serious work requires that you save your results and keep track of the commands that you have used, so that you can document your work and reproduce it later if needed.
4.1 Log-Files
When you work on an analysis, it is worthwhile to behave like a bench scientist and keep a lab notebook of your actions so that your work can be easily replicated. Everyone has a feeling of complete omniscience while working intensely – this feeling is wonderful but fleeting. By the next day, the exact little details needed for perfect duplication have become obscure. Stata has a lab notebook on hand: the log file.
A log file is simply a record of your Results window. It records all commands and all textual output in real time. Thus it keeps your lab notebook for you as you work. Because it saves the file to the disk while it writes the Results window, it also protects you from disastrous failures, be they power failures or computer crashes. We recommend that you start a log file whenever you begin any serious work in Stata.
To open a log file, use the log using
command and give your log file a meaningful filename.
. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear (SOEP 2009 (Kohler/Kreuter)) . log using project1, replace --------------------------------------------------------------------------------------------------------------------------- name: <unnamed> log: N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl log type: smcl opened on: 2 Aug 2018, 19:08:13
The replace option ensures that an existing log file with the name project1 will be overwritten. This will often be the case if you need to re-run your commands several times to get them right. By default, Stata will save the log file in its Stata Markup and Control Language (SMCL) format, which preserves all formatting and links from the Results window.
If you want to temporarily suspend logging and then resume logging, just use the commands log off
and log on
. tab emplst Employment Status | Freq. Percent Cum. -------------------+----------------------------------- Full-Time Employee | 2,040 38.82 38.82 Part-Time Employee | 599 11.40 50.22 Irregular Employee | 288 5.48 55.70 Unemployed | 312 5.94 61.64 Retired | 1,389 26.43 88.07 Not in Labor Force | 627 11.93 100.00 -------------------+----------------------------------- Total | 5,255 100.00 . log off name: <unnamed> log: N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl log type: smcl paused on: 2 Aug 2018, 19:08:13 --------------------------------------------------------------------------------------------------------------------------- . drop party . log on --------------------------------------------------------------------------------------------------------------------------- name: <unnamed> log: N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl log type: smcl resumed on: 2 Aug 2018, 19:08:14
To finish your logging, close your log file using the log close
command. Once the log file is closed and saved on the hard disk, you can use the view
command to open the file in the Viewer of Stata or you can directly print the content of your log file using the print
command.
. tab lfp Labor Force | Participation | Freq. Percent Cum. -------------------+----------------------------------- Dependent Employee | 2,846 52.83 52.83 Self-Employed | 213 3.95 56.78 Unemployed | 312 5.79 62.58 Not in Labor Force | 2,016 37.42 100.00 -------------------+----------------------------------- Total | 5,387 100.00 . log close name: <unnamed> log: N:\Lehre\Stata\7.Homepage\3.Parts\Part 2\project1.smcl log type: smcl closed on: 2 Aug 2018, 19:22:29 --------------------------------------------------------------------------------------------------------------------------- . view project1.smcl . print project1.smcl
As log files in SMCL format can only be opened with Stata, a log file can alternatively be saved in ASCII/HTML format with the extension .log or in plain-text format with the extension .txt. However, I recommend that you use the default SMCL format because SMCL files can be translated into variety of formats, such as plain log, plain-text, PostScript, and PDF, using the translate
command.
. translate project1.smcl project1.pdf, replace
(file project1.pdf written in PDF format)
4.2 Do-Files and Comments
Stata comes with an integrated text editor called the Do-file Editor, which can be used for many tasks. It gets its name from the term do-file, which is a file containing a list of commands for Stata to run (called a batch file or a script in other programs). Although the Do-file Editor has advanced features that can help in writing such files, it can also be used to build up a series of commands that can then be submitted to Stata all at once. This feature can be handy when writing a loop to process multiple variables in a similar fashion or when doing complex, repetitive tasks interactively. Thus, you can run your program directly from the editor without using the Command Window anymore.
To access Stata’s Do-File Editor, use the shortcut Ctrl+9 or type the command doedit
in the Command window. Do-files have the extension .do and existing do-files can be opened by typing
doedit dofilename.do
There are several useful shortcuts to handle your new do-file if you’re working within a do-file
Shortcurt | Execution |
---|---|
Ctrl + s | Save the do-file |
Ctrl + d | All commands of the do-file are executed, starting at the beginning of the do-file |
Ctrl + Shift + d | All commands starting from the current cursor position are executed |
If you want to execute a do-file from your hard disk, you can use the do
command by typing
doedit dofilename.do
You will notice that the color of the text changes as you type within a do-file. The different colors are examples of the Do-File Editor’s syntax highlighting which you can modify if you want to.
Code that looks obvious to you may not be so obvious to a co-worker, or even to you a few months later. It is always a good idea to annotate your do-files with explanatory comments that provide the gist of what you are trying to do. If the default settings of highlighting within do-files have not been modified, comments are in green. There are three alternative ways of using comments in a do-file
- Single Comment:
*
You can start a new line with a * to indicate that this line is a comment, not a command.
. sum income Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 4,778 20542.17 37426.25 0 897756 . * The sum command calculates the mean value of a single variable or several variables
- Toggle Comment:
//
A toggle comment // is at the end of a command and indicates that everything that follows to the end of the line is a comment and should be ignored by Stata.
. gen loginc = log(income) // New Variable with Logarith of Income
(2,001 missing values generated)
- Block Comment:
/*
[…]*/
A block comment /*[…]*/ is used to indicate that all text between the opening /* and the closing */, which may be a few characters or may span several lines, is a comment to be ignored by Stata. This type of comment can be used anywhere, even in the middle of a line, and is usually used to „comment out“ temporarily unused commands.
. replace loginc = 0 if loginc >= . (2,001 real changes made) . /* > All Missings in "loginc" > are assigned a zero value > */
Often, commands can be very long, especially when it comes to graph commands. In a do-file you will probably want to break long commands into lines to improve readability. There are two alternatives to tell Stata that a command continues on the next line or lines
- Triple Slashes:
///
Triple Slashes say that everything after them to the end of the line is a comment and the command itself continues on the next line.. sum income /// > if educ == 1 & /// > female == 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 755 8132.087 24628.12 0 612757
- Delimiter:
;
Alternatively, you tell Stata to use a semi-colon instead of the carriage return at the end of the line to mark the end of a command by using#delimit ;
. Now all commands need to terminate with a semi-colon. To return to using carriage return as the delimiter, use#delimit cr
. Remember, the delimiter can only be changed in do-files.. desc income storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- income long %10.0g Individual Labor Earnings (in Euro) . #delimit ; delimiter now ; . sum income > if educ == 1 & > female == 1 ; Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- income | 755 8132.087 24628.12 0 612757 . #delimit cr delimiter now cr . desc income storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- income long %10.0g Individual Labor Earnings (in Euro)
Now, let’s take a look at a sample do-file and what it should contain at minimum
/* An Introduction to Stata Mustafa Coban July 2018 */ version 15 clear set more off capture log close log using project1.smcl, replace // Load GSOEP dataset use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear sum income /// if educ == 1 & /// female == 1 #delimit ; gen loginc = log(income) ; replace loginc = 0 if loginc >= . ; #delimit cr * Replace Missings with zero values desc loginc income log close exit
It is always a good idea to start every do file with comments that include at least a title, the name of the programmer who wrote the file, and the date. Assumptions about required files should also be noted. Then we continue with specifying the version of Stata we are using, in this case 15. This ensures that future versions of Stata will continue to interpret the commands correctly, even if Stata has changed. The clear
statement deletes the data currently held in memory and any value labels you might have. We need clear just in case we need to rerun the program. The set more off
command ensures that the execution of the do-file is not interrupted if the Results window is not large enough. If an earlier run of the do-file has failed, it is likely that you still have a log file open, in which case the log using
command will fail. Thus, at first we have to close any open logs. The problem with this solution is that it will not work if there is no log file open. The way out of this problem is to use the prefix capture
. This prefix tells Stata to run the command that follows and ignore any errors. Use judiciously. At the end of the do-file we close the log-file and exit
the do-file.
Recommendation
After learning how to use do-files, there is no need to use the Command anymore. Any command you want to test out can be written in a do-file, commented on, and executed directly from the do-file. The saved do-file allows you to check everything that you have coded so far the next day. Furthermore, there is no need to save your data transformations in many different temporarily datasets after each milestone. You can start loading the original dataset and, after all transformations, save a single final master dataset for your upcoming analysis.