## 1 Getting Started

Stata is a powerful statistical software program that can be used to process, clean up, and structure any type of data. With Stata, you have access to a wide range of statistical operations for the analysis and interpretation of your data. The personal exchange of information, as well as getting help with issues regarding programming and estimation, is made possible by the active user community in the Stata Forum. This community also actively and independently makes a wealth of current graphical and statistical tools available to the public so that new trends can be employed with the software as they arise. Some of the commands written by independent users are published in the Stata Journal and many others can be found in the Statistical Software Components archive provided by Ideas. Although your progress in learning to use the software may appear slow in the beginning, the utilization of Stata ultimately proves to be simple, fast, and accurate. The initial investment is absolutely worth it, as common, repetitive tasks can be perfectly automated.

This tutorial is organized very similarly to my course, „Introduction to Stata“, at the University of Würzburg, and is largely based on Kohler and Kreuter (2012): Data Analysis Using Stata, 3rd Edition. We start with a brief introduction to the structure and syntax of the software. The most important programming tools are used to make initial adjustments and transformations of prepared data sets. Since government organizations, statistical organizations, and businesses often provide their data as Excel files, we place particular emphasis on importing data from Excel. A course project often involves collecting data from different sources in order to obtain a wealth of individual data sets. Therefore, we take a look at how to combine multiple data sets in Stata to produce one unified data set for the subsequent analysis. Since a picture can say more than a thousand words, we will also concentrate on creating graphical depictions using Stata.

Some introductory literature on data analysis, graphical analysis and econometrics using Stata:

- Kohler, U. and Kreuter, F. (2012),
**Data Analysis Using Stata**, Third Edition, Stata Press. - Acock, A. C. (2016),
**A Gentle Introduction to Stata**, Fith Edition, Stata Press. - Mitchell, M. N. (2012):
**A Visual Guide to Stata Graphics**, Third Edition, Stata Press. - Baum, C.F. (2006),
**An Introduction to Modern Econometrics Using Stata**, Stata Press.

When you start Stata for the first time, you will see five windows in the default interface.

The Command Window is where you type your commands. When you have done so and press Enter, the result of your command appears in the larger window above, called the Results Window. Your command is added to a list in the Review Window on the left so you can keep track of the commands you use during a session. The Variables Window, on the top right, shows the variables in your data as well as their labels. The Properties Window immediately below it displays the properties of your variables and dataset. You can resize or even close some of these windows. Stata remembers previously chosen settings the next time it is opened. I recommend that you close all windows except the Command and Results windows, since the features of the other windows can be called up at any time via the command line.

At the beginning of each term, I usually set forth 3 commandments to my students for their daily work with Stata:

**Thou shalt not click:**

Operating the software directly via the keyboard is faster and more practical than doing so via the menu bar in Stata**Thou shalt not work destructively:**

An original dataset should never (without exceptions) be overwritten. It is virtually impossible to recover a dataset that has been overwritten.**Thou shalt bear witness:**

Any change, transformation, and analysis of original data must be replicable by another user of your script.

## 2 Loading Data and Getting Help

### 2.1 Loading Data

Stata reads and saves data to and from the working directory. You can print the path of the current working directory using the command `pwd`

or change the directory using the command `cd`

. cd "J:/Stata/Tutorial 1/" J:\Stata\Tutorial 1

Don’t forget to set the path to your working directory within quotes if some subfolders have blanks in their names. I recommend that you create a separate directory for each course or research project you are involved in and start your Stata session by switching to that directory. As you can see, Stata is not sensitive to the chosen slash type, and when front slashes are applied, it converts them directly to backslashes. The dot at the start of the line is how Stata marks the command lines you type and is not part of the command. Thus, you don’t type the dot when writing and executing a command.

Stata files located on the drive may be opened via the command `use`

. When doing this, it is always useful to include the option `clear`

, which guarantees that the previously loaded data set is removed from Stata before the new data set is loaded. To load a new data set, you can enter the entire new path directly.

. use "J:/Stata/Tutorial 1/dataset1.dta" , clear

Or, if you are already located in the required folder within the command interface, you can just enter the file name. Stata’s data files are always designated via the file extension `.dta`.

. use dataset1.dta , clear

A few sample data files are included with the Stata software. During the tutorial, we load different sample files using the command `sysuse`

as well as some prepared files from this website. Let’s start with `lifeexp.dta`, which features data on life expectancy and gross national product (GNP) per capita in 1998 for 68 countries. With the command `browse`

, you can open the Data Editor window, which displays the current data.

. sysuse lifeexp.dta, clear (Life expectancy, 1998) . browse

You can see that Stata regards the data as one rectangular table and data are displayed in multiple colors.

Columns represent variables and rows represent observations. Variables listed in black are numeric, e.g. `popgrowth`, whereas variables listed in red are strings or text, e.g. `country`. Furthermore, variables listed in blue are categorical variables which are stored as numbers but displayed as human-readable text. This is done by what Stata calls value labels. Finally, under the `safewater` variable, which looks to be numeric, there are some cells containing just a period (.). The periods correspond to missing values, but more on that later.

If you want to make changes to the data directly from the data editor window, type `edit`

. Now changes can be made within each cell of the data set. But I urge you to avoid this approach, as it is error-prone and not reproducible.

. edit

To get more details about what the data are and how the data are stored, type `describe`

and hit enter.

. sysuse lifeexp.dta, clear (Life expectancy, 1998) . describe Contains data from C:\Program Files (x86)\Stata15\ado\base/l/lifeexp.dta obs: 68 Life expectancy, 1998 vars: 6 26 Mar 2016 09:40 size: 2,652 (_dta has notes) --------------------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- region byte %12.0g region Region country str28 %28s Country popgrowth float %9.0g * Avg. annual % growth lexp byte %9.0g * Life expectancy at birth gnppc float %9.0g * GNP per capita safewater byte %9.0g * * indicated variables have notes --------------------------------------------------------------------------------------------------------------------------- Sorted by:

You can see that the data set contains 6 variables and 68 observations. The variables have `variable names` that can be used within commands and `variable labels` that provide user-defined information about the variables. The storage type describes the way Stata stores its data. For now, it’s enough to know that variables with the storage type `str` are string or text variables, whereas all others in this dataset are numeric. There is another nice feature provided by Stata. If the cursor is placed within the command line area, you can use the Page-Down and Page-Up keys to navigate through previously used commands.

### 2.2 Getting Help

Stata has excellent online support. Since Stata has thousands of commands, and some of them feature thousands of additional options, it is basically impossible to memorize all of them. Therefore, it is more important to know where to look to find answers than it is to learn them all by heart. To obtain help via the command `summarize`

, type

. help summarize

, and help information will be displayed on a separate window called the Viewer.

[R] summarize-- Summary statistics (View complete PDF manual entry)Syntaxsummarize[varlist] [if] [in] [weight] [,options]optionsDescription --------------------------------------------------------------------------------------------------------------------- Maindetaildisplay additional statisticsmeanonlysuppress the display; calculate only the mean; programmer's optionformatuse variable's display formatseparator(#)draw separator line after every#variables; default isseparator(5)display_optionscontrol spacing, line width, and base and empty cells ---------------------------------------------------------------------------------------------------------------------varlistmay contain factor variables; see fvvarlist.varlistmay contain time-series operators; see tsvarlist.by,rolling, andstatsbyare allowed; see prefix.aweights,fweights, andiweights are allowed. However,iweights may not be used with thedetailoption; see weight.MenuStatistics > Summaries, tables, and tests >Summary and descriptive statistics > Summary statisticsDescriptionsummarizecalculates and displays a variety of univariate summary statistics. If novarlistis specified, summary statistics are calculated for all the variables in the dataset.Quick start Remarks and examples Methods and formulas The above sections are not included in this help file.Links to PDF documentation+------+ ----+ Main +---------------------------------------------------------------------------------------------------------Optionsdetailproduces additional statistics, including skewness, kurtosis, the four smallest and four largest values, and various percentiles.meanonly, which is allowed only whendetailis not specified, suppresses the display of results and calculation of the variance. Ado-file writers will find this useful for fast calls.formatrequests that the summary statistics be displayed using the display formats associated with the variables rather than the defaultgdisplay format; see[D] format.separator(#)specifies how often to insert separation lines into the output. The default isseparator(5), meaning that a line is drawn after every five variables.separator(10)would draw a line after every 10 variables.separator(0)suppresses the separation line.display_options:vsquish,noemptycells,baselevels,allbaselevels,nofvlabel,fvwrap(#), andfvwrapon(style); see[R] estimation options.Examples. sysuse auto. summarize. summarize mpg weight. summarize mpg weight if foreign. summarize mpg weight if foreign, detail. summarize i.rep78Descriptive statistics in StataVideo exampleStored resultssummarizestores the following inr(): Scalarsr(N)number of observationsr(mean)meanr(skewness)skewness (detailonly)r(min)minimumr(max)maximumr(sum_w)sum of the weightsr(p1)1st percentile (detailonly)r(p5)5th percentile (detailonly)r(p10)10th percentile (detailonly)r(p25)25th percentile (detailonly)r(p50)50th percentile (detailonly)r(p75)75th percentile (detailonly)r(p90)90th percentile (detailonly)r(p95)95th percentile (detailonly)r(p99)99th percentile (detailonly)r(Var)variancer(kurtosis)kurtosis (detailonly)r(sum)sum of variabler(sd)standard deviation

If you don’t know the name of the command you need, you can search for it. Stata has a `search`

command that will search for keywords within the support documentation as well as other resources. Just type, for example,

. search inequality

to search for Stata commands that can calculate inequality measures based on survey data. One of the most convenient features of Stata is that all documentation on the commands is available as PDF files. Moreover, these files are linked in the online support interface, so you can jump directly to the relevant section of the manual from within the Stata Viewer.

If you have a data set with a large number of variables, you might also be interested in the `lookfor`

command, which enables you to search for and display certain words or letters found in variable names and variable labels. For example, let’s search for the word live in `nlws88.dta`

. sysuse nlsw88.dta, clear (NLSW, 1988 extract) . lookfor live storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------------------------- south byte %8.0g lives in south smsa byte %9.0g smsalbl lives in SMSA c_city byte %8.0g lives in central city

Although there is no variable in the data set that contains the word live, there are three variables whose variable labels contain the word live.

## 3 Some Descriptive Statistics and Conditions

The data editor can be used to take a first glance at the data. You can also display specific data points (or all data points) in the Results window with the command `list`

. Let’s use the data on life expectancy again and take a look at the variables `country` and `popgrowth`

. sysuse lifeexp.dta, clear (Life expectancy, 1998) . list country popgrowth (output omitted)

If your data have many observations, your listing may stop short, and you may see a blue `--more--` Tag at the base of the Results window. Pressing the Spacebar or clicking on the blue `--more--` will allow the command to be completed. However, if you want to stop tedious scrolling, just press the Q key and you can continue working.

### 3.1 Descriptive Statistics

Let’s run some summary statistics using the `summarize`

command followed by the variables which you are interested in.

. sysuse lifeexp.dta, clear (Life expectancy, 1998) . summarize lexp gnppc Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- lexp | 68 72.27941 4.715315 54 79 gnppc | 63 8674.857 10634.68 370 39980

We see that life expectancy averages 72.3 years and its standard deviation is 4.7. Further, life expectancy is between 54 and 79 years across all countries included in the data and there are 68 countries with valid observations. We also see that Stata reports only 63 observations on GNP per capita, so we must have some missing values. If we are interested in the median or other statistical moments of GNP per capita, we can use the option `detail`. The portion after the comma contains options for Stata commands. So we type

. summarize gnppc, detail GNP per capita ------------------------------------------------------------- Percentiles Smallest 1% 370 370 5% 410 380 10% 740 380 Obs 63 25% 1360 410 Sum of Wgt. 63 50% 3360 Mean 8674.857 Largest Std. Dev. 10634.68 75% 14100 29240 90% 25580 33040 Variance 1.13e+08 95% 29240 34310 Skewness 1.30502 99% 39980 39980 Kurtosis 3.382168

The median GNP per capita is 3360 US-Dollars, which is less than the mean value. Thus, the distribution of GNP per capita is right-skewed. The same conclusion can be drawn from a look at the skewness value of 1.305. We can also see that the four countries with the highest GNP per capita have values between 29240 and 39980 US-Dollars.

For certain variables we want to know how often specific values appear in the data. This naturally only makes sense with nominal- and ordinal-scaled variables with a limited number of unique values, such as gender or educational attainment. We can do this with one-way tables via the `tabulate`

command. Let’s take a closer look at the `auto` dataset, which is a record of diverse car models in the US in 1978.

. sysuse auto.dta, clear (1978 Automobile Data) . tabulate foreign Car type | Freq. Percent Cum. ------------+----------------------------------- Domestic | 52 70.27 70.27 Foreign | 22 29.73 100.00 ------------+----------------------------------- Total | 74 100.00

You can see that roughly 70% of the cars are domestic, whereas 30% are foreign. The value labels of the variable `foreign` are used to make the table nicely readable. Now, if you want to tabulate the differences in prices across the cars‘ origins, you need a one-way table of car type (foreign vs. domestic) within which we see information about car prices.

`. tabulate foreign, summarize(price)
| Summary of Price
Car type | Mean Std. Dev. Freq.
------------+------------------------------------
Domestic | 6,072.423 3,097.104 52
Foreign | 6,384.682 2,621.915 22
------------+------------------------------------
Total | 6,165.257 2,949.496 74
`

As you can see, the variable being summarized is relayed with an option. Thus, we get the mean and standard deviation price of foreign and domestic cars. Foreign cars are more expensive, on average, with a price of 6385 US-Dollars, than domestic cars, with a price of 6072 US-Dollars.

### 3.2 The in- and if-Conditions

This chapter gives a basic lesson on Stata’s command syntax via the example of the `list` command while showing how to control the appearance of a data list. The syntax for the `list` command can be seen by typing `help list`:

` list` [varlist] [if] [in] [, options]

Anything inside square brackets is optional. For the `list` command,

- varlist is optional. A varlist is a list of variable names.
- if is optional. The
`if`qualifier restricts the command to run only on observations for

which the qualifier is true. - in is optional. The
`in`qualifier restricts the command to run on particular observation

numbers. - options are optional. They are separated from the rest of the command by a comma.

If a part of a word is underlined, the underlined part is the minimum abbreviation. Any abbreviation at least this long is acecptable. Since the `l` in `list` is underlined, `l`, `li`, and `lis` are all equivalent to `list`. The `in` qualifier uses a numlist to give a range of observations that should be listed when it is applied to the `list` command. numlists have the form of one number or first/last. Positive numbers count from the beginning of the dataset, whereas negative numbers count from the end of the dataset. However, the beginning and the end of the dataset depend on the sorting of the data. Employing the `sort`

command, the data can be sorted in an ascending order by one or more variables.

`. sysuse auto, clear
(1978 Automobile Data)
. sort price
. list make price foreign in 1/5
+-----------------------------------+
| make price foreign |
|-----------------------------------|
1. | Merc. Zephyr 3,291 Domestic |
2. | Chev. Chevette 3,299 Domestic |
3. | Chev. Monza 3,667 Domestic |
4. | Toyota Corolla 3,748 Foreign |
5. | Subaru 3,798 Foreign |
+-----------------------------------+
. list make price foreign in -5/-1
+--------------------------------------+
| make price foreign |
|--------------------------------------|
70. | Peugeot 604 12,990 Foreign |
71. | Linc. Versailles 13,466 Domestic |
72. | Linc. Mark V 13,594 Domestic |
73. | Cad. Eldorado 14,500 Domestic |
74. | Cad. Seville 15,906 Domestic |
+--------------------------------------+
`

Because the dataset is sorted in ascending order according to the price of the car models, the first list shows the 5 cheapest models, while the second list displays the 5 most expensive car models. The command `gsort`

can sort the data in descending order according to one or more variables. For this, a look at the `help` file can be helpful. Just type `help gsort`.

The `if` qualifier uses a logical expression to determine which observations to use. If the expression is true, the observation is used in the command; otherwise, it is skipped. The operators whose results are either true or false are

< | less than |

<= | less than or equal |

== | equal |

> | greater than |

>= | greater than or equal |

!= | not equal |

& | and |

| | or |

! | not (logical negation) |

() | parentheses are for grouping to specify order of evaluation |

In the logical expressions, `&` is evaluated before `|` (similar to multiplication before addition in arithmetic). You can use this in your expressions, but it is often better to use parentheses to ensure that the expressions are evaluated in the proper order. To illustrate the `if` qualifier, let’s use a random sample of the German Socio-Economic Panel (GSOEP). The GSOEP annually surveys adults in Germany about their demographic and socio-economic characteristics. The altered random sample `gsoep.dta` is available at my website and can be loaded directly in Stata if you have an internet connection. Now, let’s find out what the average income of people younger than 30 years old is in the GSOEP sample. Hereafter, we want to know what the average income of adults who are 25 to 55 years old is. Although the

command can be abbreviated to two letter, I prefer the more common abbreviation __su__mmarize`sum`.

`. use "https://www.mustafacoban.de/wp-content/stata/gsoep.dta", clear
(SOEP 2009 (Kohler/Kreuter))
. sum income if age <= 30
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 857 10506.67 12122.77 0 79295
. sum income if age >= 25 & age <= 55
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 2,700 28261.41 40074.5 0 897756
`

As you can see, there are 857 persons that are 30 years old or younger in the dataset, and their average yearly income is 10507 Euro. Persons who are 25 to 55 years old earn, on average, 28261 Euro, whereby this age group’s range of incomes is relatively large, ranging from 0 to 897756 Euro. Furthermore, the GSOEP sample contains the variable `gender`, which is a string variable and the gender indicator of the random sample. Now, if we want to calculate the average yearly income of women, we must type

`. sum income if gender == "Female"
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
income | 2,458 13323 21290.77 0 612757
`

The string variable `gender` has two distinct characteristics, namely „Male“ and „Female“. In Stata, strings must always be written in quotations marks.

`.sum income if gender == "Female"`