Introduction to Stata

Part I: Introduction and Overview

Getting Started

Stata is a powerful statistical software program that can be used to process, clean up, and structure any type of data. With Stata, you have access to a wide range of statistical operations for the analysis and interpretation of your data. The personal exchange of information, as well as getting help with issues regarding programming and estimation, is made possible by the active user community in the Stata Forum. This community also actively and independently makes a wealth of current graphical and statistical tools available to the public so that new trends can be employed with the software as they arise. Some of the commands written by independent users are published in the Stata Journal and many others can be found in the Statistical Software Components archive provided by Ideas. Although your progress in learning to use the software may appear slow in the beginning, the utilization of Stata ultimately proves to be simple, fast, and accurate. The initial investment is absolutely worth it, as common, repetitive tasks can be perfectly automated.

This tutorial is organized very similarly to my course, „Introduction to Stata“, at the University of Würzburg, and is largely based on Kohler and Kreuter (2012): Data Analysis Using Stata, 3rd Edition. We start with a brief introduction to the structure and syntax of the software. The most important programming tools are used to make initial adjustments and transformations of prepared data sets. Since government organizations, statistical organizations, and businesses often provide their data as Excel files, we place particular emphasis on importing data from Excel. A course project often involves collecting data from different sources in order to obtain a wealth of individual data sets. Therefore, we take a look at how to combine multiple data sets in Stata to produce one unified data set for the subsequent analysis. Since a picture can say more than a thousand words, we will also concentrate on creating graphical depictions using Stata.

Some introductory literature on data analysis, graphical analysis and econometrics using Stata:

  • Kohler, U. and Kreuter, F. (2012), Data Analysis Using Stata, Third Edition, Stata Press.
  • Acock, A. C. (2016), A Gentle Introduction to Stata, Fith Edition, Stata Press.
  • Mitchell, M. N. (2012): A Visual Guide to Stata Graphics, Third Edition, Stata Press.
  • Baum, C.F. (2006), An Introduction to Modern Econometrics Using Stata, Stata Press.

When you start Stata for the first time, you will see five windows in the default interface.

The Command Window is where you type your commands. When you have done so and press Enter, the result of your command appears in the larger window above, called the Results Window. Your command is added to a list in the Review Window on the left so you can keep track of the commands you use during a session. The Variables Window, on the top right, shows the variables in your data as well as their labels. The Properties Window immediately below it displays the properties of your variables and dataset. You can resize or even close some of these windows. Stata remembers previously chosen settings the next time it is opened. I recommend that you close all windows except the Command and Results windows, since the features of the other windows can be called up at any time via the command line.

At the beginning of each term, I usually set forth 3 commandments to my students for their daily work with Stata:

  1. Thou shalt not click:
    Operating the software directly via the keyboard is faster and more practical than doing so via the menu bar in Stata
  2. Thou shalt not work destructively:
    An original dataset should never (without exceptions) be overwritten. It is virtually impossible to recover a dataset that has been overwritten.
  3. Thou shalt bear witness:
    Any change, transformation, and analysis of original data must be replicable by another user of your script.

Loading Data and Getting Help

2.1  Loading Data

Stata reads and saves data to and from the working directory. You can print the path of the current working directory using the command pwd or change the directory using the command cd

. cd "J:/Stata/Tutorial 1/"
J:\Stata\Tutorial 1

Don’t forget to set the path to your working directory within quotes if some subfolders have blanks in their names. I recommend that you create a separate directory for each course or research project you are involved in and start your Stata session by switching to that directory. As you can see, Stata is not sensitive to the chosen slash type, and when front slashes are applied, it converts them directly to backslashes. The dot at the start of the line is how Stata marks the command lines you type and is not part of the command. Thus, you don’t type the dot when writing and executing a command.

Stata files located on the drive may be opened via the command use. When doing this, it is always useful to include the option clear, which guarantees that the previously loaded data set is removed from Stata before the new data set is loaded. To load a new data set, you can enter the entire new path directly.

. use "J:/Stata/Tutorial 1/dataset1.dta" , clear

Or, if you are already located in the required folder within the command interface, you can just enter the file name. Stata’s data files are always designated via the file extension .dta.

. use dataset1.dta , clear

A few sample data files are included with the Stata software. During the tutorial, we load different sample files using the command sysuse as well as some prepared files from this website. Let’s start with lifeexp.dta, which features data on life expectancy and gross national product (GNP) per capita in 1998 for 68 countries. With the command browse, you can open the Data Editor window, which displays the current data.

. sysuse lifeexp.dta, clear
(Life expectancy, 1998)

. browse

You can see that Stata regards the data as one rectangular table and data are displayed in multiple colors.

Columns represent variables and rows represent observations. Variables listed in black are numeric, e.g. popgrowth, whereas variables listed in red are strings or text, e.g. country. Furthermore, variables listed in blue are categorical variables which are stored as numbers but displayed as human-readable text. This is done by what Stata calls value labels. Finally, under the safewater variable, which looks to be numeric, there are some cells containing just a period (.). The periods correspond to missing values, but more on that later.

If you want to make changes to the data directly from the data editor window, type edit. Now changes can be made within each cell of the data set. But I urge you to avoid this approach, as it is error-prone and not reproducible.

. edit

To get more details about what the data are and how the data are stored, type describe and hit enter.

. sysuse lifeexp.dta, clear
(Life expectancy, 1998)

. describe

Contains data from C:\Program Files (x86)\Stata15\ado\base/l/lifeexp.dta
  obs:            68                          Life expectancy, 1998
 vars:             6                          26 Mar 2016 09:40
 size:         2,652                          (_dta has notes)
              storage   display    value
variable name   type    format     label      variable label
region          byte    %12.0g     region     Region
country         str28   %28s                  Country
popgrowth       float   %9.0g               * Avg. annual % growth
lexp            byte    %9.0g               * Life expectancy at birth
gnppc           float   %9.0g               * GNP per capita
safewater       byte    %9.0g               * 
                                            * indicated variables have notes
Sorted by: 

You can see that the data set contains 6 variables and 68 observations. The variables have variable names that can be used within commands and variable labels that provide user-defined information about the variables. The storage type describes the way Stata stores its data. For now, it’s enough to know that variables with the storage type str are string or text variables, whereas all others in this dataset are numeric. There is another nice feature provided by Stata. If the cursor is placed within the command line area, you can use the Page-Down and Page-Up keys to navigate through previously used commands.

2.2  Getting Help

Stata has excellent online support. Since Stata has thousands of commands, and some of them feature thousands of additional options, it is basically impossible to memorize all of them. Therefore, it is more important to know where to look to find answers than it is to learn them all by heart. To obtain help via the command summarize, type

. help summarize

, and help information will be displayed on a separate window called the Viewer.

[R] summarize -- Summary statistics
                 (View complete PDF manual entry)


        summarize [varlist] [if] [in] [weight] [, options]

    options           Description
      detail          display additional statistics
      meanonly        suppress the display; calculate only the mean; programmer's option
      format          use variable's display format
      separator(#)    draw separator line after every # variables; default is separator(5)
      display_options control spacing, line width, and base and empty cells

    varlist may contain factor variables; see fvvarlist.
    varlist may contain time-series operators; see tsvarlist.
    by, rolling, and statsby are allowed; see prefix.
    aweights, fweights, and iweights are allowed.  However, iweights may not be used with the detail option; see weight.


    Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics


    summarize calculates and displays a variety of univariate summary statistics.  If no varlist is specified, summary
    statistics are calculated for all the variables in the dataset.

Links to PDF documentation

        Quick start

        Remarks and examples

        Methods and formulas

    The above sections are not included in this help file.


    ----+ Main +---------------------------------------------------------------------------------------------------------

    detail produces additional statistics, including skewness, kurtosis, the four smallest and four largest values, and
        various percentiles.

    meanonly, which is allowed only when detail is not specified, suppresses the display of results and calculation of
        the variance.  Ado-file writers will find this useful for fast calls.

    format requests that the summary statistics be displayed using the display formats associated with the variables
        rather than the default g display format; see [D] format.

    separator(#) specifies how often to insert separation lines into the output.  The default is separator(5), meaning
        that a line is drawn after every five variables.  separator(10) would draw a line after every 10 variables.
        separator(0) suppresses the separation line.

    display_options:  vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, fvwrap(#), and fvwrapon(style); see
        [R] estimation options.


    . sysuse auto
    . summarize
    . summarize mpg weight
    . summarize mpg weight if foreign
    . summarize mpg weight if foreign, detail
    . summarize i.rep78

Video example

    Descriptive statistics in Stata

Stored results

    summarize stores the following in r():

      r(N)           number of observations
      r(mean)        mean
      r(skewness)    skewness (detail only)
      r(min)         minimum
      r(max)         maximum
      r(sum_w)       sum of the weights
      r(p1)          1st percentile (detail only)
      r(p5)          5th percentile (detail only)
      r(p10)         10th percentile (detail only)
      r(p25)         25th percentile (detail only)
      r(p50)         50th percentile (detail only)
      r(p75)         75th percentile (detail only)
      r(p90)         90th percentile (detail only)
      r(p95)         95th percentile (detail only)
      r(p99)         99th percentile (detail only)
      r(Var)         variance
      r(kurtosis)    kurtosis (detail only)
      r(sum)         sum of variable
      r(sd)          standard deviation

If you don’t know the name of the command you need, you can search for it. Stata has a search command that will search for keywords within the support documentation as well as other resources. Just type, for example,

. search inequality

to search for Stata commands that can calculate inequality measures based on survey data. One of the most convenient features of Stata is that all documentation on the commands is available as PDF files. Moreover, these files are linked in the online support interface, so you can jump directly to the relevant section of the manual from within the Stata Viewer.

If you have a data set with a large number of variables, you might also be interested in the lookfor command, which enables you to search for and display certain words or letters found in variable names and variable labels. For example, let’s search for the word live in nlws88.dta

. sysuse nlsw88.dta, clear
(NLSW, 1988 extract)

. lookfor live

              storage   display    value
variable name   type    format     label      variable label
south           byte    %8.0g                 lives in south
smsa            byte    %9.0g      smsalbl    lives in SMSA
c_city          byte    %8.0g                 lives in central city

Although there is no variable in the data set that contains the word live, there are three variables whose variable labels contain the word live.

Some Descriptive Statistics and Conditions

The data editor can be used to take a first glance at the data. You can also display specific data points (or all data points) in the Results window with the command list. Let’s use the data on life expectancy again and take a look at the variables country and popgrowth

. sysuse lifeexp.dta, clear
(Life expectancy, 1998)

. list country popgrowth
(output omitted)

If your data have many observations, your listing may stop short, and you may see a blue --more-- Tag at the base of the Results window. Pressing the Spacebar or clicking on the blue --more-- will allow the command to be completed. However, if you want to stop tedious scrolling, just press the Q key and you can continue working.

3.1  Descriptive Statistics

Let’s run some summary statistics using the summarize command followed by the variables which you are interested in.

. sysuse lifeexp.dta, clear
(Life expectancy, 1998)

. summarize lexp gnppc

    Variable |        Obs        Mean    Std. Dev.       Min        Max
        lexp |         68    72.27941    4.715315         54         79
       gnppc |         63    8674.857    10634.68        370      39980

We see that life expectancy averages 72.3 years and its standard deviation is 4.7. Further, life expectancy is between 54 and 79 years across all countries included in the data and there are 68 countries with valid observations. We also see that Stata reports only 63 observations on GNP per capita, so we must have some missing values. If we are interested in the median or other statistical moments of GNP per capita, we can use the option detail. The portion after the comma contains options for Stata commands. So we type

. summarize gnppc, detail

                       GNP per capita
      Percentiles      Smallest
 1%          370            370
 5%          410            380
10%          740            380       Obs                  63
25%         1360            410       Sum of Wgt.          63

50%         3360                      Mean           8674.857
                        Largest       Std. Dev.      10634.68
75%        14100          29240
90%        25580          33040       Variance       1.13e+08
95%        29240          34310       Skewness        1.30502
99%        39980          39980       Kurtosis       3.382168

The median GNP per capita is 3360 US-Dollars, which is less than the mean value. Thus, the distribution of GNP per capita is right-skewed. The same conclusion can be drawn from a look at the skewness value of 1.305. We can also see that the four countries with the highest GNP per capita have values between 29240 and 39980 US-Dollars.

For certain variables we want to know how often specific values appear in the data. This naturally only makes sense with nominal- and ordinal-scaled variables with a limited number of unique values, such as gender or educational attainment. We can do this with one-way tables via the tabulate command. Let’s take a closer look at the auto dataset, which is a record of diverse car models in the US in 1978.

. sysuse auto.dta, clear
(1978 Automobile Data)

. tabulate foreign

   Car type |      Freq.     Percent        Cum.
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
      Total |         74      100.00

You can see that roughly 70% of the cars are domestic, whereas 30% are foreign. The value labels of the variable foreign are used to make the table nicely readable. Now, if you want to tabulate the differences in prices across the cars‘ origins, you need a one-way table of car type (foreign vs. domestic) within which we see information about car prices.

. tabulate foreign, summarize(price)

            |          Summary of Price
   Car type |        Mean   Std. Dev.       Freq.
   Domestic |   6,072.423   3,097.104          52
    Foreign |   6,384.682   2,621.915          22
      Total |   6,165.257   2,949.496          74

As you can see, the variable being summarized is relayed with an option. Thus, we get the mean and standard deviation price of foreign and domestic cars. Foreign cars are more expensive, on average, with a price of 6385 US-Dollars, than domestic cars, with a price of 6072 US-Dollars.

3.2  The in- and if-Conditions

This chapter gives a basic lesson on Stata’s command syntax via the example of the list command while showing how to control the appearance of a data list. The syntax for the list command can be seen by typing help list:

list [varlist] [if] [in] [, options]

Anything inside square brackets is optional. For the list command,

  1. varlist is optional. A varlist is a list of variable names.
  2. if is optional. The if qualifier restricts the command to run only on observations for
    which the qualifier is true.
  3. in is optional. The in qualifier restricts the command to run on particular observation
  4. options are optional. They are separated from the rest of the command by a comma.

If a part of a word is underlined, the underlined part is the minimum abbreviation. Any abbreviation at least this long is acecptable. Since the l in list is underlined, l, li, and lis are all equivalent to list. The in qualifier uses a numlist to give a range of observations that should be listed when it is applied to the list command. numlists have the form of one number or first/last. Positive numbers count from the beginning of the dataset, whereas negative numbers count from the end of the dataset. However, the beginning and the end of the dataset depend on the sorting of the data. Employing the sort command, the data can be sorted in an ascending order by one or more variables.

. sysuse auto, clear
(1978 Automobile Data)

. sort price 

. list make price foreign  in 1/5

     | make             price    foreign |
  1. | Merc. Zephyr     3,291   Domestic |
  2. | Chev. Chevette   3,299   Domestic |
  3. | Chev. Monza      3,667   Domestic |
  4. | Toyota Corolla   3,748    Foreign |
  5. | Subaru           3,798    Foreign |

. list make price foreign  in -5/-1

     | make                price    foreign |
 70. | Peugeot 604        12,990    Foreign |
 71. | Linc. Versailles   13,466   Domestic |
 72. | Linc. Mark V       13,594   Domestic |
 73. | Cad. Eldorado      14,500   Domestic |
 74. | Cad. Seville       15,906   Domestic |

Because the dataset is sorted in ascending order according to the price of the car models, the first list shows the 5 cheapest models, while the second list displays the 5 most expensive car models. The command gsort can sort the data in descending order according to one or more variables. For this, a look at the help file can be helpful. Just type help gsort.

The if qualifier uses a logical expression to determine which observations to use. If the expression is true, the observation is used in the command; otherwise, it is skipped. The operators whose results are either true or false are

< less than
<= less than or equal
== equal
> greater than
>= greater than or equal
!= not equal
& and
| or
! not (logical negation)
() parentheses are for grouping to specify order of evaluation

In the logical expressions, & is evaluated before | (similar to multiplication before addition in arithmetic). You can use this in your expressions, but it is often better to use parentheses to ensure that the expressions are evaluated in the proper order. To illustrate the if qualifier, let's use a random sample of the German Socio-Economic Panel (GSOEP). The GSOEP annually surveys adults in Germany about their demographic and socio-economic characteristics. The altered random sample gsoep.dta is available at my website and can be loaded directly in Stata if you have an internet connection. Now, let's find out what the average income of people younger than 30 years old is in the GSOEP sample. Hereafter, we want to know what the average income of adults who are 25 to 55 years old is. Although the summarize command can be abbreviated to two letter, I prefer the more common abbreviation sum.

. use "", clear
(SOEP 2009 (Kohler/Kreuter))

. sum income if age <= 30

    Variable |        Obs        Mean    Std. Dev.       Min        Max
      income |        857    10506.67    12122.77          0      79295

. sum income if age >= 25 & age <= 55

    Variable |        Obs        Mean    Std. Dev.       Min        Max
      income |      2,700    28261.41     40074.5          0     897756

As you can see, there are 857 persons that are 30 years old or younger in the dataset, and their average yearly income is 10507 Euro. Persons who are 25 to 55 years old earn, on average, 28261 Euro, whereby this age group's range of incomes is relatively large, ranging from 0 to 897756 Euro. Furthermore, the GSOEP sample contains the variable gender, which is a string variable and the gender indicator of the random sample. Now, if we want to calculate the average yearly income of women, we must type

. sum income if gender == "Female"

    Variable |        Obs        Mean    Std. Dev.       Min        Max
      income |      2,458       13323    21290.77          0     612757

The string variable gender has two distinct characteristics, namely "Male" and "Female". In Stata, strings must always be written in quotations marks.